Upper Bounds for Fisher information *

Upper bounds are considered for the Fisher information of random vectors in terms of total variation and norms in Sobolev spaces. We also survey and reﬁne a number of known results in this direction


Introduction
Given a random vector X in R n with density p, its Fisher information is defined by This functional is well-defined and finite, when the function √ p belongs to the Sobolev space W 2 1 (R n ).In all other cases, one puts I(X) = ∞.In the one-dimensional case, the integrals in (1.1) make sense when the density p is locally absolutely continuous and has derivative p in the Radon-Nikodym sense.One may then write I(X) = Eρ(X) 2 in terms of the score function ρ = (log p) , also called the logarithmic derivative of p.Of a large interest are also more general functionals (moments of the scores) Since the Fisher information appears naturally in many mathematical problems, it is useful to know general conditions which ensure that I(X) is finite.For example, for the applicability of the central limit theorem with respect to the relative Fisher information, one needs to verify that this functional becomes finite when taking several convolutions of densities which might have an infinite Fisher information (such as the uniform distributions on bounded intervals).To this aim, it was shown in [3] that in dimension n = 1 for the sum X = X 1 + X 2 + X 3 of three independent random variables X j whose densities p j have finite total variation norms b j = p j TV .Here, adding an independent summand to X may only decrease the Fisher information.On the other hand, it may happen that I(X 1 + X 2 ) = ∞.With similar conclusions, (1.2) was extended in [2] to higher moments of the scores as the relation for the sum X = X 1 + • • • + X k+1 of k + 1 independent random variables whose densities p j have total variation norms b j = p j TV .
The usefulness of such relations is explained by the fact that the total variation norm is much easier tractable.In particular, this norm can be directly related to the characteristic functions of the involved random variables.As a corollary, the following characterization holds in the case where all X j are independent, have finite absolute moment, and a common characteristic function f (t) = E e itXj , t ∈ R. Namely, the partial sums will have a finite Fisher information for some and then for all large N , if and only if f (t) = o(t −ε ) as t → ∞ for some ε > 0. The same conclusion is also true about the moments I k (S N ) of an arbitrary order k (cf.[2]).One may therefore wonder whether a similar characterization holds in spaces of higher dimensions.Keeping aside this question for a separate discussion, one of the purposes of this note is to extend the relation (1.2) to densities on R n .Theorem 1.1.For the sum X = X 1 + X 2 + X 3 of three independent random vectors X j in R n whose densities p j have finite total variation norms b j = p j TV , we have where c > 0 is an absolute constant.
Note that modulo an absolute factor, the expression on the right-hand side is slightly better than the one in (1.2), in view of the arithmetic-geometric inequality.What also looks to be rather remarkable is that the constant in (1.4) is independent of the dimension n (as we will see, one may take c = 18).
In general, the total variation norm of an integrable function u on R n is defined by where the supremum is taken over all collections of C ∞ 0 -smooth functions w i : R n → R such that w 2  1 + • • • + w 2 n ≤ 1 pointwise on R n .This definition leads to the more familiar formula u TV = |∇u(x)| dx, (1.6) once u has a weak gradient ∇u.If u = 1 A is an indicator function of a Borel set A in R n of finite volume, the expression in (1.5) defines the perimeter Per(A) of A. It is finite, for example, when A is open, bounded, and has a C 2 -smooth boundary (cf.[17], p. 229).
As an example illustrating (1.4), one may consider the random vectors X j uniformly distributed over sets A j in R n with finite positive volume v j = vol n (A j ) and finite perimeter P j = Per(A j ).In this case, the densities of X j represent normalized indicator functions p j = 1 vj 1 Aj , and their total variation norms are given by b j = P j /v j .One Bounds for Fisher information can therefore conclude that the sum X = X 1 + X 2 + X 3 has finite Fisher information satisfying I(X) ≤ c P 1 .
In the one-dimensional situation, the proofs of (1.2)-(1.3)from [2], [3] are based on the application of the Brunn-Minkowski inequality from Convex Geometry which allows one to derive these relations for X j 's uniformly distributed over arbitrary bounded intervals.Another ingredient in the argument is an interesting fact that any probability density p on the real line with finite total variation norm may be represented as a "continuous" convex mixture p = p t dπ(t) of densities of uniform distributions with the property that p TV = p t TV dπ(t), thus reversing Jensen's inequality for the total variation norm.
However, it is not clear how to push forward this approach in the multidimensional situation.Instead, we refine and employ one result from the theory of differentiable measures due to Uglanov and Bogachev about a general bound on I(p) without assuming that the density p has a convolution structure.A main difficulty in estimating I(p) concerns mostly the one-dimensional case.If a non-negative integrable function p on the real line has 3 continuous derivatives (this class may be enlarged), it was shown in [5], [6] that (1.7) In the earlier paper [16], this inequality was stated without proof with existing absolute constants.This relation may be extended to higher dimensions in terms of the corresponding partial derivatives of p.As we will see, the derivatives of the first and second order may actually be eliminated, so that we have: where c > 0 is an absolute constant.
Applying (1.8) to the convolution of 3 densities on R n , we will derive the relation (1.4).
With this approach in mind, one may wonder whether or not one can obtain similar inequalities for general moments of the scores so that to extend the inequality (1.3) to higher dimensions.In this connection, let us mention that Krugova [13] has extended the inequality (1.7) by proving that for the region of real orders 1 ≤ k < 3 with some constants C k depending on k only.But, as is also well-known, such a relation cannot be true for k ≥ 3, even if we involve higher order derivatives.For example, the density p(x) = 1 √ 2π x 2 e −x 2 /2 , x ∈ R, has integrable derivatives of any order, while I 3 (p) = ∞.Hence, the convolution structure of the distribution of the random variable of X is essential for the bound (1.3) with k ≥ 3.
Returning to Theorem 1.1, one motivating point for the derivation of multidimensional upper bounds such as (1.4) is the central limit theorem in the i.i.d.model for Bounds for Fisher information the normalized sums Z N = S N /

√
N with respect to the relative Fisher information I(Z N ||Z) = I(Z N ) − I(Z) = I(Z N ) − n (which we do not discuss here).This functional appears naturally in other limit theorems and bounds as well.For example, of a large interest is the behavior of the relative entropy D(Z N ||Z) = p N log(p N /ϕ) dx, where p N denotes the density of Z N , and ϕ is the density of the standard normal random vector Z in R n .Assuming that the distribution of the random vector X 1 in R n is isotropic, has a finite Fisher information, and shares a Poincaré-type inequality it was recently shown by Courtade, Fathi and Pananjady [8] that (1.9) Using (1.4), this bound may be stated under a weaker assumption that X 1 has density with finite total variation b.Applying (1.9) to the normalized sums of N/3 independent copies of 1 N .
This paper consists of two parts.In sections 1-7 we focus on the one-dimensional case and discuss various upper bounds on the Fisher information I(p) both in terms of the second and third derivatives of p, and for several classes of probability distributions (such as compactly supported or unimodal distributions).In Section 7, Theorems 1.1-1.2 are proved for n = 1.Sections 8-17 mostly deal with the multidimensional situation.To make the proof of main results rigorous, this case requires a careful analysis of basic concepts from the theory of weak derivatives and Sobolev spaces.Therefore, we include a short reminder of basic definitions and facts in this theory, together with some special results needed for an easy treatment of the Fisher information functional.They are used in particular to rigorously justify some of its important properties such as the lower semi-continuity and convexity.Contents:

Functions with bounded second derivative
Definition 2.1.Given −∞ ≤ a < b ≤ ∞ and an integer l ≥ 1, we denote by C l (a, b) the collection of all continuous functions u on the interval (a, b) having continuous derivatives up to order l − 1 such that the derivative u (l−1) is (locally) absolutely continuous.Then u (l−1) has a Radon-Nikodym derivative defined almost everywhere on (a, b), which we denote u (l) .
In that case, one may also say that u is C l -smooth on (a, b).When the interval coincides with the whole real line, the notation is shortened to C l .The Fisher information is well-defined for any probability density p from the class C 1 .If X is a random variable with density p, we have P{p(X) > 0} = 1, so the integration in (2.1) may be performed over the set p(x) > 0. It may further be restricted to the set of all points x where p is differentiable, with p(x) > 0 and p (x) = 0. Indeed, p (x) = 0 ⇒ p(x) > 0 (due to the property p ≥ 0), while the set where p (x) = 0 and p(x) > 0 does not contribute to the integral (2.1).
To get quantitative bounds on I(p), we will consider the classes C 2 and C 3 and use the derivatives p and p .One may extend the inequality in (2.2) to the whole real line with a similar constant.Proposition 2.3.Given a C 2 -smooth probability density p, we have, for all x ∈ R, The argument is based on two simple calculus lemmas.Lemma 2.4.Given a non-negative C 2 -smooth function u on the interval (a, b), finite or not, assume that u (x) ≤ C a.e. for some constant C. If u satisfies one of the following two conditions (i) u is non-decreasing with lim inf x↓a u (x) = 0, (ii) u is non-increasing with lim inf x↑b u (x) = 0, then C ≥ 0, and for all x ∈ (a, b), u (x) 2 ≤ 2Cu(x). (2.4) Proof.Under the assumption (i), necessarily C ≥ 0. Indeed, otherwise the derivative u (x) would be decreasing which implies that u (x) < u (a+) = 0 for all a < x < b.Hence, the function u(x) itself would be decreasing.
The scenario as in (ii) is similar; it is reduced to (i) by applying the previous step to the function x → u(−x) on the interval (−b, −a).Similarly, the set Proof of Propositions 2.2-2.3.Let p be C 2 -smooth on the real line.First note that necessarily C ≥ 0 in (2.2).Indeed, if C < 0, then p is decreasing on [a, b].But the assumption p(a) = 0 implies p (a) = 0 (since p ≥ 0), and then we would get that p(x) < 0 for all a < x < b.By a similar argument, we also have p (b) = 0. Hence, one may apply Lemma 2.5 to the function u = p, and (2.2) follows.
Turning to the next proposition, again necessarily C > 0. Indeed, if C ≤ 0, then p would be concave on the whole real line, which is impossible for probability densities.
To prove the inequality (2.3), first assume that p (x) → 0 as |x| → ∞ (this is always fulfilled, as will be shown in the next section).As in the proof of Lemma 2.5, consider the Hence, one may apply Lemma 2.5 to the interval (a k , b k ), and we obtain (2.3) for all a k ≤ x ≤ b k .In the case a k = −∞, Lemma 2.5 is also applicable due to the assumption p (−∞) = 0.A similar argument allows us to involve the points from the open set V = {x ∈ R : p (x) < 0} as well, and we obtain (2.3) on the whole real line.
To remove the assumption on the derivative, consider a random variable X with density p together with an independent variable Z which has a C ∞ 0 -smooth density q supported on a bounded interval ∆.The convolution of p with density q ε of the random variable εZ, ε > 0, is given by

Bounds for Fisher information
This function is C ∞ -smooth, and its first two derivatives are given by The last equality shows that p ε (x) ≤ C for any x ∈ R.Moreover, since for every fixed x ∈ R, sup (2.6) In addition, using the property that q ε is bounded for any fixed ε, while q ε (x − y) → 0 as |x| → ∞ for every y ∈ R, from the first equality in (2.5) it also follows that p ε (x) → 0 as |x| → ∞.Hence, one may apply the first step to the density p ε , and we get that It remains to let ε → 0 in this inequality and refer to (2.6).
Remark 2.6.In the above argument one may also use not necessarily compactly supported smoothing densities such as the standard normal density q(x) = 1 √ 2π e −x 2 /2 , for example.Note that p and its derivative admit upper bounds p(x whenever 0 < ε ≤ 1.Thus, for each fixed x ∈ R, we have integrable majorants for the functions y → p(x − εy) and y → p (x − εy) with respect to the probability measure q(y) dy.Hence, the Lebesgue dominated convergence theorem may be applied to obtain the desired relations in (2.3).

Decay of densities and their derivatives
Suppose that the constant is finite for a given C 2 -smooth probability density p on the real line.This property turns out to be sufficient to bound p(x) and p (x) in terms of the tails of the distribution function p(y) dy, x ∈ R, associated to p. Suppose for a moment that p is everywhere positive, so that F : R → (0, 1) and its inverse function F −1 : (0, 1) → R represent C 3 -smooth increasing bijections.The function Bounds for Fisher information and we have the identity ).
An application of the bound Thus, the function y(t) = L(t) 2 satisfies the differential inequality y (t) ≤ 2 √ 2C y(t) 1/4 , which is the same as After integration over the interval (t 0 , t), 0 < t 0 < t < 1, we get Necessarily lim inf x→−∞ p(x) = 0 which is equivalent to lim inf t0→0 L(t 0 ) = 0. Hence, letting t 0 approach zero in a proper way, from the above inequality we obtain that .
Simplifying the numerical constant and changing the variable t = F (x), we have been led to the inequality Now, to remove the assumption that p is positive, one may consider the convolutions p ε as in the proof of Proposition 2.3, by choosing for q the density of the standard normal law.Hence, the above step yields the bound in terms of the distribution function F ε associated to p ε .Here, according to Remark 2.6, one may let ε → 0, and then we obtain in the limit the inequality (3.2) without any constraints.Moreover, interchanging the role of the points −∞ and ∞, we have a similar bound Once we have established these estimates for p(x), we also obtain similar ones for p (x), by applying Proposition 2.3.One may now summarize.Proposition 3.1.Using the constant C as in (3.1), we have for all x ∈ R, In particular, p(x) → 0 and p (x) → 0 as |x| → ∞.
The right-hand sides of (3.3)-(3.4)may further be bounded in terms of absolute moments β s = E |X| s of a random variable X with density p. Indeed, by Chebyshev's inequality, Corollary 3.2.Assuming that the constant C and the moment β s are finite for a real number s > 0, we have for all x ∈ R, In particular, if β s < ∞ for some s > 3, then p has a bounded total variation.
Another application of Propositions 2.3 and 3.1 concerns an alternative (classical) formula for the Fisher information.It remains to perform summation over all k.

Unimodal and quasi-unimodal distributions
Proposition 2.3 may also be applied to control the Fisher information for a large variety of densities like in the following statement.It remains to add these three estamates.
One interesting case in (4.1) is when a = b.This corresponds to the so-called unimodal distributions on the real line with mode at the point a.With this in mind, the more general case a < b may be referred to as the class of quasi-unimodal distributions.
In the unimodal case, (4.1) is simplified to But then, one can further relax the basic hypothesis on the second derivative.Proof.Again, let a 0 = inf{x : p(x) > 0} and b 0 = sup{x : p(x) > 0}, so that necessarily a 0 < a < b 0 , by continuity of p.It is also necessary that p (a 0 +) = p (b 0 −) = 0.This follows from the fact that p is continuous on (−∞, a) and lim x→−∞ p (x) = 0 (since otherwise p would not be integrable), and similarly for the second half-axis.Also, by the integrability argument, we have C 0 ≥ 0 and C 1 ≥ 0. Hence, we are in position to apply Lemma 2.4 with u = p, which yields p (x) 2 ≤ 2C 0 p(x) for x < a, p (x) 2 ≤ 2C 1 p(x) for x > a.
It remains to repeat the argument from the proof of Proposition 4.1.
As an example, one may consider the symmetric exponential distribution with density It satisfies the assumptions of Proposition 4.2 with mode at a = 0 and Hence, by (4.2), I(p) ≤ 2 √ 2. Note that I(p) = 1, while Proposition 4.1 is not applicable.

Total variation norm via higher order derivatives
To further develop upper bounds on the Fisher information, we have to see how one can estimate the L 1 -norm of the first and second derivatives of a smooth density in terms of the L 1 -norm of its next third derivative.This is a preliminary step towards Theorem 1.2.
Proposition 5.1.For any function Perhaps, this relation is known (up to factors in front of the integrals).Note that a similar inequality for the L 2 -norms is obvious.Indeed, under proper integrability assumptions and applying the Plancherel theorem, (5.2) may be rewritten in terms of the Fourier transform This readily holds in view of the pointwise bound However, the finiteness of the integrals on the right-hand side of (5.2) does not guarantee that p will have a finite total variation.For example, consider a C ∞ -smooth function p on the real line which is vanishing for x ≤ 0 and such that As easy to see, p ∈ L 2 if and only if α < 1 4 (2β + 3) which may happen when 1 < β < 3 2 .Involving higher order derivatives, one may get similar relations in the spirit of the inequality (5.1), like the following ones which we prefer to state in the multiplicative form.

Corollary 5.2. For any function
.

Bounds for Fisher information
In other words, we arrive at the convexity-type relation for the sequence An application of (5.3) to p in place of p leads to a 2 ≤ 1 2 a 1 + 1 2 a 3 + h, which, by (5.4), implies which was required.
Involving further derivatives in a similar manner, we arrive at the following: Corollary 5.3.If the function p in C l is integrable and has an integrable derivative p (l) of order l ≥ 2, then all intermediate derivatives p (k) , 1 ≤ k ≤ l − 1, are integrable as well.
Proof of Proposition 5.1.First, let us derive an upper bound on the L 1 -norm of p over the unit interval.One may start with the weighted L 1 -Poincaré-type inequality Here, an equality is attained in the asymptotic sense for the indicator function u To prove it, note that, by Jensen's inequality, the left integral in (5.5) does not exceed This proves (5.5).Using this inequality with u = p , we get Next, we need to derive an upper bound on the increment m of p(x) on [0, 1] analogously to the right-hand side in (5.1).By Taylor's integral formula, for all h ∈ R, Writing this inequality with −h in place of h and averaging, we get

Bounds for Fisher information
Let us now integrate this identity over 0 < h < 1.This gives another general identity Applying it to the function x → p(1 + x), we also get and thus Thus, together with (5.6) we arrive at the similar bound Let us now apply the relation (5.7) to the functions p(x + k) and perform summation over all integers k.This will give with weight function W (x) = k∈Z w(x+k).One can easily evaluate it using the property that it is 1-periodic.Restricting ourselves to the values x ∈ [0, 1], we have This function is symmetric about the point x = 1 2 , and for all x ∈ R, so that (5.8) yields the desired inequality (5.1).

The use of the third derivative
The boundedness condition for the second derivative p is guaranteed, for example, by the integrability of the next derivative p .Hence, some of the previous statements can be made in terms of the L 1 -norm of p .As a first step towards the one-dimensional variant of Theorem 1.2, here we prove the following relation.We basically follow the arguments described in the book by Bogachev [6] and employ Corollary 5.2.Let us use the notation for all non-negative functions p from the class C 1 (even if p is not a probability density).
Proposition 6.1.For any non-negative function p of class C 3 , Proof.One may assume that all integrals on the right-hand side are finite, and p is not identically zero.In particular, p, p and p are bounded on the whole real line.
By the monotonicity, p(x) > 0 on ∆, and we may define a positive function Note that p > 0 on ∆ and Turning to the set ∆ , note that it is open and can be decomposed into at most countably many disjoint intervals (a k , b k ).In particular, v(a k ) = 2p (a k ) in the case a k > a.
The function v(x) has a continuous derivative satisfying In the case a k = a > −∞, recall that p (a k ) = 0. Given ε > 0 such that a + ε < b k , one may apply Lemma 2.4 (i) to the function u = p on the interval (a, a + ε) which yields Letting ε → 0 and using the continuity of p , we have C ε → p (a), so that the above relation leads to (6.3) for this case as well.Note that, by (6.3), necessarily p (a k ) > 0.
As for the remaining case a On the other hand, since |p (x)| ≤ 2Cp(x) for all x ∈ R with constant C = sup x p (x) (Proposition 2.3) and since p (x) > 0 on U , we have where we used the property p(x) → 0 as x → −∞ (Proposition 3.1).Thus, necessarily −∞ < a k < b k ≤ b < ∞, p (a k ) > 0, and the inequality (6.3) holds true on every interval (a k , b k ).Applying it, we get Using the simple inequality Hence (using To further estimate the right-hand side, consider two scenarios. which is similar to (6.2).In the other case, there is a point which, by (6.4), gives Bounds for Fisher information One can unite both scenarios using the formally weaker relation Let us now perform summation over all k, which leads to Here the right-hand side dominates the one of (6.2).Hence, adding the two inequalities with integrals taken over ∆ and ∆ , we get It is time to perform summation over all intervals ∆'s contained in the decomposition of U , which leads to the similar bound A similar relation holds true when integrating over the set V = {x ∈ R : p (x) < 0} in place of U (alternatively, one may apply the previous step to the function x → p(−x)).Adding the two inequalities with integrals over U and V , we then get (6.1).

Theorems 1.1-1.2 in the one-dimensional case
We are prepared to prove the one-dimensional variant of Theorem 1.2.Proof.Denote by A the integral in (7.1).We apply Corollary 5.2 to (6.1) to get that This inequality is not invariant under rescaling of the space variable.So, let us apply it to the probability density functions p λ (x) = λp(λx) with parameter λ > 0. Then we get Optimizing over all λ, we arrive at I(p) ≤ cA 2/3 with constant c = 8 √ 3 + 16 3 < 10.
We now consider a particular case of the inequality (7.1) when p has a convolution structure.
First, let us remind some of the basic properties of this operation.Given integrable functions p 1 and p 2 , the convolution is defined a.e. and represents an integrable function with the L 1 -norm Bounds for Fisher information Moreover, if p j are non-negative, the integral in (7.2) is well-defined for any fixed x (although it may be infinite); it does not change when changing p j on a set of Lebesgue measure zero.
In general, the convolution improves smoothing properties.For example, if p 1 and p 2 are bounded, the function p = p 1 * p 2 is bounded and uniformly continuous.In this case, both p 1 and p 2 belong to L 2 (R), so do their Fourier transforms by the Plancherel theorem.Hence, the Fourier transform p = p1 p2 is an integrable function, which implies the desired assertion by applying the inverse Fourier formula.
We will need the following elementary statement.
Lemma 7.2.If non-negative integrable functions p j , 1 ≤ j ≤ l, are absolutely continuous and have integrable Radon-Nikodym derivatives p j , the convolution p = p 1 * • • • * p l belongs to the class C l .Moreover, its derivatives up to order l − 1 are bounded and integrable, while the l-th Radon-Nikodym derivative of p represents the convolution Proof.For simplicity, let us consider the case l = 2. Put q j = p j , j = 1, 2. According to (7.2), for any x ∈ R, In particular, |p(x)| ≤ q 1 1 p 2 1 < ∞, so that p is bounded.After change of variable y = ξ − z, the last double integral may be rewritten as This equality shows that p is absolutely continuous and has an integrable Radon-Nikodym derivative p = q 1 * p 2 .Thus, The last integral is finite and represents a continuous bounded function of x, with |p (x)| ≤ q 1 1 q 2 1 < ∞.
After the same change of variable y = ξ − z, the last double integral may be rewritten as This equality shows that p is absolutely continuous and has an integrable Radon-Nikodym derivative p = q 1 * q 2 .In particular, p ∈ C 2 .The inequality in (7.5) follows from (7.3).The general case l ≥ 2 in (7.4)-(7.5) is similar.

Bounds for Fisher information
In the case l = 3, one may combine Lemma 7.2 with Proposition 7.1 to obtain the following consequence from (7.1).
Proposition 7.3.Given absolutely continuous probability densities p j , j = 1, 2, 3, the convolution p = p 1 * p 2 * p 3 belongs to the class C 3 and has finite Fisher information satisfying The inequality (7.6) may be further extended to the class of probability densities of bounded variation, by a suitable approximation.On the real line, the total variation semi-norm of a function p is defined by where the supremum is taken over all collections of points exist and are finite for all x ∈ R. Without loss of generality, one may always assume that the value p(x) is located between these limits.For example, one may require that p(x+) = p(x), that is, p is right-continuous.With this requirement, the value in (7.7) coincides with the so-called essential total variation semi-norm, which is consistent with the definition (1.5).
If p is integrable, then necessarily p(−∞) = p(∞) = 0. Hence, being restricted to the linear space of all integrable right-continuous functions p of bounded variation, p TV represents a norm.If p is absolutely continuous and has a Radon-Nikodym derivative p , then Like the Fisher information, the convolution of p with an arbitrary probability density q does not increase the total variation norm: p * q TV ≤ p TV . (7.9) Proof of Theorem 1.1 (n = 1).Let p j , j = 1, 2, 3, be probability densities with finite total variation norms b j = p j TV .Introduce the normal density ϕ ε (x) = 1 ε √ 2π e −x 2 /2ε 2 with mean zero and standard deviation ε > 0 and define the convolutions All these functions represent C ∞ -smooth probability densities, so that the relations (7.6) and (7.8) are applicable, which yield where we made use of (7.9) on the last step.It remains to apply the lower semi-continuity of the Fisher information (cf.As a consequence, I(p) = lim ε→0 I(p ε ).Thus, Theorem 1.1 is proved with c = 10.

Weak derivatives
In spaces of higher dimensions, the general theory about the Fisher information is somewhat different than in dimension one.For example, for the finiteness of I(p), the density p does not need be bounded and continuous anymore, in contrast with the one-dimensional case.We refer an interested reader to [4] for related issues.
Upper bounds on the Fisher information for probability densities on R n may be explored in appropriate Sobolev spaces.A main approach to the definition of Sobolev spaces is based on the integration by parts formula.Let us recall some basic notations and facts in this theory and give some additional remarks (for background we refer to [17], [10]).As usual, C ∞ 0 (R n ) denotes the space of all compactly supported functions w on R n that have continuous partial derivatives of all orders.We are especially interested in partial derivatives along one variable only and then we also write A locally integrable function u on the real line has a generalized l-th derivative v of an integer order l ≥ 1, if and only if u = ũ a.e. for some ũ from the class C l .In this case, v = ũ(l) a.e.
In the case l = 1, Proposition 8.2 is thus telling us that a locally integrable function on the real line has a generalized derivative, if and only if after a modification on a set of Lebesgue measure zero, it will be locally absolutely continuous.
In the proof we involve the so-called regularized functions which are commonly used for approximation of Sobolev functions.Let ω ∈ C ∞ 0 (R n ) be non-negative and compactly supported, with ω dx = 1.So, it is a probability density.The probability densities Bounds for Fisher information They are called the regularized functions.
These functions are locally integrable, belong to the class C ∞ (R n ), and we have the commutativity (ω ε1 ) ε2 = (ω ε2 ) ε1 , ε j > 0. Let us list a few elementary basic properties of the regularized functions, in which the choice of the regulizer ω is irrelevant.Lemma 8.4.Let u be a locally integrable function on R n and α be a multi-index.

2)
We have 3) Moreover, D α u ε = (D α u) ε provided that u has a generalized derivative D α u.
Lemma 8.5.Given a locally integrable function u on R n and a multi-index α with l = |α|, suppose that, for any w ∈ C ∞ 0 (R n ), Then u = ũ a.e. for some polynomial ũ(x 1 , . . ., x n ) in n real variables of degree at most l − 1.
Proof.Starting from (8.3), we obtain a similar equality for the regularized functions, i.e.
u ε D α w dx = 0. (8.4) Indeed, according to (8.2) and applying Fubini's theorem so as to justify the change of the order of integration, we see that the above integral is equal to But, by (8.3), the inner integral on the right-hand side is vanishing for any fixed y ∈ R n .Now, since u ε is C ∞ -smooth, one may integrate in (8.4) by parts and conclude that wD α u ε dx = 0 for all w ∈ C ∞ 0 (R n ).This implies that D α u ε (x) = 0 for all x ∈ R n , which is only possible when u ε is a polynomial of degree at most d = l − 1.It remains to apply the property 1) and note that the pointwise limit of polynomials of degree at most d is a polynomial of degree at most d.
Proof of Poposition 8.2.In one direction (the sufficiency part), one may assume that ũ = u, so that the function u belongs to the class C l .In particular, its l-th derivative u (l) , being understood in the Radon-Nikodym sense, is locally integrable on the real line. Hence dx for all a < b, which implies that u (l−1) has a bounded variation on every bounded interval.Then one may integrate by parts to get that Since the first l − 1 derivatives of u are continuous, one may further integrate by parts which leads to the desired equality According to (8.1), this shows that u (l) serves as an l-th generalized derivative for u.

Bounds for Fisher information
Arguing in the opposite direction, suppose that, for some locally integrable function v(x), we have for all w ∈ C ∞ 0 (R).Introduce the integration operators and note that the function T k v belongs to C k and has a k-th Radon-Nikodym derivative v, as long as v is locally integrable.In particular, T v is locally absolutely continuous and has v as its Radon-Nikodym derivative.Hence, one may integrate by parts to get that for any w ∈ C ∞ 0 (R).By repeated integration by parts, we obtain that In view of (8.5), this gives We are in position to apply Lemma 8.5 (in dimension n = 1) and conclude that u − T l v = Q a.e. for some polynomial Q of degree at most d = l − 1.Then ũ = T l v + Q belongs to C l , has an l-th generalized derivative v, and is equal to u a.e.

Weak derivatives along single variables
A similar characterization about the generalized partial derivatives D l i , 1 ≤ i ≤ n, also holds in the n-dimensional case.Fix an integer l ≥ 1.
Proposition 9.1.A locally integrable function u on R n , n ≥ 2, has a generalized partial derivative v = D l i , if and only if u = ũ a.e. for some Borel measurable function ũ such that, for almost all points (x j ) j =i ∈ R n−1 , the function f (x i ) = ũ(x 1 , . . ., x i−1 , x i , x i+1 , . . ., x n ) belongs to the class C l and has an l-th Radon-Nikodym derivative f (l) which is Borel measurable and locally integrable on R n .In this case, v = f (l) a.e.
Proof.We apply an argument as in the proof of Proposition 8.2, with a few modifications.
In the sufficiency direction, assume that ũ = u.Write x = (x i , x) with x i ∈ R, x ∈ R n−1 , and let E i ⊂ R n−1 be an exceptional null set of collections x = (x j ) j =i .By the assumption, for every fixed x outside this set, 1) the function f (x i ) belongs to the class C l ; 2) there is a representative for its l-th Radon-Nikodym derivative f (l) (x i ) = v(x), which defines a Borel measurable, locally integrable function on R n .
The first property allows us to perform the repeated integration by parts along the i-th coordinate to get that, for any w By property 2), and since u is Borel measurable and locally integrable, both sides of this equality represent integrable functions on R n−1 .Using Fubini's theorem, one may integrate over x to get Hence, according to (8.1), the function v(x) = f (l) (x i ) serves as an l-th generalized derivative for u.
For an opposite direction, assume that a generalized partial derivative v = D l i u exists, so that it may be chosen to be Borel measurable and locally integrable.Then, the , and is locally integrable on the real line for almost all x ∈ R n−1 .Indeed, since for all integers m, N ≥ 1, we conclude that the set is Borel measurable and has a full Lebesgue measure inside the cube B N .Hence the set has a full Lebesgue measure on R n−1 .But it contains exactly those points x ∈ R n−1 for which the function x i → v(x i , x) is integrable on all bounded intervals on the real line.
Next, we employ the integration operators (8.6) which are applied along the i-th coordinate: For x ∈ A, put Clearly, the function T v is finite, Borel measurable on R × A, and locally integrable, since for all m, N ≥ 1.The same conclusions are also true about all functions T k v.
Moreover, since x i → v(x i , x) is locally integrable, the function x i → T k v(x i , x) belongs to the class C k and has v(x i , x) as a generalized derivative of order k.In particular, T v(x i , x) is locally absolutely continuous with respect to x i , and one may integrate by parts with respect to this variable to get that, for any w ∈ By repeated integration by parts, we obtain that

Bounds for Fisher information
Here, the integrands represent Borel measurable, integrable functions.Hence, it is possible to integrate both sides according to the Fubini theorem, and then we arrive at v(x) w(x) dx = (−1) l T l v(x) D l i w(x) dx.
Applying (8.1), which is our hypothesis, this yields We are in position to apply Lemma 8.5 and conclude that u − T l v = Q a.e. for some polynomial Q in n real variables of degree at most l − 1.It remains to put ũ(x) = T l v(x) + Q(x).This characterization may be used to derive the following assertion which will be needed to correctly introduce the Fisher information.
Corollary 9.3.If a non-negative locally integrable function u on R n has a generalized gradient with partial derivative D i u, then the sets have Lebesgue measure zero.
Proof.We may assume that u is properly modified so that u = ũ.Then, using the previous notation x = (x i , x), we have that, for all x except for a null set A i ⊂ R n−1 , the function u i (x i ) = u(x i , x) is a.e.differentiable, and its derivative u i (x i ) serves as a generalized partial derivative D i u(x).Since u i ≥ 0, it follows that u i (x i ) = 0 ⇒ u i (x i ) = 0 at every point of differentiability (similarly to dimension one).Hence, the set has Lebesgue measure zero on the real line.By Fubini's theorem, where mes n stands for the Lebesgue measure on R n .
Locally Lipschitz functions.If u has a finite Lipschitz semi-norm in some neighborhood of any point, it is almost everywhere differentiable, so that it has a usual gradient ∇u(x) a.e.(by Rademacher's theorem, cf.[17], p. 50).Such functions are locally absolutely continuous along every line, and the usual gradient also serves as a generalized gradient.As a representative of the modulus of the gradient, one may take a locally finite function

Sobolev spaces
Here and elsewhere, we use the usual notation for the space L s (R n ) of all measurable functions u on R n with finite norm Given an integer l ≥ 1, the Sobolev space with parameters (l, s) is defined as It is a Banach space endowed with the norm .
When s = 2 we obtain a Hilbert space.Thus, W s 1 (R n ) contains all functions u in L s (R n ) that have a generalized gradient ∇u such that |∇u| belongs to the same space L s (R n ).
), if and only if after a modification on a set of measure zero the modified function ũ is locally absolutely continuous on almost all lines parallel to the coordinate axes whose partial Radon-Nikodym derivatives belong to L s (R n ), cf.[17], pp.44-45.This characterization is also a consequence of Corollary 9.2, which in turn is a particular case of a more general Proposition 9.1.Thus, the generalized partial derivatives ∂ xi ũ(x), x = (x 1 , . . ., x n ) ∈ R n , with respect to the i-th coordinate may be understood in the usual sense for almost all collections (x j ) j =i .
If s > 1, For s = 2, we have In this case, another characterization can be given in terms of the Fourier transform which is well-defined as an element of L 2 (R n ).

Bounds for Fisher information
This way W 2 1 (R n ) may be identified with the usual L 2 -space over the measure (1 + |t| 2 ) dt.
Example.In contrast with the one-dimensional case, the elements of the Sobolev space W s 1 (R n ) in dimension n ≥ 2 do not need be bounded and continuous.One may consider the example of the unbounded (near zero) function which belongs to all L s (R n ) and has usual partial derivatives ∂ xi u(x) for all x = 0.These derivatives are integrable with any power s < n.

Sobolev inequalities. It is a well-known classical fact that any function
with a constant C n independent of u.If it is optimal, an equality in (10.1) is attained asymptotically when u approaches indicator functions of the Euclidean balls.
The inequality (10.1) may be extended to the W s 1 -space with 1 ≤ s < n as the relation where s * = ns n−s is the so-called Sobolev conjugate.
Elements of W s l (R n ) are called Sobolev functions.They can be well approximated by smooth functions using any regularizer ω and associated regularized functions which we discussed before, cf.Definition 8.3.Let us extend the list of basic properties given in Lemma 8.4 by the following.Below, α denotes an arbitrary multi-index and s ≥ 1.
Proposition 10.1.Let u be a locally integrable function on R n . 1) as well, and moreover, 3) In addition, u ε is bounded: , then with some constant C depending on ω only, we have All properties and their proofs are rather standard.For illustration, let us explain the inequality in 7).By Definition 8.3, assuming that u is smooth and that ω is supported on the ball of radious r, we have, by Hölder's inequality, Hence On this step, the smoothness condition may be removed by approximation: Let us apply this relation with u δ in place of u, and then using the commutativity of the regularization, we get Letting δ ↓ 0, it remains to refer to the properties 2) and 5).
We now extend Lemma 7.2 to the multi-dimensional setting.
Proposition 10.2.If the functions u 1 , . . ., u l belong to W 1 1 (R n ), the convolution u = u 1 * • • • * u l has integrable generalized partial derivatives along every coordinate up to order l.Moreover, with Proof.To sumplify notations, write D = D i , D l = D l i .One may argue by induction on l.
By the induction hypothesis, v has integrable generalized partial derivatives along every coordinate up to order l − 1.Moreover,

Bounds for Fisher information
Changing the order of integration and using the induction hypothesis, the latter double integral is equal to v(y)D l−1 w(x + y) dy Du l (x ) dx Hence u has a generalized derivative D l u = D l−1 v * Du l .It remains to recall (10.4), and we arrive at (10.2)-( 10.3).

BV -space
Definition 11.1.An integrable function u on R n is said to be a function of bounded variation, if for some signed Borel measures µ i on R n , we have for all w ∈ C ∞ 0 (R n ).The generalized gradient of u is then defined as a vector-valued measure µ = (µ 1 , . . ., µ n ), whose total variation µ TV in the sense of Measure Theory is denoted u TV in the sense of Theory of Functions.Using (11.1) and the notation div w = where the first supremum is running over all partitions of R n into Borel sets A 1 , . . ., A N , and the next ones are taken over all C ∞ 0 -smooth maps w = (w 1 , . . ., w n ) : R n → R n such that |w(x)| ≤ 1 for all x ∈ R n .Thus, one arrives at the formula (1.5). Put which is a Banach space endowed with the norm Thus, an integrable function u belongs to BV (R n ), if and only if the last supremum in (11.2) is finite.
where as before the supremum is running over all C ∞ 0 -maps w = (w 1 , . . ., w n ) : R n → R n such that |w| ≤ 1 pointwise on R n .
What will be important for us is that the regularization does not increase the total variation and BV -norm, in full analogy with L s -and W s l -norms (cf.also [17]).Proposition 11.2.For any function u in BV (R n ), the regularized functions u ε defined in (8.2) Proof.Let w = (w 1 , . . ., w n ) be an arbitrary C ∞ 0 -smooth map participating in the last supremum in (11.2).We first notice that u ε div w dx = u div ψ dx, (11.5)where the map ψ = (ψ 1 , . . ., ψ n ) has components They are C ∞ 0 -smooth and represent the regularized functions for w i by means of the regulizer ω(x) = ω(−x).Since ωε represents a probability density, an application of the Cauchy inequality yields due to the condition |w(x)| ≤ 1.Thus, ψ is one of the maps participating in the last supremum in (11.2), so that the right-hand side in (11.5) may not exceed u TV .Taking the supremum in (11.5) over all admissible maps w, we arrive at the first inequality in (11.4).Using u ε 1 ≤ u 1 , we also obtain the second one.
The norm in BV (R n ) is lower semi-continuous.More precisely, the following holds.Proposition 11.3.Suppose that u k belong to BV (R n ) and have bounded norms in this space.If and similarly for the BV -norm.
Proof.One may assume additionally that u k → u a.e.Then, for any admissible map w, Taking the supremum over all w as in (11.2), we arrive at (11.6).Since, by Fatou's lemma, we get the lower semi-continuity for the BV -norm as well.

Bounds for Fisher information
This property is used in the proof of the following variant of the compactness theorem.In a slightly different way it is mentioned in Remark on p. 146, [10].
There exists a subsequence u kj which is a.e.convergent to a function u in BV (R n ) with the property that u kj − u L 1 (Ω) → 0 as j → ∞ on all balls Ω in R n .
Proof.By property 7) with s = 1 in Proposition 10.1, the regularized functions Moreover, by properties 3) and 6) with s = 1, v k are bounded and have bounded gradients uniformly over all k.Hence, these functions are bounded and equicontinuous.Applying the Arzela-Ascoli theorem, for every integer N ≥ 1, one can find a subsequence v kj which converges uniformly on the ball Ω : |x| < N .Moreover, using the diagonal argument, one may find a subsequence v kj which converges pointwise on R n and uniformly on all balls.In particular, (11.7), for all sufficiently large i and j, Applying this conclusion to the values δ = 2 −m and using the diagonal argument, we obtain a further subsequence u kj such that Thus, u kj is a Cauchy sequence in L 1 (Ω) and has a limit u belonging to this space so that u kj − u L 1 (Ω) → 0 as j → ∞.As before, one may find a further subsequence, say u kj , which converges to a locally integrable function u a.e. on the whole space R n , with the property that u kj − u L 1 (Ω) → 0 as j → ∞ for all balls Ω.By the Fatou lemma, so that u is integrable.Moreover, by (11.6), using u k TV = ∇u k 1 , we also conclude that u TV < ∞.Hence u belongs to BV (R n ).

Convolution of functions of bounded variation
We are prepared to prove a multi-dimensional variant of Lemma 7.2 for the BV -space.
For simplicity, we consider the convolution of two functions only.
We have where we used the property that the regularization does not increase the L 1 -norm.Since u, v and their partial derivatives are integrable and C ∞ -smooth, the same is true for w.Moreover, using the notation for partial derivatives This implies and since the regularization does not increase the total variation norm (Proposition 11.2).It follows that Thus, the functions D i w have bounded W 1 1 -norms uniformly over all ε > 0. Hence, we are in position to apply Proposition 11.4:There exists a sequence ε = ε k → 0 and functions q i in BV (R n ), 1 ≤ i ≤ n, such that the partial derivatives D i w k for the corresponding functions w = w k are convergent to q i a.e. and have the property that 2), and applying the lower semi-continuity property (11.6), we conclude that all q i are functions of bounded variation satisfying We claim that the function q i represents a generalized derivative ∂ xi p = D i p, and then (12.3) yields the desired relation (12.1).As required in (9.1), we need to show that p D i ψ dx = − ψ q i dx (12.4) Bounds for Fisher information for all ψ ∈ C ∞ 0 (R n ).Indeed, by the definition, But the regularized functions u = u k and v = v k are convergent in L 1 to p 1 and p 2 respectively, by property 2) in Proposition 10.1, and thus w k − p 1 → 0 as k → ∞.
Hence, the left integrals in (12.5) are convergent as k → ∞ to the left integral in (12.4).
The same is true about the right integrals, since D i w k are convergent to q i locally in L 1 , while ψ is compactly supported.This shows that p has the generalized gradient ∇p = (q 1 , . . ., q n ).
Example.In dimension n = 1, let p 1 = p 2 be the density of the uniform distribution on the interval [− 1 2 , 1  2 ].It is a function of bounded variation with total variation norm p 1 TV = p 2 TV = 2.The convolution represents the density of the so-called triangle distribution.It is absolutely continuous and its generalized (Radon-Nikodym) derivative is a function of bounded variation with total variation norm p TV = 4. Hence, (12.1) becomes an equality.Note that Lemma 7.2 is not applicable in this case.

Fisher information in high dimensions
Given a probability density p on R n , the first basic formula in (1.1) as in the definition (9.1).Here the partial derivatives are required to be locally integrable functions.Moreover, they have to be integrable for the finiteness of the Fisher information.
, then p has a generalized gradient, and the integrals in (13.1) and (13.3) coincide.
The proof is based on the chain rule formula and the next general characterization.
Suppose that we are given a continuous function T : [0, ∞) → R which has a continuous derivative T (t) for t > 0 such that sup t≥t0 |T (t)| < ∞ for any t 0 > 0.
Lemma 13.2.Let p be a non-negative, locally integrable function on R n having a generalized partial derivative D i p. Then u = T (p) has a generalized derivative D i u, if and only if the function T (p)D i p 1 {p>0} is locally integrable.In that case, D i u = T (p)D i p 1 {p>0} a.e.
Proof.In the one-dimensional case, this assertion may be refined.Let the function p be locally absolutely continuous on the real line and have a Radon-Nikodym derivative p .By continuity, the set {x ∈ R : p(x) > 0} is open and may be decomposed into at most countably many intervals (a k , b k ), finite or not.Then, on every (a k , b k ), T (p) is locally absolutely continuous and has a Radon-Nikodym derivative T (p) = T (p) p .Indeed, the assumption on the local absolute continuity of p is equivalent to the property that, for all α < β and ε > 0, there is δ > for any collection of non-overlapping intervals (x l , y l ) inside [α, β] (cf.e.g.[12]).If this segment is contained in (a k , b k ), then p is bounded away from zero, that is, p(x) Hence, by the same characterization, u = T (p) is locally absolutely continuous on (a k , b k ) and has a finite derivative q, which exists almost everywhere on this interval.But, since p(x) has a finite derivative p (x) for almost all x ∈ R, u(x) has derivative T (p(x)) p (x) for almost all x ∈ (a k , b k ).This shows that q = T (p) p a.e., thus proving the claim.
Turning to the general case n ≥ 2, note that |T (t)| ≤ c (1 + t) for all t ≥ 0 with some constant c ≥ 0. Hence u = T (p) is locally integrable.Without loss of generality, let T (0) = 0.
We may assume that p is modified as in Proposition 9.1 for l = 1, with a Borel measurable, locally integrable, generalized derivative D i p.Thus, using the notation x = (x i , x), x i ∈ R, x ∈ R n−1 , the function x i → p(x) is locally absolutely continuous and has a Radon-Nikodym derivative x i → D i p(x) for all x except for a null set E i ⊂ R n−1 .Given such a point x, the set U (x) = {x i ∈ R : p(x) > 0} is open and may be decomposed into at most countably many intervals (a k , b k ), finite or not.According to the onedimensional claim, the function x i → u(x) is locally absolutely continuous and has a Radon-Nikodym derivative D i u(x) = T (p(x)) D i p(x) on every such interval a k < x i < b k .Hence, given a C ∞ 0 -function w on R n , on any subinterval [α, β] ⊂ (a k , b k ) one may integrate by parts along the x i -coordinate to get First assume that T (p)D i p 1 {p>0} is locally integrable on R n .Then, this function will be locally integrable with respect to x i for almost all x except for a null set E i containing E i (cf.proof of Proposition 9.1).That is, for any bounded interval But this equality also holds when a k and/or b k are infinite, since w(α, x) and w(β, x) are vanishing for α and β being sufficiently large.Due to (13.5), one may perform summation over all k in (13.6), and then we arrive at with an arbitrary point x outside E i (since the integrands on both sides are supported on a bounded set).Here, the left integral does not change when extending the integration over the whole real line, so, Using Fubini's theorem, this equality may now be integrated over x, and we obtain This means that T (p(x))D i p(x) 1 {p(x)>0} serves as a generalized partial derivative for u(x).
Conversely, suppose that u = T (p) has a generalized partial derivative q i ; in particular, it is locally integrable.As we have already noticed, for any x outside E i , the function x i → u(x) has a Radon-Nikodym derivative T (p)D i p on every interval (a k , b k ).By Proposition 9.1, for almost all x, we have q i = T (p)D i p for almost all x i ∈ (a k , b k ).Therefore, this equality holds true a.e. on the whole set p(x) > 0. As a consequence, the function T (p)D i p 1 {p>0} is locally integrable.
Proof of Proposition 13.1.Suppose that p has a generalized gradient ∇p with finite I(p).By Corollary 9.3, the left integral in (13.2) may be restricted to the set {p(x) > 0}.Hence, applying Cauchy's inequality, we have , which yields (13.2).Thus, p ∈ W 1 1 (R n ).In order to derive (13.3), we apply Lemma 13.2 with the function T (t) = √ t, t ≥ 0. Since Dip √ p 1 {p>0} is square integrable, we conclude that the function √ p has a generalized Summing over all i ≤ n and recalling (13.1), we arrive at the representation (13.3).
The converse statement is similar.
Denote by P n the collection of all probability densities on R n .According to (13.2), for every I > 0, the set R n (I) = |∇p| : p ∈ P n , I(p) ≤ I is bounded in L 1 (R n ).The next statement refines this property.
Bounds for Fisher information Proposition 14.1 may be considerably sharpened by weakening the basic hypothesis on the convergence in W 1 1 (R n ).Proposition 14.2.Given random vectors X, X k (k ≥ 1) with values in R n , suppose that X k ⇒ X weakly in distribution as k → ∞.Then I(X) ≤ lim inf k→∞ I(X k ).Proof.It is sufficient to prove the following: For any subsequence of X k , one can extract a further subsequence for which the relation (14.2) holds true, even with limsup in place of liminf.For simplicity of notations, let the first subsequence be the whole sequence of positive integers.By the assumption on the weak convergence, v dµ k → v dµ as k → ∞ for any bounded continuous function v on R n , where µ and µ k denote the distributions of X and X k respectively.
One may assume that I(X k ) ≤ I < ∞ for all k.In this case, X k have absolutely continuous distributions on R n whose densities p k lie in the Sobolev space W 1 1 (R n ) and have Fisher information I(p k ) bounded by I.By Proposition 13.1, p k have bounded norms in W 1 1 (R n ), so that we may apply Proposition 11.4.Thus, some subsequence p kj is a.e.convergent to a function p ∈ BV (R n ) with the property that p kj − p L 1 (Ω) → 0 as j → ∞ (14.3) on bounded Borel sets Ω in R n .The latter ensures that p represents a probability density on R n .Indeed, necessarily p(x) ≥ 0 a.e. as p is a pointwise limit of non-negative functions.In addition, since Ω p kj → Ω p, we have Ω p ≤ 1 for all bounded Borel sets Ω in R n , hence R n p ≤ 1.For an opposite inequality, choose a bounded open set such that P{X ∈ Ω} > 1 − ε for a given ε > 0. Since X k are convergent weakly in distribution, we obtain that lim inf j→∞ P{X kj ∈ Ω} ≥ P{X ∈ Ω} > 1 − ε (cf.e.g.[1]).On the other hand, by (14.3), Hence Ω p ≥ 1 − ε for any ε > 0, so that R n p = 1.
In particular, assuming again for simplicity of notations that k j is the whole sequence of positive integers, we get (by applying Scheffe's lemma) that This means that µ k are convergent in total variation norm to the probability measure with density p. Consequently, the measure µ is absolutely continuous with respect to the Lebesgue measure on R n and has density p.In addition, the weak convergence is strengthened as the property (14.4) which implies that p k v dx → pv dx (14.5) as k → ∞ with an arbitrary bounded measurable function v on R n .
The uniform integrability property as in Proposition 13.3 allows us to apply the Dunford-Pettis compactness criterion for the space L 1 over finite measures (cf.[9], p. 20).It implies that the set R n (I) is pre-compact and also sequentially pre-compact in L 1 (Ω) with respect to the weak σ(L 1 , L ∞ )-topology for any bounded Borel set Ω in R n .Hence Bounds for Fisher information the same is true for the collection of generalized partial derivatives D i p = ∂ xi p with p ∈ P n (I), i = 1, . . ., n.As a consequence, there is a subsequence of p k along which D i p k are weakly convergent in L 1 (Ω) to some q i ∈ L 1 (Ω).Clearly, these limit functions may be chosen to be common for all such Ω's.Thus, assuming again that the subsequence is the whole sequence, we obtain that w D i p k dx → wq i dx as k → ∞, i = 1, . . ., n, for any bounded measurable function w on R n with a compact support.But, restricting ourselves to w ∈ C ∞ 0 (R n ), the above left integrals are equal to − p k D i w dx, according to the definition (9.1) of the weak derivative of p k , and they converge to − p D i w dx, by (14.5) with v = D i w.Thus p D i w dx = − wq i dx.
This shows that q i = D i p serve as generalized partial derivatives for p, so that p has a generalized gradient ∇p = (q 1 , . . ., q n ).Moreover, since p belongs to BV (R n ), we get Therefore, all D i p are integrable, and we may conclude that p ∈ W 1 1 (R n ).In addition, for any bounded measurable function w on R n with a compact support.Now, as we have observed in Proposition 13.1 and Lemma 13.2, the functions whose L 2 -norms are bounded by √ I. Since the balls in L 2 (R n ) are weakly compact, one can extract a subsequence of (ψ i,k ) k≥1 which is weakly convergent to some function ψ i ∈ L 2 (R n ).That is, we may assume that, for any i = 1, . . ., n, v ψ i,k dx → v ψ i dx as k → ∞ with an arbitrary function v ∈ L 2 (R n ).Then, more generally, we have Indeed, by the Cauchy and triangle inequalities, a.e. on the set p(x) > 0.
Finally, since ψ i,k 1 {p>0} are weakly convergent to ψ i 1 {p>0} in L 2 (R n ), by the lower semi-continuity of the norm with respect to the weak topology, we have But this is the same as Summing these inequalities over i ≤ n, we arrive at the relations (14.1)-(14.2).

Convexity of Fisher information
Recall that the collection P n of all probability densities on R n represents a convex closed set in L 1 (R n ).Another general property of the Fisher information is its convexity, where p = N k=1 α k p k with arbitrary weights α k > 0 such that N k=1 α k = 1.This follows from the convexity of the function R(u, v) = u 2 /v in the upper half-plane u ∈ R, v > 0.
As a consequence, the collection P n (I) of all probability densities on R n with Fisher information not exceeding a fixed number I represents a convex closed subset of P n .
We need to extend Jensen's inequality (15.1) to arbitrary convex mixtures of probability densities.In order to formulate this more precisely, let us recall the definition of mixtures.For any Borel set A in R n , the linear functional q → A q(x) dx is continuous on L 1 (R n ) and takes values in [0, 1] when q ∈ P n .So, given a Borel probability measure π on P n , one may introduce the Borel probability measure on R n by virtue of the formula It is absolutely continuous with respect to the Lebesgue measure and has some density p(x) = dµ(x)/dx called the (convex) mixture of q's with mixing measure π.For short, p = Pn q dπ(q).Proof.Note that the integral in (15.4) makes sense, since the functional q → I(q) is lower semi-continuous and hence Borel measurable on P n (Proposition 14.1).We may assume that this integral is finite, so that π is supported on the convex (Borel measurable) set P n (∞) = ∪ I P n (I).
Step 1. Suppose that the measure π is supported on some convex compact set K contained in P n (I).We apply the following general theorem (cf.e.g.Meyer [14], Chapter XI, Theorem T7): If a function I : K → R is convex and lower semi-continuous on a convex compact set K in a locally convex space E, then it admits the representation where L denotes the family of all continuous affine functionals on such that l(q) < I(q) for all q ∈ K.In our particular case with E = L 1 (R n ), any such functional acts on probability densities as l(q) = ψ q dx with some bounded measurable function ψ on R n . Hence for some family Ψ of bounded measurable functions ψ on R n .As a consequence, by the definition (15.2) for the measure µ with density p, Pn I(q) dπ(q) ≥ sup ψ∈Ψ Pn ψ(x) q(x) dx dπ(q) = sup ψ∈Ψ ψ(x) p(x) dx = I(p), which is the desired inequality (15.4).
Step 2. Suppose that π is supported on P n (I) for some I > 0. Since any finite measure on E is Radon, and since the set P n (I) is closed and convex, there is an increasing sequence of compact subsets K l ⊂ P n (I) such that π(∪ l K l ) = 1.Moreover, K l can be chosen to be convex (since the closure of the convex hull will be compact as well).Let π l denote the normalized restriction of π to K l with sufficiently large l so that c l = π(K l ) > 0, and define its barycenter which is a density of some probability measure µ l on R n as in the definition (15.3).Then, for any Borel measurable function f on R n such that |f | ≤ 1, we have Taking the supremum over all admissible f , we get which means that p l are convergent to p in L 1 (R n ).Hence the relation (14.1) holds: I(p) ≤ lim inf l→∞ I(p l ).On the other hand, by the previous step, I(p l ) ≤ K l I(q) dπ l (q) = 1 c l K l I(q) dπ(q) → Pn(I) I(q) dπ(q) (15.6) as l → ∞, and we obtain (15.4).
Step 3. In the general case, we may apply Step 2 to the normalized restrictions π l of π to the sets K l = P n (l).Again, for the densities π l defined as in (15.5), we obtain (15.6),where P n (I) should be replaced with P n (∞).Another application of the lower semi-continuity of the Fisher information finishes the proof.In fact, this approximation property may be generalized similarly to the setting of Proposition 14.2.
Corollary 15.3.Given independent random vectors X and Z with values in R n , for the random vectors X ε = X + εZ, ε ∈ R, we have I(X ε ) ≤ I(X), lim ε→0 I(X ε ) = I(X).Proof.For the first claim in (15.7), we may assume that I(X) is finite, so that X has an absolutely continuous distribution with density p having a generalized gradient.In this case, X ε has a density p ε representing a convex mixture of probability densities of the form q h (x) = p(x − h), h ∈ R n .Since I(q h ) = I(p), we obtain the inequality in which holds whenever the random vectors X and Y in R n are independent (cf.[15], [11]).
One interesting case in Corollary 15.3 is when Z has a standard normal distribution.Combining both claims in (15.7), we then obtain that the function ε → I(X ε ) is monotone.This choice of smoothing allows one to reduce various relations about the Fisher information I(p) such as (15.8) to the case of C ∞ -smooth densities p.Here is another example.
Corollary 15.4.Given a random vector X in R n , we have I(U (X)) = I(X) for any linear orthogonal map U : R n → R n .Indeed, by (13.1),I(U (X)) = I(X) as long as X has a C ∞ -smooth density.Hence, in the general case, we have I(U (X) + εZ) = I(X + εZ), where Z is a standard normal random vector in R n , independent of X (since U (Z) is standard normal).It remains to apply Corollary 15.3.

Upper bounds. Proof of Theorem 1.2
We are now prepared to extend several upper bounds for the Fisher information (∂ xi p(x)) 2 p(x) dx from the one-dimesnional case to higher dimensions (as before, one may adopt the agreement that 0/0 = 0).The first upper bounds were developed for densities from the classes C l on the real line with l = 2 and l = 3. Analogously, one may say that p belongs to C l (R n ), if for for any i = 1, . . ., n, for almost all (x j ) j =i ∈ R n−1 , the function Here |Ω| = mes n (Ω) denotes the n-dimensional volume of Ω.
Proof.Note that the inequality (2.2) is homogeneous in p. Applying it to the function in (16.1) with fixed x = (x j ) j =i ∈ R n−1 , we obtain that (∂ xi p(x)) 2 p(x) ≤ 2C i , C i = ess sup x ∂ 2 xi p(x).
Let us now turn to the multidimensional variant of Proposition 7.1, i.e.Theorem 1.2.It may be stated for a slightly more general class of densities as the following.In order to estimate the integral in (17.3), it is sufficient to bound the L 3 -norm of the linear functional f (θ) = v, θ with v ∈ S n−1 over the measure σ n−1 via its L 4 -norm.As is well-known, if Z = (Z 1 , . . ., Z n ) is a standard normal random vector in R n , then Z/|Z| is independent of |Z| and is uniformly distributed on the sphere.Therefore, v, Z /|Z| has the same distribution as f (θ).In addition, by independence, and since v, Z ∼ N (0, 1), Here, using the independence of Z i , we also have wiith constant c = c 0 √ 3 < 18, according to (17.1).As the last step, we use the regularized densities p j,ε = (p j ) ε , ε > 0. By the assumption, the densities p j of X j have finite total variation norms b j = p j TV .By Proposition 12.1, the random vector X = X 1 + X 2 + X 3 has an absolutely continuous distribution, Bounds for Fisher information whose density p belongs to the Sobolev space W 1 1 (R n ).Let X ε = X 1,ε + X 2,ε + X 3,ε with independent summands X j,ε having densities p j,ε .By the previous step (17.4),I(X ε ) ≤ c p 1,ε TV p 2,ε TV p 3,ε TV 2/3 .
As we know, cf.(11.4), the total variation norm may only decrease under regularization, so that we get I(X ε ) ≤ c p 1 TV p 2 TV p 3 TV 2/3 .

Theorem 1 . 2 .
For any probability density p on R n having continuous partial derivatives up to the third order,

Lemma 2 . 5 .
Let a C 2 -smooth function u ≥ 0 be defined on the interval (a, b), finite or not.If lim inf x↓a u (x) = lim inf x↑b u (x) = 0, and u (x) ≤ C a.e. for some constant C, then C ≥ 0 and (2.4) still holds in (a, b).Proof.Since the function u is continuous, the set U = {x ∈ (a, b) : u (x) > 0} is open and can be decomposed into at most countably many open disjoint intervals (a k , b k ).If a k > a, then necessarily u (a k ) = 0.By the assumption, we also have u (a k +) = 0 if a k = a.In both cases, one may apply Lemma 2.4 (i) to the interval (a k , b k ), and we obtain (2.4) for all x ∈ (a k , b k ) with C ≥ 0.

Corollary 3 . 3 . 3 ,
For any probability density p from the class C 2 such that the constant C in (3.1) is finite, we haveI(p) = − p(x)>0 p (x) log p(x) dx,as long as the function p (x) log p(x) is integrable on the set {x ∈ R : p(x) > 0}.Proof.The open set U = {x ∈ R : p(x) > 0} can be decomposed into disjoint intervals (a k , b k ).Necessarily p(a k +) = p(b k −) = 0 including the cases a k = −∞ and b k = ∞,by Proposition 3.1.Let a k < a < b < b k .Since log p(x) and p (x) are continuous functions with bounded total variations on [a, b], one may integrate by parts, which gives − b a p (x) log p(x) dx = − b a log p(x) dp (x) = −p (b) log p(b) + p (a) log p(a) + |p (a) log p(a)| ≤ √ 2C p(a) | log p(a)| → 0 as a → a k , and similarly |p (b) log p(b)| → 0 as b → b k .Hence, the above formula becomes in the limit − b k a k p (x) log p(x) dx =

Proposition 4 . 1 . 2 √
Let p be a probability density of class C 2 with finite constant C = ess sup x∈R p (x).Suppose that p is non-decreasing on a half-axis (−∞, a) and is non-increasing on a half-axis (b, ∞) for some a ≤ b.Then I(p) is finite.Moreover, I(p) ≤ 2C(b − a) + 2C ( p(a) + p(b)).

Proposition 4 . 2 .
Let p be a continuous density of the unimodal distribution with mode at the point a.Suppose that p is C 2 -smooth on the half-axis (−∞, a) and is C 2 -smooth on the half-axis (a, ∞) with finite C 0 = ess sup x<a p (x), C 1 = ess sup x>a p (x).Then I(p) is finite, and moreover, I(p) ≤ 2 ( 2C 0 + 2C 1 ) p(a).
non-empty and open.Let us decompose it into at most countably many disjoint open intervals, and let ∆ = (a, b) be one of these intervals.Necessarily b < ∞, p (b) = 0, and p (a) = 0 in the case a > −∞.Since p (x) is vanishing at infinity (Proposition 3.1), we also have p (
this generalized partial derivative is defined uniquely up to a set of Lebesgue measure zero.Of course, if u has continuous usual partial derivatives of orders up to |α|, the generalized α-th derivative exists and may be chosen to be the usual one.Like in the usual differentiation, generalized derivatives are commutative with respect to α and have a semi-group structure: If u has a generalized α-th derivative D α u, which in turn has a generalized β-th derivative v = D β D α u, then u has a generalized (α + β)-th derivative v.That is, D α+β = D β D α = D α D β for all multi-indices α and β.In dimension n = 1 with α = 1, Definition 8.1 returns us to the setting of locally absolutely continuous functions for which Radon-Nikodym derivatives serve as generalized derivatives.More generally, one may give the following characterization.Proposition 8.2.

Definition 8 . 3 .
are also compactly supported and belong to the class C ∞ 0 .They are called regularizers.Given a locally integrable function u on R n , define the convolutions Returning to Definition 8.1 with an arbitrary α such that |α| = 1, we obtain n partial derivatives v = ∂ xi u = D i u, and one may speak about the generalized gradient ∇u = (∂ x1 u, . . ., ∂ xn u) and its Euclidean length |∇u|.Thus, according to (8.1), for all w ∈ C ∞ 0 (R n ), u ∂ xi w dx = − w ∂ xi u dx, i = 1, . . ., n.

(9. 1 )Corollary 9 . 2 .
Let us formulate Proposition 9.1 once more in this particular case.A locally integrable function u on R n has a generalized gradient, if and only if after a modification on a set of measure zero the modified (Borel measurable) function ũ is locally absolutely continuous on almost all lines parallel to the coordinate axes and have partial Radon-Nikodym derivatives that are locally integrable on R n .
R n , integrating by parts, we have u(x)D l w(x) dx = u l (x − y)D l w(x) dx v(y) dy = − Du l (x − y)D l−1 w(x) dx v(y) dy = − Du l (x )D l−1 w(x + y) dx v(y) dy.
once the function p has a generalized gradient ∇p = (∂ x1 p, . . ., ∂ xn p) [a, b], b a |T (p(x i , x))D i p(x i , x)| 1 {p(xi,x)>0} dx i < ∞, x / ∈ E i .(13.5) Necessarily p(a k , x) = p(b k , x) = 0, and thus u(a k , x) = u(b k , x) = 0, as long as the endpoints a k and b k are finite.Hence, letting α → a k and β → b k in (13.4), we get b k

xProposition 16 . 1 .
i → p(x) = p(x 1 , . . ., x i−1 , x i , x i+1 , . . ., x n )(16.1)belongs to C l (R)with an additional requirement that the l-th derivative ∂ l xi p(x) with respect to x i in the Radon-Nikodym sense is locally integrable on R n .According to Proposition 9.1, such densities describe representatives of functions p on R n having generalized partial derivatives D l i p = ∂ l xi p.With this definition, Propositions 2.2-2.3 yield: If the probability density p belongs to the class C 2 (R n ) and is supported on a bounded, open, convex set Ω in R n , then I(p) ≤ 2Cn |Ω|, C = max 1≤i≤n ess sup x ∂ 2 xi p(x).
represents a certain interval (a i , b i ) depending on x.Let us integrate the last inequality over this section to get that bi ai(∂ xi p(x)) 2 p(x) dx i ≤ 2C i (b i − a i ) ≤ 2C(b i − a i ).
Namely, u belongs to the Sobolev space W 2 1 (R n ), if and only if the function (1 + |t|) û(t) belongs to L 2 (R n ).Moreover, by the Plancherel theorem, u and the total variation norm for elements of this u TV = ∇u 1 .(11.3)This assertion can be strengthened: Suppose that an integrable function u on R n has a generalized gradient ∇u = (∂ x1 u, ..., ∂ xn u).Then u belongs to BV (R n ), if and only if it belongs to W 1 1 (R n ), in which case(11.3) holds true.Indeed, returning to (11.2) and