A Lower Bound on the Relative Entropy with Respect to a Symmetric Probability

Let $\rho$ and $\mu$ be two probability measures on $\mathbb{R}$ which are not the Dirac mass at $0$. We denote by $H(\mu|\rho)$ the relative entropy of $\mu$ with respect to $\rho$. We prove that, if $\rho$ is symmetric and $\mu$ has a finite first moment, then \[ H(\mu|\rho)\geq \frac{\displaystyle{(\int_{\mathbb{R}}z\,d\mu(z))^2}}{\displaystyle{2\int_{\mathbb{R}}z^2\,d\mu(z)}}\,,\] with equality if and only if $\mu=\rho$.


Introduction
Given two probability measures µ and ρ on R, the relative entropy of µ with respect to ρ (or the Kullback-Leibler divergence of ρ from µ) is where dµ/dρ denotes the Radon-Nikodym derivative of µ with respect to ρ when it exists.In this paper, we prove the following theorem : Theorem 1.Let ρ and µ be two probability measures on R which are not the Dirac mass at 0. We suppose that R |z| dµ(z) < +∞ .
If ρ is symmetric then , with equality if and only if µ = ρ.
A remarkable feature of this inequality is that the lower bound does not depend on the symmetric probability measure ρ.We found the following related inequality in the literature (see lemma 3.10 of [2]) : if ρ is a probability measure on R whose first moment m exists and such that then, for any probability measure µ on R having a first moment, we have Our inequality does not require an integrability condition.Instead we assume that ρ is symmetric.
The three following sections are devoted to the proof of theorem 1 with the hypothesis that µ = ρ, H(µ|ρ) < +∞, In section 2 we give some preliminaries and we prove an inequality involving the Cramér transform I of (Z, Z 2 ) when Z is a random variable with distribution ρ.This is the key ingredient for proving theorem 1.We first give the proof of the large inequality when the support of µ has at least three points, in section 3.
Finally the proof of the strict inequality in the general case is more technical and it is the subject of section 4, which is split in three subsections.We end this paper by an appendix about some general results on the Cramér transform of a probability distribution on R d .

The Key Inequality a) Preliminaries
Since H(µ|ρ) < +∞, we have µ ≪ ρ and we denote f = dµ/dρ.It follows from Jensen's inequality that, for any µ-integrable function Φ, As a consequence sup In order to make appear the first and second moments of ρ, we consider functions Φ of the form z −→ uz + vz 2 , (u, v) ∈ R 2 .Then we obtain where Let ν ρ (respectively ν µ ) be the law of (Z, Z 2 ) when Z is a random variable with distribution ρ (respectively µ).We denote by Λ the Log-Laplace of ν ρ , defined by The function I is the Cramér transform of ν ρ (see the appendix for some generalities on the Cramér transform of a probability measure on R d ).We denote by D Λ and D I the respective domains of R 2 where Λ and I are finite.
b) The Key Inequality for I Let C ρ be the closed convex hull of the support of ν ρ .If ν ρ is non-degenerate on R 2 (i.e. its support is not included in a hyperplane of R 2 ), then lemma A.1 in the appendix implies that C o ρ is non-empty and In particular, if (x, y) ∈ D o I , then y > 0. We denote by A I the admissible domain of I, which is the image of D o Λ under the application ∇Λ (see proposition A.2 in the appendix for some properties of A I ).The following proposition states the key inequality for proving our main theorem.A proof of a stronger result is given in section 4 of [4].For the sake of clarity and completeness, we state and prove the inequality we need to prove theorem 1 in the most efficient way.Proposition 2. Let ρ be a symmetric probability measure on R whose support has at least three points.If x ≥ 0 and (x, y) belongs to A I , then I is differentiable at (x, y) and with equality if and only if x = 0.
Proof.The probability measure ρ has at least three points in its support thus ν ρ is non-degenerate.By proposition A.2 in the appendix, for any (x, y) ∈ A I , there exists a unique point (u, v) This formula shows that u and x have the same sign.Moreover tanh(z) ≤ z for any z ≥ 0 thus, if x > 0 then sinh(uz) ≤ uzcosh(uz).There is equality if and only if uz = 0.If x and u are positive then, using the symmetry of ρ, Finally I is differentiable in A I and with equality if and only if x = 0.

Proof of the large inequality
In this section, we only prove the large inequality of theorem 1 under the assumption that µ has at least three points in its support, since its proof is simpler than the proof of the strict inequality.The case where µ has only one or two points in its support is handled in section 4.c).The assumption implies that ρ has also at least three points in its support, since µ ≪ ρ.As a consequence both ν ρ and ν µ are non-degenerate on R 2 .We denote by C µ the closed convex hull of the support of ν µ .Proof.The non-degeneracy of ν implies that o then, by the Hahn-Banach separation theorem (see [9]), there exists a ∈ R d such that By integrating and using the fact that I is even in its first variable, we get Suppose that the support of ρ is not compact.In that case, there exists n 0 ≥ 1 such that, for any n ≥ n 0 , the intersection of [−n, n] with the support of µ, and also with the support of ρ, contains at least three points.For any n ≥ n 0 , we introduce ρ n (respectively µ n ) the probability ρ (respectively µ) conditioned by [−n, n], and then we have, by the inequality we just proved, that since ρ n has a compact support.If f = dµ/dρ then a straightforward computation gives that µ n ≪ ρ n with We have ln f dµ .
The hypothesis at the end of the introduction implies that the functions ln f , z −→ z and z −→ z 2 are integrable with respect to µ.As a consequence, the dominated convergence theorem implies that This proves that H(µ|ρ) ≥ F (µ), in the case where the support of µ has at least three points.

Proof of the strict inequality
The proof of the strict inequality is more technical.In subsection 4.a), we prove a refinement of proposition 2. This helps us to prove that H(µ|ρ) > F (µ) when ν µ is non-degenerate, in subsection 4.b).Finally we treat the degenerate case in subsection 4.c).

a) Refinement of proposition 2
The proof of the following proposition is also given in section 4 of [4], except for the point (c).For the sake of clarity and completeness, we give its proof in the most efficient way.
Proposition 4. If ρ is a symmetric probability measure on R whose support has at least three points, then This inequality still holds for (x, y) ∈ ∂D I if there exists an increasing sequence (x n ) n≥1 converging to |x| and such that (x n , y) ∈ D o I for any n ≥ 1.We denote by σ 2 the variance of ρ if it exists.The inequality is strict if one of the following condition is fulfilled : Proof.According to proposition A.2 in the appendix, the admissible domain A I of I is an open subset of R 2 .If (x, y) ∈ A I and x > 0, then there exists ε > 0 such that the ball of radius 2ε centered at (x, y) is included in By integrating the inequality of lemma 2 over the interval ]ε, x[, we obtain Since ρ is symmetric, the function I is even in its first variable and this inequality still holds if x < 0 and ε Moreover, a straightforward computation yields and ρ([−n, n]) goes to 1 when n goes to +∞.Hence Finally I is lower semi-continuous, thus Taking ε = 0, we obtain the (large) inequality, since I(0, y) ≥ 0 for any y ∈ R.
(c) Let us introduce I o the Cramér transform of ρ.If there exists u 0 > 0 such that Λ(u, 0) < +∞ for any u ∈ ] − u 0 , u 0 [, then theorem A.2 in the appendix implies that I o is C ∞ at 0 and a straightforward computation gives Thus there exists δ > 0 such that By applying inequality ( * ) with ε = δ/2, we see that I(x, y) − x 2 /(2y) > 0 as soon as x = 0.This ends the proof of the proposition.

b) The non-degenerate case
We suppose that µ has at least three points in its support.As in the beginning of section 3, we obtain that (m Moreover proposition A.1 implies that there exists (u, v) ∈ D Λ such that Let us introduce It is a non-empty set since it contains R× ] − ∞, 0[.For (s, t) ∈ D Λ , we denote by ρ s,t the measure having the density z −→ exp(sz + tz − Λ(s, t)) with respect to ρ.We denote by E ρ the set of these probability measures.With these notations, If µ / ∈ E ρ then H(µ|ρ u,v ) > 0 and thus I(m 1 (µ), m 2 (µ)) < H(µ|ρ).As a consequence, the strict inequality H(µ|ρ) > F (µ) is proved.Let us suppose now that µ ∈ E ρ .Then µ = ρ u0,v0 for some (u 0 , v 0 ) ∈ D (2) Λ which is not (0, 0), otherwise µ = ρ.In this case we compute so that the terms involved in these inequalities are equal.There are three cases : The point (c) in proposition 4 is then verified and thus the strict inequality of theorem 1 is proved in the non-degenerate case.
Lemma 5. Let ρ be a symmetric probability measure on R with variance σ 2 > 0.
We denote by Λ o its Log-Laplace.If the support of ρ contains at least three points, then the function has a unique minimum in D Λo at 0, where it is equal to σ 2 .
Proof.Suppose that D The function which is positive by the Cauchy-Schwarz inequality (the equality case does not occur since ρ has a least three points in its support).As a consequence H ρ is increasing on ] − u ∞ , u ∞ [.Since H ρ (0) = 0 we conclude that H ρ (and thus G ′ ρ ) is positive on ]0, u ∞ [ and negative on ] − u ∞ , 0[.This implies that G ρ has a unique minimum in D Λo at 0. Finally G ρ (0) = σ 2 .

c) The degenerate case
In this subsection, we assume that µ has only one or two points in its support.Suppose first that µ = δ a for some a = 0. Since ρ = δ 0 is symmetric and µ ≪ ρ, it follows that ρ({a}) ∈ ]0, 1/2] and

Lemma 3 .
If ν is non-degenerate on R d and if its first moment m exists, then m belongs to C o , the interior of the convex hull of the support of ν.
This contradicts the non-degeneracy of ν.Hence m ∈ C o .By the hypothesis at the end of the introduction, the first moment of ν µ exists and it is (m 1 (µ), m 2 (µ)).Thus it belongs to C o µ , by the previous lemma.Since µ ≪ ρ, the support of µ is included in the support of ρ.It follows that C o µ ⊂ C o ρ and thus (m 1 (µ), m 2 (µ)) Suppose that ρ has a compact support.Then D Λ = R 2 and, by proposition A.2 in the appendix, A I is equal to the convex set D o I .Hence proposition 2 implies (a) x = 0 and (x, y) ∈ A I , (b) (x, y) = (0, σ 2 ) and (0, 0) ∈ D o Λ , (c) x = 0, y > σ 2 and there exists u 0 > 0 such that ∀u ∈ ] − u 0 , u 0 [ Λ(u, 0) < +∞ .
then proposition A.2 implies that A I = D o I and the above inequality holds for any (x, y) ∈ D o I and ε ∈ [0, |x|[.Let us suppose now that D Λ is strictly included in R 2 .We introduce the conditional probability ρ n = ρ( • |[−n, n]), for any n ≥ 1 large enough so that ν ρn is still non-degenerate.We denote by Λ n the Log-Laplace of ν ρn and by I n its Cramér transform.Let (x, y) ∈ D o I and ε ∈ [0, |x|[.By proposition A.3 of the appendix, there exists a sequence (x n , y n ) ∈ R 2 converging to (x, y) and such that limsup n→+∞ I n (x n , y n ) ≤ I(x, y) .Since (x, y) ∈ D o I , we have (x n , y n ) ∈ D o I for n large enough.We notice that (C o ρn ) n≥1 is an increasing sequence of open sets whose union is C o ρ = D o I .As a consequence there exists n 0 ≥ 1 such that (x n , y n ) ∈ D o In = C o ρn and |x n | > ε for all n ≥ n 0 .Since D Λn = R 2 , we have x n ) n≥1 converging to |x| with (x n , y) ∈ D o I for any n ≥ 1.For n large enough, |x n | > ε andI(x n , y) − I(ε, y) x n − ε ≥ x n + ǫ 2y .By convexity of I and by sending n to +∞, we getI(|x|, y) − I(ε, y) |x| − ε ≥ |x| + ǫ 2y .Finally I(x, y) = I(|x|, y) and thus inequality ( * ) is extended to this case.Let (x, y) belong to D o I or be such that there exists an increasing sequence (x n ) n≥1 converging to |x| with (x n , y) ∈ D o I for any n ≥ 1.Let us prove that, if one of the conditions (a), (b) or (c) is fulfilled, then the inequality is strict : (a) If x = 0 and (x, y) ∈ A I then the inequality is strict, according to the beginning of the proof.(b) If (0, 0) ∈ D o Λ then I has a unique minimum at (0, σ 2 ) (see theorems 25.1 and 27.1 of