Optimal transport and Rényi informational divergence

Transport-entropy inequalities are considered in terms of Rényi informational divergence.


Introduction
Let (E, ρ) be a separable metric space.The Kantorovich distance between (Borel) probability measures µ and ν on E is defined by W 1 (µ, ν) = inf π ρ(x, y) dπ(x, y) with infimum taken over all measures π on the product space E × E having µ and ν as marginal projections.One often tries to relate it to more tractable distance-like quantities or measures of deviation such as the Kullback-Leibler informational divergence (relative entropy) D(ν|µ) = f log f dµ, assuming that ν is absolutely continuous with respect to µ (ν < < µ) and has density f = dν dµ .In particular, relations W 1 (µ, ν) ≤ K D(ν|µ) (1.1) form an important class of transport-entropy inequalities, with interesting connections to high-dimensional phenomena, limit theorems and other problems of Probability, Analysis and Geometry (cf.e.g.[7], [8], [9]).The measure µ in (1.1) is commonly fixed and is called a reference measure, while ν is arbitrary.
The validity of the inequality (1.1) with some (finite) constant K is known to be equivalent to the property e cρ(x,x0) 2 dµ(x) < ∞ which should hold with some c > 0 and x 0 ∈ E ( [1], [3]).This subgaussian condition may occur to be rather restrictive in applications, since for the finiteness of W 1 one only needs the finiteness of the first moment ρ(x, x 0 ) dµ(x).Therefore, it is natural to consider weaker variants of (1.1) with other informational distances so that to involve a larger class of reference distributions µ.As it turns out, to this aim the Rényi divergence power of order α > 1, can replace D. It is related to Rényi's entropy like the Kullback-Leibler divergence is related to Shannon's entropy.Note that 0 ≤ D α ≤ ∞, the function α → D α is nondecreasing, and that lim α↓1 D α = D (as long as D α < ∞ for some α > 1).We refer to [5,6] for an account of basic properties of these functionals.The aim of this note is to derive the following characterization complementing the equivalence of (1.1) and (1.2).
(1.5)However, this moment condition is strictly stronger than (1.4) in general.
Here d is a real parameter (necessarily d > 0 for the integrability reason), and c is a normalizing constant.Clearly, µ has finite second moment, when d > 2. In this case, (1.4) is telling us that µ satisfies the transport-entropy inequality (1.3), if and only if α ≥ 1 + 2 d .Note that (1.5) excludes the critical value α = 1 + 2 d .
One should mention that there exist results relating the transport cost W 1 (µ, ν) to other quantities depending on the distribution under µ of the density f of ν.For example, [2] provides a characterization for the inequalities including ECP 20 (2015), paper 4.
Page 2/12 ecp.ejpecp.org Here, the right-hand side has a strong relationship with the Rényi divergence power.However, it does not have the meaning of a distance, and the inequality itself should be viewed from a different point of view.
Let us also comment on the related functional -the Rényi divergence We have D α = 1 α−1 (e (α−1)dα − 1), so d α and D α are equivalent, when these quantities are small.Since in general D ≤ d α ≤ D α , one may wonder whether or not the transportentropy inequality (1.3) can be replaced with a sronger relation However, this inequality turns out to be equivalent to the limit case α = 1.That is, it holds if and only if the subgaussian integrability condition (1.2) is fulfilled (cf.Remark 4.2 below).
The paper is organized as follows.We start with the study of the Rényi divergence power as a convex functional on the space of densities and provide its description in the form of the supremum of certain linear functionals (Section 2); some immediate consequences are then developed in Section 3. In sections 4-5 we prove Theorems 1.1-1.2,actually in a more quantified form of two-sided bounds on the optimal constant K in (1.3).In particular, for p ≥ 2, we consider the quantities where the first supremum is running over all functions u on E with Lipschitz semi-norm u Lip ≤ 1 and µ-mean zero.It will be shown that K ∼ K α * within factors depending on α ∈ (1, 2], only.

Linearization of the Rényi divergence power
Denote by P(µ) the collection of all (probability) densities f on an abstract measurable space E with respect to a given probability measure µ.Being convex on P(µ), the entropy functional D admits a well-known sup-linear representation, namely In other words, for all ν < < µ, if and only if e g dµ ≤ 1.As a first step towards Theorem 1.1, we derive a similar description for the Rényi divergence power of an arbitrary order α > 1.
In the sequel, we write t + = max(t, 0) and denote by L p (µ) the usual Lebesgue space of all measurable functions g on E with finite norm g p = ( |g| p dµ) 1/p , p ≥ 1.

Optimal transport and Rényi informational divergence
where c is a unique solution to the equation (2.3) As an equivalent description, Theorem 2.1 admits the following analog.
We split the proof into two steps.On P(µ) introduce the concave functional Note that T f is just the difference between the left and right-hand sides of (2.1).
, the functional T is bounded above on P(µ) and attains maximum at some function f ∈ P(µ) ∩ L α (µ).
Proof.By Hölder's inequality, up to some constants c 0 and c 1 depending on α, only.Here, when taking the sup over all f , one may assume that f α ≤ C with some large C.
Therefore, T is bounded above on P by the finite constant The unit ball of L α is weakly compact, so there is a subsequence f n weakly convergent to some f with f α ≤ C. Necessarily f ∈ P(µ) and , the maximizer for the functional T is unique and has the form for some constant c.
Consider the functions of the form where u is a bounded measurable function on E vanishing outside A δ and such that u dµ = 0.Then, f ε will belong to P(µ) ∩ L α (µ) for all sufficiently small ε and hence T f ε ≤ T f.
On the other hand, using Taylor's expansion, one can show that Since ε may be both positive and negative (although small), we conclude that for all admissible functions u.But this is only possible when g − α * f α−1 0 = c on A δ for some constant c.Since δ > 0 may also be arbitrary (although small), this constant c cannot depend on δ.As a result, g − α Here the left-hand side is dominated by the right-hand side.But if g(x) > c 1 for some x ∈ E, then min(g(x), c 1 ) = c 1 , while min(g(x), c 2 ) > c 1 , so the above equality is impossible.Therefore, necessarily µ{g > c 1 } = 0 which proves the last assertion.In particular, for any b > 0, the equation ϕ(c) = b has a unique solution c.
. First, using the property f dµ = 1, we have Secondly, Using this extreme function (maximizer), one may rewrite the property "T f ≤ 0 for all f ", that is, (2.1), in terms of D α , as indicated in (2.2)-(2.3).
Proof of Theorem 2.2.The function It attains minimum at a unique point c, namely -at which But this is exactly the equation (2.3), while the inequality ψ(c) ≤ −(α * ) α * (α * − 1) being stated at this point coincides with the condition (2.2).

Necessary and sufficient conditions
Since the description given in Theorem 2.1 for the property g dν ≤ D α (ν|µ) for any ν < < µ is somewhat implicit, it would be interesting to get more tractable conditions, necessary and sufficient, even if not simultaneously.Here we mention some of such conditions, together with lower and upper bounds on the constant c appearing in (2.2)-(2.3).To avoid situations when D α (ν|µ) is finite, but the integral in (3.1) does not exist, we assume that g + ∈ L α * (µ).
In particular, applying (3.1) to the measure ν = µ, we get g dµ ≤ 0. A different choice leads to stronger necessary condition On the other hand, choosing c = −α * in Theorem 2.2, we arrive at the sufficient condition As α ↓ 1, both (3.2) and (3.3) are asymptotically optimal.Indeed, in the limit they yield e g dµ ≤ 1 which is necessary and sufficient for the relation g dν ≤ D(ν|µ).
Nevertheless, being quite explicit and working, (3.2)-(3.3)are not sharp enough to reach simulteneously necessary and sufficient conditions as in Theorem 1.1.
Let us now return to Theorem 2.1 and recall the condition where c solves the equation In particular, in the corresponding cases, Proof.The weaker upper bound c ≤ −(α * − 1) immediately follows from (3.4).To refine it, we use (3.5) and apply Markov's inequality to get Hence, 1 ≤ − c + α * − 1 proving the first statement.

Finiteness of the second moment
We are prepared to turn to Theorems 1.1-1.2which will be established in a more quantitative form involving the quantities K p introduced in (1.6).In particular, where the supremum is taken over the familiy L of all functions u on E with Lipschitz semi-norm u Lip ≤ 1, having µ-mean zero.This quantity is finite, if and only if µ has a finite second moment.Indeed, for the finiteness, it is enough to consider the Lipschitz function u(x) = ρ(x, x 0 ) − ρ(x, x 0 ) dµ(x) in the definition of K 2 with an arbitrary fixed point x 0 ∈ E.
Recall that we consider the transport-entropy inequality with an arbitrary probability measure ν < < µ.For example, if ν = µ A has a constant density f = 1 µ(A) 1 A , then (4.1) becomes Taking for A a ball of a sufficiently large radius so that µ(A) > 0, we get W 1 (µ, µ A ) < ∞, while µ A has finite first moment.Hence, for (4.1) to hold, necessarily the reference measure µ must have a finite first moment.In that case, by a simple approximation argument, there will be no loss of generality to assume in (4.1) that ν have finite first moments, as well.
Proof.By the Kantorovich-Ribinstein theorem, if µ and ν have finite first moments, there is the representation where the supremum is running over all u on E with u Lip ≤ 1 (cf.e.g.[4], p.330).
Then, (4.1) may equivalently be rewritten as Given a bounded function h on E such that h dµ = 0 and ε > 0 small enough, the function f ε = 1 + εh represents the density of a probability measure, say ν = ν ε , with respect to µ.In this case, (4. Inserting this in (4.3) and letting ε → 0, we arrive at Remark 4.2.Let us look at the possible sharpening of (4.1) in terms of the Rényi divergence, namely By the definition, if ν = µ A has a constant density f = 1 µ(A) 1 A (as before), then which is independent of α.On the other hand (following Marton's argument), given two measurable sets A, B ⊂ E at distance r = ρ(A, B), we have W 1 (µ A , µ B ) ≥ r.Applying the triangle inequality for the metric W 1 , (4.4) therefore yields .
In particular, if µ(A) ≥ 1 2 , writing B = E\A r in terms of the r-neighbourhood A r of A for the metric ρ, we get But this property is equivalent to the subgaussian condition (1.2).Therefore, (4.4) is equivalent to the standard transport-entropy inequality (1.1), corresponding to the order α = 1.

Theorems 1.1-1.2 and their refinements
up to some positive constants c α and C α depending on α, only.
We also have K ∼ K 2 for α > 2, up to α-depending factors.
Corollary 5.2.For α ≥ 2, the best value of K in (5.1) satisfies 2 α K 2 ≤ K ≤ CK 2 with some absolute constant C. Indeed, by (5.1)-(5.2) with α = 2, and using the monotonicity of the divergence power with respect to α, we get This gives an upper bound K ≤ C 2 K 2 , while the lower bound is provided by Theorem  As was already mentioned in the previous section, (5.1) may equivalently be rewritten as for all u ∈ L.
Squaring and using sup λ (λa − λ 2 ) = a 2 4 together with the property that −u ∈ L for all u ∈ L, we are reduced to the inequality of the form (5.3).That is, we obtain: Lemma 5.3.Let K be a positive constant.If µ and ν have finite first moments and ν < < µ, (5.1) is equivalent to the the relation with arbitrary u ∈ L and λ > 0.
Proof of Theorem 5.1 (lower bound on K).First assume that u ∈ L α * (µ).By Proposition 3.1 with g = 2 K λu − λ 2 (λ > 0), we get (3.6) as a necessary condition for (5.4), namely Restricting the integral to the set u ≥ Kλ, so that 2 K λu − λ 2 ≥ λ K u, the above yields To simplify, assume that λ ≥ 1, in which case we thus get Substituting λ = r/K and applying the same inequality to −u, we arrive at (5.5) In case 0 ≤ r ≤ K, there is a similar obvious bound On this step, the assumption u ∈ L α * (µ) can easily be removed: (5.5) can always be applied to centered truncated Lipschitz functions u n = v n − v n dµ, where v = u in case |u| ≤ n, and v = ±n depending on whether u > n or u < −n.Letting n → ∞, we arrive at which yields the left inequality in (5.2) with c −1 α = 1 + 4 (α * ) α * .

4. 1 .
Note that Theorems 1.1-1.2 are immediately obtained from Theorem 5.1 and Corollary 5.2, since K α * is finite (with α ≤ 2), if and only if the expression in (1.4) is finite.Before turning to the proof of Theorem 5.1, first let us explain how we will connect (5.1) to the relations as in Theorem 2.1, i.e., g dν ≤ D α (ν|µ).

Example 1.3. Let
µ be the generalized Cauchy distribution on the Euclidean space E = R n (equipped with the Euclidean distance ρ), i.e., with density with respect to Proof of Theorem 2.1.Combining Lemmas 2.3 and 2.4, it remains to look at the value of