On concentration of self-bounding functions

We prove some new concentration inequalities for self-bounding functions using the entropy method. As an application, we recover Talagrand's convex distance inequality. The new Bernstein-like inequalities for self-bounding functions are derived thanks to a careful analysis of the so-called Herbst argument. The latter involves comparison results between solutions of differential inequalities that may be interesting in their own right.


Introduction
Let X 1 , . . ., X n be independent random variables, taking values in some measurable space and let f : n → Ê be a real-valued function of n variables.We are interested in concentration of the random variable Z = f (X 1 , . . ., X n ) around its expected value.Well-known concentration inequalities establish quantitative bounds for the probability that Z deviates significantly from its mean under smoothness conditions on the function f , see, for example, Ledoux [9], McDiarmid [12] for surveys.
However, some simple conditions different from smoothness have been shown to guarantee concentration.Throughout the text, for each i ≤ n, f i denotes a measurable function from n−1 to Ê.
The following condition used by Boucheron, Lugosi, and Massart [2] generalizes the notion of a configuration function introduced by Talagrand [21]. and where x (i) = (x 1 , . . ., x i−1 , x i−1 , . . ., x n ) ∈ n−1 is obtained by dropping the i-th component of x.
It is shown in [2] that if f is self-bounding then Z satisfies, for all λ ∈ Ê, the sub-Poissonian inequality log e λ(Z− Z) ≤ e λ − λ − 1 Z which implies that for every t ≥ 0, and for all 0 < t < Z, An often convenient choice for f i is Throughout the paper we implicitly assume that f i is measurable.(McDiarmid and Reed [13, Lemma 5] point out that this is not a restrictive assumption).
Definition 2. A function f : n → Ê is called (a, b)-self-bounding if for some a, b > 0, for all i = 1, . . ., n and all x ∈ n , McDiarmid and Reed [13] show that under this condition, for all t > 0, and Maurer [11] considers a even weaker notion.
The purpose of this paper is to further sharpen these results.The proofs, just like for the abovementioned inequalities, is based on the entropy method pioneered by Ledoux [8] and further developed, among others, by Boucheron, Lugosi, Massart [3], Boucheron, Bousquet, Lugosi, Massart [1], Bousquet [4], Klein [6], Massart [10], Rio [16], Klein and Rio [7].We present some applications.In particular, we are able to recover Talagrand's celebrated convex distance inequality [21] for which no complete proof based on the entropy method has been available.
For any real number a ∈ Ê, we denote by a + = max(a, 0) and a − = max(−a, 0) the positive and negative parts of a.The main result of the paper is the following.
Theorem 1.Let X = (X 1 , ..., X n ) be a vector of independent random variables, each taking values in a measurable set and let f : n → Ê be a non-negative measurable function such that Z = f (X ) has finite mean.
and for all t > 0, and for all t > 0, The bounds of the theorem reflect an interesting asymmetry between the upper and lower tail estimates.If a ≥ 1/3, the left tail is sub-Gaussian with variance proxy a Z + b.If a ≤ 1/3, then the upper tail is sub-Gaussian.If a = 1/3 then we get purely sub-Gaussian estimates of both sides.Of course, if f is (a, b)-self-bounding for some a ≤ 1/3 then it is also (1/3, b)-self-bounding, so for all values of a ≤ 1/3, we obtain sub-Gaussian bounds, though for a < 1/3 the theorem does yield optimal constants in the denominator of the exponent.If a ≤ 1/3 and f is weakly (a, b)-selfbounding, we thus have This type of phenomenon appears already in Maurer's bound (2) but the critical value of a is now improved from 1 to 1/3.We have no special reason to believe that the threshold value 1/3 is optimal but this is the best we get by our analysis.
Note that the bounds for the upper tail for weakly self-bounded random variables are due to Maurer [11].They are recalled here for the sake of self-reference.

The convex distance inequality
In a remarkable series of papers (see [21], [19], [20]), Talagrand developed an induction method to prove powerful concentration results.Perhaps the most widely used of these is the so-called "convex-distance inequality."Recall first the definition of the convex distance: In the sequel, • 2 denotes the Euclidean norm.For any x = (x 1 , . . ., x n ) ∈ n , let denote the convex distance of x from the set A where is a weighted Hamming distance of x to the set A. Talagrand's convex distance inequality states that if X is an n -valued vector of independent random variables, then for any set A ⊂ , which implies, by Markov's inequality, that for any t > 0, Even though at the first sight it is not obvious how Talagrand's result can be used to prove concentration for general functions f of X , with relatively little work, the theorem may be converted into very useful inequalities.Talagrand [19], Steele [18], and Molloy and Reed [14] survey a large variety of applications.Pollard [15] revisits Talagrand's orginal proof in order to make it more transparent.
Several attempts have been made to recover Talagrand's convex distance inequality using the entropy method (see [3; 11; 13]).However, these attempts have only been able to partially recover Talagrand's result.In [3] we pointed out that the Efron-Stein inequality may be used to show that for all X and The same argument was used to show that Talagrand's inequality holds (with slightly different constants) for sets A with È{X ∈ A} ≥ 1/2.Maurer [11] improved the constants but still fell short of proving it for all sets.
Here we show how Theorem 1 may be used to recover the convex distance inequality with a somewhat worse constant (10 instead of 4) in the exponent.Note that we do not use the full power of Theorem 1.In fact, Maurer's results may also be applied together with the argument below.
The main observation is that the square of the convex distance is self-bounding: Lemma 1.For any A ∈ n and x ∈ n , the function f where f i is defined as in (1).Moreover, f is weakly (4, 0)-self-bounding.
Proof.The proof is based on different formulations of the convex distance.Let (A) denote the set of probability measures on A. Then, using Sion's minimax theorem, we may re-write d T as where Rather than minimizing in the large space (A), we may as well perform minimization on the convex compact set of probability measures on {0, 1} n by mapping y ∈ A on (½ y j =X j ) 1≤ j≤n .Denote this mapping by χ.Note that the mapping depends on x but we omit this dependence to lighten notation.The set (A) • χ −1 of probability measures on {0, 1} n coincides with (χ(A)).It is convex and compact and therefore the infimum in the last display is achieved at some ν.Then d T (X , A) is just the Euclidean norm of the vector ν [½ x j =Y j ] j≤n , and therefore the supremum in (3) is achieved by the vector α of components For simplicity, assume that the infimum in the definition of f i (x (i) ) in ( 1) is achieved by a proper choice of the i-th coordinate.
Clearly, f (x) − f i (x (i) ) ≥ 0 for all i.On the other hand let x i and ν i denote the coordinate value and the probability distribution on A that witness the value of f i (x (i) ), that is, It remains to prove that f is weakly (4, 0)-self-bounding.To this end, we may use once again Sion's minimax theorem, as in [3], to write the convex distance as Denote the pair (ν, α) at which the saddle point is achieved by ( ν, α).In [3] it is shown that for all x, For completeness, we recall the argument: Let ν denote the distribution on A that achieves the infimum in the latter expression.Then we have Hence, from which (4) follows.Finally, Now the convex distance inequality follows easily: È{X ∈ A} e d T (X ,A) 2 /10 ≤ 1 .
Proof.First recall that A = X : d T (X , A) = 0 .Observe now that combining Lemma 1 and Theorem 1, and choosing t = d 2 T (X , A) , we have 8 .

The square of a regular function
Let g : n → Ê + be a function of n variables and assume that there exists a constant v > 0 and there are measurable functions We call such a function v-regular.If X = (X 1 , . . ., X n ) ∈ n is a vector of independent -valued random variables, then by the Efron-Stein inequality, Var(g(X )) ≤ v. Also, it is shown in [8; 3] that for all t > 0, For the lower tail, Maurer [11] showed that if, in addition, g(x) − g i (x (i) ) ≤ 1 for all i and x, then However, in many situations one expects a purely sub-Gaussian behavior of the lower tail, something Maurer's inequality fails to capture.Here we show how Theorem 1 may be used to derive purely sub-Gaussian lower-tail bounds under an additional "bounded differences" condition for the square of g.
Corollary 2. Let g : n → Ê + be a v-regular function such that for all x ∈ n and i = 1, . . . ,n, Moreover, and therefore f is (4v, 0) self-bounding.This means that the third inequality of Theorem 1 is applicable and this is how the first inequality is obtained.
The second inequality follows from the first by noting that as g(X and now the first inequality may be applied. For a more concrete class of applications, consider a convex function g defined on a bounded hyperrectangle, say [0, 1] n .If X = (X 1 , . . ., X n ) are independent random variables taking values in [0, 1], then Talagrand [19] shows that where Åg(X ) denotes the median of the random variable g(X ) and L is the Lipschitz constant of g.(In fact, this inequality holds under the weaker assumption that the level sets {x : g(x) ≤ t} are convex.)Ledoux [8] used the entropy method to prove the one-sided inequality under the condition that g is separately convex, that is, it is convex in any of its variables when the rest of the variables are fixed at an arbitrary value.We may use Ledoux's argument in combination with the corollary above.
Let g : [0, 1] n → Ê be a non-negative separately convex function.Without loss of generality we may assume that g is differentiable on [0, 1] n because otherwise one may approximate g by a smooth function in a standard way.Then, denoting This means that g is L 2 -regular and therefore Corollary 2 applies as long as g(x) 2 − g i (x (i) ) 2 takes its values in an interval of length 1.

Proofs
Our starting point is a so-called modified logarithmic Sobolev inequality that goes back (at least) to [10].This inequality is at the basis of several concentration inequalities proved by the entropy method, see [2; 3; 17; 16; 4; 11; 13].

Theorem 2. (A MODIFIED LOGARITHMIC SOBOLEV INEQUALITY.
) Let X = (X 1 , X 2 , . . ., X n ) be a vector of independent random variables, each taking values in some measurable space .Let f : n → be measurable and let Z = f (X ).Let X (i) = (X 1 , . . ., X i−1 , X i+1 , . . ., X n ) and let Z i denote a measurable function of X (i) .Introduce ψ(x) = e x − x − 1.Then for any λ ∈ Ê, The entropy method converts the modified logarithmic Sobolev inequality into a differential inequality involving the logarithm of the moment generating function of Z.A more-or-less standard way of proceeding is as follows.
If λ ≥ 0 and f is (a, b)-self-bounding, then, using Z − Z i ≤ 1 and the fact that for all x ∈ [0, 1], For any λ ∈ Ê, define G(λ) = log e (λZ− Z) .Then the previous inequality may be written as the where v = a Z + b.
On the other hand, if λ ≤ 0 and f is weakly (a, b)-self-bounding, then since ψ(x)/x 2 is non- This again leads to the differential inequality (5) but this time for λ ≤ 0.
When a = 1, this differential inequality can be solved exactly (see [2]), and one obtains The right-hand side is just the logarithm of the moment generating function of a Poisson(v) random variable.
However, when a = 1, it is not obvious what kind of bounds for G should be expected.If a > 1, then λ− aψ(−λ) becomes negative when λ is large enough.Since both G ′ (λ) and G(λ) are non-negative when λ is non-negative, (5) becomes trivial for large values of λ.Hence, at least when a > 1, there is no hope to derive Poissonian bounds from (5) for positive values of λ (i.e., for the upper tail).
Note that using the fact that ψ(−λ) ≤ λ 2 /2 for λ ≥ 0, (5) implies that for λ ∈ [0, 2/a), Observe that the left-hand side is just the derivative of (1/λ − a/2) G(λ).Using the fact that G(0) = G ′ (0) = 0, and that G ′ (λ) ≥ 0 for λ > 0, integrating this differential inequality leads to which, by Markov's inequality and optimization of λ, leads to a first Bernstein-like upper tail inequality.Note that this is enough to derive the bounds for the upper tail of weakly self-bounded random variables.But we want to prove something more.
The following lemma is the key in the proof of the theorem.It shows that if f satisfies a selfbounding property, then on the relevant interval, the logarithmic moment generating function of Z − Z is upper bounded by v times a function G γ defined by for every λ such that γλ < 1 where γ ∈ Ê is a real-valued parameter.In the lemma below we mean c −1

Lemma 2. Let a, v > 0 and let G be a solution of the differential inequality
and for every λ ∈ (−θ , 0) This lemma is proved in the next section.First we show how it implies out main result: Proof of Theorem 1.The upper-tail inequality for (a, b)-self-bounding functions follows from Lemma 2 and Markov's inequality by routine calculations, exactly as in the proof of Bernstein's inequality when c + > 0 and it is straightforward when c + = 0.
The bound for the upper tail of weakly (a, b)-self-bounding functions is due to Maurer [11].The derivation of the bound for the lower tail requires some more care.Indeed, we have to check that the condition λ > −θ is harmless.Since θ < c −1 − , by continuity, for every positive t, Note that we are only interested in values of t that are smaller than Z ≤ v/a.Now the supremum of It is time to take into account the restriction t ≤ v/a.In the first case, when u t = t/v, this implies that u t ≤ a −1 = θ , while in the second case, since a = 1 − 6c − /3 it implies that 1 and the result follows.

Proof of Lemma 2
The entropy method consists in deriving differential inequalities for the logarithmic moment generating functions and solving those differential inequalities.In many circumstances, the differential inequality can be solved exactly as in [10; 2].The next lemma allows one to deal with a large family of solvable differential inequalities.Lemma 4 will allow us to use this lemma to cope with more difficult cases and this will lead to the proof of Lemma2.
Lemma 3. Let f be a non-decreasing continuously differentiable function on some interval I containing 0 such that f (0) = 0, f ′ (0) > 0 and f (x) = 0 for every x = 0. Let g be a continuous function on I and consider an infinitely many times differentiable function G on I such that G(0) = G ′ (0) = 0 and for every Then, for every Note that the special case when f (λ) = λ, and g(λ) = L 2 /2 is the differential inequality obtained by the Gaussian logarithmic Sobolev inequality via Herbst's argument (see, e.g., Ledoux [9]) and is used to obtain Gaussian concentration inequalities.If we choose f (λ) = e λ − 1 and g(λ) = −d(λ/e λ − 1)/dλ, we recover the differential inequality used to prove concentration of (1, 0)-selfbounding functions in [3].
Except when a = 1, the differential inequality (5) cannot be solved exactly.A roundabout is provided by the following lemma that compares the solutions of a possibly difficult differential inequality with solutions of a differential equation.Let ρ 0 : I → Ê be a function.Assume that G 0 : I → Ê is infinitely many times differentiable such that for every λ ∈ I, aG ′ 0 (λ) + 1 > 0 and G ′ 0 (0) = G 0 (0) = 0 and G ′′ 0 (0) = 1 .Assume also that G 0 solves the differential equation If ρ(λ) ≤ ρ 0 (λ) for every λ ∈ I, then H ≤ vG 0 .
Proof.Let I, ρ, a, v, H, G 0 , ρ 0 be defined as in the statement of the lemma.Combining the assumptions on H, ρ 0 , ρ and G 0 , for every λ ∈ I, or equivalently, Setting f (λ) = λ + aG 0 (λ) for every λ ∈ I and defining g : our assumptions on G 0 imply that g is continuous on the whole interval I so that we may apply Lemma 3. Hence, for every λ ∈ I and the conclusion follows since G 0 (x)/ f (x) tends to 0 when x tends to 0.
Since G is the logarithmic moment generating function of Z − Z, this can be used to derive a Bernstein-type inequality for the left tail of Z.However, the obtained constants are not optimal, so proving Lemma 2 requires some more care.
Proof of Lemma 2. The function 2G γ may be the unique solution of equation ( 6) but this is not the only equation G γ is the solution of.Define .