On Dimension-dependent concentration for convex Lipschitz functions in product spaces

Let $n\geq 1$, $K>0$, and let $X=(X_1,X_2,\dots,X_n)$ be a random vector in $\mathbb{R}^n$ with independent $K$--subgaussian components. We show that for every $1$--Lipschitz convex function $f$ in $\mathbb{R}^n$ (the Lipschitzness with respect to the Euclidean metric), $$ \max\big(\mathbb{P}\big\{f(X)-{\rm Med}\,f(X)\geq t\big\},\mathbb{P}\big\{f(X)-{\rm Med}\,f(X)\leq -t\big\}\big)\leq \exp\bigg( -\frac{c\,t^2}{K^2\log\big(2+\frac{ n}{t^2/K^2}\big)}\bigg),\quad t>0, $$ where $c>0$ is a universal constant. The estimates are optimal in the sense that for every $n\geq \tilde C$ and $t>0$ there exist a product probability distribution $X$ in $\mathbb{R}^n$ with $K$--subgaussian components, and a $1$--Lipschitz convex function $f$, with $$ \mathbb{P}\big\{\big|f(X)-{\rm Med}\,f(X)\big|\geq t\big\}\geq \tilde c\,\exp\bigg( -\frac{\tilde C\,t^2}{K^2\log\big(2+\frac{n}{t^2/K^2}\big)}\bigg). $$ The obtained deviation estimates for subgaussian variables are in sharp contrast with the case of variables with bounded $\|X_i\|_{\psi_p}$--norms for $p\in[1,2)$.


Introduction
Concentration in product probability spaces is an active research direction with numerous available results (see, in particular, monographs [21,9]). Among classical examples of such results are Bernstein-type inequalities [9,Chapter 2] for linear combinations of independent random variables, and the isoperimetric inequality in the Gauss space which implies subgaussian dimension-free concentration [29,8] (see also [11,3,5,4] as well as [25, Theorem V.1]).
First, let F n be the class of 1-Lipschitz functions on R n (here and further in this note, the Lipschitzness is with respect to the standard Euclidean metric in R n ), and µ 1 , . . . , µ n be Borel probability measures on R. In particular, it is known that whenever measures µ i satisfy the Poincaré inequality with a non-trivial constant λ > 0, i.e λVar µi h ≤ E µi |h ′ | 2 , 1 ≤ i ≤ n, for every smooth function h : R → R, then the product measure µ 1 × · · ·× µ n satisfies the Poincaré inequality on R n with the same constant, which in turn implies subexponential dimension-free upper bound exp(−ct) for (1), where c > 0 depends only on the Poincaré constant [16] (see also, for example, [33,Chapter 2]). Conversely, if µ = µ 1 = µ 2 = . . . is a probability measure on R, and for some t > 0, (1) is uniformly (over n) upper bounded by a quantity strictly less than 1/2 then necessarily µ satisfies a Poincaré inequality with a non-trivial constant [13].
A connection between concentration and measure transport inequalities was first highlighted in [22,23]. In particular, it has been established in the literature (see [27,Section 7], [12,Section 5], [6,Corollary 5.1]) that exponential dimension-free concentration for µ ×n , n ≥ 1, is equivalent to the inequality inf X∼µ, Y ∼ν for every probability measure ν absolutely continuous w.r.t µ, where the infimum is taken over all joint laws of (X, Y ) with X ∼ µ and Y ∼ ν.
A complete characterization of product measures which enjoy dimension-free subgaussian concentration was obtained in [12] (see also earlier work [32]). It was shown in [12] that given a measure µ on R, the quantity in (1) is upper bounded by C exp(−ct 2 ) for some C, c > 0 (independent of n) if and only if there is a constant D > 0 such that µ satisfies the following measure transportation inequality (the T 2 -inequality): log dν dµ dµ for every probability measure ν absolutely continuous w.r.t µ, where the infimum is over all pairs of random variables X, Y on R with X ∼ µ and Y ∼ ν. We refer to [12] for a more general statement. We would like to mention the logarithmic Sobolev inequality as a well known sufficient condition for subgaussian concentration [10], [21,Chapter 5], as well as inequalities interpolating between log-Sobolev and Poincaré [20] as sufficient conditions for dimension-free concentration estimates of the form exp(−ct p ) for the quantities in (1).
Following works of Talagrand [30,31], it has been shown in various settings that by restricting the class of Lipschitz functions to convex (or concave) functions, the worst-case concentration estimates can be significantly improved. As an illustration, it is well known that for every n ≥ 1, there exists a (non-convex) 1-Lipschitz function f n in R n such that for the random vector X (n) uniformly distributed on vertices of the cube {−1, 1} n , one has Var f n (X (n) ) = θ( √ n) (see, for example, [33,Problem 4.9]). On the other hand, a classical result of Talagrand [30,31] asserts that there is a universal constant c > 0 such that, with F n := {Convex 1-Lipschitz functions in R n }, and with µ 1 = µ 2 = · · · = µ n being the uniform measure on {−1, 1}, the quantity in (1) A complete characterization of probability measures µ on R such that (1) admits dimension-free subgaussian concentration for convex 1-Lipschitz functions with µ = µ 1 = µ 2 = . . . , was obtained in [14,15] (see also [1] for an earlier result in this direction). Both necessary and sufficient condition in that setting is µ((t+s, ∞)) ≤ 2 exp(−cs 2 )µ((t, ∞)) and µ((−∞, −t − s)) ≤ 2 exp(−cs 2 )µ((−∞, −t)) for all s, t > 0 for some constant c > 0, which can be interpreted as the condition that the distribution µ has "no gaps". The convex subgaussian concentration, in turn, is implied by the convex log-Sobolev inequality (see [28]). For results dealing with dimension-free subexponential-type concentration for convex Lipschitz functions, we refer to [7,15,2,13].
Whereas necessary and sufficient conditions for dimension-free concentration are well understood, those conditions are rather strong. For example, it is easy to construct an unbounded subgaussian distribution which does not satisfy the condition for dimension-free subgaussian concentration mentioned above.
The main purpose of this note is to give optimal dimension-dependent concentration bound in the class of subgaussian product measures for convex 1-Lipschitz functions. However, we would like to start with a discussion of · ψp -bounded variables for p ∈ (0, 2), to emphasize the difference in tail behavior. We recall the definition of the · ψp -(quasi-)norm. Given a real valued random variable Y , we set In particular, Y ψ2 is the subgaussian constant of Y , and Y ψ1 is the subexponential constant. A random variable Y with a bounded · ψp -norm satisfies, in view of Markov's inequality, Theorem 1.1. For every p ∈ (0, 2) there is a c p > 0 depending only on p with the following property. Let K > 0, n ≥ 2, and let X = (X 1 , X 2 , . . . , X n ) be a vector of independent random variables with X i ψp ≤ K, 1 ≤ i ≤ n.
Then for every 1-Lipschitz convex function f in R n , we have We were not able to locate the above theorem in the literature, and provide its proof for completeness. Theorem 1.1 is obtained by a simple reduction to Talagrand's inequality for bounded variables. We note here that the two-level tail behavior for functions of independent variables is a common phenomenon within high-dimensional probability, starting with the classical Bernstein's inequality. It can be informally justified by saying that while deviation of individual variables from the above theorem are controlled by exp(−Θ(t p )), linear combinations of variables of the form n i=1 a i X i (with a ∞ ≪ a 2 ) exhibit subgaussian behavior in a certain range. Notice that, in the above statement, 2 exp − c p t 2 / K 2 (log n) 2/p is the dominating term on the right hand side when t K = O (log n) 2 p(2−p) . Further, there is no concentration phenomenon when t K = O((log n) 1/p ). For t ≫ K(log n) 2 p(2−p) , the tail is estimated by O exp − c p t p /K p .
It can be verified that the statement of Theorem 1.1 is optimal in the following sense (we only consider the range p ∈ [1, 2) here). Proposition 1.2. For every p ∈ [1, 2) there is a C p > 0 depending only on p with the following property. Let n ≥ C p , t > 0, and K > 0. Then there exist a random vector X = (X 1 , X 2 , . . . , X n ) of independent random variables with X i ψp ≤ K, 1 ≤ i ≤ n, and a convex 1-Lipschitz function f such that Here,c,C > 0 are universal constants.
Now, let X = (X 1 , . . . , X n ) be a vector of independent K-subgaussian random variables, that is, (for an appropriate choice of C) and applying the Talagrand convex distance inequality, it is easy to deduce that for every where the implicit constant depends on K only. A more elaborate argument [19,Lemma 1.8] gives, with the above notation, the following variable-dependent estimate: , t > 0 (see also [24]). When bounding the right hand side as a function of n, K, and t only, we get which is not sharp for large t as our main result below shows.
There is a universal constant c > 0 with the following property. Let K > 0, n ≥ 2, and let X = (X 1 , X 2 , . . . , X n ) be a vector of independent K-subgaussian random variables. Then for every 1-Lipschitz convex function f in R n , we have The estimate provided by the theorem is optimal in the following sense: Proposition 1.4. Let K > 0, n ≥C, and t > 0. Then there exist a vector X = (X 1 , X 2 , . . . , X n ) of independent K-subgaussian random variables, and a convex 1-Lipschitz function f such that Here,c,C > 0 are universal constants.
The structure of the note is as follows. In Section 2, we provide a proof of Theorem 1.1. Section 3 is devoted to proving Propositions 1.2 and 1.4. Finally, in Section 4 we consider the main result of the note, Theorem 1.3.

Proof of Theorem 1.1
Fix p ∈ (0, 2), K > 0, a natural number n ≥ 2, and a 1-Lipschitz convex function f in R n . To prove the theorem, it is sufficient to verify a deviation inequality for the parameter t ≥ CK(log n) 1/p , where C > 0 is a large constant depending on p. Let X = (X 1 , . . . , X n ) be a vector of independent variables with X i ψp ≤ K, 1 ≤ i ≤ n.
For each number k ≥ 1, denote Further, let m ≥ 1 be the largest integer such that where the constantc =c(p) > 0 is defined via the relatioñ We start by writing To estimate the probability P |f (Y n ) − Med f (X)| ≥ t/2 , we note that the diameter of the support of each Y (1) i is at most 2 K(4 log n) 1/p , and hence applying Talagrand's convex distance inequality for bounded variables (2), we get for a universal constant c > 0. On the other hand, we observe that which, together with the last inequality, implies that Further, for every k ≥ 1 we have A standard estimate s ⌈s⌉ n ⌈s⌉ ≤ ens s s valid for anys ∈ [1, n] and s ∈ (0, (en) −1 ], then implies for some universal constant c > 0, where the last inequality follows since 2en ≤ exp((2 + log 2 (e)) log n).
For k ≤ m, we use the inequality where the last equality follows from (4). Using the definition of m and assuming the constant C in the assumption for t is sufficiently large, we get for some c ′′′ ,ĉ > 0 depending only on p.
For k > m, we simply write p log n , and essentially repeating the above computations, get for some c ′′ > 0 depending only on p. The result follows.

Proof of Propositions 1.2 and 1.4
First, consider the following basic example. Let p ∈ [1, 2],K > 0, and let µ be the probability measure on R defined via the relation It is easy to see that, with the random vector X in R n distributed according to µ ×n , the components of X have · ψp -norms bounded by O(K) (with the absolute implicit constant). On the other hand, with the function f : R n → R given by we have which gives the required estimates for t ≥K(log n) The main statement of this section is the following proposition. Proposition 3.1. There exists a universal constant C > 1 so that the following holds: Let n ≥ C, p ∈ [1, 2], n. Then there exists a random vector X = (X 1 , . . . , X n ) with i.i.d components whose · ψp -norm is bounded above by K such that Together with the above example, Proposition 3.1 implies Propositions 1.2 and 1.4. The "test" distribution we use to prove Proposition 3.1 is the n-fold product of a 2-point probability measure defined by µ is an appropriately chosen parameter.
The proof of the proposition relies on a precise lower bound for the tail probability of a Binomial random variable. We need the following result: There exists universal constants c b ∈ (0, 1) and C b > 1 so that the following holds. Let n be a sufficiently large integer.
Remark 3.3. The term r 2 θn+r corresponds to the usual Bernstein-type tail estimate, and log 2 + θn+r θn is the "extra" factor emerging when θn = o (r).
Although the above statement is based on completely standard calculations, we provide its proof for completeness.
Proof of Lemma 3.2. We will assume that √ θn (and θn) is greater than a sufficiently large universal constant and at the same time θ is smaller than another small universal constant. Those conditions on θ can be imposed by adjusting the constant c b in the statement of the lemma. For every k ≤ n, let P k : We claim that in order to prove the lemma it is sufficient to establish the following inequalities: for some universal constantsC,C > 1.
To verify the claim, fix any θ (satisfying assumptions from the beginning of the proof) and any r with 0 < r ≤ n − θn. We have P ≥θn+r = P ≥⌈θn+r⌉ .

Proof of Theorem 1.3
Our proof of Theorem 1.3 is based on a modification of the induction method of Talagrand. In fact, the first part of the proof which deals with setting up a recursive relation for a modified convex distance, essentially repeats, up to minor changes, the standard account of the method (see, for example, [21, p. 72-79]).
We recall that Talagrand's convex distance between a point x ∈ R n and a set A ⊂ R n is given by Since we work with measures with (possibly) unbounded supports, it is crucial for us to track the "quantitative" distance between x i and y i , i ≤ n, and to consider the differences |x i − y i | instead of the indicators 1 {xi =yi} .
Definition 4.1. Given a point x ∈ R n and a non-empty subset A of R n , we define the modified convex distance between x and A as Given a non-empty A ⊂ R n and x ∈ R n , we denote by U (x, A) the set of all vectors in R n + of the form : y ∈ A , and let V (x, A) ⊂ R n be the convex hull of U (x, A).
where the distance on the right hand side is the usual Euclidean distance in R n . Furthermore, when A is convex, Proof. The first assertion of the lemma can be derived following Talagrand's treatment for the original convex distance (see, in particular, [21, p. 72-73]).
We will provide the proof for the second assertion of the lemma for reader's convenience. Let A be a non-empty convex set. Without loss of generality, A is closed, and x / ∈ A. By a compactness argument, there is a vector A). The extremal property of y implies that for all z ∈ x − A, we have z · y ≥ y · y. Now, for any z ∈ R n , letz be the vector obtained from z by replacing each component of z by its absolute value. A) is contained in the half-space {w ∈ R n : w ·ỹ ≥ y 2 2 }, and the same is true for its convex hull V (x, A). Therefore, dist(0, V (x, A)) ≥ ỹ 2 = y 2 . On the other hand, since x − y ∈ A, we haveỹ ∈ U (x, A) ⊂ V (x, A), and therefore dist(0, V (x, A)) ≤ ỹ 2 = y 2 . We conclude that dist(0, V (x, A)) = y 2 , and the result follows.
Before we consider the proof, let us show how to derive Theorem 1.3 from the above proposition.
Proof of Theorem 1.3. First, note that it is sufficient to prove the statement for t ≥ C ′ K √ log n for a large constant Observe that, since f is convex, so is the set A, and therefore A t = {x ∈ R n : dist(x, A) < t}, in view of Lemma 4.2. Applying Markov's inequality, we get We choose δ := exp −c t 2 /4 K 2 log(2+ K 2 ñ c t 2 /4 ) (we can assume that δ ≤ 1/2 if C ′ is sufficiently large). Observe that , and hence log 2 + n log(2 + 1/δ) ≤ log 2 + K 2 ñ c t 2 /4 log 2 + K 2 ñ c t 2 /4 ≤ 2 log 2 + K 2 ñ c t 2 /4 . Therefore, , where c := 1 4c . By assuming C ′ > 1 to be sufficiently large and recalling that t ≥ C ′ K √ log n, we get , which completes treatment of the upper tail.
For the lower tail, we take A := {x ∈ R n : f (x) ≤ Med f (X) − t} and define A t := {x ∈ R n : dist c (x, A) < t} = {x ∈ R n : dist(x, A) < t} (with the last equality due to convexity of A). Then {x ∈ R n : f (x) ≥ Med f (X)} ⊂ A c t and therefore P{X ∈ A c t } ≥ 1 2 . For δ ∈ (0, 1 2 ], we have, in view of Proposition 4.3 and Markov's inequality, . Now, the same choice of δ leads to the desired bound.
As we have mentioned above, the proof of Proposition 4.3 is based on the induction on dimension. The next proposition sets up the argument.
Proof. Take arbitrary element (x, s) ∈ R n × R = R n+1 . Observe that where the notation "⊕" should be understood as vector-wise concatenation producing vectors in R n+1 . Therefore, every vector of the form where v(α) ∈ V (x, A(α)), α ∈ R, and ν is a discrete probability measure on R with a finite support, belongs to the convex hull V ((x, s), A) of U ((x, s), A). Further, we have for every Borel probability measure ν on R and every choice of v(α) ∈ V (x, A(α)): by Jensen's inequality. Hence, Recall from (14) that dist c ((x, s), A) = dist(0, V ((x, s), A)) and dist c (x, A(α)) = dist(0, V (x, A(α))). Thus, taking v(α) ∈ V (x, A(α)) so that v(α) 2 = dist c (x, A(α)) for all α, we obtain that where the infimum is taken over all discrete probability measures ν on R with a finite support. Clearly, Further, applying (16) we get We write where the product is taken over all α in the support of ν (which is a finite set in R), and ν{α} is the probability mass of α. Since α ν{α} = 1, in view of Holder's inequality, the quantity in (17) is majorized by , and the result follows.
Remark 4.5. The class of measures ν in the above proposition is restricted to discrete measures to avoid any discussion of measurability.
By considering two-point probability measures ν of the form λδ Xn+1 + (1 − λ)δ y , from the last proposition we get the following corollary.
Next, we record the following elementary fact.
Proof. We have The expression 1 + a−b 1], is minimized at λ = max(0, 1 − a−b 2c0R 2 ). And (18) follows since As an immediate consequence of Corollary 4.6 and Lemma 4.7, by considering two-point measures we obtain the following proposition.
Proposition 4.8. Let n ≥ 1, and let µ 1 , µ 2 , . . . , µ n+1 be probability measures in R. Let A ⊂ R n+1 be a non-empty subset, and for each α ∈ R, denote Let X = (X 1 , X 2 , . . . , X n+1 ) be distributed in R n+1 according to µ 1 × µ 2 × · · · × µ n+1 , and X ′ be the vector of first n components of X. Then for every κ > 0, where H (t, y) := min and h : R → R is any function satisfying Moreover, the function H (t, y) can be represented as Remark 4.9. Repeating the optimization argument from [21, p. 74], we get for every pair numbers t, y with h(y) ≥ h(t), and for every number Q ≥ 4κ (y − t) 2 : where in the second line we applied Lemma 4.7 with a := 0, b := h(t)−h(y) 4κ (y−t) 2 , and c 0 R 2 := 1 4 , and where in the last line we used that the function s → s log 2 − exp h(t)−h(y) s , s > 0, is non-decreasing.
The next lemma encapsulates the initial step of the induction: Lemma 4.10. Let µ be a K-subgaussian probability measure on R, and X be distributed according to µ. Then for any choice of the parameter L ≥ √ 2 K and any non-empty Borel subset A ⊂ R, Proof. Since A is a subset of R, the convex distance dist c (·, A) coincides with dist(·, A). Without loss of generality, the set A is closed. Let x ∈ A be a point with dist c (0, A) = dist(0, A) = |x|. Then The result follows.
In the next lemma, we deal with "the main part" of the induction argument. The basic idea is to split the argument into two cases, according to how much of the "total mass" of a set A is located far from the origin. Further, let µ 1 , µ 2 , . . . , µ m be K-subgaussian measures on R, each supported on finitely many points, let X = (X 1 , . . . , X m ) be distributed according to µ 1 × µ 2 × · · · × µ m , and let X ′ be the vector of first m − 1 components of X. Assume that for some R ≥ 1, L ≥ 16K and every x ∈ R, Proof. Define parametersL := L/4 and M := L/(8K). Our goal is to show that We consider two cases. First, assume that (19) P{X ∈ A and X m ∈ [−L,L]} ≥ (1 − exp(−M 2 ))P{X ∈ A}.
In this case, we essentially repeat the standard "induction method" argument employed in the proof of dimensionfree subgaussian concentration on the cube. Let x b be a point in [−L,L] such that P{X ′ ∈ A(x b )} ≥ P{X ′ ∈ A(x)} for all x ∈ [−L,L] (such a point x b exists since, by our assumption, X ′ can take only finitely many values and hence {P{X ′ ∈ A(x)}, x ∈ R} is a finite set). In view of Proposition 4.8, where Using the definition of x b , the equation 16L 2 L 2 = 1, and Remark 4.9 with parameters Q := 1 and κ := 1/L 2 , we get On the other hand, for all realizations of X m / ∈ [−L,L] we can crudely bound the function as Combining the relations, we get where the inequality follows from the definition of h and the bound where the last inequality follows as 2 − x ≤ 1 x for any x ∈ [0, 1], and since P{X ∈ A, and Next, we estimate the second summand inside the expectation in (21). We have where Z := exp(X 2 m /K 2 ) is a non-negative random variable satisfying EZ ≤ 2 since X m is K-subgaussian. As L ≥ 16K and by applying Holder's and Markov's inequalities, we get where the last inequality holds since M ≥ 2.
Using the previously obtained result for finite measures and letting ε → 0, we derive the required statement for all open subsets of R n . Finally, approximating arbitrary non-empty A with open sets B ⊃ A, we get the result. Remark 4.12. As we already mentioned, our proof of the main result uses a modified convex distance which is crucial in dealing with unbounded random variables. The second main feature of our approach, compared to the original argument of Talagrand, is that we estimate the product P{X ∈ A} E exp (dist c (X, A)) 2 /L(δ) 2 from above by the quantity 1/δ depending on n and t rather than by a universal constant. The parameter δ introduces the necessary additional flexibility.