Central moment inequalities using Stein's method

We derive explicit central moment inequalities for random variables that admit a Stein coupling, such as exchangeable pairs, size--bias couplings or local dependence, among others. The bounds are in terms of moments (not necessarily central) of variables in the Stein coupling, which are typically local in some sense, and therefore easier to bound. In cases where the Stein couplings have the kind of behaviour leading to good normal approximation, the central moments are closely bounded by those of a normal. We show how the bounds can be used to produce concentration inequalities, and compare them to those existing in related settings. Finally, we illustrate the power of the theory by bounding the central moments of sums of neighbourhood statistics in sparse Erd\H{o}s--R\'enyi random graphs.


INTRODUCTION
Concentration inequalities are useful and powerful tools for estimating probabilities when exact computation is not possible. They have found important application in modern statistics (Massart, 2007). Obtaining such inequalities has been an active area of probability for decades; for example, see Boucheron, Lugosi, and Massart (2013) and the references there. The standard method for deriving strong concentration inequalities requires bounds on the moment generating function. For sums of independent random variables, this leads to Hoeffding/Bennett/Chernoff inequalities. Such results generalize to martingales in various ways (McDiarmid, 1998), and to functions of independent random variables Boucheron, Lugosi, and Massart (2003). The research closest to this paper is concerned with showing concentration inequalities for random variables admitting various coupling constructions related to Stein's method. The first results and key ideas are due to Chatterjee (2007) and Chatterjee and Dey (2010), who worked under the assumption that it is possible to construct an exchangeable pair that is marginally distributed as the variable of interest, and has certain conditional moment boundedness properties. The ideas were extended to bounded size-bias couplings by Ghosh and Goldstein (2011a,b), and to size-bias couplings with a certain conditional monotonicity assumption by Cook, Goldstein, and Johnson (2017). The exchangeable where H BW := {h : Ê → Ê: h ∞ ≤ 1, h ′ ∞ ≤ 1}; it metrizes weak convergence. Given a Stein coupling, and writing D := W ′ − W , it is immediate that Hence, to establish the accuracy of normal approximation, it suffices to show, for instance, that | [GD | W ] − σ 2 | ≤ εσ 2 and that |GD 2 | ≤ εσ 3 , for suitable choice of ε.
In this paper, we show that the existence of approximate Stein couplings also yields bounds for the even central moments of W . To some extent, the bounds cover cases where the distribution of W is not very close to being normal. To state our first theorem, we define X r := { |X r |} 1/r for any random variable X and r ∈ AE. Throughout the paper, we write µ := W and σ 2 := Var(W ).
Theorem 2.1. Let (W, W ′ , G, R) be an approximate Stein coupling such that, for some constant ε such that 0 ≤ ε < 1 and for some random variable T ≥ 0, [R|W ] ≤ ε|W − µ| + T. (2.2) Let k ∈ AE and suppose that G 2k ≤ A and that D 2k ≤ B. Then The statement of the theorem looks at first sight rather complicated. Before discussing it, we state a related result with a slightly simpler bound under the additional assumption that, for each r ∈ AE of interest, {|W ′ − µ| r } ≤ {|W − µ| r }, together with an explicit statement about the smallness of T r . The extra assumption is satisfied for exchangeable pairs, when L(W ′ ) = L(W ), as well as for the Stein coupling that we shall use for sums of independent random variables with mean zero. Note that, in this theorem, odd powers of |W − µ| are also allowed.
Theorem 2.2. Let (W, W ′ , G, R) be an approximate Stein coupling such that (2.2) is satis- Theorems 2.1 and 2.2 provide bounds for central moments of W , expressed simply in terms of the moments of the random variables G and D, which, in practical circumstances, are easier to handle than W itself. However, it is not at first sight clear how good the bounds are likely to be. To get some idea of this, consider the case of a Stein coupling, when R = 0. In this case, taking f (w) = w − µ in (2.1), it follows that σ 2 = (GD). Thus, using the Cauchy-Schwarz and Hölder inequalities, we have Hence Theorems 2.1 and 2.2 show that W −µ 2k is sandwiched between σ and a multiple of {AB} 1/2 , which is the upper bound for σ obtained by replacing the 2-norms of G and D in the Cauchy-Schwarz inequality by their 2k-norms. In many applications concerning asymptotics as the size n of a problem increases, the distribution of D remains more or less constant, while σ 2 grows like n. Since (GD) = σ 2 , the distribution of n −1 G also remains more or less constant. Hence B can typically be chosen more or less constant in n, whereas A is proportional to n, implying that B/A = O(n −1 ). Thus the exponential factor in the second inequality of Theorem 2.1 is close to 1 unless k is very large, making the constant multiplying {AB} 1/2 smaller than that in Theorem 2.2.
For example, for a sum W := n i=1 X i of independent and identically distributed random variables with zero mean, there is a Stein coupling (W, W − X I , −nX I ), where I denotes a random variable with the uniform distribution on {1, 2, . . . , n}, so that, in particular, ε = 0. Then σ 2 = n (X 2 1 ) and G have factors of n, whereas (D 2 ) = (X 2 1 ) is constant in n. In particular, we can take A and B such that AB = n X 1 2 2k , as compared to σ 2 = n X 1 2 2 , and, for fixed k, the bound on W −µ 2k in Theorem 2.1 is close to n(2k − 1) X 1 2k . This Stein coupling satisfies the extra condition of Theorem 2.2, giving the bound 2n(2k − 1) X 1 2k . Note that, by considering X i ∼ 2I − 1, where I ∼ Bernoulli(1/2), the factor of √ 2k − 1 is seen to be inevitable, being the rate of the growth of N 2k for N standard normal.
In the next theorem, under slightly more restrictive conditions, it is shown that the product AB in the leading term can be replaced by σ 2 . The extra condition, given in (2.4), would be expected to be satisfied with ε 3 small, if normal approximation were good. Thus, if normal approximation is good enough, the even central moments of W cannot be too much bigger than those of the corresponding normal distribution.
Theorem 2.3. With the assumptions and notation of Theorem 2.1, suppose that now, for non-negative random variables T 1 and T 2 and for non-negative ε i , 1 ≤ i ≤ 3, we have Supposing that σ −2 T 2 k ≤ ε 4 with ε 4 small enough, we have a further refinement, showing that the even central moments of W are bounded by a quantity which, under typical asymptotics, is equivalent to the corresponding normal moment. To state the theorem, we define where, as before, N denotes a standard normal random variable, noting that c 1 is increasing in k, with c 1 (1) = 1 and c 1 (∞) = √ e. We also set h k := 2 −1/2 e 5/2 σ −3 AB 2 √ k − 1.

CONCENTRATION INEQUALITIES FROM CENTRAL MOMENT BOUNDS
The bounds derived above can be used with Markov's inequality to show that the distribution of a random variable W is concentrated about its mean, by starting from and choosing k carefully. Suppose that (W n , n ≥ 1) is a sequence of random variables with means µ n and variances σ 2 n ≍ n. Then one weak form of concentration is to say that W n is concentrated about its mean µ n on the scale (d n ) n≥1 if, for any r ≥ 1, there exist K(r) and c(r) such that È[|W n − µ n | > c(r)d n ] ≤ K(r)n −r , uniformly in n. Under the assumptions of Theorem 2.1, and for an exact Stein coupling with n −1 G 2k ≤ α n,k and D 2k ≤ β n,k , we have, Thus, taking k = k n := log n and d n := n(2k n − 1)α n,kn β n,kn exp 1 √ n (2k n − 1)β n,kn α n,kn , which can be made smaller than n −r , for any given r, by choosing c = c(r) large enough. Thus, in this sense, Theorem 2.1 shows that W n is concentrated around µ n on a scale (d n ) n≥1 , where d n is as in (3.3). As remarked earlier, it is often the case in such asymptotics that the distributions of D and n −1 G are more or less constant in n, in the sense that their tails are uniformly bounded in n. However, for any n, the norms D k and n −1 G k grow to infinity as k increases, unless the random variables themselves are uniformly bounded. The following lemma is useful in determining how fast the norms grow with k; its proof is straightforward, by using, for example, saddle point methods.
Lemma 3.1. Let X be a random variable such that, for some a, b, c > 0, If, instead, a > 1 and for suitable constants C 1 (a, b) and C 2 (a, b).
Hence the quantity d n above is of order O( √ n log n) if both D and n −1 G are a.s. bounded for all n by the same constants x 1 and x 2 . If both D and n −1 G have tails bounded as in (3.4), uniformly for all n, then d n = O((log n) 1/a+1/2 √ n); if they have tails bounded as in (3.5), uniformly for all n, and if a > 2, then d n = O(n 1/2+δ ) for any δ > 0.
The classical large deviation bounds, such as the Chernoff bounds, deliver much smaller bounds for the probabilities of large deviations than those required for the weaker form of concentration discussed above. We now show that our moment bounds can also deliver analogous results, if values of k n larger than log n are taken.
Under the conditions of Theorem 2.1, we can invoke (3.2), and choose k = k n (t) to make the principal factor t −2k n(2k−1) n −1 G 2k D 2k k small, for fixed choice of t. In particular we have the following corollary for a.s. bounded n −1 G and D.
Corollary 3.2. Under the assumptions of Theorem 2.1, for an exact Stein coupling such that |D| ≤ x 1 and |n −1 G| ≤ x 2 a.s. for all n, we have for any t ≥ √ 2nx 1 x 2 e, (3.6) The corollary gives good bounds as long as t ≪ n. However, the factor e in the denominator makes the exponent smaller than that in the Chernoff bound; we return to this later.
If we only have control over the tails of |D| and |n −1 G|, we can obtain the following.
for suitable constants K 1 , K 2 and K 3 .
The power of t in the exponent is now no longer as large as 2, though it approaches 2 as a increases. On the other hand, the bound still gives useful results, provided that t ≪ n.
If the conditions of Theorem 2.4 are satisfied, better bounds can be obtained.
Corollary 3.4. Under the assumptions of Theorem 2.4, taking k = k(y) := ⌈y 2 /2⌉, we have For an exact Stein coupling, the quantities ε 1 and ε 2 are zero. In most applications they, and ε 3 and ε 4 , can be expected to depend on n as a power n −α , for a suitable α > 0 (often α = 1/2). However, ε 2 and ε 4 also involve k-norms of the error random variables T 1 and T 2 , and the quantity h k also involves k-norms of D and n −1 G. Assuming that their tails are bounded as in (3.4), these norms can be dealt with much as above, resulting in powers of k as factors; so long as y ≪ n β for a suitably small index β, the bound above is then useful. In particular, if |D|, |n −1 G|, n −1/2 |T 1 | and n −1 |T 2 | are uniformly bounded, then good bounds are obtained for y ≪ n 1/4 , equivalent to deviations of W n from its mean of order o(n 3/4 ). This is not as good a range as for the Chernoff bounds, but the main exponent is ideal.

APPLICATIONS
Here we use Theorems 2.1 -2.4 in some applications.

Sums of independent random variables
Then for G := −nX I , as mentioned above, it is easily checked that (W, W ′ , G) is a Stein coupling; see Chen and Röllin (2010). Note too that, for D := W ′ − W = −X I , we have G = nD and so we can take A = nB, with . Writing ρ k := X I 2k / X I 2 , the bound from Theorem 2.1 becomes The theorems that give sharper bounds rely on establishing (2.4). For the coupling above, writing σ 2 i := X 2 i , the simplest version is obtained by taking ε 3 = 0 and but this yields a result that is not as clean as the following one that we derive by a slightly different argument.
To interpret the proposition, note that if there is uniform control over the tails of the X i , such as in Lemma 3.1, then for fixed k, h ′ k → 0 as n → ∞. Furthermore, even for the sub-exponential tails given by (3.5), h ′ k → 0 for k growing as (α log n) β , for 0 < β < a − 1.

Local dependence
. . , n}, and set G := −nX I . Then, it is easily checked that (W, W ′ , G) is an exact Stein coupling; see also Chen and Röllin (2010, Construction 2A). Here D := W ′ − W = − j∈N I X j Thus we can apply the bound of Theorem 2.1 with If, for example, for all i = 1, . . . , n, d := max i |N i | and m := max i X i 2k , then, using Minkowski's inequality, we could also take A = nm, B = dm, and Theorem 2.1 implies that Scan statistics furnish standard examples of this kind. If Y 1 , Y 2 , . . . , Y n are independent random variables, and, for j = 1, . . . , n − ℓ + 1, we define where ϕ j : Ê ℓ → Ê, then W = n−ℓ+1 j=1 X j satisfies the hypotheses above with d = 2ℓ + 1. In particular, if the Y i are Bernoulli and ϕ j (y j , . . . , y j + ℓ − 1) = (1 − y j )y j+1 · · · y j+ℓ−1 , then W is the centred number of head runs of at least length ℓ − 1; here we can set m = 2.
Sharper bounds may in general be obtained by bounding B more carefully.

Size bias couplings
For random variable W ≥ 0 with W = µ < ∞, we say W s has the size-bias distribution for all functions f such that the right hand side is well defined. If (W, W s ) is a coupling of a distribution with its size-biased distribution, and we define W ′ := W s and G := µ, then (W, W ′ , G) is a Stein coupling. Theorem 2.1 easily applies in this setting with D := W s −W , so that we can take If, for example, |W s − W | ≤ c for some constant c, then A = µ and B ≤ c, and in this case the bound (3.6) becomes This is to be compared with the best known bounds, those of Arratia and Baxendale (2015). A slightly weaker version of their bound, under these hypotheses, is This is better than (4.2), because it does not have the factor e in the denominator of the exponent. If Theorem 2.4 can be applied, as in (3.7), the leading term in the exponent improves to −t 2 /(2σ 2 ), at least for t ≤ σ 1+β for some β > 0. This is now typically better than −t 2 /(2µc), since σ 2 = µ D, and D ≤ c, with equality only if L(a 1 W ) = Po(a 2 ), for some a 1 , a 2 > 0. Our approach can also be used even when the random variable |W s −W | is not uniformly bounded, which is a difficult problem for exponential concentration inequalities; see Ghosh, Goldstein, and Raič (2011).

Local neighbourhood statistics of Erdős-Rényi random graphs
Let n be the set of simple and undirected graphs on n vertices with labels [n] := {1, . . . , n}, and let G n be an Erdős-Rényi random graph on n , with edge probability p := λ/n. Fix r ∈ AE and for G ∈ n and each i = 1, . . . , n, let N r (i, G) be the "r-neighbourhood" consisting of the vertex-labeled subgraph induced by vertices at a distance no greater than r from vertex i in G; note that this includes the edges between vertices at graph distance r from vertex i. For each i = 1, . . . , n, let U i be a real-valued function on all vertex-labeled graphs on at most n vertices, where the vertex labels are subsets of [n] that contain the label i, and set X i := U i N r (i, G n ) , and W := n i=1 X i . Concrete examples with r = 1 are given by taking U i to give the degree of the vertex i, or the number of triangles containing vertex i.
For sparse Erdős-Rényi random graphs, r-neighbourhoods are small with high probability. Hence, as long as the U i are well-behaved, D 2k should be of a good order. We illustrate this principle in the following theorem. For a graph G, let V (G) denote the vertex set, and |V (G)| the number of vertices.
Theorem 4.2. Fix r ∈ AE and β ≥ 0, and, for i = 1, . . . , n, let U i be a graph function as above such that, for some positive constant c and any graph G in its domain, Fix λ > 0 and let G n be an Erdős-Rényi random graph on n , with edge probability p := λ/n, and define Then there is a constant C(r, β), which can be explicitly deduced from (5.12) and (5.18), such that For U i the degree of i, we can take c = 1 and β = 1 in (4.3). For U i the number of triangles containing i, we can take c = 1/2 and β = 2.

PROOFS
In this section we prove the previous results.

Proofs for general inequalities
Proof of Theorem 2.1. For f (w) = (w − µ) 2k−1 , using (2.1), we have Furthermore, by the binomial theorem, we have Now, setting x := W − µ 2 2k and using the triangle inequality, it follows that Using Hölder's inequality, and because G 2k ≤ A and D 2k ≤ B, this gives Rearranging, we thus have The solutions x to (5.2) that also satisfy x > (2k − 1)AB/(1 − ε) are seen, by substituting this bound into the right hand side of (5.2), to be such that Since, expanding the power, inequality (5.3) is satisfied by all x ≤ (2k − 1)AB/(1 − ε), it follows that all solutions of (5.2) satisfy (5.3), proving the first inequality. The second follows from the fact that, for t ≥ 0 and γ ≥ 0, (1 + t) γ − 1 ≤ γte γt .
Proof of Theorem 2.3. Much as in the proof of Theorem 2.1, we begin by observing that Hence, using the extra assumption, we have (5.4) Defining x := W − µ 2 2k ≥ σ 2 , and bounding the final sum as in the proof of Theorem 2.1, we deduce that Using the fact that, for a ≥ 2 and y > 0, we have (1 + y) a − 1 − ay ≤ 1 2 a(a − 1)(1 + y) a−2 ≤ 1 2 a(a − 1)e ay , (5.5) it follows that from which, considering first the case x ≥ (2k − 1)σ 2 , it follows that as claimed.
Proof of Theorem 2.2. First, take r = 2k with k ∈ AE. Then, using (2.1) with f (w) = Invoking the assumption (2.2) and Hölder's inequality thus leads to for this choice of f , by the mean value theorem, we have

Hence, by Hölder's inequality, and because
, and the theorem is proved for r = 2k.
For r = 2k + 1, take f (w) = (w − µ) 2k sgn(w − µ), and observe that now the remainder of the argument is the same.
Proof of Theorem 2.4. We begin from (5.4). Defining x k := W − µ 2 2k , we now deduce that Using the Taylor expansion argument (5.5), we have Since σ ≥ B e(2k − 1), we have Using the same argument, and because h k is increasing in k, we deduce that, for each 2 ≤ l ≤ k, Iterating this inequality, starting with l = k and working downwards, shows that , proving the theorem.

Independent sums
Proof of Proposition 4.1. As in the proof of Theorem 2.1 (but with R = 0), we begin with For the first term in the sum, we have By the independence of X i and W i : and so, by the mean value theorem,

Now, by the Hölder and Jensen inequalities,
. The remaining terms in the sum in (5.8) are bounded, using Taylor's theorem, by Arguing as in the proof of Theorem 2.4 completes the proof.

Sums of local statistics of sparse Erdős-Rényi random graphs
Proof of Theorem 4.2. To obtain moment bounds for W , we define a Stein coupling. For each j = 1, . . . , n, set ) be independent indicators, independent also of G n , each with E ′ il = p. Given G n , define the random graph G i . Finally, let J be uniform on the set {1, . . . , n}, independent of the random objects above, and define W ′ := W (J) and G = −n(X J − X J ). After noting that X j is independent of W (j) , it is easy to see that (W, W ′ , G) is an exact Stein coupling. Moreover, L(W ′ ) = L(W ), so that the central moments of W and W ′ are equal, and hence both of Theorems 2.1 and 2.2 apply, with ε = ε ′ = 0; we use Theorem 2.2.
Lemma 5.1. Let I be a finite index set, E be a random (possibly empty) subset of I, and define E := |E|. Let {Y i } i∈I be a collection of random variables independent of E and, for ℓ ∈ AE, let max i∈I Y i ℓ ≤ y. Then i∈E Y j ℓ ≤ y E ℓ .
Lemma 5.3. Fix n, r ∈ AE and p ∈ [0, 1], and let N r be the number of vertices in the rneighbourhood of a given vertex in an Erdős-Rényi random graph on n vertices with edge probability parameter p. Then where A(x, ℓ) is defined in Lemma 5.2.
Proof. Let (Z 0 , Z 1 , Z 2 , . . .) be the sequence of generation sizes for a Galton-Watson branching process with offspring distribution Bi(n, p), started from a single individual; Z 0 = 1. Then, using the usual exploration process process coupling, r s=0 Z s stochastically dominates N r , so that N r ℓ ≤ r s=0 Z s ℓ , for any r, ℓ ≥ 0. To bound Z s ℓ , we use the random sum representation of a branching process and repeatedly apply Lemma 5.1 to find where B ∼ Bi(n, p). The result now easily follows from Lemma 5.2, and because A(x, l) ≥ 2 for all x > 0 and l ∈ AE.