Measure Concentration for Compound Poisson Distributions

We give a simple development of the concentration properties of compound Poisson measures on the nonnegative integers. A new modification of the Herbst argument is applied to an appropriate modified logarithmic-Sobolev inequality to derive new concentration bounds. When the measure of interest does not have finite exponential moments, these bounds exhibit optimal polynomial decay. Simple new proofs are also given for earlier results of Houdr{\'e} (2002) and Wu (2000).


Introduction
Concentration of measure is a well-studied phenomenon, and in the past 30 years or so it has been explored through a wide array of tools and techniques; [14] [11] [12] offer broad introductions.Results in this area are equally well motivated by theoretical questions (in areas such as geometry, functional analysis and probability), as by numerous applications in different fields including the analysis of algorithms, mathematical physics and empirical processes in statistics.
From the probabilistic point of view, measure concentration describes situations where a random variable is strongly concentrated around a particular value.This is typically quantified by the rate of decay of the probability that the random variable deviates from that value (usually its mean or median) by a certain amount.As a simple concrete example consider a function f (W ) of a Poisson(λ) random variable W ; if f : Z + → R is 1-Lipschitz, i.e., |f (k)−f (k+1)| ≤ 1 for all k ∈ Z + = {0, 1, 2, . ..}, then [2], Although the distribution of f (W ) may be quite complex, (1) provides a simple, explicit bound on the probability that it deviates from its mean by an amount t.This is a general theme: Under appropriate conditions, it is possible to derive useful, accurate bounds of this type for a large class of random variables with complex and often only partially known distributions.We also note that the consideration of Lipschitz functions is motivated by applications, but it is also related to fundamental concentration properties captured by isoperimetric inequalities [11].
The bound (1) was established in [2] using the so-called "entropy method," pioneered by Ledoux [9][10] [11].The entropy method consists of two steps.First, a (possibly modified) logarithmic-Sobolev inequality is established for the distribution of interest.Recall that, for an arbitrary probability measure µ and any nonnegative function f on the same space, the entropy functional Ent µ (f ) is defined by Ent µ (f ) = ∫ f log f dµ − (∫ f dµ) log(∫ f dµ), whenever all the above integrals exist.In the case of the Poisson, Bobkov and Ledoux [2] proved the following modified log-Sobolev inequality: Writing P λ for the Poisson(λ) measure, for any function f : Z + → R with positive values, where Df (k) = f (k + 1) − f (k), k ≥ 0, is the discrete gradient, and E µ denotes the expectation operator with respect to a measure µ.In fact, they also established the following sharper bound which we will use below; for any function f on Z + , The second step in the entropy method is the so-called Herbst argument: Starting from some Lipschitz function f , the idea is to use the modified log-Sobolev inequality to obtain an upper bound on the entropy of e τ f , and from that to deduce a differential inequality for the moment-generating function G(τ ) = E[e τ f ] of f .Then, solving the differential inequality yields an upper bound on G(τ ), and this leads to a concentration bound via Markov's inequality.
Our main goal in this work is to carry out a similar program for an arbitrary compound Poisson measure on Z + .Recall that for any λ > 0 and any probability measure Q on the natural numbers N = {1, 2, . ..}, the compound Poisson distribution CP(λ, Q) is the distribution of the random sum where W ∼ Poisson(λ) and the X i are independent random variables with distribution Q on N, also independent of W ; we denote the CP(λ, Q) measure by CP λ,Q .The class of compound Poisson distributions is much richer than the one-dimensional Poisson family.In particular, the CP(λ, Q) law inherits its tail behavior from Q: CP(λ, Q) has finite variance iff Q does, it has exponentially decaying tails iff Q does, and so on [13].It is in part from this versatility of tail behavior that the compound Poisson distribution draws its importance in many applications.Alternatively, CP(λ, Q) is characterized as the infinite divisible law without a Gaussian component and with Lévy measure λQ.
From the above discussion we observe that the Herbst argument is heavily dependent on the use of moment-generating functions, a fact which implicitly assumes the existence of exponential moments.Our main contribution is a modification of the Herbst argument for the case when the random variables of interest do not satisfy such exponential integrability conditions.We derive what appear to be perhaps the first concentration inequalities for a class of infinitely divisible random variables that have finite variance but do not have finite exponential moments.Apart from the derivation of the present results, the modified Herbst argument is applicable in a variety of other cases and may be of independent interest.In particular, this approach can be applied to prove dimension-free inequalities for compound Poisson vectors, as well as power-law concentration bounds for more general infinitely divisible laws.
Our starting point is the following modified log-Sobolev inequality for the compound Poisson measure CP λ,Q .
Theorem 1. [Modified Log-Sobolev Inequality for Compound Poisson Measures] For any λ > 0, any probability measure Q on N and any bounded f : where This can be derived easily from [15,Cor 4.2] of Wu, which was established using elaborate stochastic calculus techniques.In Section 3 we also give an alternative, elementary proof, by tensorizing the Bobkov-Ledoux result (2).Note the elegant similarity between the bounds in (2) and (3).
We then apply our modified Herbst argument to establish concentration bounds for CP(λ, Q) measures under various assumptions on the tail behavior of Q.These are stated in Section 2 and proved in Section 4. For example, we establish the following polynomial concentration result.Recall that a function f : and write q r for its integer moments, q r = j≥1 j r Q j , r ∈ N.
If f : Z + → R is K-Lipschitz, then for any positive integer n < L and any t > 0 we have, where for the constants A, B we can take, Various stronger and more general results are given in Section 2. There, at the price of more complex constants, we get bounds which, for large t, are of (the optimal) order t −L+δ for any δ > 0.Moreover, since the only property of the compound Poisson distribution used in the proof is that it satisfies the functional inequality of Theorem 1, similar bounds are immediately seen to hold for any measure that satisfies such an inequality.Note that although the bound of Corollary 2 is not useful for small t, it is in general impossible to obtain meaningful results for arbitrary t > 0. For example, if f is the identity function and Z ∼ Poisson(λ) where λ is of the form m + 1/2 for an integer m, then |Z − E(Z)| ≥ 1/2 with probability 1; a more detailed discussion is given in Section 2.
As noted above, these appear to be some of the first non-exponential concentration bounds that have been derived, with the few recent exceptions discussed next.Of the extensive current literature on concentration, our results are most closely related to the work of Houdré and his co-authors.Using sophisticated technical tools derived from the "covariance representations" developed in [7][8], Houdré [5] obtained concentration bounds for Lipschitz functions of infinitely divisible random vectors with finite exponential moments.In [6], truncation and explicit computations were used to extend these results to the class of stable laws on R d , and the preprint [4] extends them further to a large class of functionals on Poisson space.To our knowledge, the results in [6] [4] are the only concentration bounds with power-law decay to date.But when specialized to scalar random variables they only apply to distributions with infinite variance, whereas our results hold for compound Poisson random variables with a finite Lth moment for any L > 1.Although the methods of [6][4] as well as the form of the results themselves are very different from those derived here, some more detailed comparisons are possible as outlined in Section 2. Finally, the recent paper [3] contains a different extension of the Herbst argument to certain situations where exponential moments do not exist.The focus there is on moment inequalities for functions of independent random variables, primarily motivated by statistical applications.

Concentration Bounds
The following result is the main motivation for this paper.It illustrates the potential for using the Herbst argument even in cases where the existence of exponential moments fails or cannot be assumed.
1. Taking α = L − δ for any δ > 0 in the exponent of (4), we get a bound on the tails of f (Z) of order t −(L−δ) for large t.By considering the case where f is the identity function f (k) = k, k ∈ Z + , we see that this power-law behavior is in fact optimal.In particular, this shows that the tail of the CP(λ, Q) law decays like the tail of Q, giving a quantitative version of a classical result from [13].
2. As will become evident from the proof, Theorem 3 holds for any random variable Z with law µ instead of CP λ,Q , as long as µ satisfies the log-Sobolev inequality of Theorem 1 with respect to some probability measure Q on N and some λ > 0, and assuming that µ has finite moments up to order L. The bound (4) remains exactly the same, except that the first moment 3. Integrability properties follow immediately from the theorem: For any for all τ < L, and the same holds for any law µ as in the previous remark.
Since the support of CP(λ, Q) is Z + , we would naturally expect the range of f to be highly disconnected.Therefore, to somewhat simplify the expression in the exponent of (4) next we concentrate on the (typical) class of functions f : Z + → R whose mean under CP λ,Q is not in the range of f : Corollary 4. [Power-law Concentration for Nice f ] Suppose that Z has CP(λ, Q) distribution where Q has finite moments up to order L > 1, and write q 1 for its first moment.If f : Z + → R is K-Lipschitz and there exists ǫ > 0 such that where I ǫ (α) is defined as in Theorem 3, and Remarks.
4. Similarly to Theorem 3, this corollary gives quantitative bounds on the tail of f (Z) of the order of t −(L−δ) for any δ > 0. Also, the same result holds for any law µ as in Remark 2.
5. The exponent in (5) becomes negative exactly when t > D, for the same reasons as in Theorem 2. On the other hand, it is obvious that any bound can only be useful for t > D equal to one.Moreover, D and D 0 coincide in many special cases, as, e.g., when the range of f is a lattice in R and its mean E[f (Z)] is on the midpoint between two lattice points.In this sense, the restriction t > D is quite natural.
6.The expression 2|f (0)| + 2Kλq 1 in Theorem 3 is simply an upper bound to the constant In both cases, when L > 2 it is possible to obtain potentially sharper results by bounding D above using Jensen's inequality by, where q 2 is the second moment of Q.Similar expressions can be derived in the case of higher moments.7. The most closely related results to our power-law concentration bounds appear to be in the recent preprint [4]. 1 The relevant bounds in [4] specialized to Lipschitz functions of CP(λ, Q) random variables require that the probability measure Q be non-atomic, which excludes all the cases we consider.But shortly after the first writing of this paper C. Houdré in a personal communication informed us that this assumption can be removed by an appropriate construction.The details have not been checked by us, but in the following comparison we assume that it does not change the statements in [4].The main assumptions in [4] are that the random variable of interest has infinite variance, and also certain growth conditions.Because of the infinite-variance assumption, the majority of the results in this paper (corresponding to L > 2) apply to cases that are not covered in [4].As for the growth conditions, they are convenient to check in several important special classes, e.g., for α-stable laws on R, but they can be unwieldy in the compound Poisson case, especially as they depend on Q in an intricate way.On the other hand, if Q has infinite variance, [4,Cor. 5.3] gives optimal-order bounds, including the case when Q has infinite mean, for which our results do not apply.
Next we show how the Herbst argument can be used to recover precisely a result of [5] in the case when we have exponential moments.

Remarks.
8. Theorem 1 of [5] gives concentration bounds for a class of infinitely divisible laws with finite exponential moments, and in the compound Poisson case it reduces precisely to (6), which also applies to any random variable Z whose law satisfies the result of Theorem 1.It is also interesting to note that Theorem 5 can be derived by applying [15, Prop 3.2] to a compound Poisson random variable (constructed via the Wiener-Ito decomposition), and then using Markov's inequality.9. Theorems 3 and 5 easily generalize to Hölder continuous functions.In the discrete setting of Z + , f is K-Lipschitz iff it is Hölder continuous for every exponent β ≥ 1 with the same constant K.But if f is Hölder continuous with exponent β < 1, this more stringent requirement makes it possible to strengthen Theorem 3 and Theorem 5, by respectively redefining, C j,ǫ = 1 + j β K ǫ , and 10.While all our power-law results dealt with two-sided deviations, the bound in Theorem 5 is onesided.The reason for this discrepancy is that the last step in all the relevant proofs is an application of Markov's inequality, which leads us to restrict attention to nonnegative random variables.When exponential moments exist, the natural consideration of the exponential of the random variable takes care of this issue, but in the case of regular moments we are forced to take absolute values.

Proof of Theorem 1
An alternative representation for the law of a CP(λ, Q) random variable Z is in terms of the series where the Y j are independent.
For each n, let µ n denote the joint (product) distribution of (Y 1 , . . ., Y n ).In this instance, the tensorization property of the entropy [1][10] [11] can be expressed as where G : Z n + → R + is an arbitrary function, and the entropy on the right-hand side is applied to the restriction G j of G to its jth co-ordinate.Now given an f as in the statement of the theorem, define the functions G : and G = e H . Let μn denote the distribution of the sum S n := n k=1 kY k and write H j : Z + → R for the restriction of H to the variable y j with the remaining y i 's fixed.Applying ( 8) to G we obtain, Using the Bobkov-Ledoux inequality (2) to bound each term in the above sum, and noting that, trivially, DH j (y 1 , . . ., y n ) = D j f ( n k=1 ky k ), where the last inequality follows from the fact that xe x − e x + 1 ≥ 0 for x ≥ 0. Finally, we want to take the limit as n → ∞ in (9).Since μn ⇒ CP λ,Q as n → ∞ by (7), and since f is bounded, by bounded convergence Similarly, changing the order of summation and expectation in the right-hand side of (9) by Fubini, taking n → ∞ by bounded convergence, and interchanging the order again, it converges to This together with (10) implies that (9) yields the required result upon taking n → ∞. ✷

Concentration Proofs
For notational convenience we define the function η(x) := xe x − e x + 1, x ∈ R, and note that it is non-negative; it achieves its minimum at 0; it is strictly convex on (−1, ∞) and strictly concave on (−∞, −1); it decreases from 1 to 0 as x increases to zero, and it is increasing to infinity for x > 0.
The main technical ingredient of the paper is the following proposition, which is based on a modification of the Herbst argument.
Proposition 7. Suppose that Z has CP(λ, Q) distribution where Q has finite moments up to order L > 1.If f : Z + → R is bounded and K-Lipschitz, then for t > 0, ǫ > 0 and α ∈ (0, L), we have, where I ǫ (α) is defined as in Theorem 3 and Proof of Proposition 7.
Since f is bounded, by its definition g ǫ is also bounded above by 2 f ∞ + ǫ and below by ǫ.Therefore, the moment generating function G(τ ) := E[g ǫ (Z) τ ] is welldefined for all τ > 0.Moreover, since both g ǫ and log g ǫ are bounded, dominated convergence justifies the following differentiation under the integral, so we can relate G(τ ) to the entropy of g τ ǫ , In order to bound this entropy we will apply Theorem 1 to the function φ(x) := τ log g ǫ (x).First we observe that g ǫ can be written as the composition ), where it is easy to verify that the function h(x) := |x|I {|x|≥ǫ} + ǫI {|x|<ǫ} is 1-Lipschitz.And since f is K-Lipschitz by assumption, g ǫ is itself K-Lipschitz.Hence we can bound D j φ as The same argument also yields a corresponding lower bound, so that |D j φ(x)| ≤ τ log C j,ǫ .Applying Theorem 1 to φ gives, since η(x) is increasing for x ≥ 0. Combining this with (11) we obtain the following differential inequality valid for all τ > 0: To solve, we integrate with respect to τ on (0, α] to obtain, for any α < L, or, equivalently, where the exchange of sum and integral is justified by Fubini's theorem since all the quantities involved are nonnegative.To complete the proof we observe that g ǫ ≥ |f − E[f (Z)]|, so that by (12) and an application of Markov's inequality, Using Proposition 7 we can prove our main results, Theorem 3 and Corollaries 2 and 4.
Proof of Theorem 3. The first step is to bring the upper bound in Proposition 7 into a more tractable form.Observe that by its definition, g ǫ (x) ≤ |f (x) − E[f (Z)]| + ǫ, so that, by Jensen's inequality, for a function f satisfying the hypotheses of Proposition 7, Thus the upper bound in Proposition 7 can be weakened to where where we used the fact that the mean of the CP(λ, Q) law is λq 1 .Substituting in (14) and taking the infimum over α yields the required result (4), and it only remains to remove the boundedness assumption on f .But since the bound itself only depends on f via f (0) and K, truncating f at level ±n and passing to the limit n → ∞ proves part (a).
With T = 2|f (0)| + 2Kλq 1 + ǫ, in order to evaluate the exponent in ( 4), we calculate the first two derivatives of I ǫ (α) with respect to α as, where the exchange of differentiation and expectation is justified by dominated convergence; observe that, since C j,ǫ > 1, both are positive for all α > 0. In particular, since I ǫ (α) > 0, the exponent (15) can only be negative (equivalently, the bound in (4) can only be less than 1) if the second term in ( 15) is negative, i.e., if t > T .On the other hand, since I ′ ǫ (0) = 0 and I ′′ ǫ (α) > 0 for all α, we see that I ǫ (α) is locally quadratic around α = 0.This means that, as long as t > T , choosing α sufficiently small we can make (15) negative, therefore the bound of the theorem is meaningful precisely when t > T .
To obtain the alternative representation, fix any ǫ > 0 and set i ǫ (α) = I ′ ǫ (α).Since I ′′ ǫ (α) is strictly positive, for t > T the expression I ǫ (α) + α log(T /t) is uniquely minimized at α * > 0 which solves i ǫ (α) = log(t/T ) > 0. Hence, for all t > T , integrating by parts, Using the binomial theorem to expand I 1 (n), Substituting this bound into (16) and rearranging yields the result.✷ Next we go on to prove the exponential concentration result Theorem 5 using the classical Herbst argument in conjunction with the modified log-Sobolev inequality of Theorem 1.
Proof of Theorem 5. We proceed similarly to the proof of Proposition 7. Assume f is a bounded and K-Lipschitz, and let F (τ ) = E[exp{τ f (Z)}], τ > 0 be the moment-generating function of f (Z).Dominated convergence justifies the differentiation ≤ exp H(α) − αt .
The removal of the boundedness assumption is a routine truncation argument as in the proof of Theorem 3 or in [2] [5].In order to obtain the best bound for the deviation probability, we minimize the exponent over α ∈ (0, M/K).This yields the first expression in Theorem 5; the second representation follows from a standard argument as in the last part of the proof of Theorem 3 or [5]. ✷ Z)]| .Next we use the Lipschitz property of f to obtain an upper bound for the above exponent which is uniform over all f with f (0) fixed.Since f (j) ∈ [f (0)−Kj, f (0)+Kj], we have |f (j)| ≤ |f (0)| + Kj, and hence dx, which proves part (b).✷ Proof of Corollary 4. The proof is identical to that of Theorem 3, with the only difference that, since here we simply have g ǫ (x) = |f (x) − E[f (Z)]| for all x, we can replace the bound (13) by E[log g ǫ (Z)] ≤ log D, where D = E|f (Z) − E[f (Z)]|.Proceeding as before gives the result.✷ Proof of Corollary 2. This is an application of Theorem 3 for specific values of α and ǫ: Bounding the infimum by the value at α = n and taking ǫ = 1, Pr