Concentration inequalities for polynomials in $\alpha$-sub-exponential random variables

In this work we derive multi-level concentration inequalities for polynomial functions in independent random variables with a $\alpha$-sub-exponential tail decay. A particularly interesting case is given by quadratic forms $f(X_1, \ldots, X_n) = \langle X,A X \rangle$, for which we prove Hanson-Wright-type inequalities with explicit dependence on various norms of the matrix $A$. A consequence of these inequalities is a two-level concentration inequality for quadratic forms in $\alpha$-sub-exponential random variables, such as quadratic Poisson chaos. We provide various applications of these inequalities. Among these are generalizations the results given by Rudelson-Vershynin from sub-Gaussian to $\alpha$-sub-exponential random variables, i. e. concentration of the Euclidean norm of the linear image of a random vector, small ball probability estimates and concentration inequalities for the distance between a random vector and a fixed subspace. Moreover, we obtain concentration inequalities for the excess loss in a fixed design linear regression and the norm of a randomly projected random vector.


Introduction
Let X 1 , . . . , X n be independent random variables and let f : R n → R be a measurable function. One of the main and rather classical questions of probability theory consists in finding good estimates on the fluctuations of f (X 1 , . . . , X n ) around a deterministic value (e. g. its expectation or median), i. e. to determine a function h : [0, ∞) → [0, 1] such that (1.1) P |f (X 1 , . . . , X n ) − E f (X 1 , . . . , X n )| ≥ t ≤ h(t).
Of course, h should take into account both the information given by f as well as X 1 , . . . , X n . Perhaps one of the most well-known concentration inequalities is the tail decay of the Gaussian distribution: if X 1 , . . . , X n are independent and are distributed as a standard a standard normal distribution N (0, 1), and f (X 1 , . . . , X n ) = n −1/2 n i=1 X i , then f (X 1 , . . . , X n ) ∼ N (0, 1) and (1.2) P |f (X 1 , . . . , X n ) − E f (X 1 , . . . , X n )| ≥ t ≤ 2 exp − t 2 2 .
Using the entropy method, it is possible to show that the estimate (1.2) remains true for any Lipschitz function f (see e. g. [Led01,Section 5]). On the other hand, if f is a polynomial of degree 2, then the tails of f (X 1 , . . . , X n ) are heavier. Indeed, the Hanson-Wright inequality states that for a quadratic form in independent, standard Gaussian random variables X 1 , . . . , X n we have Here, A op is the operator norm and A HS the Hilbert-Schmidt norm (also called Frobenius norm) of A respectively. For a proof see [RV13]. Thus the tails of the quadratic form decay like exp(−t) for large t. There are inequalities similar to (1.3) for the multilinear chaos in Gaussian random variables proven in [Lat06] (and in fact, a lower bound using the same quantities as well), and in [AW15] for polynomials in sub-Gaussian random variables. However, a key component is that the individual random variables X i have a sub-Gaussian tail decay, i. e. P(|X i | ≥ t) ≤ c exp(−Ct 2 ) for some constants c, C.
In recent works [BGS18], [GSS18b], [GSS18a] we have studied similar concentration inequalities for bounded functions f of either independent or weakly dependent random variables. There, the situation is clearly different, since the distribution of f (X 1 , . . . , X n ) has a compact support, and is thus sub-Gaussian, and the challenge is to give an estimate depending on different quantities derived from f and X. However, there are many situations of interest where boundedness does not hold, such as quadratic forms in unbounded random variables as in (1.3). Here it seems reasonable to focus on certain classes of functions for which the tail behavior can directly be traced back to the tails of the random variables under consideration. Therefore, in this note we restrict ourselves to polynomial functions.
In the following results, the setup is as follows. We consider independent random variables X 1 , . . . , X n which have α-sub-exponential tail decay. By this we mean that there exists two constants c, C and a parameter α > 0 such that for all i = 1, . . . , n and t ≥ 0 There are many interesting choices of random variables X i of this type, such as bounded random variables (for any α > 0), random variables with a sub-Gaussian (for α = 2) or sub-exponential distribution (α = 1) such as Poisson random variables, or "fatter" tails such as Weibull random variables with shape parameter α ∈ (0, 1]. We reformulate condition (1.4) in terms of so-called exponential Orlicz norms, but we emphasize that these two concepts are equivalent. For any random variable X on a probability space (Ω, A, P) and α > 0 define the (quasi-)norm (1.5) adhering to the standard definition inf ∅ = ∞. Strictly speaking, this is a norm for α ≥ 1 only, since otherwise the triangle inequality does not hold. Nevertheless, the above expression makes sense for any α > 0, and we choose to call it a norm in these cases as well. For some properties of the Orlicz norms in the case α ∈ (0, 1], see Appendix A. In this note we concentrate on values α = 2/q for some q ∈ N, but also prove results for the case α ∈ (0, 1]. Throughout this work, we denote by C an absolute constant and by C l 1 ,...,l k a constant that only depends on some parameters l 1 , . . . , l k . For illustration, we start with a simplified version of some of our results which may already be sufficient for application purposes. The first result is a concentration inequality which may be considered as a generalization of the Hanson-Wright inequality (1.3) to quadratic forms in random variables with α-sub-exponential tail decay.
Proposition 1.1. Let X 1 , . . . , X n be independent random variables satisfying E X i = 0, E X 2 i = σ 2 i , X i Ψα ≤ M for some α ∈ (0, 1] ∪ {2}, and A be a symmetric n × n matrix. For any t > 0 we have As we will see in Proposition 1.5, the tail decay exp(−t α/2 A −α/2 op ) (for large t) can be sharpened by replacing the operator norm by a smaller norm. Actually, the technical result contains up to four different regimes instead of two as above.
The next theorem provides tail estimates for polynomials in independent random variables. Note that this is not a generalization of Proposition 1.1 due to the use of the Hilbert-Schmidt instead of the operator norms.
Theorem 1.2. Let X 1 , . . . , X n be independent random variables satisfying X i Ψα ≤ M for some α ∈ (0, 1]∪{2} and let f : R n → R be a polynomial of total degree D ∈ N. Then for all t > 0 (1.6) P(|f (X) − Ef (X)| ≥ t) ≤ 2 exp − 1 Intuitively, Theorem 1.2 states that a polynomial in random variables with tail decay as in (1.4) also exhibits α-sub-exponential tail decay whenever the Hilbert-Schmidt norms are not too large. Moreover, the tail decay is "as expected", i. e. one just needs to account for the total degree D by taking the D-th root.
One particularly interesting case is when the functional under consideration is a d-th order chaos. That is, given a d-tensor A = (a i 1 ...i d ) which we assume to be symmetric, i. e. a i 1 ...i d = a i σ(1) ...i σ(d) for any permutation σ ∈ S d , we consider the polynomial Additionally, we often assume that A has vanishing generalized diagonal in the sense that a i 1 ...i d = 0 whenever i 1 , . . . , i d are not pairwise different. In this situation, Theorem 1.2 reads as follows: Corollary 1.3. Let X 1 , . . . , X n be independent random variables with X i Ψα ≤ M for some α ∈ (0, 1]∪{2} and let A be a symmetric d-tensor with vanishing generalized diagonal such that A HS ≤ 1. Then As in Theorem 1.2, the conclusion is equivalent to a Ψ α/d -norm estimate.
1.1. Main results. In comparison to the aforementioned results, our main concentration inequalities provide more refined tail estimates. To this end, we need a family of tensor-product matrix norms A J for a d-tensor A and a partition J ∈ P qd of {1, . . . , qd}. For the exact definitions, we refer to (3.5). Using these norms, we may formulate our first result for chaos-type functionals. Note that we focus on the case α = 2/q for some q ∈ N only, which is sufficient for many applications, like products or powers of sub-Gaussian or sub-exponential random variables. The general case α ∈ (0, 1] will be treated later. Theorem 1.4. Let X 1 , . . . , X n be a set of independent random variables satisfying X i ψ 2/q ≤ M for some q ∈ N and M > 0, and let A be a symmetric d-tensor with vanishing diagonal. Consider f d,A (X) as in (1.7). Then, for any t > 0, To give an elementary example, consider the case d = 1 and q = 2. Here, A = a = (a 1 , . . . , a n ) is a vector, and f 1, is just a linear functional of random variables with sub-exponential tails ( X i ψ 1 ≤ M ). It easily follows from the definition that A {1,2} = |a| (i. e. the Euclidean norm of a) and A {{1},{2}} = max i |a i |. As a consequence, for any t > 0 Hence, up to constants, we get back a classical result for the tails of a linear form in random variables with sub-exponential tails. For more general functions f and similar results under a Poincaré-type inequality, we refer to [BL97] (the first order case) and [GS18] (the higher order case). Moreover, Theorem 1.4 can be used to give Hanson-Wright-type bounds for quadratic forms in sub-exponential random variables. Here we provide a sharpened version of Proposition 1.1. Let x, y be the standard scalar product in R n . Proposition 1.5. Let q ∈ N, A = (a ij ) be a symmetric n × n matrix and let X 1 , . . . , X n be a set of independent, centered random variables with X i Ψ 2/q ≤ M and E X 2 Consequently, for any x > 0 we have with probability at least 1 − 2 exp(−x/C) It is possible to replace 2/q by a general α ∈ (0, 1] ∪ {2} (see Section 6). In this case, we have to replace 2/(q + 1) by 2α/(2 + α) and 1/q by α/2.
Remark. Note that in comparison to the Hanson-Wright inequality (1.3) and Proposition 1.1, the more refined version contains two additional terms. The respective norms max i=1,...,n (a ij ) j 2 and A ∞ can no longer be written in terms of the eigenvalues of A (in contrast to A HS and A op ). Indeed, as we see later, we have max i=1,...,n (a ij ) j 2 = A 2→∞ , and A ∞ = max i,j | e i , Ae j | for the standard basis (e i ) i of R n . Moreover, the norms might have a very different scaling in n. For example, if e = (1, . . . , 1) and A = ee T − Id, then A HS ∼ A op ∼ n, max i (a ij ) j 2 ∼ n 1/2 and A ∞ = 1.
Finally, let us state the result for general polynomials in random variables with bounded Orlicz norms. To fix some notation, if f : R n → R is a function in C D (R n ), for d ≤ D we denote by f (d) the (symmetric) d-tensor of its d-th order partial derivatives.
Theorem 1.6. Let X 1 , . . . , X n be a set of independent random variables satisfying X i ψ 2/q ≤ M for some q ∈ N and M > 0. Let f : R n → R be a polynomial of total degree D ∈ N. Then, for any t > 0, Note that if f (X) = f D,A (X) as in (1.7), only the D-th order tensor gives a contribution, i. e. we retrieve Theorem 1.4. We discuss Theorems 1.4 and 1.6 and compare them to known results in Subsection 1.2. A variant of Theorem 1.6 for polynomials in independent random variables with X i ψα ≤ 1 for any α ∈ (0, 1] with be derived in Section 6.
Remark. With the help of these inequalities, it is possible to prove many results on concentration of linear and quadratic forms in independent random variables scattered throughout the literature. For example, [NSU17,Lemma A.6] is an immediate consequence of Theorem 1.4 (combined with Lemma A.1 for f (X, X ) = n i=1 a i X i X i ). In a similar way, one can deduce [Yan+17, Lemma C.4] by applying Theorem 1.4 to the random variable Z i := X i Y i , whenever (X i , Y i ) is a vector with sub-exponential marginal distributions. More generally, one can consider a linear form (or higher order polynomial chaoses) in a product of k random variables X 1 , . . . , X k with sub-exponential tails, for which Lemma A.1 provides estimates for the Ψ 1 k norm. Lastly, the results in [EYY12, Appendix B] can be sharpened for α ∈ (0, 1] ∪ {2} by a more general version of Proposition 1.5, using the same arguments as in [RV13, Section 3] to treat complex-valued matrices.
1.2. Related work. Inequalities for the L p -norms of polynomial chaos have been established in various works. From these L p norm inequalities one can quite easily derive concentration inequalities. For a thorough discussion on inequalities involving linear forms in independent random variables we refer to [PG99, Chapter 1].
Starting with linear forms, there have been generalizations to certain classes of random variables as well as multilinear forms of higher degree (also called polynomial chaoses). Among these are the two classes of random variables with either log-convex or log-concave tails (i. e. t → − log P(|X| ≥ t) is convex respectively concave). Twosided L p norm estimates for the log-convex case were derived in [HMO97] for linear forms and in [KL15] for chaoses of all orders. On the other hand, for measures with log-concave tails similar two-sided estimates have been derived in [GK95; Lat96; Lat99; LŁ03; AL12] under different conditions. Moreover, two-sided estimates for non-negative random variables have been derived in [Mel16] and for chaos of order two in symmetric random variables satisfying the inequality X 2p ≤ A X p in [Mel17].
Our approach is closer to the work of Adamczak and Wolff, [AW15], where the case of polynomials in sub-Gaussian random variables has been treated. Lastly, let us mention the two results [EYY12, Lemma B.2, Lemma B.3] and [VW15, Corollary 1.6], where concentration inequalities for quadratic forms in independent random variables with α-sub-exponential tails have been proven.
To be able to compare our results to the results listed above, let us discuss their conditions. Firstly, the conditions of a bounded Orlicz norm and log-convex or logconcave tails cannot be compared in general. It is known that random variables with log-convex tails satisfy X Ψ 1 < ∞. On the other hand, the tail function of any discrete random variable X is a step function (for example, if X has the geometric distribution, then − log P(X ≥ t) = x log(1/(1 − p))), which is neither log-convex nor log-concave but can still have a finite Ψ α norm for some α. For example, a Poisson-distributed random variable X satisfies X Ψ 1 < ∞.
The condition X 2p ≤ α X p for all p ≥ 1 and some α > 1 used in the works of Meller implies the existence of the Ψα-norm forα := (log 2 α) −1 . Especially in the case α = 2 d this yields the existence of the Ψ 1/d norm. However, we want to stress that the results in [AL12; KL15; Mel16; Mel17] are two-sided and require very different tools.
Moreover, the two works of Schudy and Sviridenko [SS11; SS12] contain concentration inequalities for polynomials in so-called moment bounded random variables. Therein, a random variable Z is called moment bounded with parameter L > 0, if for all i ≥ 1 E|Z| i ≤ iL E|Z| i−1 . Actually, using Stirling's formula, it is easy to see that moment-boundedness implies Z Ψ 1 < ∞, but it is not clear whether the converse implication also holds. However, there is no inequality of the form L ≤ C X Ψ 1 , as can be seen by X ∼ Ber(p).
Considering quadratic forms in random variables X which are moment bounded and centered, one can easily see that (apart from the constants) the bound in Proposition 1.5 is sharper than the corresponding inequality in [SS12, Theorem 1.1]. Since for log-convex distributions there are two-sided estimates, Proposition 1.5 is sharp in this class. Apart from quadratic forms, due to the different conditions and quantities, it is difficult to compare [SS12] and Theorem 1.6 in general.
1.3. Outline. In Section 2 we formulate and prove several applications which can be deduced from the main results. Section 3 contains the proof for the concentration inequalities for multilinear forms (Theorem 1.4). Thereafter, we provide the proof of Proposition 1.5 in Section 4 and of Theorem 1.6 in Section 5. Section 6 is devoted to some extensions of the main results for random variables with finite Orlicz-norms for any α ∈ (0, 1]. Lastly, we finish this note by collecting some elementary properties of the Orlicz-norms in the Appendix A.

Applications
In the following, we provide some applications of our main results. In particular, all the results in this section follow from either Proposition 1.1 or 1.5. For any random variables X 1 , . . . , X n we write X = (X 1 , . . . , X n ).
2.1. Concentration of the Euclidean norm of a vector with independent components. As a start, Proposition 1.1 can be used to give concentration properties of the Euclidean norm of a linear transformation of X consisting of independent, normalized random variables with sub-exponential tails. We give two different forms thereof. The first form is inspired by the results in [RV13] for sub-Gaussian random variables.
Proposition 2.1. Let X 1 , . . . , X n be independent random variables satisfying E X i = 0, E X 2 i = 1, X i Ψα ≤ M for some α ∈ (0, 1] ∪ {2} and let B = 0 be an m × n matrix. For any c > 0 and any t ≥ c B HS we have Note that in the case α = 2 the constant is not present on the right hand side and thus we can choose any t > 0, which is exactly [RV13, Theorem 2.1]. In the general case, we need to restrict t to be of the order B HS . The assumption of unit variance can be weakened, with some minor modifications, i. e. B HS has to be replaced by ( n i=1 σ 2 i n j=1 b 2 ij ) 1/2 and the constant C will depend on min i=1,...,n σ 2 i . We omit the details. Proof. First off, note that it suffices to prove the inequality for a matrix B such that B HS = 1 and t ≥ c, since the general case follows by consideringB := B B −1 HS . Let us apply Proposition 1.1 to the matrix A := B T B. An easy calculation shows that trace(A) = trace(B T B) = B 2 HS = 1, so that we have Here, in the first step we have used the estimates A 2 HS ≤ B 2 op B 2 HS = B 2 op and A op ≤ B 2 op as well as the fact that by Lemma A.2, EX 2 i = 1 for any i implies M ≥ C α > 0. The second inequality follows from t ≥ c ≥ c B op and the third inequality is a consequence of min(c 2−α t α , t α 2 ) ≥ min(c 2−α , 1) min(t α , t α 2 ). Now, as in [RV13], we use the inequality |z − 1| ≤ min(|z 2 − 1|, |z 2 − 1| 1/2 ), giving for any t > 0 Hence, a combination of (2.2), (2.3) and min(max(r, r 2 ), max(r 1/2 , r)) = r yields for t > c The next corollary provides an alternative estimate for BX 2 : Corollary 2.2. Let X 1 , . . . , X n be independent, centered random variables satisfying Corollary 2.2 can be compared to various bounds on the norms of BX 2 in the case that X is a sub-Gaussian vector (see for example [HKZ12] or [Ada15]). For sub-Gaussian vectors with sub-Gaussian constant 1, we have with probability at so that we have similar terms corresponding to √ x and x, whereas in the subexponential case we need two additional terms to account for the heavier tails of its components.
Proof. Define the quadratic form Z := BX 2 2 = BX, BX = X, B T BX = X, AX . Using Proposition 1.5 with the matrix A gives with probability 1 − 2 exp(−x/C) From these inequalities and |x| ≥ x the claim easily follows by taking the square root.
Projections of a random vector and distance to a fixed subspace. It is possible to apply Proposition 1.5 to any matrix A associated to an orthogonal projection. In these cases, the norms can be explicitly calculated. Moreover, these norms do not depend on the structure of the subspace onto which one projects, but merely on its dimension. This leads to the following application, where we replace a fixed projection by a random one.
Corollary 2.3. Let X 1 , . . . , X n be independent random variables satisfying E X i = 0, E X 2 i = σ 2 i and X i Ψ 1 ≤ M . Furthermore, let m < n and P be the (random) orthogonal projection onto an m-dimensional subspace of R n , distributed according to the Haar measure on the Grassmanian manifold G m,n . For any x > 0, with probability at least 1 − 2 exp(−x/C), we have Proof of Corollary 2.3. This is an application of Proposition 1.5. To see Moreover, for any projection P onto an m-dimensional subspace, one can see that A very similar result which follows from Proposition 2.1 is the following variant of [RV13, Corollary 3.1]. We use the notation d(X, E) = inf e∈E d(X, e) for the distance between an element X and a subset E of a metric space (M, d).
This follows exactly as in [RV13, Corollary 3.1] by using Proposition 2.1.
2.3. Spectral bound for a product of a fixed and a random matrix. We can also extend the second application in [RV13] to any α-sub-exponential random vector as follows.
Proposition 2.5. Let B be a fixed m × N matrix and let G be a N × n random matrix with independent entries satisfying E g ij = 0, E g 2 ij = 1 and g ij Ψα ≤ M for some α ∈ (0, 1]. For any u, v ≥ 1 with probability at least Proof. We mimic the proof of [RV13, Theorem 3.2]. For any fixed x ∈ S n−1 consider the linear operator T : R N n → R m given by T (G) = BGx, and (by abuse of notation) write T for the matrix corresponding to this linear map in the standard basis. Using Proposition 2.1 applied to the matrix T we have Now, since T HS = B HS and T op ≤ B op , this yields for any t ≥ B HS If we define t = (2CM 4 ) 1/α u B HS + (log(5) + 1) 1/α vn 1/α B op for arbitrary u, v ≥ 1 and use the inequality 2(r + s) α ≥ (r α + s α ) valid for all r, s ≥ 0, we obtain The last step is again a covering argument as in [RV13]. Choose a 1/2-covering N (satisfying |N | ≤ 5 n , see [Ver12, Lemma 5.2]) of the unit sphere in R n , and note that a union bound gives from which the assertion easily follows by upper bounding and simplifying the expression 2 B HS + 2t.

Special cases.
It is possible to apply all results to random variables having a Poisson distribution, i. e. X i ∼ Poi(λ i ) for some λ i ∈ (0, ∞). By using the moment generating function of the Poisson distribution, it is easily seen that The function g is increasing and satisfies g(x) ∼ log(1/x) (for x → 0) and g(x) ∼ x/ log(2) (for x → ∞). More generally, if the random variable |X| has a moment generating function φ |X| in a neighborhood of 0, it can be used to explicitly calculate the Ψ 1 -norm. Indeed, we have E exp(|X|/t) = φ |X| (t −1 ), and so X Ψ 1 = 1/φ −1 |X| (2). Thus, as a special case of Proposition 1.5, we obtain the following corollary.
For Poisson chaos of arbitrary order d ∈ N, one may derive similar results by evaluating Theorem 1.4 or Corollary 6.1 (both for α = 1). Note though that already for d = 1, we lose a logarithmic factor in the exponent. However, we are not aware of any more refined fluctuation estimates for d ≥ 2.
Another interesting example of a sub-exponential random variable arises in stochastic geometry. If K ⊆ R n is an isotropic, convex body and X is distributed according to the cone measure on K, then X, θ Ψ 1 ≤ c for some constant c and any θ ∈ S n−1 . For the details and the proof we refer to [PTT18, Lemma 5.1].
2.5. Concentration properties for fixed design linear regression. It is possible to extend the example of the fixed design linear regression in [HKZ12] to the situation of a sub-exponential noise (instead of sub-Gaussian).
To this end, let y 1 , . . . , y n ∈ R d be fixed vectors (commonly called the design vectors), Y = (y 1 , . . . , y n ) (the d×n design matrix ) and assume that the d×d matrix β is the coefficient vector of the least expected squared error andβ(X) is its ordinary least squares estimator (given the observation X). The quality of the estimatorβ can be judged by the excess loss

as can be shown by elementary calculations.
Observe that this is a quadratic form in X i with coefficients depending on the vectors y i . Thus, Proposition 1.5 yields the following corollary.
Corollary 2.7. In the above setting, for any x > 0 the inequality Thus the concentration properties of R(X) around its mean depends on the four different norms of the matrix A. The factor 4 appears due to the necessary centering of the X i . 2.6. Central limit theorems for quadratic forms and random edge weights. In this section, our aim is to quantify central limit theorems for quadratic forms Q(X) = Q A (X) = i,j a ij X i X j in sub-exponential random variables X 1 , . . . , X n using concentration of measure results. Typically, the first step is finding conditions such that Q(X) can be approximated by a linear form L(X). This reduces the problem to finding conditions such that L(X) is asymptotically normal (e. g. using the Lyapunov central limit theorem).
The weak convergence of quadratic forms to a normal distribution is classical, and we refer to [Jon87] and [GT99], [Cha08] for general statements (and rates of convergence), as well as [PG99] for general statements on central limit theorems for U -statistics.
Let us first consider the task of approximating Q(X) by a linear form L(X). To this end, assume that A is symmetric with vanishing diagonal and EX i = 0 for some i. Then, we may decompose (this is in fact the Hoeffding decomposition of Q(X)), and we therefore define the asymptotic behavior of the properly normalized quadratic form is dominated by the linear term. Under additional assumptions of the tail behavior of the X i , the approximation can also be quantified. Proof. Rewrite the Hoeffding decomposition of Q as and recall c n = Var(L(X)). An application of Theorem 1.4 yields In the case that the X i are also identically distributed, (2.7) is equivalent to We may apply these results to sequences of graphs. Here we always assume that the X i are identically distributed. For each n, let G n = (V n , E n ) be some undirected graph on n nodes (which we may consider as a kind of "base graph"). If A = A (n) denotes its adjacency matrix, then (2.7) can be rewritten as Sequences of graphs satisfying (2.10) are the complete graph, the complete bipartite graph G n = K m 1 (n),m 2 (n) for parameters m 1 (n), m 2 (n) satisfying m 1 (n)+m 2 (n) → ∞ and d n -regular graphs for d n → ∞.
The example of the n-stars shows that (2.10) is not sufficient for a central limit theorem of the quadratic form. Indeed, in this case we have Q(X) = X 1 N i=2 X i , where 1 is the vertex with degree (n − 1). As is easily seen, Q(X) = 0 on {X 1 = 0}, and thus if X are Bernoulli distributed, the distribution has an atom which does not vanish for n → ∞.
Finally, let us provide an example of a sequence of graphs for which a central limit theorem can be shown by imposing additional conditions. Here we assume that the random variables X i are non-negative. In this case, they can be used to define edge weights w n (X) : E n → R + by w n ({i, j})(X) = X i X j . Also let W n (X) := e∈En w n (e)(X) be the total edge weight. Note that W n (X) = AX, X for the adjacency matrix A of G.
Note that W n (X) is neither a sum of independent random variables, nor can be it written as a sum of an m-dependent sequence, since w(e)(X) and w(f )(X) are dependent whenever e ∩ f = ∅.
In the case that X ∼ Ber(p), the quantity W n (X) has a nice interpretation. If we interpret X v = 0 as a failed vertex in the "base graph" G n , W n (X) is the number of edges in the subgraph that is induced by the (random) vertex set {v ∈ V n : X v = 1}.
Proof. Consider the linear approximation given in Lemma 2.8 It is also easy to see that condition (2.11) implies Lyapunov's condition with δ = 1. Consequently, by Lindeberg's central limit theorem N (0, 1).
The claim now easily follows by combining Lemma 2.8 and Slutsky's theorem.
It should be possible to extend the result to any sequence of random graphs satisfying (2.10) and (2.11) by conditioning. Moreover, with appropriately modified conditions, by a more refined analysis it is possible to vary the sub-exponential constant M with n. We omit the details.
3. The multilinear case: Proof of Theorem 1.4 To begin with, let us introduce some notation. Define [n] := {1, . . . , n}, and let i = (i 1 , . . . , i d ) ∈ [n] d be a multiindex. For any subset C ⊆ [d] with cardinality |C| > 1, we may introduce the "generalized diagonal" of [n] d with respect to C by This notion of generalized diagonals naturally extends to d-tensors A = (a i ) i∈[n] d (obviously, the generalized diagonal of A with respect to C is the set of coefficients a i such that i lies on the generalized diagonal of [n] d with respect to C). If d = 2 and C = {1, 2}, this gives back the usual notion of the diagonal of an n × n matrix. Moreover, write In particular, we may regard A as a multilinear form by setting The latter idea may be generalized by noting that any partition J = {J 1 , . . . , J k } of [d] induces a partition of the space of d-tensors as follows. Identify the space of all d-tensors with R n d and decompose For any x = x (1) ⊗ . . . ⊗ x (k) , the identification with a d-tensor is given by For example, for d = 4 and I = {{1, 4}, {2, 3}} we have two matrices x, y and x J 1 ,J 2 ,J 3 ,J 4 = x J 1 J 4 y J 2 J 3 . Using this representation, any d-tensor A can be trivially identified with a linear functional on R n d via the standard scalar product, i. e.
These identifications give rise to a family of tensor-product matrix norms: for any partition J ∈ P d , define a norm on the space (3.2) by Now, we may define A J as the the operator norm with respect to · J : |Ax|.

This family of tensor norms agrees with the definitions in [Lat06] and [AW15] (among others).
Next we extend these definitions to a family of norms A J where A is a d-tensor but J ∈ P qd for some q ∈ N. To this end, we first embed A into the space of qd-tensors. Indeed, denote by e q (A) the qd-tensor given by (3.4) In other words, we divide i ∈ [n] qd into d consecutive blocks with q indices in each block (i 1 , . . . , i q ), (i q+1 , . . . , i 2q ), . . . and only consider such indices for which all elements of these blocks take the same value. In fact, this is an intersection of d "generalized diagonals". Now we set In view of (3.6), the two "extreme" norms corresponding to the coarsest and the finest partition of [qd] deserve special attention. Firstly, it is elementary that To prove Theorem 1.4, we furthermore need some auxiliary results. The first one compares the moments of sums of random variables with finite Orlicz norms to moments of Gaussian polynomials and the second one provides the estimates for multilinear forms in Gaussian random variables.
In the proof of Theorem 1.4, we actually show L p -estimates for f d,A (X). The following proposition provides the link to concentration inequalities. It was originally proven by Adamczak in [Ada06] and [AW15], while at this point we cite it in the form given in [SS18], with a small modification to adjust the constant in front of the exponential.
Proof of Theorem 1.4. For simplicity, we always write f (X) := f d,A (X). Moreover, without loss of generality, we may assume the X i to be centered. Let X (1) , . . . , X (d) be independent copies of the random vector X. Take a set of i.i.d. Rademacher variables (ε (j) i ), i ≤ n, j ≤ d, which are independent of the (X (j) ) j . By standard decoupling and symmetrization inequalities (see [PG99, An iteration of Lemma 3.2 together with X i ψ 2/q ≤ M hence leads to Here, (g (j) i,k ) is an array of i.i.d. standard Gaussian random variables. Rewriting (recall (3.4)) and applying Theorem 3.3 yields The proof is now easily completed by applying Proposition 3.4.

4.
Hanson-Wright-type inequality: Proof of Proposition 1.5 The main task in the proof of Proposition 1.5 is explicitly calculating the norms.
In the third step, we have iteratively used that for x j with |x j | ≤ 1 we also have |x j i | ≤ 1, and applied the Cauchy-Schwarz inequality d times.
To obtain the lower bound, let l 1 , . . . , l d be the index which achieves the maximum. Let x 1 = . . . = x q = δ l 1 , x q+1 = . . . = x 2q = δ l 2 and so on, so that The following easy observation helps in calculating the norms · J . For any partition J = {J 1 , . . . , J k } ∈ P [qd] we writeJ = {J 1 , . . . ,J k } for That is, the setsJ j indicate which of the d q-blocks intersect J j . Note that ∪ jJj = [d], butJ need not be a partition of [d]. In fact, some sets I may even appear more than once (with a slight abuse of notation, we choose to keep the set notation in this case anyway). Note that Remark 3.1 extends from partitions to decompositions (all definitions remain valid, even in case of some sets appearing multiple times). Nevertheless, we have by definition i. e. the norm does not depend on J , but on its "projection"J . We will use this observation in the next lemma to calculate the norms A J for quadratic forms (i. e. d = 2) and any q ≥ 2.
Proof. To see (1), writeJ = {J 1 , . . . ,J k }, use the triangle inequality and the fact that x ∞ ≤ x HS for any tensor x: where the supremum is taken over all unit vectors x (k) . The lower bound follows from (3.6) and Lemma 4.1.
(3) follows from the triangle and Cauchy-Schwarz inequality: The lower bound is obtained by choosing y 1 , . . . , y l as a Dirac delta on the row for which max i A i· is attained. To see (4), note that the case k ≥ 2, l ≥ 2 is very similar to the second part. If l = 1, k ≥ 2 or k = 1, l ≥ 2, similar arguments as in the third part give for any x, y 1 , . . . , y l with norm at most one The lower bound again follows by choosing suitable Dirac deltas.
We are now ready to prove Proposition 1.5. Throughout the rest of this section, for a matrix A let us denote by A od its off-diagonal and by A d the diagonal part.
Proof of Proposition 1.5. Lemma 4.2 shows that we only need to consider the four norms A HS , A op , max i=1,...,n (a ij ) j 2 and A ∞ . It is easy to see that A HS ≥ A op ≥ max i (a ij ) j 2 ≥ A ∞ . Thus, we need to determine which partitions give rise to which norms.
The only partition producing the Hilbert-Schmidt norm is J 1 = {[qd]}, with |J 1 | = 1. The operator norm appears for the decomposition J 2 = {{1, . . . , q}, {q + 1, . . . , 2q}} with |J 2 | = 2. Moreover, it is easy to see that all partitions J 3 of [2q] giving rise to max i=1,...,n (a ij ) j 2 satisfy |J 3 | ∈ {2, . . . , q + 1}. Finally, for all k = 2, . . . , 2q there are partitions J 4 such that Hence for a diagonal-free matrix A we have by simply plugging in the norms calculated in Lemmas 4.1 and 4.2 into Theorem 1.4 In the last two terms, we can choose the largest l since we can assume that t for any partition J , as the minimum is achieved in t 2 A 2 HS otherwise. For matrices with non-vanishing diagonal, we divide the quadratic form into an off-diagonal and a purely diagonal part, i. e.
For brevity, let us define P (t) : Use the above decomposition and the subadditivity to obtain Equation (4.3) can be used to upper bound p 1 (t) as The diagonal term can be treated by applying Theorem 1.4 for d = 1, q = 4 and a = (A ii ) i=1,...,n . Moreover, it is easy to see that we have a {1,2,3,4} = i (a d ii ) 2 (cf. (3.7)) and a J = A d ∞ for any other decomposition J . Consequently, Now it remains to lower bound the minimum by grouping the terms according to the different powers of t. This gives , t maxi=1,...,n (aij)j 2 2/(q+1) Lastly, from the characterization A op := sup x∈S n−1 | x, Ax | it can be easily seen that the inequalities A d ∞ ≤ A op and A od op ≤ 2 A op hold, and the constant 4 can be changed to 2 by adjusting the constant in the exponent.
5. The polynomial case: Proof of Theorem 1.6 Let us now treat the case of general polynomials f (X) of total degree D ∈ N. Before we start, we need to discuss some more properties of the norms A J . To this end, recall the Hadamard product of two d-tensors A, B given by A • B := (a i b i ) i∈[n] d (pointwise multiplication). If we interpret a d-tensor as a function [n] d → R, we may define "indicator matrices" 1 C for a set C ⊆ [n] d by setting 1 C = (a i ) i with a i = 1 if i ∈ C and a i = 0 otherwise. If |J | > 1, we do not have in general. However, [AW15, Lemma 5.2] shows a number of situations in which such an inequality does hold.
Lemma 5.1. Let A = (a i ) i∈[n] d be a d-tensor.
There is a further situation in which a version of (5.1) holds. Indeed, for any That is, L(K) is the set of those indices for which the partition into level sets is equal to K.
Lemma 5.2. Let J ∈ P qd , K ∈ P d and A be a d-tensor. Then, Proof. This is a generalization of [AW15, Corollary 5.3] which corresponds to the case q = 1. First note that by definition, Therefore, it suffices to prove that for any qd-tensor B, To see this, observe that e q (1 L(K) ) is the indicator matrix of a set C which can be written as an intersection of |K| generalized diagonals (with the cardinality of the underlying sets of indices in (3.1) always being an integer multiple of q) and |K|(|K| − 1)/2 sets of the form {i : i kq+1 = i lq+1 } for k < l. Recall that using Lemma 5.1 (2) in the last step. As a consequence, the claim follows by applying Lemma 5.1 (2) again and a generalization of Lemma 5.1 (3).
Finally, it remains to note that [AW15, Lemma 5.1] can be generalized as follows.
Lemma 5.3. Let A be a d-tensor, and let v 1 , . . . , v d ∈ R n be any vectors. Then, for any partition J ∈ P qd , Recall equations (4.1) and (4.2). We have To see the third step, for each v l we choose a set J j such that l ∈ J j and then define vectorsx by the components of the vectors v l which were attributed to J j . In particular, this leads to x where the product is taken over all the vectors v l which were attributed to x Before we begin with the proof of the concentration results for general polynomials, let us give some definitions. Boldfaced letters will always represent a vector (mostly a multiindex with integer components), and for any vector i let |i| := j i j . For the sake of brevity we define Given two vectors i, k of equal size, we write k ≤ l if k j ≤ l j for all j, and k < l if k ≤ l and there is at least one index such that k j < l j . Lastly, by f g we mean an inequality of the form f ≤ C D,q g.
Proof of Theorem 1.6. We assume M = 1. For the general case, given random variables X 1 , . . . , X n with X i Ψ 2/q ≤ M , define Y i := M −1 X i . The polynomial f = f (X) can be written as a polynomialf =f (Y ) by appropriately modifying the coefficients, i. e. multiplying each monomial by M r , where r is its total degree. Now it remains to see that ∂ i 1 ...i jf (Y ) = M j ∂ i 1 ...i j f (X).
Step 1. First, we reduce the problem to generalizations of chaos-type functionals (1.7). Indeed, by sorting according to the total grade, f may be represented as where the constants satisfy c (d) kπ 1 ),...,(iπ ν ,kπ ν ) for any permutation π ∈ S ν . As in [AW15], by rearranging and making use of the independence of X 1 , . . . , X n , this leads to the estimate Step 2. Note that X k i ψ 2/(qk) = X i k ψ 2/q ≤ 1. Thus, slightly modifying the proof of Theorem 1.4 (in particular, also using Lemma 3.2 for the non-linear terms), we obtain the estimate iν ,1 · · · g (ν) iν ,qkν ) p .
Here, (g (j) i,k ) is an array of i.i.d. standard Gaussian random variables. Moreover, the family (a k i ) ν∈{1,...,d},k∈I ν,d ,i∈[n] ν gives rise to a d-tensor A d as follows. Given any index i = (i 1 , . . . , i d ) there is a unique number r ∈ {1, . . . , d} of distinct elements j 1 , . . . , j r with each j l appearing exactly k l times in i. Consequently, we set a i 1 ...i d := a (l 1 ,...,lr) j 1 ,...,jr , and A d = (a i ) i∈[n] d . Note that this is well-defined due to the symmetry assumption.
For any k ∈ I ν,d denote by K(k) = K(k 1 , . . . , k ν ) ∈ P d the partition which is defined by splitting the set {1, . . . , d} into consecutive intervals of length k 1 , . . . , k ν . In other words, . . , ν. Now, recalling the definitions of e q (3.4) and of L(K) (5.2), by rewriting and applying Lemma 5.1 we obtain (5.3) Step 3. Next, we replace A d J by Ef (d) (X) J . To this end, first note that for i ∈ [n] d with distinct indices j 1 , . . . , j ν which are taken l 1 , . . . , l ν times, we have corresponds to the set of indices k satisfying k > l. If d = D, we clearly have R (d) i = 0, and therefore where I = {I 1 , . . . , I ν } is the partition given by the level sets of the index i. It follows that for any partition J ∈ P qD , using the partition of unity 1 = K∈P D 1 L(K) and the triangle inequality in the first, equation (5.4) in the second and Lemma 5.2 in the last step. The proof is now completed by induction. More precisely, in the next step will show that for any d ∈ {1, . . . , D − 1} and any partitions I = {I 1 , . . . , I µ } ∈ P d , J = {J 1 , . . . , J ν } ∈ P qd , Having (5.5) at hand, it follows by reverse induction and Lemma 5.2 that Plugging this into (5.3) and applying Proposition 3.4 finishes the proof.
Step 4: To show (5.5), let us analyze the "remainder tensors" R (d) in more detail. To this end, fix d ∈ {1, . . . , D − 1} and partitions I = {I 1 , . . . , I ν } ∈ P d , J = {J 1 , . . . , J µ } ∈ P qd , and let l be the vector with l α := |I α | (note that this implies |l| = d). For any k ∈ I ν,≤D with k > l, we define a d-tensor S Here, we denote by j α the value of i on the level set I α . Clearly, Therefore, it remains to prove that there is a partition K ∈ P q|k| with |K| = |J | such that The tensor will be given by an appropriate embedding of the d-tensor S We choose the partition K = {K 1 , . . . , K µ } defined in the following way: for any j, we have J j ⊂ K j , so that it remains to assign the elements r ∈ {qd + 1, . . . , q|k|} to the sets K j . Write r = ηq + m for some η ∈ {d, . . . , |k| − 1} and m ∈ {1, . . . , q}.
We claim that To see this, let x (β) = (x (β) i J β ), β = 1, . . . , µ, be a collection of vectors satisfying x (β) 2 ≤ 1. This gives rise to a further collection of unit vectors y (β) = (y 1 ir=i π(r) (recall the definition of π(r) given in the paragraph above). Now, it follows that These equations follow from the definition of the matrixS |k| and the fact that if i ∈ e q (L(Ĩ)), then for r > qd, i r = i π(r) , which implies y As this holds true for any collection x (β) , we obtain (5.8).
6. The general sub-exponential case: α ∈ (0, 1] Using slightly different techniques than in the proofs of Theorem 1.4 and Theorem 1.6, we may obtain concentration results for polynomials in independent random variables with bounded ψ α -norms for any α ∈ (0, 1]. Here, the key difference is that we will not compare their moments to products of Gaussians but to Weibull variables. To this end, we need some more notation. Let A = (a i ) i∈[n] d be a d-tensor and I ⊂ [d] a set of indices. Then, for any i I := (i j ) j∈I , we denote by A i I c = (a i ) i I c the (d − |I|)-tensor defined by fixing i j , j ∈ I. For instance, if d = 4, I = {1, 3} and i 1 = 1, i 3 = 2, then A i I c = (a 1j2k ) jk . We will also need the notation P (I c ) for the set of all partitions of I c .
For I = [d], i. e. we fix all indices of i, we interpret A i I c = a i as the i-th entry of A. Moreover, in this case, we assume that there is a single element J ∈ P (I c ) (which we may call the "empty" partition), and A i I c J = |a i | is just the Euclidean norm of a i . Finally, note that if I = ∅, i I does not indicate any specification, and Using the characterization of the Ψ α norms in terms of the growth of L p norms (see Appendix A for details), [KL15, Corollary 2] now yields a result similar to Theorem 1.4 for all α ∈ (0, 1]: Corollary 6.1. Let X 1 , . . . , X n be a set of independent, centered random variables with X Ψα ≤ M for some α ∈ (0, 1] , A be a symmetric d-tensor with vanishing diagonal and consider f d,A as in (1.7). We have for any t > 0 The main goal of this section is to generalize Corollary 6.1 to arbitrary polynomials similarly to Theorem 1.6. This yields the following result: Theorem 6.2. Let X 1 , . . . , X n be a set of independent random variables satisfying X i ψα ≤ M for some α ∈ (0, 1] and M > 0. Let f : R n → R a polynomial of total degree D ∈ N. Then, for any t > 0, To prove Theorem 6.2, note that one particular example of centered random variables with X Ψα ≤ M is given by symmetric Weibull variables with shape parameter α (and scale parameter 1), i. e. symmetric random variables w with P(|w| ≥ t) = exp(−t α ). In fact, [KL15, Example 3] especially implies the following analogue of of Lemma 3.3: Moreover, we need a replacement of Lemma 3.2. Here, instead of Gaussian random variables we use Weibull random variables to compare the p-th moments: Lemma 6.4. For any k ∈ N, any α ∈ (0, 1] and any p ≥ 2, if Y 1 , . . . , Y n are independent symmetric random variables with Y i ψ α/k ≤ M , then where w ij are i.i.d. Weibull variables with shape parameter α. Proof. We extend the arguments given in the proof of [KL15, Corollary 2]. As always, we assume M = 1. Moreover, note that it suffices to prove Lemma 6.4 for p ∈ 2N. It follows from Lemma 6.3 that w ij p ≥ C α p 1/α for any i, j, from where we easily arrive at w i1 · · · w ik p ≥ C α,k p k/α . Consequently, for a set of independent Rademacher variables ε 1 , . . . , ε n which are independent of the (Y i ) i , Y i p = ε i Y i p ≤ C α p k/α ≤ C α,k w i1 . . . w ik p . Therefore, for any m ∈ N and using standard symmetrization inequalities, Our next goal is to adapt Lemmas 5.1, 5.2 and 5.3 to the "restricted" tensors A i I c . That is, we examine whether (a modification of) the inequality (6.1) still holds in this situation, where J is a partition of I c .
Lemma 6.5. Let A = (a i ) i∈[n] d be a d-tensor, I ⊂ [d] and i I ∈ [n] I fixed.
Proof. To see (1), we may assume that {k 1 , . . . , k l }∩I = ∅ (note that if {k 1 , . . . , k l }∩ I = ∅, either the conditions are not compatible, in which case (A • 1 C ) i I c = 0, or we can remove some of the conditions and obtain a subset with {k 1 , . . . , kl} ∩ I = ∅). In this case, if C is a generalized row, then (A • 1 C ) i I c = A i I c • 1 C for some generalized row C in I c . This proves (1). If C is a generalized diagonal, we have to consider two situations. Assuming K ∩ I = ∅, i. e. K is subset of I c , we immediately obtain (2). On the other hand, if K ∩ I = ∅, then (A • 1 C ) i I c = A i I c • 1 C for some generalized row C in I c , readily leading to (2) again.
(3) is clear. To see (4), one may argue as in the proof of Lemma 5.2 (for q = 1), replacing Lemma 5.1 (2) and (3) by their analogues we just proved. Finally, an easy modification of the proof of Lemma 5.3 yields (5).
We are now ready to prove Theorem 6.2. Here, we recall the notation used in the proof of Theorem 1.6, with the only difference that now, by f g we mean an inequality of the form f ≤ C(D, α)g, where C(D, α) may depend on D, α.
Proof of Theorem 6.2. We will follow the proof of Theorem 1.6. In particular, let us assume M = 1.
Step 1. Recall the inequality from the proof of Theorem 1.6.
Note that y (µ+1) only has a single non-zero element, and thus it is easy to see that |y| K ≤ 1. Moreover, by the definition of the matrixS |k| and the fact that if i ∈ L(Ĩ), then for r > d, i r = i π(r) , which implies y x (β) = (S (|k|) ) i I c , µ+1 β=1 y (β) . (6.5) Hence, the supremum on the left hand side of (6.4) is taken over a subset of the unit ball with respect to |x| K .
Finally, it remains to prove (6.6) (S |k| ) i I c K (A |k| ) i I c K for any partition K ∈ P (I c ). This may be achieved as in the proof of Theorem 1.6, replacing Lemma 5.3 by Lemma 6.5 (5). Combining (6.4) and (6.6) yields (6.3), which finishes the proof.
It remains to prove Proposition 1.1 and Theorem 1.2 (from which Corollary 1.3 follows immediately).
Proof of Proposition 1.1. The case α ∈ (0, 1] follows immediately from the d = 2 case of Corollary 6.1. α = 2 corresponds to the well-known Hanson-Wright inequality, see e. g. [RV13]. Proof of Theorem 1.2. Let α ∈ (0, 1] and consider the bound given by Theorem 6.2. Fix any d = 1, . . . , D. Then, for any I ⊂ [d], any i I and any J ∈ P (I c ), we have (Ef (d) (X)) i I c J ≤ (Ef (d) (X)) i I c HS ≤ Ef (d) (X) HS (using (3.7)) as well as If t/(M d Ef (d) (X) HS ) ≥ 1, this immediately yields the result. Otherwise, note that the tail bound given in Theorem 1.2 is trivial. (In fact, here one needs to ensure that C D,α is sufficiently large, e. g. C D,α ≥ 1. It is not hard to see that in general this condition will be satisfied anyway.) In a similar way, it is possible to derive the same results for α = 2/q and any q ∈ N from Theorem 1.6.
From these results, the exponential moment bound follows by standard arguments, see for example [BGS18, Proof of Theorem 1.1].

Appendix A. Properties of Orlicz quasinorms
As mentioned in the introduction, Orlicz norms (1.5) satisfy the triangle inequality only for α ≥ 1. However, for any α ∈ (0, 1) (1.5) still is a quasinorm, which for many purposes is sufficient. We shall collect some elementary results on Orlicz quasinorms in this appendix. The first result is a Hölder-type inequality for the Ψ α norms.
Lemma A.1. Let X 1 , . . . , X k be random variables such that X i Ψα i < ∞ for some α i ∈ (0, 1] and let t : Proof. By homogeneity we can assume X Ψα i = 1 for all i = 1, . . . , k. We will need the general form of Young's inequality, i. e. for all p 1 , . . . , p k > 1 satisfiyng k i=1 p −1 i = 1 and any x 1 , . . . , x k ≥ 0 we have which follows easily from the concavity of the logarithm. If we apply this to p i := α i t −1 and use the convexity of the exponential function, we obtain Consequently, we have k i=1 X i Ψt ≤ 1. The random variables X 1 , . . . , X k need not be independent, i. e. we can consider a random vector X = (X 1 , . . . , X k ) with marginals having α-sub-exponential tails. The special case α i = α for all i = 1, . . . , k gives Now, by Taylor's expansion and using the inequality n n ≤ e n n! this gives E exp E|X| αn t αn n! ≤ 1 + ∞ n=1 n n n!t αn ≤ 1 +