Uniform Hanson-Wright type concentration inequalities for unbounded entries via the entropy method

This paper is devoted to uniform versions of the Hanson-Wright inequality for a random vector $X \in \mathbb{R}^n$ with independent subgaussian components. The core technique of the paper is based on the entropy method combined with truncations of both gradients of functions of interest and of the coordinates of $X$ itself. Our results recover, in particular, the classic uniform bound of Talagrand (1996) for Rademacher chaoses and the more recent uniform result of Adamczak (2015), which holds under certain rather strong assumptions on the distribution of $X$. We provide several applications of our techniques: we establish a version of the standard Hanson-Wright inequality, which is tighter in some regimes. Extending our techniques we show a version of the dimension-free matrix Bernstein inequality that holds for random matrices with a subexponential spectral norm. We apply the derived inequality to the problem of covariance estimation with missing observations and prove an almost optimal high probability version of the recent result of Lounici (2014). Finally, we show a uniform Hanson-Wright type inequality in the Ising model under Dobrushin's condition. A closely related question was posed by Marton (2003).


Introduction
The concentration properties of quadratic forms of random variables is a classic topic in probability. A well-known result is due to Hanson and Wright (we refer to the form of this inequality presented in [25]), which claims that if A is an n × n real matrix and X = (X 1 , . . . , X n ) is a random vector in R n with independent centered components satisfying max i X i ψ2 ≤ K (we will recall the definition of · ψ2 below), then for all t ≥ 0 for some absolute c > 0, where A HS = i,j A 2 i,j defines the Hilbert-Schmidt norm and A is the operator norm of A. An important extension of these results is when instead of just one matrix A we have a family of matrices A and want to understand the behaviour of random quadratic forms simultaneously for all matrices in the family. As a concrete example we consider an order-2 Rademacher chaos: given a family A ⊂ R n×n of n × n real symmetric matrices with zero diagonal, that is for all A ∈ A we have A ii = 0 for all i = 1, . . . , n, one wants to study the following random variable where ε = (ε 1 , . . . , ε n ) is a sequence of independent Rademacher signs taking values ±1 with equal probabilities. In the celebrated paper by Talagrand [28] it was shown, in particular, that there is an absolute constant c > 0, such that for any t ≥ 0 (1.2) Similar inequalities in the Gaussian case follow from the results in [8] and [6]. Apart from the new techniques that were used to prove (1.2), the significance of this result is that previously (see, for example, [20]) similar bounds were one-sided and had a multiplicative constant greater than 1 before EZ A (ε). Results with a multiplicative factor not equal to 1 are usually called deviation inequalities in contrast to concentration Uniform Hanson-Wright with applications to compressive sensing, Dicker and Erdogdu [12] prove deviation inequalities for sup A∈A (X AX − EX AX) and subgaussian vectors X under some extra assumptions. Additionally, a recent paper by Adamczak, Latała and Meller [4] studies deviation bounds for Z = X AX − EX AX with Banach space-valued matrices A and Gaussian variables, providing upper and lower bounds for the moments. The deviation inequality for general subgaussian vectors and a single positive semi-definite matrix was obtained by Hsu, Kakade, and Zhang [15]. Returning to concentration inequalities, it was shown by Adamczak [2] that if X satisfies the so-called concentration property with constant K, that is for every 1-Lipschitz function ϕ : R n → R and any t ≥ 0 we have E|ϕ(X)| < ∞ and P (|ϕ(X) − Eϕ(X)| ≥ t) ≤ 2 exp −t 2 /2K 2 , (1.4) then the following bound, similar to (1.2), holds for every t ≥ 0, (1.5) This result has application to covariance estimation and recovers another recent concentration result by Koltchinskii and Lounici [17]; this is discussed further in Section 2. The drawback of (1.5) is that the required concentration property places strong restrictions on the distribution of X: while it is satisfied by the standard Gaussian distribution as well as by some log-concave distributions (see [19]), it is not known whether the concentration property holds for general subgaussian entries and even in the simplest case of Rademacher random vectors.
In this paper we extend the aforementioned results in two directions. We extend the result from [10] for bounded variables by allowing non-zero diagonal values of the matrices and unbounded subgaussian variables X i . First, let us recall the following definition. For α > 0 denote the ψ α -norm of a random variable Y by which is a proper norm whenever α ≥ 1. A random variable Y with Y ψ1 < ∞ is referred to as subexponential and Y ψ2 < ∞ is referred to as subgaussian and the corresponding norm is usually named a subgaussian norm. We also use the L p (P ) norm. For p ≥ 1 we set Y Lp = (E|Y | p ) 1 p . One of our main contributions is the following upper-tail bound. Theorem 1.1. Suppose that the components of X = (X 1 , . . . , X n ) are independent centered random variables and A is a finite family of n × n real symmetric matrices. Denote M = max i |X i | ψ2 . Then, for any t ≥ max{M E sup A∈A AX , M 2 sup A∈A A } we have (1.6) where c > 0 is an absolute constant and Z A (X) is defined by (1.3).

Remark 1.2.
In Theorem 1.1 and below we assume that all A ∈ A are symmetric. This was done only for convenience of presentation and in fact, the analysis may be performed for general square matrices. The only difference will be that in many places A should be replaced by 1 2 (A + A T ). Remark 1.3. Notice that even though the above result is stated for finite sets A, it also holds for arbitrary bounded sets. Indeed, for a bounded set of matrices A, since these matrices are finite dimensional we can consider an increasing sequence holds. This pointwise convergence implies convergence in probability, in particular, Since for a subset A k ⊂ A the values E sup A∈A k AX 2 and sup A∈A k A are not greater than those for the original set A, we obtain the bound (1.6) for arbitrary bounded sets.
For the sake of simplicity, we only consider finite sets below.
In particular, Theorem 1.1 recovers the right-tail of the result of Talagrand (1.2) up to absolute constants, since in this case we obviously have max i |ε i | ψ2 1. Furthermore, the result of Theorem 1.1 works without the assumption used in [28] and [10] that diagonals of all matrices in A are zero. Moreover, it is also applicable in some situations when the concentration property (1.4) holds: indeed, if X is a standard normal vector in R n then it is well known (see [20]) that M = max i |X i | ψ2 ∼ √ log n. If moreover the identity matrix I n ∈ A then E sup A∈A AX ≥ E X √ n. Therefore, in this case the factor M is only of at most logarithmic order when compared to E sup A∈A AX .
In the special case that A consists of just one matrix, our bound recovers the bound that is similar to the original Hanson-Wright inequality. On the one hand, our bound may have an extra logarithmic factor that depends on the dimension n. On the other hand, the original term max i X i ψ2 A HS is replaced by the better term E AX . We discuss this phenomenon below. The core of the proof of the Hanson-Wright inequality in [25] is based on the decoupling technique which may be used (at least in a straightforward way) to prove the deviation inequality-but not the concentration inequality-for sup A∈A (X AX − EX AX) in the case that A consists of more than one matrix.
A natural question to ask is whether one may improve Theorem 1.1 and replace M = max i |X i | ψ2 by K = max i X i ψ2 . In Section 2 we discuss that in the deviation version of Theorem 1.1 this replacement is not possible in some cases. This is quite unexpected in light of the fact that max i |X i | ψ2 does not appear in the original Hanson-Wright inequality. Therefore, we believe that the form of our result is close to optimal. We also provide the following extension of Theorem 1.1 which may be better in some cases.

Proposition 1.4.
Suppose that the components of X = (X 1 , . . . , X n ) are independent centered random variables. Suppose also that the variables X i are distributed symmetrically (X i has the same distribution as −X i ). Let A be a finite family of n × n real symmetric matrices. Denote M = max i |X i | ψ2 and K = max i X i ψ2 and let G be a standard Gaussian vector in R n . Then, for any t ≥ max{M KE sup A∈A AG , where c > 0 is an absolute constant and Z A (X) is defined by (1.3). Indeed, in the case that A = {A} we have E AG ∼ A HS . The difference is that K 4 and K 2 are replaced by M 2 K 2 and M K respectively. We proceed with some notation that will be used below. For a non-negative random variable Y , define its entropy as Instead of the concentration property (1.4), we also discuss the following closely related property: Assumption 1.6. We say that a random vector X taking values in R n satisfies the logarithmic Sobolev inequality with constant K > 0 if for any continuously differentiable function f : R n → R we have whenever both sides of the inequality are not infinite.
One of the technical contributions of this paper is that we use a similar scheme to prove Theorem 1.1 and to recover (1.5) under the logarithmic Sobolev Assumption 1.6. The application of logarithmic Sobolev inequalities requires computation of the gradient of the function of interest, that is, in our case, the gradient of Z A (X) = sup A∈A (X T AX − EX T AX). In the analysis that we present, there is a need to control the behaviour of ∇Z A (X) (or its analogs) and, as in [10] and [2], we use a truncation argument to do so. However, in both cases our proofs make use of the entropy variational formula from [11], that states that for random variables Y, W with E exp(W ) < ∞ we have E(W exp(λY )) ≤ E exp(λY ) log(E exp(W )) + Ent(exp(λY )). (1.8) Doing so allows us to shorten the proofs and avoid some technicalities appearing in previous papers. Finally, to prove Theorem 1.1 we use a second truncation argument: this argument is based on the Hoffman-Jørgensen inequality (see [20]). We also present two lemmas which are used several times in the text. Both results have short proofs and may be of independent interest. Lemma 1.7. Suppose that for random variables Z, W and any λ > 0 we have Ent(e λZ ) ≤ λ 2 EW e λZ and P(W > L + θt) ≤ e −t , (1.9) where θ, L are positive constants. Then, the following concentration result holds where c > 0 is an absolute constant. If, moreover, (1.9) holds for any λ ≤ 0, we have The second technical result is a version of the convex concentration inequality in [28] which does not require the boundedness of the components of X. Lemma 1.8. Let f : R n → R be a convex, L-Lipschitz function with respect to the Euclidean norm on R n and X = (X 1 , . . . , X n ) be a random vector with independent components. Then, for any t ≥ CL max i |X i | ψ2 we have , where c, C > 0 are absolute constants.
Despite generalizing existing results on convex concentration, the result of Lemma 1.8 follows easily from the truncation approach combined with the Hoffman-Jørgensen inequality. As another application of this technique we provide a version of the matrix Bernstein inequality that holds for random matrices with subexponential spectral norm. For clarity of presentation, this inequality is first presented in Section 4. Finally, the same argument showing that it is not possible to replace M = max i |X i | ψ2 by K = max i X i ψ2 in Theorem 1.1 is used to show that the same is not possible in Lemma 1.8.
We sum up the structure of the paper: • Section 2 is devoted to applications and discussions and consists of several parts.
At first, we give a simple proof of the uniform bound from [2] under the logarithmic Sobolev assumption. The second paragraph is devoted to improvements of the non-uniform Hanson-Wright inequality (1.1) in the subgaussian regime. Furthermore, we apply our techniques to obtain a uniform concentration result similar to Theorem 1.1 in a particular case of non-independent components. We consider the Ising model under Dobrushin's condition, a setting that has been studied recently in [3] and [13]. The question we study was raised by Marton [22] in a closely related scenario. Finally, we show that it is not possible in general to replace max i |X i | ψ2 with max i X i ψ2 in Theorem 1.1 by providing an appropriate counterexample.
• In Section 3 we present our proof of Theorem 1.1. While doing so, we prove Lemma 1.7 and Lemma 1.8. Finally, we give a proof of Proposition 1.4.
• In Section 4 we formulate and prove the dimension-free matrix Bernstein inequality that holds for random matrices with subexponential spectral norm. We demonstrate how our Bernstein inequality can be used in the context of covariance estimation for subgaussian observations improving the state-of-the-art result from [21] for covariance estimation with missing observations.

Some applications and discussions
We begin with some notation that will be used throughout the paper. For a random vector X taking its values in R n let X 1 , . . . , X n denote its components. When all components of X are independent let X i denote an independent copy of the component X i .
Throughout the paper C, c > 0 are absolute constants that may change from line to line. We write a b if a ≤ Cb for some absolute constant C > 0. Moreover, if a b and b a we write a ∼ b. Furthermore, for a square matrix A, denote by Diag(A) the diagonal matrix that has the same elements on the diagonal as A. The off-diagonal part of A is defined by Off(A) = A − Diag(A); we define diag(a) as a n × n diagonal matrix with diagonal elements a ∈ R n . Finally, for two symmetric (Hermitian) matrices A, In what follows we also use the following equivalent formulations of tail inequalities. Assume that for a random variable Y and some a, b > 0 we have that for any t ≥ 1, The last inequality implies for any u ≥ max(a, b), and vice versa.

Uniform Hanson-Wright inequalities under the logarithmic Sobolev condition
In this paragraph we recover a result from [2] under Assumption 1.6. Consider a random variable Z A (X) defined by (1.3), a function of X that satisfies logarithmic Sobolev assumption (1.7).
Following [2] we assume without loss of generality, that A is a finite set of matrices. Then Z A is Lebesgue-a.e. differentiable and bounded by a Lipschitz function of X with good concentration properties. Remark 2.1. Note that Assumption 1.6 applies only for smooth functions, so that a standard smoothing argument should be used (see e.g. [19]). For the sake of completeness we recall this argument in Section A. In what follows in this section we assume that none of these potential technical problems appear.
In particular, since X satisfies the logarithmic Sobolev condition with constant K, we have by Theorem 5.3 in [19] Taking squares and using (a Furthermore, the logarithmic Sobolev condition implies for any λ ∈ R Ent(e λZ A (X) ) ≤ 4K 2 λ 2 E sup A∈A AX 2 e λZ A (X) .
Therefore, by Lemma 1.7 it holds for any t ≥ 0 that which coincides with (1.5) for K-concentrated vectors up to absolute constant factors.

Remark 2.2.
This result may be used directly to prove the concentration for Σ − Σ , whereΣ is the sample covariance defined asΣ = 1 N N i=1 X i X i and X 1 , . . . , X N are centered Gaussian vectors with the covariance matrix Σ (see Theorem 4.1 in [2]). We return to the covariance estimation problem in Section 4.

Improving Hanson-Wright inequality in the subgaussian regime
Our analysis implies, in particular, an improved version of Hanson-Wright inequality (1.1) in some cases. We consider a centered random vector X = (X 1 , . . . , X n ) with independent subgaussian components and set K = max i X i ψ2 , M = max i |X i | ψ2 . In this case (1.1) implies that with probability at least 1 − 2e −t we have At the same time, Theorem 1.1 for a single matrix A = {A} implies with the same Observe that when |X i | ≤ L almost surely for every i ≤ n, we have M min{K √ log n, L}.
The following example illustrates the difference between these two bounds.
Example 2.4. Assume, δ 1 , . . . , δ n are independent Bernoulli random variables with the same mean δ and let δ ≤ 1 On the other hand, for δ ≤ 1 where the last line follows directly from Theorem 1.1 in [26] (a result equivalent to Theorem 1.1 was also obtained in [7]). Therefore, the standard Hanson-Wright inequality implies that with probability at least 1 − e −t we have while (2.1) and M min{K √ log n, 1} imply that for t ≥ 1 and δ ≤ 1 It is easy to verify that lim δ→0+ √ δ| log δ| = 0, thus the inequality (2.2) is better than Hanson-Wright inequality for this X in the subgaussian regime (when the t-term is dominated by the √ t-term).

Uniform concentration results in the Ising model
Consider a random vector σ ∈ {−1, 1} n with the distribution defined by where Z is a normalizing factor. This distribution defines the Ising model with param- For an arbitrary function f on {−1, 1} n denote a difference operator, where the operator T i σ = (σ 1 , . . . , σ i−1 , −σ i , σ i+1 , . . . ) flips the sign of the i-th component, and π(· | σ 1 , . . . , σ i−1 , σ i+1 , . . . ) is conditional distribution of the i-th component given the rest of the elements. The following recent result provides the logarithmic Sobolev inequality for σ under Dobrushin-type conditions. Theorem 2.5 (Proposition 1.1, [13]). Suppose, h ∞ ≤ α and J satisfies J ii = 0 and There is a constant C = C(α, ρ), such that for an arbitrary function Remark 2.6. Following [13] the condition (2.3) will be called Dobrushin's condition.
We may obtain the following uniform concentration result which is a simple outcome of our Lemma 1.7 and Theorem 2.5. Proposition 2.7. Let A be a finite set of symmetric matrices with zero diagonal. It holds in the Ising model under Dobrushin's condition and h ∞ ≤ α that for any t ≥ 0 where C depends only on α, ρ.
Proof. Let σ (i) = (σ 1 , . . . , σ i−1 , σ i , σ i+1 , . . . ), where given all but the i-th element of σ, the variables σ i and σ i are independent and are distributed as π(· | σ 1 , . . . , σ i−1 , σ i+1 , . . . ). Obviously, we may have all σ 1 , . . . , σ i and σ 1 , . . . , σ n defined on the same discrete probability space, and thus we will use the notation π(·) and π(· | ·) for the distribution and the conditional distribution. Therefore, we have . . , σ n ) and using the independence of σ i and σ i given where A is a given finite set of symmetric matrices with zero diagonal (the diagonal is not important here, since σ 2 i = 1). Let us apply Theorem 2.5 to f (σ) = e λZ A (σ)/2 . Since for x ≥ y and λ ≥ 0 where for A (maximizer of (2.4)) we have, Note that concentration for sup A∈A Aσ is implied by the same result. Indeed, we have where w = γ A is such that sup A∈A Aσ = w σ. Thus, the expectation of the corresponding difference operator is bounded by 4 sup A∈A A . Therefore, due to the standard Herbst argument (Proposition 6.1 in [11]) Theorem 2.5 implies To sum up, by Theorem 2.5 we have It is left to apply Lemma 1.7 which finishes the proof of the following inequality where C only depends on α, ρ from Theorem 2.5. The claim follows.
This implies, where we used that Tr(BD) ≤ Tr(B) D which holds for any pair of symmetric and nonnegative matrices B, D. Finally, we have The right-hand side term appears instead of E Aσ in aforementioned Example 2.5.
Replacing max i |X i | ψ2 with max i X i ψ2 in Theorem 1.1 and Lemma 1.8 Essentially, we show that it is not possible to substitute max i |X i | ψ2 with max i X i ψ2 in Theorem 1.1 by presenting a concrete counterexample which was kindly suggested by Radosław Adamczak. Suppose the opposite: there is an absolute constant C > 0 such that for any set of matrices A and any subgaussian random variables X 1 , . . . , X n it holds with probability at least 1 − e −t that which implies that for some other constant C > 0 we have Notice that here we allow a multiplicative constant not equal to 1 in front of the expectation. Let us take A = {A (1) , . . . , A (n) } with A (i) having only one nonzero element A (i) ii = 1. For the sake of simplicity we take i.i.d. X 1 , . . . , X n with EX 2 i = 1. This implies Note that this inequality also holds if we rescale X i = αX i for an arbitrary α > 0. Therefore, if X 1 ψ2 ≤ 4 X 1 L2 , we can always rescale our random variables to have X 1 L2 = 1 and X 1 ψ2 ≤ 4, so that the above inequality still holds.
Taking the latter into account we conclude that there is a constant D > 0, such that if a centered random X 1 satisfies X 1 ψ2 ≤ 4 X 1 L2 , then for any n ≥ 1 the following inequality holds It is known that such hypercontractivity of maxima implies certain regularity of tails of X 2 1 . In this case by Theorem 4.6 in [14] for any ρ, ε > 0 there is another constant (2.7) The latter does not have to hold for every subgaussian random variable X 1 . For instance, taking a symmetric random variable X 1 with P(|X 1 | = 1) = 1 − e −r and P( 2 and the conditions of (2.6) are satisfied. But for large enough r > At and for t = t 0 , we have P X 2 1 > At = P(X 2 1 > t) = e −r , therefore breaking the tail regularity (2.7). Therefore, it is impossible to establish an inequality of the form (2.5). We note that it is also possible to prove that (2.6) may not hold for X 1 defined above via some direct calculations.
For the same reason it is not possible to replace max i≤n |X i | ψ2 with max i≤n X i ψ2 in Lemma 1.8. Indeed, suppose that for any convex L-Lipschitz function f we have Taking f (X) = max i≤n |X i |, which is convex and 1-Lipschitz, we get The same choice of X 1 implies (2.6) and leads to a contradiction.

Proof of Theorem 1.1
In this section we assume that the components of X are independent. We recall that X i denotes an independent copy of the component X i . The main tool of the proof is the modified logarithmic Sobolev inequality (see Theorem 2 in [10] or Theorem 6.15 in [11]).
For the sake of brevity we denote Z = Z A (X) in this section. Let us set Then by the symmetrized version of the inequality we have that for any λ, The right-hand side of the inequality can be "decoupled" by the variational entropy formula (1.8), as it is done in the proof of Lemma 1.7 which was presented in the introduction.
By the standard Herbst argument (see e.g., Proposition 6.1 in [11]) we have for any This moment generating function bound is known to immediately imply the right-tail concentration bound (see the properties of subgamma random variables in [11]). Finally, if (1.9) holds for all λ ∈ R, the two sided inequality can be derived in the same way.
Remark 3.1. Note, there is as well a moment version of the modified logarithmic Sobolev inequality, see e.g., Theorem 2 in [9]. By this theorem it holds for all q ≥ 2 that which is equivalent to the second inequality in (1.9) up to absolute constant factors, then it holds for any q ≥ 2 (Z − EZ) + Lq ≤ 4Lq + √ 4θq.
The last inequality implies (1.10) up to absolute constant factors. We note that similar moment computations were used in [9] to analyze the Rademacher chaos. Similarly, one can introduce the moment analog of the logarithmic Sobolev inequality (see equation 3 in [5]): where K > 0 is a constant, | · | stands for the standard Euclidean norm and q ≥ 2. Now, if it holds (which in some cases may be derived by the second application of the moment analog of the logarithmic Sobolev inequality) which implies the result similar to (1.10).
Finally, we establish a version of our result that requires neither that X i is centered nor that X i has variance one. It can happen that EX AX = Tr(A), but in fact, the value we subtract does not really affect the concentration properties. In general we can consider the random variable Z = sup A∈A (X AX − g(A)), (3.2) where g : R n×n → R is an arbitrary function.

Lemma 3.2.
Suppose that the components X i are independent but not necessarily centered, and |X i | ≤ K almost surely. Then for Z defined by (3.2) and for any t ≥ 1 it holds with probability at least 1 − e −t that where C is an absolute constant.
Proof. Let A be the matrix that maximizes Z(X) given X. We have where the last line follows from |X i − X i | ≤ 2K. The factor 2 appears in the second line because A is symmetric and thus X i is counted twice. Applying the triangle inequality we get where E [·] = E[·| X] denotes the expectation with respect to the variables X 1 , . . . , X n only. Thus, where we used (a + b + c) 2 ≤ 3(a 2 + b 2 + c 2 ). Since |X i | ≤ K, we have by convex concentration for Lipschitz functions (see e.g. Theorem 6.10 in [11]) Since Diag(A) ≤ A , we get for L ∼ K 2 (E sup A∈A AX + E sup A∈A Diag(A)X ) 2 and θ ∼ K 4 (sup A∈A A ) Therefore, due to the modified logarithmic Sobolev inequality (3.1) we can use Lemma 1.7. This provides us with the inequality where we can neglect the θ in front of √ t when t ≥ 1.
Note that our bound has the term E sup A∈A Diag(A)X which can be avoided in the case of centered variables X i . Therefore, we obtain the bound matching the previous results (1.5) and (1.2).

Corollary 3.3.
Suppose that |X i | ≤ K almost surely and EX i = 0. Then for any t ≥ 1 it holds with probability at least 1 − e −t that where C > 0 is an absolute constant.
In the next two lemmas we show how to get rid of the diagonal term. This finishes the proof of the corollary above.
where E ε denotes the expectation with respect to ε given Y . Proof. Let X be an independent copy of X. By the standard symmetrization argument together with Jensen' inequality and the triangle inequality we have where E ε denotes the expectation with respect to ε. Conditionally on (X − X ) set A X,X = {A diag(X − X ) : A ∈ A}. Let a 1 , . . . , a n be the columns of A. Notice that for any matrix A we have Diag(A A) = diag( a 1 2 , . . . , a n 2 ) diag(A 2 11 , . . . , A 2 nn ) = Diag(A) 2 . Therefore, by Lemma 3.4 we have We now want to get rid of the squares in (3.5). Let B be an arbitrary set of symmetric n × n matrices and let us fix some B ∈ B. We have E Bε 2 = B 2 HS and by Khinchin's inequality we have with the optimal constant due to [27]. Thus, we have Furthermore, by the convex Poincare inequality (Theorem 3.17, [11]) we have The last inequality combined with (3.5) implies

Aε .
Now, taking the expectation with respect to X, X and applying (3.4) again we finish the proof.

Truncation for unbounded variables
In this section we finish the proof of Theorem 1.1. In order to apply the bounded version of our inequality, we want to truncate each variable X i , which can be done by the approach from [1] (see references therein for more details on various applications of this method), where it was used in the context of Talagrand's concentration inequality.
Suppose that max i |X i | ψ2 < ∞ and set The variables Y i are now bounded by the value M . Therefore, the first term of the last line can be analyzed by Lemma 3.2.
To bound the rest we need to control the deviations of W . We have W 2 = W 2 1 + · · · + W 2 n is a sum of independent random variables with bounded ψ 1 -norm. Thus, we can control it's expectation via the Hoffman-Jørgensen inequality. Due to the choice of the truncation level we have by Markov's inequality Therefore, by Proposition 6.8 in [20] we have where the latter holds since max . Furthermore, by Theorem 6.21 in [20] we have where K 1 is an absolute constant. Given the bound on the expectation of W 2 we have Finally, we obtain the deviation bound: for every t > 0 we have Taking this into account, Lemma 3.2 can be applied to the variables Y as follows. Set that accounts for unbounded variables. As a matter of fact, we prove the following Lemma which is even slightly stronger than Lemma 1.8. Lemma 3.6. Let f : R n → R be separately convex 1 L-Lipschitz with respect to the Euclidean norm in R n and X = (X 1 , . . . , X n ) be a random vector with the independent components. Then it holds for t ≥ 1 that where C > 0 is an absolute constant. Additionally, if f is convex and L-Lipschitz, then for any t > 0, Proof. By convex concentration (Theorem 6.10 in [11]) for bounded Y i defined by (3.6) we have that for any t > 0, Moreover, due to the Lipschitz assumption and (3.8) we have where the latter holds with probability at least 1 − e −t . Integrating these two bounds we also get (3.10) which together implies that with probability at least 1 − e −t we have The proof of the lower tail bound follows from Theorem 7.12 in [11] and the standard relation between the median and the expectation which holds in our case.
From Lemma 3.6 due to the fact that sup Moreover, similar to (3.10) we have (3.12) Next, we bound the difference between EZ A (X) and EZ A (Y ).
Lemma 3.7. We have Proof. Similarly to (3.7) we have (3.13) where by (3.8) E 1/2 W 2 max i≤n |X i | ψ2 and by (3.11), Taking the square root we get Similarly and using (3.12) we have, Plugging it in (3.13) we get the required inequality.
Therefore, in (3.9) we can use Lemma 3.7 to get (3.14) and by Lemma 3.12 (neglecting the diagonal term for centered X due to Lemma 3.5) Finally, with probability at least 1 − e −t for t ≥ 1 we have from (3.7), (3.12) and (3.11) which using (3.8) turns into Combining the last inequality together with (3.14) and (3.15) we finish the proof of Theorem 1.1.

Proof of Proposition 1.4
The proof is essentially based on the application of the next standard deviation bound instead of the concentration bound of (3.11) in the proof of Theorem 1.1. Since we did not find an exact reference, we derive this inequality here. Lemma 3.8. Suppose that X 1 , . . . , X n are independent centered random variables and A is a finite set of symmetric matrices. Let G be a standard normal vector in R n . Then it holds with probability at least 1 − Ce −t that where C > 0 is an absolute constant.
Proof. At first, we observe that sup A∈A AX = sup A∈A,γ∈S n−1 γ T AX. Consider the metric ρ defined by ρ(a, b) = a − b max i≤n X i ψ2 for any a, b ∈ R n . By Theorem 2.2.26 in [29] it holds for t ≥ 0 and an absolute constant C > 0 that with probability at least 1 − C exp(−t), sup A∈A,γ∈S n−1 i≤n X i ψ2 and the functional γ 2 is also defined in [29]. For the sake of brevity, we will not introduce its definition here. Finally, applying Talagrand The claim follows.
Setting M = 8E max i |X i | and K = max i X i ψ2 consider the truncation scheme that is used in (3.6). Due to the assumption that all X i are symmetrically distributed, we have EY i = 0. Therefore, Lemma 3.8 implies which can be used instead of the convex concentration inequality (3.3) when dealing with the modified logarithmic Sobolev inequality. Following this proof and using the fact that max i |Y i | ≤ M almost surely, we end up with the concentration bound which holds with probability at least 1 − e −t for any t > 1. Furthermore, we slightly modify the derivations of the previous section by using Lemma 3.8 instead of (3.11). In particular, we get with probability at least 1 − e −t for any t > 1, and taking expectation we also get |EZ A (X) − E A Z(Y )| M KE sup A∈A AG . The claim follows from (3.7).

The matrix Bernstein inequality in the subexponential case
As we mentioned above, one of the prominent applications of the uniform Hanson-Wright inequalities is a recent concentration result in the Gaussian covariance estimation problem. It is known that covariance estimation may be alternatively approached by the matrix Bernstein inequality, see e.g., [33,21]. Following the truncation approach, which was taken above, we provide a version of matrix Bernstein inequality that does not require uniformly bounded matrices. The standard version of the inequality (see [30] and reference therein) may be formulated as follows: consider random independent matrices X 1 , . . . , X N ∈ R n×n , such that almost surely max i X i ≤ L. It holds where c is an absolute constant and The first problem with this result is that it does not hold in general cases when max i X i ψ1 or max i X i ψ2 are bounded. The second problem is the bound depends on the dimension n. This does not allow to apply this result to operators in infinite-dimensional Hilbert spaces.
For a positive-definite real square matrix A we define the effective rank asr(A) =

Tr(A)
A . We show the following bound. Proposition 4.1. Consider the set of independent Hermitian matrices X 1 , . . . , X N ∈ C n×n such that X i ψ1 < ∞. Set M = max i≤N X i ψ1 and let the positive definite There are absolute constants c, C, c 1 > 0 such that for any u ≥ c 1 max{M, σ} we have Remark 4.2. Using the well known bound for the maximum of subexponential random variables (see [20]) we have Therefore, we may state the bound for M = log N max i≤N X i ψ1 up to absolute constant factors. When n = 1 the effective rank plays no role and our bound recovers the version of classical Bernstein inequality which is due to [1]. In this paper, it is also shown that the log N factor cannot be removed in general. This means that M = max i≤N X i ψ1 can not be replaced by max i≤N X i ψ1 in general.
Proof. Fix U > 0 and consider the decomposition so that the matrices Y i are uniformly bounded by U in the operator norm. By the triangle inequality and the union bound we have Therefore, two parts can be treated separately. Throughout this proof c > 0 is an absolute constant which may change from line to line. It is known that uniformly bounded random matrices satisfy the Bernstein-type inequality (see Theorem 3.1 in [24]) where we used Y i ≤ U . However, since we want to present this bound in terms of X i and not Y i , we need the following modification of the proof of Minsker's theorem. Using the notation of his proof, it follows from Lemma 3.1 in [24]: where φ(t) = e t − t − 1. Now, using the same lines of the proof, instead of formula (3.4) we have and lines (3.5) with the condition where σ 2 = R . Following the last lines of the proof of Theorem 3.1, we finally have for u ≥ C max{U, σ}.
We proceed with the analysis of Z i . Set U = 8E max i≤N X i , then we have by Markov's Thus, we can apply Proposition 6.8 from [20] to Z i taking its values in the Banach space (C n×n , · ) equipped with the spectral norm. We have which implies that for some absolute constant K > 0, Using Theorem 6.21 from [20] in (C n×n , · ) we have, where K 1 , K 2 > 0 are absolute constants. This implies that for any u ≥ max i≤N Z i ψ1 we have where c > 0 is an absolute constant. Combining it with (4.1) and that for some absolute prove the claim.
To the best of our knowledge, the Proposition 4.1 is the first to combine two important properties: it simultaneously captures the effective rank instead of the dimension n and is valid for matrices with subexponential operator norm (the matrix Bernstein inequality in the unbounded case was previously granted under the so-called Bernstein moment condition; we refer to [30] and the references therein). We should also compare our results with Proposition 2 in [16]. His inequality has the same form as our bound, but instead of the effective rank, the original dimension n is used and M = max i≤N X i ψ1 is replaced by max i≤N X i ψ1 log N max i≤N X i ψ1 2 /σ 2 .

An application to covariance estimation with missing observations
Now we turn to the problem studied in [17] and [21]. Suppose, we want to estimate the covariance structure of a random subgaussian vector X ∈ R n (which will be assumed centered) based on N i.i.d. observations X 1 , . . . , X N . For the sake of brevity, we work with the finite-dimensional case, while as in [17] our results do not depend explicitly on the dimension n. Recall that a centered random vector X ∈ R n is subgaussian if for all u ∈ R n we have X, u ψ2 (E X, u 2 ) 1 2 .
Observe that this definition does not require any independence of the components of X.
In what follows we discuss a more general framework suggested in [21]. Let δ i,j , i ≤ N, j ≤ n be independent Bernoulli random variables with the same mean δ. We assume that instead of observing X 1 , . . . , X N we observe vectors Y 1 , . . . , Y N which are defined as Y j i = δ i,j X j i . This means that some components of vectors X 1 , . . . , X N are missing (replaced by zero), each with probability 1 − δ. Since δ can be easily estimated, we assume it to be known. Following [21], denotê It can be easily shown that the estimator is an unbiased estimator of Σ = EX i X i . In particular, Remark 4.4. The upper bound above provides some important improvements upon Proposition 3 in [21], which is The bound (4.4) depends on n and therefore is not applicable in the infinite dimensional scenarios. It also contains a term proportional to t 2 , which appeared due to a straightforward truncation of each observation. Moreover, this result has an unnecessary factor r(Σ) in the term r(Σ)t N δ 2 . Finally, when δ = 1 tighter results may be obtained using high probability generic chaining bounds for quadratic processes. In particular, Theorem 9 in [17] implies with probability at least 1 − e −t , (4.5) Unfortunately, this analysis may not be implied for δ < 1 in general, since the assumption (4.2) does not hold for the vector Y , defined by Y j i = δ i,j X j i . Therefore, our technique is a reasonable alternative that works for general δ and is almost as tight as (4.5) when δ = 1. We also remark that for δ = 1 even sharper versions of (4.5) were obtained in [23].
However, their statistical procedure differs from the sample covariance matrixΣ.
Tr(Σ), and the same bound holds for Diag(Y Y ) ψ1 .
Let A be an arbitrary symmetric matrix and let us calculate E(A δδ ) 2 where denotes the Hadamard product and δ = (δ 1 , . . . , δ n ) is a vector with independent components having Bernoulli distribution with the same mean δ. We have, This can be put together in the following expression, All these matrices are positive semi-definite, apart from the term Off(A)Diag(A) + Diag(A)Off(A), which we can obviously bound by 1 2 (Off(A) + Diag(A)) 2 = A 2 /2. Taking into account δ ≤ 1, we have the following Recall that Y = diag(δ)X. Therefore, we have Off(Y Y ) = δδ Off(XX ). Since the latter has zero diagonal, the term with δ in the formula above disappears. Therefore, It holds EOff(XX ) 2 2E(XX ) 2 + 2EDiag(XX ) 2 , and we also have from [21] that E(XX ) 2 CTr(Σ)Σ. Additionally, due to (4.2) we immediately have EX 4 i Σ 2 ii . Finally, the following bound holds EDiag(XX ) 2 CDiag(Σ) 2 CTr(Σ)Diag(Σ).
Plugging these bounds into (4.6) we get the second inequality. As for the diagonal case we have for A = Diag(XX ), Uniform Hanson-Wright Lemma 4.6. For Y as in Lemma 4.5 and any unit vector u ∈ R n we have Proof. First, we want to check that Obviously, u XX u L4 ≤ u X 2
Next, consider a vector a ∈ R n . We show the following bound, First, using (x + y) 2 ≤ 2x 2 + 2y 2 we have, To the first term we apply the decoupling inequality (Theorem 6.1.1 in [32]). Namely, defining δ 1 , . . . , δ n as independent copies of δ 1 , . . . , δ n we have, where in the last sum only the terms with k = i and l = j do not vanish. Since It remains to bound the second term.
The last term is simple to bound: Furthermore, we can rewrite the centred sum as a sum of independent random variables, so that Finally, using we can apply (4.8) with a = diag(u)X. This implies Due to (4.2) we have E 1/4 (u X) 4 Σ 1/2 . Moreover, the vector diag(u)X also satisfies the subgaussian assumption (4.2) and has the covariance matrix diag(u)Σdiag(u). Therefore, we have where we used that u = 1. Therefore, we conclude that Finally, we have for the diagonal term Before we present the proof of the deviation bound, let us recall the following version of Talagrand's concentration inequality for empirical processes. Remarkably, the following result can be proven using very similar techniques: at first, one may use the modified logarithmic Sobolev inequality to prove a version of Talagrand's concentration inequality in the bounded case and then use the same truncation argument as in the proof of Theorem 1.1 to get the result in the unbounded case. Theorem 4.7 (Theorem 4 in [1]). Let X 1 , . . . , X N ∈ X be a sample of independent random variables and F be a countable class of measurable functions X → R such that (4.9) and σ 2 = sup f ∈F N i=1 Ef 2 (X i ). There is an absolute constant C > 0 such that Σ − Σ δ −1 Diag(Σ (δ) ) − EDiag(Σ (δ) ) + δ −2 Off(Σ (δ) ) − EOff(Σ (δ) ) .
The following simple lemma shows how to apply the logarithmic Sobolev inequality to non-differentiable functions.
Lemma A.1. Suppose that X satisfies Assumption 1.6. Let f : R n → R be such that |f (x) − f (y)| ≤ |x − y| max(L(x), L(y)), for some continuous non-negative function L(x). Then for some absolute constant C > 0 and any λ ∈ R it holds Ent(e λf ) ≤ CK 2 λ 2 EL(x) 2 e λf Take F (x) = e λf (x)/2 and let us consider a sequence of smooth approximations F m (x) = φ m (x − u) F (u)du, so that F m (x) tends to F pointwise. This is possible due to the fact that F is continuous. Moreover, we have due to the symmetry Since φ m (u) vanishes for u ≥ 1/m, we have It is easy to see that |F (x) − F (y)| = |e λf (x)/2 − e λf (y)/2 | ≤ λ 2 x − y max(e λf (x)/2 , e λf (y)/2 ) max(L(x), L(y)), and therefore, where we set L m (x) = sup y : x−y ≤m −1 L(y) and F m (x) = sup x−y ≤m −1 e λf (y)/2 that tend pointwise to L(x) and F (x), respectively, as m → ∞.
Since each function f m is smooth, we have by Assumption 1.6 that Taking the limit m → ∞ we prove the the required inequality.