Tail inequalities for sums of random matrices that depend on the intrinsic dimension

This work provides exponential tail inequalities for sums of random matrices that depend only on intrinsic dimensions rather than explicit matrix dimensions.  These tail inequalities are similar to the matrix versions of the Chernoff bound and Bernstein inequality except with the explicit matrix dimensions replaced by a trace quantity that can be small even when the explicit dimensions are large or infinite.  Some applications to covariance estimation and approximate matrix multiplication are given to illustrate the utility of the new bounds.


Introduction
Sums of random matrices arise in many statistical and probabilistic applications, and hence their concentration behavior is of significant interest.Surprisingly, the classical exponential moment method used to derive tail inequalities for scalar random variables carries over to the matrix setting when augmented with certain matrix trace inequalities.This fact was first discovered by Ahlswede and Winter [1], who proved a matrix version of the Chernoff bound using the Golden-Thompson inequality [7,23]: tr exp(A + B) ≤ tr(exp(A) exp(B)) for all symmetric matrices A and B. Later, it was demonstrated that the same technique could be adapted to yield analogues of other tail bounds such as Azuma's and Bernstein's inequalities [3,9,19,8,16,15].Recently, a theorem due to Lieb [12] was identified by Tropp [25,24] to yield sharper versions of this general class of tail bounds.Altogether, these results have proved invaluable in constructing and simplifying many probabilistic arguments concerning sums of random matrices.
One deficiency of many of these previous inequalities is their dependence on the explicit matrix dimension (this is reviewed in Section 3.3), which prevents their application to infinite dimensional spaces that arise in a variety of data analysis tasks (e.g., [22,18,6,2,11]).In this work, we prove analogous results where the dimension is replaced with a trace quantity that can be small even when the explicit matrix Tail inequalities for sums of random matrices that depend on the intrinsic dimension dimension is large or infinite.For instance, in our matrix generalization of Bernstein's inequality, the (scaled) trace of the second moment matrix appears instead of the matrix dimension.Such trace quantities can often be regarded as notions of intrinsic dimension.The price for this improvement is that the more typical exponential tail e −t for t > 0 is replaced with a slightly weaker tail t(e t − t − 1) −1 ≈ e −t+log t .As t becomes large, the difference becomes negligible.For instance, if t ≥ 2.6, then t(e t − t − 1) −1 ≤ e −t/2 .There are some previous works that also give tail inequalities for sums of random matrices that do not depend on the explicit matrix dimension, at least in some special cases.For instance, in the case where the summands all have rank one, then Oliveira [16] gives a bound where the dimension is replaced by the number of summands.Rudelson and Vershynin [21] also prove similar exponential tail inequalities for sums of rank-one matrices using non-commutative Khinchine moment inequalities by way of a key inequality of Rudelson [20]; the extension to higher rank random matrices is not explicitly worked out, but may be possible.Indeed, Magen and Zouzias [14] pursue this direction, but their argument is complicated and falls short of giving an exponential tail inequality-this point is discussed in Section 3.3.
To concretely compare the new technique to previous results based on the matrix exponential moment method, consider the sum n i=1 γ i A i , where A 1 , A 2 , . . ., A n are fixed symmetric d × d matrices, and γ 1 , γ 2 , . . ., γ n are independent standard normal random variables.Tropp gives the following tail bound for the largest eigenvalue of the sum (Theorem 4.1 in [25]): where Σ := n i=1 A 2 i .Combining Theorem 3.2 with Lemma 4.3 in [25] gives the following new tail bound: • t e t − t − 1 .
Note that tr(Σ)/λ max (Σ) ≤ d but t(e t − t − 1) −1 > e −t for t > 0, so no bound always dominates the other.However, for moderately large values of t, and when d tr(Σ)/λ max (Σ), the new bound is a significant improvement.
The following lemma due to Tropp [24] is a matrix generalization of a scalar result due to Freedman [5] (see also [28]), where the key is the invocation of Theorem 2.1.We give the proof for completeness.

Lemma 2.2 ([24]
).Let I be the identity matrix for the range of the X i .Then

E tr exp
Proof.The proof is by induction on n.The claim holds trivially for n = 0. Now fix n ≥ 1, and assume as the inductive hypothesis that (2.1) holds with n replaced by n − 1.In this case, E tr exp where the first inequality follows from Theorem 2.1 and Jensen's inequality, and the second inequality follows from the inductive hypothesis.
3 Exponential tail inequalities for sums of random matrices

A generic inequality
We first state a generic inequality based on Lemma 2.2.This differs from earlier approaches, which instead combine Markov's inequality with a result similar to Lemma 2.2 (c.f.Theorem 3.6 in [25]).Theorem 3.1.For any η ∈ R and any t > 0, • (e t − t − 1) −1 . Proof.
. Note that g(x) := e x − x − 1 is nonnegative for all x ∈ R and increasing for x ≥ 0. Letting {λ i (A)} denote the eigenvalues of A, we have where the last inequality E[tr(exp(A) − I)] ≤ 0 follows from Lemma 2.2.
Tail inequalities for sums of random matrices that depend on the intrinsic dimension When n i=1 X i has zero-mean, then the first sum in the right-hand side of the inequality from Theorem 3.1 vanishes, so the trace is only over a sum of matrix logarithmic moment generating functions .
For an appropriate choice of η, this trace quantity can be small even when the X i have large or infinite explicit dimension.

Some specific bounds
We now give some specific bounds as corollaries of Theorem 3.1.The proofs use Theorem 3.1 together with some techniques from previous works (e.g., [1,25]) to yield new tail inequalities that depend on intrinsic notions of dimension rather than the explicit matrix dimensions.
First, we give a bound under a subgaussian-type condition on the distribution.

Theorem 3.2 (Matrix subgaussian bound).
If there exists σ > 0 and k > 0 such that for all i = 1, . . ., n, , and for all η > 0 almost surely; then for any t > 0, Proof.We fix η := 2t/(σ 2 n).By Theorem 3.1, we obtain By the sub-additivity of the map M → λ max (M )-i.e., λ max (A) ≤ λ max (B)+λ max (A−B)- We can also give a Bernstein-type bound based on moment conditions.For simplicity, we just state the bound in the case that the λ max (X i ) are bounded almost surely.

Theorem 3.3 (Matrix Bernstein bound).
If there exists b > 0, σ > 0, and k > 0 such that for all i = 1, . . ., n, almost surely; then for any t > 0, and therefore, by the operator monotonicity of the matrix logarithm and the fact log(1 + x) ≤ x, )) for 0 ≤ x < 3, we have by Theorem 3.1 and the subadditivity of the map M → λ max (M ), gives the desired bound.

Discussion
The results of this paper can be viewed as another sharpening of the matrix exponential moment method for deriving exponential tail inequalities for sums of random matrices, which has its origins in the work of Ahlswede and Winter [1] and was subsequently generalized and improved by many others [3,9,19,8,16,15,25,24].The novel feature of our results when compared to previous results is the absence of explicit dependence on the matrix dimensions.Indeed, nearly all previous tail inequalities using the exponential moment method (either via the Golden-Thompson inequality or Lieb's trace inequality) are roughly of the form d • e −t when the matrices in the sum are d × d [1,3,9,19,8,15,25,24].For instance, a corollary of the "Master Tail Bound for Independent Sums" of Tropp (Theorem 3.6 in [25]) can be written as for all t > 0 and η > 0 (see Corollary 3.7 in [25]).Of course, when the random matrices are always confined to a single lower-dimensional space, then these previous results clearly depend only on this lower dimension (i.e., d is replaced by this lower dimension).However, this situation is significantly less general than what is required in many Tail inequalities for sums of random matrices that depend on the intrinsic dimension applications that involve very high or infinite dimensional matrices, such as the analysis of ridge regression [11], kernel methods [22,6,2], and Gaussian process methods [18].Our results therefore widen the applicability of the matrix exponential moment method to handle these cases.
Relative to the tail inequalities of Rudelson and Vershynin [21] and Oliveira [16], we note that our inequalities apply to random matrices of any rank, rather than just rank-one (or low-rank) random matrices.Although [21] and [16] only explicitly provides inequalities for the rank-one case, the work of Magen and Zouzias [14] gives an extension that applies to higher rank random matrices.[14] considers the specific case where X 1 , . . ., X n are i.i.d.copies of a random matrix X which satisfies E[X] 2 ≤ 1, X 2 ≤ γ, and rank(X) ≤ r almost surely; their bound has the following form: for some unspecified positive constants c 1 and c 2 .It should be noted, however, that the right-hand side decreases only polynomially in t rather than exponentially, which is qualitatively weaker than the previous results of [21] and [16]; therefore, it is not a strict improvement or generalization.One disadvantage of our technique is that in finite dimensional settings, the relevant trace quantity that replaces the dimension may turn out to be of the same order as the dimension d (an example of such a case is discussed next).In such cases, the resulting tail bound from Theorem 3.3 (say) of k • t(e t − t − 1) −1 is looser than the d • e −t tail bound provided by earlier techniques [25], and this can be significant for small values of t.
We note that the general matrix exponential moment method used here and in previous work leads to a significantly suboptimal tail inequality in some cases.This was pointed out in [25], but we elaborate on it here further.Suppose x 1 , . . ., x n ∈ {±1} d are i.i.d.random vectors with independent Rademacher entries: each coordinate of x i is +1 or −1 with equal probability.Let . In this case, Theorem 3.3 implies the bound On the other hand, because the x i have subgaussian projections, it is known that (see Lemma A.1 in Appendix A).First, this latter inequality removes the d factor on the right-hand side.But more importantly, the deviation term t does not scale with d in this inequality, whereas it does in the former.Thus this latter bound provides a much stronger exponential tail: roughly put, Pr[λ max ( )) for some constant c > 0 (note that the dimension d does not appear in the exponent); the probability bound from Theorem 3.3 is only of the form exp(−Ω((n/d) min(τ, τ 2 ))).The sub-optimality of Theorem 3.3 is shared by all other existing tail inequalities proved using this exponential moment method.The issue may be related to the asymptotic freeness of the d×n random matrix that nearly all high-order moments of random matrices with independent entries vanish asymptotically-which is not exploited in the matrix exponential moment method.This means that the proof technique in the exponential moment method overestimates the contribution of high-order matrix moments that should have vanished.Formalizing this discrepancy would help clarify the limits of this technique, but the task is beyond the scope of this paper.It is also worth mentioning that this phenomenon only appears to hold when the x i have independent entries (and other similar cases).In cases with correlated entries, our bound is close to best possible in the worst case.

Examples
For a matrix M , let M F denote its Frobenius norm, and let M 2 denote its spectral norm.If M is symmetric, then M 2 = max{λ max (M ), −λ min (M )}, where λ max (M ) and λ min (M ) are, respectively, the largest and smallest eigenvalues of M .

Supremum of a random process
The first example embeds a random process in a diagonal matrix to show that Theorem 3.2 is tight in certain cases.
Let X = diag(z 1 , z 2 , . . . ) be the random diagonal matrix with the z i on its diagonal.We have E[X] = 0, and Therefore, letting t := 2(τ + log k) > 2.6 for τ > 0 and interpreting λ max (X) as sup i {z i }, Suppose the z i ∼ N (0, 1) are just N i.i.d.standard Gaussian random variables.Then the above inequality states that the largest of the z i is O(log N + τ ) with probability at least 1−e −τ ; this is known to be tight up to constants, so the log N term cannot generally be removed.This fact has been noted by previous works on matrix tail inequalities [25], which also use this example as an extreme case.We note, however, that these previous works are not directly applicable to the case of a countably infinite number of meanzero Gaussian random variables z i ∼ N (0, σ 2 i ) (or more generally, subgaussian random variables), whereas the above inequality can be applied as long as the sum of the σ 2 i is finite.

Covariance estimation
Our next example uses Theorem 3.3 to give a spectral norm error bound for estimating the second moment matrix of a random vector from i.i.d.copies.This is relevant in the context of (kernel) principal component analysis of high (or infinite) dimensional data [22].
Since λ max (−X i ) ≤ λ max (Σ), we also have Above, the relevant notion of intrinsic dimension is tr(K − Σ 2 )/λ max (K − Σ 2 ), which can be finite even when the random vectors x i take on values in an infinite dimensional Hilbert space.A related result was given in [29] for Frobenius norm error Σn − Σ F rather than spectral norm error.This is generally incomparable to our result, although spectral norm error may be more appropriate in cases where the spectrum is slow to decay.

Approximate matrix multiplication
Finally, we give an example about approximating a matrix product AB using nonuniform sampling of the columns of A and B.

Example 4.3. Let
be fixed matrices, each with m columns.Assume a i = 0 and b i = 0 for all i = 1, 2, . . ., m.If m is very large, then the straightforward computation of the product AB can be prohibitive.An alternative is to take a small (non-uniform) random sample of the columns of A and B, say a i1 , b i1 , a i2 , b i2 , . . ., a in , b in , and then compute a weighted sum of outer products where p ij > 0 is the a priori probability of choosing the column index i j ∈ {1, 2, . . ., m} (the actual values of the probabilities p i for i = 1, 2, . . ., m are given below).An analysis of this randomized approximation scheme is given below.This scheme was originally proposed and analyzed by Drineas, Kannan, and Mahoney [4], where the error measure used was the Frobenius norm; here, we analyze the spectral norm error.The spectral norm error was also analyzed in [14], but the result had a worse dependence on the allowed failure probability.
Let X 1 , X 2 , . . ., X n be i.i.d.random matrices with the discrete distribution given by for all i = 1, 2, . . ., m, where p i := a i 2 b i 2 /Z and Z := X j and M := 0 AB BA 0 .
Note that Mn −M 2 is the spectral norm error of approximating AB using the average of n outer products n j=1 a ij b ij /p ij , where the indices are such that i j = i ⇔ X j = a i b i /p i for j = 1, 2, . . ., n.
We have the following identities: and the following inequalities:

Pr
Mn ] be the numerical (or stable) rank of A and B, respectively.Since we have the simplified (but slightly looser) bound (for t ≥ 2.6) Therefore, for any ∈ (0, 1) and δ ∈ (0, 1), if then with probability at least 1−δ over the random choice of column indices i 1 , i 2 , . . ., i n ,

A Sums of random vector outer products
The following lemma is a tail inequality for the spectral norm error of the empirical covariance matrix of subgaussian random vectors.This result (without explicit constants) is due to Litvak, Pajor, Rudelson, and Tomczak-Jaegermann [13] (see also [26]).
Lemma A.1.Let x 1 , x 2 , . . ., x n be random vectors in R d such that, for some γ ≥ 0, for all i = 1, 2, . . ., n, almost surely.For all 0 ∈ (0, 1/2) and t > 0, For completeness, we give a detailed proof of Lemma A.1 by applying the tail inequality in Lemma A.2 to Rayleigh quotients of the empirical covariance matrix, together with a covering argument based on the estimate in Lemma A.3 from [17].