Subadditivity of Matrix phi-Entropy and Concentration of Random Matrices

Matrix concentration inequalities provide a direct way to bound the typical spectral norm of a random matrix. The methods for establishing these results often parallel classical arguments, such as the Laplace transform method. This work develops a matrix extension of the entropy method, and it applies these ideas to obtain some matrix concentration inequalities.

The most effective methods for developing matrix concentration inequalities parallel familiar scalar arguments.For example, it is possible to mimic the Laplace transform method of Bernstein to obtain powerful results for sums of independent random matrices [AW02,Oli10,Tro12b].Several other papers have adapted martingale methods to the matrix setting [Oli09,Tro11,Min12].A third line of work [MJC + 12, PMT13] contains a matrix extension of Chatterjee's techniques [Cha07,Cha08] for proving concentration inequalities via Stein's method of exchangeable pairs.See the survey [Tro12c] for a more complete bibliography.
In spite of these successes, the study of matrix concentration inequalities is by no means complete.Indeed, one frequently encounters random matrices that do not submit to existing techniques.The aim of this paper is to explore the prospects for adapting ϕ-Sobolev inequalities [LO00, Cha04,BBLM05] to the matrix setting.By doing so, we hope to obtain concentration inequalities that hold for general matrix-valued functions of independent random variables.
It is indeed possible to obtain matrix analogs of the scalar ϕ-Sobolev inequalities for product spaces that appear in [BBLM05].This theory leads to some interesting concentration inequalities for random matrices.On the other hand, this method is not as satisfying as some other approaches to matrix concentration because the resulting bounds seem to require artificial assumptions.Nevertheless, we believe it is worthwhile to document the techniques and to indicate where matrix ϕ-Sobolev inequalities differ from their scalar counterparts.
1.1.Notation and Background.Before we can discuss our main results, we must instate some notation.The set R + contains the nonnegative real numbers, and R ++ consists of all positive real numbers.We write M d for the complex Banach space of d × d complex matrices, equipped with the usual ℓ 2 operator norm • .The normalized trace is the function The theory can be developed using the standard trace, but additional complications arise.
The set H d refers to the real-linear subspace of d × d Hermitian matrices in M d .For a matrix A ∈ H d , we write λ min (A) and λ max (A) for the algebraic minimum and maximum eigenvalues.For each interval I ⊂ R, we define the set of Hermitian matrices whose eigenvalues fall in that interval: We also introduce the set H d + of d × d positive-semidefinite matrices and the set H d ++ of d × d positive-definite matrices.Curly inequalities refer to the positive-semidefinite order.For example, A B means that B − A is positive semidefinite.
Next, let us explain how to extend scalar functions to matrices.Recall that each Hermitian matrix A ∈ H d has a spectral resolution where λ 1 , . . ., λ d are the eigenvalues of A and the matrices P 1 , . . ., P d are orthogonal projectors that satisfy the orthogonality relations P i P j = δ ij P j and where δ ij is the Kronecker delta and I is the identity matrix.One obtains a standard matrix function by applying a scalar function to the spectrum of a Hermitian matrix.
Definition 1.1 (Standard Matrix Function).Let f : I → R be a function on an interval I of the real line.Suppose that A ∈ H d (I) has the spectral decomposition (1.1).Then We use lowercase Roman and Greek letters to refer to standard matrix functions.When we apply a familiar real-valued function to an Hermitian matrix, we are referring to the associated standard matrix function.Bold capital letters such as Y , Z denote general matrix functions that are not necessarily standard.
1.2.Subadditivity of Matrix Entropies.In this section, we provide an overview of the theory of matrix ϕ-entropies.Our entire approach has a strong parallel with the work of Boucheron et al. [BBLM05].In the matrix setting, however, the technical difficulties are more formidable.
1.2.1.The Class of Matrix Entropies.First, we carve out a class of standard matrix functions that we can use to construct matrix entropies with the same subadditivity properties as their scalar counterparts.
Definition 1.2 (Φ d Function Class).Let d be a natural number.The class Φ d contains each function ϕ : R + → R that is either affine or else satisfies the following three conditions.
(3) Define ψ(t) = ϕ ′ (t) for t ∈ R ++ .The derivative Dψ of the standard matrix function ψ : H d ++ → H d is an invertible linear operator on H d ++ , and the map A → [Dψ(A)] −1 is concave with respect to the semidefinite order on operators.
The technical definitions that support requirement (3) appear in Section 2. For now, we just remark that the scalar equivalent of (3) is the statement that t → [ϕ ′′ (t)] −1 is concave on R ++ .
The class Φ 1 coincides with the Φ function class considered in [BBLM05].It can be shown that Φ d+1 ⊆ Φ d for each natural number d, so it is appropriate to introduce the class of matrix entropies: This class consists of scalar functions that satisfy the conditions of Definition 1.2 for an arbitrary choice of dimension d.Note that Φ ∞ is a convex cone: it contains all positive multiples and all finite sums of its elements.
In contrast to the scalar setting, it is not easy to determine what functions are contained in Φ ∞ .The main technical achievement of this paper is to demonstrate that the standard entropy and certain power functions belong to the matrix entropy class.
Theorem 1.3 (Elements of the Matrix Entropy Class).The following functions are members of the Φ ∞ class.
(1) The standard entropy t → t log t.
(2) The power function The statement about the classical entropy can be obtained from standard results in matrix theory, but the statement for power functions demands some effort.The proof of Theorem 1.3 appears in Section 5.
Similarly, the conditional matrix ϕ-entropy functional is where F is a subalgebra of the master sigma algebra.
For each convex function ϕ, the trace function tr ϕ : Therefore, Jensen's inequality implies that the matrix ϕ-entropy is nonnegative: For concreteness, here are some basic examples of matrix ϕ-entropy functionals.
The key fact about matrix ϕ-entropies is that they satisfy a subadditivity property.Let x := (X 1 , . . ., X n ) denote a vector of independent random variables taking values in a Polish space, and write x −i for the random vector obtained by deleting the ith entry of x.
Consider a positive-semidefinite random matrix Z that can be expressed as a measurable function of the random vector x.
We instate the integrability conditions E Z < ∞ and E ϕ(Z) < ∞.
Theorem 1.5 (Subadditivity of Matrix ϕ-Entropy).Fix a function ϕ ∈ Φ ∞ .Under the prevailing assumptions, (1.3) Typically, we apply Theorem 1.5 by way of a corollary.Let X ′ 1 , . . ., X ′ n denote independent copies of X 1 , . . ., X n , and form the random matrix Then Z ′ i and Z are independent and identically distributed, conditional on the sigma algebra generated by x −i .In particular, these two random matrices are exchangeable counterparts.
Corollary 1.6 (Entropy Bounds via Exchangeability).Fix a function ϕ ∈ Φ ∞ , and write ψ = ϕ ′ .With the prevailing notation, Theorem 1.5 and Corollary 1.6 are matrix counterparts of the foundational results from Boucheron et al. [BBLM05,Sec. 3], which establish that scalar ϕ-entropies satisfy a similar subadditivity property.We devote Section 3 to the proof of these results.
1.3.Some Matrix Concentration Inequalities.Using Corollary 1.6, we can derive concentration inequalities for random matrices.In contrast to some previous approach to matrix concentration, we need to place some significant restrictions on the type of random matrices we consider.A signed permutation Π ∈ M d is a matrix with the properties that (i) each row and each column contains exactly one nonzero entry and (ii) the nonzero entries only take values +1 and −1.

1.3.1.
A Bounded Difference Inequality.We begin with an exponential tail bound for a random matrix whose distribution is invariant under signed permutations.
Theorem 1.8 (Bounded Differences).Let x := (X 1 , . . ., X n ) be a vector of independent random variables, and let x ′ := (X ′ 1 , . . ., X ′ n ) be an independent copy of x.Consider random matrices Assume that Y is bounded almost surely.Introduce the variance measure where the supremum occurs over all possible values of x.For each t ≥ 0, , and Theorem We can also establish moment inequalities for a random matrix whose distribution is invariant under signed permutation.
1.4.Generalized Subadditivity of Matrix ϕ-Entropy.Theorem 1.5 is the shadow of a more sophisticated subadditivity property.We outline the simplest form of this more general result.See the lecture notes of Carlen [Car10] for more background on the topics in this section.
We work in the * -algebra M d of d × d complex matrices, equipped with the conjugate transpose operation * and the normalized trace inner product A, B := tr(A * B).We say that a subspace A ⊂ M d is a * -subalgebra when A contains the identity matrix, A is closed under matrix multiplication, and A is closed under conjugate transposition.In other terms, I ∈ A and AB ∈ A and A * ∈ A whenever A, B ∈ A.
In this setting, there is an elegant notion of conditional expectation.The orthogonal projector E A : M d → A onto the * -subalgebra A is called the conditional expectation with respect to the * -subalgebra.For * -subalgebras A and B, we say that the conditional expectations E A and E B commute when This construction generalizes the concept of independence in a probability space.We can define the matrix ϕ-entropy conditional on a * -subalgebra A: Let A 1 , . . ., A n be * -subalgebras whose conditional expectations commute.Then we can extend the definition of the matrix ϕ-entropy to read Because of commutativity, the order of the conditional expectations has no effect on the calculation.It turns out that matrix ϕ-entropy admits the following subadditivity property.
Theorem 1.10 (Subaddivity of Matrix ϕ-Entropy II).Fix a function ϕ ∈ Φ ∞ .Let A 1 , . . ., A n be * -subalgebras of M d whose conditional expectations commute.Then We omit the proof of this result.The argument involves considerations similar with Theorem 1.5, but it requires an extra dose of operator theory.The work in this paper already addresses the more challenging aspects of the proof.Theorem 1.10 can be seen as a formal extension of the subadditivity of matrix ϕ-entropy expressed in Theorem 1.5.To see why, let Ω := Ω 1 × • • • × Ω n be a product probability space.The space L 2 (Ω; M d ) of random matrices is a * -algebra with the normalized trace functional E tr.For each i = 1, . . ., n, we can form a * -subalgebra A i consisting of the random matrices that do not depend on the ith factor Ω i of the product.The conditional expectation E A i simply integrates out the ith random variable.By independence, the family of conditional expectations E A 1 , . . ., E An commutes.Using this dictionary, compare the statement of Theorem 1.10 with Theorem 1.5.
1.5.Background on the Entropy Method.This section contains a short summary of related work on the entropy method and on matrix concentration.
Inspired by Talagrand's work [Tal91] on concentration in product spaces, Ledoux [Led97, Led01] and Bobkov & Ledoux [BL98] developed the entropy method for obtaining concentration inequalities on product spaces.This approach is based on new logarithmic Sobolev inequalities for product spaces.Other authors, including Massart [Mas00a,Mas00b], Rio [Rio01], Bousquet [Bou02], and Boucheron et al. [BLM03] have extended these results to obtain additional concentration inequalities.See the book [BLM13] for a comprehensive treatment.
In a related line of work, Lata la & Oleskiewicz [LO00] and Chafaï [Cha04] investigated generalizations of the logarithmic Sobolev inequalities based on ϕ-entropy functionals.The paper [BBLM05] of Boucheron et al. elaborates on these ideas to obtain a new class of moment inequalities; see also the book [BLM13].Our paper is based heavily on the approach in [BBLM05].
There is a recent line of work that develops concentration inequalities for random matrices by adapting classical arguments from the theory of concentration of measure.The introduction contains an overview of this research, so we will not repeat ourselves.
Our paper can be viewed as a first attempt to obtain concentration inequalities for random matrices using the entropy method.In spirit, our approach is closely related to arguments based on Stein's method [MJC + 12, PMT13].The theory in our paper does not require any notion of exchangeable pairs.On the other hand, the arguments here are substantially harder, and they still result in weaker concentration bounds.
For the classical entropy ϕ : t → t log t, we are aware of some precedents for our subadditivity results, Theorem 1.5 and Theorem 1.10.In this special case, the subadditivity of H ϕ already follows from a classical result [Lin73].There is a more modern paper [HOZ01] that contains a similar type of subadditivity bound.Very recently, Hansen [Han13] has identified a convexity property of another related quantity, called the residual entropy.Note that, in the literature on quantum statistical mechanics and quantum information theory, the phrase "subadditivity of entropy" refers to a somewhat different kind of bound [LR73]; see [Car10,Sec. 8] for a modern formulation.1.6.Roadmap.Section 2 contains some background on operator theory.In Section 3, we prove Theorem 1.5 on the subadditivity on the matrix ϕ-entropy.Section 4 describes how to obtain Corollary 1.6.Afterward, in Section 5, prove that the standard entropy and certain power functions belong to the Φ ∞ function class.Finally, Sections 6 and 7 derive the matrix concentration inequalities, Theorem 1.8 and Theorem 1.9.

Operators and Functions acting on Matrices
This work involves a substantial amount of operator theory.This section contains a short treatment of the basic facts.See [Bha97,Bha07] for a more complete introduction.
2.1.Linear Operators on Matrices.Let C d be the complex Hilbert space of dimension d, equipped with the standard inner product a, b := a * b.We usually identify M d with B(C d ), the complex Banach space of linear operators acting on C d , equipped with the ℓ 2 operator norm • .
We can also endow M d with the normalized trace inner product A, B := tr(A * B) to form a Hilbert space.As a Hilbert space, M d is isometrically isomorphic with C d 2 .Let B(M d ) denote the complex Banach space of linear operators that map the Hilbert space M d into itself, equipped with the induced operator norm.The Banach space B(M d ) is isometrically isomorphic with the Banach space M d 2 .
As a consequence of this construction, every concept from matrix analysis has an immediate analog for linear operators on matrices.An operator For self-adjoint operators S, T ∈ B(M d ), the notation S T means that T−S is positive semidefinite.Each self-adjoint matrix operator T ∈ B(M d ) has a spectral resolution of the form where λ 1 , . . ., λ d 2 are the eigenvalues of T and the spectral projectors P 1 , . . ., P d 2 are positivesemidefinite operators that satisfy where δ ij is the Kronecker delta and I is the identity operator.As in the matrix case, a self-adjoint operator with nonnegative eigenvalues is the same thing as a positive-semidefinite operator.We can extend a scalar function f : I → R on an interval I of the real line to obtain a standard operator function.Indeed, if T has the spectral resolution (2.1) and the eigenvalues of T fall in the interval I, we define This definition, of course, parallels the definition for matrices.2.2.Monotonicity and Convexity.Let X and Y be sets of self-adjoint operators, such as H d (I) or the set of self-adjoint operators in B(M d ).We can introduce notions of monotonicity and convexity for a general function Ψ : X → Y using the semidefinite order on the spaces of operators.
The convexity of an operator-valued function Ψ is equivalent with a Jensen-type relation: whenever X is an integrable random operator taking values in X.
In particular, we can apply these definitions to standard matrix and operator functions.Let I be an interval of the real line.We say that the function f : I → R is operator monotone when the lifted map f : H d (I) → H d is monotone for each natural number d.Likewise, the function f : I → R is operator convex when the lifted map f : H d (I) → H d is convex for each natural number d.
Although scalar monotonicity and convexity are quite common, they are much rarer in the matrix setting [Bha97,Chap. 4].For present purposes, we note that the power functions t → t p with p ∈ [0, 1] are operator monotone and operator concave.The power functions t → t p with p ∈ [1, 2] and the standard entropy t → t log t are all operator convex.2.3.The Derivative of a Vector-Valued Function.The definition of the Φ ∞ function class involves a requirement that a certain standard matrix function is differentiable.For completeness, we include the background needed to interpret this condition.
Definition 2.3 (Derivative of a Vector-Valued Function).Let X and Y be Banach spaces, and let When F is differentiable at A, the operator T is called the derivative of F at A, and we define DF (A) := T.
The derivative and the directional derivative have the following relationship: In Section 5.2, we present an explicit formula for the derivative of a standard matrix function.

Subadditivity of Matrix ϕ-Entropy
In this section, we establish Theorem 1.5, which states that the matrix ϕ-entropy is subadditive for every function in the Φ ∞ class.This result depends on a variational representation for the matrix ϕ-entropy that appears in Section 3.1.We use the variational formula to derive a Jensentype inequality in Section 3.2.The proof of Theorem 1.5 appears in Section 3.3.
3.1.Representation of Matrix ϕ-Entropy as a Supremum.The fundamental fact behind the subadditivity theorem is a representation of the matrix ϕ-entropy as a supremum of affine functions.
Lemma 3.1 (Supremum Representation for Entropy).Fix a function ϕ ∈ Φ ∞ , and introduce the scalar derivative ψ = ϕ ′ .Suppose that Z is a random positive-semidefinite matrix for which Z and ϕ(Z) are integrable.Then The range of the supremum contains each random positive-definite matrix T for which T and ϕ(T ) are integrable.In particular, the matrix ϕ-entropy H ϕ can be written in the dual form where Υ i : H d + → H d for i = 1, 2. This result implies that H ϕ is a convex function on the space of random positive-semidefinite matrices.The dual representation of H ϕ is well suited for establishing a form of Jensen's inequality, Lemma 3.3, which is the main ingredient in the proof of the subadditivity property, Theorem 1.5.
It may be valuable to see some particular instances of the dual representation of the matrix ϕ-entropy: The first formula is the matrix version of a well-known variational principle for the classical entropy, cf.[BBLM05,p. 525].In the matrix setting, this result can be derived from the joint convexity of quantum relative entropy [Lin73].
3.1.1.The Convexity Lemma.To establish the variational formula, we require a convexity result for a quadratic form connected with the function ϕ.
Lemma 3.2.Fix a function ϕ ∈ Φ ∞ , and let ψ = ϕ ′ .Suppose that Y is a random matrix taking values in H d + , and let K be a random matrix taking values in M d .Assume that Y and K are integrable.Then Proof.The proof hinges on a basic convexity property of quadratic forms.Define a map that takes a matrix A in H d and a positive-definite operator T on M d to a nonnegative number: We assert that the function Q is convex.Indeed, the same result is well known when A and T are replaced by a vector and a positive-definite matrix [Bha07, Exer.1.5.1], and the extension is immediate from the isometric isomorphism between operators and matrices.
Recall that the Φ ∞ class requires A → [Dψ(A)] −1 to be a concave map on H d ++ .With these observations at hand, we can make the following calculation: We obtain the second relation when we apply Jensen's inequality to the convex function Q.The third relation depends on the semidefinite Jensen inequality (2.2) for the concave function A → [Dψ(A)] −1 , coupled with the fact [Bha97, Prop.V.1.6]that the operator inverse reverses the semidefinite order.
3.1.2.Proof of Lemma 3.1.The argument parallels the proof of [BBLM05, Lem.1].We begin with some reductions.The case where ϕ is an affine function is immediate, so we may require the derivative ψ = ϕ ′ to be non-constant.By approximation, we may also assume that the random matrix Z is strictly positive definite.
[Indeed, since ϕ is continuous on R + , the Dominated Convergence Theorem implies that the matrix ϕ-entropy H ϕ is continuous on the set containing each positive-semidefinite random matrix Y where Y and ϕ(Y ) are integrable.Therefore, we can approximate a positive-semidefinite random matrix Z by a sequence {Y n } of positive-definite random matrices where Y n → Z and be confident that When T = Z, the argument of the supremum in (3.1) equals H ϕ (Z).Therefore, our burden is to verify the inequality for each random positive-definite matrix T that satisfies the same integrability requirements as Z.
For simplicity, we assume that the eigenvalues of both Z and T are bounded and bounded away from zero.See Appendix A for the extension to the general case.
We use an interpolation argument to establish (3.3).Define the family of random matrices Introduce the real-valued function Observe that F (0) = H ϕ (Z), while F (1) coincides with the right-hand side of (3.3).Therefore, to establish (3.3), it suffices to show that the function F (s) is weakly decreasing on the interval [0, 1].We intend to prove that (3.4) We differentiate the function F to obtain To handle the first term in (3.4), we applied the product rule, the rule (2.3) for directional derivatives, and the expression dT s /ds = T − Z.We used the identity D tr ϕ(A) = ψ(A) to differentiate the second term.We also relied on the Dominated Convergence Theorem to pass derivatives through expectations, which is justified because ϕ and ψ are continuously differentiable on H d ++ and the eigenvalues of the random matrices are bounded and bounded away from zero.Now, the last two terms in (3.5) cancel, and we can rewrite the first two terms using the trace inner product: Invoke Lemma 3.2 to conclude that F ′ (s) ≤ 0 for s ∈ [0, 1].

3.2.
A Conditional Jensen Inequality.The variational inequality in Lemma 3.1 leads directly to a Jensen inequality for the matrix ϕ-entropy.
) is a pair of independent random variables taking values in a Polish space, and let Z = Z(X 1 , X 2 ) be a random positive-semidefinite matrix for which Z and ϕ(Z) are integrable.Then where E 1 is the expectation with respect to the first variable X 1 .
Proof.Let E 2 denote the expectation with respect to the second variable X 2 .The result is a simple consequence of the dual representation (3.2) of the matrix ϕ-entropy: We have written T (X 2 ) to emphasize that this matrix depends only on the randomness in X 2 .To control (3.6), we apply Fubini's theorem to interchange the order of E 1 and E 2 , and then we exploit the convexity of the supremum to draw out the expectation E 1 .
The last relation is the duality formula (3.2), applied conditionally.
3.3.Proof of Theorem 1.5.We are now prepared to establish the main result on subadditivity of matrix ϕ-entropy.This theorem is a direct consequence of the conditional Jensen inequality, Lemma 3.3.In this argument, we write E i for the expectation with respect to the variable X i .Using the notation from Section 1.2.3, we see that First, separate the matrix ϕ-entropy into two parts by adding and subtracting terms: We can rewrite this expression as The inequality follows from Lemma 3.3 because Z = Z(X 1 , x −1 ) where X 1 and x −1 are independent random variables.The first term on the right-hand side of (3.7) coincides with the first summand on the righthand side of the subadditivity inequality (1.3).We must argue that the remaining summands are contained in the second term on the right-hand side of (3.7).Repeating the argument in the previous paragraph, conditioning on X 1 , we obtain Substituting this expression into (3.7),we obtain Continuing in this fashion, we arrive at the subadditivity inequality (1.3): This completes the proof of Theorem 1.5.

Entropy Bounds via Exchangeability
In this section, we derive Corollary 1.6, which uses exchangeable pairs to bound the conditional entropies that appear in Theorem 1.5.This result follows from another variational representation of the matrix ϕ-entropy.

4.1.
Representation of the Matrix ϕ-Entropy as an Infimum.In this section, we present another formula for the matrix ϕ-entropy.
Lemma 4.1 (Infimum Representation for Entropy).Fix a function ϕ ∈ Φ ∞ , and let ψ = ϕ ′ .Assume that Z is a random positive-semidefinite matrix where Z and ϕ(Z) are integrable.Then H ϕ (Z) = inf Let Z ′ be an independent copy of Z. Then We require a familiar trace inequality [Car10, Thm.2.11].This bound simply restates the fact that a convex function lies above its tangents.
Proposition 4.2 (Klein's Inequality).Let f : I → R be a differentiable convex function on an interval I of the real line.Then With Klein's inequality at hand, the variational inequality follows quickly.
Proof of Lemma 4.1.Every function ϕ ∈ Φ ∞ is convex and differentiable, so Proposition 4.2 with for each fixed matrix A ∈ H d + .Substitute this bound into the definition (1.2) of the matrix ϕ-entropy, and draw the expectation out of the trace to reach (4.3) The inequality (4.3) becomes an equality when A = E Z, which establishes the variational representation (4.1).The symmetrized bound (4.2) follows from an exchangeability argument.Select A = Z ′ in the expression (4.3), and apply the fact that E ϕ(Z) = E ϕ(Z ′ ) to obtain (4.4) Since Z and Z ′ are exchangeable, we can also bound the matrix ϕ-entropy as Take the average of the two bounds (4.4) and (4.5) to reach the desired inequality (4.2).
4.2.Proof of Corollary 1.6.Lemma 4.1 leads to a succinct proof of Corollary 1.6.We continue to use the notation from Section 1.2.3.Apply the inequality (4.2) conditionally to control the conditional matrix ϕ-entropy: because Z ′ i and Z are conditionally iid, given x −i .Take the expectation on both sides of (4.6), and invoke the tower property of conditional expectation: To complete the proof, substitute (4.7) into the right-hand side of the bound (1.3) from the subadditivity result, Theorem 1.5.

Members of the Φ ∞ function class
In this section, we demonstrate that the classical entropy and certain power functions belong to the Φ ∞ function class.The main challenge is to verify that A → [Dψ(A)] −1 is a concave operatorvalued map.We establish this result for the classical entropy in Section 5.4 and for the power function in Section 5.5.5.1.Tensor Product Operators.First, we explain the tensor product construction of an operator.The tensor product will allow us to represent the derivative of a standard matrix function compactly.
The operator A ⊗ B is self-adjoint because we assume the factors are Hermitian matrices.
Suppose that A, B ∈ H d are Hermitian matrices with spectral resolutions Then the tensor product A ⊗ B has the spectral resolution In particular, the tensor product of two positive-definite matrices is a positive-definite operator.
We also require the Hermite representation of the divided difference: where we have written τ := 1 − τ .
The following result gives an explicit expression for the derivative of a standard matrix function in terms of a divided difference.
Proposition 5.3 (Daleckiȋ-Kreȋn Formula).Let f : I → R be a continuously differentiable function of an interval I of the real line.Suppose that A ∈ H d (I) is a diagonal matrix with A = diag(a 1 , . . ., a d ).The derivative Df (A) ∈ B(M d ), and where ⊙ denotes the Schur (i.e., componentwise) product and f [1] (A) refers to the matrix of divided differences: (a i , a j ) for i, j = 1, . . ., d.

Operator Means.
Our approach also relies on the concept of an operator mean.The following definition is due to Kubo & Ando [KA80].
Definition 5.4 (Operator Mean).Let f : R ++ → R ++ be an operator concave function that satisfies f (1) = 1.Fix a natural number d.Let S and T be positive-definite operators in B(M d ).
We define the mean of the operators: When S and T commute, the formula simplifies to A few examples may be helpful.The function f (s) = (1 + s)/2 represents the usual arithmetic mean, the function f (s) = s 1/2 gives the geometric mean, and the function f (s) = 2s/(1 + s) yields the harmonic mean.Operator means have a concavity property, which was established in the paper [KA80].Then for α ∈ [0, 1] and ᾱ = 1 − α.

Entropy.
In this section, we demonstrate that the standard entropy function is a member of the Φ ∞ function class.
Theorem 5.6.The function ϕ : t → t log t − t is a member of the Φ ∞ class.
This result immediately implies Theorem 1.3(1), which states that t → t log t belongs to Φ ∞ .Indeed, the matrix entropy class contains all affine functions and all finite sums of its elements.Theorem 5.6 follows easily from (deep) classical results because the variational representation of the standard entropy from Lemma 3.1 is equivalent with the joint convexity of quantum relative entropy [Lin73].Instead of pursuing this idea, we present an argument that parallels the approach we use to study the power function.Some of the calculations below also appear in [Lie73, Proof of Cor.2.1], albeit in compressed form.
Proof.Fix a positive integer d.We plan to show that the function ϕ : t → t log t − t is a member of the class Φ d .Evidently, ϕ is continuous and convex on R + , and it has two continuous derivatives on R ++ .It remains to verify the concavity condition for the second derivative.
Write ψ(t) = ϕ ′ (t) = log t, and let A ∈ H d ++ .Without loss of generality, we may choose a basis where A = diag(a 1 , . . ., a d ).The Daleckiȋ-Kreȋn formula, Proposition 5.3, tells us As an operator, the derivative acts by Schur multiplication.This formula also makes it clear that the inverse of this operator acts by Schur multiplication: Using the Hermite representation (5.1) of the first divided difference of t → e t , we find The latter calculation assumes that µ = λ; it extends to the case µ = λ because both sides of the identity are continuous.As a consequence, We discover the expression This formula is correct for every positive-definite matrix.
For each τ ∈ [0, 1], consider the operator monotone function f : t → t τ defined on R + .Since f (1) = 1, we can construct the operator mean M f associated with the function f .Note that A ⊗ I and I ⊗ A are commuting positive operators.Thus, The map A → (A ⊗ I, I ⊗ A) is linear, so Proposition 5.5 guarantees that A → A τ ⊗ A τ is concave for each τ ∈ [0, 1].This result is usually called the Lieb Concavity Theorem [Bha97, Thm.IX.6.1].Combine this fact with the integral representation (5.2) to reach the conclusion that A → [Dψ(A)] −1 is a concave map on the cone H d ++ of positive-definite matrices.
5.5.Power Functions.In this section, we prove that certain power functions belong to the Φ ∞ function class.
This result immediately implies Theorem 1.3(2), which states that t → t p+1 belongs to the class Φ ∞ .Indeed, the matrix entropy class contains all positive multiples of its elements.The proof of Theorem 5.7 follows the same path as Theorem 5.6, but it is somewhat more involved.First, we derive an expression for the function A → [Dψ(A)] −1 where ψ = ϕ ′ .Lemma 5.8.Fix p ∈ (0, 1], and let ψ(t) = t p for t ≥ 0. For each matrix where τ := 1 − τ .
Proof.As before, we may assume without loss of generality that the matrix A = diag(a 1 , . . ., a d ).
Using the Daleckiȋ-Kreȋn formula, Proposition 5.3, we see that The Hermite representation (5.1) of the first divided difference of t → t 1/p gives 1 We use continuity to verify that the latter calculation remains valid when µ = λ.Using this function g, we can identify a compact representation of the operator: where we write E ij for the matrix with a one in the (i, j) position and zeros elsewhere.It remains to verify that the bracket coincides with the expression (5.3).Indeed, The second relation follows from the definition of the standard operator function associated with t → t (1−p)/p .To confirm that the third line equals the second, expand the matrices A = i a i E ii and I = j E jj and invoke the bilinearity of the tensor product.
Proof of Theorem 5.7.We are now prepared to prove that certain power functions belong to the Φ ∞ function class.Fix an exponent p ∈ [0, 1], and let d be a fixed positive integer.We intend to show that the function ϕ(t) = t p+1 /(p + 1) belongs to the Φ d class.When p = 0, the function ϕ is affine, so we may assume that p > 0. It is clear that ϕ is continuous and convex on R + , and ϕ has two continuous derivatives on R ++ .It remains to verify that the second derivative has the required concavity property.Let ψ(t) = ϕ ′ (t) = t p for t ≥ 0, and consider a matrix where we maintain the usage τ := 1 − τ .For each τ ∈ [0, 1], the scalar function a → τ a + τ is operator monotone because it is affine and increasing.On account of the result [And79, Cor.4.3], the function is also operator monotone.A short calculation shows that f (1) = 1.Therefore, we can use f to construct an operator mean M f .Since A ⊗ I and I ⊗ A are commuting positive operators, we have The map A → (A ⊗ I, I ⊗ A) is linear, so Proposition 5.5 ensures that is a concave map.
We are now prepared to check that (5.4) defines a concave operator.Let S, T be arbitrary positive-definite matrices, and choose α ∈ [0, 1].Suppose that Z is the random matrix that takes value S with probability α and value T with probability 1 − α.For each τ ∈ [0, 1], we compute The first relation holds because t → t 1−p is operator concave [Bha97, Thm.V.1.9and Thm.V.2.5].
To obtain the second relation, we apply the concavity property of the map (5.5), followed by the fact that t → t 1−p is operator monotone [Bha97, Thm.V.1.9].This calculation establishes the claim that is concave on H d ++ for each τ ∈ [0, 1].In view of the integral representation (5.4), we may conclude that A → [Dψ(A)] −1 is concave on the cone H d ++ of positive-definite matrices.

A Bounded Difference Inequality for Random Matrices
In this section, we prove Theorem 1.8, a bounded difference inequality for a random matrix whose distribution is invariant under signed permutation.We begin with some preliminaries that support the proof, and we establish the main result in Section 6.2.6.1.Preliminaries.First, we describe how to compute the expectation of a function of a random matrix whose distribution is invariant under signed permutation.See Definition 1.7 for a reminder of what this requirement means.Lemma 6.1.Let f : I → R be a function on an interval I of the real line.Assume that X ∈ H d (I) is a random matrix whose distribution is invariant under signed permutation.Then Proof.Let Π ∈ H d be an arbitrary signed permutation matrix.Observe that The first relation holds because the distribution of X is invariant under conjugation by Π.The second relation follows from the definition of a standard matrix function and the fact that Π is unitary.We may average (6.1) over Π drawn from the uniform distribution on the set of signed permutation matrices.A direct calculation shows that the resulting matrix is diagonal, and its diagonal entries are identically equal to tr[E f (X)].
We also require a trace inequality that is related to the mean value theorem.This result specializes [MJC + 12, Lem.6.2.Proof of Theorem 1.8.The argument proceeds in three steps.First, we present some elements of the matrix Laplace transform method.Second, we use the subaddivity of matrix ϕentropy to deduce a differential inequality for the trace moment generating function of the random matrix.Finally, we explain how to integrate the differential inequality to obtain the concentration result.
6.2.1.The Matrix Laplace Transform Method.We begin with a matrix extension of the moment generating function (mgf), which has played a major role in recent work on matrix concentration.

6.2.2.
A Differential Inequality for the Trace Mgf.Suppose that Y ∈ H d is a random Hermitian matrix that depends on a random vector x := (X 1 , . . ., X n ).We require the distribution of Y to be invariant under signed permutations, and we insist that Y is bounded.Without loss of generality, assume that Y has zero mean.Throughout the argument, we let the notation of Section 1.2.3 and Theorem 1.8 prevail.
Let us explain how to use the subadditivity of matrix ϕ-entropy to derive a differential inequality for the trace mgf.Consider the function ϕ(t) = t log t, which belongs to the Φ ∞ class because of Theorem 1.3(1).Introduce the random positive-definite matrix Z := e θY , where θ > 0. We write out an expression for the matrix ϕ-entropy of Z: (6.2) In the third line, we have applied Lemma 6.1 to the logarithm in the second term, relying on the fact that Y is invariant under signed permutations.To reach the last line, we recognize that m ′ (θ) = E tr(Y e θY ).We have used the boundedness of Y to justify this derivative calculation.Corollary 1.6 provides an upper bound for the matrix ϕ-entropy.Define the derivative ψ Consider the function f : t → e θt .Its derivative f ′ : t → θe θt is convex because θ > 0, so Proposition 6.2 delivers the bound The second relation follows from the fact that Y and Y ′ i are exchangeable, conditional on x −i .The last line is just the tower property of conditional expectation, combined with the observation that Y is a function of x.To continue, we simplify the expression and make some additional bounds.
The second relation follows from a standard trace inequality and the observation that e θY is positive definite.Last, we identify the variance measure V Y defined in (1.4) and the trace mgf m(θ).
Combine the expression (6.2) with the inequality (6.3) to arrive at the estimate We can use this differential inequality to obtain bounds on the trace mgf m(θ).
6.2.3.Solving the Differential Inequality.Rearrange the differential inequality (6.4) to obtain This is where we use the hypothesis that Y has mean zero.Now, we integrate (6.5) from zero to some positive value θ to find that the trace mgf satisfies The approach in this section is usually referred to as the Herbst argument [Led99].
6.2.4.The Laplace Transform Argument.We are now prepared to finish the argument.Combine the matrix Laplace transform method, Proposition 6.4, with the trace mgf bound (6.6) to reach To obtain the result for the minimum eigenvalue, we note that The inequality follows when we apply (6.7) to the random matrix −Y .This completes the proof of Theorem 1.8.

Moment Inequalities for Random Matrices with Bounded Differences
In this section, we prove Theorem 1.9, which gives information about the moments of a random matrix that satisfies a kind of self-bounding property.
Proof of Theorem 1.9.Fix a number q ∈ {2, 3, 4, . . .}. Suppose that Y ∈ H d + is a random positivesemidefinite matrix that depends on a random vector x := (X 1 , . . ., X n ).We require the distribution of Y to be invariant under signed permutations, and we assume that E( Y q ) < ∞.The notation of Section 1.2.3 and Theorem 1.9 remains in force.
The second inequality derives from the hypothesis (1.5) that V Y cY .Note that this bound requires the fact that Y q−2 is positive semidefinite.
We have compared the qth trace moment of Y with the (q − 1)th trace moment.Proceeding by iteration, we arrive at This observation completes the proof of Theorem 1.9.
Appendix A. Lemma 3.1, The General Case In this appendix, we explain how to prove Lemma 3.1 in full generality.The argument calls for a simple but powerful result, known as the generalized Klein inequality [Pet94, Prop.3], which allows us to lift a large class of scalar inequalities to matrices.Proof of Lemma 3.1, General Case.We retain the notation from Lemma 3.1.In particular, we assume that Z is a random positive-definite matrix for which Z and ϕ(Z) are both integrable.We also assume that T is a random positive-definite matrix with T and ϕ(T ) integrable.
For n ∈ N, define the function l n (a) := (a ∨ 1/n) ∧ n, where ∨ denotes the maximum operator and ∧ denotes the minimum operator.Consider the random matrices Z n := l n (T ) and T k := l k (T ) for each k, n ∈ N.These matrices have eigenvalues that are bounded and bounded away from zero, so these entities satisfy the inequality (3.3) we have already established.
Rearrange the terms in this inequality to obtain Let us begin with the right-hand side of (A.1).We have the sure limit Z n → Z.Therefore, the Dominated Convergence Theorem guarantees that E Z n → E Z because Z is integrable and Z n ≤ Z .Likewise, E T k → E T .The functions ϕ and ψ are continuous, so the limit of the right-hand side of (A.1) satisfies This expression coincides with the right-hand side of (A.2).
Taking the limit of the left-hand side of (A.1) is more involved because the function ψ may grow quickly at zero and infinity.We accomplish our goal in two steps.First, we take the limit as n → ∞.Afterward, we take the limit as k → ∞.
Introduce the nonnegative function γ(z, t) := ϕ(z) − ϕ(t) − (z − t)ψ(t) for z, t > 0. Observe that the right-hand side of this inequality is integrable.Indeed, all of the quantities involving T k are uniformly bounded because the eigenvalues of T k fall in the range [k −1 , k] and the functions ϕ and ψ are continuous on this interval.The terms involving Z may not be bounded, but they are integrable because Z and ϕ(Z) are integrable.We may now apply the Dominated Convergence Theorem to take the limit: where we rely again on the sure limit Z n → Z as n → ∞.Boucheron et al. also establish that γ(z, l k (t)) ≤ γ(z, 1) + γ(z, t) for z, t > 0.
We may assume that the second term on the right-hand side is integrable or else the desired inequality (A.2) would be vacuous.The first term is integrable because Z and ϕ(Z) are integrable.Therefore, we may apply the Dominated Convergence Theorem: E tr Γ(Z, T k ) → E tr Γ(Z, T ) as k → ∞, (A.6) where we rely again on the sure limit T k → T as k → ∞.In summary, the limits (A.5) and (A.6) provide that E tr Γ(Z n , T k ) → E tr Γ(Z, T ) as k, n → ∞.In view of the limit (A.3), we have completed the proof of (A.2).

Definition 1. 7 (
Invariance under Signed Permutation).A random matrix Y ∈ H d is invariant under signed permutation if we have the equality of distribution Y ∼ Π * Y Π for each signed permutation Π.

5. 2 .
The Derivative of a Standard Matrix Function.Next, we present some classical results on the derivative of a standard matrix function.See [Bha97, Sec.V.3] for further details.Definition 5.2 (Divided Difference).Let f : I → R be a continuously differentiable function on an interval I of the real line.The first divided difference is the map f [1] : R 2 → R defined by Proposition 5.5 (Operator Means are Concave).Let f : R ++ → R ++ be an operator monotone function with f (1) = 1.Fix a natural number d. Suppose that S 1 , S 2 , T 1 , T 2 are positive-definite operators in B(M d ).

Definition 6. 3 (
Trace Mgf).Let Y be a random Hermitian matrix.The normalized trace moment generating function of Y is defined as m(θ) := m Y (θ) := E tr e θY for θ ∈ R. The expectation need not exist for all values of θ.The following proposition explains how the trace mgf can be used to study the maximum eigenvalue of a random Hermitian matrix [Tro11, Prop.3.1].Proposition 6.4 (Matrix Laplace Transform Method).Let Y ∈ H d be a random matrix with normalized trace mgf m(θ) := tr e θY .For each t ∈ R, P {λ max (Y ) ≥ t} ≤ d • inf θ>0 e −θt+log m(θ) .

. 5 )E
The l'Hôpital rule allows us to calculate the value of θ −1 log m(θ) at zero.Since m(0) tr(Y e θY )E tr e θY = E tr Y = 0.

Proposition A. 1 (
Generalized Klein Inequality).For each k = 1, . . ., n, suppose that f k : I 1 → R and g k : I 2 → R are functions on intervals I 1 and I 2 of the real line.Suppose that n k=1 f k (a) g k (b) ≥ 0 for all a ∈ I 1 and b ∈ I 2 .Then, for each natural number d, n k=1 tr[f k (A) g k (B)] ≥ 0 for all A ∈ H d (I 1 ) and B ∈ H d (I 2 ).