On spectral distribution of sample covariance matrices from large dimensional and large $k$-fold tensor products

We study the eigenvalue distributions for sums of independent rank-one $k$-fold tensor products of large $n$-dimensional vectors. Previous results in the literature assume that $k=o(n)$ and show that the eigenvalue distributions converge to the celebrated Mar\v{c}enko-Pastur law under appropriate moment conditions on the base vectors. In this paper, motivated by quantum information theory, we study the regime where $k$ grows faster, namely $k=O(n)$. We show that the moment sequences of the eigenvalue distributions have a limit, which is different from the Mar\v{c}enko-Pastur law. As a byproduct, we show that the Mar\v{c}enko-Pastur law limit holds if and only if $k=o(n)$ for this tensor model. The approach is based on the method of moments.


Introduction
For n ∈ N, let {ξ 1 , . . ., ξ n } be a family of i.i.d.centered complex random variables with unit variance.Denote y = 1 Therefore, each n k -dimensional vector Y α is a k-fold tensor product of n-dimensional i.i.d.vectors, and M n,k,m is a sum of m independent rank-1 Hermitian matrices of dimension n k .The simplest case of k = 1 and τ α ≡ 1 was studied in the seminal paper [8] where the celebrated Marčenko-Pastur law was derived under appropriate moment conditions on the entries of the base vector y and the limiting scheme where n → ∞ and m/n → c > 0. Many subsequent improvements on the Marčenko-Pastur law appeared in the literature, including [10], [2] and [9].These papers were able to deal with the model (1.1) with general sequence {τ α : α ∈ N + }.Notably, the latter paper extended the law to a broad family of y-vectors, called good vectors, that includes the current setting with i.i.d.components.Later, for the setting τ α ≡ 1, the necessary and sufficient conditions on Y 1 that the Marčenko-Pastur law serves as the limiting spectral distribution (LSD) of M n,1,m were carried out in [11].
Recently, [6] considered the k-fold tensor model M n,k,m and established a LSD for its n k real-valued eigenvalues.For the special case τ α ≡ 1, the LSD is exactly the Marčenko-Pastur law.A central limit theorem (CLT) is also established for a class of linear spectral statistics following the approach of [7].The main setup [6] is that the tensor product number k must be small enough compared to the space dimension n.Precisely, k/n → 0 is required for the validity of the LSD while k ≡ 2 is required when n → ∞ for the validity of the CLT.It is natural to consider the setting where k is large.Noting that large k means more dependence among the entries of the vector Y 1 , it is expected that the LSD of M n,k,m does not obey the Marčenko-Pastur law when k is large enough.However, the martingale moment bound employed [6] is less useful when k is large, and the method cannot be directly extended to the case k = O(n).The present paper deals with the case k = O(n) for the model (1.1) and shows that the empirical spectral distribution (ESD) of M n,k,m with τ α ≡ 1 converges to the Marčenko-Pastur law if and only if k = o(n).Therefore, the matrix M n,k,m with k = O(n) can be seen as a new example of bad vectors, for which the necessary and sufficient condition in [11] does not hold.
Another motivation for studying the model (1.1) is from quantum information theory.In [1], the model M n,k,m was introduced as a quantum analog of the classical probability problem of allocating r balls into s bins.The random vector Y 1 was interpreted as random product states in (C n ) ⊗k .When k is fixed and m = cn k for some c > 0, [1] established the convergence in expectation of the normalized trace of moments n −k TrM p n,k,m , which coincide with the corresponding moments of the Marčenko-Pastur law.In quantum physics and quantum information theory, it is natural to investigate the behavior of a large number of quantum states.In [12], the quantum entanglement of structured random states was characterized, and the spectral density of the reduced state was studied when the number of the quantum states k is large.In [4], the asymptotic behavior of the average entropy of entanglement for elements of an ensemble of random states associated with a graph was studied when the dimension of the quantum subsystem is large.
This paper considers the scenario where k grows to infinity quickly.Namely, we assume that there exist constants c, d ∈ (0, ∞), such that Let us add one more word of motivation for the choice of regime k = O(n).In hindsight, this is a natural threshold in our quest for limiting theorems, and from the point of view of probability theory, our result extends existing results.However, the need for large k is real in Quantum Information Theory.Actually, a typical real world scenario would be n fixed (the dimension of the system and k would go to infinity -the system can be used many times, i.e., one works on regularized quantities.In practice, we do not know how to fix n and send k to infinity, so we take this scenario as our next model.Our approach is based on the method of moments.Under appropriate moment conditions on the base variable ξ 1 and the sequence of coefficients {τ α }, we derive the limits for the spectral moments of M n,k,m under the limiting scheme (1.2).A striking fact from our result is that contrary to all the previous results, [6], the limiting spectral moments found here involve the 4th moment of the base variable ξ 1 .

Almost sure convergence of spectral moments
For fixed α = (α 1 , . . ., α p ), for each value t appearing in α, we count its frequency by ( The main result of the paper is the following theorem.
Theorem 2.1.Assume d > 0 and the following moment conditions hold.
1.For all p ∈ N. p-th moment 2. For all q ∈ N, We have (2.3) s,p is a special class of graphs that will be defined in the course of the proof (see Lemma 3.1).
(ii) For all fixed p ∈ N + , if k ≥ 2, we have In particular, (i) and (ii) imply that n −k TrM p n,k,m converge almost surely to the limit given in the r.h.s. of (2.3).
The proof of the theorem is given in Section 3.
Remark 2.1.Here, we require the random variable ξ 1 to have finite moments of any order.For most matrix models, such moment conditions can be removed by a standard truncation argument.However, the centralization step in the truncation argument fails for our matrix model.More precisely, let ŷ(l) α be the truncated vector for all 1 ≤ α ≤ m and 1 ≤ l ≤ k, then we have the following identity has rank at most n k−1 .Thus, the n k × m matrix, whose α-th column is (2.4), has rank at most kn k−1 .Hence, by [3,Theorem A.44], the sup norm of the difference of the cumulative distribution functions in the centralization step does not exceed k/n, which is not negligible when d > 0.
Let θ = e d(m 4 −1) .The limit moments in (2.3), say γ p , are polynomial functions of θ.The first four moments are However, for higher exponent p, some computing code is needed to find an explicit expression for γ p .
Remark 2.2.The limiting moment sequence (γ p ) grow to infinity extremely fast with p.To see this, let τ α ≡ 1.Then γ p is lower bounded by the first term (s = 1) in (2.3) is cθ p(p−1)/2 .Thus the Carleman's condition, that is, = ∞, is not satisfied (see also [5]).In particular, it is not clear whether the moment sequence (γ p ) uniquely determines a limiting distribution.However, by the convergence of the moments, we know that the sequence of eigenvalue distributions is tight (almost surely).
As a byproduct of our moment method, we give an alternative method for deriving a limiting spectral distribution in the case of d = 0, a result already given in [6] using the Stieltjes transform method. . (2.5) Here C s,p is a special class of graphs defined later in Lemma 3.1.
(ii) For all fixed p ∈ N + , if k ≥ 2, we have In particular, (i) and (ii) imply that n −k TrM p n,k,m converge almost surely to the limit given in the r.h.s. of (2.5).
The proof of the proposition is given in Appendix A. Finally, comparing the two cases d > 0 and d = 0, it is worth noticing that the fourth moment m 4 contributes to the limiting spectral moments only in the case of d > 0 (Theorem 2.1).

Proof of Theorem 2.1
Before proceeding to the proof, we introduce some preliminaries on graph combinatorics.We first introduce some terminologies and notations from graph theory.Denote [m] by the set of integers in the closed interval [1, m].We call α = (α 1 , . . ., α p ) ∈ [m] p a sequence of length p with vertices α j for 1 ≤ j ≤ p.We denote by |α| the number of distinct elements in α.If s = |α|, then we call α an s-sequence.Two sequence are equivalent if one becomes the other by a suitable permutation on [m].The sequence α is canonical if α 1 = 1 and α u ≤ max{α 1 , . . ., α u−1 } + 1 for u ≥ 2. We denote by C s,p the set of all canonical s-sequences of length p, and denote J s,p (m) by the set of all s-sequences α ∈ [m] p .Then to α u+1 for 1 ≤ u ≤ p.Here, we use the convention that α 1 = α p+1 .We call a down edge from α u to i . We also call an up edge from i (u) 1 to α u+1 an up innovation if α u+1 is different from α 1 , . . ., α u .We denote the graph by g(i 1 , α) and call such graph a ∆(p; α)-graph.Two graphs g(i 1 , α) and g(i 1 , α) are said to be equivalent if the two sequences i 1 and i 1 are equivalent.We write g(i 1 , α) ∼ g(i 1 , α) if they are equivalent.For each equivalent class, we choose the canonical graph such that i 1 = (i p is a canonical r-sequence for some r ∈ N + .A canonical ∆(p; α)-graph is denoted by ∆(p, r, s; α) if it has r noncoincident i-vertices and s noncoincident α-vertices.We classify ∆(p, r, s; α)-graphs into the following five categories.∆ 1 (p, s; α).∆(p; α)-graphs in which each down edge must coincide with one and only one up edge.If we glue the coincident edges, the resulting graph is a tree of p edges and p + 1 vertices, which implies r + s = p + 1. ∆ 2 (p, r, s; α).∆(p; α)-graphs that contain at least one single edge.∆ 3 (p, s; α).∆(p; α)-graphs such that the number of the edges between two vertices is 0 or 2. If we glue the coincident edges, the resulting graph is a connected graph with exactly one cycle.The graph has p edges and p vertices, which implies r + s = p.∆ 4 (p, s; α).∆(p; α)-graphs that have two up edges and two down edges with the same endpoints and all other down edges coincide with one and only one up edge.If we glue the coincident edges, the resulting graph is a tree of p − 1 edges and p vertices, implying r + s = p.∆ 5 (p, r, s; α).∆(p; α)-graphs that do not belong to the categories above.In this case, the graph has at most p − 1 vertices, since the in-degree equals to out-degree and at least 2 for all vertices.Thus, r + s ≤ p − 1.
Next, we determine the number of sequence i 1 ∈ C p+1−s,p such that g(i 1 , α) ∈ ∆ 1 (p, s; α) for given sequence α ∈ C s,p .We have the following lemma, which is motivated by [3,Lemma 3.4].Lemma 3.1.For any sequence α ∈ C s,p , there is at most one sequence i 1 ∈ C p+1−s,p such that g(i 1 , α) ∈ ∆ 1 (p, s; α).We denote by C Proof.We denote We define a pair of characteristic sequences {u 1 , . . ., u p } and {d 1 , . . ., d p } by and By definition, we always have u p = 0, and since α 1 = 1, we always have d 1 = 0.For a graph belongs to ∆ 1 (p, s), there are exactly s − 1 up innovations and hence there are s − 1 u-variables equal to 1 and s − 1 d-variables equal to −1.From its definition, one sees that d l = −1 means that after plotting the l-th down edge (α l , i (l) 1 ), the future path will never revisit the vertices α l , which means that the edge (α l , i (l) 1 ) must coincide with the up innovation leading to the vertex α l .Since there are r = p + 1 − s down innovations to lead out the r i 1 -vertices, d l = 0 implies that the edge (α l , i (l) 1 ) must be a down innovation.Therefore, d l = −1 must follow a u j = 1 for some j < l, which leads to the restriction of the pair of characteristic sequences Next, we show that each pair of characteristic sequences satisfying (3.4) defines a graph in ∆ 1 (p, s) uniquely.
Firstly, we have i We use induction to determine the unique graph in ∆ 1 (p, s).Suppose that the first l pairs of the down and up edges are uniquely determined by the two sequences {u 1 , . . .u l } and {d 1 , . . .d l }, and the subgraph of the first l pairs of down and up edges satisfies the following properties (a1) The subgraph is connected, and the unidirectional noncoincident edges form a tree.
(a2) If α l+1 = 1, then each down edge coincides with an up edge, which means that the subgraph not have single innovations.
(a3) If α l+1 = 1, then from the vertex α 1 to α l+1 , there is only one path (chain without cycles) of down-up-down-up single innovations and all other down edges coincide with an up edge.
We consider the following four cases to determine the (l + 1)-th pair of down and up edges.
Case 1. d l+1 = 0 and u l+1 = 1.Then both edges of the (l + 1)-th pair are innovations, which implies that i  ) is an innovation so i , α l+2 ) is not an innovation so α l+2 coincides with some vertex α j for 1 ≤ j ≤ l + 1.If α j = α l+1 , then there is a path i → α j .Also note that there should be a path connecting α j and i (l) 1 in the subgraph of first l pairs of down and up edges.The two paths with undirectional noncoincident edges will lead to a circle, which is a contradiction.Hence, α l+2 = α j = α l+1 and the (l + 1)-th up edge coincide with the (l + 1)-th down edge.After adding the two edges, the subgraph with the first l + 1 pairs of down and up edges satisfies the properties (a1) -(a3).Case 3. d l+1 = −1 and u l+1 = 1.By (3.4), we have Hence, there is at least one vertex in {α 2 , . . ., α l+1 } which is not α 1 , such that the vertex will be visited in the last p − (l + 1) pairs of down and up edges.Thus, by property (a1) and (a2), we have α l+1 = α 1 .Hence, there must be a single up innovation leading to the vertex α l+1 .We denote the up innovation by (i 1 , which means that the down edge starting from α l+1 coincides with the up innovation leading to α l+1 .Besides, the (l + 1)-th up edge is an innovation, which starts from i with the only up innovation ended at α l+1 .Before this up innovation, there must be a single down innovation by property (a3).Then the up edge can be drawn to coincide with this down innovation.If the path of single innovations of the subgraph with first l pairs of down and up edges has only one pair of down-up innovations, then α l+2 = 1, and hence the subgraph with first l + 1 pairs of down and up edges has no single innovations, which implies that the properties (a1) and (a2) hold.Otherwise, α l+2 = 1 and the subgraph with first l + 1 pairs of down and up edges satisfies properties (a1) and (a3).

Proof of (i)
We will use the index α, β for the columns of Y , which should be in [m].We will also use the index i, a multiple index of the form i = (i 1 , . . ., i k ), for the rows of Y .Then i should be in [n k ].For any p ∈ N, we compute the moment We use the convention that i (p+1) = i (1) .We have y (1)   αt i y (1)   αt i An observation is that E(i 1 , α) = E(i 1 , α ) if the two sequences i 1 and α are equivalent to i 1 and α respectively.Hence, by (3.5) and (3.1), we have We first compute the sum on i 1 in (3.7).For any sequences α ∈ C s,p , we can split the sum according to the category of the graph g(i 1 , α) in the following way: We deal with the four terms on the right hand side of (3.8) one by one.For i 1 ∈ C r,p with g(i 1 , α) ∈ ∆ 1 (p, s; α), by the definition of ∆ 1 (p, s; α), we have E(i 1 , α) = n −p and p = r + s − 1.Hence, (3.9) For i 1 ∈ C r,p with g(i 1 , α) ∈ ∆ 3 (p, s; α), by the definition of ∆ 3 (p, s; α), we have E(i 1 , α) = n −p and p = r + s.Hence, (3.10) For i 1 ∈ C r,p with g(i 1 , α) ∈ ∆ 4 (p, s; α), by the definition of ∆ 4 (p, s; α), we have E(i 1 , α) = n −p m 4 and p = r + s.Hence, (3.11) For i 1 ∈ C r,p with g(i 1 , α) ∈ ∆ 5 (p, r, s; α), by the definition of ∆ 5 (p, r, s; α), we have E(i 1 , α) = O(n −p ) and p ≥ r + s + 1.Hence, i 1 ∈Cr,p g(i 1 ,α)∈∆ 5 (p,r,s;α) (3.12) Substituting (3.9), (3.10), (3.11) and (3.12) to (3.8), and noting that we have where the last equality follows from Lemma 3.1.Hence, by (3.13), (3.14) and Lemma 3.1, (3.7) can be written as Recall the definition of deg t (α) given in (2.1) Then the sequence α has exactly deg t (α) vertices that equals to t.Thus, we have the following identity The following lemma computes the size of ∆ 4 (p, s; α) for given α ∈ C (1) Proof.We sort the graphs in ∆ 4 (p, s; α) into two classes.We will show that graphs in ∆ 4 (p, s; α) can be transferred to graphs in ∆ 1 (p, s; α) by splitting the vertex on the sequence i 1 associated to the multiple edges.
The first class consists of graphs whose the first l − 1 up edges are distinct, and the l-th down edge (α l , i (l) 1 ) coincide with the j-th down edge for some 1 ≤ j < l ≤ p.Hence, α l = α j and i 1 .Since the subgraph of the path from α j to α l does not have a cycle, one can find 1 ≤ j < l, such that the j -th up edge is from i 1 to α l belongs to the set of the last p − (l − 1) up edges.Hence, the edge (α l , i (l) 1 ) is not a down innovation.Now we split the vertex i .The l-th down edge connect i , which is a down innovation.The rest of the edges can be plotted such that the new graph belongs to ∆ 1 (p, s; α).See the graph below.
The second class consists of graphs whose first l down edges are distinct, and the l-th up edge (i (l) 1 , α l+1 ) coincide with the j-th up edge for some 1 ≤ j < l ≤ p − 1.This means that i 1 and α l+1 = α j+1 .Since the subgraph of the path from i (j) 1 does not have a cycle, one can find j < j ≤ l, such that the j -th down edge is from α l+1 to i (l) 1 , which means that α j = α l+1 , i 1 and the j -th down edge is not a down innovation.Another down edge from α l+1 to i .The j -th down edge connect i , which is a down innovation.The l-th up edge starts from i (l,2) 1 . The rest of the edges can be plotted such that the new graph belongs to ∆ 1 (p, s; α).See the graph below.Therefore, for any graphs in ∆ 4 (p, s; α), one can split the vertex on the sequence i 1 associated to the multiple edges to get a graph in ∆ 1 (p, s; α).Equivalently, any graphs in ∆ 4 (p, s; α) can be obtained from graphs in ∆ 1 (p, s; α) by gluing two vertices on the sequence i 1 that has a same neighborhood.Moreover, gluing different pairs of vertices on the sequence i 1 leads to different graphs in ∆ 4 (p, s; α).

−→
For fixed α ∈ C s,p , by Lemma 3.1, there exists a unique sequence i 1 , such that the graph g(i 1 , α) ∈ ∆ 1 (p, s; α).For the vertex t on the α-line, an observation is that the number of its neighborhoods on the i-line is deg t (α).Hence, the choice to glue two vertices on the sequence i 1 with the same neighborhood t on the α-line is Thus, for fixed α ∈ C s,p , the number of i 1 ∈ C r,p such that g(i 1 , α) ∈ ∆ 4 (p, s; α) is given by (3.16).
Proof.We prove by contradiction.Suppose not, then there exists a w ∈ C p+1−s,p , such that g(w, α) ∈ ∆ 1 (p, s; α).Note that the path from α j 1 to α j 2 and the path from α j 1 to α j 2 are two paths from the vertex on the α-line with labeled t 1 to the vertex on the α-line with labeled t 2 .The two paths will lead to multiple directed edges or undirected cycle, which is contradicted to the definition of ∆ 1 (p, s; α).
The following corollary is a direct consequence of Lemma 3.3.
s,p , graphs in ∆ 3 (p, s; α) are the ∆(p; α)-graphs in which each down edge must coincide with one and only one up edge.If we glue the coincident edges, the resulting graph is a connected graph with exactly one cycle.
The following lemma computes the size of ∆ 3 (p, s; α) for given α ∈ C (3.17) Proof.Recall that down edges of graphs in ∆ 3 (p, s; α) coincide with exactly one up edge, and the graph is connected with exactly one cycle when gluing coincident edges.We also sort the graphs in ∆ 3 (p, s; α) into two classes.We will show that graphs in ∆ 3 (p, s; α) can be transferred to graphs in ∆ 1 (p, s; α) by splitting an appropriate vertex on the sequence i 1 .
The first class consists of graphs whose subgraph with first l − 1 up and down edges has no cycle when orientation is removed and multiple edges are identified, but has a cycle when adding the l-th down edge.Hence, we have 3 ≤ l ≤ p, and both α l and i ), which is a down innovation.The l-th up edge starts from i (l,2) 1 . The rest of the edges can be plotted such that the new graph belongs to ∆ 1 (p, s; α).See the graph below.
The second class consists of graphs whose subgraph with first l − 1 up edges and first l down edges has no cycle when orientation is removed and multiple edges are identified, where we use the following equality in the last equality

Proof of (ii)
For any p ∈ N, we compute the variance of the moment We use the convention that i (p+1) = i (1) and j (p+1) = j (1) .Recalling the model (1.1), we can write A The case d = 0.
The variance calculation (ii) is identical to the one given in the proof of Theorem 2.1, see Section 3.2.It remains to establish the moment limit (i).

1 )
From the definition above, one can see that the set of distinct vertices of a canonical ssequence is[s].Denote by I s,m the set of injective maps from [s] to[m].For a canonical s-sequence α and a map ϕ ∈ I s,m , we call ϕ(α) the s-sequence (ϕ(α 1 ), . . ., ϕ(α p )).For each canonical s-sequence, its image under the maps in I s,m gives all its equivalent sequence, and hence its equivalent class of sequence in [m] p has exactly m(m − 1) • • • (m − s + 1) distinct elements.Let α = (α 1 , . . ., α p ) ∈ [m] p be a fixed canonical s-sequence.For i 1 = (i p , draw two parallel lines, referring to the α-line and the i-line.Plot i i-line and α 1 , . . ., α p on the α-line.Draw p down edges from α u to i the set of such canonical sequences α.Then the number of the elements in C

Remark 3 . 1 .
There exists α ∈ C s,p \C by the value of i (l) 1 and the first l − 1 pairs of up and down edges which connect i (l) 1 are plotted to connect the new vertex i (l,1) 1

1 , 1 .
and the first l − 1 pairs of up and down edges which connect i (l) 1 are plotted to connect the new vertex i (l,1) The l-th down edge (α l , i (l) 1 ) is replaced by the new down edge (α l , i (l,2) 1

1 .
where we use k = o(n) in the third equality.Letting m, n, k → ∞ with the ratio (1In the case τ 1 = τ 2 = • • • = 1, by Lemma 3.1, we have lim which is the p-th moment of the Marchenko-Pastur law.