Sparse random tensors: concentration, regularization and applications

We prove a non-asymptotic concentration inequality of sparse inhomogeneous random tensors under the spectral norm. For an order-$k$ inhomogeneous random tensor $T$ with sparsity $p_{\max}\geq \frac{c\log n}{n }$, we show that $\|T-\mathbb E T\|=O(\sqrt{n p_{\max}}\log^{k-2}(n))$ with high probability. The optimality of this bound is provided by an information theoretic lower bound. By tensor matricization, we extend the range of sparsity to $p_{\max}\geq \frac{c\log n}{n^{k-1}}$ and obtain $\|T-\mathbb E T\|=O(\sqrt{n^{k-1} p_{\max}})$ with high probability. We also provide a simple way to regularize $T$ such that $O(\sqrt{n^{k-1}p_{\max}})$ concentration still holds down to sparsity $p_{\max}\geq \frac{c}{n^{k-1}}$. We present our concentration and regularization results with two applications: (i) a randomized construction of hypergraphs of bounded degrees with good expander mixing properties, (ii) concentration of sparsified tensors under uniform sampling.


Introduction
Tensors have been popular data formats in machine learning and network analysis. The statistical model on tensors and the related algorithms have been widely studied in last ten years, including tensor decomposition [1,22], tensor completion [27,39], tensor sketching [52,40], tensor PCA [44,13,3], and community detection on hypergraphs [29,23]. This raises the urgent demand for random tensor theory, especially the concentration inequalities in a non-asymptotic point of view. There are several concentration results of sub-Gaussian random tensors [45] and Gaussian tensors [4,44,40]. Recently concentration inequalities for rank-1 tensor were also studied in [49] with application to the capacity of polynomial threshold functions [5]. In many of the applications in data science, the sparsity of the random tensor is an important aspect. However, there are only a few results for the concentration of order-3 sparse random tensors [27,33], and not much is known for general order-k sparse random tensors.
Inspired by discrepancy properties in random hypergraph theory, we prove concentration inequalities on sparse random tensors in the measurement of the tensor spectral norm. To simplify our presentation, we focus on real-valued order-k n × · · · × n tensors, while the results can be extended to tensors with other dimensions. We denote the set of these tensors by R n k . We first define the Frobenius inner product and spectral norm for tensors. Definition 1.1 (Frobenius inner product and spectral norm). For order-k n × · · · × n tensors T and A, the Frobenius inner product is defined by sum of entrywise products: with probability at least 1 − n −r .
We show that the high probability bound in Theorem 1.2 is optimal up to the logarithm term in the minimax sense. Theorem 1.3. Suppose we observe random tensor T such that ET = θ for θ ∈ [0, p] n k where p ∈ (0, 1] and n ≥ 16, then there exists constants c 1 , c 2 > 0 only depending on k such that where the infimum is taken over all functionsθ : R n k → R n k , T →θ(T ). In particular, if p ≥ c log n n , then there exists constant c 3 > 0 only depending on k and c such that This theorem implies if we want to reserve the high probability result in Theorem 1.2, √ np log k−2 (n) cannot be replaced by other terms with order o( √ np). Hence, the upper bound is tight when k = 2 and tight up to a logarithm term when k > 2. More generally, even if we consider all functionŝ θ : R n k → R n k , T →θ(T ), θ (T ) − ET has no high probability bound tighter than O( √ np).
Taking k = 2 in Theorem 1.2 with tensor matricization (see Definition 2.2), we obtain a concentration inequality down to sparsity p ≥ c log n n k−1 .
Theorem 1.4. Let k ≥ 2 be fixed. Assume p ≥ c log n n k−1 for some constant c > 0 and some integer 1 ≤ m ≤ k − 1. Then for any r > 0, there is a constant C > 0 depending only on r, c, k such that T − ET ≤ C n k−1 p with probability at least 1 − n −r .
Previous results for tensors include the concentration of sub-Gaussian tensors and expectation bound on the spectral norm for general random tensors [45,40]. The sparsity parameter does not appear in those bounds and directly applying those results would not get the desired concentration for sparse random tensors.
To also compare with previous works on concentration of sparse random hypergraphs (See Definition 2.3 and Definition 2.4), where each hyperedge {i 1 , . . . , i k } is generated independently with probability p i 1 ,...,i k , we have the following quick corollary from Theorem 1.4. Corollary 1.5. Let k ≥ 2 be fixed and T be the adjacency tensor of a k-uniform inhomogeneous random hypergraph with n vertices and p ≥ c log n n k−1 for some constant c > 0. Then for any r > 0, there is a constant C > 0 depending only on r, c, k such that with probability at least 1 − n −r , T − ET ≤ C n k−1 p. (1.1) 1.2. From random matrices to random tensors. There have been many fruitful results on the concentration of random matrices, including sparse random matrices. We briefly discuss different proof techniques and their difficulty and limitation for generalization to random tensors.
For sub-Gaussian matrices, an ǫ-net argument will quickly give a desired spectral norm bound [48]. For Gaussian matrices, one could relate the spectral norm to the maximal of a certain Gaussian process [47]. However, this type of argument does not adapt to sparse random matrices.
Another powerful way to derive a good spectral norm bound for random matrices is called the high moment method. Considering a centered n × n Hermitian random matrix A, for any integer k, its spectral norm satisfies E[ A 2k ] ≤ E[tr(A 2k )]. By taking k growing with n, if one can have a good estimate of E[tr(A 2k )], it implies a good concentration bound on A . It's well-known that computing the trace of a random matrix is equivalent to counting a certain class of cycles in a graph. This type of argument, together with some more refined modifications and variants (e.g. estimating high moments for the corresponding non-backtracking operator), is particularly useful for bounding the spectral norm of sparse random matrices, see [50,6,7,31,10]. This method can not only get the right order of the spectral norm but also capture the exact constant.
A different approach is called Friedman-Kahn-Szemerédi argument, which was first applied to obtain the spectral gap of random regular graphs [19]. Similar argument was used in [18] to estimate the largest eigenvalue of sparse Erdős-Rényi graphs. Although using this method one cannot obtain the exact constant of the spectral norm, it does capture the right order on n and the sparsity parameter p. Moreover, it provides a way to regularize sparse random matrices that improves concentration.
A natural question is how those methods can be applied to study the spectral norm of random tensors. For sub-Gaussian random tensors of order k, the ǫ-net argument would give us a spectral norm bound O( √ n) [45]. However, the dependence on the order k might not be optimal, and it cannot capture the sparsity in the sparse random tensor case. For Gaussian random tensors, surprisingly, none of the above approaches could obtain a sharp spectral norm bound with the correct constant. Instead, the exact asymptotic spectral norm for Gaussian random tensors was given in [4] using techniques from spin glasses. This is also the starting point for a line of further research: tensor PCA and spiked tensor models under Gaussian noise, see for example [44,36,13,3].
However, the tools from spin glasses rely heavily on the assumption of Gaussian distribution and cannot be easily adapted to non-Gaussian cases. One might try to develop a high moment method for random tensors. Unfortunately, there is no natural generalization of the trace or eigenvalues for tensors that match our cycle counting interpretation in the random matrix case. Instead, by projecting the random tensor into a matrix form (including the adjacency matrix, self-avoiding matrix, and the non-backtracking matrix of a hypergraph), one could still apply the moment method to obtain some information of the original tensor or hypergraph, see [37,41,17,2]. This approach is particularly useful for the study of community detection problems on random hypergraphs. However, after reducing the adjacency tensor into an adjacency matrix, there is a strict information loss and one could not get the exact spectral norm information of the original tensor. Due to the barrier of extending other methods to sparse random tensors, we generalize the Friedman-Kahn-Szemerédi argument to obtain a good spectral norm bound.
1.3. Regularization. Regularization of random graphs was first studied in [18]. It was proved in [18] that by removing high-degree vertices from a random graph, one could improve the concentration under the spectral norm. A data-driven threshold for finding high degree vertices for the stochastic block model can be found in [53]. A different regularization analysis was given in [32] by decomposing the adjacency matrix into several parts and modify a small submatrix. This method was later generalized to other random matrices in [43,42].
We adapt the techniques from [18], and apply it to an inhomogeneous random directed hypergraph (see Definition 2.5), whose adjacency tensor has independent entries. This allows us to prove a concentration inequality (see Theorem 5.2) with the same probability estimate as in [32] for regularized inhomogeneous random directed graphs. The regularization of inhomogeneous random hypergraphs is discussed in Section 6.

Applications.
To demonstrate the usefulness of our concentration and regularization results, we highlight two applications. In Section 6, we show that the concentration and regularization of a sparse Erdős-Rényi hypergraph can be used to construct a sparse random hypergraph with bounded degrees that satisfies a hypergraph expander mixing lemma, improving the construction in [19] of a relatively dense random hypergraph model. In Section 7, we go beyond tensors with entries in {0, 1} and study the concentration bound for a deterministic tensor under uniform sampling. We improved and generalized the results in [27]. This inequality is useful to estimate the sample size in tensor completion problems.
Organization of the paper. In Section 2, we provide some useful definitions and lemmas for our proofs. In Section 3 we prove all the main results on concentration. In Section 4 we prove the minimax lower bound in Theorem 1.3. In Section 5, we analyze the regularization procedure. In Section 6, we present the construction of sparse hypergraph expanders. In Section 7, we provide the analysis of tensor sparsification.

Preliminaries
In this section, we collect some definitions and lemmas that will be used later in our paper. For the ease of notation, we denote the Frobenius inner product between a tensor T and a tensor x 1 ⊗ · · · ⊗ x k by T (x 1 , . . . , x k ) := T, x 1 ⊗ · · · ⊗ x k , which can be seen as a multi-linear form on x 1 , . . . , x k . It is worth noting that the following holds: 4 Lemma 2.1. Let T ∈ R n k be any order-k tensor for k ≥ 2. We have T ≤ T F .
Proof. The following inequality holds: We introduce the following operation of a tensor called matricization, also known as flattening or unfolding. For more details see [30].
In the proof of our main results, we need some definitions from hypergraph theory.  For any adjacency tensor T , t i σ(1) ,...,i σ(k) = t i 1 ,...,i k for any permutation σ ∈ S k , so T is symmetric. We may abuse notation and write t e in place of t i 1 ,...,i k , where e = {i 1 , . . . , i k }.
For the proof of Theorem 1.4, we will work with a non-symmetric random tensor and we rely on some properties of the corresponding directed hypergraph. We include definitions here.
consists of a set V of vertices and a set E of directed hyperedges such that each directed hyperedge is an element in V × · · · × V = V k . Let T be the adjacency tensor of H such that Note that the adjacency tensor T is not symmetric. The degree of a vertex i, denoted by d i , is defined by t i,i 1 ,...,i k−1 .

5
Finally, we recall the classical Chernoff bound that will be used in our proofs.

Proof of concentration results
3.1. Proof of Theorem 1.2. The proof is a generalization of [18,34] and is suitable for sparse random tensors. This type of method is known as Friedman-Kahn-Szemerédi argument originally introduced in [19 3.1.1. Discretization. Fix δ ∈ (0, 1), define the n-dimensional ball of radius t as We introduce a set of lattice points in S 1 as follows: By the Lipschitz property of spectral norms, we have the following upper bound, which reduces the problem of bounding the spectral norm of T to an optimization problem over T . The proof follows from Lemma 2.1 in the supplement of [34]. For completeness, we provide the proof here.
Proof. For any v ∈ S 1−δ , consider the cube in R n of edge length δ/ √ n that contains v, with all its vertices in δ √ n Z n . The diameter of the cube is δ, so the entire cube is contained in S 1 . Hence all vertices of this cube are in T and S 1−δ ⊂ convhull(T ). Therefore for each 6 where the last inequality is due to This completes the proof. Now for any fixed k-tuples (y 1 , . . . , y k ) ∈ T × · · · × T , we decompose its index set. Define the set of light tuples as and heavy tuples as In the remaining part of our proof, we control the contributions of light and heavy tuples to the spectral norm respectively.

Light tuples.
Let W = T − ET be the centered random tensor and we denote the entries of W by w i 1 ,...,i k for i 1 , . . . , i k ∈ [n]. We have the following concentration bound for the contribution of light tuples.
where δ ∈ (0, 1) on the right hand side is determined by the definition of T in (3.1).
Proof. Denote Then the contribution from light tuples can be written as Since each term in the sum has mean zero and is bounded by √ np/n, we are ready to apply Bernstein's inequality to get for any constant c > 0, . By the volume argument (see for example [48]) we have |T | ≤ exp(n log(7/δ)), hence the k-th product of T satisfies |T × · · · ×T | ≤ exp(kn log(7/δ)).
Then taking a union bound over all possible y 1 , . . . , y k ∈ T , we have This completes the proof.
By Lemma 3.2, for any r > 0, we can take the constant c in Lemma 3.2 large enough such that with probability at least 1 − n −r . Therefore to prove Theorem 1.4, it remains to control the contribution from heavy tuples.
3.1.3. Heavy tuples. Next we show the contribution of heavy tuples can be bounded by c √ np log k−2 (n) for some constant c with high enough probability. Namely, with high probability sup y 1 ,...,y k ∈T (i 1 ,...,i k )∈L Note that from our definition of heavy tuples in (3.3), we have Therefore it suffices to show that with high enough probability for all y 1 , . . . , y k ∈ T , for a constant C k depending on k. We will focus on the heavy tuples (i 1 , . . . , i k ) such that y 1,i 1 , . . . , y k,i k > 0. We denote this set by L + . The rest cases will be similar and there are 2 k different cases in total. We now define the following index sets for a fixed tuple (y 1 , . . . , y k ) ∈ L + : Also the following definitions are needed: (2) e(I 1 , . . . , I k ): the number of distinct hyperedges between k vertex sets I 1 , . . . , I k . More precisely, (3) µ(I 1 , . . . , I k ) = Ee(I 1 , . . . , I k ), (4) µ(I 1 , . . . , I k ) = p|I 1 | · · · |I k |, The following two lemmas are about the properties of the sparse directed random hypergraphs (recall Definition 2.5), which are important for the rest of our proof. Proof. For a fixed (k − 1)-tuple (i 1 , . . . , i k−1 ), by Bernstein's inequality, where in the last inequality we use the assumption p ≥ c log n/n. Then taking a union bound over Therefore for any r, c > 0 we can take c 1 sufficiently large to make Lemma (3.3) hold.
which gives the desired result for Case (2).
With Lemma 3.3 and Lemma 3.4, we prove our estimates (3.7) for all heavy tuples. Recall we are dealing with the tuples over L + , we then have ...,s k ): 2 s 1 +···+s k ≥ √ npn k/2−1 α 1,s 1 · · · α k,s k σ s 1 ,...,s k . The last equality follows directly from definitions in (7). (3.13) implies that it suffices estimate the contribution of heavy tuples through its index sets. We then bound the contribution of heavy tuples by splitting the indices (s 1 , . . . , s k ) into 6 different categories. Let C := (s 1 , · · · , s k ) : 2 s 1 +···+s k ≥ √ npn k/2−1 , |D s 1 1 | ≤ · · · ≤ |D s k k | (3.14) be the ordered index set for heavy tuples where we assume |D s 1 1 | ≤ · · · ≤ |D s k k |. For the case where the sequence {|D s i i |, 1 ≤ i ≤ k} have different orders can be treated similarly, and there are k! many in total. We then define the following 6 categories in C: In the remaining part of the proof, we will show for all 6 categories {C t , 1 ≤ t ≤ 6}, for some constant C k depending only on k, c 1 , c 2 , c 3 and δ, where the constants c 1 , c 2 , c 3 are the same ones as in Lemma 3.3 and Lemma 3.4.
Recall (6), we will repeatedly use the following estimate: From now on, for simplicity, whenever we are summing over s i for some 1 ≤ i ≤ k, the range of s i is understood as 1 ≤ s i ≤ ⌈log 2 ( √ n/δ)⌉.
3.1.4. Tuples in C 1 . In this case we get where the last inequality comes from (3.16).
3.1.5. Tuples in C 2 . The constraint on C 2 is the same as the condition in Case (1) of Lemma 3.4.
Then we have where the last inequality is from the fact that each s i satisfies 1 ≤ s i ≤ ⌈log 2 ( √ n/δ)⌉ for i ∈ [k] (see (3.8)). Therefore (3.21) can be bounded by for a constant C depending only on δ, k and c 3 .

3.2.
Proof of Theorem 1.4. The following lemma is an inequality comparing the spectral norms of a tensor and its matricization from [51]. Now Mat 1 (T − ET ) is a n × n k−1 random matrix whose entries are one-to-one correspondent to entries in T − ET . Let A ∈ R n k−1 × R n k−1 be a matrix such that Then A is an adjacency matrix of a random directed graph on n k−1 vertices with Then applying Theorem 1.2 with the matrix case (k = 2), for any r > 0, there is a constant C > 0 depending on r and c k−1 such that with probability at least 1 − n −r , Then from (3.27), with probability at least 1 − n −r , This completes the proof of Theorem 1.4.

Proof of Corollary 1.5.
Proof. We consider the set of indices I = {(i 1 , . . . , i k ) : i 1 > i 2 > · · · > i k }. Let T I be the random tensor after zeroing out the entries with index in I c . Then by Theorem 1.4, with probability 1−n −r , For any permutation σ in the symmetric group of order k denoted by S k , we repeat this argument for the sets of indices I σ = (i σ(1) , . . . , i σ(k) ) : i σ(1) > i σ(2) > · · · > i σ(k) , and have

Proof of minimax lower bound
In this section, we will proof Theorem 1.3. We first compute the packing number over the parameter space under the spectral norm, then apply Fano's inequality. We first introduce two useful lemmas for showing this result. We will use the version in [46]. Varshamov-Gilbert bound). For n ≥ 8, there exists a subset S ⊂ {0, 1} n such that |S| ≥ 2 n/8 + 1 and for every distinct pair of ω, ω ′ ∈ S, the Hamming distance satisfies where d is a metric on Θ; (ii) let P i be the distribution with respect to parameter θ i , then for all i, j ∈ [N ], P i is absolutely continuous with respect to P j ; .
Since we will apply Fano's inequality associated with Kullback-Leibler divergence, it requires the following lemma about random tensor with independent Bernoulli entries. Lemma 4.3. For 0 ≤ a < b ≤ 1, we consider parameters θ, θ ′ ∈ [a, b] n k for 0 ≤ a < b ≤ 1, and let P and P ′ be the corresponding distributions, then the Kullback-Leibler divergence satisfies Proof. We firstly consider entrywise KL-divergence. For p, q ∈ [a, b], D KL (Ber(p) Ber(q)) = p log p q By independence of each entry, we have . Now we are ready to proof Theorem 1.3. We note that H(ω (i) , ω (j) ) = ω (i) −ω (j) 2 2 . Let W be a fixed order-(k −1) tensor with entries either 0 or 1 and dimension n × · · · × n. The entries of W is designed as follows. Let m = ⌊p − 1 k−1 ⌋ ∧ n, so 1 ≤ m k−1 ≤ 1/p. We assign 1's to an m × · · · × m subtensor of W and assign 0's to the rest entries. Then the rank of W is 1 and W = W F = m (k−1)/2 . Now we define for i ∈ [N ], where J ∈ R n k is an order-k tensor with all ones, and On the other hand, Let P i be the distribution of a random tensor T associated with parameter θ (i) for i ∈ [N ]. Since where the last inequality is due to the choice m = ⌊p − 1 k−1 ⌋∧ n ≤ p − 1 k−1 . To apply Fano's inequality, we let α = nm k−1 p 2 14400 and verify that for i, j ∈ [N ], for β = which gives the desired result.

Regularization
In this section we present the regularization procedure to obtain the concentration of spectral norms of order O( n k−1 p) down to sparsity p ≥ c n k−1 . Our regularization results show that for sparse hypergraphs, the appearance of high degree vertices is a barrier to the concentration of the spectral norm. This is the same phenomenon studied in [18,32] for sparse random graphs.
For any order-k tensor A indexed by [n], let S ⊂ [n]. We define the regularized tensor A S as When we observe a random tensor T , we regularize T as follows. Suppose the degree of vertex i is greater than 2n k−1 p, then we remove all edges of this vertex. In other words, we zero out the corresponding hyperedges in the adjacency tensor. We call the resulting tensorT . LetS ⊂ [n] be the set of vertices with degree greater than 2n k−1 p. Then with our notation,T = TS.
The following lemma shows that with high probability, not many vertices are removed. From our Theorem 1.4, when p = c log n n k−1 for any c > 0, the regularization is not needed. Below we assume p < log n n k−1 for simplicity.
Lemma 5.1. Let c n k−1 ≤ p < log n n k−1 for a sufficiently large c > 1. Then the number of regularized vertices |S| is at most 1 n k−2 p with probability at least 1 − exp − n 6 log n .
Proof. Similar to (3.9), by Bernstein's inequality, we have for each i ∈ [n], Then 1{d i > 2n k−1 p} is a Bernoulli random variable with mean at most µ := exp − 3n k−1 p 8 . Since d i are independent for all i ∈ [n], by Chernoff's inequality (2.2), for any λ ≥ 0, Since n k−1 p ≥ c, we can choose a constant c sufficiently large and take so that 2 + λ ≤ 3λ, and from (5.3) we know where the last two inequalities are from (5.4) and our assumption that n k−1 p < log n.
Theorem 5.2. Let c n k−1 ≤ p < log n n k−1 for a sufficiently large c > 1. LetT be the random tensor T after regularization, then for any r > 0, there exists a constant C depending on c, k, r such that Proof. We first prove Theorem 5.2 when k = 2, the matrix case. Let S be any fixed subset of [n] with |S| ≤ 1 p . Since the spectral norm of a tensor is bounded by its Frobenius norm, we then have We consider the random matrix T S generated from P S such that if i 1 ∈ S, then t S i 1 ,i 2 = p S i 1 ,i 2 = 0 for all i 2 ∈ [n]. Applying Lemma 3.2 with k = 2 to T S − P S , for any constant C > 0, where δ is the parameter associated with T (see (3.1)). Taking the union bound on all S ⊂ [n], there are 2 n such subsets, so we have Now we consider the heavy tuples. Note thatT satisfies the bounded degree condition in Lemma 3.3 with c 1 = 2, and the quantity e(I 1 , I 2 ) corresponding toT is smaller than the one corresponding to T . Thus, given the bounded degree property forT , from the proof of Lemma 3.4, the bounded discrepancy conditions in Lemma 3.4 hold forT with probability at least 1 − n − 1 2 (c 3 −12) . As a result, the contribution of heavy tuples can be bounded by C 1 √ np for sufficiently large C 1 .
Take C = 1 and δ = 1/2 in (5.8). From the analysis above, there exists a constant C 2 > 1 depending on c, c 3 such that ≤ 2e −n + n − 1 2 (c 3 −12) . (5.9) We define the following two events: Then conditioned on the event E c where the last inequality is from (5.6). Therefore from (5.5) and (5.9), for any r > 0, we can take c 3 large enough such that This completes the proof when k = 2.
Next we consider the case when k ≥ 3. Let T be the adjacency tensor of a k-uniform random directed hypergraph and ET = P . Let A ∈ R n k−1 × R n k−1 be a matrix such that Then A is an adjacency matrix of a random directed graph on n k−1 vertices with p ≥ c n k−1 . Regularizing A by removing vertices of degrees greater than 2n k−1 p, we have with probability at least 1 − n −r , Â − EA ≤ C n k−1 p.
By the way we regularize an order-k random tensor T introduced above, we have Mat 1 (T − P ) is a submatrix of (Â − EA) with other entries being 0. Therefore by Lemma 3.5, This completes the proof of Theorem 5.2 for all k ≥ 2.
Our Theorem 5.2 provides the guarantee of concentration after regularization for random tensors with independent entries. For symmetric random tensors, we provide a similar regularization procedure in Section 6, see (6.2).

Sparse hypergraph expanders
The expander mixing lemma for a d-regular graph (the degree of each vertex is d) states the following: Let G be a d-regular graph on n vertices with the second largest eigenvalue in absolute value of its adjacency matrix satisfying λ := max{λ 2 , |λ n |} < d. For any two subsets be the number of edges between V 1 and V 2 . Then (6.1) shows that d-regular graphs with small λ have a good mixing property, where the number of edges between any two vertex subsets is approximated by the number of edges we would expect if they were drawn at random. Such graphs are called expanders, and the quality of such an approximation is controlled by λ, which is also the mixing rate of simple random walks on G [14].
Hypergraph expanders have recently received significant attention in combinatorics and theoretical computer science [38,16]. Many different definitions have been proposed for hypergraph expanders, each with their own strengths and weaknesses. In this section, we only consider hypergraph models that have a generalized version of expander mixing lemma (6.1). There are several hypergraph expander mixing lemmas in the literature based on the spectral norm of tensors [20,35,15]. However, for deterministic tensors, the spectral norm is NP-hard to compute [26], hence those estimates might not be applicable in practice. In [9,24], the authors obtained a weaker expander version mixing lemma for a sparse deterministic hypergraph model where the mixing property depends on the second eigenvalue of a regular graph.
Friedman and Widgerson [20] obtained the following spectral norm bound for a random hypergraph model: Consider a k-uniform hypergraph model on n vertices where dn k−1 hyperedges are chosen independently at random. Let J be the order-k tensor with all entries taking value 1. If d ≥ Ck log n, then with high probability T − d n J ≤ (C log n) k/2 √ d. Combining with their expander mixing lemma in [20], it provides a random hypergraph model with a good control of the mixing property. This is a random hypergraph model where each vertex has an expected degree of dn k−2 , which is not sparse. To the best of our knowledge, our Theorem 6.2 below is the first construction of a sparse random hypergraph model with bounded degrees that satisfies a k-subset expander mixing lemma with high probability.
The idea of applying expander mixing lemma and spectral gap results of sparse expanders to analyze matrix completion and tensor completion has been developed in [25,8,11,21,24]. We believe our result could also be useful for tensor completion or other related problems.
Let H be a k-uniform Erdős-Rényi hypergraph (recall Definition 2.3) on n vertices with sparsity p = c n k−1 , where each hyperedge is generated independently with probability p. Its adjacency tensor is then a symmetric tensor, and we denote it by T . We construct a regularized hypergraph H ′ as follows: (1) ConstructT such that . Ifd i > 2n k−1 p, zero out all entries t i,i 1 ,...,i k−1 . We then obtain a new tensorT .
Note that this regularization procedure is applicable to inhomogeneous random hypergraphs by taking p = max i 1 ,...,i k ∈[n] p i 1 ,...,i k . By our construction, H ′ is a k-uniform hypergraph with degrees at most 2k!n k−1 p = 2k!c. Let J ∈ R n k be an order k tensor with all entries taking value 1. From Theorem 5.2, its adjacency tensor T ′ satisfies for some constant C > 0 with high probability. In the next theorem we use (6.3) to show that H ′ satisfies an expander mixing lemma with high probability. to be the number of hyperedges between V 1 , . . . , V k . Theorem 6.2. Let H be a k-uniform Erdős-Rényi hypergraph with sparsity p = c n k−1 for some sufficiently large constant c > 1. Let H ′ be the hypergraph H after regularization, then there exists a constant C > 0 such that with high probability for any subsets V 1 , . . . , V k ⊂ V (H), we have the following expander mixing lemma: Proof of Theorem 6.2. Let 1 V i be the indicator vector of V i , 1 ≤ i ≤ k such that the j-th entry of The last line is from the definition of the spectral norm for tensors and (6.3). Then (6.5) follows.

Tensor sparsification
In the tensor completion problem, one aims to estimate a low-rank tensor based on a random sample of observed entries. A commonly used definition of the rank for tensors is called canonical polyadic (CP) rank. We refer to [30] for more details. In order to solve a tensor completion problems, there are two main steps. First, one needs to sample some entries from a low-rank tensor T . Then, based on the observed entries, one solves an optimization problem and justifies that the solutions to this problem will be exactly or nearly the original tensor T . A fundamental question is: how many observations are required to guarantee that the solution of the optimization problem provides a good recovery of the original tensor T ?
After a random sampling from the original tensor T , we obtain a random tensorT . If we require the sample size to be small,T then will be random and sparse. In the next step, the optimization procedure is then based onT . In my matrix or tensor completion algorithm, especially for the non-convex optimization algorithm, we need some stability guarantee on the initial data, see for example [28,27,12]. Therefore, it is important to have some concentration guarantee such thatT is close to T in some sense.
Another related problem is called tensor sparsification. Given a tensor T , through some sampling algorithm, one wants to construct a sparsified versionT of T such that T −T is relatively small with high probability. In [40], a non-uniform sampling algorithm was purposed and the probability of sampling each entry is chosen based on the magnitude of the entry in T . However, without knowing the exact value of the original tensor T , a reasonable way to output a sparsified tensorT is through uniform sampling.
We obtain a concentration inequality of the spectral norm for tensors under uniform sampling, which is useful to both of the problems above. It improves the sparsity assumption in the analysis of the initialization step for the tensor completion algorithm purposed in [27] and is applicable to other tensor completion and sparsification problems.
Let T be a deterministic tensor. We obtain a new tensorT by uniformly sampling entries in T with probability p. Namely,t i 1 ,...,i k = t i 1 ,...,i k with probability p, 0 with probability 1 − p. Remark 7.2. Theorem 2.1 in [27] provided an estimate for T −pT for symmetric T and symmetric sampling, assuming k = 3 and p ≥ log n n 3/2 . When k = 3, we improved the sparsity assumption down to p ≥ c log n n 2 and our result covers non-symmetric tensors with uniform sampling. Proof of Theorem 7.1. Without loss of generality, we may assume max i 1 ,...,i k ∈[n] |t i 1 ,...,i k | = 1 in our proof. We first prove the result for k = 2, the matrix case. It is a simple modification of the proof of Theorem 1.2. Let Z =T − pT . Then the entries of Z satisfies |z i 1 ,i 2 | ≤ 1, and Ez i 1 ,i 2 = 0. 23 Using the same discretization argument in Section 3.1.1, we have for any δ ∈ (0, 1), Z ≤ (1 − δ) −2 sup y 1 ,y 2 ∈T |Z(y 1 , y 2 )|.
Define light and heavy tuples in the same way as in (3.2) and (3.3). For the contribution of light tuples, the proof of Lemma 3.2 follows in the same way for Z. Therefore for any r > 0, we can take a constant C large enough such that sup y 1 ,y 2 ∈T i 1 ,i 2 ∈L y 1,i 1 y 2,i 2 z i 1 ,i 2 ≤ C √ np with probability at least 1 − n −r . Now it remains to control the contribution from heavy tuples. Namely, with probability 1 − n −r , there exists a constant C 1 > 0 such that sup y 1 ,y 2 ∈T (i 1 ,i 2 )∈L y 1,i 1 y 2,i 2 z i 1 ,i 2 ≤ C 1 √ np.
Recall z i 1 ,i 2 =t i 1 ,i 2 − pt i 1 ,i 2 . Therefore from (7.3) and (7.4), it suffices to show that with high enough probability for all y 1 , y 2 ∈ T , (i 1 ,i 2 )∈L y 1,i 1 y 2,i 2 ·t i 1 ,i 2 ≤ C 2 √ np (7.5) for a constant C 2 > 0. Similarly to the proof in Section 3.1.3, we can focus on the heavy tuples (i 1 , i 2 ) in L + (defined below (3.7)). The rest cases will be similar. Now we introduce auxiliary random variables t ′ i 1 ,i 2 such that t ′ i 1 ,i 2 = 1 ift i 1 ,i 2 = t i 1 ,i 2 , 0 ift i 1 ,i 2 = 0.
Since t ′ i 1 ,i 2 is a Bernoulli random variable with mean p, all of our analysis in Section 3.1.3 for the contribution from L + applies without any change. Hence we get for any r > 0, there exists a constant C 3 > 0 such that with probability at least 1 − n −r , (i 1 ,i 2 )∈L + y 1,i 1 y 2,i 2 t ′ i 1 ,i 2 ≤ C 3 √ np. Therefore (7.5) holds. This finishes the proof of Theorem 7.1 when k = 2.

24
Next we extend the result for all k ≥ 3. By tensor matricization, we have Mat 1 (T − pT ) ∈ R n × R n k−1 . By the same argument in the proof of Theorem 1.4, we obtain for any r > 0, there exists a constant C depending on r, c, k such that with probability 1 − n −r , T − pT ≤ Mat 1 (T − pT ) ≤ C n k−1 p.
This finishes the proof for all k ≥ 3.