Normal approximation and confidence region of singular subspaces

This paper is on the normal approximation of singular subspaces when the noise matrix has i.i.d. entries. Our contributions are three-fold. First, we derive an explicit representation formula of the empirical spectral projectors. The formula is neat and holds for deterministic matrix perturbations. Second, we calculate the expected projection distance between the empirical singular subspaces and true singular subspaces. Our method allows obtaining arbitrary $k$-th order approximation of the expected projection distance. Third, we prove the non-asymptotical normal approximation of the projection distance with different levels of bias corrections. By the $\lceil \log(d_1+d_2)\rceil$-th order bias corrections, the asymptotical normality holds under optimal signal-to-noise ration (SNR) condition where $d_1$ and $d_2$ denote the matrix sizes. In addition, it shows that higher order approximations are unnecessary when $|d_1-d_2|=O((d_1+d_2)^{1/2})$. Finally, we provide comprehensive simulation results to merit our theoretic discoveries. 
Unlike the existing results, our approach is non-asymptotical and the convergence rates are established. Our method allows the rank $r$ to diverge as fast as $o((d_1+d_2)^{1/3})$. Moreover, our method requires no eigen-gap condition (except the SNR) and no constraints between $d_1$ and $d_2$.


Introduction
Matrix singular value decomposition (SVD) is a powerful tool for various purposes across diverse fields. In numerical linear algebra, SVD has been successfully applied for solving linear inverse problems, low-rank matrix approximation and etc. See, e.g., Golub and Van Loan (2012), for more examples. In many machine learning tasks, SVD is crucial for designing computationally efficient algorithms, such as matrix and tensor completion (Cai et al. (2010), Keshavan et al. (2010), Candès and Tao (2010), Xia and Yuan (2018), Xia et al. (2017)), and phase retrieval (Ma et al. (2017), Candes et al. (2015)), where SVD is Normal approximation of SVD 3799 often applied for generating a warm initial point for non-convex optimization algorithms. In statistical data analysis, SVD is superior for denoising and dimension reduction. For instance, SVD, as a dimension reduction tool, is used for text classification in Kim et al. (2005). See also Li and Wang (2007). In Shabalin and Nobel (2013), SVD shows appealing performances in low rank matrix denoising. More specifically, in Donoho and Gavish (2014), they proved that statistically minimax optimal matrix denoising can be attained via precise singular value thresholding. Recently, matrix SVD is generalized to tensor SVD for tensor denoising, see Xia and Zhou (2019) and Zhang and Xia (2018).
The perturbation analysis is critical for advancing the theoretical developments of SVD for low-rank matrix denoising where the observed data matrix often equals a low-rank information matrix plus a noise matrix. The deterministic perturbation bounds of matrix SVD have been well established by Davis-Kahan (Davis and Kahan (1970), Yu et al. (2014)) and Wedin (Wedin (1972)) many years ago. Among those deterministic perturbation bounds, one simple yet useful bound shows that the perturbation of singular vectors is governed by the so-called signal-to-noise ratio (SNR) where "signal" refers to the smallest non-zero singular value of the information matrix and the "noise" refers to the spectral norm of the noise matrix. It is a quite general result since the bound does not rely on the wellness of alignments between the singular subspaces of the information and of the noise matrices. Such a general bound turns out to be somewhat satisfactorily sharp when the noise matrix contains i.i.d. random entries. However, more refined characterizations of singular vectors are needed on the frontiers of statistical inference for matrix SVD. The Davis-Kahan Theorem and Wedin's perturbation bounds are illustrated by the non-zero smallest singular value of the information matrix, where the effects of those large singular values are usually missing. Moreover, the exact numerical factor is also not well recognized.
The behavior of singular values and singular vectors of low rank perturbations of large rectangular random matrices is popular in recent years. They play a key role in statistical inference with diverse applications. See Li andLi (2018), Naumov et al. (2017), Tang et al. (2018) for some examples in network testing. The asymptotic limits of singular values and singular vectors were firstly developed by Benaych-Georges and Nadakuditi (2012), where the convergence rate of the largest singular value was also established. Recently, by Ding (2017), more precise non-asymptotic concentration bounds for empirical singular values were obtained. Meanwhile, Ding (2017) also proved non-asymptotic perturbation bounds of empirical singular vector when the associated singular value has multiplicity 1. In a recent work (Bao et al., 2018), the authors studied the asymptotic limit distributions of the empirical singular subspaces when (scaled) singular values are bounded. Specifically, they showed that if the noise matrix has Gaussian distribution, then the limit distribution of the projection distance is also Gaussian. Unlike these prior arts (Ding (2017), Bao et al. (2018)), we focus on the non-asymptotical normal approximations of the joint singular subspaces in a different regime. Our approach allows the rank to diverge, and imposes no 3800 D. Xia constraints between d 1 and d 2 . In addition, we establish the convergence rates and impose no eigen-gap conditions (except SNR).
In Xia (2019), the low rank matrix regression model is investigated where the author proposed a de-biased estimator built on nuclear normal penalized least squares estimator. The de-biased estimator ends up with an analogous form of the low rank perturbation of rectangular random matrices. Then, nonasymptotical normal approximation theory of the projection distance is proved, under near optimal sample size requirement. The paramount observation is that the mean value in the limit normal distribution is significantly larger than its standard deviation. As a result, a much larger than regular sample size requirement is necessary to tradeoff the estimation error of the expected projection distance. Most recently, Chen et al. (2018) revealed an interesting phenomenon of the perturbation of eigenvalues and eigenvectors of such non-asymmetric random perturbations, showing that the perturbation of eigen structures is much smaller than the singular structures. In addition, some non-asymptotic perturbation bounds of empirical singular vectors can be found in Koltchinskii and Xia (2016), Bloemendal et al. (2016) and Abbe et al. (2017). The minimax optimal bounds of singular subspace estimation for low rank perturbations of large rectangular random matrices are established in Cai and Zhang (2018).
Our goal is to investigate the central limit theorems of singular subspaces in the low rank perturbation model of large rectangular random matrices. As illustrated in Xia (2019), the major difficulty arises from how to precisely determine the expected projection distance. One conclusive contribution of this paper is an explicit representation formula of the empirical spectral projector. This explicit representation formula allows us to obtain precise characterization of the (non-asymptotical) expected projection distance. After those higher order bias corrections, we prove normal approximation of the singular subspaces with optimal (in the consistency regime) SNR requirement. For better presenting the results and highlighting the contributions, let's begin with introducing the standard notations. We denote M = U ΛV T the unknown d 1 × d 2 matrix where U ∈ R d1×r and V ∈ R d2×r are its left and right singular vectors. The diagonal matrix Λ = diag(λ 1 , · · · , λ r ) contains M 's non-increasing positive singular values. The observed data matrixM ∈ R d1×d2 satisfies the additive model: Here, we fix the noise variance to be 1, just for simplicity. For ease of exposition, let d 1 ≤ d 2 . LetÛ ∈ R d1×r andV ∈ R d2×r be the top-r left and right singular vectors ofM . LetΛ = diag(λ 1 , · · · ,λ r ) denote the top-r singular values ofM . We focus on the projection distance between the empirical and true singular subspaces which is defined by By Davis-Kahan Theorem (Davis and Kahan, 1970) or Wedin's sin Θ theorem (Wedin, 1972) where · denotes the spectral norm and d 2 = max{d 1 , d 2 }. Therefore, it is convenient to consider λ r √ d 2 . In this paper, we focus on the consistency regime 1 so that the empirical singular subspaces are consistent which requires λ r √ rd 2 . See, e.g., Tao (2012), Koltchinskii and Xia (2016), Cai and Zhang (2018) and Vershynin (2010).
Our contributions are summarized as follows.
1. An explicit representation formula ofÛÛ T andVV T is derived. In particular,ÛÛ T andVV T can be completely determined by a sum of a series of matrix product involving only Λ, d2×(d2−r) are chosen so that (U, U ⊥ ) and (V, V ⊥ ) are orthonormal matrices. To derive such a useful representation formula, we apply the Reisz formula, combinatoric formulas, contour integrals, residue theorem and generalized Leibniz rule. It worths to point out that the representation formula is deterministic as long as Z < λ r /2. We believe that this representation formula of spectral projectors should be of independent interest for various purposes. 2. By the representation formula, we prove the normal approximation of In particular, we show thatε 1 converges to a standard normal distribution as long as √ rd 2 /λ r → 0 and r 3 /d 2 → 0 as d 1 , d 2 → ∞. The required SNR is optimal in the consistency regime. Note that our result allows r to diverge as fast as o((d 1 + d 2 ) 1/3 ). In addition, no conditions on the eigen-gaps (except λ r ) are required. The convergence rate is also established. The proof strategy is based on the Gaussian isoperimetric inequality and Berry-Esseen theorem. 3. The unknown Edist 2 [(Û,V ), (U, V )] plays the role of centering inε 1 . To derive user-friendly normal approximations of dist 2 [(Û,V ), (U, V )], it suffices to explicitly calculate its expectation (non-asymptotically). By the representation formula ofÛÛ T andVV T , we obtain approximations of Edist 2 [(Û,V ), (U, V )]. Different levels of approximating the expectation ends up with different levels of bias corrections. These levels of approximations are (a) Level-1 approximation: B 1 = 2d Λ −1 2 F . The approximation error is 1 We note that, in RMT literature (see, e.g., Bao et al. (2018), Ding (2017)), many works studied the problem when λr = O( √ d 2 ) and λr (d 1 d 2 ) 1/4 . In this paper, we focus on the regime when empirical singular subspaces are consistent, i.e., Edist 2 [(Û,V ), (U, V )] → 0 when d 2 → ∞. As shown in Cai and Zhang (2018), such consistency requires √ rd 2 /λr → 0.

D. Xia
(c) Level-k approximation: where C 2 > 0 is some absolute constant.
The aforementioned approximation errors hold whenever C 2 d 2 /λ 2 r < 1. Explicit formula for B ∞ is also derived. An intriguing fact is that if . Different levels of bias corrections require different levels of SNR conditions for the asymptotical normality. For instance, we prove the normal approximation ofε 2 : with the log d 2 -th order bias correction. More exactly, we show the asymptotical normality ofε 2 when √ rd 2 /λ r → 0 and r 3 /d 2 → 0 as d 1 , d 2 → ∞. As far as we know, this is the first result about the limiting distribution of singular subspaces which allows the rank r to diverge. Meanwhile, no eigen-gap conditions (except SNR) are needed. Since our normal approximation is non-asymptotical, we impose no constraints on the relation between d 1 and d 2 .
The rest of the paper is organized as follows. In Section 2, we derive the explicit representation formula of empirical spectral projector. The representation formula is established under deterministic perturbation. We prove normal approximation of dist 2 [(Û,V ), (U, V )] in Section 3. Especially, we show that dist 2 [(Û,V ), (U, V )] is asymptotically normal under optimal SNR conditions. In Section 4 and 5, we develop the arbitrarily k-th level approximations of Edist 2 [(Û,V ), (U, V )] and its corresponding normal approximation, where requirements for SNR are specifically developed. In Section 6, we propose confidence regions and discuss about data-adaptive shrinkage estimator of singular values. We then display comprehensive simulation results in Section 7, where, for instance, we show the importance of higher order approximations of Edist 2 [(Û,V ), (U, V )] when the matrix has unbalanced sizes and the effectiveness of shrinkage estimation of singular values. The proofs are collected in Section 8 and Appendix 8.

Representation formula of spectral projectors
Let A and X be d×d symmetric matrices. The matrix A has rank r = rank(A) ≤ d. Denote the eigen-decomposition of A, where Λ = diag(λ 1 , · · · , λ r ) contains the non-zero non-increasing eigenvalues of A. The d × r matrix Θ = (θ 1 , · · · , θ r ) consists of A's eigenvectors. The noise matrix X satisfies X < min 1≤i≤r |λi| 2 where · denotes the matrix operator norm. GivenÂ = A + X where A and X are unknown, our goal is to estimate Θ. We denoteΘ = (θ 1 , · · · ,θ r ) the d × r matrix containing the eigenvectors ofÂ with largest r eigenvalues in absolute values. Therefore,Θ represents the empirical version of Θ. We derive the representation formula ofΘΘ T for deterministic X. The formula is useful for various of purposes.
Apparently, by eq. (3), a simple fact is Compared with the famous Wedin's and Davis-Kahan's first-order (w.r.t. X ) perturbation bound (Davis and Kahan, 1970;Wedin, 1972), Theorem 1 provides a precise formula for the empirical spectral projector. For instance, we can obtain the second-order approximationΘΘ − ΘΘ − S A,1 (X) and even higher order approximations. The proof of Theorem 1 is based on complex analysis of the resolvent, a technique has been used in Koltchinskii and Lounici (2016); Xia (2019); Löffler et al. (2019). We note that our representation formula is similar, in spirit, to the perturbation series of the spectral projector for a single eigenvalue developed in Kato (2013). However, our formula is to investigate the spectral projector for all eigenvalues jointly, which has recently become more useful in low-rank methods.

Normal approximation of spectral projectors
Recall from (1) thatM = M + Z ∈ R d1×d2 with M = U ΛV T where U ∈ R d1×r and V ∈ R d2×r satisfying U T U = I r and V T V = I r . The diagonal matrix Λ = diag(λ 1 , · · · , λ r ) contains non-increasing positive singular values of M . Let U andV beM 's top-r left and right singular vectors. We derive the normal approximation of which is often called the (squared) projection distance on Grassmannians. To this end, we clarify important notations which shall appear frequently throughout the paper.
To apply the representation formula from Theorem 1, we turnM, M and Z into symmetric matrices. For notational consistency, we create ( The eigenvectors corresponding to λ i and λ −i are, respectively, may not be uniquely defined if the singular value λ i has multiplicity larger than 1. However, the spectral projector UU T and V V T are unique regardless of the multiplicities of M 's singular values.

D. Xia
By the above notations, it is clear that It suffices to prove the normal approximation of ΘΘ T − ΘΘ T 2 F . Observe that By Theorem 1 and ΘΘ T P ⊥ = 0, we can write where we used the fact P ⊥ P ⊥ = P ⊥ so that We prove CLT of dist 2 [(Û,V ), (U, V )] with an explicit normalizing factor. Without loss of generality, we assume d 1 ≤ d 2 hereafter.
where d = d 1 + d 2 − 2r and Φ(x) denotes the c.d.f. of standard normal distributions. By setting s = λr √ rd2 , we conclude that By Theorem 2, the asymptotical normality holds as long as , then the first condition in (5) is equivalent to √ rd2 λr → 0. Such SNR condition is optimal in the consistency regime.

Remark 1.
The normalization factor √ 8d Λ −2 F comes from the fact Clearly, this conclusion relies on the Gaussian assumption. If the entries of Z are not Gaussian, this variance should involve the kurtosis of the unknown distribution. The unknown kurtosis makes the data-driven statistical inference even more challenging. Finally, we remark by the proof of Theorem 2 that no constraints between d 1 and d 2 are needed.

Approximating the bias
Recall (4), we have where we used the fact E S A,2k+1 (X) = 0 for any positive integer k ≥ 1. We aim to determine E P ⊥ XP −1 2 F and E ΘΘ T , S A,2k (X) for all k ≥ 2. Apparently, by obtaining explicit formulas of E ΘΘ T , S A,2k (X) for larger ks, we end up with more precise approximation of E dist 2 [(Û,V ), (U, V )]. In Lemma 1-3, we provide arbitrarily k-th order approximation of the bias.

Lemma 1 (First order approximation). The following equation holds
where C 2 > 0 is an absolute constant (depending on the constant C 1 ).

D. Xia
In Lemma 2, we calculate E ΘΘ T , S A,4 (X) . It yields the second order ap- Lemma 2 (Second order approximation). The following fact holds In general, we calculate the arbitrary k-th order approximation in Lemma 3.
where c 1 , C 1 , C 2 > 0 are some absolute constants. Then, the following bound holds where C 3 , C 4 , C 5 , C 6 are some absolute constants and B k is defined by The second and higher order terms involve the dimension difference Δ d = d 1 − d 2 . If d 1 = d 2 , these higher order approximations essentially have similar effects as the first order approximation. (Bao et al., 2018, Theorem 2.9 (Bao et al., 2018, Theorem 2.3). Compared with Bao et al. (2018), our results are non-asymptotical. We impose no eigen-gap conditions and no upper bounds on r.

Normal approximation after bias corrections
In this section, we prove the normal approximation of dist 2 [(Û,V ), (U, V )] with explicit centering and normalizing terms. By Theorem 2, it suffices to substitute E dist 2 [(Û,V ), (U, V )] with the explicit formulas from Lemma 1-3. Similarly as in Section 4, we consider arbitrarily k-th levels of bias corrections for dist 2 [(Û ,V ), (U, V )]. Higher order bias corrections, while involving more complicate bias reduction terms, require lower levels of SNR to guarantee the asymptotical normality. For instance, the first order bias correction in Theorem 3 requires λ r rd 3/2 2 for asymptotical normality, while the log d 2 -th order bias correction in Theorem 4 only requires optimal λ r √ rd 2 for asymptotical normality. Again, the rank r is allowed to diverge as fast as o (d 1 + d 2 ) 1/3 .

D. Xia
By Theorem 3, we conclude that The above conditions require λ r rd 3/2 2 and r 3 d 2 . The order d 3/4 2 is larger than the optimal rate √ d 2 . It is improvable if we apply higher order bias corrections.

There exist absolute constants
where B k is defined by (6).
. By choosing k = log d 2 , it boils down to √ rd 2 /λ r → 0 which is optimal in the consistency regime. Similarly as in Theorem 2, the condition ( Remark 4. To avoid computing the sum of k terms in B k (6), it suffices to apply B ∞ which by Remark 2 is

3811
We note that it is also possible to develop the asymptotic distribution for the one-sided singular vectors ÛÛ − UU 2 F and VV − V V 2 F . For that purpose, some treatments should be adjusted and the proof has to be modified accordingly. We leave it as a future work.
In particular, we set k = log d 2 in Theorem 4 and get We define the confidence region based on (Û,V ) by where z α denotes the critical value of standard normal distribution, i.e., z α = Φ −1 (1 − α). Theorem 5 follows immediately from Theorem 4.

D. Xia
Remark 5. We can also simply replace B log d2 with B ∞ and Theorem 5 still holds under the same conditions.

Theorem 6. Suppose that conditions in Theorem 4 and (7) hold, and r = O(1).
Then, for any α ∈ (0, 1), we have By Theorem 6, the data-driven confidence region M α (Û,V ) is a valid confidence region asymptotically. For simplicity, we only consider the case of fixed ranks. We remark that Theorem 6 still holds if we replaceB log d2 withB ∞ .

Numerical experiments
For all the simulation cases considered below, we choose the rank r = 6 and the singular values are set as λ i = 2 r−i · λ for i = 1, · · · , r for some positive number λ. As a result, the signal strength is determined by λ. The true singular vectors U ∈ R d1×r and V ∈ R d2×r are computed from the left and right singular subspaces of a d 1 × d 2 Gaussian random matrix.

Higher order approximations of bias and normal approximation
In Simulation 1, we show the effectiveness of approximating Meanwhile, we show the inefficiency of first order approximation when |d 1 − d 2 | min(d 1 , d 2 ). In Simulation 2, we demonstrate the benefits of higher order ap- Simulation 1. In this simulation, we study the accuracy of first order approximation and its relevance with , 200, 300. The signal strength λ is chosen as 30, 30.5, · · · , 40. For each given λ, the first order approximation 2d Λ −1 2 F is recorded. To obtain Edist 2 [(Û,V ), (U, V )], we repeat the experiments for 500 times for each λ and the Figure 1(a). Since d 1 = d 2 = d, the first order approximation has similar effect as higher order approximation which is verified by Figure 1(a). Second, we set d 1 = d2 2 = d for d = 100, 200, 300. As a result, Δ d = d 2 −d 1 = d which is significantly large. Similar experiments are conducted and the results are displayed in Figure 1(b), which clearly shows that first order approximation is insufficient to estimate Edist 2 [(Û,V ), (U, V )]. Therefore, if |d 1 − d 2 | 0, we need higher order approximation of Edist 2 [(Û ,V ), (U, V )]. Simulation 2. In this simulation, we study the effects of higher order approximations when |d 1 − d 2 | 0. More specifically, we choose d 1 = 500 and d 2 = 1000. The signal strength λ = 50, 51, · · · , 60. For each λ, we repeat the experiments for 500 times producing 500 realizations of dist 2 [(Û,V ), (U, V )] whose average is recorded as the simulated Edist 2 [(Û,V ), (U, V )]. Meanwhile, for each λ, we record the 1st-4th order approximations B 1 , B 2 , B 3 and B 4 which are defined by (6). All the results are displayed in Figure 2. It verifies that higher order bias corrections indeed improve the accuracy of approximating It also shows that the 1st and 3rd order approximations over-estimate Edist 2 [(Û,V ), (U, V )]; while, the 2nd and 4th order approxima- Simulation 3. We apply higher order approximations and show the normal , d 2 = 600 and rank r = 6. We fixed the signal strength λ = 50. The density histogram is based on 5000 realizations from independent experiments. We consider 1st-4th order approximations, denoted by {B k } 4 k=1 . More specifically, and The results are shown in Figure 3. This experiment aims to demonstrate the necessity of higher order bias corrections. Indeed, by the density histograms in Figure 3, the first and second order bias corrections are not sufficiently strong to guarantee the normal approximations, at least when λ ≤ 50, where the density histograms either shift leftward or rightward compared with the standard normal Clearly, the 3rd and 4th order approximations are already close to the simulated mean. We observe that the 1st and 3rd order approximations overestimate Edist 2 [(Û,V ), (U, V )]; while, the 2nd and 4th order approximations under-estimate curve. On the other hand, after third or fourth order corrections, the normal approximation is very satisfactory at the same level of signal strength λ = 50.

Normal approximation with data-dependent bias corrections
Next, we show normal approximations of dist 2 [(Û,V ), (U, V )] with data-dependent bias corrections and normalization factors. Simulation 4. We apply the 1st order approximation and show normal ap- when d 1 = d 2 = 100 and r = 6. Here,Λ = diag(λ 1 , · · · ,λ r ) denotes the top-r empirical singular values ofM . The signal strength λ = 25, 50, 65, 75. For each λ, we record 5000 thousand independent experiments and draw the density histogram. The p.d.f. of standard normal distribution is displayed by the red curve. The results are shown in Figure 4. Since eachλ j over-estimates the true λ j , the bias correction 2d Λ −1 2 F is not sufficiently significant. It explains why the density histograms shift rightward compared with the standard normal curve, especially when signal strength λ is moderately strong. Simulation 5. We apply the 1st order approximation and show normal ap- when d 1 = d 2 = 100 and r = 6. Here,Λ = diag(λ 1 , · · · ,λ r ) denotes the top-r shrinkage estimators of λ j s as in (9) Figure 4 whereΛ is used instead ofΛ, we conclude that 2d Λ−1 2 F is more accurate than 2d Λ−1 2 F for bias corrections. Indeed, we see that normal approximation of is already satisfactory when signal strength λ = 35.

Proofs
We only provide the proof of Theorem 1 in this section. Proofs of other theorems are collected in the supplementary file.

Proof of Theorem 1
For notational simplicity., we assume λ i > 0 for 1 ≤ i ≤ r, i.e., the matrix A is positively semidefinite. The proof is almost identical if A has negative eigenvalues. Indeed, if there exist negative eigenvalues, we should also construct a contour plot which includes those negative eigenvalues.
Since A is positively semidefinite, we have min 1≤i≤r |λ i | = λ r . The condition in Theorem 1 is equivalent to λ r > 2 X . Recall that {λ i ,θ i } d i=1 denote the singular values and singular vectors ofÂ. Define the following contour plot γ A on the complex plane (shown as in Figure 6): , where the contour γ A is chosen such that min η∈γ A min 1≤i≤r |η − λ i | = λr 2 . Weyl's lemma implies that max 1≤i≤r |λ i − λ i | ≤ X . We observe that, when X < λr 2 , all {λ i } r i=1 are inside the contour γ A while 0 and {λ i } d i=r+1 are outside of the contour γ A . By Cauchy's integral formula, we get As a result, we haveΘΘ Note that Therefore, we write the Neumann series: By (11) and (10), we get For k ≥ 1, we define which is essentially the k-th order perturbation. Therefore, we obtain By (13), it suffices to derive explicit expression formulas for {S A,k (X)} k≥1 . Before dealing with general k, let us derive S A,k (X) for k = 1, 2 to interpret the shared styles.
To this end, we denote I r the r × r identity matrix and write where we set λ j = 0 for all r + 1 ≤ j ≤ d. Denote P j = θ j θ T j for all 1 ≤ j ≤ d which represents the spectral projector onto θ j .

Derivation of S A,1 (X). By the definition of S
Case 1: both j 1 and j 2 are greater than r. In this case, the contour integral in (14) is zero by Cauchy integral formula.
Case 2: only one of j 1 and j 2 is greater than r. W.L.O.G, let j 2 > r, we get Case 3: none of j 1 and j 2 is greater than r. Clearly, the contour integral in (14) is zero.

Derivation of S A,2 (X). By the definition of S A,2 (X),
Case 1: all j 1 , j 2 , j 3 are greater than r. The contour integral in (15) is zero by Cauchy integral formula.
To sum up, we obtain

Derivation of S A,k (X) for general k. Recall the definition of S A,k (X), we write
We consider components of summations in (16). For instance, consider the cases that some k 1 indices from {j 1 , · · · , j k+1 } are not larger than r. W.L.O.G., let j 1 , · · · , j k1 ≤ r and j k1+1 , · · · , j k+1 > r. By Cauchy integral formula, the integral in (16) is zero if k 1 = 0 or k 1 = k + 1. Therefore, we only focus on the cases that 1 ≤ k 1 ≤ k. Then, Recall that our goal is to prove Accordingly, in the above summations, we consider the components, where s 1 , · · · , s k1 ≥ 1 and s k1+1 = · · · = s k+1 = 0, namely, It turns out that we need to prove r j1,··· ,j k 1 ≥1 It suffices to prove that for all j = (j 1 , . . . , j k1 ) ∈ {1, · · · , r} k1 , To prove (17), we rewrite its right hand side. Given any j = (j 1 , · · · , j k1 ) ∈ {1, · · · , r} k1 , define . Then, the right hand side of (17) is written as Now, we denote t i (j) = p∈vi(j) s p for 1 ≤ i ≤ r, we can write the above equation as where the last equality is due to the fact v 1 (j) + · · · + v r (j) = k 1 . Similarly, the left hand side of (17) can be written as Therefore, in order to prove (17), it suffices to prove that for any j = (j 1 , · · · , j k1 ) the following equality holds

D. Xia
where we omitted the index j in definitions of v i (j) and t i (j) without causing any confusions. The non-negative numbers v 1 + · · · + v r = k 1 . We define the function and we will calculate 1 2πi γ A ϕ(η)dη by Residue theorem. Indeed, by Residue theorem, Clearly, Res(ϕ, η = ∞) = 0 and it suffices to calculate Res(ϕ, η = 0). To this end, let γ 0 be a contour plot around 0 where none of {λ k } r k=1 is inside it. Then, By Cauchy integral formula, we obtain where we denote by f (x) (k−k1) the k − k 1 -th order differentiation of f (x). Then, we use general Leibniz rule and get which proves (18). We conclude the proof of Theorem 1. IEEE Transactions on Information Theory. MR3876445
Denote the event E 1 : our analysis is conditioned on E 1 . By Theorem 1, on event E 1 , we havê Therefore, we get Then, We investigate the normal approximation of is ignorable when signal strength λ r is sufficiently strong. For some t > 0 which shall be determined later, define a function where we view X as a variable in R (d1+d2)×(d1+d2) and the function φ(·) : Clearly, φ(s) is Lipschitz with constant 1. Lemma 4 shows that f (·) is Lipschitz when λ r ≥ C 4 √ d 2 . The proof of Lemma 4 is in Appendix, Section B.1. (20).

Lemma 4. There exist absolute constants
By Lemma 4 and Gaussian isoperimetric inequality (see, e.g., (Koltchinskii andLounici, 2016, 2017)), it holds with probability at least 1−e −s for any s ≥ 1 that Normal approximation of SVD 3829 for some absolute constant C 5 > 0. Now, set t = C 2 where C 2 is defined in (19). Therefore, φ X C2· √ d2 = 1 on event E 1 . Meanwhile, the following fact holds where the last inequality holds as long as e −c1d2/2 ≤ 1 √ d2 and we used the fact E 1/p X p ≤ C 6 √ d 2 for some absolute constant C 6 > 0 and any positive integer p. (See, e.g., (Koltchinskii and Xia, 2016), (Vershynin, 2010) and (Tao, 2012)). Together with (21), it holds with probability at least 1 − e −s − e −c1d2 for any s ≥ 1 that for some absolute constant C 6 > 0. Therefore, for any s ≥ 1, with probability at least 1 − e −s − e −c1d2 , where we assumed d 2 ≥ 3r. We next prove the normal approximation of 2 P −1 XP ⊥ 2 F . Similar as in (Xia, 2019), by the definition of P −1 , X and P ⊥ , we could write Then,

D. Xia
Denote z j ∈ R d1 the j-th column of Z for 1 ≤ j ≤ d 2 . Then, z 1 , · · · , z d2 are independent Gaussian random vector and Ez j z T j = I d1 for all j. Therefore, where {e j } d2 j=1 represent the standard basis vectors in R d2 . Similarly, Sincet U T z j and U T ⊥ z j are Gaussian random vectors and where {u j } d1 j=r+1 and {v j } d2 j=r+1 denote the columns of U ⊥ and V ⊥ , respectively. Observe that Z T u j ∼ N (0, I d2 ) for all r + 1 ≤ j ≤ d 1 and Therefore, {Z T u j } d1 j=r+1 are independent normal random vectors. Similarly, Zv j ∼ N (0, I d1 ) are independent for all r + 1 ≤ j ≤ d 2 . Clearly, V T Z T u j1 ∼ N (0, I r ) and U TZ v j2 ∼ N (0, I r ) are all independent for r + 1 ≤ j 1 ≤ d 1 and r + 1 ≤ j 2 ≤ d 2 .
As a result, let d = d 1 + d 2 − 2r, we conclude that where we abuse the notations and denote ∼ N(0, I r ). By Berry-Esseen theorem ((Berry, 1941) and (Esseen, 1942)), it holds for some Normal approximation of SVD 3831 absolute constant C 7 > 0 that where we used the fact Var Λ −1 z j In (24), the function Φ(x) denotes the c.d.f. of standard normal distributions.
Recall that, on event E 1 , where normal approximation of the first term is given in (24) and upper bound of the second term is given in (22). Based on (22), we get for any x ∈ R and any s ≥ 1, where the last inequality is due to (24) and the Lipschitz property of Φ(x). Similarly, for any x ∈ R and any s ≥ 1, Finally, we conclude that for any s ≥ 1, where d = d 1 + d 2 − 2r and C 6 , C 7 , c 1 are absolute positive constants.

A.2. Proof of Theorem 6
It suffices to prove where the first term in the RHS of (25) converges to N (0, 1) in view of Theorem 4 and (9). It suffices to show that the second term in the RHS of (25) converges to 0 in distribution. We prove it in two cases.

A.3. Proof of lemmas in Section 4
Observe that S A,k (X) involves the product of X for k times. If k is odd, we immediately get ES A,k (X) = 0 since Z has i.i.d. standard normal entries. Therefore, it suffices to investigate E ΘΘ T , S A,k (X) when k is even.
Proof of Lemma 1. By the definitions of P ⊥ , X and P −1 , By the proof of Theorem 2, we obtain E P ⊥ XP −1 2 F = (d 1 + d 2 − 2r) Λ −1 2 F which is the first claim. To prove the second claim, it holds by Theorem 1 that where we used the fact ΘΘ T P 0 = P 0 ΘΘ T = 0. Then, for some absolute constant C 2 > 0. Therefore, where the last inequality holds as long as λ r ≥ 5C 1 √ d 2 .

D. Xia
where Θ = (θ 1 , · · · , θ r , θ −r , · · · , θ −1 ) ∈ R (d1+d2)×(2r) . In addition, we can write Observe that, for any integer p ≥ 0, and get the simple bound Observe that Zv j1 is independent with ZV ⊥ and Z T u j1 is independent with Z T U ⊥ . Therefore, where the last inequality is due to the independence between Z T u j2 and Z T U ⊥ , the independence between Zv j2 and ZV ⊥ . We conclude that where C 2 > 0 is some absolute constant and the last inequality is due to e −c1d2 ≤ d −1 2 . We now finalize the proof. If there exists one odd t i , then there exists at least another t j which is also odd since the sum of t i s is even. Following the same analysis, we conclude r k whenever any of t 1 , · · · , t τ −1 is an odd number. Therefore, it suffices to consider the cases that all of t 1 , · · · , t τ −1 are even numbers.
Proof of Lemma 2. From the above analysis, to calculate E ΘΘ T , S A,4 (X) , it suffices to calculate where t 1 , · · · , t τ −1 are positive even numbers and s 1 , · · · , s τ are positive numbers.

A.4. Proof of Lemma 3.
To characterize E ΘΘ T , S A,2k (X) more easily, we observe the following property.

D. Xia
Since θ j1 and θ j2 are orthogonal, we conclude that Xθ j1 and Xθ j2 are independent normal vectors from which we get that θ T j1 W t1 θ j2 |P ⊥ XP ⊥ is subexponential and As a result, we conclude that for some absolute constants C 1 , C 2 > 0. It suggests that the dominating terms come from those tuples (j 1 , j 2 , · · · , j τ −1 ) such that |j 1 | = |j 2 | = · · · = |j τ −1 |. Now, we define P j = λ j P j + λ −j P −j . To this end, we conclude for some absolute constants C 1 , C 2 > 0. The above fact suggests that it suffices to focus on the effect from individual singular values (i.e., for any fixed 1 ≤ j ≤ r). Moreover, it is easy to check that whereP −s j = P j + (−1) s P −j implying that the k-th order error term has dominator λ 2k j . To this end, we prove the following lemma in the Appendix.
By Lemma 5 and (27), it holds for all k ≥ 2 that for some absolute constants C 1 , C 2 > 0, which concludes the proof.

A.5. Proof of CLT theorems in Section 5
Proof of Theorem 3 Recall Theorem 2, we end up with By Lemma 1, we get Therefore, By the Lipschitz property of Φ(x) and applying similar technical as in proof of Theorem 2, we can get where the last inequality holds as long as λ r ≥ 9t √ d 2 . Therefore, we conclude the proof of Lemma 4.
Proof of Lemma 5. Based on Property 2 and eq. (27), it suffices to calculate the quantities Etr P −s1 j W t1 P −s2 j W t2 · · · P −sτ j which relies on singular values λ j and singular vectors u j , v j only. Moreover, the actual forms of u j , v j does not affect the values. By choosing {u j } r j=1 and {v j } r j=1 as the first r canonical basis vectors in R d1 and R d2 , it is easy to check that we can reduce the calculations to the rank-one spiked model with singular value λ j . To leverage the dimensionality effect where U T ⊥ ZV ⊥ ∈ R d1−×d2− has i.i.d. standard normal entries, we consider the rank-one spiked model witĥ where Z has i.i.d. standard normal entries and Let u andv denote the leading left and right singular vectors ofM . By fact (27), it suffices to calculate the k-th order approximation of ûû T − uu T 2 F + vv T − vv T 2 F . In the proof, we calculate the errors ûû T − uu T 2 F and vv T − vv T 2 F separately. W.L.O.G., we just deal with ûû T − uu T 2 F and consider d 1 ≤ d 2 3 . Recall that we aim to calculate the k-th order error term in uu T −ûû T 2 F . To this end, we write the error terms as We show that E 2k = (−1) k d k−1 for some absolute constant C 1 > 0. To this end, we consider the second-order (see (Xia and Zhou, 2019)) moment trick (denote T = λ 2 (u ⊗ u)) where Δ = λuv T Z T + λZvu T + ZZ T . By eq. (4), we can write