Reduced rank regression with matrix projections for high-dimensional multivariate linear regression model

: In this work, we incorporate matrix projections into the reduced rank regression method, and then develop reduced rank regression estimators based on random projection and orthogonal projection in high-dimensional multivariate linear regression model. We propose a consistent estimator of the rank of the coeﬃcient matrix and achieve prediction performance bounds for the proposed estimators based on mean squared errors. Finally, some simulation studies and a real data analysis are carried out to demonstrate that the proposed methods possess good stability, prediction performance and rank consistency compared to some other existing methods.


Introduction
Multivariate linear regression methods are widely used statistical tools in regression analysis. In general, a multivariate linear regression has n observations with r responses and p predictors, and can be expressed as where Y ∈ R n×r denotes a multivariate response matrix, X ∈ R n×p represents a matrix of predictors, ε ∈ R n×r is an error matrix with its entry ε ij being independent of each other with mean zero and variance σ 2 ij , and B ∈ R p×r is the regression coefficient matrix. The model in (1) is the foundation of multivariate regression analysis with its aim being to study the relationship between X and Y through the regression coefficient matrix B.
For model (1), the ordinary least-squares (OLS) estimator of B iŝ where X + denotes the Moore-Penrose inverse of X.

Existing work
The OLS method of multiple responses, under no constraints, is equivalent to performing OLS estimation for each response variable, separately, and so the estimator does not use the possible correlation between multiple responses. In practice, it will be quite realistic to assume that the response variables are correlated. One way of avoiding this drawback of the OLS method will be to consider reduced rank regression (RRR) model [19]. The reduced rank regression would allow the rank of B to be less than min (p, r), and so the model parametrization can be expressed as B = B 1 B 2 , where B 1 ∈ R r×d , B 2 ∈ R d×p , and rank(B 1 )=rank(B 2 )=d. The decomposition B = B 1 B 2 is non-unique since, for any orthogonal matrix O ∈ R d×d , B * 1 = B 1 O and B * 2 = O T B 2 will result in other valid decompositions satisfying B = B * 1 B * 2 = B 1 B 2 . Nevertheless, the parameter B of interest is identifiable, as well as span(B 1 )=span(B) and span(B T 2 )=span(B T ). Under some constraints on B 1 and B 2 , such as B 2 B T 2 = I d or B T 1 B 1 = I d , [2,19,21] derived the maximum likelihood estimators of the RRR parameters. As there are some linear constraints on regression coefficients, the number of effective parameters gets reduced and as a result the prediction accuracy may get improved. In high-dimensional data, a large number of predictor variables will be typically available, but some of them may not be useful for predictive purpose. Bunea et al. [4] proposed a rank selection criterion for selecting the optimal reduced rank estimator of the regression coefficient matrix in multivariate response regression models, and derived minimax optimal bounds based on mean squared errors of the estimators. Chen et al. [6] proposed an adaptive nuclear norm penalization method with low-rank matrix approximation, and developed a method for simultaneous dimension reduction and coefficient estimation in high-dimensional multivariate linear regression. If some column vectors of a predictor matrix X are nearly linearly dependent, the situation known as multicollinearity, the OLS estimator is known to perform poorly. Similarly, the performance of the reduced rank estimator is also not satisfactory when the predictor variables are highly correlated or the ratio of signal to noise is small. To overcome this problem, by incorporating ridge penalty into reduced rank regression, a reduced rank ridge regression estimator has been proposed [13,6].
Dimension reduction is a way to reduce the number of random variables using various mathematical and statistical tools. In this regard, random projection is a widely used dimension reduction method in statistical and machine learning literature. Dobriban and Liu [8] examined different random projection methods in an unified framework, and derived explicit formulas for the accuracy loss of these methods compared to ordinary least-squares. Ahfock et al. [1] studied statistical properties of sketched regression algorithms and achieved new distributional results for a large class of sketched estimators and a conditional central limit theorem for the sketched dataset. Wang et al. [25] examined matrix ridge regression problems based on classical sketch and Hessian sketch. Thanei et al. [22] discussed some applications of random projections in linear regression models, and computational costs and theoretical guarantees of the generalization error in terms of random projection methods. Furthermore, random projection ideas have also been applied to problems in classification [5], clustering [9], and convex optimization [15,16,17] Principal components regression (PCR), based on principal components analysis, is also a classical tool for dimension reduction method, and so is the use of PCR to overcome the multicollinearity problem. Slawski [20] found a connection and made a comparison between principal components regression and random projection methods in classical linear regression.

Main contributions of this work
In this work, we propose three reduced rank estimators with a nuclear norm penalty in multivariate linear regression model in terms of single random projection, averaged random projection and principal components analysis, respectively. The estimation performance bounds of the proposed estimators are achieved based on mean squared errors. Some simulation studies and a real data analysis are performed to demonstrate that the proposed estimators possess good stability and prediction performance compared to some other existing methods under certain conditions. In our model, the number of parameters p and r can be either less than the observed value n or greater than n. Moreover, the entry ε ij in error matrix can have different variance σ 2 ij . Thus, the model considered here is a different one from those in Bunea et al. [4] and Chen et al. [6], in which the authors have assumed that all entries of the error matrix have the same variance σ 2 . Thus, their models become a special case of the model considered here. We also develop a consistent estimation approach of the rank of the regression coefficient matrix, and the practical performance of the proposed rank estimation method is then demonstrated through simulation studies.

Notation
For a matrix A ∈ R n×p , λ i (A) denotes the ith largest singular value of A. For m, n ∈ R, m ∧ n denotes min{m, n}. The Frobenious norm, nuclear norm and spectral norm of A are denoted by and ||A|| 2 = λ 1 (A), respectively. P A denotes the orthogonal projection matrix A(A T A) + A T . vec(·) operator transforms an n × m matrix into an nmdimensional column vector by stacking the columns of the matrix below each other. A ⊗ B denotes the Kronecker product of two matrices A and B. Finally, tr(·) denotes the trace of a square matrix.

Reduced rank regression with matrix projections
Yuan et al. [27] proposed a reduced rank estimator with nuclear norm penalty by minimizing the penalized least squares criterion where P μ (B) = μ||B|| * . The penalty produces sparsity among the singular values and thus achieves dimension reduction and shrinkage estimation simultaneously. Chen et al. [6] developed a new method for simultaneous dimension reduction and coefficient estimation in high-dimensional multivariate regression in terms of an adaptive nuclear norm penalization. For this, they replaced the penalty an adaptive nuclear norm of XB and ω i 's are non-negative weights.

Reduced rank regression with single random projection
The OLS estimator and the reduced rank estimator may both perform poorly when column vectors of a data matrix are highly correlated. To avoid this problem, in this work, we develop a two-step estimation method. First, a low-rank matrix is utilized to approximate the data matrix, and then a reduced rank regression is performed in terms of nuclear norm penalty. Motivated by the work of Halko et al. [10], we use the low-rank matrix approximation of X to bẽ where XS = QR, S ∈ R p×k is a standard Gaussian matrix which is a random matrix whose entries are independent standard normal variables, Q ∈ R n×k is a matrix with k orthonormal columns, and R ∈ R k×k is an upper triangular matrix with positive diagonal elements. Inspired by Eq. (4) and the work of Chen et al. [6], a reduced rank estimator with nuclear norm penalty, based on random projection, can then be derived by minimizing the penalized least squares criterion The following proposition shows that a closed-form global minimizer of (5) can be found.
The result in Proposition 2.1 follows directly from Lemma 1. The rank of the coefficient matrix B, denoted by r 0 , can be regarded as the number of effective linear combinations of predictor variables relating to response variables. In practice, we need to estimate the rank of B. (6) indicates that the quality of the rank estimator is related to the ratio of signal to noise and the value of k, and by combining the works of Bunea et al. [4] and Chen et al. [6], we develop here a method of rank estimation of B that can be expressed as where k, r X and η represent the number of columns of random projection matrix S, rank of predictor matrix X and a pre-specified constant, respectively. In practice, we can get the values of k and μ by cross-validation.
The following theorem shows that the rank selection method proposed recovers consistently the true rank r 0 under certain conditions. Theorem 2.1. Suppose the entries of ε ∈ R n×r are independent of each other and ε ij ∼ N (0, σ 2 ij ). Also, let μ = ηr X (1 + θ) 2V (PX ε)log(n + r)/(kδ) and λ r0 (XB) > 2kμ ηr X , for any θ > 0. Then, we have Theorem 2.1 holds when n or r tends to infinity, and p is not restricted. Therefore, the rank consistency of the proposed estimator is effective for both classical and high-dimensional cases.
In order to evaluate the performance of the proposed estimators and demonstrate the results obtained, we need to decompose the predictor matrix X and construct two sub random matrices of S as follows.
For a matrix X ∈ R n×p with rank(X) = q and q ≤ n ∧ p, the singular value decomposition (SVD) can be expressed as where Γ is a n × q column orthonormal matrix, Γ 1 is the n × l matrix in which the columns are the top l left singular vectors of X, Γ 2 is similarly the n × (q − l) matrix in which the columns are the bottom q − l left singular vectors of X, Λ is a q × q diagonal matrix, Λ 1 is the l × l diagonal matrix consisting of the top l singular values of X (l < q), Λ 2 is similarly the (q − l) × (q − l) matrix consisting of the bottom q −l singular values of X, P is a p×q column orthonormal matrix, P 1 is the p × l matrix in which the columns are the top l right singular vectors of X, and P 2 is similarly the p × (q − l) matrix in which the columns are the bottom q − l right singular vectors of X. Let S 1 = P T 1 S and S 2 = P T 2 S. Then, S 1 and S 2 are l × k and (q − l) × k matrices, respectively. Further, S 1 and S 2 are independent since P 1 and P 2 are column orthonormal matrices.
In this section, we obtain results to bound the difference between the true value and its estimated value based on single random projection. Our analyses are separated into two parts. First, we describe bounds on the probability of a large deviation. Next, we present some information about expected values. When expectation is not taken, XB −XB F is indeed a random variable. In this situation, the following theorem gives error bound with a certain probability.
Theorem 2.2. SupposeX,B and Λ 2 are as defined in (4), (6) and (8), respectively. Further, let the entries ε ij 's of the error matrix ε be independent of each other, each following normal distribution N (0, σ 2 ij ), and k ≥ l + 4, with l being a non-negative integer. Then, for any p × r matrix C with r(C) ≤ r 0 , some δ ∈ (0, 1] and all γ ≥ 1, t ≥ 1, with failure probability at most exp − θ 2 log(n + r) . More specifically, setting where k represents the number of columns of the random projection matrix S.
The following theorem provides a bound for the expected error in the Frobenius norm. (4) and (6), respectively. Then, for any

Theorem 2.3. SupposeX andB are as defined in
More specifically, setting C = B, we have where k represents the number of columns of the random projection matrix S.
The expected error bound in (10) reveals some interesting features. The bound depends on the value of k with the first term decreasing with increasing k and the second term increasing with increasing k. Thus, the choice of k balances the sum of the two terms, which can result in the value of sum being minimum.
To compare the expected error bounds derived by using the estimation of X (proposed methods) with not using the estimation of X for model (1), the following corollary is given.
Remark 1. The value of the second term on the right hand side of (10) may be less than the value of the term on the right hand side of (11) and so the sum of the two terms on the right hand side of (10) may also be less than the value of the term on the right hand side of (11), especially when

Reduced rank regression with averaged random projection
We have studied reduced rank estimator with a single random projection. The variance of reduced rank estimator is an important problem in practical application. The variance of the estimator can be reduced by averaging multiple reduced rank estimators from different random projections. In this section, we propose a reduced rank regression with averaged random projections, inspired by the works of [3,20,22].
∈ R p×k be independent random projection matrices. Then, we define the averaged random projection matrix as  (4), (6) and (12), respectively. We then have The proof is similar to the proof of Proposition 4 of Slawski [20]. The result in (13) suggests that reduced rank estimator with averaged random projection reduces the estimation error, improving significantly the efficiency of estimator.

Reduced rank regression with principal components analysis
For a data matrix X, let ΓΛP T be the SVD of X as in (8). Then, the top k principal components X k can be extracted from X, by setting X 1 = XP 1 , where P 1 ∈ R p×k denotes the top k right singular vectors of X. We can then obtain a low-rank matrix approximation of X by using the top k right singular vectors of X asX Eq. (14) is different from Eq. (4). First, P 1 in (14) is a deterministic matrix derived by the principal components analysis of X, while Q in (4) is a random matrix obtained via QR factorization of X which is multiplied by a random matrix S from right. Second,X is achieved by multiplying X from right using the orthogonal projection matrix of P 1 , andX is obtained by orthogonally projecting X onto the column of Q.
By minimizing Y −XB 2 F , we obtain the ordinary principal components regression estimator asB Further, a reduced rank estimator with nuclear norm penalty in terms of principal components analysis is obtained by minimizing the penalized least squares criterion Proposition 2.3. LetX = XP 1 P T 1 and PX Y have a singular value decomposition to beÛDV T . Then, a minimizer of (16) iŝ Similarly, the estimated rank of B can be expressed aŝ where k represents the number of principal components used. (14) and (17), respectively. Then, for any θ > 0 and some δ ∈ (0, 1],

Corollary 2.5. SupposeX andB are as defined in
with failure probability at most exp − θ 2 log(n + r) , where k represents the number of principal components used.
Corollary 2.6. SupposeX andB are as defined in (14) and (17), respectively. Also, suppose where k represents the number of principal components used.
Remark 2. In practice, k should be less than or equal to q. When k = q, the right hand side of (19) is equal to the right hand side of (11). On the other hand, the value of the second term on the right hand side of (19) is less than the value of the term on the right hand side of (11) when k < q, and so the sum of the two terms on the right hand side of (19) may be less than the value of the term on the right hand side of (11).

Simulation study
In this section, we carry out a simulation study to compare the proposed methods with some known existing methods in terms of estimation accurary, prediction accuracy and performance of rank recovery. The simulated data are from the true model in (1). We consider two cases: One when both p and r are less than n, and another when p and r are greater than n. The following are the specific details of these two cases: (a) The row vectors of the predictors X were independently generated from multivariate normal distribution N (0, Σ X ), wherein the elements of Σ X are composed of ρ |i−j| , and ρ |i−j| denotes the correlation of the pairwise elements in the row vector of the predictor matrix. The coefficient matrix B is constructed Here, c > 0 is a prespecified constant, called signal intensity, in order to control the values of the entries of matrix B. All elements of B 1 and B 2 are derived from the uniform (0, 1) distribution. The entry ε ij of the error matrix ε are from N (0, σ 2 ij ), where σ ij is derived from the uniform (a, b) distribution with 0 ≤ a < b. We set the correlation coefficient ρ=0.1, 0.5 and 0.9, respectively, and the signal intensity c=0.5 and 0.05, respectively. In addition, we take the values of (a, b) as (0, 1), (2,3) and (4,5), respectively. In this case, three scenarios are considered for performing the simulation study with Scenario 1: ρ = 0.1, 0.5, 0.9, c = 0.5 and (a, b) = (0, 1); Scenario 2: ρ = 0.1, 0.5, 0.9, c = 0.05 and (a, b) = (2, 3); Scenario 3: ρ = 0.1, 0.5, 0.9, c = 0.05 and (a, b) = (4, 5). We then simulated 100 data sets consisting of n = 60, p = r = 30, r 0 = 10 in all these three scenarios; (b) The setting is similar to that in (a), except that 100 data sets of simulation consist of n = 30, p = r = 40, r 0 = 10.
In the tables and figures presented, ANR, RRR and RAN denote adaptive nuclear norm penalized estimator [6], rank penalized estimator [4] and robustified adaptive nuclear norm penalized estimator [6], respectively. PNR, SNR and MSN represent nuclear norm penalized estimator with principal components analysis, single random projection and averaged random projection, respectively, while MRE represents the median rank estimate and correct rank recovery percentage.
We made use of cross-validation for selecting k and μ, and compared several different values of η based on different generated data. These results show that η = 1/2 is a good choice. Therefore, in the following comparisons, η = 1/2 is specified. ANR, RRR and RAN were computed by using the R package "rrpack". We implemented all proposed estimators in R, as well. For all the methods, the estimation accuracy of regression coefficient is measured by Results for case (a): In terms of the estimation accuracy of regression coefficient, from Table 1 and Figure 1, we see that the estimation errors of all methods grow with an increase in correlation coefficient ρ and all methods have similar performance when the ratio of signal to noise is large. The estimation errors of ANR and RRR methods still grow with an increase in correlation coefficient ρ when the ratio of signal to noise is small, while the estimation errors of PNR, SNR and MSN methods all decrease and, therefore, the proposed three methods perform better than other methods in this situation. Moreover, the estimation errors of ANR, RRR and RAN methods all increase as the ratio of signal to noise decreases for the three correlation coefficients used. The estimation errors for PNR, SNR and MSN methods also decrease as the ratio of signal to noise decreases when the correlation coefficient is large, but the estimation errors are smaller than other methods. Furthermore, we see that the performance of PNR, SNR and MSN methods is quite stable and similar to that of ANR, RRR and RAN methods with changes in correlation coefficient ρ. For the prediction accuracy of regression function, we see that all methods have similar performance when the ratio of signal to noise is large, but the prediction errors of all methods decrease with increasing values of correlation coefficient ρ. RRR method performs poorly especially when the ratio of signal to noise is small. In most situations, PNR and MSN methods perform better than other methods for all values of ρ when the ratio of signal to noise is small. The performance of SNR and MSN methods is good when there is a small correlation between the predictor variables, while PNR method performs well when the correlation coefficient ρ is large. For the performance of rank recovery, as seen in Table 1, the proposed methods are better than the existing ones in terms of median rank estimate and correct rank recovery percentage.
Results for case (b): In this high dimensional case, the estimation errors of all methods are relatively large and decrease with increasing values of correlation coefficient ρ when the signal strength is high based on estimation accuracy; yet, the proposed methods have smaller estimation errors compared to other methods in this situation. Other comparisons are similar to those for case (a) in terms of estimation accurary, prediction accuracy and the performance of rank recovery.

Illustrative example
A breast cancer dataset was first used by Chin et al. [7]. It contains 89 samples comprising gene expression measurements and comparative genomic hybridization measurements. This dataset has been analyzed by Witten et al. [26] and Chen et al. [6], and these data are available in the R package PMA. It has been shown that some types of cancer have the characteristics of abnormal alterations of DNA copy number [14]. It will, therefore, be of interest to identify the relationship between DNA copy numbers and RNA expression levels. Here, we regress copy-number variations on gene expression profile since the prediction model can identify copy-number changes related to function. In this case, we consider chromosome 18, where p = 294, r = 51 and n = 89. We centered and scaled both predictor matrix X and response matrix Y . For comparison of    prediction accuracy, a prediction mean squared error (PMSE) is defined as where (Y t , X t ) represents the test dataset andB represents the estimator of B corresponding to each method. In addition, we randomly split the data into a training set of size 70 and a test set of size 19. The training dataset is used to achieve the estimation in the model, and then the test dataset is used to evaluate the prediction performance of estimators. All the tuning parameters were selected by ten-fold cross-validation. As seen in Table 3 and Figure 3, the proposed estimators PNR, SNR and MSN are better than other estimators in terms of prediction performance and stability. More specifically, the prediction error of MSN method is the smallest and is the most stable one, followed by PNR method, while RRR method performs poorly in this case. Although the RAN method is better than the ANR method, it is not as good as the methods proposed in this work.

Discussion
We are considering the model exactly as considered in Bunea et al. [4] and Chen et al. [6], wherein the rows assume independence between elements. However, we allow for heterogeneity among the components in terms of different variances, which generalizes their model. It will be of interest to consider dependence between components within the rows, and this is something we wish to consider Rank and PMSE represent the estimated rank and the prediction mean squared error, respectively. The numbers in parentheses are the corresponding standard deviations. Fig 3: The distribution of PMSE based on data split at random 100 times for our future work. A weighted regression scheme can be taken into account in the following manner. Let Σ = [σ ij ] n×r and Σ j =[diag(σ j )] −1 , where diag(·) represents a diagonal matrix with the enclosed vector on its diagonal, and σ j denotes the jth column vector of Σ, j = 1, · · · , r. Let I j be an r × r matrix with the (jj)th entry being 1 and remaining entries being 0.
Here, we refer to Eq. (22) as a multivariate weighted linear regression model compared to Eq. (23). Using model (22), weighted counterparts of proposed estimators can be obtained. Moreover, suppose σ ij 's have the common factor σ 0 , with σ ij = σ 0 σ 0 ij and Σ j =[diag(σ 0 j )] −1 , and σ 0 j = σ −1 0 σ j . We then have ε * ∈ R n×r as an error matrix with its entries ε * ij 's being independent of each other with mean zero and variance σ 2 0 , which is similar to the error matrix in Bunea et al. [4] and Chen et al. [6].
Now notice that the model (1) may be expressed in the vector form as where vec(Y ) and vec(ε) are nr×1 vectors, B is a pr×1 vector, and Cov(vec(ε))= [diag(vec(Σ))] 2 . Thus, model in (24) is a general linear regression model with heteroscedasticity. Also, the model in (21) and (22) where W = r j=1 (I j ⊗ Σ j ) is a weighted matrix and Cov W vec(ε) =I r ⊗ I n . If σ ij 's have the common factor σ 0 , then Cov W vec(ε) =σ 2 0 I r ⊗ I n . In practice, when σ ij 's are unknown, we can utilize some existing methods (such as [18,11]) to estimate σ ij 's in terms of model in (24).

Appendix A: Some useful lemmas
In order to achieve the results of propositions, theorems and corollaries, we introduce the following lemmas.

Lemma 1.
Let P X Y have a singular value decomposition, P X Y = UDV T , and for any μ ≥ 0, a global optimal solution of the representation Reduced rank regression with matrix projections (26) to be equivalent to In addition, we have Using von Neumann's trace inequality [24,12], we obtain with the equality holding when XB satisfies the singular value decomposition XB = Udiag[λ i (XB)]V T . Thus, (27) can be re-expressed as It is obvious that the objective function in (28) can be minimized if Lemma 2. Assume that there exists an index m ≤ r 0 such that λ m+1 (XB) ≤ (1 − δ)kμ ηr X and λ m (XB) > (1 + δ)kμ ηr X for some δ ∈ (0, 1]. Then, Proof. From (7), we haver > m ⇐⇒ λ m+1 (PX Y ) > kμ ηr X andr < m ⇐⇒ and In addition, by the assumed conditions on λ m+1 (XB) and λ m (XB), the proof of the lemma gets completed.

Lemma 3.
Let {M l } be a finite sequence of matrices with dimension n × r, and {ξ l } be a finite sequence of independent standard normal variables. Consider the matrix Gaussian series Z = l ξ l M l , and let V (Z) be the matrix variance statistic of the sum, that is, Moreover, for all t 0 ≥ 0, we have and Proof. The proof of Lemma 3 can be found in Chapter 4 of Tropp [23].

Lemma 4.
Let the SVD of X be as in (8). For a fixed l ≥ 0, suppose S 1 has full row rank. Then, the approximation error satisfies Lemma 5. If the matrices M and N are fixed, and a standard Gaussian matrix G is drawn, then

Lemma 6. Suppose g is a Lipschitz function on matrices satisfying
where L denotes Lipschitz constant. Draw a standard Gaussian matrix G. Then, Lemma 7. Let G be a l × k standard Gaussian matrix, and k ≥ l + 4. Then, for all t ≥ 1, Proof. Proofs of Lemmas 4-7 can be found in Halko et al. [10].
Proof. Let g(X) = Λ 2 XS + 1 F . Then, by using triangle inequality of norm and some norm properties, we have By the properties of expected values and Lemma 5, we then obtain We now define the event , for all t ≥ 1.
Combining (32), (34) and (35), we have for all γ and t ≥ 1. Moreover, by the nature of probability and the use of Lemma 7 have Thus, Upon using Lemma 4 and the fact P XS = QQ T , the proof gets completed.
Further, by Lemmas 2 and 3, and let t 0 = θ 2V (PX ε)log(n + r), we have Thus, the proof of the theorem gets completed.

B.2. Proof of Theorem 2.2
Proof. For any p × r matrix C, by the definition ofB, we have Recall that These imply that From the facts that PXX =X, < M, N > F ≤ ||M || 2 ||N || * and ||N || * ≤ r(N )||N || F , we have Moreover, using the triangle inequality of norm, we have Now, combining (37), (38) and (39), we have and it then follows that As shown in the proof of Theorem 2.1, P { PX ε 2 ≥ δμ} ≤ exp − θ 2 log(n + r) .
Thus, we obtain with failure probability at most exp − θ 2 log(n + r) . Further, setting C = B, we have Upon combining (41) and (42), and using Lemma 8, the proof gets completed.

B.3. Proof of Theorem 2.3
Proof. From (40)  Note that ε = [ε ij ] n×r is an error matrix and the entries are independent of each other with mean zero and variance σ 2 ij , and so Further, setting C = B, we obtain Upon combining (43)-(48), we complete the proof of the theorem.

B.4. Proof of Corollary 2.4
Proof. From (40) and the proof of Theorem 2.3, we have Note that X = ΓΛP T and X + = P Λ −1 Γ T , and so P X = X(X T X) + X T = XX + = ΓΓ T . Then, By combining (49) and (50), we complete the proof.