Principal Regression for High Dimensional Covariance Matrices

This manuscript presents an approach to perform generalized linear regression with multiple high dimensional covariance matrices as the outcome. Model parameters are proposed to be estimated by maximizing a pseudo-likelihood. When the data are high dimensional, the normal likelihood function is ill-posed as the sample covariance matrix is rank-deficient. Thus, a well-conditioned linear shrinkage estimator of the covariance matrix is introduced. With multiple covariance matrices, the shrinkage coefficients are proposed to be common across matrices. Theoretical studies demonstrate that the proposed covariance matrix estimator is optimal achieving the uniformly minimum quadratic loss asymptotically among all linear combinations of the identity matrix and the sample covariance matrix. Under regularity conditions, the proposed estimator of the model parameters is consistent. The superior performance of the proposed approach over existing methods is illustrated through simulation studies. Implemented to a resting-state functional magnetic resonance imaging study acquired from the Alzheimer's Disease Neuroimaging Initiative, the proposed approach identified a brain network within which functional connectivity is significantly associated with Apolipoprotein E $\varepsilon$4, a strong genetic marker for Alzheimer's disease.


Introduction
In this manuscript, we study a regression problem with covariance matrices as the outcome under a high dimensional setting. Suppose y it ∈ R p is a p-dimensional random vector, which is the tth acquisition from subject i, for t = 1, . . . , T i and i = 1, . . . , n, where T i is the number of observations of subject i and n is the number of subjects. Let T max = max i T i . The high dimensionality refers to the scenario when T max p. The data, y it , are assumed to follow a normal distribution with covariance matrix Σ i . Here, without loss of generality, it is assumed that the distribution mean is zero as the study interest focuses on the covariance matrices. Let x i ∈ R q denote the q-dimensional covariates of interest acquired from subject i. For the covariance matrices, we assume the following regression model, which is considered in Zhao et al. (2019). For i = 1, . . . , n, where γ ∈ R p is a linear projection, and β ∈ R q is the model coefficient. In x i , the first element is set to one to include the intercept term. The goal is to estimate γ and β using the observed data {(y i1 , . . . , y iT i ), x i } n i=1 . One application of such a regression problem is to analyze covariate associated variations in brain coactivation in a functional magnetic resonance imaging (fMRI) study, where covariance/correlation matrices of the fMRI signals are generally utilized to reveal the coactivation patterns. Characterizing these patterns with population/individual covariates is of great interest in neuroimaging studies (Seiler and Holmes, 2017;Zhao et al., 2019). Another example is the study of financial stock market data. Considering a pool of stock returns, covariance matrices over a period of time capture the comovement or synchronicity of the stocks. Firm and market-level information, such as industry type, firm's cash flow, stock size, and book-to-market ratio, plays an essential role in determining the synchronicity. Quantifying such association is an important topic in financial theory (Zou et al., 2017).
To estimate γ and β, Zhao et al. (2019) proposed a likelihood-based approach, that is to minimize the negative log-likelihood function in the projection space. One sufficient condition to solve the likelihood-based criterion is that the sample covariance matrices are positive definite.
Thus, the likelihood estimator is ill-posed when T max < p as the sample covariance matrices are rank-deficient. Additionally, it has been shown that when p increases, the sample covariance matrix performs poorly and can lead to invalid conclusions. For example, the largest eigenvalue of the sample covariance matrix is not a consistent estimator, and the eigenvectors can be nearly orthogonal to the truth (Johnstone and Lu, 2009). To circumvent difficulties raised by the high dimensionality, one solution is to impose structural assumptions, such as bandable covariance matrices, sparse covariance matrices, spiked covariance matrices, covariances with a tensor product structure, and latent graphical models (see a review of Cai et al., 2016, and references therein). Another class of high-dimensional covariance matrix estimator is the shrinkage estimator. Daniels and Kass (2001) considered two shrinkage estimators of the covariance matrix, a correlation shrinkage and a rotation shrinkage, offering a compromise between completely unstructured and structured estimators to improve the robustness. Ledoit and Wolf (2004) introduced a well-conditioned estimator of the covariance matrix, which is an optimal linear combination of the identity matrix and the sample covariance matrix under squared error loss. This is equivalent to the optimal linear shrinkage of the eigenvalues while retaining the eigenvectors. Instead of a linear combination, Ledoit and Wolf (2012) extended this work to nonlinear transformations of the sample eigenvalues and presented a way of finding the transformation that is asymptotically equivalent to the oracle linear combination.
To model multiple covariance matrices, procedures include regression-type approaches introduced by Anderson (1973), Chiu et al. (1996), Hoff and Niu (2012), Fox and Dunson (2015), and Zou et al. (2017); (common) principal component analysis related methods by Flury (1984), Boik (2002), Hoff (2009), and Franks and Hoff (2019); and methods based on other types of matrix decomposition, such as the Cholesky decomposition (Pourahmadi et al., 2007). Among these, Fox and Dunson (2015) introduced a scalable nonparametric covariance regression model applying low-rank approximation. Franks and Hoff (2019) generalized a Bayesian hierarchical model studying the heterogeneity in the covariance matrices to high dimensional settings. Compared to the above-mentioned approaches, Model (1) offers higher flexibility in modeling the relationship with the covariates. For example, x can be either continuous or categorical, and one can easily include interactions and/or polynomials of the covariates.
In the high dimensional setting considered in this study, γ and β, as well as n covariance matrices will be estimated under Model (1). Because of its computational efficiency and explicit formulations of the tuning parameters, the linear shrinkage approach will be generalized to multiple covariance estimates. Interestingly, it will be shown that estimating each covariance matrix separately, such as using the shrinkage estimator proposed in Ledoit and Wolf (2004), leads to suboptimal estimation accuracy for γ, β and Σ i 's. Thus, a linear shrinkage estimator of the covariance matrix is proposed, of which the shrinkage coefficients are shared across matrices. With the shrinkage estimator, it is proposed to estimate (γ, β) through maximizing a pseudo-likelihood.
The framework proposed in this manuscript has three major contributions.
(1) This is probably the first attempt to analyze a large number of high-dimensional covariance matrices varying with covariates in a regression setting.
(2) The proposed shrinkage estimator of the covariance matrices is well-conditioned and has uniformly minimum quadratic risk asymptotically among all linear combinations.
(3) Under regularity conditions, the proposed approach achieves consistent estimators of the parameters.
The rest of the paper is organized as the following. Section 2 introduces the proposed shrinkage estimator of the covariance matrices and the pseudo-likelihood based method of estimating γ and β. Section 3 studies the asymptotic properties. In Section 4, the superior performance of the proposed approach over existing methods is demonstrated through simulation studies. Section 5 articulates an application to a resting-state fMRI data set acquired from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Section 6 concludes this paper with discussions. Technical proofs are collected in the supplementary materials.

Method
Considering the regression model (1), it is proposed to estimate the parameters by solving the following optimization problem.
whereΣ i is an estimator of the covariance matrix Σ i to be discussed later, which is positive definite, for i = 1, . . . , n; and H is a positive definite matrix in R p×p , which is set to be the average ofΣ i 's, It is essential to impose a constraint on γ, otherwise the objective function of (2) is minimized at γ = 0 with fixed β. WhenΣ i = S i = T i t=1 y it y it /T i (i.e., the sample covariance matrix), which is the proposal in Zhao et al. (2019), it is equivalent to minimize the negative log-likelihood function of {γ y it } i,t assuming the data are normally distributed. However, when T max = max i T i < p, problem (2) is ill-posed as S i 's are rank-deficient. Thus, the goal of this manuscript is to propose a well-conditioned estimator of Σ i that yields optimal properties. To achieve this, a covariate-dependent linear shrinkage estimator, denoted as Σ * i , is proposed, which yields the minimum expected squared loss under regression model (1), where the expectation is taken over the sample covariance matrix S i . minimize (µ,ρ) and In Section 3, we show that S * i is a consistent estimator of Σ * i and is uniformly optimal asymptotically among all the linear combinations of the sample covariance matrices and the identity matrix regarding the quadratic risk. The objective function (β, γ) is an approximation of the negative loglikelihood function if replacingΣ i with the proposed shrinkage estimator S * i . Thus, optimizing (2) can be considered as a pseudo-likelihood approach under the normality assumption.
The proof of Theorem 1 and Lemma 1 is presented in Section A.1 of the supplementary materials.
Formulation (3) introduces a shrinkage estimator of the covariance matrix, where the shrinkage is shared across subjects and is optimal under the squared error loss. For each subject, Σ * i is a linear combination of the sample covariance matrix S i and the identity matrix. The weighting parameters, ρ and µ, are population level parameters that are shared across subjects. This is equivalent to imposing a linear shrinkage on the sample eigenvalues. Assuming γ is a common eigenvector of all the covariance matrices, µ is the average eigenvalue corresponding to γ. The level of shrinkage is determined by the leverage between the accuracy of S i 's and the variation in the eigenvalues. If S i 's are accurate or the errors are small relative to the variation in the eigenvalues, less shrinkage will be imposed; otherwise, if S i 's are inaccurate and the errors are comparable or even higher than the eigenvalue variability, the sample covariance matrices will be shrank more.
Algorithm 1 summarizes the optimization procedure. As problem (2) is nonconvex, a series of random initializations of (γ, β) is considered and the one that achieves the minimum value of the objective function is the estimate. The initial values of γ can be set as the eigenvectors of the average sample covariance matrices,S = n i=1 T i S i / n i=1 T i ; and the initial values of β is the corresponding solution to (2) by replacingΣ i with a well-conditioned estimator, such as the estimator proposed in Ledoit and Wolf (2004). When p < n i=1 T i ,S is of full rank, and the sample eigenvectors are consistent estimators assuming all the covariance matrices have the same eigendecomposition.
Step 3 in the algorithm updates the covariance matrix estimators with a global shrinkage parameter. In Section 4, through simulation studies, we show that it improves the performance in estimating the covariance matrices and β with lower bias and higher stability. The details of updating γ and β in Step 4 can be found in Algorithm 1 in Zhao et al. (2019).
For higher-order components, one can first remove the identified components and use the new data to estimate the next with an additional orthogonality constraint, that is, the new component is orthogonal to the identified ones. Different from Algorithm 2 in Zhao et al. (2019), there is no need to include a rank-completion step as S * i is introduced to render the rank-deficiency issue. To determine the number of components, the metric of average deviation from diagonality proposed in Zhao et al. (2019) is adopted. Let Γ (k) ∈ R p×k denote the first k estimated components, the average deviation from diagonality is defined as where diag(A) is a diagonal matrix of the diagonal elements in a square matrix A, and det(A) is is a diagonal matrix, for ∀ i = 1, . . . , n, then DfD(Γ (k) ) = 1. In practice, k can be chosen before DfD increases far away from one or before a sudden jump occurs.

Asymptotic Properties
In this section, we study the asymptotic properties of the proposed estimators. For i = 1, . . . , n, it is assumed that Σ i has the eigendecomposition of Σ is a diagonal matrix and Π i = (π i1 , . . . , π ip ) is an orthonormal rotation matrix; {λ i1 , . . . , λ ip } are the eigenvalues and the columns of Π i are the corresponding eigenvectors.
. . , y iT i ) ∈ R T i ×p is the data matrix of subject i. Under the normality assumption, the columns of Z i = (z itj ) t,j are uncorrelated, and the rows, z it = (z i1 , . . . , z ip ) ∈ R p for t = 1, . . . , T i , are normally distributed with mean zero and covariance matrix Λ i . The following assumptions are imposed.
Assumption A1 There exists a constant C 1 independent of T max such that p/T max ≤ C 1 , where Assumption A3 There exists a constant C 2 independent of T min and T max such that p j=1 E(z 8 i1j )/p ≤ C 2 , for ∀ i ∈ {1, . . . , n}.
Assumption A4 Let Q denote the set of all the quadruples that are made of four distinct integers between 1 and p, for ∀ i ∈ {1, . . . , n}, where |Q| is the cardinality of set Q.
Assumption A5 All the covariance matrices share the same set of eigenvectors, i.e., Π i = Π, for i = 1, . . . , n. For each Σ i , there exists (at least) a column, indexed by j i , such that γ = π ij i and Model (1) is satisfied.
Assumption A1 allows the data dimension, p, to be greater than the (maximum) number of observations, T max , and to grow at the same rate as T max does. This is a common regularity condition for shrinkage estimators (Ledoit and Wolf, 2004). Assumption A2 guarantees that the average sample covariance matrixS = n i=1 T i S i /N utilized in the initial step of Algorithm 1 is positive definite. Together with Assumption A5, the eigenvectors ofS are consistent estimators of Π (Anderson, 1963). Assumptions A3 and A4 regulate z it on higher-order moments, which is equivalent to imposing restrictions on the higher-order moments of y it . When the data are assumed to be normally distributed, both A3 and A4 are satisfied. Assumption A5 assumes that all the covariance matrices share the same eigenspace, though the ordering of the eigenvectors may differ. When p/T min → 0, To proof Proposition 1, we first study the asymptotic properties of S * i and show that S * i is the optimal linear shrinkage estimator of the covariance matrix under the squared loss. This is accomplished under the assumption that γ is given. As the initialization of γ is already a consistent estimator, the consistency of the solution after iteration follows. For β, it is firstly shown that the association between the shrinkage estimator, Σ * i , and the covariates is the same as the covariance matrix, Σ i , does (Lemma 4). Thus, it is equivalent to optimize problems (2) and (3) to solve for β, and the solution is a consistent estimator of β based on the pseudo-likelihood theory (Gong and Samaniego, 1981). In the iteration step of Algorithm 1, S * i improves the estimation of the covariance matrices with lower squared loss, and in consequence, improves the estimation of γ and β. In Section 4, the improvement is demonstrated through simulation studies.
In Section 2, the optimization problem (3) introduces a linear combination of the sample covariance matrix and the identity matrix, Σ * i , that achieves the minimum expected squared error. From Theorem 1, the solution has population-level parameters. Thus, the sample counterpart, S * i , is introduced. The following Lemma 2 first shows that asymptotically, the weighting parameters in Σ * i are well-behaved. Lemma 3 demonstrates that the corresponding sample counterpart of the weighting parameters are consistent estimators. Theorem 2 demonstrates that S * i performs as well as Σ * i does asymptotically.
Lemma 3. For given (γ, β), as T min → ∞, Theorem 4. Assume (γ, β) is given. With a fixed n ∈ N + , for any sequence of linear combinations of the identity matrix and the sample covariance matrix, where the combination coefficients are constant over i ∈ {1, . . . , n}, the estimator S * i verifies: In addition, every sequence of {Σ i } n i=1 that performs as well as The difference between Σ * * i and Σ * i is that Σ * * i minimizes the squared loss instead of the expected loss, while asymptotically they are equivalent (Theorems 2 and 3). Theorem 4 presents the main result that, with a fixed sample size n, the proposed shrinkage estimator {S * i } n i=1 achieves the uniformly minimum (average) quadratic risk asymptotically among all linear combinations of the identity matrix and the sample covariance matrix. Here, "average" implies an average over the subjects, and "asymptotically" refers to that the number of observations within each subject increases to infinity. Therefore, S * i is asymptotically optimal. In addition, it is guaranteed that S * i is positive definite (see a discussion in Section A.8 of the supplementary materials). Thus, there exits unique solution to the optimization problem (2).
Next, we study the asymptotic properties of the model coefficient estimator. Letβ denote the solution to the optimization problem (2).
Lemma 4. For given γ, assume the linear shrinkage estimator, Σ * i , satisfies then Theorem 5. For given γ, assume Assumptions A1-A5 are satisfied,β is a consistent estimator Lemma 4 implies that under the rotation γ, the expectation of the shrinkage estimator, Σ * i , has the same association with the covariates as the true covariance matrix, Σ i , does. S * i is a consistent estimator of Σ * i and is positive definite. This substantiates the choice of S * i replacing the sample covariance matrix S i in the optimization problem. Theorem 5 states the consistency ofβ.

γ is known
In this section, we focus on examining the performance of the proposed method in estimating the covariance matrices and model coefficients by assuming the projection γ is known. Three methods are compared. (i) Estimate each individual covariance matrix using the estimator proposed in Ledoit and Wolf (2004) and replaceΣ i with it in the optimization problem (2). We denote this approach as LW-CAP (Ledoit and Wolf based Covariate Assisted Principal regression), where the shrinkage is estimated on each individual covariance matrix. (ii) Estimate the covariance matrices using the proposed shrinkage estimator S * i in (6). We denote this approach as CS-CAP (Covariate dependent Shrinkage CAP), where the shrinkage parameters are assumed to be shared across subjects. (iii) Estimate each individual covariance matrix using the sample covariance matrix and plug into the optimization problem (2). This is the CAP approach proposed in Zhao et al. (2019), which is only applicable when T min = min i T i > p.
The covariance matrices are generated using the eigendecomposition Σ i = ΠΛ i Π , where Π = (π 1 , . . . , π p ) is an orthonormal matrix in R p×p and Λ i = diag{λ i1 , . . . , λ ip } is a diagonal matrix with the diagonal elements to be the eigenvalues, for i = 1, . . . , n. In Λ i , the diagonal elements are exponentially decaying, where eigenvalues of the second and the fourth dimension (D2 and D4) satisfy the log-linear model in (1). We consider a case with a single predictor X (thus q = 2), which is generated from a Bernoulli distribution with probability 0.5 to be one. For D2, the coefficient β 1 = −1; and for D4, β 1 = 1. For the rest dimensions, λ ij , for i = 1, . . . , n, is generated from a log-normal distribution, where the mean of the corresponding normal distribution decreases from 5 to −1 over j. Cases when p = 20, 50, 100 are considered.
We first compare the three approaches, LW-CAP, CS-CAP and CAP, under sample sizes n = 50 and T i = T = 50 for all i and present the result in Table 1. In the estimation, for dimension j, γ is set to be π j . In Table 1

γ is unknown
In this section, we evaluate the performance of the CS-CAP approach when γ is unknown and estimated by solving optimization problem (2) using Algorithm 1. The data are generated following the same procedure as in Section 4.1. To evaluate the performance in estimating the projection γ, we consider a similarity metric measured by | γ, γ |, where ·, · is the inner product of two vectors andγ denotes the estimate of γ. When this metric is one, the two vectors are identical (up to sign flipping); and when this metric is zero, the two vectors are orthogonal. Case where p = 100 is studied. The performance of the CS-CAP approach is firstly compared to the LW-CAP approach with sample sizes n = 100 and T i = T = 100. The results are presented in Table 2. From the     We apply the proposed approach to ADNI resting-state functional magnetic resonance imaging (fMRI) data acquired at the baseline screening. AD is an irreversible neurodegenerative disease that destroys memory and related brain functions causing problems in cognition and behavior.
Apolipoprotein E ε4 (APOE-ε4) has been consistently identified as a strong genetic risk factor for AD. With an increasing number of APOE-ε4 alleles, the lifetime risk of developing AD increases, and the age of onset decreases (Corder et al., 1993). Thus, APOE-ε4 is generally treated as a potential therapeutic target (Safieh et al., 2019). In AD studies, resting-state fMRI is another emerging biomarker for diagnosis (Koch et al., 2012). It is important to articulate the genetic impact on brain functional architecture. In this study, n = 194 subjects diagnosed with either MCI or AD are analyzed. Resting-state fMRI data collected at the initial screening are preprocessed.
Time courses are extracted from p = 75 brain regions, including 60 cortical and 15 subcortical regions grouped into 10 functional modules, using the Harvard-Oxford Atlas in FSL (Smith et al., 2004). For each time course, a subsample is taken with an effective sample size of T = 67 to remove the temporal dependence. In the regression model, APOE-ε4, sex and age are entered as the covariates.
The CS-CAP approach is applied to identify brain subnetworks within which the functional connectivity demonstrates a significant association with APOE-ε4. Using the deviation from diagonality criterion, CS-CAP identifies three components. The model coefficients and 95% bootstrap confidence interval from 500 bootstrap samples are presented in Table 3. From the table, C3 is significantly associated with APOE-ε4 and age; C1 and C2 are significantly associated with sex and age. To better interpret C3, a fused lasso regression (Tibshirani et al., 2005) is employed to sparsify the loading profile, similarly as in the sparse principal component analysis proposed in Zou et al. (2006). The fused lasso regularization is defined based on the modular information to impose local smoothness and consistency (Grosenick et al., 2013;Zhao et al., 2020). presents these regions on a brain map. C3 is negatively associated with APOE-ε4 indicating that functional connectivity between regions in the same sign among APOE-ε4 carriers is lower, while connectivity between regions in the opposite signs among APOE-ε4 carriers is higher. The findings are in line with existing knowledge about AD. Compared to APOE-ε4 non-carriers, more functional connectivity between the left hippocampus and the insular/prefrontal cortex while more functional disconnection of the hippocampus has been observed in APOE-ε4 carriers (De Marco and Venneri,   (Badhwar et al., 2017). Increased connectivity in the limbic system, including the hippocampus, the amygdala and the thalamus, has been detected in individuals with memory impairment (Gour et al., 2011(Gour et al., , 2014, though the effect of APOE-ε4 carriage lacks consensus (Badhwar et al., 2017). It was shown that the limbic hyperconnectivity is positively associated with the memory performance, suggesting the preservation of brain function due to increased connectivity in the medial temporal lobe pathology (Gour et al., 2014).

Discussion
In this study, we introduce an approach to perform linear regression with multiple high dimensional covariance matrices as the outcome. A linear shrinkage estimator of the covariance matrix is firstly introduced, where the shrinkage coefficients are shared parameters across subjects. It is showed that the proposed estimator is optimal achieving the uniformly minimum quadratic loss asymptotically among all linear combinations of the identity matrix and the sample covariance matrix. Utilizing the well-conditioned estimator of the covariance matrices, a pseudo-likelihood based approach is considered to estimate the linear projection parameter and the model coefficient.
Through simulation studies, the proposed approach demonstrates superior performance in estimating the covariance matrices and the model coefficients with lower estimation bias and variation over the existing methods. Applying to a resting-state fMRI data set acquired from ADNI, the findings are consistent with existing knowledge about AD.
The proposed framework extends the proposal in Zhao et al. (2019) to high dimensional scenario.
When p is small, the proposed shrinkage estimator demonstrates lower squared loss than the sample covariance matrix as suggested in both theoretical results and simulation studies. Different from the  linear shrinkage estimator introduced in Ledoit and Wolf (2004), which was proposed for a single covariance matrix estimation, the shrinkage coefficients considered in this study are population level parameters shared across subjects. This is superior than the individual shrinkage as the proposed one leverages the accuracy of the sample covariance matrix and the variability in the eigenvalues across subjects.
In this study, the asymptotic properties are studied under the assumption that the covariance matrices have the same eigendecomposition. We leave the study of the consistency relaxing this assumption to future research. The proposed shrinkage estimator is optimal with respect to a squared risk. However, this may overshrink the small eigenvalues (Daniels and Kass, 2001). Other types of loss function, such as the Stein's loss, will be considered in the future.

Acknowledgments
Data collection and sharing for this project was funded by the Alzheimer's Disease Neuroimaging

Supplementary Materials
This supplementary material collects the technical proof of the theorems in the main text and additional simulation results.

A Theory and Proof
A.1 Proof of Theorem 1 and Lemma 1 For the objective function in (3), under the In order to minimize the objective function, as the objective function is convex, derivatives are firstly taken over µ and ρ.
The minimum value of the function is

A.2 Proof of Proposition 1
Proof. Under Assumptions A2 and A5, the eigenvectors ofS are consistent estimators of Π. Replace γ with its estimate in Theorems 2-4 and Theorem 5, the consistency of β follows.

A.3 Proof of Lemma 2
Proof.
(1) For µ, Under Assumption A2, where · F is the Frobenius norm of a matrix.

A.4 Proof of Lemma 3
Proof. In the proof of Lemma 3, here, it is assumed that γ is a column of Π i indexed by j i , for i = 1, . . . , n (Assumption A4).
(i) First, we prove the consistency ofδ 2 i .
From above derivation and the fact that As both (γ S i γ) 2 and Ez 4 i1j i are bounded, then, as T min = min i T i → ∞, Therefore, as T min = min i T i → ∞, The consistency ofφ 2 i (for i = 1, . . . , n) andφ 2 are straightforward.

A.5 Proof of Theorem 2
In order to prove Theorem 2, the following lemma is firstly introduced. This lemma is also used to prove Lemma A.2 in the next section.
Lemma A.1. If a 2 i is a sequence of nonnegative random variables (implicitly indexed by T i ) whose expectations converge to zero, for i = 1, . . . , n, and κ 1 , κ 2 are two nonrandom scalars, and Analogously, if a 2 is a sequence of nonnegative random variables (implicitly indexed by T min = min i T i ) whose expectations converge to zero, and κ 1 , κ 2 are two nonrandom scalars, and Consider the complementary of set T i , since Ea 2 i → 0, there exists an integer T i2 such that, ∀ T i ≥ T i2 , Bringing together the results inside and outside the set The proof of the second part follows the same strategy. Now, we prove Theorem 2.
Proof. We first prove that S * i is a consistent estimator of Σ * i .
The right-hand side converges to zero. Therefore, Ea 2 i → 0, conditions in Lemma A.1 are satisfied. Therefore, Analogously, it can be shown that Next, we prove Theorem 3.
First, need to prove that α − φ 2 converges to zero in quadratic mean.
This proves that α − φ 2 converges to 0 in quadratic mean. In the following, we prove that S * i is a consistent estimator of Σ * * i . .
It is assumed that the samples/subjects are independent, therefore, Thus, The above quantity on the right is bounded by a constant from above. Therefore, as T min → ∞, Sinceδ 4 is bounded, It can be concluded that as T min → ∞, This implies that

A.7 Proof of Theorem 4
Proof. For the first statement, By Theorem 3, the second term on the right converges to zero, and the first term is ≥ 0 by the definition of Σ * * i . For the second statement, As T min → ∞, the second term on the right-hand side converges to zero. For > 0, there exists a constant M > 0 such that when T min > M , p j=1 λ 2 ijj /(pT i ) < . Thus, ψ 2 i ≥ (1 − κ) − and ψ 2 ≥ (1 − κ) − .
For a choice of , we have Therefore, for both c ≤ 1 − κ and c > 1 − κ, the smallest eigenvalue of S * i is bounded away from zero. Analogous to the proof of the largest eigenvalue, for the case that p/T max does not have a limit, we can also have the conclusion for the whole sequence. Since both the largest and the smallest eigenvalues are bounded, S * i is well-conditioned and invertible.
A.9 Proof of Lemma 4 and Theorem 5 We first proof Lemma 4. Proof.
Next, we prove that the proposed estimator β is a consistent estimator (Theorem 5).
Proof. Using the consistency of pseudo-likelihood estimator (Gong and Samaniego, 1981) and the conclusion in Lemma 4,β is a consistent estimator of β.

B.1 γ unknown
Here, we present the performance of estimating the fourth dimension (D4) when γ is unknown ( Figure B.1). From the figures, as n and T increase, the estimate of the covariance matrices, the projection and the model coefficient converge to the truth. presented. Forγ, (e) similarity to π 4 is presented. Data dimension p = 100. Sample sizes vary from n = 50, 100, 500, 100 and T i = T = 50, 100, 500, 1000.