Multivariate variable selection by means of null-beamforming

This article aims to use beamforming, a covariate-assisted data projection method to solve the problem of variable selection for multivariate random-effects regression models. The new approach attempts to explore the covariance structure in the data with a small number of random-effects covariates. The basic premise behind the proposal is to scan through a covariate space with a series of forward filters named null-beamformers; each is tailored to a particular covariate in the space and resistant to interference effects originating from other covariates. Applying the proposed method to simulated and real multivariate regression data, we show that it can substantially outperform the existing methods of multivariate variable selection in terms of sensitivity and specificity. A theory on selection consistency is established under certain regularity conditions. MSC2020 subject classifications: Primary 62G08, 62M10; secondary 62P05.


Introduction
The advance of high-throughput technology in science has generated various types of correlated data. Integrative analysis holds great promises for uncovering hidden links between these data. For this purpose, a class of multivariate random-effects regression models are investigated in this paper, where subjectspecific random effects are introduced to account for the variations among subjects [11]. Suppose that there are J subjects (or responses in the terminology of multivariate regression) under study, each depending on the same set of randomeffects covariates indexed by {1, ..., p}. Let y j and x k are column vectors of n measurements on subject j and on covariate k respectively. Then, a multivariate random-effects regression model can be written as with fixed-effects μ k ∈ R, random-effects coefficients β kj ∈ R and error vectors ε j ∈ R n . We assume that coefficient vectors β j = (β 1j , ..., β pj ) T have mean μ = (μ 1 , ..., μ p ) T and covariance matrix Σ and that error vectors ε j have mean zero and unknown covariance matrix Λ. We further assume that the random-effects coefficients and the error vectors are independent of each other. As estimation of the fixed-effects has already been studied in [17], we focus on inference of the random-effects coefficients in the above model.
Denote by X = (x 1 , ..., x p ) the observations on all covariates. Then the conditional covariance matrix cov(y j |X) admits the following decomposition where σ kk is called variance component associated with covariate k. This shows that the above random-effects model provides a way to describe the covariance structure of the data. However, when the number of covariates, p, is larger than both n and J, the problem of estimating either these variance components or random-effects coefficients is ill-posed, where the commonly used multivariate least squares criterion does not provide a unique solution. To tackle this issue, we impose a sparsity assumption on the model that only a small number of variance components in the above covariance matrix decomposition have positive values. We are interested in the problem of identifying these non-zero variance components. We refer to covariate k as an non-active covariate when β k1 = · · · = β kJ = constant. By definition, σ kk = 0 holds if and only if P (covariate k is non-active) = 1. So the above problem is equivalent to a multivariate variable selection problem where we want to infer active covariates for a multivariate regression model [17]. A conventional remedy for variable selection is to penalize the magnitudes of regression coefficients in the least squares. When the penalty is increasing, estimates are zeroed out, and a subset model is then identified and estimated. Such a remedy is particularly of interest when the dimension p is large and candidate covariates contain many redundant or irrelevant variables. The variable selection procedure LASSO [19] followed this remedy. Over the past two decades, much progress has been made along this direction [6,26], among others. As the recent research on variable selection mainly focuses on a univariate response setting, limited research has been done on multiple responses settings, e.g., [2,15,16,4,17,24,13]. Despite of the above progress, a few issues remain to be addressed. First, most of these methods have been developed for independent measurements. There are various applications in which measurements on each subject are dependent. For instance, sensitivity measurements of a drug can be dependent as cell lines used in these measurements exhibit genetic relatedness when they are associated with the same types of cancers [10]. In multiple genome-wide association studies, individual genotypes in a subject group are correlated [25]. In neuroimaging, measurements from different sensors outside a brain are dependent as they are generated from the same neuronal sources inside the brain [20]. In finance, returns of different stocks are correlated due to the so-called crosssectional dependence [9]. Secondly, the existing methods mentioned above are mainly for multivariate fixed-effects regression models, where given the values of covariates X, the response covariance structure is determined only by error terms. In contrast, multivariate random-effects regression models are hierarchical, where conditional on the values of covariates, the responses depend not only on error terms but also on random-effects coefficients [11]. Although, in principle, multivariate regression data can be fitted by either a fixed-effects model or a random-effects model, a comparison between these two approaches has not been made in literature [2]. Finally, most of the existing inference procedures are not computationally scalable to large-scale data with many subjects or many responses. This prohibits their applications to big data.
Here, we address these issues by generalizing the idea of beamforming, a covariate-assisted data projection method [23] to multivariate regression settings. Our contributions are three-fold. First, we develop a novel algorithm called principal variable analysis (PVA) to identify important covariates by covariate-interference-adjusted data projections (called forward beamforming or null-beamforming) that account for the maximum amount of variation in the data. Such a procedure provides a principled way to extract information about covariates from the multivariate regression data. In the PVA, unlike the existing methods, we gauge the importance of each covariate with respect to the multivariate response by its information index (called power), which is defined by its variance component. The higher the power of a covariate, the more amount of variations in the response data it can account for. We estimate the power of each covariate by performing null-beamforming on the data. To adjust for varying background noises, we replace the power by signal-to-noise ratio (SNR), a relative information index for each covariate. In each forward step, after nulling the previously selected covariates, we are able to adjust the SNR values of the remaining covariates and to conduct iteratively screening for these covariates. The iteration will be terminated once no covariate significantly stands out in the current step. The above procedure produces a list of highly ranked covariates called principal variables along with their estimated regression coefficients. Based on these selected covariates, the response covariance matrix can be decomposed into two parts: one for the selected variance components and the other for noises. In this sense, the PVA is viewed as a covariate-assisted principal component analysis. As the beamforming can be implemented through parallel computing, the PVA is scalable to large-scale data. Secondly, we establish a global property of selection consistency for the PVA under some regularity conditions. In particular, a sufficient condition for a consistent selection was imposed on the number of subjects, J, the number of covariates, p, the number of measurements per subject, n, and the number of non-zero entries in cov(y j |X), m n , that is, n −α0 log(p) is bounded for some positive constant α 0 and m n log(n)/J → 0, as n, J and p tend to infinity. This implies that the proposed procedure can handle the variable selection problem in an ultra-high dimensional covariate space. Finally, we conduct a set of simulation studies to evaluate the performance of the PVA compared to the existing variable selection methods. The numerical results demonstrate that in terms of sensitivity and specificity the PVA can substantially outperform the existing methods such as the multivariate group LASSO, the multivariate elastic-net, the multivariate LASSO, the multivariate sparse group LASSO, among others. We also apply our method to some anticancer drug data, identifying a novel set of genes for predicting drug sensitivity in cancer cell lines. Using the information extracted from the Human Protein Atlas Portal at http://www.proteinatlas.org/cancer, we show that most of the identified genes have significantly high protein staining levels at least in one or more than one of common cancers.
The remaining of the paper is organized as follows. The details of the proposed methodology and algorithm are provided in Section 2. An asymptotic theory on the proposed procedure is developed in Section 3. The simulation studies and a real data application are presented in Section 4. The discussion and conclusion are made in Section 5. The technical details, proofs, and extra theorems can be found in the Appendices A and B. Throughout the paper, we denote by λ max (·) and λ min (·) the largest and smallest eigenvalues of a square matrix respectively. For any matrix F n , we define the spectral norm ||F n || as λ . For a sequence of real numbers {u n }, we say F n = O(u n ) if ||F n ||/|u n | is bounded from above and F n = o(u n ) if ||F n ||/|u n | tends to zero as n tends to infinity.

Methodology
Let Y = Y n×J = (y ij ) n×J = (y 1 , y 2 , · · · , y J ) and X = X n×p = (x ik ) n×p = (x 1 , · · · , x p ). We reformulate the multivariate random-effects regression model in the previous section in the following matrix form: (2.1) where unknown random regression coefficient matrix B = B p×J = (β 1 , β 2 , · · · , β J ) and ε = ε n×J = (ε 1 , ε 2 , · · · , ε J ) with β j and ε j , respectively, containing the regression coefficients and the error terms related to the jth subject. As usual, we start with a least squares-based regression analysis. It can be shown that when p < n, the least square solution,B = (X T X) −1 X T Y gives the same coefficients as fitting univariate multiple regression models to (y j , X), 1 ≤ j ≤ J separately. Note that treating β kj , 1 ≤ j ≤ J as correlated random coefficients allows us to explore the dependence between y j , 1 ≤ j ≤ J. As pointed out before, when p > n, X T X is not invertible and thus the above least squares solution is not unique. To tackle the problem, we make the assumption that variance components in a decomposition of cov(y j |X) are sparse, that is, σ kk = 0 for the majority of covariates k ∈ {1, 2, ..., p}. The goal of this paper is to identify these covariates of non-zero variance component and to estimate their regression coefficients given observations (Y, X).

Power and signal-to-noise ratio
To rank covariates, we define an information index called power for each covariate by projecting the response data to the covariate space. The concept of power, defined as the variance of a signal, is borrowed from the research field of signal processing, where sensor observations y j , 1 ≤ j ≤ J are often assumed weakly stationary [20,23]. In genetics, the above concept describes the so-called pleiotropic genetic effect of a single gene on multiple phenotypic traits, where multivariate linear models have been developed to connect genetic variant data to multiple quantitative traits [5]. In the multivariate randomeffects regression setting, the power is the variance component of a covariate in conditional covariance matrix C = cov(y j |X) given X, where we model regression coefficients of the multiple responses to each covariate as realizations from a random variable with a finite second moment. Then, the amount of information on each covariate in these regression coefficients can be measured by variability in these coefficients. The larger the variability, the higher degree of variation in the response data is accounted for by this covariate. In practice, the regression coefficients (β kj ) 1≤j≤J at covariate k (therefore its approximate power β kj /J as J tends to infinity) are unknown. We estimate (β kj ) 1≤j≤J by projecting response data into the coefficient space of the kth covariate along the direction w that can minimize interferences with the other covariates and with the background noise. That is, for the kth covariate, we estimate its regression coefficients by the projected data w T Y in which var(w T y j |X) = w T Cw attains the minimum, subject to w T x k = 1. To this end, we consider the Lagrange multiplier L(w, λ) = w T Cw − λ(w T x k − 1) and solve partial derivative equations We have the solution [23]. If let C = constant × I n , where I n is an n × n identity matrix (i.e., ignoring correlations in the measurements on each subject), then the above estimator reduces to a marginal multivariate least squares estimator. To explain why the above approach can provide an interference-minimized estimator of the underlying power, we assume that error term ε j is independent of the p-dimensional regression coefficient β j and with cov(ε j ) = Λ and cov(β j ) = Σ = (σ i1j1 ) p×p . Then, we have C = XΣX T + Λ. Note that under the constraint w T x k = 1, we have which yields min{w T Cw : w T x k = 1} = power of the kth covariate + min{w-dependent interference : This implies that the constraint w T x k = 1 is a linear filter which allows the power σ kk to pass through it, whereas interferences with other covariates and with the background noise are reduced via the minimization. So, min{w T Cw : w T x k = 1} is an interference-minimized estimator for the theoretical power σ kk . The above Lagrange multiplier shows that the power of the kth covariate, can be expressed as When observations on responses are white noises with noise level σ 2 , the power of the kth covariate reduces to σ 2 w T k w k . So we define the SNR at the kth covariate by γ k (σ 2 w T k w k ) −1 . Analogously, for a subset of covariates indexed by ν = {k 1 , k 2 , ..., k m }, their joint power (called the power matrix) can be defined by where the data matrix x ν = (x k1 , ..., x km ) consists of the observations on the covariates in ν and the columns in x ν are assumed linearly independent. Abusing the above notation, we let w and w ν denote n × m matrices below. Then, we can also define the SNR of covariate set ν as SNR ν = tr γ ν (σ 2 w T ν w ν ) −1 . Using the corresponding Lagrange multiplier, we can show that γ ν is the covariance matrix of the projected data w T ν Y along interference-minimized directions in the sense that tr(γ ν ) = min{tr(cov(w T Y|X)) : where tr(·) is the trace operator and I m is an m × m identity matrix. Note that w T x ν = I m define m linear filters which null each other. The projection of Y along interference-minimized directions gives an estimator, The above estimator will reduce to a marginal least squares estimator if let C = constant × I n . However, in practice, C is unknown and often not diagonal. We need estimate C by the data.
Covariates can be correlated. For example, in the cancer genomic data, genes as covariates can be highly correlated if they are located in the same pathway. Consequently, the finite sample power estimator of a covariate may have a bias due to interferences with other covariates. To address this problem, we further null the previously identified covariates by adding more constraints on the linear filter in each step as follows. Let ω and ν be two disjoint subsets of the covariates with sizes m 1 and m respectively. To define a ω-nulled power matrix of ν, adding null constraints w T x ω = 0 m×m1 into the linear filters w T x ν = I m , we consider the following optimization problem: Using the Lagrange multiplier again, we obtain the optimal weighting matrix where φ ν|ω = (I m , 0) T with 0 being the m × m 1 matrix of 0's. The nulled power matrix γ(ν|ω) is then defined as w T ν|ω Cw ν|ω , the covariance matrix of the projected data along w ν|ω . It can be shown that γ ν|ω is equal to the upper corner m × m block matrix of x T ν∪ω C −1 x ν∪ω −1 . The nulled signal-to-noiseratio SNR ν|ω can be defined as tr γ ν|ω (σ 2 w T ν|ω w ν|ω ) −1 .

Estimation of response covariance matrix
Note that the power estimation needs an estimator of the response covariance matrix, for example, the sample covariance matrixĈ = J j=1 y j y T j /J −ȳȳ T = (ĉ ij ), whereȳ = J j=1 y j /J = (ȳ 1 , ...,ȳ n ) T andĉ ij = J t=1 (y it −ȳ i )(y jt −ȳ j )/J. As the sample covariance matrix can be inconsistent with the true one when the dimension n is larger than J, [1] and [3] amended it by thresholding its entries:Ĉ h =Ĉ(τ nJ ) = (ĉ ij I(|ĉ ij | > hτ nJ )), where I(·) is the indicator and τ nJ = log(n)/J with the tuning constant h ≥ 0. Under certain mixing conditions, [23] showed that the thresholded sample covariance matrix was consistent with the true one with dependent sample. For a finite sample, the thresholded covariance matrix may still be degenerate when the number of subjects is close to or smaller than the number of measurements per subject. So, following [12], we further shrink the thresholded covariance estimator to a diagonal matrix as follows:Ĉ and <D 1 , D 2 > = tr(D 1 D T 2 )/n for any n × n matrices D 1 and D 2 . Having defined C hs , we estimate the power matrices γ ν and γ ν|ω byγ ν = (x νĈhs x ν ) −1 and Similarly, the ω-nulled SNR can be estimated byŜ

Principal variable analysis
We are now ready to describe the PVA for multivariate variable selection. Although we focus on the SNR-based PVA below, the power-based PVA can also be defined similarly. Initialization: To start with, find 1 ≤ k 1 ≤ p at which the SNR attains the maximum. Set ω 0 = ∅ and ω 1 = {k 1 }.
Nulling: In the iteration m, m ≥ 2, let ω m−1 denote the set of covariates selected in the first m − 1 iterations. For any covariate k not in ω m−1 , using the formula (2.2), we calculate the nulled SNR,ŜNR {k}|ωm−1 , as well as an estimated optimal projection directionŵ. We then find k m ∈ ω m−1 in whichŜNR {k}|ωm−1 attains the maximum.
Forward selection and stopping criteria: After a number of iterations, the nulled SNR values will start leveling off, which indicates that the remaining covariates have no significant contributions to the covariance structure of the response. This motivates us to set the following stopping criteria in each iteration: For m ≥ 2, at the end of the mth iteration, we make a scree plot of the nulled SNR values and identify an elbow point. To find the elbow point, we consider the vector which links the highest and the lowest points on the scree plot. Then we find the orthogonal distance from each point on the plot to this vector. The point on the plot with the largest distance is selected as the elbow point. The elbow point partitions the remaining covariates into two subsets, namely upper set and lower set. The lower set, containing those covariates with SNR values lower than the elbow point, is uninformative about the responses. To test the hypothesis that the upper set is uninformative, we calculate the mean μ l and standard deviation θ l for the lower subset. The hypothesis is accepted if the maximum nulled SNR value,ŜNR max = max{ŜNR k|ωm−1 : k ∈ ω m−1 }, of the upper set falls into the following confidence interval, |ŜNR max − μ l | ≤ c 0 θ l , where c 0 is a tuning constant. We set the default value c 0 = 5. Applying the central limit theorem to the SNR values in the lower set, the above interval can be shown to have the asymptotic confidence level of 1 − 5.73 × 10 −7 after multiple testing adjustment. The iteration will be terminated when the upper subset is uninformative. Otherwise, we update ω m−1 and x ωm−1 by letting ω m = {k m } ∪ ω m−1 and x ωm = (x km , x ωm−1 ), and the iteration will continue. Note that our simulations (not shown here) did indicate that the performance of PVA was not very sensitive to the choice of c 0 when it took values between 3 and 5.

Covariate network
Statistical connectivity patterns in the selected covariates are a hallmark feature for connecting pleiotropic traits such as drug inhibitory concentrations to genetic variants in genetics and for studying functional networks in neuroscience [5,14]. Here, to quantify such patterns, we compute the regression coefficientbased Pearson correlation coefficient for each pair of the selected covariates. The details are as follows. Suppose that q covariates are selected by the PVA. Based on the multivariate least squares, we obtainB 0 , an estimator of the q × J regression coefficient matrix for these covariates. For any pair of rows (i, j) inB 0 , we calculate Fisher's z-transformation of their correlation coefficient r ij , z ij = 0.5 ln ((1 − r ij )/(1 + r ij )). For rows i < j, we want to test whether z ij (i.e., r ij ) is significantly away from 0. There are q(q − 1)/2 such tests in total. Note that if the underlying correlation coefficient is zero, then z ij ≈ N (0, 1/(J −3)) in distribution. After Bonferroni correction to multiple testing, we can claim that z ij is significantly away from zero if √ J − 3|z ij | > z α/2 , where z α/2 is the critical value of N (0, 1) at the level α/2 = 0.01/q(q − 1). For example, for our cancer data in Section 4 where q = 37, J = 131, we obtained z α/2 = 4.33. We are now ready to construct a network with q nodes, each stands for a selected covariate (a row inB 0 ). We assign an edge to link nodes i and j if z ij is significantly away from zero.

Theory
In literature, no general asymptotic theory was provided on variable selection for high dimensional multivariate regression models with the exception of [17]. In [17], Sofer et al. developed a selection consistency theory for a special class of multivariate fixed-effects regression models, where regression coefficients did not change across responses (i.e., β 1 = · · · = β J ). In this section, we develop a general theory on selection consistency of the proposed procedure PVA for multivariate random-effects regression models. We divide the theory into two parts according to whether C is known or not. Here, we present only the case where C is estimated. The remaining is deferred into the Appendix B.
As before, assume that regression coefficient matrix B and error terms ε in the model (2.1) are independent and that given X, the covariance matrices of y j , β j and ε j , denoted by C = (c ik ) n×n , Σ = (σ ik ) p×p and Λ respectively, are independent of index j. Then, we have C = XΣX T + Λ. Assume that Λ is positively definite. For ease of presentation, we consider the special case, where Λ = σ 2 I n and x T k x k = n, 1 ≤ k ≤ n. If Λ = σ 2 I n , we can change Y and X by the transformations Λ −1/2 Y and Λ −1/2 X (under which the power is invariant), followed by rescaling Λ −1/2 X and B (see [23]). Then a general theory can be derived from the special case. We denote the full set of covariates by [1 : p] = {1, 2, · · · , p} corresponding to x 1 , · · · , x p , and the true covariate set by ν 0 . Let ν = {k 1 , · · · , k p1 } denote any subset of [1 : p] with size |ν|. The (k 1 , · · · , k p1 )th columns of X forms a data matrix x ν for the covariate set ν. If let e ν be a p × p 1 selection matrix in which for 1 ≤ j ≤ p 1 , its (k j , j)th entry takes value of 1 and the other entries take values of 0, then we can write x ν = Xe ν . Let σ 2 k denote σ kk in Σ, which shows the underlying power at the kth covariate. Let ν 0 be the underlying set of covariates.
Σe ν0 x T ν0 , the underlying noise covariance matrix. For any subset ν, if A ν0 is invertible, then we define the coherence (i.e., collinearity) matrices within and between x ν and x ν0 : Let e ν ν0 be a |ν 0 | × |ν| indicator matrix with the (j l , l)th entry equal to 1, 1 ≤ l ≤ |ν| and with other entries equal to zeros. Using e ν ν0 , we select sub-columns from x ν0 to form x ν , namely To identify active covariates, we impose the following regularity conditions on the covariance structures of the multivariate response variable and covariates, where X is treated as deterministic. If we treat X as a random design matrix, some parallel conditions can be assumed through replacing O(·) by O p (·) in the following conditions.
(C0). There exists a permutation on y j , 1 ≤ j ≤ J so that the resulted sequence is strictly stationary with covariance matrix C and that (y j , X), 1 ≤ j ≤ J follow the model (2.1). The error term ε j and the p-dimensional regression coefficient β j are independent of each other.
(C1). There are a constant 0 < r ≤ 1 and a set of active covariates ν 0 of size |ν 0 | ≤ rn such that x ν0 is of full column rank and that e T ν0 Σe ν0 and A ν0 are invertible.
(C2). For ν 0 and r in Condition (C1), as n tends to infinity, there is a constant 0 ≤ α 0 < 1 such that uniformly for any set ν ⊆ [1 : p] with |ν| ≤ rn, R νν = O(n α0 ) and R −1 νν = O(n α0 ). (C3). For ν 0 and r in Condition (C1), as n tends to infinity, uniformly for For ν 0 and r in Condition (C1), as n tends to infinity, uniformly for (C5). There exist positive constants κ 1 and τ 1 such that for any u > 0, In the last condition, we assume that there exists a permutation π on {1, ..., J} so that y π(j) , 1 ≤ j ≤ J are strong mixing. Let F k0 0 and F ∞ k denote the σ-algebras generated by {y π(j) : 0 ≤ j ≤ k 0 } and {y π(j) : j ≥ k} respectively. Define the mixing coefficient The mixing coefficient α(k) quantifies the degree of the dependence of the process {y π(j) } at lag k. We assume that α(k) is decreasing exponentially fast as lag k is increasing, i.e., (C6). There exist positive constants κ 2 and τ 2 such that α(k) ≤ exp(−τ 2 k κ2 ).  (1), which is negligible. This is natural as ν may contain noisy covariates. Similarly, Condition (C3) says that ν 0 -adjusted power of ν is also negligible. Conditions (C2) to (C3) are the assumptions commonly used in the large sample theory for linear regression models (e.g., [21]). To verify Conditions (C1)∼(C3), we refer readers to [8,21] under the assumptions that σ 2 k = 0, k ∈ ν 0 and that X is assumed to be a random matrix satisfying some moment conditions and that the growth of the dimension p is not too fast compared to the number of measurements per response, n. For example, following [8], we assume that X has a concentration property, i.e., for some constant c 1 , any u > 0 and ν ⊆ [1 : p], |ν| ≤ rn, and hence as log(p) ≤ n α0 /c 1 − 1 + log(r) This implies that Condition (C2) holds with an overwhelming probability. Analogously, Condition (C3) holds if x ν and x ν0 are asymptotically, uniformly non coherent with respect to ν ⊆ [1 : p] \ ν 0 , in the sense that R νν0 = o(1).

Condition (C4) is a technical condition which holds when
Letting ν 1 ⊆ ν 0 and ν 2 ⊆ [1 : p] \ ν 0 , in the next theorem, we show that the sparsistency property holds for the estimated nulled powers. Recall that τ nJ = log(n)/J. where The above theorem implies that uniformly for a ∈ ν 0 \ν 1 and |ν 1 ∪ν 2 | < rn, the (ν 1 ∪ ν 2 )-nulled power of a admits the formγ a|ν1∪ν2 (1). Note that it can be seen from the proofs in the Appendix B that n 2 τ nJ =o(1) in Theorem 3.1 can be replaced by m n τ nJ = o(1) which depends on m n , the number of non-zero entries in C. We therefore show that the so-called sparsistency property holds for the nulled-power-based PVA. We further show that the sparsistency property also holds for the SNR-based PVA as follows.
Theorem 3.2. Suppose that Conditions (C0)∼(C6) hold and that τ nJ n 2 = o(1) as both n and J tend to infinity. Then, we have: Note that for a ∈ ν 0 \ ν 1 and |ν 1 ∪ ν 2 | < rn, which is bounded below from zero as λ max e T ν0 Σe ν0 = O(1). It can also be shown that σ 2 ζ 0 e T a ν0 Σ −1 ν0\ν1 ΦΣ −1 ν0\ν1 e a ν0 = O(n 3α0+2α1 ). Consequently, the leading term in Theorem 3.2 (ii) tends to infinity as n 1−3α0−2α1 tends to infinity. In contrast, for a ∈ ν 0 ,ŜNR a|ν1∪ν2 converges to a constant as stated in Theorem 3.2 (i). Compared to Theorem B.4 in the Appendix B, we can see that Theorem 3.2 provides a sharper contrast between active and non-active covariates.
Let ωm denote the set of covariates derived from the (SNR-based) PVA. We have the following selection consistency for ωm. Corollary 3.1. Under the conditions in Theorem 3.2, as both n and J tend to infinity, we have the selection consistency in the sense that P (ωm = ν 0 ) → 1.
The above corollary, together with Remark 3.1, implies that along with other regularity conditions, if the condition on J, p and n that n −α0 log(p) = O(1) for some positive constant α 0 and that n 2 τ nJ = o(1) is satisfied, then the PVA-based variable selection is consistent. As pointed out before, the condition n 2 τ nJ = o(1) can be replaced by a sparsity condition where m n τ nJ = o(1).

Numerical results
In this section, we assess the performance of the PVA in identifying active covariates using synthetic and real data. As our simulations suggest that the SNRbased PVA performs better than the power-based PVA, we consider four versions of the SNR-based PVA with the four different estimators of C, namely, Ledoit-Wolf's shrinkage estimator and the optimal shrinkage of thresholded estimator C hs with h = 0.01, 0.005, 0.001. They are denoted by PVA(sh0), PVA(hs1), PVA(hs2) and PVA(hs3) respectively.

Synthetic data
We compare the performance of the PVA to those implemented in the Rpackages 'glmnet' (Friedman, Hastie, Simon, Tibshirani, version 2.1), 'lsgl' (Vincent, version 1.3.5) and 'mrce' (Rothman, version 2.1): the multivariate group LASSO (MGL), the multivariate elastic-net (MENET), the multivariate LASSO (ML), the multivariate group sparse LASSO (MGSL) and multivariate regression with covariance estimation (MRCE) when all these procedures fix their specificity values approximately at the same level as the PVA. A brief introduction to these methods can be found in the Appendix A. The Bayesian method of [2] is excluded from our comparison as it is computationally infeasible for the large scale data considered here.
Specificity and sensitivity are defined as the survival rates of true active covariates and of true non-active covariates respectively in screening, namely So the specificity SEN D is close to 1 when p |T |+m. This holds for most of our simulations, for example for m = 42, p = 2000, |T | = 37, we have SPE D ≥ 0.978.

Setting 4.1 (B was uncorrelated both within rows and between rows):
Modifying a simulation setting in [16], we simulated 50 data sets of (Y, X) from the model (2.1). Each dataset was generated in the following steps. First, we drew an i.i.d. sample of size np from the standard normal N (0, 1) to form an n × p matrix X. Secondly, we drew n independent auto-regressive row-vectors from We stacked these row vectors to generate an n × J error term matrix ε. Thirdly, we generated B = (β kj ) p×J = s 0 B 0 , where s 0 was a scale factor, B 0 = (b kj ) p×J , b kj = η kj u kj , with η kj and u kj independently sampled from the Bernoulli distribution Bin(0.1) (0.1 is the success probability) and the uniform distribution U (s 1 , s 2 ) respectively. We considered combinations of (n, p, J, p 0 , α, s 0 , s 1 , s 2 ) with n = 50, p = 100, 1000, J = 20, p 0 = 5, α = 0, 1, s 0 = 0.45, 0.6, (s 1 , s 2 ) = (−1, 1), (0.5, 1) and (1,2). Note that α = 0 and 1 corresponded to row-wise uncorrelated and row-wise correlated Bs respectively. We let (s 1 , s 2 ) = (−1, 1), (0.5, 1) and (1,2) to represent three scenarios of B: (i) rows with non-zero entries were oscillates around (thus not well separated from) the background 0; (ii) rows with non-zero entries were uniformly bigger than (thus separated from) 0 by amounts not less than 0.5s 0 ; (iii) rows with non-zero entries were uniformly bigger than (thus separated from) 0 by amounts not less than s 0 . Then, we randomly selected a subset S p0 of size p 0 from integers from 1 to p and for any j, set β kj = 0 when k ∈ S p0 . Finally, we let Y = XB + ε.
Setting 4.2 (B was uncorrelated within rows but correlated between rows): We adopted Setting 4.1 except that we multiplied the above B 0 by a matrix factor B f = (0.6 |k−j| ) p×p , resulting in new B = s 0 B f B 0 with correlations between non-zero rows.

Setting 4.3 (B was weakly correlated within rows):
We generated 50 data sets of (Y, X) from the model (2.1) for each combination of (n, p, J, p 0 ), where n = 42, 88, 150, p = 2000, J = 20, 34, 131, and p 0 = 37, 50, 70 is the number of true active covariates underpinning the model. Each dataset was generated in the following steps. We began with calculating a J ×J sample covariance matrix Ω by using the n×J weakly correlated sub-data matrix of the imputed IC50 data. Given Ω, we randomly generated p row-vectors from a J-dimensional normal N J (0, Ω), stacking them together to form a matrix B. We then modified entries of B so that the resulting matrix contained exactly p 0 non-zero rows which would be taken as p 0 active covariates later. The details were omitted. To obtain matrix X, we let F 0 be the p×p sample covariance matrix of the gene expressions in our cancer drug data which were obtained in the next section. Given F 0 , we then generated n iid row vectors from a multivariate normal N p (0, F 0 ), stacking them together to form matrix X. We generated the error term matrix, ε, by sampling from N n (0, σ 2 I n ) J times as its column vectors, where σ 2 = 0.1. Finally, we obtained Y by setting Y = XB + ε.
Setting 4.4 (B was strongly correlated within rows): Similar to Setting 4.3, we generated 50 data sets of (Y, X) from the model (2.1) for each combination of (n, p, J, p 0 ), where n = 42, 88, 150, p = 2000, J = 20, 34, 131, and p 0 = 37, 50, 70 is the number of true active covariates underpinning the model. Each dataset was generated in the same steps as Setting 4.3, except that matrix Ω was replaced by one with high correlation coefficients. The further details were omitted.
Setting 4.5 (B was moderately correlated within rows): Similar to Setting 4.3, we generated 50 data sets of (Y, X) from the model (2.1) for each combination of (n, p, J, p 0 ), where n = 20, 42, p = 2000, J = 131, and p 0 = 20, 37. Here, Ω was generated from the n non-missing rows of the IC50 data while X was produced by use of the gene expression data corresponding to the above n non-missing rows. The error term matrix was generated by sampling from N n (0, σ 2 I n ) J times as before but with σ 2 = 0.0645. The further details were omitted.
For each combination of (n, p, J, p 0 , s 0 , s 1 , s 2 ) in Settings 4.1 and 4.2, we applied the PVA, MGL, MENET, ML, MSGL and MRCE to each of 50 data sets respectively and calculated their sensitivity values when the specificity value was fixed approximately at the same level. Note that in Settings 4.3 to 4.5, it In each panel, from the left to the right, the odd columns are for sensitivity while the even columns are for specificity. In each panel, box-plots from the left to the right are for sh 0 , hs1, hs2, hs3, mgl, menet, mrce, ml and msgl respectively. B was called "Oscillated" if its non-zero entries were oscillates around 0; "Separated" if non-zero entries were uniformly bigger than 0.
was too time-consuming to run MRCE on a PC. In light of this, we skipped MRCE in our comparison in these settings. For the MGL, MENET, ML, MSGL and MRCE, we adjusted their penalty coefficients to achieve approximately the same specificity as that of the PVA. These sensitivity and specificity values were summarized using box-plots as shown in Figures 1∼7. In these figures sh 0 , hs1, hs2 and hs3 correspond to PVA based on the shrunk and thresholded covari- ance estimators with tuning constants h = 0, 0.01, 0.005, 0.001 respectively. And mgl, menet, mrce, ml, msgl stand for the multivariate group LASSO, the multivariate elastic-net, the multivariate regression with covariance estimation, the multivariate LASSO and the multivariate sparse group LASSO respectively.
The results indicated that the PVA substantially outperformed the MGL, MENET, ML, MSGL and MRCE in terms of sensitivity and specificity in all the scenarios under consideration. In Settings 4.1 and 4.2, the results suggested that the performances of the MGL, MENET, ML, MGSL and MRCE In each panel, from the left to the right, the odd columns are for sensitivity while the even columns are for specificity. In each panel, box-plots from the left to the right are for sh 0 , hs1, hs2, hs3, mgl, menet, mrce, ml and msgl respectively. Adopt the same notations in Figures 1 and 2. had deteriorated sharply when the separation between active and non-active covariates, in terms of regression coefficients, was decreasing. In contrast, the performance of the PVA was much more robust than the other procedures to interferences between active and non-active covariates. This was due to interferences being minimized through the optimization in the null-beamforming. This explained why the PVA substantially outperformed the other procedures as the separation between active and non-active covariates was decreasing.    4.4, when the specificity values were fixed roughly at the same level, compared to the MGL, on average the sensitivity values of the PVA(hs3) were increased by 74%, 97%, 136%, and 237% respectively. Compared to the MENET, on average the sensitivity values were increased by 312%, 478%, 443% and 968% respectively. In comparison to the ML, on average the sensitivity values were increased by 103%, 133%, 163% and 250% respectively. In comparison to the MSGL, on average the sensitivity values were increased by 53%, 85%, 110% and 169%. The results also suggested that the improvements of the PVA(hs3) over the other procedures in sensitivity were decreasing when p 0 changed from 50 to 70, although they were still large. This was expected as the model complexity increased but the number of measurements per response did not increase.
In Setting 4.3, we considered a weakly correlated regression coefficient matrix B. With the same combinations of (n, p, J, p 0 ) as before, compared to highly correlated B setting, the improvements over the other procedures reduced but they were still substantial. This reflected a fact that the higher the correlations in columns or rows of B, the stronger intra-correlations the response variable would receive. Therefore, more accurate variable selection would be derived from the PVA as it could explore correlation structures in the data better than the other methods. The results also indicated that the sensitivity improvements of the PVA over the other procedures were increasing in J and n. The similar result was also obtained in Setting 4.5.
We recorded the running times of performing the above procedures on each of the 50 data sets in each setting. The results showed that on average the PVA was run much faster than the ML and MSGL and was also very competitive with the MGL and MENET when we applied them to these data sets in terms of log-CPU-times in seconds. The details were omitted.

Anti-cancer drug data
Cancer drugs exert their function through binding to one or more protein targets [22]. Early "one gene, one drug, one cancer" paradigm considers the role of individual genes and their changes in drug-perturbed states, which largely ignore a target's cellular and physiological context. Meanwhile, cancer gene-centric methods largely ignore the multi-factor-driven attribute of cancer diseases at the cellular level. With the generation of rich data resources for genome-wide gene expressions and drug-and cancer-induced perturbations, data integrative approaches such as PVA try to provide systematic insights into mechanisms of drugs and cancers in a "multiple genes, multiple drugs, multiple types of cancers" paradigm.
In this section, we focus on the following two data sets: IC50 values of drug sensitivity in cancer cell lines and the corresponding gene expression DNA microarrays [10]. According to cancer encyclopedia, IC50 is a concentration of drug that reduces a biochemical activity such as cell multiplication to 50 percent of its normal value in the absence of the inhibitor. The data sets contain gene expression levels of 13321 genes and median inhibitory concentrations (IC50s) of 131 drugs across 586 cell lines. Among these cell lines, only 42 had complete records of their response to 131 drugs. Here, we considered only the 42 completed cell lines. The challenging problem of imputing remaining cell lines will be addressed in a separate work. We aim to identify biomarkers (a set of genes) that underpin the drug sensitivity in cancer cell lines. Multivariate random-effects regression models can be used to recover these biomarkers, where we treat drugs as subjects (or responses), IC50 values of each drug on cell lines as measurements and genes as random-effects covariates. Note that, in the above regression, multiple drugs are simultaneously linked to the same set of covariates. So, the higher the number of drugs, the more information about these covariates can be extracted from the drug sensitivity data.
Letting X be log-gene-expression levels and Y be IC50 values of 42 completely observed cell lines, we considered the model We constructed a network, displayed in Figure 8, for the selected genes based on their regression coefficients across 131 drugs. The network was strongly connected as there always existed a path from any node to any other node. Surprisingly, although by the iterative nulling, the selected genes were uncorrelated in their expression levels, they were strongly correlated when they reacted to cancer drugs as shown in Figure 8. This suggests that these genes are potentially correlated in a high function level (e.g., protein level).
To reveal the potential roles of these selected genes played in cancer drug sensitivity, as their protein products would dictate their functions, we investigated their protein stainings in the following 20 common cancers [18]: Breast, Carcinoid, Cervical, Colorectal, Endometrial, Glioma, Hand and neck, Liver, Lung, Lymphoma, Melanoma, Ovarian, Pancreatic, Prostate, Renal, Skin, Stomach, Testis, Thyroid and Urothelial. We extracted such information from the Human Protein Atlas Portal at http://www.proteinatlas.org/cancer. As in the Portal, we classified the protein expression/staining levels into 4 categories: high, medium, low and not detected. We assigned the scores of 3, 2, 1 and 0 to the above categories respectively. If a gene did not play a role in a cancer, it would receive a score of zero as its protein staining at that cancer would be hardly detectable. We found 34 of the selected genes, which had positive staining levels on at least one of these cancers. This implied that these genes might play certain functional roles in growths of some of these cancers. In the Portal, there was no information available on the remaining 3 of the selected genes.

Conclusion
High dimensional multivariate regression data, where columns stand for measurements on responses (or subjects) can be fitted by both fixed-effects models and random-effects models with helps of variable selection. The existing multivariate variable selection methods have been put forward mainly for fixed-effects models. In this paper, we have developed a novel approach called PVA for selecting random-effects covariates in multivariate random-effects regression models. PVA is covariate-assisted, in which we project the response data matrix into the space spanned by each covariate and define a relative information index SNR by the variance component ratio between this covariate and the background noise. The resulting SNR values are then used to rank covariates. The highly ranked covariates are called principal variables. By the PVA, we try to find a small number of principal variables to explain the maximum amount of variation in the data. Our approach allows us to consider correlations between measurements and between responses (between rows and columns in the response data matrix) while the existing methods are only able to deal with correlation structures be-tween responses. In a multivariate fixed-effects model with many responses, for each covariate, we need to estimate many regression coefficients, which is a highdimensional problem when the number of responses (or subjects) is very large. In contrast, in a multivariate random-effects model, for each covariate, we only need to estimate its variance component, which is a low-dimensional problem. This difference provides a foundation for the PVA approach to multivariate variable selection. In multivariate regression models, all responses are related to the same set of covariates, which implies that the larger the number of responses, the more information on covariates can be extracted from the response data. Therefore, the accuracy of random-effects covariate selection is expected to increase as the number of responses is increasing. However, when all covariate variance components are zeros, the models reduce to a class of fixed-effects models, where the methods in [17] can be employed while the PVA approach is not applicable.
We have established a novel theory on selection consistency for the proposed method when along with other regularity conditions, the number of covariates p, the number of non-zero entries m n in covariance matrix. the number of measurements per response n and the number of responses J satisfy m n log(n)/J = o(1) and log(p) = O(n α0 ). In particular, we have shown that under these regularity conditions, true active covariates are asymptotically separable from non-active covariates in terms of their power or SNR values as n and J tend to infinite. We have also shown that the nulled power has a higher value than a non-nulled power and is adaptive to response covariance structures. We have conducted a wide range of simulation studies to compare the PVA with the multivariate group LASSO, the multivariate elastic-net, the multivariate LASSO, the multivariate sparse group LASSO and the MRCE. This has explained why the PVA can outperform the existing multivariate variable selection procedures in the literature as these methods are not adaptive to response covariance structures. The simulation results have shown that the PVA can substantially perform better than its competitors in all the scenarios under considerations while the PVA is scalable to the data size by iteratively calculating the power or SNR values. The simulation studies in Settings 4.1∼4.5 have shown that even when the response covariance matrix is not sparse or when J is much smaller than n, PVA can still have a superior performance than the existing methods.
To demonstrate the usage of the PVA in practice, we have conducted PVA on a cancer drug dataset and identified a list of principal genes and the related network to predict the drug's sensitivity to cancers in a "multiple genes, multiple drugs, multiple types of cancers" paradigm. The correlations of the selected genes in the RNA expression levels are largely different from those in their functional levels (their contributions to the IC50 values). The results have been further validated by the protein expression levels of these genes in 20 common cancers. We should mention that we have applied the cross-validation-based multivariate group LASSO and the multivariate elastic-net to the same dataset. Unfortunately, we have ended up with a few thousand genes being selected, which were very difficult to interpret in practice.

Appendix A: The existing approaches to multivariate variable selection
To introduce the multivariate group LASSO (MGL) and the multivariate elasticnet (MENET), we consider the following penalization problem where ||B|| F = kj β 2 kj denotes the standard Forbinus norm of B = (β kj ), ||Y − XB|| F is the Forbinus norm of Y − XB, n is the number of observations, 0 ≤ α ≤ 1 is a pre-determined constant and λ ≥ 0 is the penalty coefficient. The solution defines the multivariate group LASSO when α = 1, ridge estimator when α = 0 corresponds to the ridge estimation, and the multivariate elastic net when α = 0.5.
The multivariate LASSO (ML) and the multivariate sparse LASSO (MSGL) can be derived by the following penalization problem with

Appendix B: Extra theorems, technical details and proofs
In this appendix, for a n × n square matrix D, let ||D|| be the operator norm, the square root of the largest eigenvalue of DD T . Slightly abusing the notation, now let ||D|| F denote the size-normalized Forbinus norm, tr(DD T )/n, where tr(·) is the trace.

B.1. Theory on principal variable analysis with known covariance
We begin with an ideal setting where C is known. This includes the case of J = ∞ in which we can estimate C exactly. We establish lower bounds for the SNRs below. The above proposition shows that the SNR-based map has a sharp lower bound of 1 when Λ = σ 2 I n . However, when Λ = σ 2 I n , we apply the proposition to (Λ −1/2 Y, Λ −1/2 X) to obtain a Λ-dependent lower bound for the SNR values.
To investigate the asymptotic properties of the power-based and the SNRbased maps, let A ν0 = C − x ν0 e T ν0 Σe ν0 x T ν0 , the remainder of C after subtracting the term x ν0 e T ν0 Σe ν0 x T ν0 . In the next proposition, we shows that a local consistency of the predictive power with the underlying power e T ν0 Σe ν0 at ν 0 : the power at ν 0 can be written as the underlying power plus the interferences with the predictors not in ν 0 and with the white noise. These interferences can be negligible if predictors outside ν 0 have zero powers.

Proposition B.2.
If both e T ν0 Σe ν0 and A ν0 are invertible and x ν0 has a full column rank, then the predictive power matrix If σ 2 k = 0, k ∈ ν 0 and λ min (x T ν0 x ν0 /n) is bounded below from zero as the sample size n tends to infinity, then γ ν0 = e T ν0 Σe ν0 + O(n −1 ). We now in position to state a theorem on the global sparsistency property of the power map. In the theorem, we show that for an active predictor, the predictive power has a positive limit whereas for a non-active predictor, the predictive power tends to zero. This allows us to separate active predictors from non-active ones by thresholding the power map.
(ii) Uniformly for any ν ⊆ [1 : p] \ ν 0 with |ν| ≤ rn, (iii) Uniformly for any ν = ν 1 ∪ ν 2 with ν 1 ⊆ ν 0 and ν 2 ⊆ [1 : p] \ ν 0 and |ν| ≤ rn, γ ν can be partitioned into The above theorem also indicates that compared to the underlying power matrix, e T ν Σe ν , the predictive power matrix γ ν may be not consistent if the collinearity between a pair of the predictors does not converge to zero as n tends infinity. This can be seen from the derivation of the predictive power at the predictor k j ∈ ν 0 below.
The following corollary says that under Condition (C1), the predictor k j does have positive predictive power although the power has deteriorated due to the interferences with other predictors. Therefore, if we employ the estimated predictive power to screen predictors and ifĈ is consistent with C, then under Conditions (C0)∼(C3), the screening procedure can have a sure screening property that for an appropriately chosen threshold, all predictors in ν 0 can be detected with a probability approaching to one.

Corollary B.1. Under the conditions in Theorem B.1, as n tends infinity, we have:
(i) Uniformly for any k j ∈ ν 0 , the predictive power of the predictor k j can be expressed as (ii) Uniformly for any k ∈ ν 0 , the predictive power of the predictor k can be expressed as Let a be the current predictor under investigation and ν 1 ∪ν 2 be the predictors identified in the previous steps by PVA, with the size |ν 1 ∪ ν 2 | < rn, where 0 < r ≤ 1, ν 1 ⊆ ν 0 and ν 2 ⊆ [1 : p] \ ν 0 . For a = k j ∈ ν 0 , let e a ν0 = e {a} ν0 , a |ν 0 |-dimensional column vector with the j the coordinate equal to 1 and other coordinates equal to zero. In the next theorem, we show that the global sparsistency property continues holds for the nulled predictive power and that the nulling can improve the accuracy of power estimation. (1). Then, under Conditions (C0)∼(C3), as n tends to infinity, we have:
where g aν1 = e T ν1 ν0 e T ν0 Σe ν0 −1 e a ν0 and The power-based variable screening may not be efficient due to the inhomogeneous power background σ 2 w T k w k . This calls for the SNR-based variable screening. We show that active predictors can be asymptotically separated from non-active ones by means of the nulled-SNR.

B.2. Theory on principal variable analysis with estimated covariance
We will show later in Lemma B.4 that under Conditions (C1)∼(C6), the optimal shrinkage covariance estimatorĈ hs is consistent with the true covariance C. This allows us to state the following theorems for the case where unknown C is estimated byĈ hs .
Theorem B.4. Suppose that Conditions (C0)∼(C6) hold and that τ nJ n 2 = o(1) as both n and J tend to infinity. Then, we have: The above theorem implies that the sparsistency property holds for the estimated predictor powerγ a . Usingγ a , we can screen the predictors with a prespecified threshold, say n −1+α0 log(n), obtaining an estimated set of active predictors,ν d = {1 ≤ a ≤ p :γ a > n −1+α0 log(n)}. We can prove the following sure screening property forν d . (1) and n 2+α1 τ nJ = o(1), then as both n and J tend to infinity, P (ν 0 =ν d ) → 1.
To prove (ii), we apply Lemma B.1 to calculate x T ν C −1 x ν . We have which completes the proof of (ii).
To prove (iii), we note that Invoking Lemma B.1, we have Similarly, we can show that The proof is completed by applying Lemma B.1 to each block.

Proof of Theorem B.2.
To prove (i), let ν denote ν 1 ∪ ν 2 ∪ {a}. We partition x T ν C −1 x ν into the block matrix below: and that x ν1 = x ν0 e ν1 ν0 . Invoking Lemma B.1, we have We partition (x T ν C −1 x ν ) −1 in the same as we did above for x T ν C −1 x ν : Then, by definition we have And let F ν2∪{a} = R (ν2∪a)(ν2∪a) − R (ν2∪a)ν0 R −1 ν0ν0 R ν0(ν2∪a) and define Σ ν1 as in the proof of Theorem B.1. We have Note that any main block matrix has a larger smallest eigenvalue than that of the whole matrix. Therefore, by Condition (C3) we have This together with equations (B.1) and (B.2) shows that γ a|ν1∪ν2 = O(n −1+α0 ). The proof of (i) is completed.
To prove Theorem B.3, we need a more lemma as follows.
Proof of Lemma B.2. Note that The proof is completed by some algebraic manipulation.
The proof is completed.
Note that in Lemma B.4, under Conditions (C1)∼(C6), we show that the optimal shrinkage covariance estimatorĈ hs is consistent with the true covariance C. This allows us to extend Theorems B.1∼B.3 to the case where unknown C is estimated byĈ hs .
We use the methods developed in [7,23] to prove the lemma. Following the proof in [23], we can find constants d t , t = 1, ..., 5, such that for any u > 0, We also have Finally, for as , n and J tend to infinity, Combining these with the other equalities shown before, we complete the proof.
Let m n be the number of non-zero entries in covariance matrix C. In the next lemma, we show the convergence rates of the threshold estimator. For h = 0, the above results continue to hold if m n is replaced by n.
For a ∈ ν 0 , we haveγ a = O p (n −1+α0 + n 2 τ nJ ) = O p (n −1+α0 ), which implies that with probability tending to one, a ∈ν d as n and J tend to infinity.