Envelope method with ignorable missing data

: Envelope method was recently proposed as a method to reduce the dimension of responses in multivariate regressions. However, when there exists missing data, the envelope method using the complete case observations may lead to biased and ineﬃcient results. In this paper, we generalize the envelope estimation when the predictors and/or the responses are missing at random. Speciﬁcally, we incorporate the envelope structure in the expectation-maximization (EM) algorithm. As the parameters under the envelope method are not pointwise identiﬁable, the EM algorithm for the envelope method was not straightforward and requires a special decomposition. Our method is guaranteed to be more eﬃcient, or at least as eﬃcient as, the standard EM algorithm. Moreover, our method has the potential to outperform the full data MLE. We give asymptotic properties of our method under both normal and non-normal cases. The eﬃciency gain over the standard EM is conﬁrmed in simulation studies and in an application to the Chronic Renal Insuﬃciency Cohort (CRIC) study.


Introduction
Recently, a new dimension reduction method called the envelope method has been proposed in the multivariate regressions [11]. Unlike the standard dimension reduction methods, the envelope method assumes the redundancy among responses rather than among predictors. Specifically, it is assumed that there exist some linear combinations of the response variables that do not contribute to the regression. Under such a condition, the envelope method is shown to have efficiency gain over the ordinary least squares which regresses one response at a time ignoring other responses. Similar redundancy structures have also been extended to hold among the predictors or among both predictors and responses. It is known that the estimation of the central space may suffer from bias when the correlations between variables are high [7]. The envelope conditions circumvent the challenge of identifying the central space in the standard dimension reduction problem when the correlation between variables is high, at the cost of obtaining a bigger space containing the parameters of interest, and thus makes the envelope estimates more reliable.
A prominent problem when a large number of responses and predictors are collected is the missingness of responses or predictors. Missing data may arise when a subject refuses to respond to certain questions or when the data is not collected. The missing data mechanism is said to be missing at random (MAR) or ignorable if it only depends on the observed data and it is said to be missing not at random (MNAR) or nonignorable if otherwise. As [33] suggested, in most MAR scenarios, a complete case analysis would lead to inefficient or possibly biased results. We assume the missingness mechanism is MAR throughout this paper.
In this paper, we generalize the envelope method for data with missing predictors and responses. As the parameters under the envelope method are not pointwise identifiable, such a generalization requires a special decomposition. The importance of the research lies in several aspects. First, with rapidly advancing technology, it is common that high-dimensional responses are collected to characterize multiple aspects of individuals. Biased and inefficient results will be obtained if the analysis deletes all the observations with missing values. Second, while the standard missing data methods typically suffer from an efficiency loss, as compared to the full data analysis, the method that incorporates dimension reduction can potentially recover substantial efficiency. Third, our proposed method to recover the missing information can also be generalized to the predictor envelope model where the redundancy is assumed among the predictors rather than the responses, as well as to the case where the redundancy is present among both the responses and the predictors. And lastly, to the best of our knowledge, our paper is among the first few in the dimension reduction literature to discuss the case where both responses and predictors are subject to missingness.
We organize the paper as follows. In Section 2, we introduce the notations and review the envelope models. In Section 3, we present the observed data likelihood and clarify the difficulty of applying the envelope method directly. In Section 4, we propose an EM envelope algorithm. Simulations are given in Section 5, where we compare the EM envelope method with the existing methods. In Section 6, we apply the EM envelope to the Chronic Renal Insufficiency Cohort (CRIC) data. In Section 7, we present a brief discussion. Section 8 contains the link to our R package.

Preliminary
Let Y i = (Y i1 , . . . , Y ir ) T and X i = (X i1 , . . . , X ip ) T denote the multivariate responses and predictors for individual i, where T denotes the transpose of a matrix and i = 1, . . . , n. Also, let Y = (Y 1 , . . . , Y n ) ∈ R r×n and X = (X 1 , . . . , X n ) ∈ R p×n , where Y ∈ R p×n denotes that Y is an element in the set of all real matrices with dimension r × n. Consider the multivariate linear regression model where ε i are identically and independently (i.i.d) distributed with mean 0 and variance Σ, and β ∈ R r×p . We firstly assume the normality of the error when deriving the EM envelope estimator. We extend later (Propositions 2 and 3) the robustness property of our estimator when the normality is possibly violated. Let R Xij = 1 if X ij is observed and R Xij = 0 if otherwise, for j = 1, . . . , p. Similarly, let R Y ik denote the missing indicator for Y ik , for k = 1, . . . , r. Let R i = (R Xi1 , . . . , R Xip , R Yi1 , . . . , R Yir ) T denote the vector of missingness indicators of all variables for individual i. Let Y i,mis and X i,mis denote the vectors of the missing responses and the predictors for individuals i.
Let Y i,obs and X i,obs denote the vectors of the observed responses and predictors for individual i. Under such notations, different individuals may have different missing responses and predictors, i.e., the lengths and the components of Y i,obs and X i,obs differ from one to another. Let D i,obs = (X i,obs , Y i,obs ) T and D i,mis = (X i,mis , Y i,mis ) T denote the observed data and the missing data for individual i, respectively. Let y ik and x ij denote the possible value of Y ik and X ij . Then y i = (y i1 , . . . , y ir ) T and x i = (x i1 , . . . , x ip ) T are the possible value of Y i and X i . Let x i,obs and x i,mis denote the value of the observed and missing predictors. Define y i,obs and y i,mis similarly. We assume the missingness is ignorable: Assumption 1 implies that given the observed data, the failure to observe a variable does not depend on the unobserved data. This particular type of missingness is called missing at random (MAR) or ignorable missingness. A complete case analysis is inefficient and can be seriously biased [32]. Throughout the paper, we assume both covariates and responses are missing at random, which has also been assumed in [6] and [26].
In multivariate regression with fully observed data, the envelope method [11] is motivated by the observation that some characteristics of the responses are unaffected by the changes of the predictors. For example, in a randomized trial, the difference between the repeated measures of the blood pressure of a patient in the treatment group (or the control group) may only reflect the aging over time rather than the treatment effect. A matrix O ∈ R r×r is orthonormal if and only if it satisfies O T O = I r , where I r denotes the identity matrix with dimension r. Consider an orthonormal matrix (Γ, Γ 0 ) ∈ R r×r such that Condition 1. span(β) ⊆ span(Γ), (r−u) , and 0 ≤ u ≤ r. The subspace span(Γ) satisfying Conditions 1 and 2 is not unique, but Cook et al. [11] defined the envelope to be the smallest subspace satisfying these conditions. The dimension u is known as the envelope dimension. Notice the decomposition of Σ is equivalent to cor(Γ T 0 Y, Γ T Y | X) = 0. From span(β) ⊆ span(Γ), the regression parameter can be written as β = Γη, where η ∈ R u×p . Therefore, the envelope model can also be written as follows: Under Condition 2, we have Var(Γ T Y)= Ω and Var(Γ T 0 Y) = Ω 0 . The matrices Ω and Ω 0 can be thought of as the variance of Y under the new basis (Γ, Γ 0 ). The null correlation only guarantees the information of Γ T 0 Y is immaterial in the first two moments. Under the normality assumption of the error, Conditions 1-2 are equivalent to the following two conditions: Conditions 3-4 are equivalent to Γ T 0 Y (Γ T Y, X). Although the original envelope was developed using Conditions 1-2, we directly define envelope using Conditions 3-4. The envelope under Conditions 3-4 is in general no smaller than that defined by Conditions 1-2. We prefer Conditions 3-4 because the interpretation of the envelope is more straightforward especially when the normality is violated.
We give a simple example for the envelope model.
where ε 1 and ε 2 follow two normal distributions, and they are independent of each other. The predictors X do not affect the summation of responses Y 1 + Y 2 . Additionally, it can be verified that Y 1 −Y 2 is independent of Y 1 +Y 2 ; thus, Y 1 +Y 2 can be completely discarded in the regression. That is, the regression of Y on X can be replaced with the regression of Y 1 − Y 2 on X. In this example, The combinations of responses that are involved in the regression, Γ T Y, is called the material part of Y, and the part that is uninvolved, Γ T 0 Y, is called the immaterial part of Y. Hence, the main focus of the envelope method is to find the column space of Γ, i.e., span(Γ), that fully contains the information of β, i.e., find an envelope of β.
Once an estimate of the basis Γ,Γ, is obtained,β env is obtained by projecting the maximum likelihood estimatorβ onto the estimated envelope space,β env = PΓβ, where P A stands for the projection matrix for the matrix A. Figure 1 demonstrates the intuition of efficiency gain of the envelope method when there is no missing data, or equivalently, with the full data. Consider two groups of individuals (the group with X = 1 is denoted by triangles and the other with X = 0 is by circle dots), where each point (triangle or circle dot) denotes one individual. Two responses Y 1 and Y 2 are collected for each individual. Suppose that we are interested in estimating the group difference on Y 1 , the standard maximum likelihood estimation (MLE) projects all the data onto the Y 1 axis, ignoring information on Y 2 completely. The density curves of the two group distributions of Y 1 are given at the bottom in Figure 1(a). The two curves are hard to distinguish as they almost overlapped. The full data MLE for the group difference is 0.11 with the bootstrap standard error being 0.12 and the p-value being 0.37. Thus, it is hard to distinguish between the two groups. While the true difference between the two group mean of Y 1 , 0.32, is contained in the 95% confidence interval of the full data MLE, the large variability of the estimator makes the point estimate deviate from the true parameter value.
The idea of the envelope method is to reduce the noise in the original data by projecting each observation onto the direction that contains all the information related to the regression. The two groups are best distinguished along the direction of the black solid line. In contrast, the two groups have almost identical distribution along the direction that is orthogonal to the black solid line. That is, the information orthogonal to the black solid line does not contribute to the distinction between the two groups. Thus, eliminating that part of variation does not sacrifice any relevant information for the regression, but instead makes the regression more efficient. An estimate of the black solid line is shown as the purple dashed line in Figure 1(b). All the points are thus first projected onto the estimated directionΓ T Y, then projected onto the Y 1 axis. For example, a data point A was first projected onto the estimated envelope direction with an intersection B, and then projected onto the Y 1 axis. [11] showed that the envelope method can achieve substantial efficiency gain when the envelope direction is aligned with the eigenspaces of Σ that correspond to relatively small eigenvalues. In that way, linear combinations of Y with larger variances can be eliminated by the projection. In Figure 1(b), the direction that can better distinguish the two groups is aligned with the direction that the data has less variability, so the envelope method is expected to provide substantial efficiency gain. The density curves of the two groups under the envelope estimation are shown at the bottom of Figure 1(b) and they have much smaller spreads. The envelope estimator for the group difference is 0.32 with the standard error being 0.03 and the p-value < 0.001. Thus, it is much easier to distinguish between the two groups. Now, consider the case where the predictors X are fully observed but some values of the responses are missing (see Figure 2). The missingness mechanism is as follows. For an individual Such missingness mechanism is MAR, and the missing rate is 30% for Y 1 , and 20% for Y 2 . The hollow triangle represents Y 1 missing, and the hollow circle dot represents Y 2 missing. The standard EM method is shown in Figure 2(a). Although being an asymptotically unbiased method, the standard EM estimates of the group difference is 0.11. Similar as the full data MLE, the point estimate of the standard EM also deviates from the true parameter value due to the large variability. The bootstrap standard error is 0.12 with the p-value being 0.37. The spreads of the two group densities are again relatively large, resulting in a relatively inefficient estimate.
The existing envelope methods for solving Γ all require the data to be fully observed [11,15]. Figure 2(b) shows the complete case envelope where all the observations with missing data are deleted from the analysis. The estimated complete case envelope direction is shown as the blue dashed line in Figure 2(b), which is far from the true envelope direction (black solid line). This leads to a severe bias: even the sign of the estimated parameter is incorrect. The complete case envelope estimate is −1.63 with the bootstrap standard error being 0.15 and the p-value < 0.001.
Our method is shown in Figure 2(c). Different from the complete case analysis, we use both the complete cases and the partially missing information. Our proposed method is asymptotically unbiased when the missing pattern is MAR. The estimated envelope direction is shown as the red dashed line. Our method recovers the envelope direction and achieves significant efficiency gain over the standard EM as the density curves have much smaller spreads. The EM envelope estimator is 0.31 with the bootstrap standard error 0.04 and the p-value < 0.001. It is interesting to see that our method may even outperform the full data MLE as the efficiency gain by the envelope method outweighs the information loss due to missing data in this illustrative example.  Intuitive illustration of the envelope method in the presence of missing data. Two groups are shown using circle dots (X = 0) and triangles (X = 1). Hollow circle dots or triangles indicate one of the components of Y is missing: the hollow triangle has Y 1 missing, and the hollow circle dot has Y 2 missing. The solid line is the true envelope direction, the dashed lines are the estimated envelope using different methods. The density curves of the two groups using different methods are shown at the bottom of each subfigure.

The observed data likelihood
The envelope method proposed by [11] utilizes the full data likelihood function L full = n i=1 f (y i | x i ; η, Γ, Ω 0 , Ω) to obtain the MLE of the parameters. In the presence of missing data, we replace the full data likelihood with the observed data likelihood where ρ is the parameter for the predictors' distribution and ∝ denotes proportional to, i.e., a multiplicative constant is omitted. Let χ i,mis denote the set of predictors X i that is missing for individual i. For example, if X i,mis = X i1 , then χ i,mis = {X i1 }. Write χ i,mis = ∅ when all the p predictors are observed for this individual. Since f (y i,obs , y i,mis | x i ; η, Γ, Ω 0 , Ω)dy i,mis = f (y i,obs | x i ; η, Γ, Ω 0 , Ω), we can simplify the observed data likelihood as The first part of the observed data likelihood corresponds to the likelihood of individuals with fully observed predictors. The second part corresponds to the likelihood of individuals with missing predictors. Hence, the observed data likelihood utilizes more information than the complete data likelihood.
The observed data likelihood is in general hard to calculate as it involves the multivariate integral. Closed form observed data likelihood exists under certain distributions. Example A1 in the Appendix derives the closed form of the observed data likelihood when predictors and responses follow a joint normal distribution. However, in general, the integral in the observed data likelihood may result in a complicated form. [13] pointed out that the envelope method performs poorly when the first order derivative of the objective function do not have a closed form. Even when the observed data likelihood is available in a closed form, the parameter is typically complicatedly intertwined in the likelihood. Together with the fact that the parameter is not pointwise identifiable, it is challenging to calculate the maximum likelihood estimates under an envelope structure. Such a challenge was also identified in [13] in the context of generalized linear models. In this paper, we propose an EM envelope algorithm that can identify and estimate the envelope space with missing data.

The EM updates
Let l full (φ | L) = log L full (φ | L) denote the log of full data likelihood, where φ = (η, Γ, Ω 0 , Ω, ρ). Then, the logarithm of full data likelihood of (X, Y) is In the E-step, Recall that Σ 1 = ΓΩΓ T and Σ 2 = Γ 0 Ω 0 Γ T 0 , we can also use φ = (η, Γ, Σ 1 , Σ 2 , ρ) as the new parameters for the reparameterization. Hence, we have After the E-step, we do the M-step. However, the parameters under the envelope method are not pointwise identifiable [11]. That is, for any orthogonal matrix A, replace Γ with ΓA results in a equivalent model. The EM algorithm for the envelope method is not straightforward and requires a special decomposition in the M-step. We imitate that of the full data likelihood in [11] to isolate the parameter to be optimized from the other parameters. We decompose To find the maximizer of Q 2 (β, Σ | φ t ), note under the envelope conditions 3- and Span(β) ⊆ Span(Σ 1 ). This implies Σ 2 β = 0. Additionally, by Theorem 3 in [21], where † indicates the Moore-Penrose inverse. We can write Q 2 as: where det 0 (A) denotes the product of its non-zero eigenvalues. Further, we we use the Lemma 4.3 in [11], which is reviewed as Lemma 5 in the Appendix. Suppose matrix Γ is given, then by Lemma 5, we have The elements in Γ are not pointwise identifiable; however, as the objective function above is a function of Span(Γ), we only need to estimate the span of the column space of Γ, which is identifiable. The MLE of Span(Γ) can be obtained using full Grassmannian optimization [11,8].

Selection of the envelope dimension
The selection of the envelope dimension can be viewed as a diagnostic or model selection under the envelope framework. Model selection criteria for missing data problem such as the likelihood ratio test and the information criteria including AIC, BIC, typically involve the observed data likelihood. As mentioned, the observed data likelihood may be complicated and not in a closed form. Hence, it is ideal if the calculation of the model selection criteria could be obtained directly from the EM output. [27] proposed the information criteria for missing data problems. They used the fact that The Q function can be computed from the EM output and the H function can be analytically approximated as part of the EM output.
[18] recommended using the BIC to select the envelope dimension, because the AIC tends to over select the true dimension and the likelihood ratio testing is inconsistent. Thus, we generalize the BIC for the missing data problem following [27] as BIC H,Q = −2Q(φ |φ)+2H(φ |φ)+pu log n. The penalty term is pu log n because under the envelope model, there are pu+r(r+1)/2 unknown parameters in total, and only pu varies with dimension u. The asymptotic properties of BIC H,Q are given in [27].
The computation of the H function is not straightforward since it may not have a closed form. [27] proposed a method for approximating the H function through the truncated Hermite expansion with MCMC sampling. Alternatively, an approximation of BIC Q could be obtained by omitting When the proportion of missing information is small, the use of BIC Q is adequate.
The information criterion relies on the correct specification of the distribution. Alternatively, we can generalize a bootstrap method for choosing the envelope dimension u, which is more robust to misspecification of distributions. A similar bootstrap method was proposed by [45,17] and has been widely used for selecting the dimension of the central space in the dimension reduction literature [30,46,48]. We propose to first fix the dimension u for the basis matrix Γ and then bootstrap data b times to get a sequence of envelope spaceΓ 1 , . . . ,Γ b . If the proposed dimension is u * > u, then span(Γ) can be any space of dimension u * that contains span(Γ), and thus, the estimate should suffer from large variability as compared to the estimate of the original dataΓ. Therefore, we choose the largest dimension u * such that the bootstrap estimated space is the most similar toΓ. To evaluate the variability ofΓ 1 , . . . ,Γ b , we use the vector correlation coefficient q 2 proposed by [25]. Suppose A and B ∈ R r×u are semi-orthonormal matrices, then We see that q 2 (A, B) ∈ [0, 1] and higher value of q 2 indicates higher correlation between the two subspaces. When q 2 (A, B) = 1, span(A) = span(B). Hence, we choose the largest dimension u * such that

Asymptotics
The following propositions guarantee the efficiency gain and asymptotic normality of the EM envelope estimator. Specifically, Proposition 1 establishes the asymptotic property when the densities of both ε and X are correctly specified and that of ε is normal. Proposition 2 extends the result to the case where the distribution of X is correctly specified but ε has a misspecified normal working density. Proposition 3 extends the result further to the case where ε and X both have a misspecified normal working density. Let l * denote the log-likelihood under working model. Let s n (φ) = ∇l * (φ) and M n (φ) = −E{∇ 2 l * (φ)}, where ∇ denote the gradient with respect to a general parameter φ. We state our regularity conditions first.
(A1) (Observed likelihood) L obs is unimodal, i.e, the probability distribution has a single maximum, in the parameter space Φ with only one point φ 0 where X i1 follows Binomial distribution, X i2 follows normal distribution and X i1 X i2 , then regularity conditions (B1)-(B2) hold.
In Examples 2 and 3 above, the observed data follows a Gaussian mixture distribution, which is multimodal and thus violates (A1). However, condition (A1) is only a sufficient condition for the EM algorithm to converge to the MLE. The convergence of the EM algorithm under Gaussian mixture model is well studied and therefore we can impose some alternative assumptions to substitute condition (A1) under this case. For example, if the conditions in Theorem 3.3 of Zhao et al. [47] hold, the EM algorithm in Examples 2 and 3 converges to the MLE.
The parameter of the envelope model is φ = (η, Γ, Ω, Ω 0 , ρ). We are interested in the property of the parameters β, Σ and ρ, which are functions denote our parameter of interest,θ em·env andθ em·std denote the EM envelope and the standard EM estimators as the EM sequence converges. The following propositions can be proved using the results in [39].
Matrices C r and E u are defined in the Appendix. Hence, V env − V std ≥ 0, which indicates the efficiency gain of the EM envelope estimator.
When the envelope dimension u = r, the envelope reduces to the standard maximum likelihood estimate. That is, even when the envelope assumptions do not hold, the EM envelope estimator performs as well as the standard EM estimator. Also, following a similar argument as in [11], if the variability of the immaterial part is relatively large, then the efficiency gain would be substantial.
Propositions 2 and 3 below extend Proposition 1 and provide the asymptotics of missing data envelope estimator when the normality of ε i is violated. Lemmas 1-4 provide asymptotics for the standard estimator.

Normal errors
Jia et al. [29] compared the envelope method with some competitor estimators such as ridge regression and Curds and Whey introduced by [3]. They concluded that the envelope model has the best performance when u < p < r < n in the classical domain. Therefore, to avoid duplication, we do not consider those competitor estimators here. In this subsection, we compare six different estimators: the EM envelope estimatorβ em·env , the complete case (CC) envelope estimator β cc·env , the full data envelopeβ full·env , the standard EM estimatorβ em·std , the standard complete case (CC) estimatorβ cc·std , and the full data MLEβ full·std . The complete case estimators only utilize the observations that do not have any predictors or responses missing, whereas the full data estimators use the full data without any missingness. In practice, the full data estimators cannot be calculated with the missing data. The full data envelope sets a theoretical maximal efficiency possibly gained from incorporating the envelope structures. We carry out the simulations in the following steps.
Step 3. Generate the missingness as follows. Set three missingness mechanisms for the predictors as logitP (R Xi, 1 . Also, set five missingness mechanisms for the responses as logitP (R Yi,2 = 1, 10 . For each individual, we randomly choose one missingness mechanism for the predictors and one missingness mechanism for the responses. Then, we generate the missingness indicators (R Xi,1 , . . . , R Xi,p , R Yi,1 , . . . , R Yi,r ), for i = 1, . . . n. We obtain the observed data for predictors and responses.
Under the missingness mechanisms above, each predictor suffers from about 10%-15% missingness and each response about 5%-10%. In our simulations, to simplify the calculation and reduce the computation burden, we apply the 1-D algorithm proposed by [15] to solve Γ. The 1-D algorithm only provide a √ n-consistent estimate of Γ rather than the most efficient estimate. However, we still find good performance of EM envelope method with 1-D algorithm. Details about the algorithm are in the Appendix. The median MSEs are 4.44 × 10 −5 , 2.00 × 10 −4 , 1.02 × 10 −5 , 5.34 × 10 −2 , 0.69 and 5.23 × 10 −2 for the EM envelope, the complete case envelope, the full data envelope, the standard EM, the standard complete case analysis and the full data MLE, respectively. Detailed comparisons of the six estimators are given in Figure 3 below and Table 1 in the Appendix. For the EM envelope estimator, by using BIC Q to choose the envelope dimension, out of 1000 times of simulations, we correctly estimated the envelope dimension u = 3 at an accuracy of 98.6%. The envelope dimension u = 2 is selected 12 times and u = 4 is selected 2 time. The overselection u = 4 still provides a correct model, although the point estimate may not be as efficient as compared with that using the correct u. The underestimation of u = 2 could introduce some bias. As expected, the standard complete case analysis suffers from both large variance and large bias. In contrast, the EM envelope is asymptotically unbiased and the most efficient among the four estimators using the observed data, despite the occasional underestimation of u. In this simulation setting, the variance of the immaterial part of the responses is relatively large. Thus, by eliminating the variability of the immaterial part, the EM envelope estimate outperforms the standard EM. This confirms the efficiency gain in Proposition 1. Similar to the illustrative example in Section 2, the EM envelope also outperforms the full data MLE in this simulation, emphasizing the advantage of incorporating a dimension reduction method to recover the efficiency loss due to missing data. The performance of the EM envelope is close to the full data envelope in this case.
In this specific setting, the complete case envelope outperforms the standard EM. This is an interesting case as the complete case envelope is biased but the standard EM is not. However, the ordering of the two is not certain in general. The complete case data may not have an envelope structure, although in finite sample cases we can usually find one. Intuitively, if the proportion of missingness is low, the complete case envelope estimate resembles the EM envelope estimate, and thus outperforms the standard EM. If the proportion of missingness is high, the complete case envelope is both biased and inefficient while the standard EM is still unbiased although inefficient. When the bias of the complete case envelope dominates the MSE, the standard EM outperforms the complete case envelope. When the proportion of missingness is not at extremes (too high or too low), the complete case envelope is not necessarily better or worse than the standard EM. The standard EM estimate may have a smaller bias but a relatively larger variance while the complete case envelope may have a larger bias and a smaller variance.
We carried out another simulation study, where the steps were the same as above, except we replaced Ω 0 = 1000I q with Ω 0 = 10I q in Step 2. This is a case where the variance of the immaterial part is not as large. The median MSEs of the EM envelope, the complete case envelope, the full data envelope, the standard EM, the standard complete case analysis and the full data MLE are: 1.06 × 10 −4 , 6.16 × 10 −4 , 8.58 × 10 −5 , 5.42 × 10 −4 , 6.81 × 10 −3 and 5.24 × 10 −4 . Detailed comparisons of the six methods are given in Figure 8 and Table 2 in the Appendix. Out of 1000 simulations, the envelope dimension is correctly estimated as u = 3 with an accuracy of 89.8%, while the rest 10.2% yields an estimated envelope dimension u > 3. As mentioned, overselection can still provide us with the correct model but may lead to inefficient estimation. The EM envelope and the standard complete case analysis remain the best and the worst estimators using the observed data in terms of the MSEs, the standard EM now outperforms the complete case envelope. Again, the EM envelope outperforms the full data MLE.

Non-normal errors
In order to investigate the performance of our estimator under the scenario of Propositions 2 and 3, we carried out four additional sets of simulations to compareβ em·env andβ em·std as well as other estimators when the error term ε i is not normally distributed. Specifically, we consider two scenarios: (i) Correctly specified the distribution of X i and (ii) Misspecified the distribution of X i . The simulations under scenario (i) are carried out in the following steps. 1*. Set n = 500, r = 10, p = 5, and u = 2. Generate parametersΓ ∈ R r×u , β ∈ R r×p , where the elements are drawed independently from U (0, 1) and U (−10, 10). By QR decomposition, we get Γ fromΓ, where Γ satisfies Γ T Γ = I u×u . Set the true regression coefficients as β = P Γβ . Generate a matrix N ∈ R p×p where each element is independently from U (−10, 10), and set Σ x = NN T . 2*. Generate the full data (X i , Y i ) for each individual i. We generate X ij i.i.d ∼ 25Ber(0.5) where j = 1, . . . 5. In order to satisfy the independence con- we firstly draw ε i1 ∈ R u and ε i2 ∈ R r−u independently from two distributions t 5 (0, I u ) and t 5 (0, 1000I r−u ). Then we set ε i = Γε i1 + Γ 0 ε i2 and Y i = βX i + ε i . 3*. Generate missingness same as Step 3. 4*. Calculateβ em·env ,β cc·env ,β full·env ,β em·std ,β cc·std , andβ full·std . We calculateβ em·std andβ em·env using normal working model for ε i and Bernoulli model for X i using the parameter updates derived in Example A3. The dimension of the envelope ofβ em·env ,β full·env andβ cc·env are obtained through the bootstrap method with 20 iterations. 5*. Repeat Steps 2*-4* for 1000 times.
Using the above missingness mechanism, the predictors and responses suffers from about 13% missingness. Although the normality of ε i is violated, the data was still generated under a nontrivial envelope structure defined by Conditions 3-4 with the envelope dimension u = 2.
We use boostrap to choose the envelope dimensions forβ em·env ,β full·env andβ cc·env . All the envelope dimensions are correctly specified forβ em·env and β full·env . Following Theorem 2 in [41] and Proposition 2, once the envelope dimension is correctly specified, the full data envelope with a misspecified working normal density is still consistent although it no longer provides the MLE. As forβ cc·env , the correct envelope dimension u = 2 is selected 903 out of 1000 times, u = 3 is selected 94 times, and it chose u = 4 for the rest of 3 times. We observe the bootstrap method requires more computational time than the likelihood method, but is more robust in selecting the envelope dimension. It is worth noticing that for the complete case, even if the envelope dimension is correctly specified for most of the time, the resulting estimator usually suffers from bias. Under current missingness mechanism, the bias for the complete case estimator is relatively small. Therefore, all three envelope estimators have better performances than the standard estimators with full, complete and all data, because Envelope method with ignorable missing data 4437 the variance of the immaterial part is much larger than that of the material part. The median MSEs are 4.84 × 10 −4 , 1.52 × 10 −2 , 1.07 × 10 −3 , 0.11, 1.28 × 10 −4 , and 1.41 × 10 −2 forβ em·env ,β em·std ,β cc·env ,β cc·std ,β full·env ,β full·std . Detailed comparisons of the simulation results are given in Figure 4 below and Table 3 in the Appendix. We see that when the error term follows multivariate t distribution, as long as the envelope independence conditions hold, our EM envelope estimator empirically outperforms the standard estimator. Also, the EM envelope outperforms the full data MLE, suggesting that in practice, our method has the potential to recover the efficiency loss from missing data.

The simulation under scenario (ii) is similar to that under scenario (i). In
Step 2*, we generate X i i.i.d ∼ t 5 (0, Σ x ), where t ν (μ, Σ) represent the multivariate t distribution with location parameter μ, scale parameter Σ and degrees of freedom ν, Σ x = NN T . and each element of N is independently from U (−10, 10). In Step 4*,β em·std andβ em·env are obtained using normal working model for both ε i and X i .
All the envelope dimensions forβ em·env andβ full·env are correctly estimated through the bootstrap method. The dimension forβ cc·env is selected correctly for 90.5% of the time, while the rest 9.5% yields an estimated dimension u > 3. All three envelope estimators have better performances than the standard estimators with full, complete and all data because the variation of the immaterial part is much larger than the material part. The median MSEs are 7.96 × 10 −4 , 7.61 × 10 −2 , 1.38 × 10 −3 , 0.50, 1.52 × 10 −4 , and 6.96 × 10 −2 forβ em·env ,β em·std , β cc·env ,β cc·std ,β full·env ,β full·std . Detailed comparison of the simulation results are given in Figure 5 below and Table 4 in the Appendix. We carried out another two sets of simulations where the data generating steps were the same as above, but we changed the distribution of ε i1 ∈ R u and ε i2 ∈ R r−u . Firstly, we generate each element of ε i1 , ε i2 independently from  Tables 5 and 6 in the Apendix. Under both settings, we see substantial empirical efficiency gains by using our method.

Data analysis
In this section, we apply our proposed method to the Chronic Renal Insufficiency Cohort (CRIC) study. The CRIC study recruited 3939 participants from April 8, 2003 through September 3, 2008 and continued through March 31, 2013 [19]. The study cohort was a racially and ethnically diverse group aged from 21 to 74 years with mild to moderate chronic kidney disease (CKD). Each study subject was given extensive clinical evaluation, and the information collected included quality of life, dietary assessment, physical activity, health behaviors, depression, cognitive function, and blood and urine specimens.
To prevent the development of severe clinical events, it is important to identify CKD patients with a high risk of end-stage renal diseases (ESRD) in their early stages. A variety of risk factors for ESRD have been identified in the literature [4,23,35,2,20,1]. It is of interest to investigate the difference in the distributions of baseline biomarkers among the patients who develop ESRD versus who do not. Correlation among risk factors have often been observed in the literature [5]; however, it has not been fully utilized in the statistical analyses for predicting ESRD and CVD. Our method leveraged the correlation among the risk factors and biomarkers to improve the efficiency of the analysis. Additionally, it is of interest to explore modifiable biomarkers, which are the biomarkers that are significantly differently distributed for patients who develop ESRD adjusting for the established biomarkers.
The study participants were distinguished by the ESRD status (binary, 1 for ESRD and 0 for no ESRD) within five years of enrollment. We assumed death before the progression of ESRD and withdraw from the study were independent of the ESRD disease status. Thus, we focused our analysis on the remaining 3205 patients. In our analysis, we also adjusted for gender, age, race, systolic, and diastolic blood pressures, and hemoglobin. The biomarkers and risk factors are urine albumin, urine creatinine, high sensitivity C-reactive protein (HS CRP), brain natriuretic peptide (BNP), chemokine ligand 12 (CXCL12), fetuin A, fractalkine, myeloperoxidase (MPO), neutrophil gelatinase associated lipocalin (NGAL), fibrinogen, troponin, urine calcium, urine sodium, urine potassium, urine phosphate, high sensitive troponin T (TNTHS), aldosterone, C-peptide, insulin value, total parathyroid hormone (Total PTH), CO 2 , 24-hour urine protein, and estimated glomerular filtration rate (EGFR). We performed a log transformation on the highly skewed biomarkers and risk factors. In addition, we divided fetuin A by 10 4 as its scale was quite different from other biomarkers.
We first assessed the difference in the distributions of baseline biomarkers versus the ESRD status, unadjusted for the established biomarkers. All the biomarkers except the EGFR had some missingness ranging from <1% to 6%. Also, as for the predictors, hemoglobin and BMI had a relatively low missing rate (there are 15 observations with hemoglobin missing and 5 observations with BMI missing). As the proportion of missingness was relatively low, we used the BIC Q given in Section 4.2 to select the envelope dimension. The EM envelope method reduced the dimension of the biomarkers from r = 23 to u = 15. The point estimates, bootstrap standard errors, confidence intervals and p-values for the mean difference of biomarkers among ESRD patients versus no ESRD patients are given in the Appendix. The magnitude of the point estimates of our method is in general slightly smaller than those of the standard EM. For example, the coefficient for urine albumin is 0.56 using our method and 2.54 using the standard EM. This is because in each EM iteration, the envelope estimate is the projection of the standard estimates onto the envelope direction. The reduction in the magnitude is interpreted as the noise subtracted from the original estimates. As Louis [34] suggested, the closed form of the asymptotic variance for the standard EM estimator is in general hard to obtain. Hence, we carried out the nonparametric bootstrap for 1000 times, that is, we resample individuals with replacement. The standard errors of our method is also generally smaller than those of the standard method. For example, Figure 6 further shows the empirical cumulative density distributions of the estimated standard errors of the standard EM versus our method. Again, the estimated standard errors are in general smaller (on the right hand side of 1 in Figure 6) using our method than using the standard EM indicating the efficiency gain using our method, which aligns with our theory. The mean of the ratio is 1.24 for coefficients corresponding to ESRD and 1.62 for all coefficients. That is, on average, our method is about 24% more efficient than the standard method for the coefficients corresponding to ESRD and 62% more efficient for all coefficients. The same set of biomarkers (all the aforementioned biomarkers except HS CRP, fetuin A and insulin value) were found by our method and the standard EM, to be significantly different among patients with and without ESRD. Table 7 and Table 8 in the Appendix present details of the results.
It is found in the literature that although many novel biomarkers are found to be marginally significantly associated with the ESRD status, such an association often disappears after adjusting for the established biomarkers [22,37,28]. That is, they are not as useful as modifiable biomarkers. We next assess the mean difference of baseline biomarkers among patients with and without the ESRD status, adjusted for the established biomarkers. The EGFR and the amount of urine protein excreted are two established biomarkers for predicting the ESRD. Thus, in the subsequent analysis, we use the two variables as predictors rather than responses. The estimated envelope dimension is u = 17. The point estimates, bootstrap standard errors, confidence intervals and p-values for the mean difference of biomarkers for different ESRD status adjusting for the EGFR and the urine protein are given in Table 7. The point estimates and the standard errors are again in general smaller using our method as compared with using the standard EM. Figure 7 shows the empirical distribution of the ratio between the estimated standard errors of the two methods. The mean of the ratio is 1.92 for coefficients corresponding to the ESRD and 1.86 for all coefficients. Comparing Figure 6 and Figure 7, we see that the EM envelope method achieves even higher efficiency gain when we adjust for the established biomarkers versus not. As found in the literature, after adjusting for the established biomarkers, the majority of biomarkers that have been investigated are no longer significant. We observe the same phenomenon using both our method and the standard EM. However, among the few biomarkers that remain significant, there is some discrepancy between the standard EM and our method: our method found HS CRP, aldosterone, and C-peptide significant which were not shown in standard EM; whereas standard EM found NGAL, which was not found in our method. As our method is more efficient for finite sample, the results of which are more precise than those of the standard EM.

Discussion
In this paper, we proposed the EM envelope method to achieve more efficient estimation for coefficients in the multivariate regression with missing data. Specifically, we assumed the redundancy exists in the response variables and thus could be omitted in the regression to reduce noise. A similar redundancy structure may also occur among the predictors or among both predictors and responses. Our method can be similarly derived under those scenarios. For example, if we assume there exists a linear combination of predictors that do not contribute to the regression and assume the missingness mechanism of predictors and responses are MAR, then our method could be adapted to gain efficiency by discarding the immaterial part of the variance among the predictors. A similar derivation can be made by changing the covariance matrix Σ in this paper to Σ x , the covariance matrix of predictors.
An alternative approach to calculate an envelope estimate with missing data is to use the model free approach proposed by [13]. Specifically, we can calculate the standard EM estimator together with its asymptotic variance using the Louis formula. However, the calculation of the asymptotic variance of the EM estimator requires calculating the conditional expectation of the outer product of the complete data score vector, an inherently problem-specific task that usually requires much computational effort as discussed in [36]. Also, this method requires estimating an envelope in R pq space instead of R q , which makes the problem more challenging. A detailed comparison of the empirical performances of such model free envelope based on the standard EM estimator versus the EM envelope method is left for future work.
Under model (2.2), we implicitly assumed the independence between the predictor X and the error vector ε. Specifically, we assumed there is no heteroscedasticity. Su and Cook [42] studied the heteroscedastic envelope when the predictor is categorical. Our method can be easily adapted to their approach. How to apply the envelope method under heteroscedasticity when the predictor is beyond categorical is still an open problem, which we leave for future research.
Throughout this paper, our method is proposed assuming the missing data mechanism is ignorable. When the data is nonignorably missing, a selection model is needed to be specified. We also leave it as a future research topic.

Appendix A: The derivations of examples
In the following example, we show that if (X T i , Y T i ) T follows a normal distribution, then (Y T i,obs , X T i,obs ) T also follows a normal distribution. Example A1. Suppose the predictors and responses are normally distributed as

Derivation of Example A1
there exists a unique permutation matrix B i , i.e., a square matrix that has exactly one entry of 1 in each row and each column and 0s elsewhere, such that Thus, by the property of normal distribution, , O a×b is a matrix of size a × b with all elements being 0, k i is the total length of (X T i,obs , Y T i,obs ) T , and l is the total length of (X T , The update of the parameters β and Σ have been discussed above. Here, we present two examples focusing on the calculation of A j,t and ρ t .

Example A2. Under Model (2.1) and assume
Envelope method with ignorable missing data 4443

Derivation of Example A2
The likelihood of X can be written as where C = −(np log 2π)/2. Thus, where A i4,t = E(X i |θ t , D i,obs ) denote the conditional expectation of X i given D i,obs . Let ρ t+1 = (μ t+1 , Σ x,t+1 ). By Lemma 5, we have μ t+1 = A T 4,t /n, and Σ x,t+1 = (A 3,t − 2A 4,t μ t+1 )/n + μ T t+1 μ t+1 . Then, we calculate A 1,t , A 2,t , A 3,t . Since X i and Y i |X i are normally distributed, following a similar derivation as the Example A1, given θ t , ( For simplicity, for the derivation of the parameter updates below, we only focus on the t th step, and thus omit all the subscript t for the parameter updates. For different individuals, missing value occurs at different locations, so we rearrange X i , Y i to separate missing variables from the observed variables. Write

L. Ma et al.
Then, we can obtain A i1 , A i2 and A i3 through

Proof of Example A3
Let β i,obs denote the submatrix of β where the rows corresponds to the observed responses Y i,obs . Let Σ i,obs denote the submatrix of Σ with the elements corresponds to the covariance of Y i,obs . Let ε i,obs denote the random error corresponds to Y i,obs . Hence, we have Y i,obs = β i,obs X i + ε i,obs where ε i,obs independently follows N (0, Σ i,obs ).
First, we derive the distribution of X i |Y i,obs given θ = θ t .
The last equation holds because for a Bernoulli variable, we have . The likelihood function of X can be written as Envelope method with ignorable missing data 4445 Hence, By taking derivative with regard to π, we get the update for parameter π t+1 = n i=1π i,t /n. For simplicity, we again omit the subscript t in the following derivation. Next, we calculate the conditional covariance matrices A 1 , A 2 , A 3 . For an individual i, if x i is not missing, A i1 , A i2 and A i3 can be computed trivially. Hence, we only need to demonstrate the case when x i is missing. There exists a permutation matrix + E(y T i,mis |y i,obs , θ)E(y i,mis |y i,obs ; θ), the form of A i1 can be obtained.
To calculate A i2 , by the law of total expectation, we have After obtaining A i1 , A i2 and A i3 , we can obtain A 1 , A 2 and A 3 through a summation over i.

Proof of Example 1
Firstly we prove the case where X i follows normal distribution. Since the working model for ε i is also normal, the estimatorθ obs·std is obtained by maximizing the following observed data likelihood under the working model: From Example 1 and notations therein, we have The estimatorθ obs·std is the solution to the following generalized estimating equation (GEE): where l i is the log-likelihood of each observation under the working model, and ψ(D i,obs , θ) = ∂l i /∂θ. During the proof, we are calculating the expectation given the observed data pattern.
∂β T , and Envelope method with ignorable missing data We need to show (B1) hold for any compact subset of the parameter space. That is, for any c > 0 and sequence {D i,obs } ∞ i=1 satisfying D i,obs ≤ c, the sequence of functions ψ (D i,obs , θ) is equicontinuous on any compact set of the parameter space.
By taking the derivative of ψ(D i,obs , θ) with respect to θ, we will see that ∂ψ ∂θ is continuous in θ and D i,obs . Hence, when the parameter space Θ is compact and D i,obs ≤ c, ∂ψ ∂θ is uniformly bounded. Therefore, ψ(D i,obs , θ) is equicontinuous. That is, regularity condition (B1) holds. Next, we prove condition (B2) holds. That is, the solution of is unique at θ = θ 0 . Since we assumed a fixed missing mechanism and (X i , Y i ) is of length p + r, there are at most 2 p+r − 1 observed data patterns. Let m denote the total number of observed data patterns, D * i,obs denote the i-th observed data pattern with probability p * i for i = 1, . . . , m satisfying m i=1 p * i = 1. For example, if for the i-th observed data pattern only X 1 is missing, then D * i,obs = (X 2 , . . . , X p , Y). Hence, . Let S * i , B * i denote the corresponding matrices in Example 1 for the observed data D * i,obs . Let l * i denote the loglikelihood of the i-th observed data pattern under the working model, then where P B(Σ) ≡ B(B T ΣB) −1 B T Σ represents the projection onto span(B) relative to Σ. In order to show the above estimating equation has a unique solution atμ =μ 0 , we only need to show where e i is the vector of length p + r where the i-th index equals 1 and equals 0 otherwise. Since there is no predictor or response with missing rate 100%, q * i > 0 for all 1 ≤ i ≤ p + r, the above matrix is full rank. That is,μ =μ 0 is the unique solution. Sinceμ = (μ T 0x , μ T 0x β T 0 ) T , the solution for μ x must be unique and μ x = μ 0x . Recall that Therefore, ) is full rank. Hence, the above equation impliesΣ =Σ 0 is the unique solution. That is, Σ x = Σ 0x , Σ = Σ 0 , and β = β 0 are unique solutions.
Therefore, the solution for (A.3) is unique, so that (B2) holds.

Proof of Example 2
When X i follows binomial distribution with m trials and success probability p. Without loss of generality, among the n samples, we let the first n 0 samples to be the case where the covariate X i is not missing. Then, the observed data likelihood can be written as Hence, L(θ) can also be expressed using normal densities: which is a Gaussian mixture model. It is easy to show that ∂ψ i ∂θ is continuous in θ using the same technique. Hence, following the same proof procedure, we know that (B1)-(B2) holds when X i follows binomial distribution.