Improved Estimators for Semi-supervised High-dimensional Regression Model

We study a linear high-dimensional regression model in a semi-supervised setting, where for many observations only the vector of covariates $X$ is given with no response $Y$. We do not make any sparsity assumptions on the vector of coefficients, and aim at estimating $\mathrm{Var}(Y|X)$. We propose an estimator, which is unbiased, consistent, and asymptotically normal. This estimator can be improved by adding zero-estimators arising from the unlabelled data. Adding zero-estimators does not affect the bias and potentially can reduce variance. In order to achieve optimal improvement, many zero-estimators should be used, but this raises the problem of estimating many parameters. Therefore, we introduce covariate selection algorithms that identify which zero-estimators should be used in order to improve the above estimator. We further illustrate our approach for other estimators, and present an algorithm that improves estimation for any given variance estimator. Our theoretical results are demonstrated in a simulation study.


Introduction
High-dimensional data analysis, where the number of predictors is larger than the sample size, is a topic of current interest.In such settings, an important goal is to estimate the signal level τ 2 and the noise level σ 2 , i.e., to quantify how much variation in the response variable can be explained by the predictors, versus how much of the variation is left unexplained.For example, in disease classification using DNA microarray data, where the number of potential predictors, say the genotypes, is enormous per each individual, one may wish to understand how disease risk is associated with genotype versus environmental factors.
Estimating the signal and noise levels is important even in a low-dimensional setting.In particular, a statistical model partitions the total variability of the response variable into two components: the variance of the fitted model τ 2 , and the variance of the residuals σ 2 .This partition is at the heart of techniques such as ANOVA and linear regression, where τ 2 and σ 2 might also be commonly referred to as explained versus unexplained variation, or between treatments versus within treatments variation.Moreover, in model selection problems, τ 2 and σ 2 may be required for computing popular statistics, such as Cp, AIC, BIC and R 2 .Both τ 2 and σ 2 are also closely related to other important statistical problems, such as genetic heritability and signal detection.Hence, developing good estimators for these quantities is a desirable goal.
When the number of covariates p is much smaller than the number of observations n, and a linear model is assumed, the ordinary least squares (henceforth, OLS) method provides us straightforward estimators for τ 2 and σ 2 .However, when p > n, it becomes more challenging to perform inference on τ 2 and σ 2 without further assumptions, such as sparsity of the coefficients.
In practice, the sparsity assumption may be unrealistic for some areas of interest.In this case, considering only a small number of significant coefficient can lead to biases and inaccuracies.
One relevant example is the problem of missing heritability, i.e., the gap between heritability estimates from genome-wide-association-studies (GWAS) and the corresponding estimates from twin studies.For example, by 2010, GWAS studies had identified a relatively small number of covariates that collectively explained around 5% of the total variations in the trait height, which is a small fraction compared to 80% of the total variations that were explained by twin studies (Eichler et al., 2010).Identifying all the GWAS covariates affecting a trait, and measuring how much variation they capture, is believed to bridge a significant fraction of the heritability gap.
With that in mind, methods that heavily rely on the sparsity assumption may underestimate τ 2 by their nature.We show in this work that in the semi-supervised setting, in which for many observations only the covariates X are given with no responses Y , one may consistently estimate the heritability without sparsity assumptions.We use the term semi-supervised setting to describe a setting in which the distribution of X is known.The setting where the distribution of X is only partially known is not part of this work.
Estimating τ 2 and σ 2 in a high-dimensional regression setting is generally a challenging problem.As mentioned above, the sparsity assumption, which means that only a relatively small number of predictors are relevant, plays an important role in this context.Fan et al. (2012) introduced a refitted cross validation method for estimating σ 2 .Their method includes a two-staged procedure where a variable-selection technique is performed in the first stage, and OLS is used to estimate σ 2 in the second stage.Sun and Zhang (2012) introduced the scaled lasso algorithm that jointly estimates the noise level and the regression coefficients by an iterative lasso procedure.Both works provide asymptotic distributional results for their estimators and prove consistency under several assumptions including sparsity.In the context of heritability estimation, Gorfine et al. (2017) presented the HERRA estimator, which is based on the above methods and is also applicable to time-to-event outcomes, in addition to continuous or dichotomous outcomes.Another recent related work is Cai and Guo (2020) that considers, as we do here, a semi-supervised learning setting.In their work, Cai and Gue proposed the CHIVE estimator of τ 2 , which integrates both labelled and unlabelled data and works well when the model is sparse.They characterize its limiting distribution and calculate confidence intervals for τ 2 .For more related works, see the literature review of Cai and Guo (2020).
Rather than assuming sparsity, or other structural assumptions on the coefficient vector β, a different approach for high-dimensional inference is to assume some knowledge about the covariates distribution.Dicker (2014) uses the method-of-moments to develop several asymptotically-normal estimators of τ 2 and σ 2 , when the covariates are assumed to be Gaussian.Schwartzman et al. (2019) proposed the GWASH estimator for estimating heritability, which is essentially a modification of one of Dicker's estimators where the columns of X are standardized.
Unlike Dicker, the GWASH estimator can also be computed from typical summary statistics, without accessing the original data.Janson et al. (2017) proposed the EigenPrism procedure to estimate τ 2 and σ 2 .Their method, which is based on singular value decomposition and convex optimization techniques, provides estimates and confidence intervals for normal covariates.
In this paper we introduce a naive estimator of τ 2 and show that it is asymptotically equivalent to Dicker's estimators when the covariates are normal, an assumption which is relaxed in this work.The naive estimator is also a U-statistic and asymptotically normal.U-statistics can be typically used to obtain uniformly minimum variance unbiased estimators (UMVUE).
However, when moments restrictions exist, U-statistics are no longer UMVUE, as shown by Hoeffding (1977).Under the assumed semi-supervised setting, the distribution of X is known (and hence, moments of X are known).Thus, the naive estimator is not UMVUE and it potentially can be improved.We demonstrate how its variance can be reduced by using zeroestimators that incorporate the additional information from the unlabelled data.
The contribution of this paper is threefold.First, we propose a novel approach for improving initial estimators of the signal level τ 2 in the semi-supervised setting without assuming sparsity or normality of the covariates.The key idea of this approach is to use zero-estimators that are correlated with the initial estimator of τ 2 in order to reduce variance without introducing extra bias.Second, we define a new notion of optimality with respect to a linear family of zero-estimators.This allows us to suggest a necessary and sufficient condition for identifying optimal oracle-estimators.We use the term oracle to point out that the specific coefficients that compose the optimal linear combination of zero-estimators are dependent on the unknown parameters.Third, we suggest two estimators that successfully improve initial estimators of τ 2 .We discuss in detail the improvement of the naive estimator and also apply our approach to other estimators.Thus, in fact, we provide an algorithm that has the potential to improve any given estimator of τ 2 .
The rest of this work in organized as follows.In Section 2 we describe our setting and introduce the naive estimator.In Section 3 we introduce the zero-estimator approach and suggest a new notion of optimality with respect to linear families of zero-estimators.An optimal oracle estimator of τ 2 is also presented.In Section 4 we apply the zero-estimator approach to improve the naive estimator.We then study some theoretical properties of the improved estimators.Simulation results are given in Section 5. Section 6 demonstrates how the zeroestimator approach can be generalized to other estimators.A discussion is given in Section 7, while the proofs are provided in the Appendix.

Preliminaries
We begin with describing our setting and assumptions.Let (X 1 , Y 1 ), ..., (X n , Y n ) be i.i.d.observations drawn from some unknown distribution where X i ∈ R p and Y i ∈ R. We consider a semi-supervised setting, where we have access to infinite i.i.d.observations of the covariates.
Thus, we essentially assume we know the covariate distribution.Notice that the assumption of known covariate distribution has already been presented and discussed in the context of high-dimension regression (e.g.Candes et al. 2017 andJanson et al. 2017) without using the term "semi-supervised learning".
For i = 1, . . ., n we consider the the linear model where We also assume that the intercept term is zero, which can be achieved in practice by centering the Y 's.Let (X, Y ) denote a generic observation and let σ 2 Y denote the variance of Y .Notice that it can be decomposed into signal and noise components, where Var(ǫ) = E(ǫ 2 ) = σ 2 and Cov(X) = Σ.
The signal component τ 2 ≡ β T Σβ can be thought of as the total variance explained by the best linear function of the covariates, while the noise component σ 2 can be thought of as the variance left unexplained.We assume that E(X) ≡ µ are known and also that Σ is invertible.
Therefore, we can apply the linear transformation X → Σ −1/2 (X − µ) and assume w.l.o.g. that µ = 0 and Σ = I.It follows by ( 2) that σ 2 Y = β 2 + σ 2 , which implies that in order to evaluate σ 2 , it is enough to estimate both σ 2 Y and β 2 .The former can be easily evaluated from the sample, and the main challenge is to derive an estimator for β 2 in the high-dimensional setting.

A Naive Estimator
In order to find an unbiased estimator for β 2 = p j=1 β 2 j we first consider the estimation of β 2 j for each j.A straightforward approach is given as follows: Let W ij ≡ X ij Y i for i = 1, ..., n, and j = 1, ..., p.Notice that where Thus, unbiased estimates of τ 2 ≡ β 2 and σ 2 are given by where σ2 We use the term Naive to describe τ 2 since its construction is relatively simple and straightforward.The Naive estimator was also discussed by Kong and Valiant (2018).A similar estimator was proposed by Dicker (2014).Specifically, let where X is the n × p design matrix and Y = (Y 1 , ..., Y n ) T .The following lemma shows that τ 2 and τ 2 Dicker are asymptotically equivalent under some conditions.
When τ 2 + σ 2 is bounded and p/n converges to a constant, then, Note that in this paper we are interested in a high-dimensional regression setting and therefore we study the limiting behaviour when n and p go together to ∞.Using Corollary 1 from Dicker (2014), which computes the asymptotic variance of τ 2 Dicker , and the above lemma, we obtain the following corollary.
Corollary 1.Under the assumptions of Lemma 1, , The variance of the naive estimator τ 2 under model (1) (without assuming normality) is given by the following proposition.
Proposition 1. Assume model ( 1) and additionally that β T Aβ and A 2 F are finite.Then, where The following proposition shows that the naive estimator is consistent under some minimal assumptions.
Proposition 2. Assume model (1) and additionally that τ 2 + σ 2 = O(1) and Then, τ 2 is consistent.Moreover, when the columns of X are independent and both p/n and E(X 4 ij ) are bounded, then

Oracle Estimator
In this section we introduce the zero-estimator approach and study how it can be used to improve the naive estimator.In Section 3.1 we present the zero-estimator approach and an illustration of this approach is given in Section 3.2.Section 3.3 introduces a new notion of optimality with respect to linear families for zero-estimators.We then find an optimal oracle estimator of τ 2 and calculate its improvement over the naive estimator.

The Zero-Estimator Approach
We describe the approach in general terms.Consider a random variable V ∼ P , where P belongs to a family of distributions P. Let g(V ) be a zero-estimator, i.e., E P [g(V )] = 0 for all P ∈ P. Let T (V ) be an unbiased estimator of a certain quantity of interest θ.Then, the Minimizing Var[U c (V )] with respect to c yields the minimizer In other words, by combining a correlated unbiased estimator of zero with the initial unbiased estimator of θ, one can lower the variance.Note that plugging c * in (6) reveals how much variance can be potentially reduced, where ρ is the correlation coefficient between T (V ) and g(V ).Therefore, it is best to find an unbiased zero-estimator g(V ) which is highly correlated with T (V ), the initial unbiased estimator of θ .It is important to notice that c * is an unknown quantity and, therefore, U c * is not a statistic.However, in practice, one can estimate c * by some ĉ * and use the approximation

Illustration of the Zero-Estimator Approach
The following example illustrates how the zero-estimator approach can be applied to improve the naive estimator τ 2 in the simple linear model setting.
Example 1 (p = 1).Assume model (1) with X ∼ N(0, 1).By (8), we wish to find a zeroestimator g(X) which is correlated with τ 2 .Consider the estimator U c = τ 2 + cg(X), where and c is a fixed constant.The variance of U c is minimized by c * = −2β 2 and one can verify that Var(U c * ) = Var(τ 2 ) − 8 n β 4 .For more details see Remark 3 in the Appendix The above example illustrates the potential of using additional information that exists in the semi-supervised setting to lower the variance of the initial Naive estimator τ 2 .However, it also raises the question: Can we achieve a lower variance by adding different zero-estimators?One might attempt to reduce the variance by adding zero-estimators such as g k (X) for k > 2. Surprisingly this attempt will fail.Hence, the unbiased oracle estimator of τ 2 , R ≡ τ 2 − 2β 2 g(X), is optimal with respect to zero-estimators of the form g k (X).This unanticipated result motivated us to extend the idea of the optimal zero-estimator to a general regression setting of p covariates.

Optimal Oracle Estimator
We now define a new oracle unbiased estimator of τ 2 and prove that under some regularity assumptions this estimator is optimal with respect to a family of zero-estimators.Here, optimality means that the variance cannot be further reduced by including additional zero-estimators of that given family.We now specifically define our notion of optimality in a general setting.
Definition 1.Let T be an unbiased estimator of θ and let g 1 , g 2 , ... be a sequence of zeroestimators, i.e., E θ (g i ) = 0 for i ∈ N and for all θ.
family of zero-estimators.For a zero-estimator g * ∈ G, we say that all g ∈ G and for all θ.
We use the term oracle since g * ≡ m k=1 c * k g k for some optimal coefficients c * 1 , ..., c * m , which are a function of the unknown parameter θ.The following theorem suggests a necessary and sufficient condition for obtaining an OOE.
Theorem 1.Let g m = (g 1 , ..., g m ) T be a vector of zero-estimators and assume the covariance matrix M ≡ Var[g m ] is positive definite for every m.Then, R * is an optimal oracle estimator (OOE) with respect to the family of zero-estimator G iff R * is uncorrelated with every zeroestimator g ∈ G, i.e., Cov θ [R * , g] = 0 for all g ∈ G and for all θ.
Returning to our setting, define the following oracle estimator where and let the G be the family of zero-estimators of the form The following proposition shows that T oracle is an OOE with respect to G.
Theorem 2 (General p).Assume model (1) and additionally that X has moments of all orders.
Then, the oracle estimator T oracle defined in ( 9) is an OOE of τ 2 with respect to G.
Remark 1.For the proof of Theorem 2 and Proposition 1, homoscedasticity of ǫ is not required.
We now compute the variance reduction of T oracle with respect to the naive estimator.The following statement is a corollary of Proposition 1.
Corollary 2. Assume model (1) and additionally that the columns of X are independent. Then, Moreover, in the special case where X i i.i.d ∼ N (0, I).Then, Rewriting (10) yields Notice that by Cauchy-Schwarz inequality, since E(X 2 ) = 1 then E(X 4 ) ≥ 1, and therefore Var(T oracle ) < Var(τ 2 ).The following example provides intuition about the improvement of Var(T oracle ) over Var(τ 2 ).
See Remark 4 in the Appendix for more details about the relative improvement of the optimal oracle estimator.

Proposed Estimators
In this section we show how to use the zero-estimator approach to derive improved estimators over τ 2 .In Section 4.1 we show that estimating all p 2 optimal coefficients given in (9) may introduce too much variance.Therefore, Sections 4.2 and 4.3 introduce alternative methods to reduce the number of zero-estimators used in estimation.

The cost of estimation
The optimal oracle estimator defined in ( 9) is based on adding p 2 zero-estimators.Therefore, it is reasonable to suggest and study the following estimator instead of the oracle one: is a U-statistics estimator of ψ jj ≡ β j β j ′ h jj ′ .Notice that E Ä ψjj ′ ä = 0 and that for i 1 = i 2 we have E(W i 1 j W i 2 j ′ ) = β j β j ′ ; thus, T is an unbiased estimator of τ 2 and we wish to check it reduces the variance of naive estimator τ 2 .This is described in the following proposition.
Proposition 3. Assume model (1) and additionally that τ 2 + σ 2 = O(1); E(X 4 ij ) ≤ C for some positive constant C, and p/n = O(1).Then, where Note that the second equation in (12) follows from (10).To build some intuition, consider the case when X i i.i.d ∼ N (0, I) and p = n.Then, the last equation can be rewritten as Notice that the term 8 n (2τ 2 σ 2 + σ 4 ) in (13) reflects the additional variability that comes with the attempt at estimating all p 2 optimal coefficients.Therefore, the estimator T fails to improve the naive estimator τ 2 and a similar result holds for p/n → c for some positive constant c.Thus, alternative ways that improve the naive estimator are warranted, which are discussed next.

Improvement with a single zero-estimator
A simple way to improve the naive estimator is by adding only a single zero-estimator.More gn]  Var[gn] and g n is some zero-estimator.By (8) we have Notice that U c * is an oracle estimator and thus c * needs to be estimated in order to eventually construct a non-oracle estimator.Let g i be the sample mean of some zero estimators g 1 , ..., g n .By ( 7), it can be shown that where θ j ≡ E(S ij ) and S ij = W ij g i .Notice that Var (g i ) does not depend on i. Derivation of (15) can be found in Remark 5 in the Appendix.Here, we specifically chose g as it worked well in the simulations but we do not argue that this is the best choice.Let T c * = τ 2 − c * g n denote the oracle estimator for the specific choice of g n , and where c * is given in ( 15).Notice that by ( 14) we have The following example demonstrates the improvement of Var(T c * ) over Var(τ 2 ).
Example 3 (Example 2 -continued).Consider a setting where n = p; τ 2 = σ 2 = 1 ; ∼ N (0, I) and β j = 1 √ p for j = 1, ..., p.Notice that this is an extreme non-sparse settings since the signal level τ 2 is uniformly distributed across all p covariates.In this case one can verify that Var(T c * ) = 12 n + O(n −2 ), which is approximately 40% improvement over the naive estimator variance (asymptotically).For more details see Remark 6 in the Appendix.
In the view of ( 15), a straightforward U-statistic estimator for c * is where Var(g i ) is assumed known as it depends only on the marginal distribution of X.Thus, we suggest the following estimator and prove that T c * and T ĉ * are asymptotically equivalent under some conditions.
Proposition 4. Assume model (1) and additionally that τ 2 + σ 2 and p/n are O(1).Also, for 4 is bounded and that the columns of the design matrix X are independent.Then, We note that the requirement that the columns of X be independent can be relaxed to some form of weak dependence.

Improvement by selecting small number of covariates
Rather than using a single zero-estimator to improve the naive estimator, we now consider estimating a small number of coefficients of T oracle .Recall that T oracle is based on adding p 2 zero estimators to the naive estimator.This estimation comes with high cost in terms of additional variability as shown is ( 13).Therefore, it is reasonable to use only a small number of zero estimators.Specifically, let B ⊂ {1, ..., p} be a fixed set of some indices such that |B| ≪ p and consider the estimator By the same argument as in Proposition 3 we now have Also notice that when ∼ N (0, I), (20) can be rewritten as where is sufficiently large, one can expect a significant improvement over the naive estimator by using a small number of zero-estimators.For example, when τ 2 B = 0.5; p = n; τ 2 = σ 2 = 1, then T B reduces the Var(τ 2 ) by 10%.For more details see Remark 7 in the Appendix.
Notice that we do not assume sparsity of the coefficients.The sparsity assumption essentially ignores covariates that do not belong to the set B. When β j 's for j / ∈ B contribute much to the signal level τ 2 ≡ β 2 , the sparse approach leads to disregarding a significant portion of the signal, while our estimators do account for this as all p covariates are used in τ 2 .
The following example illustrates some key aspects of our proposed estimators.
Two interesting key points: 1.In the first scenario the estimator T B has the same asymptotic variance as τ 2 , while the estimator T c * reduces the variance by approximately 40%.
2. In the second scenario the variance reduction of T B is approximately 40%, while T c * has the same asymptotic variance as τ 2 .Interestingly, in this example, the OOE estimator T oracle asymptotically improves the naive by 40% regardless of the scenario choice, as shown by ( 11).For more details see Remark 8 in the Appendix.
A desirable set of indices B contains relatively small amount of covariates that capture a significant part of the signal level τ 2 .There are different methods to choose the covariates that will be included in B, but these are not a primary focus of this work.For more information about covariate selection methods see Zambom and Kim (2018) and Oda et al. (2020) and references therein.In Section 5 below we work with a certain selection algorithm defined there.
We call δ a covariate selection algorithm if for every dataset (X n×p , Y n×1 ) it chooses a subset of indices B δ from {1, ..., p}.Our proposed estimator for τ 2 , which is based on selecting small number of covariates, is given in Algorithm 1.
Algorithm 1: Proposed Estimator based on covariate selection ) and a selection algorithm γ.

Calculate the naive estimator τ
2. Apply algorithm γ to (X, Y) to construct B γ .
3. Calculate the zero-estimator terms: for all j, j ′ ∈ B γ . Result: Some asymptotic properties of T γ are given by the following proposition.
Proposition 5. Assume there is a set B ≡ j : Notice that the requirement lim n→∞ n [P ({B γ = B})] 1/2 = 0 is stronger than just consistency.
Remark 2 (Practical considirations).Some cautions regarding the estimator T γ need to be considered in practice.When n is insufficiently large, then B γ might be different than B and Proposition 5 no longer holds.Specifically, let S ∩ B γ and B ∩ S γ be the set of false positive and false negative errors, respectively, where S = {1, ..., p} \B and S γ = {1, ..., p} \B γ .While false negatives merely result in not including some potential zero-estimator terms in our proposed estimator, false positives can lead to a substantial bias.This is true since the expected value of a post-selected zero-estimator is not necessarily zero anymore.A common approach to overcome this problem is to randomly split the data into two parts where the first part is used for covariate selection and the second part is used for evaluation of the zero-estimator terms.

Estimating the variance of the proposed estimators
We now suggest estimators for Var(τ 2 ), Var(T γ ) and Var(T ĉ * ).Let The following proposition shows that ÿ Var τ 2 is consistent under some conditions.
Proposition 6. Assume model ( 1) and additionally that ∼ N (0, I) and Consider now Var(T γ ) and let ÿ The following propositions shows that ÿ Var (T γ ) is consistent.
Proposition 7.Under the assumptions of Propositions 5 and 6, When normality of the covariates is not assumed, we suggest the following estimators: and where ÷ 2 are all U-statistics estimators, and β2 j is given by (3).Although we do not provide here formal proofs, our simulations support that these estimators are consistent under the same assumptions of Proposition 3.

Simulations Results
We now provide a simulation study to illustrate our estimators performance.We compare the different estimators that were discussed earlier in this work: • The naive estimator τ 2 which is given in (4).
• The optimal oracle estimator T oracle which is given in (9).
• The estimator T ĉ * which is based on adding a single zero-estimator and is given in (18).
• The estimator T γ which is based on selecting a small number of covariates and is given by Algorithm 1.Details about the specific selection algorithm we used can be found in Remark 9 in Appendix.
An additional estimator we include in the simulation study is the PSI (Post Selective Inference), which was calculated using the estimateSigma function from the selectiveInference R package.The PSI estimator is based on the LASSO method which assumes sparsity of the coefficients and therefore ignores small coefficients.
We fix β 2 j = τ 2 B 5 for j = 1, . . ., 5, and p−5 for j = 6, . . ., p, where τ 2 and τ 2 B vary among different scenarios.The number of observations and covariates is n = p = 400, and the residual variance is σ 2 = 1.For each scenario, we generated 100 independent datasets and estimated τ 2 by using the different estimators.Boxplots of the estimates are plotted in Figure 1 and results of the RMSE are given in Table 1.Code for reproducing the results is available at https://git.io/Jt6bC.

Figure 1 demonstrates that:
• Both of the proposed estimators demonstrate an improvement over the naive estimator in terms of RMSE.For example, when τ 2 = 1 and τ 2 B = 1/3, the Single estimator T c * improve the naive estimator by 17% and when τ 2 B = 2/3, the Selection estimator T γ improves the naive by 15%.When τ 2 = 2 these improvements are even more substantial.
• As already been suggested in Example 4, the Selection estimator T γ works well when τ 2 B is large while the Single estimator T ĉ * works well when τ 2 B is small.
• The PSI estimator is biased in a non-sparse setting.For example, when τ 2 B = 1/3 the PSI has larger RMSE than the proposed estimators.When τ 2 B = 0.99 the PSI has low bias therefore and low RMSE.This is not surprising since the PSI estimator is based on the LASSO method which is known to work well when the true model that generates the data is sparse.

Generalization to Other Estimators
The suggested methodology in this paper is not limited to improving only the naive estimator, but can also be generalized to other estimators.The key is to add zero-estimators that are highly correlated with our initial estimator of τ 2 ; see Equation ( 8).Unlike the naive es-timator, which is represented by a closed-form expression, other common estimators, such as the EigenPrism estimator (Janson et al., 2017), are computed numerically and do not have a closed-form representation.That makes the task of finding optimal zero-estimators somewhat more challenging since the zero-estimators' coefficients also need to be computed numerically.
A comprehensive theory that generalizes the zero-estimate approach to other estimators, other than the naive, is beyond the scope of this work.However, here we present a general algorithm that achieves improvement without claiming optimality.The algorithm is based on adding a single zero-estimator as in Section 4.2.The algorithm below approximates the optimal-oracle coefficient c * given in (7) from bootstrap samples and then, returns a new estimator that is composed of both the initial estimator of τ 2 and a single zero-estimator.
3. Approximate the coefficient c * by where ÷ Cov (•) denotes the empirical covariance from the bootstrap samples, and Var(g n ) is known by the semi-supervised setting.
Result: Return the empirical estimator T emp = τ 2 − c * g n .
We now demonstrate the performance of the empirical estimator given by Algorithm 2 together with two initial estimators mentioned earlier: The EigenPrism (Janson et al., 2017) and the PSI which is described in Taylor and Tibshirani (2018) and was used in Section 5. We consider the same setting as in Section 5. Results are given in Tables 2-3 and the code for reproducing the results is available at https://git.io/Jt6bC.
Tables 2-3 demonstrate that the standard error of the empirical estimators is equal to or lower than the standard error of the initial estimators, and as τ 2 increases, the improvement over the initial estimators is more substantial.As in Section 5, the single zero-estimator approach works especially well when τ 2 B is small; otherwise, there is a small or no improvement, but also no additional variance or bias is introduced.This highlights the fact that the zero-estimator approach is not limited to improving only the naive estimator but rather has the potential to improve other estimators as well.

Discussion
This paper presents a new approach for improving estimation of the explained variance τ 2 of a high-dimensional regression model in a semi-supervised setting without assuming sparsity.The key idea is to use zero-estimator that is correlated with the initial unbiased estimator of τ 2 in order to lower its variance without introducing additional bias.The semi-supervised setting, where the number of observations is much greater than the number of responses, allows us to construct such zero-estimators.We introduced a new notion of optimality with respect to zero-estimators and presented an oracle-estimator that achieves this type of optimality.We proposed two different (non-oracle) estimators that showed a significant reduction, but not optimal, in the asymptotic variance of the naive estimator.Our simulations showed that our approach can be generalized to other types of initial estimators other than the naive estimator.
Many open questions remain for future research.While our proposed estimators improved the naive estimator, it did not achieve the optimal improvement of the oracle estimator.Thus, it remains unclear if and how one can achieve optimal improvement.Moreover, in this work, strong assumption was made about the unsupervised data size, i.e., N = ∞.Thus, generalizing the suggested approach by relaxing this assumption to allow for a more general setting with finite N ≫ n is a natural direction for future work.A more ambitious future goal would be to extend the suggested approach to generalized linear models (GLM), and specifically to logistic regression.In this case, the concepts of signal and noise levels are less clear and are more challenging to define.

Appendix
Proof of Lemma 1: where X is the n × p design matrix and Y = (Y1, ..., Yn) T .Thus, the naive estimator can be also written as The Dicker estimate for τ 2 ia given by τ 2 Dicker ≡ . We need to prove that root-n times the difference between the estimators converges in probability to zero, i.e., √ n τ 2 Dicker − τ 2 p → 0. We have, It is enough to prove that: We start with the first term, . Notice that ωi depends on n but this is suppressed in the notation.In order to show that n −0.5 and E ω 2 i converge to zero.
Notice that Yi ∼ N 0, τ 2 + σ 2 by construction and therefore E(Y 8 i ) = O(1) as n and p go to infinity.Let Vj = X 2 ij − 1 and notice that E(Vj) = 0. We have The expectation is not 0 when j1 = j2 and j3 = j4 (up to permutations) or when all terms are equal.In the first case we have for a positive constant C1.In the second case we have Hence, as p and n have the same order of magnitude, we have which implies E(ω 2 i ) ≤ K1/n → 0, where K and K1 are positive constants.This completes the proof that We now move to prove that n −2.5 Ä X T Y 2 ä p → 0. By Markov's inequality, for ǫ > 0 where we used (26) in the third equality.Therefore, n Since p and n have the same order of magnitude and τ 2 +σ 2 is bounded by assumption, then n Proof of Corollary 1: According to Corollary 1 in Dicker (2014), we have , given that p/n converges to a constant.Therefore we can write ) by Slutsky's theorem.

Proof of Proposition 1:
Let Wi = (Wi1, ..., Wip) T and notice that τ 2 = Wi 1 j Wi 2 j is a U-statistic of order 2 with the kernel , where wi ∈ R p .
By Theorem 12.3 in van der Vaart (2000), where where › W2 is an independent copy of W2.Now, let A = E WiW T i be a p × p matrix and notice that where A 2 F is the Frobenius norm of A. Thus, by rewriting (27) the variance of the naive estimator is given by Proof of Proposition 2: The latter is assumed and we now show that the former also holds true.Let λ1 ≥ ... ≥ λp be the eigenvalues of A and notice that A is symmetric.We have that n We now prove the moreover part, that is, independence of the columns of X implies that Notice that when j = j ′ we have, follows from the assumptions that the columns of X are independent and E(Xij ) = 0 for each j.Also notice that in the third row we used the assumption that E(ǫ Similarly, when j = j ′ , This can be written more compactly as where σ 2 Y = β 2 + σ 2 .Therefore, where the last equality holds since 1) and by the Cauchy-Schwarz inequality we have → 0 and we conclude that Var(τ Remark 3. Calculations for Example 1: where in the third equality we used E(X 2 ) = 1 and E(XY ) ≡ β.In the fourth equality the expectation is zero for all i = i1, i2.Now, since X ∼ N (0, 1) and E(ǫ|X) = 0, then Therefore, by ( 7) we get c * = −2β 2 .Plugging-in c * back in (8) yields Var(Uc * ) = Var(τ 2 ) − 8 n β 4 .
Proof of Theorem 1: 1. We now prove the first direction: OOE ⇒ Cov[R * , g] = 0 for all g ∈ G.
Let R * ≡ T + g * be an OOE for θ with respect to the family of zero-estimators G.By definition, Var k=1 ck g k for some fixed m, and note that g ∈ G.Then, Therefore, for all (c1, ..., cm), which can be represented compactly as Assuming M is positive definite and solving for c yields the minimizer cmin = M −1 b.Plug-in cmin in the (32) yields 2. We now prove the other direction: if R * is uncorrelated with all zero-estimators of a given family G then it is an OOE.
Let R * = T + g * and R ≡ T + g be unbiased estimators of θ, where g * , g ∈ G. Define g ≡ R * − R = g * − g and notice that g ∈ G .Since by assumption R * is uncorrelated with g, and hence Var Proof of Theorem 2: We start by proving Theorem 2 for the special case of p = 2 and then generalize for p > 2. By Theorem 1 we need to Thus, we need to show that We start with calculating the LHS of (34), namely Cov τ 2 , g k 1 k 2 .Recall that τ 2 ≡ β2 1 + β2 2 and therefore where the calculations can be justified by similar arguments to those presented in (31).We shall use the following notation: Notice that A, B, C and D are functions of (k1, k2) but this is suppressed in the notation.Write, Thus, rewrite ( 35) and obtain Similarly, by symmetry, Using ( 36) and ( 37) we get We now move to calculate the RHS of (34), namely and i1 − 1 which by definition is also equal to g20.Similarly, we have h12 Now, observe that for every (k1, k2, d1, d2) ∈ N 4 0 , where the third equality holds since the terms with i1 = i2 vanish.It follows from (40) that Therefore, rewrite (39) to get which is exactly the same expression as in (38).Hence, equation (34) follows which completes the proof of Theorm 2 for p = 2.
We now generalize the proof for p > 2. Similarly to (34) we want to show that We begin by calculating the LHS of (42), i.e., the covariance between τ 2 and g k 1 ...kp .By the same type of calculations as in ( 35), for all (k1, ..., kp) ∈ N p 0 we have where L1, L2 and L3 are just a generalization of the notation given in (38).Again, notice that L1, L2 and L3 are functions of k1, ..., kp but this is suppressed in the notation.

Proof of Corollary 2:
Write, Consider where the second and third equality are justified by ( 45) and ( 43) respectively.Consider now Var where the fifth equality holds since the summand is 0 for all i1 = i2.The summation is not zero in only three cases: For the first two cases the summation equals 1 For the third case the summation equals to Remark 4. Calculations for Example 2.
Recall that by (5) we have Now, when we assume standard Gaussian covariates, one can verify that , where σ 2 Y = σ 2 + τ 2 .Thus, in this case we can write and Var(T oracle ) = Var(τ 2 ) − 8 n τ 4 = 12 n + O(n −2 ) by ( 11).More generally, the asymptotic improvement of T oracle over the naive estimator is: where we used the fact that σ 2 Y = τ 2 + σ 2 = 2τ 2 in the second equality.Now, notice that when p = n then the reduction is 2 3+2 = 40% and when p/n converges to zero, the reduction is 66%.

Proof of Proposition 3:
Write, We start with calculating the middle term.Let pn(k) where Cn ≡

Write,
Cn Now, notice that Rewrite (53) to get where we used (54) to justify the first equality.By the same type of calculation, one can compute the covariance in ( 52) over all 60 and obtain that Cov τ 2 , We now move to calculate the last term of (51).Recall that Cov (Wi 1 j 1 Wi 2 j 2 Xi 3 j 1 Xi 3 j 2 , Wi 4 j 3 Wi 5 j 4 Xi 6 j 3 Xi 6 j 4 ) , where J is now defined to be the set of all quadruples (j1, j2, j3, j4), and I is now defined to be the set of all sextuples (i1, ..., i6) such that i1 = i2 = i3 and i4 = i5 = i6.For the set I, there are three different cases to consider: (1) when one of {i1, i2, i3} is equal to one of {i4, i5, i6}; (2) when two of {i1, i2, i3} are equal to two of {i4, i5, i6}; and (3) when {i1, i2, i3} are equal to {i4, i5, i6}.There are 3 1 • 3 = 9 options for the first case, 3 2 • 3! = 18 for the second case, and Cov (Wi 1 j 1 Wi 2 j 2 Xi 3 j 1 Xi 3 j 2 , Wi 4 j 3 Wi 5 j 4 Xi 6 j 3 Xi 6 j 4 ) = = p −2 n (2) where the fourth equality we use which is given by ( 29), and in the fifth equality we used the assumption that E(X 4 ij ) ≤ C for some positive C. Since we assume p/n = O(1), the above expression can be further simplified to By the same type of calculation, one can compute the covariance in ( 56) over all 594 cases and obtain that Var Lastly, plug-in ( 55) and ( 57) into (51) to get where the last equality holds by (11).
Remark 5. Calculations for equation 15: where Sij ≡ Wij gi.Also notice that Var (gn) = Var In order to calculate Var(Tc * ) we need to calculate the numerator and denominator of (16).Consider first θj ≡ E(Sij). Write, where in the last equality we used the assumption that E(ǫ|X) = 0. Since the columns of X are independent, the summation is not zero (up to permutations) when j = k and m = k ′ .In this case we have Notice that in the forth equality we used the assumption that E(X 2 ij ) = 1 for all j = 1, ..., p.Thus, plug-in τ 2 = 1 and βj = 1 √ p to get the numerator of ( 16): Consider now the denominator of ( 16).Write, Since we assume that the columns of X are independent, the summation is not zero when j1 = j3 and j2 = j4.Thus, Notice that we used the assumption that since we assume that Σ = I in the last equality.Now, recall by (50) that Var where we used the assumption that n = p in the last equality.

Proof of Proposition 4:
We need to prove that By Markov and Cauchy-Schwarz inequalities, it is enough to show that where Consider now the denominator of (61).Write, Since we assume that the columns of X are independent, the summation is not zero when j1 = j3 and j2 = j4.Thus, Notice that since we assume that Σ = I then E(X 2 ij ) = 1 for all i = 1, ..., n and j = 1, ..., p .Now, since we assume that n/p = O(1), by ( 61) and ( 62) it is enough to prove that δ 1 n 3 → 0 and δ 2 n 4 → 0.

and since by Corollary 2
we have where in the last equality we used the assumption that E(ǫ|X) = 0. Since we assume that the columns of X are independent, the summation is not zero (up to permutations) when j = k and m = k ′ .In this case we have by Cauchy-Schwartz Notice that in the forth equality we used the assumption that E(X 2 ij ) = 1 for all i = 1, ..., n and j = 1, ..., p.Thus Plug-in Y = β T X + ǫ to get j 1 ,j 2 j 3 ,j 4 j 5 ,j 6 k 1 <k 2 k 3 <k 4 βj 3 βj 4 βj 5 βj 6 L 1 E (X1j 1 X1j 2 X1j 3 X1j 4 ) where in the forth equality we used the assumption that E(ǫ 2 |x) = σ 2 .Now, notice that: • L1 is not zero (up to permutation) when j1 = j2 and j3 = j4.
We start with showing that nθ1 → 0. Notice that T 2 B 1A = T B Tγ 1A = T 2 γ 1A Thus, Now, notice that n(E T 2 γ (1 − 1A) ) → 0 by similar arguments as in (67), with a slight modification of using the existence of the fourth moments of Tγ and TB, rather than the second moments.Also, by Cauchy-Schwarz inequality we have, where C is an upper bound of the maximum over all first four moments of Tγ and TB.Therefore, nθ1 → 0.
Consider now nθ2.Write, and notice that the last equation follows by similar arguments.

Table 1 :
Boxplots representing the estimators distribution .The x-axis stands for τ 2 B .The red dashed is the true value of τ 2 .Summary statistics.An estimate for the standard deviation of RMSE (σ RM SE ) was calculated using the delta method.The estimator with the lowest RMSE (excluding the oracle) is in bold.

Table 2 :
Summary statistics equivalent to Table 1 for the EigenPrism estimator.

Table 3 :
Summary statistics equivalent to Table1for the PSI estimator.