Eﬃcient estimators for expectations in nonlinear parametric regression models with responses missing at random

: We consider nonlinear regression models that are solely deﬁned by a parametric model for the regression function. The responses are assumed to be missing at random, with the missingness depending on multiple covariates. We propose estimators for expectations of a known function of response and covariates. Our estimator is a nonparametric estimator cor- rected for the regression function. We show that it is asymptotically eﬃcient in the H´ajek and Le Cam sense. Simulations and an example using real data conﬁrm the optimality of our approach.


Introduction
In this article we study efficient estimation of expectations in a nonlinear regression model that is defined solely by the conditional constraint and therefore also known as the conditional mean model. Here the regression function r ϑ is assumed to be known up to a parameter vector ϑ and X is a d-dimensional random vector. The nonlinear regression model is an important model for applications; see, for example, the books by Bates and Watts (1998 [1]) and Seber and Wild (1989 [18]).
In the literature it is quite common to introduce a third variable ε = Y − r ϑ (X), especially if the covariates X and errors ε can be assumed to be independent; see, for example, Wang and Rao (2001 [22]), who study linear regression with missing responses. We do not make the independence assumption: in many situations, especially in applications in econometrics, model (1.1), which we consider here, is more suitable because of its flexibility.
We are interested in the scenario when responses Y are possibly missing and work with an indicator variable Z that is 1 if a response Y is observed and 0 if it is missing. Our sample consists of independent copies (X i , Z i Y i , Z i ), i = 1, ..., n, of a base observation (X, ZY, Z). The indicator Z tells us if a zero response is a numerical zero or a missing value. More specifically, we assume that the responses are missing at random (MAR), i.e. the probability that Y is missing depends only on the covariate vector X that is always observed, P (Z = 1|X, Y ) = P (Z = 1|X) = π(X).
The MAR assumption is common in applications; see, for example, the book by Little and Rubin [10]. It in particular implies that Z and Y are conditionally independent given X.
Our goal is to efficiently estimate expectations E{h(X, Y )} of the joint distribution in model (1.1), where h is some known square-integrable function. This is a quite general problem: we basically estimate the entire joint distribution of the vector (X, Y ). In the literature usually only estimation of the mean response E(Y ) is considered; see, for example, Matloff (1981 [11]), Cheng (1994 [3]), Rao (2001, 2002 [22,23]) and, for further references, Müller (2009 [12]). Other examples of such expectations are moments of Y or X, mixed moments, and probabilities involving X and Y such as P (X < Y ). Estimation of E{h(X, Y )} is also considered in Müller (2009 [12]) in the more restrictive nonlinear regression model with independent covariates and errors. That article (1.2) with Γ defined in equation (2.1) in Section 2. Our nonparametric estimator H np for the first part of (1.2) is a partially imputed estimator, where χ(x) is the Nadaraya-Watson estimator of χ(x) = E{h(X, Y )|X = x}, similar to that used in Cheng (1994 [3]) (see Section 2.1 for details). Alternatively one could, as in Cheng and Wei (1986 [4]) and Cheng (1990 [2]), use a full imputation approach, which also replaces observed cases with estimators. In the nonparametric model full imputation and partial imputation are asymptotically equivalent (see Cheng, 1994 [3]), which is intuitively clear since the model contains no structural information. For this article we prefer partial imputation, for reasons of speed and simplicity. We will show that the estimator proposed in this paper is efficient in the sense of Hájek and Le Cam. The efficiency results imply asymptotic normality, which is useful for constructing approximative confidence intervals for expectations E{h(X, Y )} of known square-integrable functions h(X, Y ). To the best of our knowledge, our estimator is the first efficient estimator for E{h(X, Y )} in the parametric MAR multiple regression model (1.1). Müller et al. (2006 [15]) propose an efficient estimator for univariate linear regression, but does not provide technical details. The results of this paper also apply to the usual model with no missing data, i.e. when all indicators equal one and π(·) ≡ 1, so this is covered as a special case. This paper is organized as follows. In the next section we provide a complete and detailed derivation of the stochastic expansion of the nonparametric estimator and of the correction term. Section 3 characterizes efficient estimators of functionals of the joint distribution and gives the efficient influence function for estimating E{h(X, Y )} in our model. The efficiency of our estimator is established by showing that the expansion in Section 2 matches the efficient influence function in Section 3. In Section 4 we explain how our estimator can be implemented and compare it with other methods in various scenarios, using computer simulations. The results are positive throughout and confirm the theoretically proved optimality of our approach. In Section 5 we illustrate our approach by means of a real data set. Some technical details can be found in the Appendix.

Expansion of the estimator
Our estimator H = H np − Γ from (1.2) consists of the nonparametric estimator H np from equation (1.3) and a correction term Γ, which has the form Here g(x) is a consistent estimator of uniformly in x on the support I of X, with ρ h (x) = E{h(X, Y )ε|X = x} and σ 2 (x) = E(ε 2 |X = x). The term Γ incorporates the parametric regression structure and is suggested by the canonical gradient, which characterizes the influence function of an efficient estimator (see Section 3).
To estimate g(x) we can, for example, use a combination of Nadaraya-Watson estimators introduced by Nadaraya and Watson (1964 [16, 24]). The residuals ε i = Y i − r ϑ (X i ) are based on an efficient estimator ϑ of ϑ; see Müller and Van Keilegom (2012 [13]) for an approach using estimating equations, and also for an overview of related efficient methods.
All estimators in Γ, including ϑ, are complete case estimators since only observations with Z = 1 are used; see Müller and Schick (2017 [14]) who show that in the model with MAR responses complete case analysis is efficient for estimating characteristics of the conditional distribution of Y given X. The estimator H np from (1.3), on the contrary, is an imputation estimator. Hence our estimator (1.2) is a combination of imputation and complete case analysis.
In the usual model with no missing data, the partially imputed estimator H np for E{h(X, Y )} reduces to the empirical estimator. However, it is not efficient unless we enhance it by correcting for the unknown parametric regression function using Γ with all Z i = 1 (i = 1, . . . , n) and π ≡ 1, i.e. the efficient estimator becomes In the following, for convenience of notation, we will always use the lower case letter c to represent a generic constant. The norm brackets · refer to the Euclidean norm of a vector. We will assume throughout that π(x) > 0, for all x in the support I of X, to exclude the extreme case that no response is observed, that h(X, Y ) is squareintegrable and that E(ε 2 ) is positive and finite. The covariate vector X and the regression function need to satisfy the following conditions.
Assumption (X). The d-dimensional random vector X has a compact support I and a density f that is bounded and bounded away from zero on I.
To construct an efficient estimator of E{h(X, Y )}, an efficient estimator of ϑ, say ϑ, is needed. Efficient estimation of ϑ in models defined by conditional constraints is discussed in Müller and Van Keilegom (2012 [13]). They show that an efficient estimator ϑ is characterized by the following expansion.
An example of an efficient estimator is provided by Müller and Van Keilegom (2012 [13]), who propose using where σ 2 (x) is a consistent estimator of σ 2 (x) uniformly in x ∈ I, for example, the Nadaraya-Watson estimator. Under the conditions of their Theorem 2.1, the solution ϑ of the above estimating equation satisfies Assumption (T).
In the next two subsections we will expand the partially imputed estimator H np of equation (1.3) and derive the expansion of the correction term Γ introduced in (2.1). Combining the two parts gives the expansion of H = H np − Γ, which is stated in Corollary 2.1 at the end of this section.

Expansion of the nonparametric estimator
Consider the nonparametric partial imputation estimator introduced in (1.3), which imputes only the incomplete cases, as in Cheng (1994 [3]). We propose estimating the conditional expectation χ(x) = E{h(X, Y )|X = x} by the Nadaraya-Watson estimator is a kernel function with integrated boundary correction, i.e. the form of the kernel is different for interior and boundary points x. The letter b = b n denotes a bandwidth sequence which tends to zero as n increases. By using boundary kernels we assume that we know the support of I, which is a rather strong assumption. For practical applications we recommend estimating I using extreme values; see also Remark 2.1.
To derive the expansion of the partially imputed estimator (1.3) we stipulate the following assumption on the kernel K.  Note that assumption (K) is necessary because we consider a scenario with a covariate vector X. This is in contrast to Cheng (1994 [3]), who considers the nonparametric model with univariate covariates X. Cheng uses Theorem 1 of [5] to derive the expansion of his version of H np . The theorem requires a nonnegative kernel function, so it cannot be applied to our multivariate scenario, which requires using higher order kernels. For the construction of such kernels we can use results from Simonoff (1996 [19]); see Remark 2.1 at the end of this subsection for details.
For the ease of derivation we will further assume that π(x) and σ 2 (x) are bounded away from zero on I. In the second conclusion of Lemma 2.1 below we will show that n j=1 K b (X j , x)Z j /n converges to π(x)f (x) in probability, uniformly in x. It follows that with probability tending to one. Hence we can assume, without loss of generality, that the denominator in the Nadaraya-Watson estimator χ(·) is bounded away from zero on I. Finally we need the following two conditions.
Assumption (D). The functions χ, π and f are d + 1 times continuously differentiable on I.
Lemma 2.1 below facilitates the derivation of the asymptotic linearity of H np in Theorem 2.1.
Suppose further that the distribution of (X, V ) has a joint density and that Assumptions (K), (B) and (D) are satisfied.
The vector X is our covariate vector from Assumption (X), whereas V is an arbitrary random vector, so the results can be used at various points of the proof. We will, for example, use Lemma 2.1 with Z in place of g(X, V ). The proof is given in the Appendix.
Proof. To prove this theorem we write 3) We will show that B 1 and the term B 3 given below are asymptotically equivalent. This will be established using Assumptions (X), (K), (B) and (D) and Lemma 2.1. Then we show that the leading term of H np stated in the theorem is an approximation of A + B 3 . We first introduce Then we define B 3 as the conditional expectation of B 2 given the completely observed cases "B", i.e.
Formally B stands for the subset {(X j , Y j , Z j ), j = 1, . . . , n : Z j = 1}. A fairly straightforward but lengthy calculation yields the asymptotic equivalence of B 1 and B 3 , The detailed proof of (2.4) is provided in the Appendix. It remains to examine B 3 more closely. Assume that (X p , Y p , Z p ) does not belong to the set of complete observations B. We have The last but one step follows from the first conclusion in Lemma 2.1. This combined with (2.3) and (2.4) gives the expansion provided in the theorem: This completes the proof.

Remark 2.1.
To specify a kernel that satisfies Assumption (K), we can extend the construction of second order boundary kernels in Section 3.3.1 of Simonoff (1996 [19]) to higher order boundary kernels. Consider, for example, the case d = 2 and X = ( and |∂L i (s 1 )/∂s 1 | < η|s 1 | ν for any s 1 satisfying |s 1 | > ζ, with some constants η, ζ and ν > 1, we first calculate second order kernels where I 1 denotes the support of X 1 . Then the linear combination of T 1 (s 1 , x 1 ) and T 2 (s 1 , x 1 ), , is a univariate third order boundary kernel for X 1 . A third order boundary kernel K 2 (s 2 , x 2 ) for X 2 can be constructed analogously. By taking the product we obtain the desired bivariate third order boundary kernel K(s, The construction of general multivariate higher order boundary kernels is done analogously. For j = 2, 3 . . . , d, we first calculate univariate j-th order boundary kernels T 1 and T 2 , and then a univariate (j + 1)-th order boundary kernel as the linear combination given above. The product of j such univariate (j + 1)-th order boundary kernels yields a multivariate (j + 1)-th order boundary kernel K.
A multivariate (d + 1)-th order boundary kernel constructed in this way will satisfy Assumption (K). If the boundary of the support I is unknown, it can be estimated using extreme values, i.e.

Expansion of the correction term
To expand the additive correction the following assumption is required: Under Assumption (R) on the regression function and Assumption (S) we expand the nonlinear correction Γ in the next theorem. Remember that

Theorem 2.2. Suppose that Assumptions (X), (R), (T) and (S) hold and that
We begin with an auxiliary result. A first order Taylor expansion, using Assumption (R), yields This combined with the square integrability of ρ h (X), Assumption (S), guarantees for any constant c that The second relation uses the Cauchy-Schwarz inequality. Now apply (2.5) to obtain This combined with (2.6) gives Here we use the law of large numbers in the third step and the fourth equation uses the asymptotic linearity of ϑ stated in Assumption (T). This gives the influence function of the correction term, which involves the unknown quantity Replacing g(x) by a uniformly consistent estimator g(x) does not change the asymptotic expansion because We verify (2.9) in the Appendix. This completes the proof.
, σ 2 (x) and π(x) being Nadaraya-Watson estimators of ρ h (x), σ 2 (x) and π(x), respectively. The Nadaraya-Watson estimator is uniformly consistent when X has a compact support. In Section 4 we will use this estimator for our simulation study, and also show more details.
We conclude the section with the final expansion of our estimator H = H np − Γ. The result follows directly from the statements in Theorems 2.1 and 2.2 on H np and Γ. We therefore formulate the result as a corollary.

Efficiency
In this section we calculate the canonical gradient of E{h(X, Y )}, which characterizes the influence function of an efficient estimator of that expectation. The efficiency of our estimator will be established by showing that the canonical gradient equals the influence function obtained in Section 2. We will use results from Müller et al. (2006 [15]) and Müller (2009 [12]) about the canonical gradient, and also from Schick (1993 [17]) about the tangent space in nonlinear regression. Essential for the derivation of canonical gradients is the notion of tangent space: a canonical gradient is characterized as an orthogonal projection of a gradient onto the tangent space, which is the closed linear span of the set of all perturbations of the joint distribution P (dx, dy, dz) within the model. The distribution P depends on the marginal distribution G(dx) of X, the conditional probability π(x) of Z = 1 given X = x and the conditional distribution Q(x, dy) of Y given X = x. Müller et al. (2006 [15]), who also considers regression models with MAR responses, were the first to describe the tangent space for general differentiable functionals κ(G, Q, π) in this model. They write the joint distribution in the form where B p = pδ 1 + (1 − p)δ 0 denote the Bernoulli distribution with parameter p and δ t the Dirac measure at t. To specify the tangent space we assume that G,

3997
Q and π are Hellinger differentiable: where . = means ignoring o p (n −1/2 ) items. Since the perturbed distributions are probability distributions, the Hellinger derivative u belongs to The tangent space is the orthogonal sum As in Müller et al. (2006 [15]), we have no structural assumptions on G and π. This means that we have no further restrictions on the perturbations u and w and can therefore take U = L 2,0 (G) and W = L 2 (G π ). We must, however, take the regression structure into account, i.e. the space V is the subset of V 0 to which v is now restricted. In the following we assume that the subspaces U , V and W are closed and linear.
The canonical gradient g * is an element of the tangent space and has the form where u * (X), Zv * (X) and {Z − π(X)}w * (X) are projections of the gradient (that characterizes the differentiable functional) onto the three orthogonal subspaces of the tangent space. For full details of the results we have just summarized, see pages 352-355 in Müller et al. (2006 [15]). There they provide a detailed characterization of efficient estimators in the model with MAR responses and then specialize them to four specific models for the conditional distribution Q. In the current paper we have Q(x, dy) = f {y − r ϑ (x)|x}dy, where f (·|x) denotes the conditional density of the (conditional mean zero) error distribution given X = x. In order to find V we introduce perturbations of the parameter ϑ and the conditional error density. The exact form of V is only relevant for the derivation of v * and is therefore located in the Appendix; see Section A.4 for details. The derivation of u * and w * is given in the proof of Theorem 3.1 below. In that theorem we provide the explicit representation of the canonical gradient of E{h(X, Y )}. The efficiency of our estimator is formulated subsequently in Corollary 3.1 Theorem 3.1. Let the vector Δ and the matrix I be defined as in Section 2, i.e. Δ = E{ṙ ϑ (X)h(X, Y )σ −2 (X)ε} and I = E{Zṙ ϑ (X)ṙ ϑ (X) σ −2 (X)}. Suppose Assumptions (R), (S) and the Hellinger differentiability assumption (3.1) are satisfied. Further assume that the conditional density of ε given x, f (·|x), has a finite Fisher information I and that I is invertible. Then the canonical gradient of the functional E{h(X, Y )} is Proof. Müller et al. (2006 [15]) and Müller (2009 [12]) show that the canonical gradient g * (X, ZY, Z) from (3.2), now specifically for the functional E{h(X, Y )}, is determined by for all u ∈ U , v ∈ V and w ∈ W . Here we use the fact that the canonical gradient of E{h(X, Y )} is a projection of a gradient of E{h(X, Y )} onto the tangent space. To determine the specific form of g * we set u = 0 and v = 0 in In order to find v * we must take the parametric model structure into account, i.e. the special form of the subset V ⊂ V 0 . The derivation of V and v * is quite elaborate and therefore given in Section A.4 of the Appendix where we prove . It follows from Corollary 2.1 that our estimator is asymptotically linear, and from Theorem 3.1 that the influence function given in Corollary 2.1 equals the canonical gradient g * from Theorem 3.1. Hence our estimator is efficient in the sense of the Hájek and Le Cam theory, which implies asymptotic normality. We formulate this as a corollary.

Linear and nonlinear regression with one covariate
To illustrate the results of the previous sections, we conduct a simulation study comparing various estimators for E(Y ), E(Y 2 ), E(XY ) and E{X exp(XY )}; see Tables 1-4. In each case we consider two regression functions, r ϑ (x) = ϑx and r ϑ (x) = cos(ϑx) with ϑ = 2, and two variance functions, namely a linear variance function σ 2 (x) = 0.6 − 0.5x and a parabolic variance function σ 2 (x) = (x − 0.4) 2 + 0.1. The covariate X is generated from a uniform distribution on [−1, 1] and the error variable η in ε = σ(X)η from a standard normal distribution. In all scenarios we use the logistic distribution function π(x) = 1/{1 + exp(−x)} for the conditional probability, so that about half of the simulated responses are missing. In this section we use, for simplicity, only ordinary kernels instead of the boundary kernels discussed in Remark 2.1.
To evaluate the performance of our asymptotically optimal estimator when sample sizes are small we simulate the mean squared errors (MSE) of H ϑ and H ϑ . Here H ϑ denotes the version of our estimator H from (1.2) that uses the true values of σ 2 (x), π(x) and ϑ in the correction term, whereas H ϑ uses estimators for those quantities. For the calculation of we use a consistent nonparametric estimator for σ 2 (x), namely where K b1 (·) is a Gaussian kernel with bandwidth b 1 and ϑ 0 is the ordinary least squares estimator (or some other consistent estimator of ϑ). In the model with a linear regression function ϑ and ϑ 0 have a closed form, while for the cosine function we use the nls function in R to obtain them. Our nonparametric estimator for π(x) is where K b2 (·) is a Gaussian kernel with bandwidth b 2 ; ρ h (x) is a plug-in estimator for ρ h (x) = E{h(X, Y )ε|X = x}. For our choices of h it will involve the estimators σ 2 (x) and ϑ just described; see below for more details. The nonparametric part H np of our estimator H is the partially imputed estimator (1.3).
It is based on a Nadaraya-Watson estimator for the conditional expectation χ(x) = E{h(X, Y )|X = x}, with a Gaussian kernel K b3 (·) with bandwidth b 3 . We also compare H ϑ and H ϑ with the simple Horvitz-Thompson type es- For estimators involving kernel estimation we use leave-one-out cross validation to select the bandwidth. For example, to obtain the bandwidth b 1 of σ 2 (x), we first calculate, for each complete observation (X j , Y j ), for bandwidths b from a candidate set G. Then b 1 is obtained as For the case h(x, y) = y, for example, we used the set G = {0.1, 0.2, ...0.5} for b 1 and also for b 2 . The bandwidth b 3 for the nonparametric part H np has the form b 3 = an −2/5 , which is indicated to have optimal convergence rate by Cheng (1994 [3]). We chose a = 0.5, 0.6, . . . , 0.9 to determine b 3 .  The simulated mean squared errors for estimating the mean response are given in Table 1. In this case ρ h (x) = E{h(X, Y )ε|X = x} = E{Y ε|X = x} = σ 2 (x). In each row of Table 1 the efficient estimator outperforms the nonparametric estimator without the nonlinear correction, while the simple estimator is inferior to any of its competitors. In the linear regression model the two versions H ϑ and H ϑ of the efficient estimator differ slightly, in contrast to the cosine regression model, where the difference is quite large. This is because the estimator of the slope in the linear regression model is better than that of the frequency parameter in the model with the cosine regression function. The MSEs for different sample sizes confirm the root-n convergence rate of the efficient estimator, as stated in Corollary 2.1. The entries are mean squared errors as in Table 1 Table 2 displays the simulation results for the same scenario as in Table 1, but now the second moment of the response is estimated. The efficient estimator again outperforms both the nonparametric estimator and the simple estimator. In our scenario with normal errors we have ρ h (x) = 2r ϑ (x)σ 2 (x) with r ϑ (x) = ϑx and r ϑ (x) = cos(ϑx). In both regression models it makes little difference whether true values or estimators are used. We consider the same scenario as in Tables 1 and 2, now with h(x, y) = xy.
The MSEs for estimating E(XY ) are given in Table 3. In both regression models ρ h (x) = xσ 2 (x). The first two columns of the left panel (linear regression) indicate that estimating σ 2 (x), π(x) and ϑ increases the MSE slightly. The MSEs in the corresponding columns in the right panel (cosine regression) appear to be similar. The results in Table 3 again confirm the superiority of the efficient estimator as well as the convergence rate.
The results for E{X exp(XY )} are listed in Table 4. Straightforward calculations yield ρ h (x) = σ 2 (x)x 2 exp[{ϑ + σ 2 (x)/2}x 2 ]. The efficient estimator clearly outperforms the competing estimators. As in the previous tables we see that the two estimators H ϑ and H ϑ based on true values and on estimates perform similarly. The influence function of the efficient estimator in Corollary 2.1 contains a non-negligible part that comes from the difference n 1/2 ( ϑ − ϑ). This part is missing if we replace ϑ by ϑ, which explains why in some cases, e.g. the upper left panel in Table 4, H ϑ outperforms H ϑ . However, estimating σ 2 (x) and π(x) adds uncertainty, especially if n is not very large, so that in other cases, for example in the right panel in Table 4, the MSE of H ϑ is larger than that of H ϑ .

Linear regression with two covariates
Finally we consider a bivariate covariate vector X = (X 1 , X 2 ) and a linear regression function r ϑ (x) = ϑ 1 x 1 + ϑ 2 x 2 with ϑ 1 = 1 and ϑ 2 = 2. We modify the scenario of the previous section as follows: the variance function σ 2 (x) = σ 2 (x 1 , x 2 ) is set to be 2.1 − 0.5(x 1 + x 2 ) or (x 1 + x 2 − 0.8) 2 + 0.1, and π(x) = 1/[1+exp{−(x 1 +x 2 )}]. In order to generate correlated covariates X 1 , X 2 we first sample auxiliary random variables W, X 1 and X 2 independently: W is generated from a uniform distribution on [−0.5, 0.5], and X 1 and X 2 are generated from a uniform distribution on [−1, 1]. Then we take X 1 = X 1 + W and X 2 = X 2 + W . Our final estimator is based on kernel estimators. For example, σ 2 (x) now involves a product of two Gaussian-based kernels of order 4 (Wand and Schucany, 1990 [21]), i.e. K(x) = (3 − x 2 )Φ(x)/2, where Φ(·) is the standard Gaussian density function, both using the same bandwidth, to estimate the unknown conditional expectations. Table 5 shows the simulated mean squared errors of estimators of the mean response in the bivariate regression model. In this case ρ h (x) = E{h(X, Y )ε|X = x} = σ 2 (x). Again our efficient estimator outperforms the competing estimators and confirms our theoretical results. The efficient estimator that uses estimates σ 2 (x), π(x) and ϑ is better than the estimator H ϑ , which uses the true values.

An example
In this section we apply our method to a data set of 2139 HIV positive patients from a clinical trial (Hammer et al., 1996 [6]). The data are freely accessible in the R package speff2trial. The entries are simulated mean squared errors of estimators of the mean response, here for the scenario with the bivariate linear regression function r ϑ (X) = ϑ 1 X 1 + ϑ 2 X 2 (ϑ 1 = 1, ϑ 2 = 2) described in Section 4.2.
In the trial patients were randomly assigned to four antiretroviral therapies: (i) zidovudine (ZDV) monotherapy, (ii) ZDV + didanosine (DDI), (iii) ZDV + zalcitabine, and (iv) DDI monotherapy. We want to compare the ZDV monotherapy (i) with the alternative group of therapies (ii)-(iv), and estimate the mean number of CD4 cells in both groups, i.e. the number of white blood cells that fight the infection. An increasing CD4 count indicates that the HIV treatment is more effective.
We are interested in the difference between the mean CD4 counts (Y ) in the monotherapy group and the mean CD4 counts in the alternative therapy group at 96±5 weeks post therapy. There are six covariates: X (1) , age; X (2) , weight; X (3) , CD4 counts at baseline; X (4) , CD4 counts at 20±5 weeks; X (5) , CD8 (immune cells) counts at baseline; X (6) , CD8 counts at 20±5 weeks. Because of deaths and dropouts, 39% of the responses in the monotherapy group and 37% of the responses of the combined therapy group are missing, while all covariates are observed for all patients. Let Z again denote the missingness indicator (which is 1 if Y is observed and 0 if it is missing). As indicated by Hu et al. (2010 [9]) and Tang et al. (2018 [20]), who consider the same data set, it is reasonable to asume that the conditional expectation of the response given the covariates can be modelled using linear regression, and that the response is missing at random. The variable selection results in Tang et al. (2018 [20]) suggest that only X (3) , X (4) and X (6) actually affect Y . We therefore assume E(Y |X) = ϑ X, with a covariate vector X = (1, X (3) , X (4) , X (6) ) and a regression parameter ϑ ∈ R 4 .
We apply our method to the two groups of data separately and construct the efficient estimator for the mean response μ (0) in the monotherapy group and the mean response μ (1) in the combined therapy group. Then we calculate the difference between the means, μ (1) − μ (0) . For the construction of the efficient estimator see Section 4.1.
For comparison we also consider the three estimators for the mean difference μ (1) − μ (0) in Section 7 of Hu et al. (2010 [9]): inverse probability weighting estimation (IPW), augmented inverse probability weighting estimation (AIPW), and semiparametric dimension reduction estimation (SDR). Besides the linear regression model between Y and X, Hu et al. additionally assume a parametric logistic model for the probablity of missingness, i.e. logit{π(X)} = γ X for some parameter γ (which is technically a different statistical model). For the term π(X) in the nonlinear correction term of our efficient estimator, we therefore use the nonparametric estimator (4.1) and a parametric estimator for the logistic model, both based on (X (3) , X (4) , X (6) ) . IPW, inverse probability weighting estimation; AIPW, augmented inverse probability weighting estimation; SDR, semiparametric dimension reduction estimation; EENP, efficient estimator with the nonparametric estimator for the probability of missingness; EEP, a version of the efficient estimator with the logistic model for the probability of missingness.
The point estimators, standard errors and 95% confidence intervals of various methods are given in Table 6. The results of the IPW, AIPW and SDR are taken from Hu et al. (2010 [9]) for comparison. The standard errors of the two versions of the efficient estimator (EENP and EEP in Table 6) are obtained using the bootstrap based on 500 repetitions. The point estimators, standard errors and confidence intervals of our method are close to those of the AIPW and SDR, which both attain an efficiency bound if E(Y |X) and π(X) are correctly specified, as discussed in Section 3 of Hu et al. (2010 [9]). However, our method is efficient without specifying an auxiliary parametric model for π(X). From Table 6 we can see that the results of the EENP and the EEP are very close.

A.1. Proof of Lemma 2.1
Let f 2 denote the joint density of (X, V ) and f (·|x) the conditional density of V given X = x. For the proof of part (1) we write μ(x) = E{g(X, V )K b (X, x)} and, using substitution, obtain

4005
A Taylor expansion gives that for s = (s 1 , . . . , s d ) ∈ S b (x), For example, if x = (x 1 , x 2 ) and s = (s 1 , s 2 ) are two-dimensional vectors, we have Since f (x) and m(x) are d + 1 times continuously differentiable on I, the term in the remainder is When |α| = d + 1, it follows that with β ∈ R d , because f (x) and m(x) are d + 1 times continuously differentiable on I. By Assumption (K) (ii) we have where the second step holds true because of (A.1) and the last step because of Assumption (K) (i). Therefore, by Assumption (B), we have We now prove part (2). Analogously as in the derivation of Theorem 2 in Hansen (2008 [7]), where Y in the original proof is replaced by g(X, V ), we have The assumptions of that theorem are satisfied: 1. Assumptions 1 and 3 in Hansen (2008 [7]) hold true by Assumption (K)(i) and (K)(iii), respectively; 2. we have independent observations, so conditions (2), (4), (7) and (10) in [7] are not needed, and (11) in that article simplifies to θ = 1; 3. condition (5) in [7] is satisfied by Assumption (X), and inspecting the proofs of Theorems 1 and 2 in [7] reveals that condition (3) and (6) in that article can be replaced by the assumption that g(X, Y ) is square integrable for independent data; 4. equation (12) in [7] is met by Assumption (B); equation (13) in that article is satisfied since the support I in (A.3) is compact.

A.2. Proof of equation (2.4) (Theorem 2.1)
We can write B 1 in the form By the second conclusion of Lemma 2.1 we have and therefore where the two items are both bounded away from zero by Assumption (X) and (2.2). This shows In the following we assume that (X p , Y p , Z p ) and (X q , Y q , Z q ) are two different observations which do not belong to the set of complete observations B. Consider where the last equality holds because φ(X p ) and φ(X q ) are conditionally independent given B. Then This combined with Further we obtain For T 1 we have In the third step we used and in the last statement that the variances are finite by assumption. The second term T 2 computes to The last step follows from the first conclusion of Lemma 2.1. Hence we have This combined with (A.4) yields Now use E(B 2 − B 3 ) = 0 and Chebyshev's inequality to obtain n 1/2 |B 2 − B 3 | = o p (1). This and n 1/2 |B 1 − B 2 | = o p (1) finally give

A.3. Proof of equation (2.9) (Theorem 2.2)
We will use similar arguments as in the first part of the proof of the theorem, in particular we write again ε * i = ε i −ṙ ϑ (X i ) ( ϑ − ϑ). Using this notation, equation (2.9) becomes We treat the three parts separately. As in the proof of (2.6), we obtain for the first part In the last step we use the arguments following (2.7) with g(·) ≡ 1, and the fact that g(x) is a consistent estimator of g(x). The second part computes to where the last step uses Assumptions (R) and (T) as well as the uniform consistency of g(x). Finally we show In equation (2.8) in the first part of proof of Theorem 2.2 we have seen that n −1 n i=1 Z i g(X i )ε i is part of the approximation and therefore has the order O p (n −1/2 ). The term on the left-hand side of (A.7) is approximately conditionally centered (given X i ). Since g(x) − g(x) is asympotically negligible, we obtain the desired order o p (n −1/2 ).
Choose, for example, where ϑ i is some consistent estimator of ϑ that does not use (X i , Y i ) if that pair is observed. The other two leave-one-out estimator are defined similarly. Thanks to this construction g(X i ) is independent of Y i and Z i conditional on X i , and we obtain (suppressing the subscript i) E Z( g(X) − g(X))ε = E Z g(X)ε = E E{Z g(X)ε|X} = E π(X)E{ g(X)|X}E{ε|X} = 0.

A.4. Proof of equation (3.6) (Theorem 3.1)
To specify the tangent space V concerning the conditional distribution we introduce perturbations s and t of the two parameters f (·|x) and ϑ. Write F (·|x) for the conditional distribution function of f (·|x) and assume that f (·|x) has finite Fisher information for location, E 2 (ε|x) < ∞, where (·|x) = −f (·|x)/f (·|x) is the score function. The perturbed conditional distribution is Here S is determined by two constraints: the perturbed error conditional density f ns (·|x) must integrate to one, f ns (y|x)dy = 1, and must be centered at zero, yf ns (y|x)dy = 0. As in Schick (1993 [17]), Section 3, we have Therefore the subspace V of V 0 is V = s{x, y − r ϑ (x)} + {y − r ϑ (x)|x}ṙ ϑ (x) a : s ∈ S, a ∈ R d .
Writing this as an iterated expectation, E{E(Zt * t|X)} = E{π(X)E(t * t|X)} = E[E{h(X, Y )t|X}], (A. 10) we see that h(X, Y )/π(X) is a candidate for t * (X, Y ). Since t * must be in S we choose a suitably modified version, namely with ρ h (x) = E{h(X, Y )ε|X = x}. Now plug a * and t * into the formula for v * in (A.8) to obtain formula (3.6).
To verify (A.11) formally, we show that t * satisfies characterization (A.10) and that t * is in V 2 = {t(X, Y ) : t ∈ S}. To prove the first part we consider t = t(X, Y ) ∈ V 2 , that is, by definition of S, E(t|X) = 0 and E(tε|X) = 0. Then