Kernel Machines With Missing Responses

Missing responses is a missing data format in which outcomes are not always observed. In this work we develop kernel machines that can handle missing responses. First, we propose a kernel machine family that uses mainly the complete cases. For the quadratic loss, we then propose a family of doubly-robust kernel machines. The proposed kernel-machine estimators can be applied to both regression and classification problems. We prove oracle inequalities for the finite-sample differences between the kernel machine risk and Bayes risk. We use these oracle inequalities to prove consistency and to calculate convergence rates. We demonstrate the performance of the two proposed kernel machine families using both a simulation study and a real-world data analysis.


Introduction
We consider the problem of learning in the presence of missing responses. Missing response is a type of missing data in which the response variable cannot always be observed. Missing responses are common in market research surveys, medical research, and opinion polls. Our motivating example is the Los Angeles County homeless survey directed by the Los Angeles Homeless Services Authority (LAHSA). In the Los Angeles County there are 2054 tracts. The LAHSA was interested in surveying the number of homeless counts in the different tracts. For each tract, information on the median household income, the percentage of unoccupied housing units, etc., were collected. Due to budget constraints, LAHSA used stratified spatial sampling of tracts to conduct the survey. There were 244 tracts which are known to have a large homeless population. All of these tracts were included in the survey. Out of the 1810 other tracts, 265 were randomly included in the survey, leaving 1545 that where not included. The probability of tract inclusion in the survey was dependent on the Service Provision Area (SPA). Different areas have different probability of being visited. Thus, this is a problem of missing responses, as covariates were collected for all tracts but responses were collected only for tracts that were included in the survey. More details can be found in Kriegler and Berk (2010).
Other examples of missing responses include the following. Consider a clinical study in which genetic information is collected on all participants but the level of a specific biomarker is collected only on a subsample based on their genetic information. In this example, the genetic information are the covariates which are collected for everyone but the biomarker which is the response missing for some of the participants.
Inference for missing data is challenging. There are three main mechanisms leading to missing data, missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR), see Little and Rubin (2002, Chapter 1). The two examples of missing responses discussed above can be cataloged as MAR. In the homeless count example, tracts are visited dependent on the areas and independent of the actual counts in these tracts. Biomarker levels are collected based on the genetic profile and not the biomarker level.
Three main approaches are usually used for handling missing values in statistical analysis. The complete case analysis uses mainly the observations that have no missing data. These observations are referred to as the complete cases. The second approach is imputation where a value or a set of values are assigned to each missing value. The third approach is a maximum likelihood approach which first poses some models for response given the covariates. Under MAR assumption, the likelihood function can be written as the likelihood function of response-given covariates multiplied by the likelihood function of covariates and missing mechanism, thus enabling maximizing separately the two parts of the likelihood. Pelckmans et al. (2005) summarize the three approaches; see also Little and Rubin (2002), and Tsiatis (2006, Chapter 6).
We develop a kernel-machine approach for missing responses. Kernel methods, which include SVMs as a special case, are easy-to-compute techniques that enable estimation under weak or no assumptions on the distribution (Steinwart and Christmann, 2008;Hofmann et al., 2008). Kernel machines minimize a regularized version of empirical risk where the empirical risk is the average of a loss function on the observed sample. In recent years, kernel methods have been developed for many types of data including some missing data settings. However, so far, no work has been done on missing responses in the context of kernel machines.
We first propose a family of kernel machines that can be considered as inverse-weighted-probability complete-case estimators (Robins et al., 1994;Tsiatis, 2006). More specifically, we first model the missing mechanism if it is unknown; in some settings it is known by design and then we do not need to model it. Then, we use the estimated inverse probabilities of the observed cases to weight the loss function of the complete cases. We show that if the missing mechanism is specified correctly, the empirical risk, which is sum of the weighted loss function, is an unbiased estimator for the risk. We then prove oracle inequalities, consistency results, and calculate convergence rates for this type of kernel machine. The main drawback of this approach is that the missing mechanism is estimated using a model and when this model is misspecified, the estimator could be biased.
We propose a doubly-robust kernel-machine estimator in order to overcome the potential bias in missing mechanism misspecification. Doubly-robust estimators are augmented inverse-probability-weighted-completecase estimators. Scharfstein et al. (1999) first introduced the notation of doubly-robust estimators. Bang and Robins (2005) give an overview of the development of doubly-robust estimators. Zhao et al. (2015) give a new application in the setting of individualized treatment regimes. We face two main challenges when constructing a doubly-robust kernel-machine estimator. The first is that the loss function obtained by adding an augmentation term needs to ensure both doubly-robustness and convexity. The second is that the augmentation-term estimation needs to converge uniformly over a set of functions that grows with the sample size. We are not aware of any doubly-robust estimators in the context of kernel machines. To compute the proposed doubly-robust kernel-machine estimator, we first estimate the missing mechanism and also the conditional distribution response given covariates. The latter is used to calculate the conditional risk. Then, based on the previous weighted loss function, we augment a weighted conditional risk. Empirical risk based on this loss function has the doubly robust property in the sense that if either the missing mechanism or the conditional distribution is correctly specified, not necessarily both, the empirical risk is unbiased. We prove oracle inequalities and consistency results, and calculate convergence rates for quadratic-loss doubly-robust kernel machines.
To illustrate the proposed kernel machine methods, we apply them to simulated data. In the simulation study, we use the proposed kernel machine methods to analyze regression and classification problems. We then analyze the Los Angeles homeless data, comparing the proposed kernel machines to other existing methods.
Approaches for missing responses include the work of Wang and Rao (2002). Under the missing at random assumption, they first imputed the missing response values by the kernel regression imputation and then constructed a complete data empirical likelihood to obtain the mean of the response variable from the imputed data set. Wang et al. (2004) extended a semiparametric regression analysis method to include missing responses. Their interest was to estimate the mean of the response. First they used a partially linear semiparametric regression model to estimate the conditional mean of response-given covariates; only completed cases are included in this step. Then they used weighted observed responses and weighted conditional mean of response to estimate the mean. Smola et al. (2005) developed a framework in which kernel methods can be written as estimators in an exponential family, which can handle both missing covariates and missing responses. They extended the concave convex procedure (Yuille and Rangarajan, 2003) to find a local optimum. However, there is no guarantee for convergence and the computations can be demanding. Liang et al. (2007) proposed a partially linear model for missing responses with measurement errors on the covariates. Azriel et al. (2016) studied a regression problem with missing responses. They showed that when the conditional expectation is not linear in the predictors, the additional observations provide more information. In their work, they constructed the best linear predictor which depends also on the incomplete data.
In the learning literature, semi-supervised learning is halfway between supervised and unsupervised learning and it can be used to handle missing data. In this learning scenario, a dataset has two components: labeled and unlabeled. Semi-supervised learning wishes to have a more accurate prediction by taking into account also the unlabeled data. Semi-supervised learning method uses, for example, distance measures to create clustering and neighbouring graphs. Then, the semi-supervised method uses the obtained structure to get a better understanding of the labeled data. However, the semi-supervised approach is different from our proposed kernel machine method, because the semi-supervised learning methods do not consider the missing mechanism and do not try to account for the bias of the complete observations. Details about semi-supervised learning is given in Chapelle et al. (2006).
The paper is organized as follows. Background and notation are given in Section 2. In Section 3 we present the proposed kernel machines. Section 4 presents the main theoretical results, including the oracle inequalities, consistency results, and convergence rate calculations. Simulation results are shown in Section 5. The Los Angeles homeless data is analyzed in Section 6. In Section 7 we discuss potential future directions. Technical proofs appear in the Supplementary Material.

Preliminaries
Assume that n independent and identically distributed observations D = {(M 1 , X 1 , Y 1 ), . . . , (M n , X n , Y n )} are collected. Here, M is a missingness indicator such that M = 1 if Y is observed, and M = 0 otherwise. The random vector X is a covariate vector that takes its values in a compact set X ⊂ R d . The random variable Y is the response that takes its values in the set Y ⊂ R where Y can be, for example {−1, 1} for classification problems, and some compact segment of R for regression problems. Define π(X) = P(M = 1 | X) as the propensity score.
We need the following assumption which is discussed in Tsiatis (2006).
Assumption 2.1. The missing mechanism is MAR and there is a positive constant 0 < c < 1 2 such that inf x∈X π(x) ≥ 2c > 0.
Let P be the set of all probability measures that follow Assumption 2.1. In the following, we will focus our analysis on probability measures in P.
We now move to discuss kernel machine learning methods. Let L : Y × R → [0, ∞) be a loss function where L(Y, f (X)) can be interpreted as the cost of predicting Y by f (X). We assume that L is convex and a locally Lipschitz continuous loss function such that for all a > 0 there exists a constant C L (a) ≥ 0 for which We also assume that L(y, 0) is bounded and without loss of generality we assume that L(y, 0) ≤ 1. Define the L-risk R L,P (f ) ≡ E[L(Y, f (X))] to be the expected loss when using the function f (X) as a predictor of Y . Define the the Bayes risk as R * L,P ≡ inf f is measurable R L,P (f ), where the Bayes risk is the smallest possible risk. The empirical risk is defined by Let H be a separable reproducing kernel Hilbert space (RKHS) of a bounded measurable kernel on X and denote its norm by · H . Let k : X ×X → R be its reproducing kernel. We assume that k is a universal kernel, which means that H is dense in the space of bounded continuous functions with respect to the supremum norm and that k ∞ ≤ 1 (see Chapter 4 of Steinwart and Christmann, 2008;Hofmann et al., 2008, for details). A kernel machine f D,λ is the minimizer of the regularized empirical risk, where the regularization term λ f 2 H penalizes the RKHS norm of f . Since L is a convex loss function, it can be shown (Steinwart and Christmann, 2008, Theorem 5.5 (Representer Theorem)) that there is unique minimizer to the minimization problem (2.1). Moreover, this minimizer is of the form where α = (α 1 , · · · , α n ) ∈ R n is a vector of coefficients.

Kernel machines with missing responses
In this section, we derive two types of kernel machines. The first type uses weighted-complete-cases while the second type has the doubly-robust property.

Weighted-complete-case kernel machines
Let π(X) > 0 be an estimator of the propensity score. Note that a naive complete case estimator for R L,P (f ), ]. This is a restrictive condition which typically requires M to be independent of the pair (X, Y ). Therefore, consistency of R L,D (f ) to R L,P (f ) cannot be guaranteed when using a naive complete case estimator. Using Assumption 2.1, Thus, in order to avoid this bias, we propose to weight the complete cases appropriately. Let Π be the set of conditional distribution of M given X. Define the weighted loss function for missing response data Define the weighted empirical risk as Since L(Y, f (X)) is a convex function, both L W (π * , M, X, Y, f (X)) and L W ( π, M, X, Y, f (X)) are convex functions. The missing-response kernel machine is defined as Lemma 3.1. Assume that the conditional probability estimator π(x) converges to π(x) in probability and that Assumption 2.1 holds. Then, for any given f ∈ H, the weighted empirical risk R L W ,D (f ) is a consistent estimator for the risk R L,P (f ).
Note that R L W ,D (f ) is a consistent estimator of R L,P (f ) only under the assumption that π(X) is consistent for π(X). Since this assumption cannot be verified, we also develop a family of doubly-robust kernel machines.

Doubly-robust kernel machines
Suppose the conditional distribution of Y given X is F Y |X (y | X, β 0 ), where β 0 ∈ B is an unknown parameter and B is a parameter space. Assume that F Y |X (y | x, β) is continuously differentiable with respect to β for x ∈ X , y ∈ Y. Let Let β be an estimator of β 0 , and define H(X, f (X)) = H(X, β, f (X)). Assume that β P −→ β * , where β * ∈ B. Here, β * does not necessarily equal β 0 . By the continuous mapping theorem, Furthermore, we assume that H(x, β, f (x)) is a continuous function of β for every fixed x ∈ X . Define the following augmented loss function This function doesn't need to be nonnegative as opposed to L and L W . The corresponding empirical risk is In order to define the doubly-robust estimator, we need both L W,H and L W , H to be convex functions.
Lemma 3.2. Let L(Y, f (X)) be the quadratic loss, that is L(Y, f (X)) = (Y − f (X)) 2 . Then, L W,H and L W , H are both convex functions.
The doubly-robust kernel machine is defined as We assume that π(X) converges in probability to some conditional probability function π * (X), not necessarily the true π 0 (X). We also assume that β P −→ β * which does not necessarily equal β 0 . It follows that The following lemma states that if either the estimators of π(X) or the estimator of β 0 is consistent, the doubly-robust empirical risk

Estimation of the augmentation term
We present two explicit examples of estimation of the augmentation term which is needed for the doublyrobust estimation. We limit the discussion to the quadratic loss function.

Regression
Consider the following location-shift regression model (Tsiatis, 2006, Chapter 5) where ε is the error with mean zero independent of X. For example, when the model is a linear regression, µ(X, β 0 ) = X T β 0 ; when the model is a log single index model, µ(X, β 0 ) = log X T β 0 . Write We present two explicit examples of estimation of the augmentation term which is needed for the doubly-robust estimation. We limit the discussion to the quadratic loss function.
where the third equality holds because the error ε has a mean zero and is independent of X.
Two terms need to be estimated, namely, β 0 and E ε 2 . We first estimate β 0 by maximizing the likelihood function.
Let F M,X (m, x) and F (m, x, y, β 0 ) denote the joint distribution of (M, X) and (M, X, Y ) respectively. Then where the second equation follows from Assumption 2.1.
Without loss of generality, suppose that the first n 1 triples (M i , X i , Y i ) are the complete cases, and for the last n − n 1 observations, only the covariates X i are observed.
The likelihood function can be written as Note that only the first term involves β 0 and hence it is enough to maximize which can be a substitute in R L W , H ,D (f ). Minimizing (3.4) with respect to this H results in the doubly-robust kernel machines for regression problems.

Classification
For classification problems, where Y ∈ {−1, 1}, assume that the probability of Y given X follows a logistic model. More specifically, assume that P Only the term P(Y = 1 | X) needs to be estimated and using the logistic model, Using the same argument as previous, it is enough to maximize since Y is a binary variable that gets values in {−1, 1}. This is a standard logistic regression and the estimator of β 0 can be found using standard tools. Substituting H(X, f (X)) into R L W , H ,D (f ) and minimizing (3.4) results in the doubly-robust kernel machines for classification problems.
In both cases, the kernel machines are defined as

Assumptions, conditions and errors
In Section 3 we proved that for any given f ∈ H, the empirical risk based on the two proposed kernel machines are consistent estimators of the risk function R L,P (f ). In this section, we prove universal consistency and derive the learning rates of the proposed kernel machines. Here, universal consistency means when the training set is sufficiently large, the learning methods produce nearly optimal decision functions with high probability for all P ∈ P. Learning rates provide a framework that is more closely related to practical needs. It answers how fast R L,P (f D,λ ) converges to the Bayes risk R * L,P . The learning rate of learning method is defined in Steinwart and Christmann (2008, Lemma 6.5).
In order to prove the universal consistency, we will prove oracle inequalities. Oracle inequalities bound the finite-sample distance between the empirically obtained decision function and that of the omniscient oracle, namely, the true risk of decision function. Before giving theoretical results for f D,λ , we present the following notation and assumptions.
Assumption 4.1. The following property of the estimator π(X) holds.
Note that this assumption can always be satisfied by taking where π(X) is some estimator. Moreover, if a lower bound on the constant c in Assumption 2.1 is known, In order to show the universal consistency of the two proposed kernel machines, we need the following two conditions which depend on the choice of kernel and loss function. Both conditions can be verified.
Condition 4.2. There are constants q > 0 and r ≥ 1, such that the locally Lipschitz constant is bounded by Remark 4.1. Condition 4.1 is used to bound the entropy of the function space H. Linear, Taylor, and Gaussian RBF kernels satisfy for all p > 0, since all of them are infinitely often differentiable (Steinwart and Christmann, 2008, Section 6.4). For the hinge loss, Condition 4.2 holds with q = 0. For the quadratic loss, Condition 4.2 holds with q = 1 (Steinwart and Christmann, 2008, Section 2.2). Define ( 4.1) as the missing mechanism estimation error, and the conditional risk estimation error, respectively. Here H n is the subspace on which the minimization takes place.
For calculating the learning rates we need the following assumption, which is not needed for the consistency results. Define f P,λ = inf f ∈H λ f 2 H + R L,P (f ), and the approximation error is given by Assumption 4.2. There exist constant b and γ ∈ (0, 1] such that This assumption is used to establish learning rates.

Theoretical results of weighted-complete-case kernel machines
Recall that P is the set of all probability distributions that follow Assumption 2.1. We have the following consistency result for the weighted-complete-case kernel machines.
where 0 < λ n < 1. Then, the weighted-complete-case kernel machine is P-universally consistent. In other words, R L,P f W D,λ P −→ R * L,P for all P ∈ P. The proof of Theorem 4.1 is based on an oracle inequality derived for weighted-complete-case kernel machines and can be found in the Supplementary Material (see Theorem C.1). Note that when L is the quadratic loss, the kernel k is Gaussian, and d ≡ 0, then λ n should be chosen such that λ n n 1 2 − → ∞ for an arbitrary small > 0.
Next we derive the learning rate of this learning method.
. Then, the learning rate of the weighted-complete-case kernel-machine learning method is Note that when L is the quadratic loss, the kernel k is Gaussian, and d ≡ 0, the learning rate is n γ 2γ+2 − for an arbitrary small > 0.

Theoretical results of the doubly-robust kernel machines
Before giving the theoretical results of the doubly-robust kernel machines, we discuss some convergence orders related to this learning method. Let P n f = 1 n n i=1 f (X i ) be the empirical measure on sample value X 1 , . . . , X n . Define where H n is the subspace on which the minimization takes place. Note that a n is the mean of i.i.d. bounded random variables and hence a n = O p n − 1 2 . However, unlike a n , the term h n is a supremum of a random process over of set of functions f ∈ H n , where H n is a space that grows with n. We discuss the functional space H n in the proof of the following lemma.
, where d appears in Assumption 4.1.
The previous three convergence orders of a n , h n and Err 2,n are used to prove the universal consistency and derive the learning rate of the doubly-robust kernel machine learning method. The following theorem describes the universal consistency property for doubly robust kernel machines.
Theorem 4.2. Let Assumptions 2.1 and 4.1 hold. Let the loss function L be a quadratic loss. Assume that either |π(X) − π(X)| = O p n − 1 2 or β − β 0 = O p n − 1 2 . Choose 0 < λ n < 1, such that λ n −→ 0 and The proof of Theorem 4.2 is based on an oracle inequality derived for doubly-robust kernel machines and can be found in the Supplementary Material (see Theorem C.2). Note that when the kernel k is Gaussian, and d ≡ 0, then λ n should be chosen such that λ n n 1 2 − → ∞ for an arbitrary small > 0. Finally, we derive the learning rate based on previous results. Note that when the kernel k is Gaussian, and d ≡ 0, the learning rate is n γ 2γ+2 − , > 0 is arbitrary small.

Simulation study
We conducted a simulation study to evaluate the finite-sample performance of the proposed kernel methods for both regression and classification. We compare the proposed methods with the following three existing methods.
Reg The linear regression method which uses only complete cases. SSL The semi-supervised linear regression method of Azriel et al. (2016) which takes into account the missing responses. CC The naive kernel machines which use only the complete observations.
For the proposed kernel machines, we consider the following six different settings.
WCC-M Weighted-complete-case kernel machines with a misspecified missing mechanism, which is estimated by a generalized linear model through the probit link function. WCC-C Weighted-complete-case kernel machines with a correctly specified missing mechanism, namely a generalized linear model with the logit link function. DR-M Doubly-robust kernel machines with a misspecified missing mechanism and a misspecified regression model. DR-MR Doubly-robust kernel machines with a misspecified regression model but a correctly specified missing mechanism. DR-MM Doubly-robust kernel machines with a correctly specified regression model and a misspecified missing mechanism. DRC Doubly-robust kernel machines with a correctly specified regression model and a missing mechanism.
We consider four generating data mechanisms. The first example is a toy example that shows the price of ignoring the missing responses. Specifically, in this example, both the density and the missing rate are getting larger with the first covariate X. Thus, ignoring the missingness, yields an estimator which is based on the smaller values of X. However, since the response is a nonlinear curve in X, this may lead to a biased estimation. The model is given by Y = exp(X) + U 2 + U 3 + U 4 + U 5 + ε where X ∼ 4 · Beta(5, 3), U 2 , . . . U 5 are independent uniform variables on [0, 4], and ε is a standard normal variable. The missing mechanism is given by Observations with X on [0,2] have lower probability to be generated than X on (2,4] while more easily to be observed. For X in segment [0, 2], the missing rate is about 22%, while 77% for X in segment (2, 4]. The overall missing rate is about 64%, see Figure 1. Setting 2 is a classification setting which is considered by Laber and Murphy (2011). Data are generated as Y = sign X 2 − 4 25 X 2 1 − 1 + ε , where X 1 and X 2 are independent uniform random variables on [0, 5], ε is a normal variable with mean 0 and standard deviation 1 2 . The missing mechanism is P (M = 1 | X) = exp{ 3 2 (X2−X1)} 1+exp{ 3 2 (X2−X1)} . For Y = 1, the missing rate is about 20% while 84% for Y = −1, which means that positive labels are more easily observed. The overall missing rate is about 50%.
The last two settings are taken from examples in Liu et al. (2007). These two settings are motivated from prostate-specific antigen (PSA) which is routinely used as a biomarker for prostate cancer screening. Liu et al. (2007) studied the genetic pathway effect on PSA and use least-squares kernel machines to model the genetic pathway effect. Consider a generic regression model Y = Z + h (X 1 , . . . , X p ) + ε, where X 1 , . . . X p are independent uniform variables on [0, 1], Z = 3 cos(X 1 ) + 2U , where U is also a uniform random variable on [0, 1], h(·) is a centered smooth function, and ε is an independent standard normal random variable. In Setting 3, p = 5, and h (X 1 , . . . , X 5 ) = 10 cos(X 1 ) − 15X 2 2 + 10 exp(−X 3 )Z 4 − 8 sin(X 5 ) cos(X 3 ) + 20X 1 X 5 . The missing mechanism is given by Xi 5 .
To summarize, the simulations show that the doubly-robust kernel machine methods perform better, in general, than the other existing methods. When the regression model for doubly-robust kernel machines is correctly specified, the doubly-robust kernel machine methods are recommended. Additionally, if the missing mechanism is correctly estimated, the doubly-robust estimator is the best choice. When little information about the regression model is given, the weighted-complete-case kernel machine is another good choice, especially for large sample size datasets.

Application of Los Angeles homeless population
We applied the proposed kernel-machine methods to the Los Angeles homeless dataset. The dataset is described in Kriegler and Berk (2010)  House Income, after log transformation and normalization. The last two boxplots are of the covariates Residential and PctMinority after normalization census tracts in the Los Angeles county, where the goal is to estimate the number of homeless in each tract. Due to budget limitation, some tracts were not visited, and consequently, the number of homeless in these tracts is missing. The missing mechanism depends on the Service Provision Area (SPA) to which the tract belongs. We use this dataset to compare the performance of the methods mentioned above.
Following Azriel et al. (2016), we first delete all tracts with zero median household income and the highlypopulated tracts leaving 1797 tracts in the dataset. We also used the same covariate sets as in Azriel et al. (2016). To evaluate the performance of different methods, we randomly choose 1597 tracts to train the algorithms and then use the 200 tracts to test. The risk is calculated by the mean square error (MSE) and the weighted mean square error, where the weights are the inverse probability of the tract to be visited. We repeated this process 100 times.
Since the data are skewed, we first took the log transformation of the observed response and the covariates "Industrial", "PctVacant", "Commercial", and "MedianHouseIncome". No transformation was done for the covariates "Residential" and "PctMinority". We normalized the data. Boxplots of the data after transformation and normalization are shown in Figure 3.
We considered the previous Reg, SSL, CC, WCC and DR methods. For the kernel machine methods we used the RBF kernel. Particularly, we used the semi-supervised linear regression method to estimate the responses in the doubly-robust kernel machine method. Table 1 provides the numerical results of the five different methods. Overall the methods perform similarly while the weighted-complete-case kernel machines perform best for both the MSE and the weighed MSE with the lowest mean, median and standard deviation of the two kinds of risk. In this example, the doubly-robust kernel machine does not perform well and this could be related to the performance of the semi-supervised linear regression method used in the augmentation term.

Conclusion and discussion
We proposed two kernel-machine methods for handling the missing-response problem. Specifically, we proposed an inverse-probability complete-case estimator which can be applied to any convex loss function. We also proposed a quadratic loss based doubly-robust estimator. The empirical risk of these new data-dependent loss functions were shown to be consistent for any function h ∈ H under mild conditions. We presented oracle inequalities and consistency results for both types of kernel machines. We also presented a simulation study and applied these new methods to the Los Angeles homeless dataset. Several open questions remain and many possible generalizations still exist, especially for the doublyrobust estimator. We would like to extend the quadratic-loss based doubly-robust estimator to include other convex loss functions. Additionally, we would like to develop a new data-dependent loss function for handling missing covariates and guarantee the doubly-robust property at the same time. This work is under progress were we use imputation methods to define the augmentation term of a doubly-robust estimator.
Appendix A: Calculation in Subsection 3.4

A.1. Weighted-complete-case kernel machines
Let where α, A and W are defined as in Section 3.4. Taking the derivative and equating to zero, we have α = (λI + W K) −1 W Y.

A.2. Doubly-robust kernel machines
For regression, we have does not depend on α. Consequently, g(α) has the same minimizer of where µ X, β , W 1 and W 2 are defined as in Subsection 3.4, Taking the derivative and equating to zero, Next we will derive the kernel machines for the classification problem. Recall that in classification In this situation, we need to minimize Using the same technique as previously, where µ X, β = 2logit X, β − 1. Table 2 Descriptive statistics of the four settings: The median, mean, and standard deviation. Proof. Let L(y, t) = (y − t) 2 . We now prove that L W , H π, H, M, X, Y, f (X) is convex. The same argument can be used for L W,H .

Appendix B: Tables of simulations
Recall that H(X, t) = y∈Y L(y, t)dF Y |X y | X, β .
We first show that for every convex loss L, H(X, t) is convex. For any α ∈ (0, 1), by the convexity of L(y, t) with respect to t, which indicates that H(X, t) is a convex function with respect to t.
Therefore, when M = 0, L W , H π, H, 0, X, Y, t = H(X, t) which is a convex function for any loss L. When M = 1 and L is the quadratic loss, where U X, β = y∈Y ydF Y |X y | X, β and V X, β = y∈Y y 2 dF Y |X y | X, β . Note that U and V are not functions of t. Hence, for M = 1, Since the second derivative with respect to t is positive, L W , H π, H, M, X, Y, t is convex with respect to t.
By the Law of Large Number (LLN), we have Note that The third equality holds because M and Y are independent given X. As a conclusion, we have, R L W , H ,D (f ) P −→ R L,P (f ).

C.3. Oracle Inequality for Weighted-Complete-Case Kernel Machines
Theorem C.1. Let Assumptions 2.1 and 4.1 hold. Then, for fixed λ > 0, n ≥ 1, ε > 0, and η > 0, with probability not less than 1 − e −η , Proof. By the definition of f W D,λ , Hence, where the inequality follows from (C.1). Note that by (3.1), where the second equality holds by conditional expectation and the third equality holds for the MAR missing mechanism. Hence, We first bound expressions A n and B n . Note that L(y, 0) ≤ 1 for all y ∈ Y. By Assumption 4.1, where c is defined in Assumption 2.1 and C L (·) is a Lipschiz constant defined in Section 2.
Using (C.10) we can bound A n and B n of (C.9) Using the similar argument as (Steinwart and Christmann, 2008, Theorem 6.25) for any η > 0, we have where the last inequality is from Hoeffding's inequality (Steinwart and Christmann, 2008, Theorem 6.10).
C.7. Proof of Lemma 4.1 Proof. Define X i (f ) = L(X i , Y i , f (X i )) − H(X i , β 0 , f (X i )) and let X i (f ).
Since f ∞ ≤ f H , the space H n over which the supremum h n is taken is contained in (c 2,n λ) − 1 2 B H . By (C.8), X i (f ) ∞ ≤ 2 r (c 2,n λ) −1 + 1 . Using the functional Hoeffding's inequality (Berestycki et al., 2009, Section 6.5), where K u is a universal constant and C is any constant.
Let C = C Proof. Note that for every f ,
C.9. Proof of Theorem 4.2 Proof. In the proof of Theorem C.2, where A n , B n , C n , and D n are same as defined in (C.9). For A n + B n , we have the same result as (C.11).
Next we bound C n and D n in the two different situations of (i) and (ii).