Learning Semiparametric Regression with Missing Covariates Using Gaussian Process Models

Missing data often appear as a practical problem while applying classical models in the statistical analysis. In this paper, we consider a semiparametric regression model in the presence of missing covariates for nonparametric components under a Bayesian framework. As it is known that Gaussian processes are a popular tool in nonparametric regression because of their flexibility and the fact that much of the ensuing computation is parametric Gaussian computation. However, in the absence of covariates, the most frequently used covariance functions of a Gaussian process will not be well defined. We propose an imputation method to solve this issue and perform our analysis using Bayesian inference, where we specify the objective priors on the parameters of Gaussian process models. Several simulations are conducted to illustrate effectiveness of our proposed method and further, our method is exemplified via two real datasets, one through Langmuir equation, commonly used in pharmacokinetic models, and another through Auto-mpg data taken from the StatLib library. MSC 2010 subject classifications: Primary 60K35, 60K35; secondary 60K35.


Introduction
In nonparametric regression, the objective is to find relationships between response and covariates without assuming the parametric form of a regression function. Nonparametric regression is a rapidly growing and exciting field. It offers a more flexible way to model the effects of covariates on the response compared to parametric models, which often have more restrictive conditions on the mean function. Many competing methods are available for nonparametric regression, including kernel-based methods, regression splines, smoothing splines, and wavelet and Fourier series expansions. When both responses and covariates are fully observed, the relevant theories and methods are well developed as described in Takezawa (2005). But a drawback of nonparametric regression models lies in its ability of interpretability in contrast to parametric regression models. Thus, various efforts have been addressed on semiparametric models, which balance the interpretation of parametric models and flexibility of nonparametric models. However, there is limited literature on either nonparametric or semiparametric models for missing covariates appearing in the nonparametric components of a regression.
For missing data, there are three basic classifications. If missingness does not depend on either observed or missing values, the data are called missing completely at random (MCAR). While the assumption of missing at random (MAR) is that missingness depends only on the observed values. The MAR is less restrictive than the MCAR. A much more relaxed assumption is missing not at random (MNAR), where the missingness depends on the data that are missing. A compressively review of general parametric statistical inferences with missing data has been discussed in Little and Rubin (2002).
When we come up with nonparametric modeling, one common approach is splines, such as using basis function representations for the mean function (e.g., Denison (2002)). Yau and Kohn (2003) used thin plate splines to allow the mean and variance to change with covariates. In certain applications, the structure may be overly restrictive due to the specific splines used in the model. However, model estimation using regression splines become more challenging when covariates have missingness. Faes et al. (2011) developed a nonparametric model based on spline basis functions, where covariates are missing. They carried out inference using variational Bayes approximations (cf., Beal (2003)) and showed that in the case of missing covariates, variational Bayes approximations produce multimodality in the posterior distributions where the one-to-one mapping does not exist for the unknown function.
In the Bayesian framework, nonparametric regression (or nonparametric classification) problems become the elicitation of the suitable priors on the mean function. Dirichlet process models are very popular methods for Bayesian nonparametric. Wang et al. (2010) developed a classification model to handle incomplete inputs, where they extended the finite Quadratically Gated Mixture of Experts (QGME) developed by Liao et al. (2007) to an infinite QGME via a Dirichlet process prior. Since the Markov chain Monte Carlo (MCMC) based analysis for this model suffers from huge computational costs, Wang et al. (2010) implemented approximate inference via the variational Bayesian approach. Recently, Zhang et al. (2016) proposed an infinite Dirichlet process mixture model to solve unsupervised learning for clustering with missing data. They assumed missing data as latent variables and obtained their posterior distributions using the variational Bayesian expectation maximization algorithm. Often, the computation burden is heavy on all these Dirichlet Process models above. Hence, the inference is carried out using approximate methods like variational Bayes and others. Moreover, the current literature for missing predictors in Dirichlet process models is only focused on clustering problems other than regression.
Gaussian process (GP) models are acknowledged as another popular tool for nonparametric regression. The usage of GP models has been widespread in spatial models, in the analysis of computer experiments and time series, in machine learning and so on (Rasmussen and Williams (2006)). For the properties of GP models, one can refer to Adler (1990), Van Der Vaart and Wellner (1996), Rasmussen and Williams (2006) and Cramér and Leadbetter (2013). Further, with normality assumption on the residuals, Choi and Schervish (2007) have shown assigning GP priors to the unknown regression function would lead to a consistent estimator for the regression function. However, for GP models, the case of missing inputs has received little attention, due to the challenge of propagating the input uncertainty through the nonlinear GP mapping. Only recently, there appear several studies focusing on GP models with inputs subject to some measurement uncertainty (Girard and Murray-Smith (2003), Quiñonero-Candela and Roweis (2003) and Damianou and Lawrence (2015)). They often developed a two-stage procedure for estimating such GP models either using variational Bayesian methods or expectation maximization procedures, wherein the first stage, they estimated the model parameters only for complete cases and then in the second step, they alternately updated model parameters and adjusted estimates of missing input points. However, the situation to deal with noisy inputs due to measurement uncertainty will be quite different from the situation where the inputs are completely missing. Therefore, in this paper, we consider the scenario when an input of GP models is subject to MCAR or MAR as for the purpose to fill in the gap of missing data for GP models in the literature. To avoid the risk of introducing modeling biases in parametric regression models as well as the existing drawbacks of nonparametric regression models (such as the difficulty of interpretation and lack of extrapolation capability), we will consider semiparametric regression models in our study. Specifically, we will use the partially linear model, the most commonly used semiparametric regression model (cf., Engle et al. (1986), Ruppert et al. (2003), Härdle and Liang (2007) and references therein). A GP prior will be assigned to nonparametric components for this semiparametric regression model. Further, we will impute the missing covariate of the nonparametric component via a Bayesian hierarchical model, which will be a key for us to recover the covariance function of the GP prior.
To complete the prior specification of GP models, we need to elicit the priors on the hyperparameters of a GP. Those hyperparameters often control the smoothness and variation of a GP. However, it is often difficult to specify subjective information over hyperparameters of a GP model. Thus, we will consider using noninformative priors. In Berger et al. (2001), they mentioned assigning noninformative priors such as commonly used constant priors and independent Jeffrey's priors for hyperparameters of a GP both fail to yield proper posteriors. Instead, they recommended the 'exact' reference prior for GP models when there is no white noise. Ren et al. (2012) extended the 'exact' reference prior in the case when we have white noises for the responses and showed the posterior is propriety under such prior. We further prove that under some mild conditions, the posterior propriety of the GP under the 'exact' reference prior will still hold in the presence of ignorable missing covariates. In addition, we have conducted a simulation study to compare the results from the 'exact' reference prior with certain weakly informative priors applied to hyperparameters of a GP.
The format of the paper is organized as below. In Section 2, we outline the setting of semiparametric regressions in a Bayesian hierarchical modeling framework. Section 3 will focus on the discussion of sampling methods to estimate model parameters and deriving posterior predictive distribution. In addition, we show the posterior propriety of our model under the "exact" reference prior for GP hyperparameters. Then, in Section 4, we perform several simulation studies to validate our proposed method and compare with existing methods. We present two real-world applications in Section 5 to show the advantage of our proposed model over some competitive models. Finally, in Section 6, we draw the conclusion and point out some future direction.

Semiparametric Regression Models with Ignorable Missing Covariates
The task of finding a good function estimation from a given dataset receives a lot of attention not only in the statistics literature but also in the neural network and machine learning communities. One of the popular approaches for Bayesian nonparametric regression is using a GP prior in modeling the unknown underlying function with nonlinear and nonparametric structures. GP model admits a much richer latent structure than that of a parametric model, where the latter one restricts to certain fixed parametric structure. Thus, the GP model will potentially better approximate the true response function. In this section, we are going to propose our semiparametric regression model in a Bayesian framework to handle missing data, where we will use a GP model to estimate the nonparametric component.
The semiparametric regression model that we consider is given by for i = 1, 2, · · · , n. Here, β = [β 0 , · · · , β p ] is a q × 1 vector of coefficients for fully observed covariates z i = [1, z i1 , · · · , z ip ] and further, define Z = [z 1 , · · · , z n ] ∈ n×q , where q = p + 1. We assume that p << n and denote y = [y 1 , · · · , y n ] . Here, g(·) is the unknown function, x i 's ∈ are the observed inputs (subject to missing) and i 's are random errors. Without loss of generality, we assume i In the absence of covariates z i 's, our model is reduced to a nonparametric model To estimate unknown function g(·), we are going to introduce a GP prior on g(·). We will consider a zero mean GP to avoid confounding of the mean parameter of a GP prior with coefficients β in Model (2.1). Let us denote g(·) ∼ GP (0, σ 2 z k(·, · | )), where k(·, · | ) is the correlation function, and σ 2 z and are hyperparameters of the GP. Then, given any finite n distinct inputs x 1 , · · · , x n ∈ , [g(x 1 ), · · · , g(x n )] will follow a multivariate Gaussian distribution with zero mean vector and covariance matrix Σ, with (i, j)th entry of Σ, i.e., (Σ) ij = σ 2 z k ij = σ 2 z k(x i , x j | ), for i, j = 1, · · · , n. In this paper, we only considered isotropic correlation kernel, that is, for some isotropic correlation function Ψ and ||.|| denote Euclidean distance. A common choice of isotropic correlation functions is the squared exponential kernel (also known as Gaussian kernel), that is, where the scaling parameter σ 2 z controls the variation of the response surface and the length-scale parameter guides the smoothness of sample paths. In Subsection 4.2, we will also discuss about other power exponential correlation and Matérn class of correlation functions.
Let us consider the input x i 's in Model (2.1) are subject to missing. This may happen because respondents in a survey refuse to fill in certain items, or recorders fail to observe an input due to unknown mistakes in an experimental process or others. Denote x = [x 1 , · · · , x n ] and presume that x ∼ f (x | ω), where ω are some unknown parameters. Without loss of generality, we can partition x into x mis , x obs , where x obs collects the fully observed values, while x mis denotes the observations absent of x values. Suppose m out of n covariates x i 's are missing, i.e., x mis ∈ m and x obs ∈ n−m . For imputation of missing covariates under Bayesian framework, we need a probabilistic model to model the missing x i 's. For i = 1, · · · , n, let R i be a binary random variable with success probability π i and use R i to indicate whether x i is observed or not (R i = 1 if x i is missing and 0 otherwise). Then, we define R = [R 1 , · · · , R n ] as a n × 1 vector of missingness indicator. Here, we consider the following two missingness mechanisms: (1) π i = P (R i = 1 | y i , x i , z i ) = p for some constant 0 < p < 1. In this case, x i 's are missing completely at random (MCAR) and the missingness mechanism is independent of the data. ( With these specified missingness mechanisms, our goal is to make the statistical inference for Model (2.1). We estimate parameters , σ 2 z , σ 2 , and β based on marginal likelihood, where we integrate out the unknown function g(·) in the likelihood, i.e., f (y | x, Z, , σ 2 z , σ 2 , β) = N n (Zβ, σ 2 z G). Here, N n (·, ·) indicates a n-dimensional multivariate normal distribution with Zβ being its mean and σ 2 z G being its covariance, where G = ηI n + K, η = σ 2 /σ 2 z is the variance component of the noise-to-signal ratio, and K is n × n isotropic correlation matrix with (i, j)th entry being k ij and depending only on . Throughout the paper, we will interchange the usage of the notation K and K( ) to represent the correlation matrix of a GP whenever it is necessary. To simplify the notation, define Θ = ( , σ 2 z , η, β ) . Given the observed data D = {R, y, x obs , Z}, the likelihood of Θ, ω for Model (2.1) is: Under the two specified missingness mechanisms, P (R i | y i , x i , z i ) will not have any effect on estimation of parameters Θ and imputation values of missing x mis . Thus, when we derive the posterior distribution of parameters Θ and missing values x mis , we can ignore the first term on the right side of the likelihood (2.3). Further, if we assign a prior on ω as π(ω) then we can integrate out nuisance hyperparameters ω in Equation (2.3). Define π(x) = ω f (x | ω)π(ω)dω as the marginal prior on x and factorize π(x) = π(x mis | x obs )π(x obs ). Then, given the data D, the likelihood of Θ is: (2.4) To utilize Bayesian methods to perform the inference on Model (2.1), we need to specify priors on the unknown parameters Θ. One common approach is to use proper priors on Θ, assigning subjective priors or abstracting information from previous data. One advantage of proper priors is that they can always achieve posterior propriety. However, the subjective elicitation of GP hyperparameters (i.e., , η and σ 2 z ) is difficult due to the hard interpretation of their meanings in practice. Therefore, we resort to specify the priors of GP hyperparameters noninformatively. But Berger et al. (2001) showed that the conventional noninformative priors lead to improper posterior. Thus, they derived an exact reference prior under the case without the noise (i.e., σ 2 = 0 in our case). Ren et al. (2012) further examined the effect of noise and derived the "exact" reference prior under the situation σ 2 = 0. In this paper, we aim to extend the posterior propriety of this reference prior in the case of missing data for the GP models.

Posterior Propriety and Posterior Inference
In the Section 3.1, we discuss the posterior propriety with the "exact" reference prior. Then, in Section 3.2, we specify MCMC procedure to carry out Bayesian inference of parameters in Model (2.1). Section 3.3 will be discussed about how to estimate new observations from our proposed model.

Posterior Propriety with the 'Exact' Reference Prior
In this subsection, we aim to prove the posterior propriety of our GP models with the 'exact' reference prior under the situation when the inputs of GP models are missing. Following the discussion of Ren et al. (2012), the 'exact' reference priors of ( , η, σ 2 z ) are based on their Fisher information matrix, which is derived from integrating β out using a flat prior in the likelihood of Θ below provided that x mis is known, (3.1) Here, L * (·) with a subscript ' * ' indicating that x is fully observed in this expression. The Fisher information matrix derived from the integrated likelihood of ( , η, σ 2 z ) in (3.1) is given by is the notation for trace and ∂K/∂l indicates the first-order partial derivative of K with respect to . Applying the derivation of the 'exact' reference prior from Ren et al. (2012), a noninformative prior for Θ is Then, to show the posterior propriety of Θ using the "exact" reference prior (3.3) under the missing data framework for our Model (2.1), we only need to prove the integration of the joint posterior distributions of Θ and x mis below is finite over the domain of Θ and x mis . Here, in (3.4), π(x mis | x obs ) is the prior for x mis given the observed x obs , which depends on the marginal distribution of π(x).
To verify the propriety of the joint posterior (3.4), first, let us integrate out β and σ 2 z from this joint distribution, which yields presence of ignorable missingness in covariates, to show the joint posterior distribution of (Θ, x mis ) is proper, we only need to verify that Using the Condition A1 to Condition A4 in Supplementary S.1 (Bishoyi et al., 2019), Ren et al. (2012) have proved that the integrated likelihood L * * (x mis ) is finite, that is: for a given x mis . Thus, L * * (x mis ) is bounded. Then, this is equivalent to prove that where C is some bounded constant. Therefore, if we add additional condition below, then (3.6) will be finite.
The Condition A5 is easy to achieve. For example, if we specify a proper prior on x, often the conditional distribution of x mis given x obs will be proper as well. Without loss of generality, for the discussion throughout the paper, we will assume the covariates x ) and further, presume the prior of hyperparameter μ x and σ 2 x to be π(μ x , σ 2 x ) ∝ 1/σ 2 x . Integrating out μ x and σ 2 x , the conditional marginal prior for π(x mis | x obs ) follows a multivariate t distribution (see details in Equation (S.1) and its derivation in Supplementary S.2 (Bishoyi et al., 2019)). Moreover, in Subsection 4.3, we analyze the sensitivity of the prior chosen on the missing covariates x i 's that satisfies Condition A5.
From our discussion, using Conditions A1-A4 in Supplementary S.1 (Bishoyi et al., 2019) with additional Condition A5, we can easily establish the posterior propriety of (Θ, x mis ) in Model (2.1). The Condition A1 ensures that the correlation function will decrease to zero as the distance between two points goes to infinity, while the Condition A2 ensures → ∞, a Taylor expansion of the correlation function will follow. The commonly used correlation matrix of the GP model such as the power exponential kernel, Matérn kernel, spherical kernel, rational quadratic kernel, and other isotropic kernels will often automatically satisfy the Conditions A1 and A2.

Bayesian Computation and Sampling Schemes
Since the joint posterior distribution of (Θ, x mis ) in (3.4) is proper, we will rely on this joint posterior to make inference for our proposed Model (2.1) when some of input x i 's are missing.
However, this joint posterior does not have a closed form, thus, we shall resort to MCMC sampling scheme to draw samples of unknown parameters to make an inference. There are two key steps in developing the MCMC scheme. First, we draw the missing values x mis provided that Θ is known and treat the values drawn for x mis as their imputed values. Second, we sample Θ based on observed x obs and imputed x mis . These alternative iterations create a Markov chain that eventually stabilizes to the joint posterior distribution of parameters and missing covariates in (3.4). The detailed steps of MCMC schemes are described below.
Step 1 : draw x mis from its posterior conditional distribution: where the term I * ( , η, 1 | x mis , x obs ) is defined in Equation (3.3) and π(x mis | x obs ) is derived in Equation (S.1). Since the conditional posterior distribution of x mis do not have a closed form, we use a Metropolis-Hastings algorithm to impute values of x mis .
Step 2 : Given the imputed values x mis , we sample the value of using the posterior conditional distribution given by: The posterior conditional distribution of is not closed form, thus we use slice sampling (cf., Neal (2003)) to draw samples of from its posterior conditional distribution.
Step 3 : Provided that x mis and are known, we sample η from its posterior conditional distribution: Also, the posterior conditional distribution of η does not have a closed form and we will use the slice sampling algorithm to draw samples of η from its posterior conditional distribution.
Step 4 : When x mis , and η are known, we draw σ 2 z from its posterior conditional distribution: where IG indicates an inverse gamma distribution with the shape parameter (n − q)/2 and the rate parameter S 2 /2.
Step 5 : Given x mis , , η and σ 2 z , then we can sample Once we give the initial values for , η, β, σ 2 z , and x mis , then the Bayesian computation is done by running MCMC algorithms from Step 1 through Step 5 until the MCMC chain has converged. To evaluate the convergence of the MCMC chains, we run the MCMC chains with 10 different starting values for the unknown parameters. The Gelman-Rubin potential scale reduction factor (cf., Brooks and Gelman (1998)) is found to be very close to 1 at most after 25,000 iterations of MCMC runs in our simulations and examples for all unknown parameters in the model. We also evaluate the convergence by informally looking at trace plots and we find the MCMC chains are mixing well after 25,000 iterations in our simulations and examples. After MCMC samples are converged, the statistical inferences are straightforward by utilizing the MCMC samples. For example, a posterior median estimate and 95% credible interval for the unknown function g(·) can be formed from the median, 2.5%, and 97.5% empirical quantiles of the corresponding MCMC realizations, respectively.

Posterior Predictive Distribution
In Subsection 3.2, we have developed a MCMC algorithm to impute the missing covariates under ignorable missing mechanism as well as to estimate the unknown parameters in Model (2.1) simultaneously. However, often in the study, one of our goals is to predict responses using Model (2.1) when new observations of covariates come; while, another purpose might be using future observations to assess the performance of our proposed models in comparison to other competitive models. For these reasons, in this subsection, we are going to derive the posterior predictive distribution of y new when we observe new covariates in Model (2.1).

Let us presume that the n observations
where π(Θ, x mis | D) is the joint posterior distribution of (Θ, x mis ) derived in (3.4) and , where x i is a training point for i = 1, · · · , n and x test j is a test point for j = 1, · · · , t.
Let M be a total number of iterations for MCMC samples after burn-in period. Then, to generate a random sample y test from its posterior predictive distribution in (3.7), it involves two major iterative steps, that is, for i = 1, · · · , M, Step 1 Draw (Θ (i) , (x mis ) (i) ) from π(Θ, x mis | D), where the detailed steps are described in Subsection 3.2.
Step 2 After given the values of (Θ, x mis ) at the i-th iteration, we sample the i-th iteration values of y test from Then, y test = M i=1 y test(i) /M is the value of the posterior median estimate of y test .

Simulation Examples
In Subsection 4.1, we design some simulation examples to validate the inference procedure proposed in Section 3 and compare the benefits by imputing the missing values in Model (2.1) instead of using complete data only. Also, we compare the results of using our reference priors versus the weakly informative priors for the hyperparameters in the GP prior. Moreover, we empiricially investigage the posterior consistency of our proposed models under the 'exact' reference priors. In Subsection 4.2, we conduct some experiments to evaluate the sensitivity of misspecification of correlation functions for GP priors assigned to g(·) in Model (2.1). Further, we analyze the prior choices assigned for the missing covariates in Subsection 4.3. In the meantime, we perform a simulation study to compare our proposed method with other competitive methods in a nonparametric regression when the covariates are missing.

Simulation I
Consider the semiparametric regression model (2.1) with the following specification, ∼ N(0, σ 2 ) with σ 2 = 0.4 and n = 120. Thus, η = σ 2 /σ 2 z = 0.4. To test the performance of our proposed method, we randomly select 100 data points out of 120 generated data points from (4.1) to be training datasets, while the rest 20 data points are left for the assessment of the prediction power for the model. Next, we create an average of 10%, 25% and 40% missingness of covariates x i 's in the training data according to the procedure described below. That is, we randomly generate the missing indicator from where R i = 1 indicates x i is missing for the ith subject, R i = 0 otherwise. We fix b 1 = −0.1 in (4.2) and then in each simulation run, we solve the value of b 0 to make the average missing probability of p i 's over 100 training points equals to 0.1, 0.25 and 0.4, respectively.
After the data were generated, we employ the MCMC sampling scheme developed in Subsection 3.2 to estimate model parameters and impute missing values of x i 's. We assume π(x i ) x ) with the hyperprior on μ x and σ 2 x being π(μ x , σ 2 x ) ∝ 1/σ 2 x . Using the derivation in Supplementary S.2 (Bishoyi et al., 2019), we know that the conditional prior distribution of x mis given x obs will follow a multivariate t-distribution. Although we advocate to use the reference prior discussed in Subsection 3.1 for the unknown parameters , η, β 0 , β 1 and σ 2 z in Model (4.1), we have also performed a comparison with other two popular priors used for our setting in practice. For the three priors considered, the information containing in the priors is gradually increasing as below: 1. Reference Prior (named Prior 1 ): We use the priors proposed in Subsection 3.1.
2. Vaguely Informative Prior (named Prior 2 ): According to the range of x i 's we generate, we assign a uniform prior on [−20, 20] to . Besides, we presume an inverse gamma prior with mean 1 and variance 100 for σ 2 z and log(η) ∼ N (0, 100). The priors for all other parameters are specified the same as Prior 1.
For each simulated data, we run the MCMC for 100,000 iterations, where the first 50,000 draws are discarded as a burn-in phase and every 10th values of MCMC samples Table 1: Comparison between our proposed method and the naive method via three different priors for Model (4.1). are stored to reduce the level of correlation between successive values of the chain. For each different scenario of missing percentage, we repeat the entire simulation procedure described above for 50 times using different random seeds. The total computation time costs 11 hours to run on a Xeon E5-2690 CPU with 2.60 GHz frequency, 128 GB RAM and 24 cores. Then, under the scenarios of three different priors mentioned above, we compare the parameters estimated in Model (4.1) using our proposed methods (PM) to the naive method. In the naive method, we only use the complete cases (CC) (i.e., those data points where covariate values x i 's are observed) to fit Model (4.1).
A comparison of these two methods with three different priors is shown in Table  1. We have compared the predicted mean squared error (PMSE) of y i 's for test points and the bias of the estimated parameters in Model (4.1). Here, we define PMSE =  Table 1, the bias of the estimated parameters are calculated using the absolute distance between posterior median estimates of the parameters and their corresponding true values in the simulation.
From Table 1, it is clear to see that using our proposed method to impute the missing covariates x i 's, we are able to predict the test points with better accuracy than using the naive method in all three different levels of missingness. Moreover, when the missing rate is higher, the posterior median estimates of the hyperparameters of GP prior as well as the parametric coefficients in Model (4.1) have relative lower biases by using  our proposed method than using the naive method. Generally speaking, the weakly informative prior will yield less absolute bias and smaller PMSE than the other two priors. It makes sense as if the weakly informative prior can provide extra information to pinpoint the right range of the target parameters. However, if we are lacking of such information, by using our proposed reference prior in the analysis, our reference prior is still doing a good job for the inference and prediction in comparison to the vaguely informative prior shown in Table 1 and is not much different than the weakly informative prior. However, an advantage for the practitioners in the usage of our proposed priors is that they can automatically run our program without knowing how to elicit the priors, whereas they are expected to obtain the comparable results as they do have some weakly information on the prior beliefs.
Another important topic is to explore the posterior consistency of our proposed models under the 'exact' reference priors. To empirically investigate this problem, we consider three different sample sizes: n = 100, 500, and 1000 for simulating data from Model (4.1). We illustrate the bias for each parameter averaging over 50 datasets that use different random seeds to simulate the data from Model (4.1). Notice that in Table  2, the row of n = 100 is the same as the row of Prior 1 in Table 1. From Table 2, it is obviously the bias of all parameters become smaller when the sample size gets larger. This pattern indicates that the posterior consistency empirically holds for our proposed models under the 'exact' reference prior.

Simulation II
In this subsection, we design several simulation experiments to test the performance of our proposed method under misspecification of correlation functions for the GP prior assigned to g(·) in Model (2.1). Here, we consider three different types of covariance functions, commonly used in the spatial statistics and machine learning field.

Mátern Class (MC) of Covariance Functions:
with positive parameters ν and , where K ν (·) is a modified Bessel function. The most interesting cases for Mátern class of covariance functions are ν = 3/2 and ν = 5/2 (denote them as MC 3/2 and MC 5/2 in Table 3 and Table S), that is: In our simulation, we have considered both the choice of ν = 3/2 and ν = 5/2.
For each choice of covariance functions (i.e., 4 choices) and each missing percentages for the covariate x i 's (i.e., 10%, 25% and 40%), we apply Model (4.1) to generate 10 different sets of data using different random seeds. Thus, we have a total of 10 × 4 × 3 = 120 datasets. Here, we choose β 0 = −10, β 1 = 20, = 10 and σ 2 z = 2 in Model (4.1), while all other settings are the same as Subsection 4.1. The data is generated in the same way as described in Subsection 4.1 with only changing the covariance function of the GP prior assigned to g(·) in Model (4.1).
We use the mean squared error of imputed missing values of x i 's (MSEx), PMSE of test points y i 's and deviance information criterion (DIC) (Spiegelhalter et al. (2002)) to evaluate the performance and test the goodness of fit for the different choices of covariance functions. In fact, due to the complication of integrating out x mis in the likelihood, we use the conditional DIC defined in Celeux et al. (2006) for computing DIC. Notice that for model comparison, we can define the deviance as where f (y | Θ, x mis , x obs , Z) is the conditional likelihood of y. Then, apply the original definition of DIC to this conditional distribution, which leads to  Table 3: The sensitivity analysis of using squared exponential kernels based on MSEx, PMSE and DIC. 'T' represents the true kernel used in generating the data and 'F' indicates the kernel applied in fitting the data.
where E Θ,x mis (·) implies taking expectation respect to the joint posterior distribution of Θ and x mis , which can be easily approximated using an MCMC run by taking the sample mean of the simulated values of D(Θ, x mis ); and for Θ and x mis , we choose their posterior medians in our study.
Since the SE covariance function has lots of good properties and supports a large class of functions with various shapes, we want to focus on the performance of using SE covariance functions when the other covariance kernels are true. For the reference, we present a detailed result in Table S Table S, we construct Table 3 Table 3, we could see the relative changes of MSEx, PMSE and DIC values of using SE covariance function in comparison to using the true kernel is relatively small. Thus, it shows that the performance of our model using SE covariance under misspecification of covariance kernel is kind of robust. Thus, in our application, we will choose to work with SE covariance.

Simulation III
In this subsection, we first design some simulation examples to assess the sensitivity of the priors assigned to the missing covariates. Then, we perform a simulation MSEx PMSE X X X X X X X X X X

Missing
Prior  study to compare our proposed method with two competitive methods. One using cubic smooth splines, where the missing covariates are imputed through multiple imputation by chained equations (MICE) algorithm (cf., van Buuren and Groothuis-Oudshoorn (2011)). Another one is the method proposed by Faes et al. (2011), who solved the issue of missing covariates in the usage of spline basis functions for nonparametric regression.
Since the focus here is more on the estimation of the function g(·) and its missing covariate, we slightly revise the assumption of Model (4.1) and omit the covariate z i and the intercept there, i.e., for i = 1, · · · , n and n = 120. Out of 120 observations, we randomly select 100 data points for training and use the rest of 20 observations as the test dataset. The percentage of 10%, 25%, and 40% missingness are created among the training points using the missing mechanism shown in Equation (4.2).
First, let us explore the sensitivity of the prior assigned to the missing covariate x i 's. To be specific, we investigate three different priors for x i 's in Model (4.4): 1) Gaussian prior: π(x i ) = N (0, 100); 2) Uniform prior: π(x i ) = 1({−5 ≤ x i ≤ 5})/10, where 1(·) is an indicator function; and 3) standard Cauchy prior: π(x i ) = 1/[π(1 + x 2 i )] for x i ∈ R. Clearly, the second choice of the prior is identical to the distribution where the covariate x i 's are generated from. Typically, the Cauchy distribution has much heavier tailer and thus it might be expected to tolerate more on the misspecification for the choice of the prior on x i 's.
We have repeated 50 times for the process to generate data from Model (4.4) using different random seeds. The values in Table 4 are averaged over 50 times and calculated by using our proposed method in Subsection 3.2 with changing the corresponding priors specified on the x i 's. From Table 4, it is obvious that the values of MSEx and PMSE are almost the same among three different choices of the priors for x i 's in all three scenarios of missingness. This result suggests that the choice of the prior for the covariate x i 's is insensitive to the inference and prediction in our proposed method provided that the prior elicitation is containing the domain of the covariate x i 's.
Next, we are going to examine the performance of our proposed procedure to estimate the unknown function g(·) in comparison with the other two competitive methods. The three methods compared in Table 5 are the following:  Table 5: The comparison of three methods based on MSEx and PMSE.
• Method 2 (M2). We first perform multiple imputation of the missing covariates x i 's using MICE algorithm, which we call the MICE package in R. Next, we fit cubic smoothing spline for g(·) based on the imputed data.
• Method 3 (M3). We apply the method proposed in Faes et al. (2011) to impute missing covariates and estimate g(·) simultaneously. In their paper, they used penalized spline with mixed model representation to estimate g(·) (cf., Section 4.1 of Faes et al. (2011)).
We repeated the entire experiment 50 times using different random seeds and in each dataset, we analyze the data with the three methods described above. We assess the performance of these methods using MSEx and PMSE, and the results shown in Table  5 are averaging over these 50 experiments.
From Table 5, we can see that the values of MSEx for all three methods are similar when the missing rates are comparatively lower (i.e., 10% and 25%). It is obviously seen that when the missing percentage is higher (i.e., 40%), our proposed method performs much better than the other two in term of MSEx. Based on PMSE, the method proposed by Faes et al. (2011) and our proposed method have relative similar PMSE values, although Faes et al. (2011)'s method is doing slightly better in higher missing percentages than ours. In term of PMSE, both our method and Faes et al. (2011)'s method are doing much better than the cubic smoothing splines method. The cubic smoothing splines (i.e., M2) uses a two-stage estimation procedure, thus it fails to take advantage of the functional relationship between the response and covariates when imputing the missing covariates.

Application
Since our approach has successfully applied to the simulated data and recovered the true values of parameters well, we are going to employ our methods to two applications. According to our investigation for the relative robustness of misspecification of covariance functions of GP prior and the sensitivity of the prior choice on the missing covariate x i 's in Section 4, we are going to use SE covariance kernel on the GP prior and assume a normal distribution on the covariate x i 's throughout the applications.

Application I
First, we are going to apply our proposed method in Adsorption Isotherm data for R-113 refrigerant vapors on BPL activated carbon at 298 Kelvin obtained from Mahle et al. (1994). BPL activated carbon is a virgin granular activated carbon designed for use in gas phase applications. It can be reactivated for reuse which eliminates disposal problem. One of the usage of BPL activated carbon is for gas purification and solvent recovery. R-113 is 1,1,2-Trichloro-1,2,2-Trifluoroethane, which is a colorless to water white, nonflammable liquid with a slight, ether like odor at high concentrations. It has been used as a cold degreasing agent, dry cleaning solvent, refrigerant, blowing agent, chemical intermediate and drying agent. The data we considered contains 29 observations. We partitioned the data into training and test datasets, containing 24 and 5 observations, respectively. Figure 1 displays the 24 training points as blue colors and the 5 test points as red colors from Adsorption Isotherm data in Mahle et al. (1994).
Adsorption is usually described through isotherms, that is, the amount of adsorbate on the adsorbent (i.e., loading in Figure 1) as a function of its pressure (defined as partial pressure in Figure 1). It is clear that the loading has a non-decreasing relationship with the partial pressure from Figure 1. The Langmuir equation, defined in Langmuir (1918) is one of the most popular models that correlates the amount of adsorbed gases y on plane surfaces of glass, mica, and platinum with the equilibrium aqueous concentration x through a nonlinear function given by where α > 0, β > 0, n is the total number of observations and i takes account of random measurement errors with the assumption that i i.i.d.
∼ N(0, σ 2 ). This formula is the most commonly used isotherm equation because of its simplicity and its ability to fit a variety of adsorption data. In our dataset, y i in Equation (5.1) presents loading (mol/kg), while x i corresponds to partial pressure (pa) and n = 24. However, some of the assumptions used to derive Equation (5.1) are seldom all true. In addition, the accuracy of the data collected during the experimental procedure may be affected due to various reasons like equipment failure, data entry error and etc. Thus, in the presence of missing or inaccurate data, the inference based on Langmuir equation may be invalid.
When the data is fully observed and accurate, Dey et al. (1997) proposed a model Log Model : to be a competitive model with the Langmuir equation. There are no constraints on the values of parameters α and β in Equation (5.2). However, their defined model is merely based on the approximation of the geometric representation of the data generated from the Langmuir equation to ease the computation.
Particularly, we use the semiparametric model below GP Model : to compare the performance with the model specified in Equation (5.2) as well as with the Langmuir Equation (5.1) using the Adsorption Isotherm Data. We evaluate the accuracy of missing imputation based on mean squared errors criteria for all three models. From Figure 1, the domain of x, i.e., the partial pressure is always positive. We want to incorporate this information on the prior of x i 's. Thus, we consider a truncated normal prior on covariates x i 's, i.e., π(x i | μ x , σ 2 x ) ∝ N + (μ x , σ 2 x ). For the priors on the hyperparameters μ x and σ 2 x , we use the same noninformative priors as before, that is, π(μ x | σ 2 x ) ∝ 1 and π(σ 2 x ) ∝ 1/σ 2 x . Details about computation scheme for Langmuir Model and Log Model are postponed to Supplementary S.4 and Supplementary S.5 in (Bishoyi et al., 2019), while the computation scheme for GP Model is similar as discussed in Section 3 by merely changing the prior on x i 's.
We artificially create missingness in the covariates and compare imputed missing covariate with the true value based on MSEx. In addition, we compare the different models based on DIC and PMSE for the test points. We use Equation (4.2) to yield the ignorable missingness for the covariate x i , where we produce the missingness with three different percentages, i.e., 10%, 25% and 40%. We repeat each generation 50 times. Thus, for each percentage, we average the values of MSEx, DIC, and PMSE over 50 times for Langmuir Model, Log Model, and GP Model, a summary of which is given in   to Langmuir Model from experimental data. Therefore, Log Model will have a high risk of misspecification in real applications. GP Model has nonparametric nature in its fit, thus it will be more flexible in regressing adsorption isotherm data and can avoid misspecification. Hence, our GP Model will be a better choice for the analysis of adsorption isotherm data in comparison to the Langmuir model when we have missing covariates.

Application II
In this subsection, we are going to use our method on Auto-mpg data. This dataset is from the StatLib library maintaining by Carnegie Mellon University and previously was used in the 1983 American Statistical Association Exposition. This data is also available in the Statistics and Machine Learning Toolbox in MATLAB with a filename called "carbig.mat". One of its application goal is to predict the fuel consumption in miles per gallon (mpg) using the weight and horsepower of a car. In this data, it contains 398 PMSE DIC X X X X X X X X X X

Method
Size n = 30 n = 60 n = 90 n = 30 n = 60 n = 90  instances and we have 6 missing values in the horsepower attribute and it is reasonable to consider that missingness in the horsepower attribute is ignorable.
A common approach to model the fuel consumption for this data is to apply the linear regression technique. Our initial study shows that there is a nonlinear relationship between mpg and horsepower, but there is a linear relationship between mpg and the natural logarithm of the weight (denote as log(weight) in Figure 2) of the car. Both of these phenomena can be clearly seen from the original data in Figure 2. We randomly sample 30, 60 and 90 instances, respectively, from the original data and each sample will include those 6 missing observations, which miss the horsepower attribute. We repeat such random draws for 50 times of each 3 cases of instances and we consider the rest of the observations in the data as test points.
We employ the linear regression and our proposed semiparametric model on the three cases of instances for the randomly sampled observations. Specifically, we use the linear structure for the natural logarithm of the weight (in tons) and the nonparametric structure for the horsepower in our proposed model, that is: Similarly, for the linear regression, we use the horsepower and the natural logarithm of the weight as predictor variables and the mpg as the response variable, i.e., In both regressions above, y i corresponds to the mpg, x i is the horsepower attribute, z i indicates the natural logarithm of the weight of the car, i is the random error with the assumption that i ∼ N(0, σ 2 ) and n is the number of instances we consider. From Figure 2, it is natural to assume that the values of the horsepower attribute is nonnegative, thus, we assume a truncated normal prior on x i 's, i.e., π(x i | μ x , σ 2 x ) ∝ N + (μ x , σ 2 x ). All the other priors of unknowns are the same as Subsection 5.1. We compare the performance of both models based on PMSE and DIC, where their values are averaged over 50 draws for each case of instances. Table 7 shows the results for our semiparametric model versus the linear model in the scenarios with missing data imputed. As expected, with the increase in the number of training data points, both values of PMSE and DIC have become smaller for these two models. However, our proposed semiparametric model is able to perform better than the linear model in all different sample sizes situations. Thus, our proposed GP  semiparametric model is superior in the analysis of the Auto-mpg data in comparison to the linear model.

Semiparametric Regression with Missing Covariates``````````M
In addition, we build up Table 8, where we compare the PMSE results for Model (5.3) using complete cases and the MICE algorithm to deal with missing data for Model (5.3). When we compare the values of PMSE in the first row from Table 7 with those shown in Table 8, we can see that our proposed method (which is to impute the missing covariates and estimate the unknown parameters in a simultaneous way) performs much better in comparison with MICE imputation algorithms when the sample size is median (i.e., n = 60 and n = 90); while our method is dominant in the performance of PMSE in comparison to the complete cases analysis in all three cases.

Discussion
In this paper, we have considered the problem of imputation of missing covariates for the nonparametric part in a semiparametric regression under the Bayesian framework. In the absence of a parametric regression part, our semiparametric model can be reduced to a nonparametric regression setting. Our proposed procedure permits us to model nonparametric as well as semiparametric regression in the presence of missing covariate.
To deal with missing covariates for the nonparametric regression is often difficult. Especially, when we assign a GP prior to the unknown nonparametric function, the missing covariates will cause the problem to establish the covariance function of the GP prior. Our proposed method is the first one to solve this problem for the GP prior, while it has still kept the flexibility of GPs in the computation for the nonparametric/semiparametric modeling from Bayesian perspective. Also, we have proved the posterior propriety under the "exact" reference prior for the hyperparameters of GP in the appearance of missing covariates. Moreover, we have illustrated that in the presence of ignorable missing covariates for the semiparametric regression model, our proposed method can perform better than the naive method using complete cases and the cubic splines using imputed data. In addition, we have demonstrated our proposed method is at least comparable to the performance of the spline methods proposed by Faes et al. (2011) to deal with missing covariates in nonparametric regression. In the two applications, we have displayed that our proposed method is able to perform better than the competitive parametric methods when there are missing covariates in the data, especially we are not certain about the parametric relationship between the response and the predictor variables. Thus, our method will be particularly appealing for analyzing the data where the covariates are subject to ignorable missingness and the relationship between the response and the covariates is unclear. However, we are at least expected that the unknown function g(·) has an one-to-one mapping, otherwise, we might encounter the same multimodality problem in the posterior distribution for the missing covariates as mentioned in Beal (2003) and Faes et al. (2011).
Throughout this paper, we assume the missing data mechanism is ignorable. Sometimes, this assumption is somewhat restricted. Thus, our next goal is to extend our proposed procedure for a non-ignorable missing mechanism.

Supplementary Material
Supplementary Materials for "Learning Semiparametric Regression with Missing Covariates Using Gaussian Process Models" (DOI: 10.1214/18-BA1136SUPP; .pdf). We have restated about the four conditions used in Ren et al. (2012) and the derivation for the Conditional Distribution of x mis Given x obs in Section S.1 and Section S.2 of the supplement, respectively. Moreover, we have put the detailed results of MSEx, PMSE and DIC for different covariance kernels in Simulation II of Section 4.2 as Section S.3 of the supplement material. Also, in Section S.4 and Section S.5 of the supplement material, we have included the MCMC sampling scheme for Langmuir model estimation as well as the MCMC sampling scheme for Log Model Estimation for Section 5.2. See more details in Supplement S (http://doi.org/10.2307/1390675).