A New Bayesian Single Index Model with or without Covariates Missing at Random

For many biomedical, environmental, and economic studies, the single index model provides a practical dimension reaction as well as a good physical interpretation of the unknown nonlinear relationship between the response and its multiple predictors. However, widespread uses of existing Bayesian analysis for such models are lacking in practice due to some major impediments, including slow mixing of the Markov Chain Monte Carlo (MCMC), the inability to deal with missing covariates and a lack of theoretical justification of the rate of convergence of Bayesian estimates. We present a new Bayesian single index model with an associated MCMC algorithm that incorporates an efficient Metropolis–Hastings (MH) step for the conditional distribution of the index vector. Our method leads to a model with good interpretations and prediction, implementable Bayesian inference, fast convergence of the MCMC and a first-time extension to accommodate missing covariates. We also obtain, for the first time, the set of sufficient conditions for obtaining the optimal rate of posterior convergence of the overall regression function. We illustrate the practical advantages of our method and computational tool via reanalysis of an environmental study.


Introduction
For many practical studies, including the environmental study (Chambers, 1983)  analysis is inadequate for inference and prediction when the usual assumptions of linear regression model fail. The Single Index Model (SIM) (Stoker, 1986) provides a simple and interpretable framework for understanding a complex, nonlinear relationship between a response variable Y i and its p > 1 dimensional covariate vector X i = (X i1 , … , X ip ). The conditional expectation of Y i given X i under the SIM is only an unknown univariate function of the scalar index Z i = α T X i = ∑ j = 1 p α j X ij , which is an unknown linear projection of the covariate vector X i with unknown index vector α = (α 1 , α 2 , … , α p ). A SIM clearly offers a practical compromise between a completely nonparametric multiple regression and a fully parametric linear regression because it accommodates both nonlinear main effects and higher-order interactions. It also offers clear physical interpretations of the index Z = α T X and relative importance of the predictor effects via the magnitudes of the index parameters (α 1 , … , α p ). These practical features of SIM are highly desirable for some biomedical, environmental and econometric studies dealing with the unknown nonlinear relationship between the response and predictors.
There are some great reviews on the recent rapid development of the frequentist SIM literature (Antoniadis et al., 2004;Hristache et al., 2001). These existing methods are essentially of three categories: (a) average derivative method (Stoker, 1986;Powell et al., 1989), (b) M-estimation (Ichimura, 1993;Hardle et al., 1993;Xia et al., 2002;Xia, 2006) and (c) sliced inverse regression (Li and Duan, 1989;Li, 1991). Average derivative methods require highly restrictive and hard to verify (in practice) conditions to ensure the consistency of the estimator of α. M-estimation approaches usually have good asymptotic properties.
However, these methods often lead to difficult high-dimensional optimization. The sliced inverse regression methods have only limited applications in practice since they require the distributions of X i to be elliptically symmetric.
In existing frequentist literature on SIM, the overwhelming emphasis has been on good empirical and asymptotic properties of the point estimates of the index parameter α and the link function f(·). The usual frequentist approaches to characterize uncertainty are based on either an asymptotic approximation or computationally intensive resampling methods. These methods can break down in practice even with a moderate number of covariates. However, Bayesian methods provide a realistic evaluation of the uncertainty of the estimates as well as of the predictions of future responses for many biomedical and other applications. Most of the existing Bayesian approaches for SIM use some basis representation such as splines (Antoniadis et al., 2004;Wang, 2009) and wavelets (Park et al., 2005) of f(·), along with a multivariate prior density on the coefficients of the chosen basis representation of f(·). For sample size n is moderately large. In Section 2, we propose the new Bayesian SIM using the Ornstein-Uhlenbeck (OU) process prior f(·) with a covariance function that helps us avoid the numerical inversion of the (n × n) covariance matrix within MCMC.
In Section 3, we present a novel and convenient method to generate from the posterior distributions of the parameters of the SIM. For all existing Bayesian methods, a major challenge is to generate samples from the conditional posterior distribution of the index vector α. In general, for the Metropolis-Hastings (MH) step to simulate α within each MCMC iteration, no obvious choice of a proposal density can ensure reasonable acceptance rate and autocorrelation of the posterior samples of α. In this paper, we devise a proposal density based on aligning the mode of the proposal density with that of the target conditional posterior density of α. Our proposal density facilitates a much higher acceptance rate of the MH step for simulating α compared to the acceptance rates for existing methods.
In Section 4, we also present a theoretical justification for our Bayesian method. We provide the minimal regularity conditions required to ensure the optimal convergence rate of the estimate of the overall mean regression function, g(x) = f(α T x). We provide these conditions for the OU process prior (recommended by us) as well as for the Gaussian process (GP) prior with square exponential covariance kernel. These minimal regularity conditions about our GP prior on the unknown f(·) are easy to ensure in practice and help us determine the prior for f(·) to achieve the optimal convergence rate of the Bayesian estimates of mean response.
Like numerous other biomedical studies, our environmental study example has observations with some missing covariates. To the best of our knowledge, there are very few existing frequentist methods and no Bayesian methods for SIM that accommodate missing covariates. These inverse probability weighted, estimating equations-based frequentist approaches (Xue, 2013;Guo et al., 2015;Li and Yang, 2016) essentially do not use observations with missing covariates. Their other major weaknesses include a difficult to estimate finite sample variance of the estimated α even while using bootstrap, high dependence of the performance of the estimates of f(·) on the choice of the bandwidth of the kernel smoothing and computational difficulty associated with the empirical likelihood method. Related important work of Niu et al. (2017), however, focuses only on statistical tests for the parametric single index model. To address these weaknesses, in Section 5, we develop a novel and computationally efficient Bayesian SIM approach to handle Missing-At-Random (MAR) covariates using post-MCMC importance sampling weights. In Section 6, we present simulation studies to demonstrate the finite sample properties of our Bayesian methods and their comparisons with estimates from some competing methods for whom the software is publicly available.
In Section 7, we present the reanalysis of air quality study (Chambers, 1983) to illustrate the practical advantages of our Bayesian method compared to other existing competitors. Section 8 presents some concluding remarks and discussions about various further extensions and relationships among different approaches.

Bayesian Method for SIM
We denote observed data as D n = {(y i , x i ) : i = 1, 2, … , n}, where (y i , x i ) is the observed values of the response Y i ∈ ℝ and corresponding p-dimensional predictors X i ∈ ℝ p for i = 1, … , n independent subjects. Without loss of generality, we assume that the covariates are scaled so that X i ∈ [0, 1] p . We assume the single-index model with independent errors ϵ i ~ N(0, σ 2 ), unknown index parameter α = (α 1 , … , α p ) and an unknown nonlinear univariate link-function f(·). Some generalizations of SIM such as projection pursuit regression (PPR) (Friedman et al., 2001) aim to find appropriate multiple projections and link functions of multidimensional X i to model Y i . However, the models obtained from PPR are hard to interpret because each component of X i can affect E[Y i |X i ] via multiple link functions. To ensure identifiability of the model in (2.1), we further need α ∈ S p − 1 + = α: α 1 2 + α 2 2 + ⋯ + α p 2 = 1, α 1 > 0 (Lin and Kulasekera, 2007). This restriction on α poses challenges for the specification of prior of the index vector α and for the subsequent Bayesian posterior computation. To address this challenge, we specify a Π θ for the one-toone polar transformation θ (Park et al., 2005), where α 1 = sin θ 1 , α 2 = sin θ 2 cos θ 1 , ⋯, α p − 1 = sin θ p − 1 ∏ j = 1 p − 2 cos θ j , α p = ∏ j = 1 p − 1 cos θ j for θ j ∈ [0, π] for all j = 1, … p -1 to ensure α ∈ S p − 1 + . A prior Π α for α corresponds to a specific prior Π θ for θ. In spite α = α(θ) being a function of θ, we sometime suppress this relation in the notation α for the brevity. We assume that the joint prior Π(Λ) of the parameters Λ = (f, θ, σ, κ) of our Bayesian single index model of (2.1) is where Π f | α(θ), κ = Π f | θ, κ is the conditional nonparametric prior for the link-function f(·), given θ and the hyperparameter κ with corresponding hyper-prior Π κ . Also, Π θ and Π σ are marginal priors for θ and σ, respectively. Conditional on α, we require the prior Πf |α,κ to be supported on ℂ[ − p, p], the space of all continuous functions on [ − p, p], because a conditional prior on the function space {f : t ↦ f(t), t = α T x} for a fixed α needs the restriction of the domain |t | ≤ α x ≤ p because α = 1. A practical choice for the prior model of a nonparametric link function is f(·), which is the Gaussian process prior GP(μ, c κ ) indexed by the user-specified prior mean function μ: ℝ ℝ and the positive definite covariance kernel c κ : [ − p, p] × [ − p, p] ℝ, known except the hyper-parameter κ that controls the smoothness of the realizations (sample paths) of GP(μ, c κ ). The choice of the covariance kernel c κ is vital for controlling the sample paths of the GP(μ, c κ ). We opt for the Ornstein-Uhlenbeck (OU) process with the covariance kernel given by c κ t 1 , t 2 = e −κ t 1 − t 2 .
The OU process is often a popular and appropriate choice, and this c κ (t 1 , t 2 ) is a special case of the more general Matérn family of covariance kernels c t 1 , t 2 = τ 2 1 Γ(ν)2 ν − 1 2ν κ t 1 − t 2 ν K ν 2ν κ t 1 − t 2 , where Γ(·) is the gamma function, K ν is the modified Bessel function of the second kind, ν and κ are the smoothness and bandwidth parameters and τ controls the signal-to-noise ratio. The OU process corresponds to (ν = 0.5, τ = 1) in (2.3). Even though the OU stochastic process used as a prior for f(·) has only one unknown hyper-parameter κ, we use the notation Π f|α,κ for the prior f(·) because the evaluation of the posterior distribution involves the joint multivariate prior density of f(·) evaluated at a finite number of values which are functions of only (α, κ) (explained in the following section). We use the prior Π κ for κ to be a discrete uniform on a set J = a: Π κ (a) > 0 of equally spaced grid points in the interval [κ min , κ max ].
This very convenient and practical prior is capable of approximating any continuous prior density for κ to any acceptable level via appropriately choosing the number and gaps of the grid points in J. For the prior Π σ of σ 2 , we use the Inverse-Gamma prior IG(γ 1 , γ 2 ) density with shape γ 1 and scale γ 2 for σ 2 . This big enough class can also accommodate any reasonable prior opinion about σ 2 , and it is also a convenient prior since it is conjugate under the Bayesian SIM model of (2.1).

Author Manuscript
Author Manuscript

Author Manuscript
where Σ α, κ −1 i, j = 0 for j ∉ {i + 1, i, i − 1}, upper and lower subdiagonals are q i = r i / 1 − r i 2 with r i = exp{−κ(t i+1 − t i )} for 1 ≤ i ≤ N − 1, and diagonal entries are s i = 1 + r i q i + r i − 1 q i − 1 for i = 1, · · ·, N. The determinant d N of the tridiagonal matrix Σ α, κ −1 (and hence the determinant of Σ α, κ ) is also evaluated via the sequential formula d m = s m d m − 1 − q m − 1 2 d m − 2 with d 1 = s 1 and d 0 = 1. These help the implementation of the MCMC algorithm via avoiding the numerical inversion and direct computation of the determinant of any potentially high-dimensional covariance matrix at each iteration. The resulting conditional posteriors for the MCMC algorithm are the following: f | σ 2 , κ, θ, D n ϕ n {f; (Σ θ, κ −1 + σ −2 I n ) −1 y/σ 2 , (Σ θ, κ −1 + σ −2 I n ) −1 }, where Π θ (θ) is the prior for the one-one polar transformation θ of α (as explained earlier), ϕ n (.; μ, B) is the n-variate normal density with mean μ and variance matrix B and J is the discrete support of the prior of κ. For the sake of brevity, the above conditional posteriors are given for the case N = n when all the values of α T x i are distinct. We again emphasize that the conditional posterior distributions in (3.2)-(3.4) are straightforward to sample from because we have a closed-form expression for Σ θ, κ −1 and a sequential formula for its determinant.
However, for any choice of prior Π θ (θ), sampling θ from the conditional density of (3.5) is not straightforward in spite of the available expressions for Σ θ, κ −1 and |Σ θ,κ | −½ . Hence, we use a Metropolis-Hastings (MH) step within MCMC iteration to sample θ from the target density in (3.5). For this MH step, our independent proposal densities P j of θ j for j = 1, … , p − 1 are rescaled Beta densities with common support [0, π], that is To assure an appropriate proposal density P j θ j with good acceptance rate for the MH step, we choose d j to satisfy θ j, MAP = πc j d j /{ c j + d j 2 c j + d j + 1 } for fixed value of c j , where θ MAP = (θ 1, MAP , ⋯, θ p, MAP ) is the maxima of the target density in (3.5). We call this θ MAP , as the conditional maximum a posteriori (MAP) of θ. As the support of θ in (3.5) is closed and bounded, we initially fix a set of grid points G, and take θ MAP = argmax θ r ∈ G p θ r | σ 2 , κ, f, D where p(θ r |σ 2 , κ, f, D) is the conditional posterior density of θ evaluated at θ r . The approximation of the mode by θ MAP is improved by making the grid points in G finer. An alternative procedure for computing θ MAP is to use any built-in optimization algorithm available in R or MATLAB. Increasing the value of c j results in lowering the variance, if d j is kept fixed. We decide the value of c j based on the desired variance of the proposal density. For simplicity, we fix all the values of c j = c for j = 1, 2, … , p − 1.
Alternatively, sampling θ from (3.5) and f from (3.2) within each MCMC iteration results in highly autocorrelated posterior samples since f and θ are highly correlated in the posterior.
To circumvent this problem, we modify the MH step for sampling θ by adapting the method of Murray and Adams (2010), using surrogate data h. We describe the MCMC algorithm as follows: A possible choice of S θ is τ 2 I where τ 2 is a fixed, say, τ 2 = 0.1. In the above expressions, The transition kernels used for computing the acceptance probability of new θ based on past sample θ (t) are q(θ (t) ; θ) = ∏ j = 1 ). Here, c j , d j are the parameters for the proposal density used for generating θ and (c j ) are the parameters for the proposal density used in previous iteration number T.

Theoretical Results
The semiparametric Bayesian estimator of the regression function g(x) = f(α T x) in Choi et al. (2011) has been shown to possess posterior consistency. However, the theory and associated conditions for obtaining a desirable convergence rate of the posterior estimate of g(x) are not available yet. Even though we do not evaluate the convergence rates of the Bayesian estimates of f(·) and α separately, our theoretical result describes how the convergence rate of the Bayesian estimate of regression g(x) to the true regression function g 0 (x) depends on the sample size n and the roughness β of the true link function f 0 (·). This result also shows that the asymptotic performance of our Bayesian estimate of g(x) at the observed covariate values x 1 , … , x n is optimal as long as the roughness of the sample path of the prior process of the unknown f(·) is not too high compared to the sample-size n. We also show that this desirable result holds even when the roughness β of the true f 0 (·) is not very high (i.e., f 0 is less than twice differentiable). with Fourier transform f(λ) satisfying ‖f‖ β Sobolev norm of f. Roughly speaking, for integer β, a function belongs to ℍ β if it has partial derivatives up to order β that are all square-integrable.
For the sake of brevity, we consider a special case of the Bayesian SIM model described in (2.1) with σ 2 = 1 for our theoretical development. Let the true data generating parameters be (f 0 (·), α 0 ). We now state the main result on posterior contraction in estimating g 0 (x) = f 0 α 0 T x with respect to the empirical L 2 norm given by g − g 0 2, n 2 = (1/n)∑ i = 1 n g x i − g 0 x i 2 . In particular, we would like to find the minimum possible sequence of positive numbers ϵ n converging to 0 such that E 0 ℙ( g − g 0 2, n ≤ Mϵ n | D n ) 1 for a suitably large positive number M, where E 0 denotes expectation under the true data generation mechanism and D n = (y, X) is the observed data indexed by sample size n.
Assuming g 0 ∈ ℂ β [0, 1] p and recognizing that g 0 is essentially concentrated on a subspace with dimension 1, the optimal (in a minimax sense) rate for estimating g 0 is much faster ϵ n = n −β/(2β + 1) instead of ϵ n = n −β/(2β + p) , the minimax rate of estimating a p-variable function with smoothness β. We investigate the concentration rates for two different choices of prior on the link function f 0 . Theorem 1 presents the concentration rate when the prior density on f 0 is a OU process, i.e., the (i, j) th element of the covariance matrix is c ij = e −κ α T x i − α T x j . Theorem 2 provides the concentration rate when the prior density on f 0 is a Gaussian process with square exponential covariance kernel, i.e., the (i, j) th element of the covariance matrix is c ij = e −κ α T x i − α T x j 2 . Theorem 1.
Theorem 1 shows that we achieve the optimal rate of convergence of the posterior around true mean g 0 (x) if the square root of inverse bandwidth parameter κ of the OU process prior is taken to be of order n (1−2β)/(2β+1) . It is interesting to note that for β < 1/2 (i.e., when the true function f 0 is not even differentiable), we need κ of the OU process prior to go to infinity to achieve the optimal rate of convergence. However, when β > 1/2, we need κ to go to 0 to ensure the optimal convergence rate. Such conditions are required since the OU process has extremely rough sample paths. However, for the Gaussian prior, Theorem 2 shows that irrespective of the smoothness of f 0 , we need to ensure that κ of the Gaussian process goes to infinity to ensure the optimal concentration rate of the posterior. This is expected since the sample paths of a Gaussian process with square exponential covariance kernel are smooth. Therefore, to ensure that the posterior estimate of f converges to f 0 at an optimal rate, we need to ensure that κ goes to infinity. The proofs of Theorem 1 and 2 are

Analysis with Missing-At-Random Covariates
To the best of our knowledge, our paper presents the first extension of the Bayesian singleindex model to accommodate covariates that are missing at random (Little and Rubin, 2014). By an abuse of notation, the entire data D n is expressed as D n = D 1 ∪D 2 , where D 1 and D 2 contain all the observed data from subjects respectively in S 1 and S 2 . Here, S 1 and S 2 are respectively the set of subjects with no missing covariates and the set of subjects with at least 1 missing covariate. We also use the notation X ci = (X im , x io ) to denote the "complete" covariate vector from each subject i ∈ S 2 , where X im denotes the unknown missing covariates and x io denotes the observed parts of the covariates. For each subject i ∈ S 1 , we have X c i = x i 0 with X im being an empty vector. We assume that X ci has the joint density g X (· | γ) independently for i = 1, … , n. It is reasonable and conventional in missing-data literature (Little and Rubin, 2014) to assume that γ shares no parameters with the set of parameters Λ = (f, θ, σ, κ) associated with the regression model (2.1) and its prior Π γ is independent of the joint prior Π(Λ) in (2.2). Now, the joint posterior based on the entire observed data D n is p Λ, γ | D n ∝ L 1 D 1 | Λ, γ × L 2 D 2 | Λ, γ × Π(Λ) × Π γ (γ), where the likelihood based on D 1 is the likelihood based on D 2 is 2) and Π γ (γ) is the prior distribution of γ based on the fully observed covariates x io for the subjects i ∈ S 1 . For L 2 in (5.2), each integral is computed over the sample space of all missing covariates of the subject i ∈ S 2 . For example, in our analysis of the air-quality study, we assume g X (· | γ) to be the multivariate Gaussian density ϕ p (x; μ, Σ) with the popular and conjugate Normal-inverse-Wishart prior π(γ) for γ = (μ, Σ), even though other multivariate distributions can also be accommodated similarly. In this case, the integration of (5.2) has a closed-form expression. Bayesian estimates from (5.1) need Monte Carlo integration with respect to the joint posterior of p(Λ, γ | D n ). To address this challenge, we propose the following method weighted Mote Carlo integration. Note that equation (5.1) can also be expressed as p Λ, γ | D n ∝ p Λ | D 1 L 2 D 2 | λ, γ p 1 γ | D 1 , where p(Λ|D 1 ) is the posterior of (3.1) based on D 1 instead of the whole data D n ; p 1 γ | D 1 ∝ ∏ i ∈ S 1 ϕ p x i | γ Π γ (γ) is the posterior of γ based on fully observed covariates x io from D 1 . For the weighted Monte Carlo integration using weighted samples (instead of usual identically distributed samples from standard MCMC) (Λ*, Γ*), we first obtain Λ* from p(Λ|D 1 ) using the MCMC method described in Section 3. Then we independently sample γ* from the posterior density Π γ γ | D 1 . For example, using the usual conjugate Normal-inverse-Wishart prior Π(μ, Σ) of γ = (μ, Σ) for our analysis, this p 1 (μ, Σ|D 1 ) has the corresponding Normal-inverse-Wishart posterior density. We then compute the sampling weight ω* of (Λ*, γ*) to be proportional to L 2 (D 2 | Λ*) using integration of X im for all subjects i ∈ S 2 . We note that, in our case, with the multivariate normal density for g X (X im , x io |γ), the X im given (x i0 , γ*) has the multivariate normal distribution with a mean and covariance matrix that are known functions of x io and γ*. Unlike our case of multivariate normal density for g X (X im , x io |γ), one may need to perform a Monte Carlo integration for L 2 (D 2 | Λ*) via generating the missing observations X im from the conditional density of X im given (x i0 , γ*). Finally, to obtain the Bayesian estimates, we use the samples (Λ*, Γ*) with the corresponding sampling weights ω*.

Simulation Studies
Throughout our simulation studies, we use the OU process Π f|α,κ as a prior of the link function f(·), with hyper-parameter κ assigned a discrete uniform prior on the interval [0.5, 2] with grid size 0.05. The prior distribution for error variance σ 2 is inverse gamma with shape parameter 2 and rate parameter 0.01. For evaluating the approximate mode of the conditional posterior of θ within MH steps, we use grid size 0.05 for each θ j . We fix each c j at 5,000. We estimated d j using the method described in Section 3. For each Bayesian analysis based on our and other competing Bayesian methods, we use 3,000 MCMC samples after discarding the first 2,000 for burn-in.
In our simulation study 1, we investigate the performance of our Bayesian estimates based on the SIM of (2.1) (BSIM in short) for different sample sizes (n = 50, 100), forms of true link functions and values of the error standard deviations σ ∈ {0.01, 0.5}. Two different true link functions f 0 considered here are quadratic with f 1 (z) = z + z 2 and exponential with f 2 (z) = e z . True index parameter α 0 = (1, 1, 1)/ 3 corresponds to true θ 0 = (0.61547, 0.7854) in polar coordinates. We use 50 replicates of the datasets for each combination of true f 0 , σ and n to approximate the sampling distribution of the estimates. We begin with the performance of the estimates when the true data generating simulation models are (2.1) with N(0, σ 2 ) error distributions. The approximate sampling coverage probability of interval estimates and mean acceptance rate of the Metropolis-Hastings (MH) step are reported in Table 1.
For both forms of true link function f, the average MH acceptance rates of θ are close to 16%. The approximate coverage probabilities of the interval estimates are smaller when the true link function is exponential (f 2 ) compared to when it is quadratic (f 1 ). As expected, an increase in error standard deviation σ also lowers the coverage probabilities irrespective of the form of true f(·). The coverage probability is substantially smaller because the signal-tonoise ratio (the standard deviation of f(Z) divided by σ) when σ = 0.5 is 1/5 times the signalto-noise ratio when σ = 0.1.
Next, in simulation study 2, we compare our BSIM estimates of θ with the estimates from two competing methods: (1) method of Choi et al. (2011) (referred to as Choi-SIM henceforth) and (2) method of first-step projection pursuit regression (PPR). For brevity, we provide the details of simulation study 2 in the Supplementary Materials (Appendix A, Dhara et al., 2019). We have shown that the proposed BSIM method is superior to both Choi-SIM and PPR when the true model is the SIM of (2.1).
In simulation study 3, we compare our BSIM method with two the frequentist methods implemented in the R package MAVE, namely, (1) the kernel-based sliced inverse regression (KSIR in short) method of Li (1991) and (2) the method of central dimension reduction using outer product gradient (CSOPG in short) of Xia (2006). We also compare our method with two competing Bayesian methods, both implemented using MCMC methods, namely the final estimates α from Gramacy-SIM because of the norm of the initial α from Gramacy-SIM may not be 1. Based on the MSSE values from different methods presented in Table 2 for different choices of n and σ, BSIM-based estimates perform similar to CSOPG and Antoniadis-SIM based estimates when σ is small. For larger σ, CSOPG and Antoniadis-SIM perform as well as BSIM. Even though KSIR and Gramacy-SIM perform similar to BSIM when σ is smaller, for higher values of σ, BSIM performs better than both of them.
In simulation study 4, we also compare the MSSE of α from our BSIM with those from competing methods when the true underlying distribution of error ϵ i is a mixture of two normal densities 0.95 N(0, σ 2 ) + 0.05 N(0, (3σ) 2 ) instead of the normal distribution assumed for SIM. In Table 3, the MSSE values from different methods are presented for two sample sizes and 3 different values of σ. Similar to previous simulation study 3, α from BSIM performs better than those from KSIR and Gramacy-SIM. When σ is smaller, BSIM performs better than CSOPG and Antoniadis-SIM. However, for a larger value of σ, BSIM does not outperform CSOPG and Antoniadis-SIM as far the MSSE of α is concerned.
Our simulation study 5 investigates the performance of the extension of our BSIM method to deal with missing-at-random (MAR) covariates, under link functions f 1 and f 2 , sample sizes n = 50 and 100 and error standard deviations σ ∈ {0.01, 0.1, 0.5}.
Each of the 50 replicated datasets for each combination of (f, n, σ) has only variable X 1 potentially missing-at-random (MAR) with probability p 1j that does not depend on X 1 . For the results in Table 4, we use p 1j = 1/ 1 + e 4 + 2x 2i + 3x 3i + y i , which results in about 23% of observations with X 1 MAR. For the results in Table 5, we use p 1j = 1/ 1 + e 2 + 2x 2i + 3x 3i + y i which results in approximately 40% observations with X 1 MAR. In these two tables, we compare the BSIM method using only completely observed data D 1 with our proposed BSIM extension for MAR covariates using full data D = D 1 ∪D 2 . The comparisons of these two competing Bayesian methods are based on the average widths of the Bayesian interval estimates of θ parameters. Not unexpectedly, we obtain narrower 95% credible intervals when we use BSIM extension for available data D = D 1 ∪D 2 = instead of using BSIM only on D 1 . These two tables also present the percentage of improvement in average widths of the interval estimates for using BSIM extension for MAR covariates, and comparison of these two tables show that improvements in widths are more for Table 5 with a higher proportion of observations with missing covariates. It is important to note that even though our simulation model allows only the X 1 to be MAR, the lengths of the credible intervals for both the parameters θ 1 and θ 2 improve when using the missing data extension of BSIM. This may be due to the strong posterior dependence of the parameters θ 1 and θ 2 . The improvements in the widths of interval estimates also increase with the increase in error standard deviation (σ).
In our simulation study 6, we compare our proposed BSIM extension method to MAR covariates with the inverse probability weighted outer product gradient (OPG) method (called NOPG in short), described in Guo et al. (2014) and based on the estimates α of the index vector. Guo et al. (2014) suggested either a kernel-based, nonparametric method or logistic regression to find the probability weights. We use the nonparametric, kernel-based methods to compute the probability weights because the logistic regression-based approach run into numerical issues when the sample size n is 50 (small). We use the median of the sum of square error (MSSE) of α to compare our Bayesian method with Guo et al. (2014)'s NOPG method. The results are tabulated in Table 6. Based on these results, our proposed Bayesian method either performs equally well or better when the error standard deviation σ is small. However, for higher values of σ, NOPG performs better than the proposed method in terms of MSSE of α. However, it is worth noting that computing the 95% confidence interval of α and f(α T X) are computationally difficult for the NOPG method in spite of the asymptotic results in Guo et al. (2014). In comparison, the extension of the Bayesian method to the MAR covariate provides the posterior credible intervals for the parameters of interest, making the Bayesian method more useful in practice.
In simulation study 7, we investigate the 95% coverage probabilities using the complete data (D 1 ), and how it changes when including the data with missing covariates (D 2 ). We keep the same tuning parameters used in previous simulation scenarios. The results, when 20% of the observations have X 1 missing, are in Table 7. For 40% of observations having X 1 missing, the results are summarized in Table 8. We observe that when using D = D 1 ∪ D 2 , the coverage probability drops compared to when only D 1 is used for inference. Moreover, when the percentage of missing data is higher, the coverage probability decreases.
We also compare the predictive performance of the proposed method with other existing methods (CSOPG, KSIR, Antoniadis-SIM and Gramacy-SIM). These results are based on simulation study 8. Here, we fixed the sample size of the overall data and use 80% of the data for training the model. The remaining 20% of the data is used to evaluate the predictive performance of the fitted model. This method is replicated 50 times. The median of the mean absolute errors of the predictions 1 n test Table 9.
Here, x i test denotes the covariate of the i th subject in the test set. n test denotes the number of observations in the test set. When using Bayesian methods, namely Antoniadis-SIM, Gramacy-SIM and BSIM, we use a posterior predictive median for each subject as a final prediction. We observe that the proposed method outperforms the existing methods when the error variance is small (σ = 0.01, 0.1, 0.3). When the error variance is large, the proposed method has comparable performance to existing methods.

Analysis of Air Quality Data
We illustrate our method via reanalysis of the air quality study (Chambers, 1983), which was the motivating study for many major methodological developments in SIM literature (Hristache et al., 2001;Antoniadis et al., 2004;Choi et al., 2011). The full dataset, available via the datasets package in R, contains the logarithm of daily concentration (response) as well as three predictors (covariates) including the solar radiation, wind speed and temperature of New York City for 153 days from May to September 1973. Out of these 153 samples, there are 42 data points with at least one covariate or response missing. We first use the frequentist diagnostic described in Guo et al. (2016) to check whether a SIM is appropriate for modeling the response as a function of three covariates. The observed teststatistic of T DEE = 5.77 results in a small p-value, indicating that it is reasonable to use a SIM. We also provide an exploratory assessment of the adequacy of our Bayesian SIM using the empirical cumulative distribution function (cdf) C(t) = 1 n ∑ i = 1 n I y i ≤ t versus the fitted cdf M(t) = 1 n ∑ i = 1 n Φ t − y i /σ , where I( ⋅ ) denotes the indicator function, Φ(·) denotes the standard normal cumulative distribution function, y i and σ denote the Bayesian estimates of E[Y i |X i ] = f(α T X i ) and σ, respectively (computed via MCMC samples from the joint posterior). Validity of (2.1) is supported by the agreement between C(t) and M(t) across the range of observed responses, t. In Figure 1, we plotted C(t) as the solid blue curve and M(t) as the red dashed curve. The plot reveals that these two curves are nearly identical over the range of all observed responses, indicating that a SIM is appropriate for this study across all observed response values.
Compared to previous methods that do not handle missing data, we first analyze using 111 data points used by previous methods. For our Bayesian method, the hyperparameters of beta proposal density are c 1 = c 2 = 5000 for both coordinates of θ. The prior distribution on error variance σ 2 is inverse gamma with shape parameter 15 and rate parameter 0.06, corresponding to the prior guess of around 0.02 for σ (because the true σ is typically believed to be small for such a semiparametric regression model). The link function f(·) is assigned an OU process prior with the hyper-parameter κ having uniform hyper-prior on a support of grid of points in [0.1, 3] with a common grid interval width 0.05. Similar to the simulation studies, we use 3,000 MCMC iterations (after discarding the first 2,000 iterations for burn-in) to compute the Monte Carlo approximation of the Bayesian estimates and posterior standard deviations of (α 1 , α 2 , α 3 ). Table 10 presents the point estimates of α, obtained from our method as well as three other competing methods.
As is apparent from Table 10, the values of point estimates of α from our Bayesian methods are close to the corresponding estimates obtained from existing methodologies. Unlike the method of Choi et al. (2011), we also provide the posterior standard deviations (a measure of uncertainty in Bayesian estimates) of α 1 , α 2 , α 3 , and these posteriors standard deviations are substantially smaller than the standard errors for competing frequentist estimates. This reanalysis demonstrates the feasibility and advantages of our method compared to existing SIM tools in terms of ease of computation, convergence rate and small autocorrelations among MCMC samples, as well as the ability to handle missing covariates.
In Figure 2, we overlay four estimates of the link function f(z) versus index z obtained from three previous methods (Hristache et al., 2001;Antoniadis et al., 2004;Choi et al., 2011) and our Bayesian method. For three existing methods, only the estimates of α have been provided without any explicit estimate of f(·). To make all four methods comparable, we estimate the function f(·) for these three methods using smoothing splines on the estimated α T x i . Estimated f(·) from our Bayesian method is very similar to the estimates from other two existing Bayesian methods of Choi et al. (2011) and Antoniadis et al. (2004). The frequentist estimate of f(·) (Hristache et al., 2001) appears to be somewhat different from the estimates from all three Bayesian methods.
As mentioned earlier, the air-quality study has 5 observations with covariate solar radiation missing. The point estimate of α obtained from our proposed missing covariate method (last line of Table 10) is very close to the point estimate obtained from analysis using only the completely observed data points. Since only 4.5% of the samples have missing covariates, the method accommodating the missing covariate mechanism did not result in a substantial change in the values of the Bayesian point estimates. Table 10 shows the posterior standard deviations of the index vector α with and without using the observations with missing covariates in the analysis (last two rows of Table 10). Since only a small part of the sample has missing covariates, the posterior standard deviations did not change substantially after incorporating the data points with missing covariates.
We also evaluate the performance of our proposed method for predictions. For this evaluation, we have ignored the 5 observations with missing covariates and used only the remaining 111 observations. We use a 5-fold, cross-validation of the data to provide the mean sum of squares of error for the proposed method and compare it with existing methods. Our Bayesian method has used the proposal density with the tuning parameters c 1 = c 2 = 175. The results are provided in Table 11. BSIM performs better than Choi-SIM and is comparable with Antoniadis-SIM. Gramacy-SIM has better prediction performance than BSIM. However, Gramacy-SIM does not estimate the index vector α as precisely as BSIM.
Hence, BSIM provides a precise estimate of the index vector without compromising the predictive performance.

Conclusion
It is straightforward to extend our method to other priors on f(·), including a Gaussian process with other types of covariance kernels. However, these extensions may come at an additional cost of computation owing to the lack of closed-form expressions of the inverse and determinant of the correlation matrix.
Even though our data example has only one covariate missing, it is straightforward to extend our approach to more than one and even discrete valued covariates subject to missing-atrandom. Even though it is beyond the scope of this paper, our computational method can be extended to handle even nonignorable missing mechanisms. We found that via incorporating observation with missing covariates in data analysis, we can improve the precision/width of the Bayesian estimates, but with some potential bias trade-off.
For the first time in the Bayesian single index model framework, we provide theoretical guarantees in terms of posterior convergence rates of the overall regression function g(x) = f(α T x). The rate is optimal as long as the inverse bandwidth parameter is chosen appropriately. An immediate followup of our theoretical result is to show that the marginal posterior of the index vector α converges at a parametric rate.
We intend to extend our Bayesian methods to the partial linear SIM with regression function (conditional mean response) E(Y | X, Z) = α 1 T Z + f α 2 T X , where X and Z are two covariate vectors with Z having a linear effect and X having a nonlinear effect on the mean response.
An important future research direction is toward modeling different quantiles of responses using Bayesian single index models. The fundamental challenge here is to obtain an appropriate stochastic model of the response that allows flexible skewness.      Comparison of the average length of the 95% Bayesian credible intervals from the complete part of the data (D 1 ) with average length of the 95% credible intervals from the whole data D = D 1 ∪ D 2 , where D 2 is the part with missing observations. The percentage improvement in the length of the interval is reported in the last two columns. In this simulation, 20% of the observations have X 1 missing.

Length of Interval
Percentage Improvement Using only D 1 Using D 1 ∪ D 2 n σ  Comparison of the average lengths of the 95% credible intervals from the complete part of the data (D 1 ) and the 95% credible interval estimated after taking into account the missing part of the data(D 2 ). D stands for the whole dataset given by D = D 1 ∪ D 2 . The percentage improvement in the length of the interval is reported in last two columns. Here, 40% of observations have X 1 missing.

Length of Interval
Percentage Improvement Using only D 1 Using D 1 ∪ D 2 n σ  Comparison of the median of sum of square error (MSSE) of α between the proposed method to deal with missing data and method described in Guo et al. (2014).  95% Coverage probability of θ 1 , θ 2 when only D 1 or both D 1 , D 2 are used. In this simulation, 20% of the observations have X 1 missing.

Coverage Probabilities
Using D 1 Using D = D 1 ∪ D 2 n σ  95% Coverage probability of θ 1 , θ 2 when only D 1 or both D 1 and D 2 are used. In this simulation, 40% of the observations have X 1 missing.

Coverage Probabilities
Using D 1 Using D = D 1 ∪ D 2 n σ  Comparison of predictive performance of the existing methods (CSOPG, KSIR, Antoniadis et al. (2004), Gramacy and Lian (2012) and the proposed method (BSIM). We report the median of the mean absolute errors of prediction based on 50 replicates. 80% of the data was used to train the models and the rest 20% was used for evaluating prediction performance.