Robust Learning for Optimal Treatment Decision with NP-Dimensionality

In order to identify important variables that are involved in making optimal treatment decision, Lu et al. (2013) proposed a penalized least squared regression framework for a fixed number of predictors, which is robust against the misspecification of the conditional mean model. Two problems arise: (i) in a world of explosively big data, effective methods are needed to handle ultra-high dimensional data set, for example, with the dimension of predictors is of the non-polynomial (NP) order of the sample size; (ii) both the propensity score and conditional mean models need to be estimated from data under NP dimensionality. In this paper, we propose a two-step estimation procedure for deriving the optimal treatment regime under NP dimensionality. In both steps, penalized regressions are employed with the non-concave penalty function, where the conditional mean model of the response given predictors may be misspecified. The asymptotic properties, such as weak oracle properties, selection consistency and oracle distributions, of the proposed estimators are investigated. In addition, we study the limiting distribution of the estimated value function for the obtained optimal treatment regime. The empirical performance of the proposed estimation method is evaluated by simulations and an application to a depression dataset from the STAR*D study.


Introduction
Personalized medicine, which has gained much attentions over a few past years, is a medical paradigm that emphasizes systematic use of individual patient information to optimize that patient's health care.In this paradigm, the primary interest lies in identifying the optimal treatment strategy that assigns the best treatment to a patient based on his/her observed covariates.Formally speaking, a treatment regime is a function that maps the sample space of patient's covariates to the treatments.
There is a growing literature for estimating the optimal individualized treatment regimes.Existing literature can be casted into as model based methods and direct search methods.Popular model based methods include Q-learning (Watkins and Dayan, 1992;Chakraborty et al., 2010) and A-learning (Robins et al., 2000;Murphy, 2005), where Q-learning models the conditional mean of the response given predictors and treatment while A-learning models the contrast function.The advantage of A-learning is robust against the misspecification of the baseline mean function, provided that the propensity score model is correctly specified.Recently, Zhang et al. (2012) proposed inverse propensity score weighted (IPSW) and augmented-IPSW estimators to directly maximize the mean potential outcome under a given treatment regime, i.e. the value function.Moreover, Zhao et al. (2012) and Zhang et al. (2012) recast the estimation of the value function from a classification perspective and use machine learning tools, to directly search for the optimal treatment regimes.
The rapid advances and breakthrough in technology and communication systems make it possible to gather an extraordinary large number of prognostic factors for each individual.For example, in the Sequenced Treatment Alternative to Relieve Depression (STAR*D) study, over 350 covariates are collected from each patient.With such data gathered at hand, it is of significant importance to organize and integrate information that is relevant to make optimal individualized treatment decisions, which makes variable selection as an emerging need for implementing personalized medicine.In addition, it is common to encounter studies where the number of covariates is comparable or much larger than the sample size.Therefore, variable selection is also essential in making high-dimensional statistical inference for estimated optimal treatment regimes.There have been extensive developments of variable selection methods for prediction, for example, LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001), MCP (Zhang, 2010) and many others in the context of penalized regression.Their associated inferential properties have been studied when the number of predictors is fixed, diverging with the sample size and of the non-polynomial order of the sample size.
In contrast to the large amount of work on developing variable selection methods for prediction, the variable selection tools for deriving optimal individualized treatment regimes have been less studied, especially when the number of predictors is much larger than the sample size.Among them available, Gunter et al. (2011) proposed variable ranking methods for the marginal qualitative interaction of predictors with treatment.Fan et al. (2015) developed a sequential advantage selection method that extends the marginal ranking methods by selecting important variables with qualitative interaction in a sequential fashion.However, no theoretical justifications are provided for these methods.Qian and Murphy (2011) proposed to estimate the conditional mean response using a L 1 -penalized regression and studied the error bound of the value function for the estimated treatment regime.However, the associated variable selection properties, such as selection consistency, convergence rate and oracle distribution, are not studied.Lu et al. (2013) introduced a new penalized least squared regression framework, which is robust against the misspecification of the conditional mean function.However, they only studied the case when the number of covariates is fixed and the propensity score model is known as in randomized clinical trials.Song et al. (2015) proposed penalized outcome weighted learning for the case with the fixed number of predictors.
In this paper, we study the penalized least squared regression framework considered in Lu et al. (2013) when the number of predictors is of the non-polynomial (NP) order of the sample size.In addition, we consider a more general situation where the propensity score model may depend on predictors and needs to be estimated from data, as common in observational studies.A two-step estimation procedure is developed.In the first step, penalized regression models are fitted for the propensity score and the conditional mean of the response given predictors.In the second step, the optimal treatment regime is estimated using the penalized least squared regression with the estimated propensity score and conditional mean models obtained in the first step.There are several challenges in both numerical implementation and derivation of theoretical properties, such as weak oracle and oracle properties, for the proposed two-step estimation procedure.First, since the posited model for the conditional mean of the response given predictors may be misspecified, the associated estimation and variable selection properties under model misspecification with NP dimensionality is not standard.Second, it is unknown how the asymptotic properties of the estimators for the optimal treatment regime obtained in the second step will depend on the estimated propensity score and conditional mean models obtained in the first step under NP dimensionality.To our knowledge, these two challenges have never been studied in the literature.Moreover, we estimate the value function of the estimated optimal regime and study the estimator's theoretical properties.
The remainder of the paper is organized as follows.The proposed two-step variable selection procedure for estimating the optimal treatment regime is introduced in Section 2. Section 3 and 4 demonstrate the weak oracle and oracle properties of the resulting estimators, respectively.Section 5 studies the estimator for the value function of the estimated optimal treatment regime.Simulation results are presented in Section 6.An application to a dataset from the STAR*D study is illustrated in Section 7, followed by a Conclusion Section.All the technical proofs are given in the Appendix.

Method
Let Y denote the response, A ∈ A denote the treatment received, where A is the set of available treatment options, and X denote the baseline covariates including constant one.For demonstration purpose, we focus on a binary treatment regime, i.e., A = {0, 1}, with 0 for the standard treatment and 1 for the new treatment.We consider the following semiparametric model: where h 0 (X) is the unspecified baseline function, β 0 is the p-dimensional regression coefficients and e is an independent error with mean 0 and variance σ 2 .Under the assumptions of stable unit treatment value (SUTVA) and no unmeasured confounders (Rubin, 1974), it can be shown that the optimal treatment regime d opt (x) for patients with baseline covariates X = x takes the form , where I(•) is the indicator function.
Our primary interest is in estimating the regression coefficients β 0 defining the optimal treatment regime.Let π(x) = P (A = 1|X = x) be the propensity score.We assume a logistic regression model for π(x): with p-dimensional parameter α 0 .Here, we allow the propensity score to depend on covariates, which is common in observational studies and the parameters α 0 can be estimated from the data.For randomized clinical trials, π(x, α 0 ) is a constant.We assume the majority of elements in β 0 and α 0 are zero and refer to the support supp(β 0 ), supp(α 0 ) as the true underlying sparse model of the indices.Consider a study with n subjects.Assume that the design matrix Here, we consider the case that the dimensionality p is of NP order of the sample size n.Define µ(x) = h 0 (x) + π(x, α 0 )x T β 0 , the conditional mean of the response given covariates X = x.We propose the following two-step estimation procedure to estimate the optimal treatment regime.In the first step, we posit a model Φ(x, θ) for the conditional mean function µ(x), and consider the penalized estimation for the propensity score and conditional mean models as follows.

Non-asymptotic weak oracle properties
In this section we show that the proposed estimator enjoys the weak oracle property, that is, α, β and θ defined in (3)-( 5) are sign consistent with probability tending to 1, and are consistent with respect to the L ∞ norm.Weak oracle properties of θ are established in the sense that it converges to some least false parameter θ ⋆ when the main effect model is misspecified.
Theorem 1 provides the main results.Some regularity conditions are further discussed in subsections 3.2 and 3.3.A major technical challenge in deriving weak oracle properties of β is to analyze the deviation in ( 23), for which we develop a general empirical process result in Proposition 3.1.This result is important in its own right and can be used in analyzing many other high-dimensional semiparametric models where the index parameter of an empirical process is a plug-in estimator.The following notation is introduced to simplify our presentation.
Remark 3.1.In Theorem 1, part (a) corresponds to the sparse recovery while (b) gives the estimators' convergence rates.Weak oracle property of α directly follows from Theorem 2 in Fan and Lv (2011).However, to prove this property of β requires further efforts, to account for the variability due to plugging in θ and α.L ∞ convergence rate of α1 as well as the nonsparsity size s α , play an important role in determining how fast β1 converges.
Remark 3.2.The convergence rate of θ1 will not affect that of β.This is because we require the posed propensity score model to be correct, the estimation of β is robust with respect to the model misspecification of the main effect parameters θ.Simulation results also validate our theoretical findings. sup where 0 ≥ a 3 ≤ 1/2 some positive constant, d nθ = min j∈M θ |θ ⋆j | the minimum half signal, If the response is unbounded, we require and the right-hand side of (11) shall be modified to O(n Remark 3.3.Conditions (6) and (7) are key assumptions determining the degree of model misspecification.Condition (6) requires that the posited working model Φ can provide a good approximation for µ.In that case, the residual µ−Φ will be orthogonal to the jacobian matrix φ 1 and the left-hand side of (6) will be small.There's a trade-off between the simplicity of Φ and the accuracy in terms of approximating µ.Condition (7), which measures Φ's complexity, automatically holds for the oracle submodel of Φ.
Remark 3.4.Conditions (11) and (12) put constraints on the derivatives of φ, requiring the misspecified function to be smooth.The right-hand side order in (11) is not too restrictive since we require n γ θ ≫ s θ log n.
Three common examples of the main-effect function Φ are provided below to examine the validity of Condition 3.1.
Example 1. Set Φ = 0.Then, no model is needed for Φ.It is easy to check that Condition 3.1.is satisfied.
Example 2. When a linear model is specified, i.e., Φ(x, θ) = x T θ, conditions (11) and (12) are automatically satisfied since the second-order derivative of Φ vanishes.If we set d θ = d β , (13) holds.Condition (6) also satisfies automatically.Condition (7) becomes each element in the left-hand side vector in (14) can be viewed as the inner product of the residuals obtained by fitting X 1θ on each noise variable in X 2θ and those fitted by regressing X 1θ on µ.When µ depends only on X 1θ , (14) holds with d θ = d β for the Gaussian linear model.

The design matrix
We provide the technical conditions on the design matrix as follows.
Condition 3.2.Assume that sup sup where The sequence b αβ in (15) shall satisfy Remark 3.5.Conditions (15) and (16) control the impact of the deviation of the estimated propensity score from its true value on β, thus are not needed when the propensity scores are pre-specified.By the definition of W (δ), magnitudes of the left-hand side in these two conditions depend on how accurate the misspecified function models µ and how fast θ converges to the least square estimator.The sequence b αβ in (15) can converge to 0 when X 1β and X 1α are weakly correlated.Even in the most extreme case where X 1α = X 1β , i.e, the propensity score model and outcome regression model share the same important variables, b αβ usually would not exceed the order O(b β ).Each element in the left-hand side of (16) is the multiple regression coefficient of the corresponding variable in X 2β on W (δ)X 1α , using weighted least squares with weights π • (1 −π), after adjusted by X 1β , which characterize their weak dependence given X 1β .These two conditions are generally weaker than those imposed by Fan and Lv (2011) (Condition 2), and are therefore more likely to hold.
Remark 3.6.The right-hand side in (20) can be relaxed to O(n 1/2+γ θ / log n) when using the linear model.The additional term √ s θ is due to the penalty on the complexity of the main effect model.This condition typically controls the deviation where Z = diag(A − π)X.A common approach to bound the deviation is to utilize the classical Bernstein's inequality.However this approach does not work here, because the indexing parameter in the process Φ(•) in ( 23) is an estimator.To handle this challenge, we bound the left-hand side in (23) by A general theory that covers the above result is provided in Proposition 3.1.
Remark 3.7.Conditions (21) and (22) aim to control the L ∞ norm of the quadratic term of the Taylor series as a function of α, expanded at α 0 .Similar to (15) and ( 16), the two conditions are not needed when α 0 is known to us.

Deviation
Let a ij (h), i = 1, . . ., n, j = 1, . . ., J be arbitrary deterministic functions, continuously differentiable with respect to an S-dimensional parameter h.We restrict the domain to the hypercube H centered at some fixed vector Further, for any t ≥ 3δv 0 d n , we have Pr sup for some constant c 0 .When z i 's are bounded, the term ω log n in (24) and (25) can be removed.
Remark 3.8.Proposition 3.1 shows that the magnitude of is determined by the L 2 constraints on the derivatives, the number of sums J, the dimension of the parameter S and the diameter of the region δ.

Oracle properties
In this section we study the oracle property of the estimator β.We assume that max(s α , s β ) ≪ √ n and n γ θ ≫ s θ log n.The convergence rates of the estimators are established in Section 4.1 and their asymptotic distributions are provided in Section 4.2.Technical conditions are discussed in Section 4.3.

Rates of convergence
Theorem 2. Assume that Conditions 2.1, 3.1, A.2, A.4 and 4.1 hold and Constraints on b θ , d θ ,d nθ and λ 3n are same as in Theorem 1.Further assume max(l , and n γ θ ≫ s θ log n.Then there exists a strict local minimizer β of the loss function (5), α of (3), such that α2 = 0, β2 = 0 with probability tending to 1 as n → ∞, and Remark 4.1.We note that when establishing the oracle property of β, only the weak oracle property of θ is required.This is due to the robustness of the A-learning methods and the fact that the propensity score is correctly specified.
Remark 4.2.Precision of β1 is affected by that of α1 , since When the propensity score is known, convergence rate of β1 is improved to s β /n.

Asymptotic distributions
To establish the weak convergence of the estimators, we define Σ 12 and Σ 22 as where W is a shorthand for W (θ ⋆ 1 ).Theorem 3 (Oracle property).Under conditions in theorem 2 and Condition 4.
] is asymptotically normally distributed with mean 0, covariance matrix Ω, which is the limit of where A 1n is a q 1 × s α matrix and A 2n is a q 2 × s β matrix such that We note that conditions on the smoothness of the misspecified function ( 20) is strengthened.To better understand the above theorem, we provide the following two corollaries.The first corollary gives the limiting distribution when we specify both the propensity score and main-effect model while the second one corresponds to case when the propensity score is known in advance.
Corollary 4.1.Under conditions of Theorem 3, when we correctly specify the main-effect model, i.e., µ = Φ, A 1n α1 and A 2n β1 are jointly asymptotically normally distributed, with the covariance matrix Ω ′ , which is the limit of the following matrix, .
Remark 4.3.Comparing the results in Corollary 4.1 and in Theorem 3, the term A T 2n Σ 22 A 2n accounts for the partially specification of model (1).In the most extreme case where we correctly specify Φ, β will achieve its minimum variance and is independent of α1 .In general, we can gain efficiency by posing a good working model for Φ.Numerical studies also suggest that a linear model such as Φ = Xθ is preferred compared to the constant model.This is in line to our theoretical justification since W is a diagonal matrix with the ith diagonal element Corollary 4.2.When the propensity score is known, under conditions of Theorem 3 with all α's replaced by α 0 , then with probability tending to ) is asymptotically normally distributed with mean 0, covariance matrix Ω ′′ which is the limit of Remark 4.4.An interesting fact implied by Corollary 4.2 is that the asymptotic variance of β1 will be smaller than that of the same estimator had we known the propensity score in advance.A similar result is given in the asymptotic distribution of the mean response for the value function in the next section.This is in line with the semiparametric theory in fixed p case where the variance of AIPWE would be smaller when we estimate the parameter in the coarsening probability model, even if we know what the true value is (see Chapter 9 in Tsiatis, 2006).By doing so, we can actually borrow information from the linear association between covariates in W X 1β and those in X 1α .

Technical conditions
Condition 4.1.In addition to (21) and ( 22) in Condition 3.2, assume that the right-hand side of (20) is strengthened to O(n 1 2 +γ θ / s θ log 3 n), and the following conditions hold, sup Remark 4.5.Similar to the interpretation of ( 15) and ( 16), ( 26) corresponds to a notion of weak dependence between variables in X 1α and X 1β while (27) require X 2β and X 1α are weakly correlated after adjusted by X 1β .Validity of these two conditions are related with the convergence rate of θ1 and the magnitude of model misspecification ||µ−Φ|| ∞ .Besides, it can be verified that (28)-( 30) hold with large probability when the baseline covariates possesses subgaussian tail.
Condition 4.2.Assume that sup where x 1αi and x 1βi stand for the ith row of the matrix X 1α and X 1β respectively.

Evaluation of value function
In this section, we derive a non-parametric estimate for the mean response under the optimal treatment regime.By (1), define our average population-level response under a specific regime as where the treatment decision for the ith patient is given as I(x T i β > 0).The mean response under the true optimal regime is denoted as V n (β 0 ) and it is easy to verify that β 0 is the maximizer of the function V n .
Similarly as in Murphy (2003), we propose to estimate the value function V n (β 0 ) using This estimator is not doubly robust but offers protection against misspecification of the baseline function and improved efficiency.It's not doubly robust because we require the propensity score model to be correctly specified to ensure the oracle property of β.A key condition which guarantees asymptotic normality of ( 36) is given as follows.
Condition 5.1.Assume there exists some constant C ′ , such that for all ε > 0, Remark 5.1.The above condition has similar interpretation as Condition (3.3) in Qian and Murphy (2011), where random design were utilized.Condition 7 requires that the absolute value of the average contrast function can not be too small, which together with the condition s β = o(n 1/4 ) ensures the following stochastic approximation condition: Theorem 4. Assume that conditions in Theorem 3 hold.If Condition 7 holds and the nonsparsity size s β satisfies s β = o(n 1/4 ), then with probability going to 1, is asymptotically normally distributed with variance υ 2 0 , which is limit of where v n stands for the vector √ n, and Σ 22 is defined in Theorem 3.
Remark 5.2.Note that we only need s β = o(n 1/2 ) to guarantee the weak oracle property of to show the asymptotic normality of β1 .Theorem 4 further requires s β = o(n 1/4 ) as to ensure the approximation condition (37).
Remark 5.3.When (37) is satisfied, the asymptotic normality of Vn follows immediately from the oracle property of the estimator β1 .The first term σ 2 in (38) is due to variation of the error term e i while the last two terms correspond to the asymptotic variance of β1 .
We provide a corollary here which corresponds to the case where the main-effect model is correctly specified.
Corollary 5.1.In addition to the conditions in Theorem 4, if the main-effect model is correct, √ n{ Vn − V n (β 0 )} is asymptotically normally distributed with variance υ 2 1 , which is defined as the limit of Similar to the asymptotic distribution of β1 , the following corollary suggests that the proposed estimator is more efficient in the case when we estimate the propensity score by fitting a penalized logistic regression.
Corollary 5.2.Assume the propensity score is known, and conditions in Theorem 4 hold with all α's replaced by α 0 , then with probability going to 1, √ n{ Vn −V n (β 0 )} is asymptotically normally distributed with variance υ 2 2 , which is the limit of defined in Theorem 4, and Σ ′ 22 defined in Corollary 4.2.By the definition of v n and the condition that λ max (X T 1β X 1β ) = O(n), the asymptotic variance will reach its minimum when I(x T i β 0 > 0) is close to the propensity score.We characterize this result in the following Corollary.
Corollary 5.3.Under the conditions in Theorem 4, if we further assume that then with probability going to 1, √ n{ Vn − V n (β 0 )} is asymptotically normally distributed with the variance σ 2 .
Remark 5.4.Such a result is expected with the following intuition: in an observational study, if the clinician or the decision maker has a high chance to assign the optimal treatment to an individual patient, i.e., the propensity score is close to I(x T i β 0 > 0), the variation in estimating the value function will be decreased.In other words, the more skillful the clinician or the decision maker is, the closer the observed individual response Y i approaches the potential outcome under the optimal treatment regime.
To evaluate the performance of the estimator, we report the L 2 loss of β and that of the first-stage estimator α as well.The number of missed true variables in X 1α and X 1β (denoted as FN), the number of selected variables (denoted as #S) and the average percentage of making correct decisions (denoted as PCD), which is defined as 1− n i=1 |I( βT x i > 0)−I(β T 0 x i > 0)|/n are also reported.In addition, we estimate E(Y ⋆ ( d)) and E(Y ⋆ (d opt )), the value functions of our estimated optimal treatment regime and the true optimal regime, respectively, using Monte Carlo simulations.For example, we compute E(Y ⋆ ( d)) by generating data for 10000 subjects from the model, where A is determined by the estimated optimal treatment regime I( βT x > 0).Taking the average of the 10000 values, we report the averages of mean responses over 500 replications as well as their standard deviations.
Table 1 summarizes the results.The proposed estimators perform well when covariates are independent and the magnitude of half signal is large.When sample size is small, about 2.3 and 1.7 important variables in X 1β in Models II and III are missed on average.However, the selection consistency can be observed either as the sample size increases or as the strength of the signal of β increases, as expected.Results on Model I are generally better than those on Model II or III, due to a simple linear form for the baseline function.The value function of the estimated optimal treatment regime slightly increases in all three models as the sample size gets larger, approaching the optimal value function.
The correlation structure of covariates also has influence on the performance of the proposed estimators.When an AR(1) structure is chosen, the estimator tends to miss more important variables and produce larger L 2 loss.The second model seems to be the most challenging case.When n = 200 and the signal is relatively weak, more than three important variables are missed on average.The PCD is only about 66% in this scenario.When the sample size or the magnitude of the signals increase, the performance of the estimators demonstrate similar trend as those obtained for the i.i.d.case.

Sensitivity analysis
We conduct some sensitivity analysis when the propensity score model is misspecified.Instead of fitting a penalized logistic regression model, we estimated the propensity score using sample proportions, i.e., πi = n i=1 A i /n.The results are summarized in Table 2.The estimators perform comparably to the cases when we the propensity score model is correctly specified, which suggests that the proposed estimators may not be sensitive to misspecified propensity score models.

Real data example
We further examine our method on the data set from the NIMH-funded STAR*D study, where 4041 patients in total with nonpsychotic major depressive disorder (MDD) were participated.The aim of the study was to determine the effectiveness of different treatments for those people who have not responded to initial medication treatment.Patients first received citalopram (CIT), an SSRI medication.After 8-12 weeks, three more levels of treatments were offered to participants whose first treatment didn't give an acceptable response.Available treatment strategies at Level two includes sertraline (SER), venlafaxine (VEN), bupropion (BUP) and cognitive therapy (CT) and augmenting CIT which combines CIT with one more treatment.Two Level 2A switch options with VEN and BUP treatments were provided for patients receiving CT without sufficient improvement.Four treatments were available at Level 3 for participants without anticipated response, including medication switch to mirtazapine (MIRT), nortriptyline (NTP), and medication augmentation with either lithium (Li) and thyroid hormone (THY).Finally, treatment with tranylcypromine (TCP) or a combination of mirtazapine and venlafaxine (MIRT+VEN) were provided at Level 4 for those without sufficient improvement at Level 3.
Here, we compare the treatment strategies for BUP (coded as 1) and SER (0) at Level score changing rates at Level 1), yielding the following optimal treatment regime: Apart from these 4, The SCAD penalty selects another 4 variables, including HAWAI (Native Hawaiian/other Pacific Islander), SKRSH (rash skin or not), NVTRM (CNS: Tremors) and URNONE (no symptoms in patients' urination category).Optimal treatment regime using SCAD penalty is given as

Conclusion
In this article, we propose a two-step estimator for estimating the optimal treatment strategy, which selects variables and estimates parameters simultaneously in both propensity score and outcome regression models using penalized regression.Our methodology can handle data set whose dimensionality is allowed to grow exponentially fast compared to the sample size.Oracle properties of the estimators are given.Variable selection is also involved in the misspecified model and new mathematical techniques are developed to study the estimator's properties in a general form of optimization.The estimator is shown to be more efficient when the misspecified working model is "closer" to the conditional mean of the response, although our approach does not require correct specification of the baseline function.Numerical results demonstrate that the proposed estimator enjoys model selection consistency and has overall satisfactory performance.
In the case when there are multiple local solutions of our objective functions (5), (3) or (4), although our asymptotic theory only suggests the existence of a local minimum possessing the oracle property, it is worth mentioning that we can actually identify the desired oracle estimator using existing algorithms (see Fan et al., 2014;Wang et al., 2013).Theoretical properties can be established in a similar fashion.
The current framework is focused on point exposure study.It will be interesting and practically useful to extend our results to dynamic treatment regimes.Significant efforts are needed to handle model misspecification in multiple stages.This is beyond the scope of the current paper and is an interesting future research topic.

A Technical conditions
We give some regularity conditions on the design matrix and constraints on the penalty function and regularization parameters here.Condition 2 and 5 serve to guarantee weak oracle property of the estimator while Condition 2 ′ and 5 ′ ensure the oracle property.

A.1 The design matrix
Condition A.1.The design matrix X satisfies and if the response is unbounded, where a 1 and a 2 are constants in 0, 1 2 , C ∈ (0, 1), d β a positive constant that satisfies log(p) = O(n 1−2d β ).b α and b β are two sequences that are allowed to diverge.∆ is a diagonal matrix with the ith diagonal element π i (1 − π i ).
Condition (39), ( 40), ( 41), ( 42) and ( 43) are the regularity conditions on the design matrix.They share similar purpose as Condition 2 in Fan and Lv (2011).Condition (44) controls the order of magnitude of the largest element in the design matrix.We require the propensity score bounded away from 0 and 1 to make X T 1α ∆X 1α and X T 1β ∆X 1β nondegenerate.As commented in Fan and Lv (2011), these constraints are easy to be satisfied.Since the intercept is contained in the design matrix, condition ( 42) and ( 43) imply that Condition A.2.The design matrix shall satisfy and conditions (42), ( 43 with These conditions guarantee the existence of estimators.Apart from ( 50), ( 52) and ( 57), other conditions on the penalty and regularization parameter are similar in rationale as in Fan and Lv (2011). where

B Proof of Proposition 1
We proof the case of unbounded variables only.Denote the set of functions F as {A j (h 1 ) − A j (h 2 )} T e, ∀j = 1, . . ., J, h 1 , h 2 ∈ H, by the Symmetrization Lemma (Lemma 2.3.1, van der Vaart and Well 1996), we have where ǫ i 's are independent Rademacher variables.
If we can show It follows from Lemma 2.2.2 in van der Vaart and Wellner (1996) and therefore it suffices to show For fixed z i , we prove the stochastic process i a ij (h)z i ǫ i indexed by h is sub-gaussian with the semi-metric By Lemma 1 in Appendix C, it suffices to show max It follows from mean-value theorem that max where hs , ∀s = 1, . . ., S lies between the line segment of h 1 and h 2 .Together with the fact we have which implies (65).By some arguments for sub-gaussian process in the proof of Corollary 2.2.8 in van der Vaart and Wellner (1996), we have where we have with M some constant independent of ǫ and j, thus (64) is satisfied.This completes the proof for (24).To show (25), define

It follows by Chebyshev's inequality and the condition max
where the last inequality follows due to that sup using similar arguments in showing ( 65).Now it follows by Lemma 2.3.7 in van der Vaart and Wellner (1996) that where ǫ 1 , . . ., ǫ n are i.i.d Rademacher variables.Define A to be the event in (66), B the event we can further bound (66) from above by 4Pr(A ∩ B) + 4Pr(B c ).
We first show Pr(B c ) ≤ 2/n.By the definition of the Orlicz norm || • || ψ 1 and Markov's inequality, On the event B, conditional on z i , denote for some constant c 0 .Hence, on event B, conditional on z i 's, Note that the right-hand side is independent of z i , thus using the law of total expectation, we have , which together with ( 66) and ( 67) gives the result.

C Proof of Theorem 1
Before proving Theorem 1, we state the following lemmas.The proof of Lemma 2 is given in Appendix G.
Lemma 1.Let z = (z 1 , . . ., z n ) T be an n-dimensional independent random response vector with mean 0 and a ∈ R n .
(a) If z 1 , . . ., z n are bounded in [c, d], then for any ǫ ∈ (0, ∞), and Z 1 , Z 2 , Ẑ1 , Ẑ2 the sub-matrices of Z and Ẑ, formed by components in supp(β 0 ) and its complement, respectively.Define and their complements M c β , M c′ θ respectively.For a given vector ψ or a matrix Ψ, ψ M stands for the sub-vector of ψ formed by components in M, Ψ M the sub-matrix of Ψ by rows in M. Besides, the superscript Ψ j is used to refer to the vector which is the jth column of matrix Ψ while the subscript Ψ i stands for the ith row of Ψ.
Proof of theorem 1.We break the proof into three steps.Based on Theorem 1 in Fan and Lv (2011), it suffices to prove the existence of β1 , θ1 inside the hypercube with K a large constant, conditional on the event ε, satisfying Step 1.We first show the existence of a solution to equations ( 68) and ( 69) inside ℵ for sufficiently large n.For any δ = (δ 1 , . . ., δ We write the left hand side of (68) as on the set ε 3 ∪ ε 5 ∪ ε 13 , we have Define Note that η 1M β = I 3 in (75), which we represent here using a second order Taylor expansion around α 1 , where r I 3 in (77) corresponds to second order remainder, whose jth component is given as where Σ( α) is a diagonal matrix with the ith diagonal element π ′′ (x T 1αi α) with α lying in the line segment between α1 and α 1 .Since π ′′ (•) is a bounded function, we can bound ||r whose order of magnitude is O(s α n 1−2γα log 2 n) by ( 21).We decompose I 4 in (75) as η 2M β + Z T 1 ( Ẑ1 − Z 1 )β 1 .Using similar arguments, on the set ε 9 , it follows from (22) that Using Taylor expansion, it is immediate to see that by ( 22).Combining ( 79) and ( 80) gives So far, we have by ( 76), ( 77), ( 78) and ( 81).Now we approximate ).We present it as It follows from first-order Taylor expansion that the jth element in ω 1M β can be presented as where ∆( α1 ) is a diagonal matrix with the ith diagonal component π(x i , α1 )(1 − π(x i , α1 )), where α1 lies between the line segment of α1 and α 1 .We decompose x j as the Hadamard product of two vectors, denoted by xj • xj , where Since ||A − π|| ∞ ≤ 1, elements in ∆( α1 ) are bounded, we have Combining ( 85) with (86) gives 42) and ( 43).By the same argument, we can verify that ||ω 2M β || ∞ is of the same order.Note that on the set ε 11 , these together with (87), yields Define vector-valued function then equation ( 68) is equivalent to Ψ 1 (δ β , δ θ ) = 0.It follows from ( 74), ( 82) and ( 88) that By similar arguments in the proof of Theorem 2 in Fan and Lv (2011), we have on the set ε 1 ∪ ε 2 .Thus by ( 39), ( 15), ( 52) and ( 53), we have Therefore by (87), for sufficiently large n, if (δ and if (δ Similarly we write the left-hand side of (69) as It is immediately to see that on the set ε 5 .The L ∞ norm of the first term in ( 93) is bounded by on the set ε 15 .
Step 3. Now we show the second order conditions ( 72) and ( 73) hold.Because ( 73) is directly implied by ( 55), it suffices to show that λ min ( ẐT 1 Ẑ1 ) ≥ λ min (X T 1β ∆X 1β ) for sufficiently large n.Since ( Ẑ1 − Z 1 ) T ( Ẑ1 − Z 1 ) is positive semi-definite, we have Since any symmetric matrix Ψ, the absolute value of minimum eigenvalue can be bounded by implied by the constrain max(l 1 , l 2 ) < γ α .This completes the proof.

D Proof of Theorem 2
We modify the right-hand side orders in ε 13 as O( √ n), and redefine We still have Pr(ε|X) → 1 due to the L 2 constraints in Condition 4.1.Convergence rate of α1 follows directly from Theorem 3 in Fan and Lv (2011), it suffices to show the convergence rate of β under ε.The proof is divided into two steps.In the first step, we establish the s β /n consistency of β1 in the s β -dimensional subspaces.In the second step, we show that the estimator is indeed a local minimizer, which satisfies ( 70) and ( 72).
Let n be sufficiently large such that s Rewrite Qn (β 1 ) − Qn (δ β ) using second-order Taylor's expansion, we have where Λ β is a diagonal matrix with maximum absolute element bounded by λ 2n κ β .We first show the minimum eigenvalue of ẐT 1 Ẑ1 /n + Λ β is bounded from below by a positive number.By (112) and conditions in ( 45) and ( 62), it suffices to show By ( 113) and conditions in ( 42) and ( 43 115) together with (45) entails, for sufficiently large n, that ).We next show the second line in ( 116) is smaller than 0 for sufficiently large τ .It follows from Cauchy-Schwartz inequality that by which we only need to show It follows by ( 45) that the left-hand side of ( 117) is larger than O(τ √ s β ).Therefore for sufficiently large τ , it suffices to show its right-hand side is O( √ s β ).We present α1 − α 1 by It follows from ( 45) and (61) that Combining (119) together with (120) we have which along with condition in (26) gives 117) is satisfied.The next step is to prove that the quantity in the first line of ( 116) is less than 0, an application of Markov's inequality entails , by which it suffices to show that E||v β || 2 2 I ε = O( s β n ).We rewrite nv β as the sums of the following terms, It is immediate to see that since tr( ẐT 1 Ẑ1 ) ≤ tr(X T 1β X 1β ) = s β n considering that the covariates are standardized.On the set ε 15 , E||Z It follows from (30) that A first-order Taylor expansion gives 22).Expand Z T 1 ( Ẑ1 − Z 1 )β 1 to the second order using Taylor expansion around α 1 gives by ( 22).Similarly, it follows from (21) that By the monotonicity of ρ ′ 2 (•) and the condition in (61), we have This along with ( 119)-( 127), and the condition on the nonsparsity size max(l 1 , l Therefore the convergence rate of β1 is established. Step 2. It remains to show that the vector β = ( βT 1 , 0) T is indeed a strict local minimizer of Qn (δ β ) on the space R p × R q , which need to check condition ( 70) and (72).
Let B nαβ = X T 1β W ∆X 1α , it follows from (34) that Combining ( 148), ( 149) with condition in (35) gives Therefore the covariance matrix of the above variable is Σ 22 , which is O(1) by ( 34). Define . Now the asymptotic normality will follow if we can show Besides, It follows from ( 32) and (149) that This completes the proof.

F Proof of Theorem 4
Similar to the prove of Theorem 3, under the given conditions, with probability goes to 1, we have In addition, under the conditions in Theorem 3, B −1/2 nβ ( β1 − β 1 ) is asymptotically equivalent to By the definition of Vn , we have We break the proof into two steps.In the first step, we show ( 153) is o p (1).Next, we establish the asymptotic normality of the right-hand side of (152).
Step 1.Note that (153) is always nonnegative, it suffices to provide an upper bound.We further partition (154) into the following two pieces.
It follows from Cauchy-Swartz inequality that I 1 is smaller than Step 2. Using similar arguments in the proof of Lemma 2, we can show with probability at least 1 − s β /n, the following event holds: (155) On the events ( 150) and ( 155), we have which is o(1).Therefore, it suffices to establish the asymptotic normality for or equivalently, Under the condition (45) and λ max (X T 1β X 1β ) = O(n), we have ) in Condition A.1, where b 1 , b 2 are some positive constants, and ||B|| 2,∞ = sup ||v||=1 ||Bv|| ∞ .