Dimension reduction and variable selection in case control studies via regularized likelihood optimization

Dimension reduction and variable selection are performed routinely in case-control studies, but the literature on the theoretical aspects of the resulting estimates is scarce. We bring our contribution to this literature by studying estimators obtained via L1 penalized likelihood optimization. We show that the optimizers of the L1 penalized retrospective likelihood coincide with the optimizers of the L1 penalized prospective likelihood. This extends the results of Prentice and Pyke (1979), obtained for non-regularized likelihoods. We establish both the sup-norm consistency of the odds ratio, after model selection, and the consistency of subset selection of our estimators. The novelty of our theoretical results consists in the study of these properties under the case-control sampling scheme. Our results hold for selection performed over a large collection of candidate variables, with cardinality allowed to depend and be greater than the sample size. We complement our theoretical results with a novel approach of determining data driven tuning parameters, based on the bisection method. The resulting procedure offers significant computational savings when compared with grid search based methods. All our numerical experiments support strongly our theoretical findings.


Introduction
Case-control studies investigate the relationship between a random outcome Y , typically the disease status, and a number of candidate variables X j , 1 ≤ j ≤ M , that are potentially associated with Y .An important instance is provided by cancer studies, where the X j 's may quantify exposures to certain substances, or may be a collection of genes or genetic markers.One of the problems of interest in such studies is the identification, on the basis of the observed data, of a smaller subset of the set of candidate variables, that can reliably suffice for subsequent analyses.This problem becomes more challenging when the collection of potential disease factors M is very large.Our solution to this problem is variable selection via penalized likelihood optimization.We propose below a computationally efficient selection method and we investigate the theoretical properties of our estimates under the case-control data generating mechanism.We begin by giving the formal framework used throughout the paper and by making our objectives precise.
We consider binary outcomes Y ∈ {0, 1}, where Y = 0 labels the controls (non-disease), and Y = 1 labels the cases (disease).Let f 0 (x), f 1 (x) be, respectively, the conditional distributions of X = (X 1 , . . ., X M ) given Y = 0 and Y = 1.Under the case-control or retrospective sampling scheme we observe two independent samples from each of these conditional distributions.Formally, we observe X 0 1 . . ., X 0 n 0 i.i.d. with density f 0 (x), (1.1) X 1  1 . . ., X 1 n 1 i.i.d. with density f 1 (x), where the superscripts 0 and 1 of X are mnemonic of the fact that the samples correspond to Y = 0 and Y = 1, respectively.For simplicity of notation we assume that n 0 = n 1 = n.We assume that the outcome Y is connected to X via the logistic link function where β ∈ R M , δ 0 ∈ R.
In this article we will further assume that β = β * , where β * ∈ R M has non-zero components corresponding to an index set I * ⊆ {1, . . ., M } with cardinality |I * | = k * , possibly much smaller than M .The variable selection problem can be therefore rephrased in this context as the problem of estimating the unknown set I * from data generated as in (1.1).We note that this problem is not equivalent with the problem of estimating I * from i.i.d.pairs (X i , Y i ), with Y i generated from (1.2), as no random samples from the distribution of Y are available under the sampling scheme (1.1).However, the likelihoods corresponding to the two sampling schemes are intimately related, with results detailing their connections dating back to the 1970s.This link is essential for our procedure and we illustrate it below.Following Prentice and Pyke (1979), Section 3, an application of the Bayes' theorem combined with rearranging terms gives the following re-parametrization of f 0 and f 1 , respectively: × q(x) =: 2p 0 (x)q(x), (1.3) where δ is a new intercept parameter, different than δ 0 , β is given by (1.2), and q(x) is a positive function that integrates to one.The parameters δ, β and q are constrained by the requirement that f 0 and f 1 are probability densities, that is (1.4) p j (x)q(x)dx = 1 2 ; j ∈ {0, 1}.
Therefore, the likelihood function corresponding to data generated as in (1.1), to which we will refer in the sequel as to the retrospective likelihood, is: with parameters δ, β and q related via the constraint (1.4).Notice that L pros is, up to the intercept, exactly the standard logistic regression likelihood, had we observed 2n i.i.d observations (X i , Y i ), with equal number of 0 and 1 responses Y i generated according to (1.2); this quantity and the corresponding sampling scheme are typically referred to in this context as the prospective likelihood, and prospective sampling scheme, respectively.We will also use this terminology below.
The earlier results on the estimation of β via (1.5) did not address the model selection problem and were mostly concerned with the asymptotic properties of the estimates of β.Anderson (1972), Prentice and Pyke (1979), Farewell (1979) are among the pioneers of this work and showed that: (1) The vector β ∈ R M that maximizes the retrospective likelihood L retro (δ, β) under the constraint (1.4) coincides with β ∈ R M that maximizes the prospective likelihood L pros (δ, β); (2) The asymptotic distribution of β, derived under the sampling scheme (1.1) coincides with the asymptotic distribution of β derived under the prospective sampling scheme.
A number of important works continued this program, and provided in depth analyses of various other aspects of the estimators of β in retrospective studies.We refer to Gill, Vardi and Wellner (1988) and Carroll, Wang and Wang (1995), for more general sampling schemes, to Qin and Zhang (1997) for goodness of fit tests, to Breslow, Robins and Wellner (2000) for a study of the efficiency of the estimators, to Murphy and van der Vaart (2001), for partially observed covariates, to Osius (2009), for general semiparametric association models and to Chen, Chatterjee and Carroll (2009), for shrinkage methods tailored to inference in haplotype-based case-control studies and the asymptotic distribution of the resulting estimators.The variable selection problem was not considered in any of these works.Although model selection techniques are routinely used in case-control studies, and are typically based on testing via the asymptotic distribution of β, we are unaware of theoretical analyses of the performance of the resulting estimators under this sampling scheme.
Our contribution to this literature is to provide answers to the model selection analogues of (1), and to formulate goals that replace (2) above by goals targeted to the dimension reduction and selection aspects.Specifically, we propose a model selection method based on a penalized likelihood approach, with a sparsity inducing penalty.In this article we will focus on the ℓ 1 penalized likelihood with tuning parameter λ.We will show the following: (I) For any penalty function pen(β) that is independent of δ, the maximizer of L retro (δ, β) + pen(β) under the constraint (1.4) coincides with the maximizer of the prospective likelihood L pros (δ, β) + pen(β).The result announced in (I) is an immediate extension of result (1) above established in the 1970s for the unpenalized likelihood.We present it in Section 2 below.The results in (II) necessitate an analysis that is completely different than the one needed for (2), and we discuss (a) in Section 4 and (b) in Section 3. The immediate implication of (b) that is relevant to casecontrol studies is the fact that the estimator β, which is supported on a space of potentially much lower dimension than the original R M , yields supnorm consistent estimates of the odds ratio, where the odds ratio is defined as follows.The odds of having Y = 1 for an individual with characteristics X = x are O(x) = P (Y = 1|X = x)/P (Y = 0|X = x), and the odds ratio is defined as O(x)/O(x 0 ), for some reference characteristic X = x 0 .Under model (1.2), the odds ratio becomes We establish the after model selection consistency of estimators of this ratio in Section 3 below.
In this article we will concentrate on the analysis of estimators obtained by optimizing ℓ 1 penalized criteria.The literature on the theoretical aspects of such estimates has seen an explosion in the past few years, together with the development of efficient algorithms for computing them.The results pertaining to generalized linear models are most closely connected to our work, and all of them have been established for what we termed above the prospective sampling scheme, that is for data consisting of (X i , Y i ) i.i.d.pairs.For analyses conducted under this framework, we refer to van der Geer (2008) and Bunea (2008), for sparsity oracle inequalities for the Lasso, and to Bunea (2008), for correct subset selection; we refer to Ravikumar, Wainwright and Lafferty (2008) for related results on graphical models for binary variables.Motivated by the increasing usage of the Lasso type estimators for the analysis of data arising from case-control studies, see e.g., Shi, Lee and Wahba (2007), Wu et al (2009) and the references therein, we complement the existing literature on this type of estimators by providing their theoretical analysis under the case-control data generating mechanism (1.1).To the best of our knowledge, this is the first such analysis proposed in the literature.
The rest of the paper is organized as follow.Section 5 complements the theoretical results of Sections 2 -4, by providing a fast algorithm for finding a data driven tuning parameter for the ℓ 1 penalized optimization problem.Our procedure does not use a grid search.Instead, we use a generalization of the bisection method to compute tuning parameters that yield, respectively, estimators with exactly k = 0, 1, . . ., M non-zero components.We then use a 10-fold cross-validated logistic loss function to which we add a BICtype penalty to select one of these estimators.The corresponding procedure is easy to implement and offers important computational savings over a grid search based procedure, which can be 50 times slower, for the same degree of accuracy.Section 6 contains a detailed analysis of the proposed estimators, via simulations.It supports very strongly all our theoretical findings.Section 7 is a conclusion section, summarizing our findings.All the proofs are collected in the Appendix.

Penalized logistic regression for case-control data
In this section we investigate the type of penalty functions for which estimation of β via penalized retrospective log-likelihood optimization reduces to the estimation of β via penalized prospective likelihood optimization.Recall that we mentioned in (1.5) above that the retrospective likelihood can be written as the product of the prospective likelihood and a term depending on q: L retro (δ, β, q) =: L pros (δ, β) × L(q), where the parameters δ, β and q are constrained by (1.4).Let pen(β) be a function that depends on β alone, and is independent of δ and q.Define the unconstrained maxima where the second maximum is taken over all density functions q.The following result, proved in the Appendix, is an immediate extension of the result derived by Prentice and Pike (1979) for the unpenalized likelihood.
Lemma 2.1.Let pen(β) be any function independent of δ and q.Let δ, β and q be given by (2.1).Then where the maximizer is computed over all δ, β, q satisfying the constraint (1.4).
Lemma 2.1 shows that the penalized log-likelihood estimates of (δ, β), for a likelihood corresponding to data generated as in (1.1) coincide with the penalized prospective log-likelihood estimates, which we rescale by 2n: Lemma 2.1 holds for any function pen(β), as long as it is independent of δ and q.Since we are interested in dimension reduction, we will consider a sparsity inducing penalty.Throughout this paper our estimates will be obtained via (2.2) with the ℓ 1 penalty given below for a tuning parameter λ that will be made precise in the following section.

Consistent estimation of the odds ratio after variable selection
Recall that the true odds ratio (1.6) is given in terms of β * , which is supported on a space of dimension k * , possibly much smaller than M .We show that the estimated odds ratio based on the selected variables corresponding to the non-zero elements of β given by (2.2) above, for penalty (2.3), provides a consistent estimate, in the supremum norm, of the odds ratio: with probability converging to one.For simplicity of notation, we assume in what follows that x 0 = 0.For uniformity of notation, we also denote the intercept parameter given in (1.3) by δ * .Our arguments are based on the following central fact, that may be of independent interest.Define the functions ℓ 0 (θ) =: ℓ 0 (θ; x) = log(1 + e θ ′ x ), and recall the notation θ = (δ, β).In order to write what follows in a compact way we further denote by P 0 and P 1 the probability measures corresponding to the densities f 0 and f 1 defined in (1.3), respectively.For a generic function g we write P 0 g and P 1 g for integration with respect to P 0 and P 1 , respectively.With this notation we define the quantity Theorem 3.1 below, which is central to our paper, establishes that the difference ∆ is small.The proof, given in the Appendix, uses the control of appropriately scaled empirical processes corresponding to the two samples.If ∆ ( δ, β), (δ * , β * ) is small with high probability, relatively standard arguments can be used to show that is also small, with high probability, and we establish this in Corollary 3.2 below.
To arrive at our desired result (3.1), we need to complement Corollary 3.2 with a study of the difference | δ−δ * |.This is done in Theorem 3.3, which also contains a stronger result: it shows that β adapts to the unknown sparsity of β * , in that it is a consistent estimator of β, with the rate of convergence of an estimate based only on I * variables.
The combination of Theorem 3.1, Corollary 3.2 and Theorem 3.3 yields the desired sup-norm consistency of the odds ratio stated in Theorem 3.4.All proofs are collected in the Appendix.All our theoretical results will be proved under the assumption that the design variables and the true parameter components are bounded.We formalize this in Assumption 1 below.

Assumption 1.
(i) There exits a constant L > 0, independent of M and n, such that |X j i | ≤ L, for all i and j, with probability 1.
(ii) There exits a constant B > 0, independent of M and n, such that max Let δ n be any sequence converging to zero with n.Define the tuning sequence If δ n = 1/n and M is polynomial in n, the tuning sequence r is of the order . The results of this section will be relative to estimators obtained via the penalty (2.3), with tuning parameter λ = 2r.For compactness of notation, we let θ . Whenever we use this compact notation, we will also use the notation θ ′ u or θ * ′ u, for some vector u ∈ R M +1 that will be implicitly assumed to have the first component equal to 1.
We give an immediate corollary of this theorem which establishes the supnorm consistency of θ ′ x.It is interesting to note that both Theorem 3.1 above and the corollary below hold under no assumptions on the dependence structure of the design variables.Corollary 3.2.Under Assumption 1, if k * r → 0, then for any γ > 0 The following theorem establishes rates of convergence for the estimates of δ * and β * .It requires minimal conditions on the design variables.We formalize them below.Let V be a (M + 1) × (M + 1) matrix containing a (k * + 1) × (k * + 1) identity matrix, corresponding to I * and the intercept, and with zero elements otherwise.Let X 1 , . . ., X n denote X 0 1 , . . ., X 0 n , and let X n+1 , . . ., X 2n denote X 1 1 , . . ., X 1 n .By convention, each X i is viewed as a vector in R M +1 with the first component equal to 1, i.e.X 0i = 1, for all i.We used this compact notation to avoid unnecessary superscripts.Let Σ be the (M + 1) × (M + 1) sample Hessian matrix with entries Remark.Condition H is a mild condition on Σ, as it requires that this matrix remain semi-positive definite after a slight modification of some of its diagonal elements, those corresponding to I * .This relaxes the conditions needed for the consistency of the estimators based on the non-regularized log-likelihood, when one requires that the Hessian matrix be positive definite, see e.g.Prentice and Pyke (1979).
Let c = 6/bw, for b given by Condition H above and for some positive number w that is arbitrarily close to 1.

Theorem 3.3. Under Assumption 1 and Condition
as n → ∞.
The theorem shows that the rate of convergence of β adapts to the unknown sparsity of β * : M j=1 | β j − β * j | has M terms, so we expect its size to be equal to the optimal rate 1/ √ n of each term multiplied by M .Theorem 3.3 shows that in fact β behaves like an estimator obtained in dimension k * , had this dimension been known, as its rate of convergence in the ℓ 1 norm is, up to constants and logarithmic terms, 1/ √ n multiplied by k * , for our choice of r.
Theorem 3.3 is the result announced in II(b) of the introduction: the estimators of β * analyzed under the retrospective sampling scheme exhibit the same adaptation to sparsity as those analyzed under the prospective sampling scheme.For a full analysis of the ℓ 1 penalized logistic regression estimates based on i.i.d.(X i , Y i ) pairs, with Y i generated as in (1.2) we refer to Bunea (2008), for results obtained under conditions similar to Condition H on the Hessian matrix, and to van de Geer (2008), for results on generalized linear models obtained under conditions on the covariance matrix of the covariates.The difference between the results obtained under the two sampling schemes is essentially minor, and consists in a slight difference in the size of the tuning parameter r, which needs to be larger, by a log n factor, for the case control studies.This is a very small price to pay for not having full information on the joint distribution of X and Y .
Corollary 3.2 and Theorem 3.3 (i) immediately imply, via a first order Taylor expansion, the desired result of this section.We summarize it in Theorem 3.4 below which shows that for appropriate choices of the tuning parameters r, and if the size of the true model does not grow very fast relative to 1/r then, under minimal conditions on the covariates, we can estimate the odds ratio consistently.

Consistent variable selection
In this section we investigate the consistency of the index set I corresponding to the non-zero components of the estimator β discussed above.We show that P(I * = Î) −→ 1 holds for the retrospective scheme, under conditions similar to those needed in the prospective sampling scheme.We state them below.
Condition 1.There exists d > 0 such that For p 0 and p 1 defined in (1.3) above, we assume that the following also holds.
Condition 2. There exits d > 0 such that Remark 1.The constants d in the two conditions above need not be the same; we used the same letter for clarity of notation.Remark 2. The two conditions above can be regarded as conditions guaranteeing the identifiability of the set of true variables I * .Condition 1 requires that there exists some degree of separation between the variables in I * and the rest, in that the correlations between variables in these respective sets are bounded, up to constants, by 1/k * .If k * is small to moderate, the restriction is mild.Condition 2 reinforces Condition 1, by requiring that these variables remain separated even when separation is measured by the entries in the Hessian matrix, which can be regarded as weighted correlations.
Let δ n be any sequence converging to zero with n.Define the tuning sequence and notice that the last term in this definition of r differs by a factor of √ log M from the last term in r given in (3.3) of the previous section.If δ n = 1/n and M is polynomial in n, the tuning sequence is again of the order log n . The following assumption reflects the fact that we can only detect coefficients above the noise level, as quantified by r.Theorem 4.1 shows that I, analyzed under the retrospective sampling scheme, is a consistent estimator of I * , under essentially the same conditions needed for its analysis under the prospective sampling scheme, see Bunea (2008) and also Ravikumar, Wainwright and Lafferty (2008) for related results.

Data based tuning sequences and the bisection method
In Sections 3 and 4 above we showed that if the cardinality of the true model is not larger than √ n, up to constants, the proposed method yields consistent estimation of the odds ratio and of the support of β * , for tuning sequences r given in (3.3) and (4.1), respectively.Since the constants involved in these expressions may be conservative, we complement our theoretical results by offering in this section a fully data driven construction of the tuning parameters.Typical methods involve two main steps.In the first step one computes the regularization path of the solution to (2.2), as the tuning parameter r varies.Then, in a second step, one selects the appropriate value of r from a fine grid of values, by cross-validating the log-likelihood.
We present here a method that follows in spirit this idea, but with two main modifications: we do not need to compute the regularization path, but rather a sketch of it and we choose r via a combined cross-validation/BIC -type procedure.We describe the resulting technique in what follows.
We begin by noting that, unlike ℓ 1 penalized least squares, the regularization path for the ℓ 1 penalized logistic likelihood is not piecewise linear and it cannot be computed analytically, see e.g., Rosset and Zhu (2007), Koh et al. (2007).In this case, approximate regularization paths can be obtained from path following algorithms, see, e.g., Hastie et al. (2004), Park andHastie (2006, 2007), Rosset (2005).These constructions rely on the following observation: the values of the tuning parameter r define a partition of the positive axis, such that for all r belonging to an interval of this partition we obtain the same sparsity pattern.Thus, the full information on the sparsity pattern can be recovered from having a representative inside each such interval.
We call the path corresponding to these representative values a "sketch" of the regularization path.The problem therefore reduces to finding the set of representatives.For this, one could perform a grid search and find the intervals where the sparsity pattern change.However, a grid search method may not include some of these intervals, as they can have arbitrary length.Thus, a grid search may skip some of the sparsity patterns that are in fact present in the regularization path.Figure 1 below offers an instance of this fact.It shows that a grid search with the same computational complexity as the generalized bisection method GBM, described in detail below, can fail to contain the true index set I * in the corresponding regularization path.In this example I * has three elements, which can be recovered in the left panel, and not in the right panel, as there is no value of r in the grid we considered for which we can have exactly three non-zero components.
Our procedure replaces the grid search with a different method, that allows us to obtain, for each dimension 1 ≤ k ≤ M , a value of r for which the solution given by (2.2) has exactly k non-zero components.For each tuning parameter r, let the number of nonzero entries in βr be denoted by n(r).For each dimension 1 ≤ k ≤ M we let h k (r) = n(r) − k.Our method consists in finding r such that h k (r) = 0. To solve this equation we use the Bisection Method, e.g.Burden and Faires (2001), which is a well established computationally efficient method for finding a root z ∈ R of a generic function h(z).The bisection method can be summarized as follows.Let α be the desired degree of accuracy of the algorithm.
We can apply the basic bisection method to solve h k (r) = 0, for each k, and obtain the desired values of the tuning sequence r 1 , . . ., r M .However, performing BBM for each dimension k separately is not computationally efficient, as there is a large amount of overlapped computation.In what follows we propose an extension of the BBM that finds a sequence r 0 , ..., r M such that n(r k ) = k, for all 0 ≤ k ≤ M .The extension uses a queue consisting of pairs (r i , r j ) such that n(r i ) < n(r j ) − 1.
The General Bisection Method for all k (GBM).Initialize all r i with −1.
1. Choose r 0 very large, such that n(r 0 ) = 0. Choose r n = 0, hence n(r n ) = n.2. Initialize a queue q with the pair (r 0 , r n ).The GBM described above offers a way of obtaining candidate values r 0 , . . ., r M for the tuning parameter r.Observe that this method does not use anything specific to the ℓ 1 -regularized logistic loss function.The GBM can be used in connection with any penalized loss function, for a sparsity inducing penalty.This makes the GBM method more general than the one proposed in Park and Hastie (2007), which is restricted to ℓ 1 -regularized generalized linear models.
We performed a quantitative evaluation of the GBM on four problems of different difficulty levels.The data were simulated from a Normal distribution, described in Section 6 below as NOR IID.In Table 1 below we compared the GBM and a grid search in terms of their capability of constructing approximate regularization paths containing the true I * .The percentages reported in the table are percentages of time I * was in the path, over 250 simulations.Table 1 indicates that if the set I * is in the regularization path, it will also be in the GBM path sketch, obtained at a low computational expense.
To complete the selection procedure, we use the dimension stabilized pfold cross-validation procedure summarized below.Let D denote the whole data set, and let D = D 1 ∪ . . .∪ D p be a partition of D in p disjoint subsets, each subset containing the same percentage of cases and controls.In all the experiments presented in the next section we took p = 10.Let D −j = D \ D j .We will denote by r j k a candidate tuning parameter determined using the GBM on D −j .We denote by I j k the set of indices corresponding to the non-zero coefficients of the estimator of β given by

With k from
Step 3, use the BBM on the whole data set D to find the tuning sequence r k and then compute the final estimators using (2.2) and this tuning sequence.Remark.In Section 4 above we showed that the tuning sequence required for correct variable selection needs to be slightly larger than the one needed for accurate estimation of the odds ratio.This suggests that cross-validation, which is routinely used for the latter purpose, is not the appropriate method for constructing a data driven tuning sequence for accurate variable identification.This fact is becoming well recognized in practice and was also pointed out in Leng, C., Lin, Y., and Wahba, G. (2006), for linear models.This motivates the modification we used in Step 3 of our procedure described above, where we added a BIC-type penalty to the cross-validated loss function.BIC-type methods have been used successfully for correct variable identification in a large variety of contexts, and we refer to Bunea and McKeague (2005) and the references therein for related results in prospective models.The numerical experiments presented in Section 6 below indicate very strongly that the same is true for our variable selection procedure, applied to the logistic likelihood and case-control type data.The theoretical analysis of this procedure is beyond the scope of this paper and will be addressed in future work.6. Numerical Experiments 6.1.Simulation design.In this section we illustrate our proposed methodology via simulations, for the three data generating mechanisms described below.We generate X = (X 1 , . . ., X M ) ∈ R M according to one of the following three distributions: (1) SNP: X 1 , . . ., X M are i.i.d. as U , where U is the discrete random variable having zero mean and variance 1: (2) NOR IID: X ∼ N (0, σ 2 I M ).
(3) NOR CORR: X ∼ N (0, Σ).We used the label SNP for the first type of X we considered, as this type of distribution is encountered in the analysis of Single Nucleotide Polymorphism type data; the i.i.d.assumption is not typically met for the basic SNP data sets, but is instead met for what is called tagging SNPs, which are collections of SNP representatives picked at large enough intervals from one another to mimic independence; we used the label SNP here for brevity.
For each of the three distributions of X given above we generated two samples, of size n each, from f 0 (x) and f 1 (x) described in (1.1), respectively, for given P(Y = 1) = π and using the logistic link (1.2).
The non-zero entries of β were set to 1, that is β j = 1, j ∈ I * .In our experiments, we took π = P (Y = 1) = 0.01, i.e. we assumed a rare incidence of the disease.The value of δ 0 given by (1.2) was found numerically in order to obtain the desired incidence π.

6.2.
The behavior of β ′ x.The quality of the estimators of the odds ratio R(x), for a given baseline x 0 , is dictated by the quality of β ′ x as an estimator of β * ′ x.Notice that the sup-norm consistency of β ′ x that follows from Corollary 3.2 and Theorem 3.3 (i) above immediately implies its consistency in empirical norm or mean squared error We present below a numerical study of the mean squared error MSE for k * = 3 and k * = 10, for different values of n and M , with M possibly larger than n.In all the results of this section we report the median MSEs over 200 simulations.Figure 2 shows the decrease of the MSE as the sample size increases, when M is kept fixed and set to M = 2000 in this example.Figure 3 shows that for given configurations (k * , 2n) the size of the MSE is essentially unaffected by an increase in the number of predictors M from 250 to 2000.Our findings are consistent with the behavior of the Lasso estimates in other generalized linear models and with our Theorem 3.4 of Section 3. The overall conclusion is that the size of the MSE is influenced by the model size k * and not by M , the total number of variables in the model.When the variables X j are correlated, the MSE values are higher for the larger value of k * = 10 and very similar to the MSE values obtained for the uncorrelated variables when k * = 3.This supports the fact that the dependency structure of the variables X j 's per se does not have a dramatic impact on the MSE, as suggested by our theoretical results of Section 3, which hold under the very mild Condition H on the design variables.

Variable selection accuracy.
In this section we consider the same simulation scenario as above, and we shift focus to the quality of variable selection.In Figure 4 below we investigated the sensitivity of the estimated probability of correct variable selection, I * = Î, or correct inclusion, I * ⊆ Î, as we varied M = 250, 500, 1000, 2000.In the interest of space, we only report below the cases k * = 3, 2n = 300 (left column) and k * = 10, 2n = 1000 (right column).The results of this section support strongly the theoretical results of Section 4. Our method can identify the correct model with high probability.The accuracy of selection is influenced by the true model size k * and the dependence structure of the variables X j , as shown in Figure 5 and it is

Conclusions
We summarize our overall contributions in this section.
1.In this article we offered a theoretical analysis of the quality of model selection-type estimators in case-control studies.Our focus has been on optimizers of the ℓ 1 regularized logistic likelihood.We showed that these estimators, analyzed under the case-control sampling scheme have model selection properties that are similar to those of the estimates analyzed under the prospective sampling scheme.In particular, we established the after model selection consistency of the odds ratio, and the consistency of subset selection.To the best of our knowledge, this is the first such theoretical analysis conducted for this sampling scheme.
2. We introduced a computationally efficient variable selection and dimension reduction procedure that uses a generalization of the Bisection Method, the GBM.We used the GBM to find simultaneously M tuning parameters, each yielding an estimator with exactly k non-zero entries, 1 ≤ k ≤ M .The final estimator is selected from the set of these M candidates as the minimizer of a p-fold cross-validated log-likelihood to which we added a BIC-type penalty.This technique is general and can be used in connection with other loss functions and sparsity inducing penalty terms.The full theoretical investigation of this promising method is the subject of future research.All our simulation experiments indicate very good performance of this procedure in all the scenarios we considered.Moreover, our technique provides important computational savings over grid search based methods.

Appendix
Proof of Lemma 2.1.Let ( δ, β) and q be given by (2.1).We show that the maximum value of log L retros (δ, β, q) + pen(β), over all (δ, β, q) that satisfy the constraint (1.4), is bounded above and below by log L retros ( δ, β, q) + pen( β).The bound from above is immediate, as a constraint maximum is always smaller or equal than the unconstrained maximum.To show the bound from below, we only need to verify that ( δ, β, q) given by (2.1) satisfy the constraint (1.4).
Let G(δ, β) = log L pros (δ, β)+pen(β).If ( δ, β) are given by (2.1), and pen(β) is not a function of δ then, in particular, we have : and also where p j denotes p j evaluated at ( δ, β).Recall that the maximum likelihood estimator of q is where δ a denotes the Dirac function.Then, condition (1.4) is satisfied by ( δ, β), q, since This concludes the proof of this Lemma.
Proof of Theorem 3.1.We begin by introducing the notation used in the sequel.Let P 0 n =: denote the empirical measures associated with the two samples.For any function g of generic argument x we will use the notation P j n g(x) = 1 n n i=1 g(x i ), for j = 0, 1. Recall that we defined the functions and recall the notation θ = (δ, β).Then, by the definition of the estimator we have for all θ = (δ, β).In particular, for θ = 0, this shows that where L is a common bound on X 0 ij , X 1 ij for all i and j.Therefore, the estimator is effectively computed on the parameter set and it is therefore enough to restrict our study to this set.
Recall that we defined ∆( θ, θ * ) in (3.2) as Note further that since (7.1) holds for all θ it holds in particular for for θ = θ * .Then, by adding ∆( θ, θ * ) + r M j=1 | β j − β * j | to both sides of (7.1) and re-arranging terms we obtain Let ǫ > 0 be a quantity that will be made precise below.Then, for θ ∈ C given by (7.2) above we define where |v| 1 = d j=1 |v j | denotes the ℓ 1 norm of a generic vector v ∈ R d , for some d.Define the events We will argue below that: Fact 2. P(E c 0 ) → 0 and P(E c 1 ) → 0.
Assume that Fact 1 and Fact 2 hold.Then, if Fact 1 holds, both terms in the left hand side of display (7.5) are positive.Thus, in particular, display (7.5) yields on the event Recall now that we have assumed that max j∈I * |β * j | ≤ B, |δ * | ≤ B, for some positive constant B and that δ ∈ C, and so | δ| ≤ L log 2 2r .Thus, on the event Then, for our choice of r, under the assumption that rk * → 0 and for the choice of ǫ given below in (7.11), the right hand side of the above display converges to zero with n.Therefore, for any α > 0 we have the desired result: (7.6) P(∆( θ, θ * ) > α) ≤ P(E c 0 ) + P(E c 1 ) −→ 0, provided Fact 2 holds.We argue in what follows that the two facts hold.
Proof of Fact 1. Recall the definition (1.3) of f 0 (x), f 1 (x), q(x) and p 1 (x), and notice that we have f 0 (x) + f 1 (x) = 2q(x).Let θ′ x be an intermediate point between θ * ′ x and θ ′ x.Then, a second order Taylor expansion gives: which shows that the left hand side is positive.
Proof of Fact 2. Define the re-scaled empirical processes and, and notice that (7.8) The proof of Fact 2 therefore relies on the control of G 0 n and G 1 n .For this, we use the bounded difference inequality, see e.g.Theorem 2.2, page 8 in [9].To apply it we need to evaluate by how much G 0 n and G 1 n change if we change the i-th variable X 0 i , X 1 i , respectively, while keeping the others fixed.Recall that is the empirical measure putting mass 1/n at each observation X i .Let P 0′ n be the empirical measure where the inequality follows immediately by a first order Taylor expansion and the assumption that all X variables are bounded by L > 1.The calculations involving G 1 n are identical, and yield the bound 8L n .Therefore, we can apply the bounded difference inequality to obtain that 8L 2 , (7.10) We will use Lemma 3 in [27] to obtain bounds on E 0 G 0 n and E 1 G 1 n .We re-state a version of it here, for ease of reference.
Let J n be an integer such that 2 Jn ≥ n and 0 < ǫ ≤ log 2 r Then, if the functions l 0 and l 1 defined above are Lipschitz in θ ′ x and the components of x are bounded by L, with probability one, then both , where C 1 , C 2 are positive constants depending on the respective Lipschitz constants and L.
Notice that l 0 and l 1 are Lipschitz in t = θ ′ x, with respective constants 1 and 2. Also, inspection of the chaining argument used in the proof of this lemma shows that we can take J n = (M ∨ n) and for r given in (3.3), which we also recall here Then, making the constants precise in the lemma above and taking (7.12) u = 4L 2 log 1 δ n , display (7.10) yields for any δ > 0, in particular for any δ = δ n → 0. This display in combination with (7.8) above gives the desired result.This completes the proof of this theorem.
Proof of Theorem 3.3.Recall that in (7.7) above we showed that with θ′ x being an intermediate point between θ * ′ x and θ ′ x.
Let γ > 0 be arbitrarily close to zero, fixed.Let A γ = {sup x | θ ′ x−θ * ′ x| ≤ γ}.By (7.13) above we have P(A γ ) → 1.On A γ we have θ * ′ x ≥ θ′ x − γ, for all x, and therefore exp( θ′ x) for all x and with w arbitrarily close to one.Recall that f (x) = πf 0 (x) + (1 − π)f 1 (x), for P(Y = 1) = π.Then, on the set A γ we have where the last inequality follows from Condition H.Then, adding and subtracting r| δ − δ * | to both sides in (7.5) and using (7.14) above we obtain where we obtained the last line by using the Cauchy-Schwarz inequality followed by an inequality of the type 2uv ≤ au 2 + v 2 /a, for any a > 1.For clarity of exposition, we absorbed the term rǫ, which is of order 1/2 M ∨n , into the first term, which is much larger.Therefore on the set A γ , with P(A γ ) → 1, which is the desired result.
Proof of Theorem 4.1.Since P(I * = Î) ≥ 1 − P(I * ⊆ Î) − P( I ⊆ I * ), it is enough to control the two probabilities in the right hand side of this inequality separately.
Recall that we denoted the cardinality of I * by k * .Then, by the definition of the sets I and I * we have and let L n (θ) = log L pros (θ).By standard results in convex analysis, if Let k be arbitray, fixed.Define Recall that by convention X i0 = 1, for all i, and for compactness of notation we will write θ 0 = δ and θ * 0 = δ.
Assuming without loss of generality that the data has been scaled such that 1 2n n i=1 X 2 ik = 1 for all k, we therefore obtain, for every k ∈ I * , that In what follows we bound each of the above terms individually.

Bound on (III).
Let θ′ X i be a point between θ ′ X i and θ * ′ X i , for each i.A first order Taylor expansion yields where the last line follows from the first by dividing and multiplying the summands by p 1 (X i )p 0 (X i ) defined in (1.3) above.Let γ be arbitrarily close to zero, fixed.Let D γ be the set on which sup x | θ ′ x − θ * ′ x| ≤ γ.Corollary 3.2 above guarantees that P(D c γ ) → 0. Since γ is arbitrarily close to zero, the difference  for any δ = δ n → 0, as n → ∞.

Bound on (I).
First notice that, for any k we have Let Z ik = X 0 ik p 1 (X 0 i ) + X 1 ik p 1 (X 1 i ) − X 1 ik , and notice that EZ ik = 0, for all i and k.Since |Z ik | ≤ 3L, we use Hoeffding's inequality to obtain and this is bounded by δ/M for any v . Thus, by Assumption 2 and our choice of r given in (4.1), we have that (I) ≤ δ.
Collecting now the bounds on (I), (II) and (III) we therefore obtain P(I * ⊆ Î) ≤ 9δ → 0, for any δ = δ n → 0, as n → ∞, which completes the first part of the proof.
Control of P( I ⊆ I * ).The control of this quantity will essentially involve the usage of the same probability bounds as above.We will however need a H(µ), (7.18) where by convention we denote the first component of µ by δ.Let, by abuse of notation, µ ∈ R M +1 be the vector that has the first component δ, the other components of µ in positions corresponding to the index set I * , and is zero otherwise.Define the set By standard results in convex analysis it follows that, on the set B 1 , µ is a solution of (2.2).Recall that θ is a solution of (2.2) by construction.Then, using simple convex analysis arguments as in Proposition 4. where T n and S n have been defined in (7.15) above.We notice that the display (7.19) is almost identical to display (7.16) above, and so it can be bounded in a similar fashion.The only difference is in invoking versions of Theorem 3.3 and Corollary 3.1 corresponding to θ replaced by µ, which hold under the same assumptions and for the same tuning parameters.Consequently P( Î ⊆ I * ) ≤ 9δ → 0, which concludes the proof of this theorem. .

(
II) For pen(β) = λ M j=1 |β j |, we obtain estimators Î of I * and dimension reduced estimators β of β * by optimizing L pros (δ, β) + pen(β).Then: (a) The behavior of P( I = I * ), analyzed under the sampling scheme (1.1) is essentially the same as the behavior of P( I = I * ), evaluated under the prospective sampling schemes.(b)The estimator β, analyzed under the sampling scheme (1.1), adapts to the unknown sparsity of β * , which parallels the same property that can be established for β, under the prospective sampling schemes.

Figure 1 .
Figure1.Approximate regularization path for normal iid data, 2n = 300, M = 15 and k * = 3. Left: the regularization path sketch of β, obtained using GBM.Right: regularization path of β using a grid search with the same computational complexity as the GBM.
Given: a dataset D partitioned into p disjoint subsets: D = D 1 ∪ . . .∪ D p , each subset containing the same percentage of cases and controls.Let D −j = D \ D j for all j. 1.For each 1 ≤ k ≤ M and each fold j of the partition, 1 ≤ j ≤ p: Use the GBM to find r j k and I j k such n(r j k ) = |I j k | = k on D −j .Compute L j k =: log L pros ( δj , βj k ), as defined above.2. For each 1 ≤ k ≤ M : Compute L k =:

Figure 4 .Table 2 .
Figure 4.The percentage of times Î = I * (top row) and I * ⊆ Î(bottom row) vs number M of predictors.Left column: k * = 3, 2n = 300, right column: k * = 10, 2n = 1000.We present, in Figure5below, graphs of the percentage of times I * = Î and, respectively, I * ⊆ Î, for M = 2000, as we varied n, for k * = 3 and k * = 10.For ease of reference, we summarized in Table2the sample sizes needed to identify the true model at least 90% of the time.

XD|||
with ϑ arbitrarily close to 0, for each i, on the set D γ .Therefore we haveP |S n − T n | ≥ |β * k ij X ik 1 − p 1 (X i )p 0 (X i ) 4 ( θ j − θ * j ) ni X ij X ik ( θ j − θ * j ) a) + (b) + P(D c γ ).To bound (a) we first invoke Conditions 1 and 2 to obtain that1 2n M j=0 2n i=1 X ij X ik 1 − p 1 (X i )p 0 (X i ) 4 ( θ j − θ * j ) ≤ C k * M j=0 | θ j − θ * j |,where C is a positive constant, independent of n and depending only on the constant d of Conditions 1 and 2. Combining this with the assumption thatmin k∈I * |β * k | > 4r we arrive at (a) ≤ P θ j − θ * j | ≥ ck * r   ,for some positive constant c depending only on d and not on n.To bound (b) we recall that all X variables are bounded by L and that |D ni | is bounded by ϑ.Note that we can always choose ϑ = 1 k * , as we can always choose the corresponding γ.Therefore, the resulting bound is(b) ≤ P   M j=0 | θ j − θ * j | ≥ c 1 k * r   ,for some c 1 > 0 depending on L and not on n.Collecting the bounds above we thus have(III) ≤ M P θ j − θ * j | ≥ c 1 k * r   + M P(D c γ ) ≤ 3M (P(E 0 ) + P(E 1 )),where the last line follows from the proofs of Theorem 3.3 and Corollary 3.2, for sets E 0 , E 1 defined as in (7.4) above, and with r replaced by the slightly larger value given in (4.1) above.Then, as in the proof of Fact 2 of Theorem 3.1 and for this value of r, we obtain(III) ≤ 3M 2δ M ≤ 6δ −→ 0,for any δ = δ n → 0, as n → ∞.Bound on (II).We reason exactly as above, using now Assumption 2 and Condition 1 to obtain, for some constant c > 0, that (II) ≤ M P θ j − θ * j | ≥ ck * r   ≤ M (P(E 0 ) + P(E 1 )) ≤ M 2δ M ≤ 2δ −→ 0, (7.17 ) different intermediate argument.Let µ ∈ R k * +1, where we denote the first component of µ by δ.Define

Table 1 .
Percentage of times the I * was present in different approximate regularization paths for the Normal dataset NOR IID, M = 250.Grid has the same computational complexity as the GBM, while Grid×10 and Grid×50 are finer grids that are 10 respectively 50 times more expensive than the GBM.
k and computed on D −j by ( δj , βj k ).With log L pros (δ, β) defined in (2.2) above, let L j k =: log L pros ( δj , βj k ), computed on D j .With this notation, the procedure becomes: Variable Selection Procedure.