Electronic Journal of Statistics

Dimension reduction and variable selection in case control studies via regularized likelihood optimization

Florentina Bunea and Adrian Barbu

Full-text: Open access

Abstract

Dimension reduction and variable selection are performed routinely in case-control studies, but the literature on the theoretical aspects of the resulting estimates is scarce. We bring our contribution to this literature by studying estimators obtained via 1 penalized likelihood optimization. We show that the optimizers of the 1 penalized retrospective likelihood coincide with the optimizers of the 1 penalized prospective likelihood. This extends the results of Prentice and Pyke (1979), obtained for non-regularized likelihoods. We establish both the sup-norm consistency of the odds ratio, after model selection, and the consistency of subset selection of our estimators. The novelty of our theoretical results consists in the study of these properties under the case-control sampling scheme. Our results hold for selection performed over a large collection of candidate variables, with cardinality allowed to depend and be greater than the sample size. We complement our theoretical results with a novel approach of determining data driven tuning parameters, based on the bisection method. The resulting procedure offers significant computational savings when compared with grid search based methods. All our numerical experiments support strongly our theoretical findings.

Article information

Source
Electron. J. Statist., Volume 3 (2009), 1257-1287.

Dates
First available in Project Euclid: 4 December 2009

Permanent link to this document
https://projecteuclid.org/euclid.ejs/1259944246

Digital Object Identifier
doi:10.1214/09-EJS537

Mathematical Reviews number (MathSciNet)
MR2566187

Zentralblatt MATH identifier
1326.62161

Subjects
Primary: 62J12: Generalized linear models
Secondary: 62J07: Ridge regression; shrinkage estimators 62K99: None of the above, but in this section

Keywords
Case-control studies model selection dimension reduction logistic regression lasso regularization prospective sampling retrospective sampling bisection method

Citation

Bunea, Florentina; Barbu, Adrian. Dimension reduction and variable selection in case control studies via regularized likelihood optimization. Electron. J. Statist. 3 (2009), 1257--1287. doi:10.1214/09-EJS537. https://projecteuclid.org/euclid.ejs/1259944246


Export citation

References

  • [1] Anderson, J. A. (1972) Separate sample logistic discrimination., Biometrika, 59, 19–35.
  • [2] Barron, A., Birgé, L. and Massart, P. (1999) Risk bounds for model selection via penalization., Probability Theory and Related Fields, 113, 301–413.
  • [3] Breslow, N. E., Robins, J. M. and Wellner, J. A. (2000) On the semi-parametric efficiency of logistic regression under case-control sampling, Bernoulli, 6(3), 447–455.
  • [4] Bunea, F. (2008) Honest variable selection in linear and logistic models via, 1 and 1+2 penalization, Electronic Journal of Statistics , 2, 1153–1194.
  • [5] Bunea, F. and McKeague, I. (2005) Covariate Selection for Semiparametric Hazard Function Regression Models, Journal of Multivariate Analysis, 92, 186–204.
  • [6] Burden, R.L. and Faires, J.D. (2001), Numerical analysis, 7th ed., Pacific Grove, CA: Brooks/Cole
  • [7] Carroll, R. J., Wang, S. and Wang, C. Y. (1995) Prospective Analysis of Logistic Case-Control Studies., Journal of the American Statistical Association, 90(429), 157–169.
  • [8] Chen, Y-H., Chatterjee, N. and Carroll, R. J. (2009) Shrinkage Estimators for Robust and Efficient Inference in Haplotype-Based Case-Control Studies., Journal of the American Statistical Association, 104(485), 220–233.
  • [9] Devroye, L. and Lugosi, G. (2001), Combinatorial methods in density estimation, Springer-Verlag.
  • [10] Farewell, V. T. (1979) Some results on the estimation of logistic models based on retrospective data., Biometrika, 66, 27–32.
  • [11] Gill, R. D., Vardi, Y. and Wellner, J. A. (1988). Large sample theory of empirical distributions in biased sampling models., Ann. Statist. 16, 1069–1112.
  • [12] Hastie, T., Rosset S., Tibshirani, R, and Zhu J. (2004). The Entire Regularization Path for the Support Vector Machine., J. Mach. Learn. Res. 5, 1391–1415.
  • [13] Koh, K., Seung-Jean K., and Boyd, S. (2007) An Interior-Point Method for Large-Scale L1-Regularized Logistic Regression., J. Mach. Learn. Res. 8, 1519–1555.
  • [14] Leng, C., Lin, Y., and Wahba, G. (2006). A note on the lasso and related procedures in model selection., Statistica Sinica 16, 1273–1284.
  • [15] Meier, L., van de Geer, S. and Bühlmann, P. (2009) High-dimensional additive modeling, Ann. Statist., to appear .
  • [16] Murphy, S. A. and van der Vaart, A. W. (2001) Semiparametric mixtures in case-control studies, Journal of Multivariate Analysis, 79, 1–32.
  • [17] Osius, G. (2009) Asymptotic inference for semiparametric association models, Ann. Statist., 37 (1), 459–489.
  • [18] Park, M.Y. and Hastie, T., (2006). Regularization path algorithms for detecting gene interactions, Manuscript. Available from www-stat.stanford.edu/hastie/Papers/glasso.pdf.
  • [19] Park, M.Y. and Hastie, T., (2007). L1-regularization path algorithm for generalized linear models., Journal of the Royal Statistical Society: Series B, 69(4), 659–677.
  • [20] Prentice, R. L. and Pyke, R. (1979) Logistic disease incidence models and case-control studies., Biometrika, 66(3), 403–411.
  • [21] Qin, J. and Zhang, B. ( 1997) A goodness-of-fit test for logistic regression models based on case-control data, Biometrika, 84(3) , 609–618.
  • [22] Ravikumar, P., Wainwright, M. J. and Lafferty, J. (2008) High-dimensional graphical model selection using, 1-regularized logistic regression. Technical Report, UC Berkeley, Dept of Statistics.
  • [23] Rosset, S (2005). Tracking curved regularized optimization solution paths., Advances in Neural Information Processing Systems 17 2005.
  • [24] Rosset, S and Zhu, J. (2007) Piecewise linear regularized solution paths, The Ann. Statist., 35(3), 1012–1030.
  • [25] Shi, W., Lee, K. E. and Wahba, G. (2007) Detecting disease-causing genes by LASSO-Patternsearch algorithm., BMC Proceedings 2007, 1(Suppl 1), S60.
  • [26] van de Geer, S. (2008) High-dimensional generalized linear models and the Lasso., The Ann. Statist., 36(2), 614–645.
  • [27] Wegkamp, M. H. (2007) Lasso type classifiers with a reject option., Electronic Journal of Statistics, 1, 155–168.
  • [28] Wu, T.T., Chen, Y. F., Hastie, T., Sobel, E. and Lange, K. (2009) Genome-wide association analysis by lasso penalized logistic regression., Bioinformatics, 25(6), 714–721.