The Annals of Applied Statistics

Variable selection and prediction with incomplete high-dimensional data

Ying Liu, Yuanjia Wang, Yang Feng, and Melanie M. Wall

Full-text: Open access


We propose a Multiple Imputation Random Lasso (MIRL) method to select important variables and to predict the outcome for an epidemiological study of Eating and Activity in Teens. In this study $80\%$ of individuals have at least one variable missing. Therefore, using variable selection methods developed for complete data after listwise deletion substantially reduces prediction power. Recent work on prediction models in the presence of incomplete data cannot adequately account for large numbers of variables with arbitrary missing patterns. We propose MIRL to combine penalized regression techniques with multiple imputation and stability selection. Extensive simulation studies are conducted to compare MIRL with several alternatives. MIRL outperforms other methods in high-dimensional scenarios in terms of both reduced prediction error and improved variable selection performance, and it has greater advantage when the correlation among variables is high and missing proportion is high. MIRL is shown to have improved performance when comparing with other applicable methods when applied to the study of Eating and Activity in Teens for the boys and girls separately, and to a subgroup of low social economic status (SES) Asian boys who are at high risk of developing obesity.

Article information

Ann. Appl. Stat., Volume 10, Number 1 (2016), 418-450.

Received: July 2014
Revised: December 2015
First available in Project Euclid: 25 March 2016

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Missing data random lasso multiple imputation variable selection stability selection variable ranking


Liu, Ying; Wang, Yuanjia; Feng, Yang; Wall, Melanie M. Variable selection and prediction with incomplete high-dimensional data. Ann. Appl. Stat. 10 (2016), no. 1, 418--450. doi:10.1214/15-AOAS899.

Export citation


  • Azur, M. J., Stuart, E. A., Frangakis, C. and Leaf, P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? Int. J. Methods Psychiatr. Res. 20 40–49.
  • Belloni, A. and Chernozhukov, V. (2013). Least squares after model selection in high-dimensional sparse models. Bernoulli 19 521–547.
  • Breiman, L. (2001). Random forests. Mach. Learn. 45 5–32.
  • Buuren, S. and Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software 45 1–67.
  • Chen, Q. and Wang, S. (2013). Variable selection for multiply-imputed data with application to dioxin exposure study. Stat. Med. 32 3646–3659.
  • Claeskens, G. and Consentino, F. (2008). Variable selection with incomplete covariate data. Biometrics 64 1062–1069.
  • Derksen, S. and Keselman, H. J. (1992). Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables. Br. J. Math. Stat. Psychol. 45 265–282.
  • Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407–499.
  • Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
  • Frerichs, L., Perin, D. M. P. and Huang, T. T.-K. (2012). Current trends in childhood obesity research. Current Nutrition Reports 1 228–238.
  • Garcia, R. I., Ibrahim, J. G. and Zhu, H. (2010a). Variable selection for regression models with missing data. Statist. Sinica 20 149–165.
  • Garcia, R. I., Ibrahim, J. G. and Zhu, H. (2010b). Variable selection in the Cox regression model with covariates missing at random. Biometrics 66 97–104.
  • Glynn, R. J., Laird, N. M. and Rubin, D. B. (1993). Multiple imputation in mixture models for nonignorable nonresponse with follow-ups. J. Amer. Statist. Assoc. 88 984–993.
  • Groll, A. and Tutz, G. (2014). Variable selection for generalized linear mixed models by $L_{1}$-penalized estimation. Stat. Comput. 24 137–154.
  • Hastie, T., Tibshirani, R., Friedman, J. and Franklin, J. (2005). The elements of statistical learning: Data mining, inference and prediction. Math. Intelligencer 27 83–85.
  • Hurvich, C. M. and Tsai, C.-L. (1990). The impact of model selection on inference in linear regression. Amer. Statist. 44 214–217.
  • Ibrahim, J. G., Zhu, H., Garcia, R. I. and Guo, R. (2011). Fixed and random effects selection in mixed effects models. Biometrics 67 495–503.
  • Johnson, B. A., Lin, D. Y. and Zeng, D. (2008). Penalized estimating functions and variable selection in semiparametric regression models. J. Amer. Statist. Assoc. 103 672–680.
  • Kral, T. V. and Faith, M. S. (2009). Influences on child eating and weight development from a behavioral genetics perspective. Journal of Pediatric Psychology 34 596–605.
  • Laird, N. M. and Ware, J. H. (1982). Random-effects models for longitudinal data. Biometrics 38 963–974.
  • Larson, N. I., Wall, M. M., Story, M. T. and Neumark-Sztainer, D. R. (2013). Home/family, peer, school, and neighborhood correlates of obesity in adolescents. Obesity (Silver Spring) 21 1858–1869.
  • Matthews, B. W. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)—Protein Structure 405 442–451.
  • Meinshausen, N. and Bühlmann, P. (2010). Stability selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 72 417–473.
  • Neumark-Sztainer, D., Wall, M. M., Larson, N., Story, M., Fulkerson, J. A., Eisenberg, M. E. and Hannan, P. J. (2012). Secular trends in weight status and weight-related attitudes and behaviors in adolescents from 1999 to 2010. Preventive Medicine 54 77–81.
  • Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley, New York.
  • Shen, C.-W. and Chen, Y.-H. (2012). Model selection for generalized estimating equations accommodating dropout missingness. Biometrics 68 1046–1054.
  • Siddique, J. and Belin, T. R. (2008). Using an approximate Bayesian bootstrap to multiply impute nonignorable missing data. Comput. Statist. Data Anal. 53 405–415.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol. 58 267–288.
  • Wang, S., Nan, B., Rosset, S. and Zhu, J. (2011). Random Lasso. Ann. Appl. Stat. 5 468–485.
  • Wood, A. M., White, I. R. and Royston, P. (2008). How should variable selection be performed with multiply imputed data? Stat. Med. 27 3227–3246.
  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 301–320.