Annals of Applied Statistics

Imputation and post-selection inference in models with missing data: An application to colorectal cancer surveillance guidelines

Lin Liu, Yuqi Qiu, Loki Natarajan, and Karen Messer

Full-text: Open access


It is common to encounter missing data among the potential predictor variables in the setting of model selection. For example, in a recent study we attempted to improve the US guidelines for risk stratification after screening colonoscopy (Cancer Causes Control 27 (2016) 1175–1185), with the aim to help reduce both overuse and underuse of follow-on surveillance colonoscopy. The goal was to incorporate selected additional informative variables into a neoplasia risk-prediction model, going beyond the three currently established risk factors, using a large dataset pooled from seven different prospective studies in North America. Unfortunately, not all candidate variables were collected in all studies, so that one or more important potential predictors were missing on over half of the subjects. Thus, while variable selection was a main focus of the study, it was necessary to address the substantial amount of missing data. Multiple imputation can effectively address missing data, and there are also good approaches to incorporate the variable selection process into model-based confidence intervals. However, there is not consensus on appropriate methods of inference which address both issues simultaneously. Our goal here is to study the properties of model-based confidence intervals in the setting of imputation for missing data followed by variable selection. We use both simulation and theory to compare three approaches to such post-imputation-selection inference: a multiple-imputation approach based on Rubin’s Rules for variance estimation (Comput. Statist. Data Anal. 71 (2014) 758–770); a single imputation-selection followed by bootstrap percentile confidence intervals; and a new bootstrap model-averaging approach presented here, following Efron (J. Amer. Statist. Assoc. 109 (2014) 991–1007). We investigate relative strengths and weaknesses of each method. The “Rubin’s Rules” multiple imputation estimator can have severe undercoverage, and is not recommended. The imputation-selection estimator with bootstrap percentile confidence intervals works well. The bootstrap-model-averaged estimator, with the “Efron’s Rules” estimated variance, may be preferred if the true effect sizes are moderate. We apply these results to the colorectal neoplasia risk-prediction problem which motivated the present work.

Article information

Ann. Appl. Stat., Volume 13, Number 3 (2019), 1370-1396.

Received: April 2018
Revised: December 2018
First available in Project Euclid: 17 October 2019

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Post-selection inference missing data multiple imputation model selection model averaging


Liu, Lin; Qiu, Yuqi; Natarajan, Loki; Messer, Karen. Imputation and post-selection inference in models with missing data: An application to colorectal cancer surveillance guidelines. Ann. Appl. Stat. 13 (2019), no. 3, 1370--1396. doi:10.1214/19-AOAS1239.

Export citation


  • Carroll, R. J., Ruppert, D., Stefanski, L. A. and Crainiceanu, C. M. (2006). Measurement Error in Nonlinear Models: A Modern Perspective, 2nd ed. Monographs on Statistics and Applied Probability 105. CRC Press/CRC, Boca Raton, FL.
  • Chatterjee, A. and Lahiri, S. N. (2010). Asymptotic properties of the residual bootstrap for Lasso estimators. Proc. Amer. Math. Soc. 138 4497–4509.
  • Chatterjee, A. and Lahiri, S. N. (2011). Bootstrapping lasso estimators. J. Amer. Statist. Assoc. 106 608–625.
  • Claeskens, G. (2016). Statistical model choice. Ann. Rev. Stat. Appl. 3 233–256.
  • Claeskens, G. and Consentino, F. (2008). Variable selection with incomplete covariate data. Biometrics 64 1062–1069.
  • Claeskens, G. and Hjort, N. L. (2003). The focused information criterion. J. Amer. Statist. Assoc. 98 900–945.
  • Claeskens, G. and Hjort, N. L. (2008a). Minimizing average risk in regression models. Econometric Theory 24 493–527.
  • Claeskens, G. and Hjort, N. L. (2008b). Model Selection and Model Averaging. Cambridge Series in Statistical and Probabilistic Mathematics 27. Cambridge Univ. Press, Cambridge.
  • Efron, B. (2014). Estimation and accuracy after model selection. J. Amer. Statist. Assoc. 109 991–1007.
  • Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. J. Amer. Statist. Assoc. 102 359–378.
  • Heymans, M. W., van Buuren, S., Knol, D. L., van Mechelen, W. and de Vet, H. C. W. (2007). Variable selection under multiple imputation using the bootstrap in a prognostic study. BMC Med. Res. Methodol. 7.
  • Hjort, N. L. (2014). Comment [MR3265671]. J. Amer. Statist. Assoc. 109 1017–1020.
  • Hjort, N. L. and Claeskens, G. (2003). Frequentist model average estimators. J. Amer. Statist. Assoc. 98 879–899.
  • Hosmer, D. W. and Lemeshow, S. (1989). Applied Logistic Regression. Wiley-Interscience, New York.
  • Jones, M. P. (1996). Indicator and stratification methods for missing explanatory variables in multiple linear regression. J. Amer. Statist. Assoc. 91 222–230.
  • Lachenbruch, P. A. (2011). Variable selection when missing values are present: A case study. Stat. Methods Med. Res. 20 429–444.
  • Lieberman, D. A., Rex, D. K., Winawer, S. J., Giardiello, F. M., Johnson, D. A. and Levin, T. R. (2012). Guidelines for colonoscopy surveillance after screening and polypectomy: A consensus update by the US multi-society task force on colorectal cancer. Gastroenterology 143 844–857.
  • Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data, 2nd ed. Wiley Series in Probability and Statistics. Wiley-Interscience, Hoboken, NJ.
  • Liu, L., Messer, K., Baron, J. A., Lieberman, D. A., Jacobs, E. T., Cross, A. J., Murphy, G., Martinez, M. E. and Gupta, S. (2016). A prognostic model for advanced colorectal neoplasia recurrence. Cancer Causes Control 27 1175–1185. DOI:10.1007/s10552-016-0795-5.
  • Liu, L., Qiu, Y., Natarajan, L. and Messer, K. (2019). Supplement to “Imputation and post-selection inference in models with missing data: An application to colorectal cancer surveillance guidelines.” DOI:10.1214/19-AOAS1239SUPP.
  • Long, Q. and Johnson, B. A. (2015). Variable selection in the presence of missing data: Resampling and imputation. Biostatistics 16 596–610.
  • Martinez, M. E., Thompson, P., Messer, K. et al. (2012). One-year risk of advanced colorectal neoplasia: United States vs. United Kingdom risk-stratification guidelines. Ann. Intern. Med. 12 856–864.
  • Meinshausen, N. and Bühlmann, P. (2010). Stability selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 72 417–473.
  • Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics. Wiley, New York.
  • Schomaker, M. and Heumann, C. (2014). Model selection and model averaging after multiple imputation. Comput. Statist. Data Anal. 71 758–770.
  • Schomaker, M. and Heumann, C. (2018). Bootstrap inference when using multiple imputation. Stat. Med. 37 2252–2266.
  • Siegel, R. L., Miller, K. D. and Jemal, A. (2015). Cancer statistics. CA Cancer J. Clin. 65 5–29.
  • Tanner, M. A. and Wong, W. H. (1987). An application of impuation to an estimation problem in grouped lifetime analysis. Technometrics 29 23–32.
  • Tsiatis, A. A. (2006). Semiparametric Theory and Missing Data. Springer Series in Statistics. Springer, New York.
  • van Buuren, S. and Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. J. Stat. Softw. 45 1–67.
  • van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics 3. Cambridge Univ. Press, Cambridge.
  • Wood, A. M., White, I. R. and Royston, P. (2008). How should variable selection be performed with multiply imputed data? Stat. Med. 27 3227–3246.

Supplemental materials

  • Supplement to “Imputation and post-selection inference in models with missing data: An application to colorectal cancer surveillance guidelines”. We provided the proofs of Theorem 3.1 and Corollary 3.1. Also we provided the R code for simulation results in Table 1 and Table 2.