Annals of Applied Statistics
- Ann. Appl. Stat.
- Volume 13, Number 3 (2019), 1370-1396.
Imputation and post-selection inference in models with missing data: An application to colorectal cancer surveillance guidelines
It is common to encounter missing data among the potential predictor variables in the setting of model selection. For example, in a recent study we attempted to improve the US guidelines for risk stratification after screening colonoscopy (Cancer Causes Control 27 (2016) 1175–1185), with the aim to help reduce both overuse and underuse of follow-on surveillance colonoscopy. The goal was to incorporate selected additional informative variables into a neoplasia risk-prediction model, going beyond the three currently established risk factors, using a large dataset pooled from seven different prospective studies in North America. Unfortunately, not all candidate variables were collected in all studies, so that one or more important potential predictors were missing on over half of the subjects. Thus, while variable selection was a main focus of the study, it was necessary to address the substantial amount of missing data. Multiple imputation can effectively address missing data, and there are also good approaches to incorporate the variable selection process into model-based confidence intervals. However, there is not consensus on appropriate methods of inference which address both issues simultaneously. Our goal here is to study the properties of model-based confidence intervals in the setting of imputation for missing data followed by variable selection. We use both simulation and theory to compare three approaches to such post-imputation-selection inference: a multiple-imputation approach based on Rubin’s Rules for variance estimation (Comput. Statist. Data Anal. 71 (2014) 758–770); a single imputation-selection followed by bootstrap percentile confidence intervals; and a new bootstrap model-averaging approach presented here, following Efron (J. Amer. Statist. Assoc. 109 (2014) 991–1007). We investigate relative strengths and weaknesses of each method. The “Rubin’s Rules” multiple imputation estimator can have severe undercoverage, and is not recommended. The imputation-selection estimator with bootstrap percentile confidence intervals works well. The bootstrap-model-averaged estimator, with the “Efron’s Rules” estimated variance, may be preferred if the true effect sizes are moderate. We apply these results to the colorectal neoplasia risk-prediction problem which motivated the present work.
Ann. Appl. Stat., Volume 13, Number 3 (2019), 1370-1396.
Received: April 2018
Revised: December 2018
First available in Project Euclid: 17 October 2019
Permanent link to this document
Digital Object Identifier
Mathematical Reviews number (MathSciNet)
Zentralblatt MATH identifier
Liu, Lin; Qiu, Yuqi; Natarajan, Loki; Messer, Karen. Imputation and post-selection inference in models with missing data: An application to colorectal cancer surveillance guidelines. Ann. Appl. Stat. 13 (2019), no. 3, 1370--1396. doi:10.1214/19-AOAS1239. https://projecteuclid.org/euclid.aoas/1571277757
- Supplement to “Imputation and post-selection inference in models with missing data: An application to colorectal cancer surveillance guidelines”. We provided the proofs of Theorem 3.1 and Corollary 3.1. Also we provided the R code for simulation results in Table 1 and Table 2.