Statistical Science

Multiple Imputation: A Review of Practical and Theoretical Findings

Jared S. Murray

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

Multiple imputation is a straightforward method for handling missing data in a principled fashion. This paper presents an overview of multiple imputation, including important theoretical results and their practical implications for generating and using multiple imputations. A review of strategies for generating imputations follows, including recent developments in flexible joint modeling and sequential regression/chained equations/fully conditional specification approaches. Finally, we compare and contrast different methods for generating imputations on a range of criteria before identifying promising avenues for future research.

Article information

Source
Statist. Sci., Volume 33, Number 2 (2018), 142-159.

Dates
First available in Project Euclid: 3 May 2018

Permanent link to this document
https://projecteuclid.org/euclid.ss/1525313139

Digital Object Identifier
doi:10.1214/18-STS644

Mathematical Reviews number (MathSciNet)
MR3797707

Zentralblatt MATH identifier
1397.62052

Keywords
Missing data proper imputation congeniality chained equations fully conditional specification sequential regression multivariate imputation

Citation

Murray, Jared S. Multiple Imputation: A Review of Practical and Theoretical Findings. Statist. Sci. 33 (2018), no. 2, 142--159. doi:10.1214/18-STS644. https://projecteuclid.org/euclid.ss/1525313139


Export citation

References

  • Abayomi, K., Gelman, A. and Levy, M. (2008). Diagnostics for multivariate imputations. J. Roy. Statist. Soc. Ser. C 57 273–291.
  • Akande, O., Li, F. and Reiter, J. (2017). An empirical comparison of multiple imputation methods for categorical data. Amer. Statist. 71 162–170.
  • Andridge, R. R. and Little, R. J. A. (2010). A review of hot deck imputation for survey non-response. Int. Stat. Rev. 78 40–64.
  • Arnold, B. C., Castillo, E. and Sarabia, J. M. (2001). Conditionally specified distributions: An introduction. Statist. Sci. 16 249–274.
  • Arnold, B. C. and Press, J. S. (1989). Compatible conditional distributions. J. Amer. Statist. Assoc. 84 152–156.
  • Audigier, V., Husson, F. and Josse, J. (2016). Multiple imputation for continuous variables using a Bayesian principal component analysis. J. Stat. Comput. Simul. 86 2140–2156.
  • Audigier, V., Husson, F. and Josse, J. (2017). MIMCA: Multiple imputation for categorical variables with multiple correspondence analysis. Stat. Comput. 27 501–518.
  • Banerjee, A., Murray, J. and Dunson, D. B. (2013). Bayesian learning of joint distributions of objects. In Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS), Scottsdale, AZ.
  • Barnard, J. and Rubin, D. B. (1999). Miscellanea. Small-sample degrees of freedom with multiple imputation. Biometrika 86 948–955.
  • Bernaards, C. A., Belin, T. R. and Schafer, J. L. (2007). Robustness of a multivariate normal approximation for imputation of incomplete binary data. Stat. Med. 26 1368–1382.
  • Blackwell, M., Honaker, J. and King, G. (2015). A unified approach to measurement error and missing data. Sociol. Methods Res. 46 303–341.
  • Böhning, D., Seidel, W., Alfó, M., Garel, B., Patilea, V., Walther, G., Di Zio, M., Guarnera, U. and Luzi, O. (2007). Imputation through finite Gaussian mixture models. Comput. Statist. Data Anal. 51 5305–5316.
  • Bondarenko, I. and Raghunathan, T. (2016). Graphical and numerical diagnostic tools to assess suitability of multiple imputations and imputation models. Stat. Med. 35 3007–3020.
  • Breiman, L. (2001). Random forests. Mach. Learn. 45 5–32.
  • Breiman, L., Friedman, J., Stone, C. J. and Olshen, R. A. (1984). Classification and Regression Trees. Wadsworth Advanced Books and Software, Belmont, CA.
  • Burgette, L. F. and Reiter, J. P. (2010). Multiple imputation for missing data via sequential regression trees. Am. J. Epidemiol. 172 1070–1076.
  • Carpenter, J. and Kenward, M. (2013). Multiple Imputation and Its Application, 1st ed. Wiley, New York.
  • Chen, J. and Shao, J. (2000). Nearest neighbor imputation for survey data. J. Off. Stat. 16 113–131.
  • Cole, S. R., Chu, H. and Greenland, S. (2006). Multiple-imputation for measurement-error correction. Int. J. Epidemiol. 35 1074–1081.
  • Collins, L. M., Schafer, J. L. and Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol. Methods 6 330–351.
  • DeYoreo, M., Reiter, J. P. and Hillygus, D. S. (2017). Bayesian mixture models with focused clustering for mixed ordinal and nominal data. Bayesian Anal. 12 679–730.
  • Doove, L. L., Van Buuren, S. and Dusseldorp, E. (2014). Recursive partitioning for missing data imputation in the presence of interaction effects. Comput. Statist. Data Anal. 72 92–104.
  • Drechsler, J. (2010). Multiple imputation of missing values in the wave 2007 of the IAB Establishment Panel. IAB Discussion Paper.
  • Dunson, D. B. and Xing, C. (2009). Nonparametric Bayes modeling of multivariate categorical data. J. Amer. Statist. Assoc. 104 1042–1051.
  • Elliott, M. R. and Stettler, N. (2007). Using a mixture model for multiple imputation in the presence of outliers: The “healthy for life” project. J. Roy. Statist. Soc. Ser. C 56 63–78.
  • Fithian, W. and Josse, J. (2017). Multiple correspondence analysis and the multilogit bilinear model. J. Multivariate Anal. 157 87–102.
  • Fosdick, B. K., DeYoreo, M. and Reiter, J. P. (2016). Categorical data fusion using auxiliary information. Ann. Appl. Stat. 10 1907–1929.
  • Gebregziabher, M. and DeSantis, S. M. (2010). Latent class based multiple imputation approach for missing categorical data. J. Statist. Plann. Inference 140 3252–3262.
  • Gelman, A., Carlin, J. B., Rubin, D. B., Vehtari, A., Dunson, D. B. and Stern, H. S. (2014). Bayesian Data Analysis, 3rd ed. CRC Press, Boca Raton, FL.
  • He, Y. and Zaslavsky, A. M. (2012). Diagnosing imputation models by applying target analyses to posterior replicates of completed data. Stat. Med. 31 1–18.
  • He, Y., Zaslavsky, A. M., Landrum, M. B., Harrington, D. P. and Catalano, P. (2010). Multiple imputation in a large-scale complex survey: A practical guide. Stat. Methods Med. Res. 19 653–670.
  • Heitjan, D. F. and Little, R. J. A. (1991). Multiple imputation for the fatal accident reporting system. J. Roy. Statist. Soc. Ser. C 40 13–29.
  • Horton, N. J., Lipsitz, S. R. and Parzen, M. (2003). A potential for bias when rounding in multiple imputation. Amer. Statist. 57 229–232.
  • Hu, J., Reiter, J. P. and Wang, Q. (2017). Dirichlet process mixture models for modeling and generating synthetic versions of nested categorical data. Bayesian Anal. 12 679–703.
  • Hughes, R. A., White, I. R., Seaman, S. R., Carpenter, J. R., Tilling, K. and Sterne, J. A. C. (2014). Joint modelling rationale for chained equations. BMC Med. Res. Methodol. 14 28.
  • Ibrahim, J. G., Lipsitz, S. R. and Chen, M. H. (1999). Missing covariates in generalized linear models when the missing data mechanism is non-ignorable. J. R. Stat. Soc. Ser. B. Stat. Methodol. 61 173–190.
  • Ibrahim, J. G., Chen, M. H., Lipsitz, S. R. and Herring, A. H. (2005). Missing data methods for generalized linear models: A comparative review. J. Amer. Statist. Assoc. 100 332–346.
  • Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. J. Amer. Statist. Assoc. 96 161–173.
  • Kim, J. K. (2002). A note on approximate Bayesian bootstrap imputation. Biometrika 89 470–477.
  • Kim, J. K., Brick, J. M., Fuller, W. A. and Kalton, G. (2006). On the bias of the multiple-imputation variance estimator in survey sampling. J. R. Stat. Soc. Ser. B. Stat. Methodol. 68 509–521.
  • Kim, H. J., Reiter, J. P., Wang, Q., Cox, L. H. and Karr, A. F. (2014). Multiple imputation of missing or faulty values under linear constraints. J. Bus. Econom. Statist. 32 375–386.
  • Kim, H. J., Cox, L. H., Karr, A. F., Reiter, J. P. and Wang, Q. (2015). Simultaneous edit-imputation for continuous microdata. J. Amer. Statist. Assoc. 110 987–999.
  • Kropko, J., Goodrich, B., Gelman, A. and Hill, J. (2014). Multiple imputation for continuous and categorical data: Comparing joint multivariate normal and conditional approaches. Polit. Anal. 22 497–519.
  • Lee, M. C. and Mitra, R. (2016). Multiply imputing missing values in data sets with mixed measurement scales using a sequence of generalised linear models. Comput. Statist. Data Anal. 95 24–38.
  • Li, F., Yu, Y. and Rubin, D. B. (2012). Imputing missing data by fully conditional models: Some cautionary examples and guidelines. Dept. Statistical Science, Duke Univ., Durham, NC.
  • Li, F., Baccini, M., Mealli, F., Zell, E. R., Frangakis, C. E. and Rubin, D. B. (2014). Multiple imputation by ordered monotone blocks with application to the anthrax vaccine research program. J. Comput. Graph. Statist. 23 877–892.
  • Lipsitz, S. R. and Ibrahim, J. G. (1996). A conditional model for incomplete covariates in parametric regression models. Biometrika 83 916–922.
  • Little, R. J. A. (1988). Missing-data adjustments in large surveys. J. Bus. Econom. Statist. 6 287–296.
  • Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data, 2nd ed. Wiley-Interscience, Hoboken, NJ.
  • Little, R. J. A. and Schluchter, M. D. (1985). Maximum likelihood estimation for mixed continuous and categorical data with missing values. Biometrika 72 497–512.
  • Liu, J. S. (1994). The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. J. Amer. Statist. Assoc. 89 958–966.
  • Liu, C. and Rubin, D. B. (1998). Ellipsoidally symmetric extensions of the general location model for mixed categorical and continuous data. Biometrika 85 673–688.
  • Liu, J., Gelman, A., Hill, J., Su, Y.-S. and Kropko, J. (2014). On the stationary distribution of iterative imputations. Biometrika 101 155–173.
  • Manrique-Vallier, D. and Reiter, J. P. (2014a). Bayesian estimation of discrete multivariate latent structure models with structural zeros. J. Comput. Graph. Statist. 23 1061–1079.
  • Manrique-Vallier, D. and Reiter, J. P. (2014b). Bayesian multiple imputation for large-scale categorical data with structural zeros. Surv. Methodol. 40 125–134.
  • Manrique-Vallier, D. and Reiter, J. P. (2016). Bayesian simultaneous edit and imputation for multivariate categorical data. J. Amer. Statist. Assoc. 112 1708–1719.
  • Meng, X.-L. (1994). Multiple-imputation inferences with uncongenial sources of input. Statist. Sci. 9 538–558.
  • Meng, X.-L. and Romero, M. (2003). Discussion: Efficiency and self-efficiency with multiple imputation inference. Int. Stat. Rev. 71 607–618.
  • Morris, T. P., White, I. R. and Royston, P. (2014). Tuning multiple imputation by predictive mean matching and local residual draws. BMC Med. Res. Methodol. 14 75.
  • Murray, J. S. and Reiter, J. P. (2016). Multiple imputation of missing categorical and continuous values via Bayesian mixture models with local dependence. J. Amer. Statist. Assoc. 111 1466–1479.
  • Nguyen, C. D., Lee, K. J. and Carlin, J. B. (2015). Posterior predictive checking of multiple imputation models. Biom. J. 57 676–694.
  • Nielsen, S. F. (2003). Proper and improper multiple imputation. Int. Stat. Rev. 71 593–607.
  • Olkin, I. and Tate, R. F. (1961). Multivariate correlation models with mixed discrete and continuous variables. Ann. Math. Stat. 32 448–465.
  • Paddock, S. M. (2002). Bayesian nonparametric multiple imputation of partially observed data with ignorable nonresponse. Biometrika 89 529–538.
  • Raghunathan, T. E., Reiter, J. P. and Rubin, D. B. (2003). Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19 1–16.
  • Raghunathan, T. E., Lepkowski, J. M., Van Hoewyk, J. and Solenberger, P. (2001). A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv. Methodol. 27 85–96.
  • Rässler, S. (2004). Data fusion: Identification problems, validity, and multiple imputation. Aust. J. Stat. 33 153–171.
  • Reiter, J. P. (2002). Satisfying disclosure restrictions with synthetic data sets. J. Off. Stat. 18 531.
  • Reiter, J. P. (2005). Using CART to generate partially synthetic public use microdata. J. Off. Stat. 21 441.
  • Reiter, J. P. (2012). Bayesian finite population imputation for data fusion. Statist. Sinica 22 795–811.
  • Reiter, J. (2017). Discussion: Dissecting multiple imputation from a multi-phase inference perspective: What happens when God’s, imputer’s and analyst’s models are uncongenial? Statist. Sinica. 27 1578–1583.
  • Reiter, J. P. and Raghunathan, T. E. (2007). The multiple adaptations of multiple imputation. J. Amer. Statist. Assoc. 102 1462–1471.
  • Reiter, J. P., Raghunathan, T. E. and Kinney, S. K. (2006). The importance of modeling the sampling design in multiple imputation for missing data. Surv. Methodol. 32 143.
  • Robins, J. M. and Wang, N. (2000). Inference for imputation estimators. Biometrika 87 113–124.
  • Rousseau, J. (2016). On the frequentist properties of Bayesian nonparametric methods. Annual Review of Statistics and Its Application 3 211–231.
  • Rubin, D. B. (1981). The Bayesian bootstrap. Ann. Statist. 9 130–134.
  • Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley. New York.
  • Rubin, D. B. (1993). Discussion: Statistical disclosure limitation. J. Off. Stat. 9 461–468.
  • Rubin, D. B. (1996). Multiple imputation after 18+ years. J. Amer. Statist. Assoc. 91 473–489.
  • Rubin, D. B. (2003a). Discussion on multiple imputation. Int. Stat. Rev. 71 619–625.
  • Rubin, D. B. (2003b). Nested multiple imputation of NMES via partially incompatible MCMC. Stat. Neerl. 57 3–18.
  • Rubin, D. B. and Schafer, J. L. (1990). Efficiently creating multiple imputations for incomplete multivariate normal data. In Proc. Statistical Computing Section of the American Statistical Association 83–88. Amer. Statist. Assoc., Alexandria, VA.
  • Rubin, D. B. and Schenker, N. (1986). Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. J. Amer. Statist. Assoc. 81 366–374.
  • Schafer, J. L. (1997). Analysis of Incomplete Multivariate Data. Chapman & Hall, London.
  • Schafer, J. L. (2003). Multiple imputation in multivariate problems when the imputation and analysis models differ. Stat. Neerl. 57 19–35.
  • Schenker, N. and Taylor, J. M. G. (1996). Partially parametric techniques for multiple imputation. Comput. Statist. Data Anal. 22 425–446.
  • Schifeling, T. A. and Reiter, J. P. (2016). Incorporating marginal prior information in latent class models. Bayesian Anal. 11 499–518.
  • Seaman, S. R. and Hughes, R. A. (2016). Relative efficiency of joint-model and full-conditional-specification multiple imputation when conditional models are compatible: The general location model. Stat. Methods Med. Res. DOI:10.1177/0962280216665872.
  • Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statist. Sinica 4 639–650.
  • Shah, A. D., Bartlett, J. W., Carpenter, J., Nicholas, O. and Hemingway, H. (2014). Comparison of random forest and parametric imputation models for imputing missing data using MICE: A CALIBER study. Am. J. Epidemiol. 179 764–774.
  • Si, Y. and Reiter, J. P. (2013). Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. J. Educ. Behav. Stat. 38 499–521.
  • Stuart, E. A., Azur, M., Frangakis, C. and Leaf, P. (2009). Multiple imputation with large data sets: A case study of the children’s mental health initiative. Am. J. Epidemiol. 169 1133–1139.
  • Su, Y.-S., Gelman, A., Hill, J., Yajima, M. et al. (2011). Multiple imputation with diagnostics (mi) in R: Opening windows into the black box. J. Stat. Softw. 45 1–31.
  • Van Buuren, S. (2007). Multiple imputation of discrete and continuous data by fully conditional specification. Stat. Methods Med. Res. 16 219–42.
  • Van Buuren, S. (2012). Flexible Imputation of Missing Data. CRC Press, Boca Raton, FL.
  • Van Buuren, S. and Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. J. Stat. Softw. 45 1–67.
  • Van Buuren, S. and Oudshoorn, K. (1999). Flexible Multivariate Imputation by MICE. TNO Prevention Center, Leiden, The Netherlands.
  • Van Buuren, S., Brand, J. P. L., Groothuis-Oudshoorn, C. G. M. and Rubin, D. B. (2006). Fully conditional specification in multivariate imputation. J. Stat. Comput. Simul. 76 1049–1064.
  • Vermunt, J. K., Van Ginkel, J. R., Van Der Ark, L. A. and Sijtsma, K. (2008). Multiple imputation of incomplete categorial data using latent class analysis. Sociol. Method. 38 369–397.
  • Vidotto, D., Vermunt, J. K. and Kaptein, M. C. (2015). Multiple imputation of missing categorical data using latent class models: State of art. Psychol. Test Assess. Model. 57 542–576.
  • Vink, G., Frank, L. E., Pannekoek, J. and van Buuren, S. (2014). Predictive mean matching imputation of semicontinuous variables. Stat. Neerl. 68 61–90. DOI:10.1111/stan.12023.
  • Wang, N. and Robins, J. M. (1998). Large-sample theory for parametric multiple imputation procedures. Biometrika 85 935–948.
  • Xie, X. and Meng, X.-L. (2017). Dissecting multiple imputation from a multi-phase inference perspective: What happens when God’s, imputer’s and analyst’s models are uncongenial? Statist. Sinica. 27 1485–1545.
  • Xu, D., Daniels, M. J. and Winterstein, A. G. (2016). Sequential BART for imputation of missing covariates. Biostatistics 17 589–602.
  • Zhu, J. and Raghunathan, T. E. (2015). Convergence properties of a sequential regression multiple imputation algorithm. J. Amer. Statist. Assoc. 110 1112–1124.