Statistical Science

Multiple Imputation for Multilevel Data with Continuous and Binary Variables

Vincent Audigier, Ian R. White, Shahab Jolani, Thomas P. A. Debray, Matteo Quartagno, James Carpenter, Stef van Buuren, and Matthieu Resche-Rigon

Full-text: Open access

Abstract

We present and compare multiple imputation methods for multilevel continuous and binary data where variables are systematically and sporadically missing. The methods are compared from a theoretical point of view and through an extensive simulation study motivated by a real dataset comprising multiple studies. The comparisons show that these multiple imputation methods are the most appropriate to handle missing values in a multilevel setting and why their relative performances can vary according to the missing data pattern, the multilevel structure and the type of missing variables. This study shows that valid inferences can only be obtained if the dataset includes a large number of clusters. In addition, it highlights that heteroscedastic multiple imputation methods provide more accurate inferences than homoscedastic methods, which should be reserved for data with few individuals per cluster. Finally, guidelines are given to choose the most suitable multiple imputation method according to the structure of the data.

Article information

Source
Statist. Sci., Volume 33, Number 2 (2018), 160-183.

Dates
First available in Project Euclid: 3 May 2018

Permanent link to this document
https://projecteuclid.org/euclid.ss/1525313140

Digital Object Identifier
doi:10.1214/18-STS646

Mathematical Reviews number (MathSciNet)
MR3797708

Keywords
Missing data systematically missing values multilevel data mixed data multiple imputation joint modelling fully conditional specification

Citation

Audigier, Vincent; White, Ian R.; Jolani, Shahab; Debray, Thomas P. A.; Quartagno, Matteo; Carpenter, James; van Buuren, Stef; Resche-Rigon, Matthieu. Multiple Imputation for Multilevel Data with Continuous and Binary Variables. Statist. Sci. 33 (2018), no. 2, 160--183. doi:10.1214/18-STS646. https://projecteuclid.org/euclid.ss/1525313140


Export citation

References

  • Albert, A. and Anderson, J. A. (1984). On the existence of maximum likelihood estimates in logistic regression models. Biometrika 71 1–10.
  • Allison, P. (2002). Missing Data. Sage, Thousand Oaks, CA.
  • Andridge, R. R. (2011). Quantifying the impact of fixed effects modeling of clusters in multiple imputation for cluster randomized trials. Biom. J. 53 57–74.
  • Asparouhov, T. and Muthén, B. (2010). Multiple imputation with Mplus. Technical report. Available at http://www.statmodel.com/download/Imputations7.pdf.
  • Audigier, V. and Resche-Rigon, M. (2017). micemd: Multiple imputation by chained equations with multilevel data. R package version 1.2.0.
  • Audigier, V., White, I. R., Jolani, S., Debray, T. P. A., Quartagno, M., Carpenter, J., van Buuren, S. and Resche-Rigon, M. (2018). Supplement to “Multiple imputation for multilevel data with continuous and binary variables.” DOI:10.1214/18-STS646SUPPA, DOI:10.1214/18-STS646SUPPB.
  • Bartlett, J. W., Seaman, S. R., White, I. R. and Carpenter, J. R. (2015). Multiple imputation of covariates by fully conditional specification: Accommodating the substantive model. Stat. Methods Med. Res. 24 462–487.
  • Bates, D., Mächler, M., Bolker, B. and Walker, S. (2015). Fitting linear mixed-effects models using lme4. J. Stat. Softw. 67 1–48.
  • Blossfeld, H.-P., Günther Ro\"s"bach, H. and von Maurice, J., eds. (2011). Education as a Lifelong Process: The German National Educational Panel Study (NEPS). VS Verlag für Sozialwissenschaften, Wiesbaden, Germany.
  • Bos, W., Lankes, E.-M., Prenzel, M., Schwippert, K. and Valtin, R., eds. (2003). Erste Ergebnisse aus IGLU: Schülerleistungen Am Ende der Vierten Jahrgangsstufe Im Internationalen Vergleich [the First]. Waxmann, Münster, Germany.
  • Carpenter, J. and Kenward, M. (2013). Multiple Imputation and Its Application, 1st ed. Wiley, New York.
  • Carrig, M. M., Manrique-Vallier, D., Ranby, K. W., Reiter, J. and Hoyle, R. H. (2015). A nonparametric, multiple imputation-based method for the retrospective integration of data sets. Multivar. Behav. Res. 50 383–397.
  • Curran, P. J. and Hussong, A. M. (2009). Integrative data analysis: The simultaneous analysis of multiple data sets. Psychol. Methods 14 81–100.
  • Curran, P. J., Hussong, A. M., Cai, L., Huang, W., Chassin, L., Sher, K. J. and Zucker, R. A. (2008). Pooling data from multiple longitudinal studies: The role of item response theory in integrative data analysis. Dev. Psychol. 44 365–380.
  • Debray, T., Riley, R., Rovers, M., Reitsma, J., Moons, K. and on behalf of the Cochrane IPD Meta-analysis Methods group (2015b). Individual participant data (IPD) meta-analyses of diagnostic and prognostic modeling studies: Guidance on their use. PLoS Med. 12 e1001886.
  • Debray, T., Moons, K., van Valkenhoef, G., Efthimiou, O., Hummel, N., Groenwold, R. and Reitsma, J. O. (2015a). Get real in individual participant data (IPD) meta-analysis: A review of the methodology. Res. Synth. Methods 6 293–309.
  • DerSimonian, R. and Laird, N. (1986). Meta-analysis in clinical trials. Control. Clin. Trials 7 177–188.
  • Drechsler, J. (2015). Multiple imputation of multilevel missing data—rigor versus simplicity. J. Educ. Behav. Stat. 40 69–95.
  • Enders, C. (2010). Applied Missing Data Analysis. Guilford Press, New York.
  • Enders, C. K., Keller, B. T. and Levy, R. (2017). A fully conditional specification approach to multilevel imputation of categorical and continuous variables. Psychol. Methods.
  • Enders, C., Mistler, S. and Keller, B. (2016). Multilevel multiple imputation: A review and evaluation of joint modeling and chained equations imputation. Psychological Methods 21 222–240.
  • Erler, N. S., Rizopoulos, D., van Rosmalen, J., Jaddoe, V. W. V., Franco, O. H. and Lesaffre, E. M. E. H. (2016). Dealing with missing covariates in epidemiologic studies: A comparison between multiple imputation and a full Bayesian approach. Stat. Med. 35 2955–2974.
  • Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika 80 27–38.
  • Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian Anal. 1 515–533.
  • Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6 721–741.
  • Global Research on Acute conditions Team (GREAT) Network (2013). Managing acute heart failure in the ED—case studies from the acute heart failure academy. Available at http://www.greatnetwork.org.
  • Goldstein, H., Bonnet, G. and Rocher, T. (2007). Multilevel structural equation models for the analysis of comparative data on educational performance. J. Educ. Behav. Stat. 32 252–286.
  • Goldstein, H., Carpenter, J., Kenward, M. G. and Levin, K. A. (2009). Multilevel models with multivariate mixed response types. Stat. Model. 9 173–197.
  • Graham, J. W. (2012). Missing Data: Analysis and Design. Springer, New York.
  • Grund, S., Lüdtke, O. and Robitzsch, A. (2016). Multiple imputation of missing covariate values in multilevel models with random slopes: A cautionary note. Behav. Res. Methods 48 640–649.
  • Hughes, R. A., White, I. R., Seaman, S., Carpenter, J., Tilling, K. and Sterne, J. (2014). Joint modelling rationale for chained equations. BMC Med. Res. Methodol. 14 28.
  • Jackson, D., White, I. R. and Riley, R. D. (2013). A matrix-based method of moments for fitting the multivariate random effects model for meta-analysis and meta-regression. Biom. J. 55 231–245.
  • Jolani, S. (2018). Hierarchical imputation of systematically and sporadically missing data: An approximate Bayesian approach using chained equations. Biom. J. 60 333–351.
  • Jolani, S., Debray, T. P. A., Koffijberg, H., van Buuren, S. and Moons, K. G. M. (2015). Imputation of systematically missing predictors in an individual participant data meta-analysis: A generalized approach using MICE. Stat. Med. 34 1841–1863.
  • Kropko, J., Goodrich, B., Gelman, A. and Hill, J. (2014). Multiple imputation for continuous and categorical data: Comparing joint multivariate normal and conditional approaches. Polit. Anal. 22 497–519.
  • Kunkel, D. and Kaizar, E. E. (2017). A comparison of existing methods for multiple imputation in individual participant data meta-analysis. Stat. Med. 36 3507–3532.
  • Langan, D., Higgins, J. P. T. and Simmonds, M. (2017). Comparative performance of heterogeneity variance estimators in meta-analysis: A review of simulation studies. Res. Synth. Methods 8 181–198.
  • Lassus, J., Gayat, E., Mueller, C., Peacock, W., Spinar, J., Harjola, V., van Kimmenade, R., Pathak, A., Mueller, T. et al. (2013). Incremental value of biomarkers to clinical variables for mortality prediction in acutely decompensated heart failure: The multinational observational cohort on acute heart failure (MOCA) study. Int. J. Cardiol. 168 2186–2194.
  • Lee, K. and Carlin, J. (2010). Multiple imputation for missing data: Fully conditional specification versus multivariate normal imputation. Am. J. Epidemiol. 171 624–632.
  • Lee, Y., Nelder, J. A. and Pawitan, Y. (2006). Generalized Linear Models with Random Effects: Unified Analysis via $H$-Likelihood. Monographs on Statistics and Applied Probability 106. Chapman & Hall/CRC, Boca Raton, FL. With 1 CD-ROM (Windows).
  • Little, R. (1988). Missing-data adjustments in large surveys. J. Bus. Econom. Statist. 6 287–296.
  • Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data, 2nd ed. Wiley-Interscience, Hoboken, NJ.
  • Liu, J., Gelman, A., Hill, J., Su, Y.-S. and Kropko, J. (2014). On the stationary distribution of iterative imputations. Biometrika 101 155–173.
  • Longford, N. T. (2008). Missing data. In Handbook of Multilevel Analysis 377–399. Springer, New York.
  • Mathew, T. and Nordström, K. (2010). Comparison of one-step and two-step meta-analysis models using individual patient data. Biom. J. 52 271–287.
  • McNeish, D. and Stapleton, L. M. (2016). Modeling clustered data with very few clusters. Multivar. Behav. Res. 51 495–518.
  • Mebazaa, A., Gayat, E., Lassus, J., Meas, T., Mueller, C. et al. (2013). Association between elevated blood glucose and outcome in acute heart failure: Results from an international observational cohort. J. Am. Coll. Cardiol. 61 820–829.
  • Meng, X. (1994). Multiple-imputation inferences with uncongenial sources of input (with discussion). Statist. Sci. 10 538–573.
  • Morris, T. P., White, I. R. and Crowther, M. J. (2017). Using simulation studies to evaluate statistical methods. ArXiv e-prints.
  • Mullis, I., Martin, M., Gonzalez, E. and Kennedy, A. (2003). Pirls 2001 international report: Iea’s study of reading literacy achievement in primary school in 35 countries. Available at: https://timssandpirls.bc.edu/pirls2001i/pdf/p1_IR_book.pdf.
  • Noh, M. and Lee, Y. (2007). REML estimation for binary data in GLMMs. J. Multivariate Anal. 98 896–915.
  • Pinheiro, J. and Bates, D. (2000). Mixed-Effects Models in S and S-PLUS. Springer, New York.
  • Pinheiro, J., Bates, D., DebRoy, S. and Sarkar, D. (2016). nlme: Linear and nonlinear mixed effects models. R package version 3.1-128.
  • Quartagno, M. and Carpenter, J. R. (2016a). Multiple imputation for IPD meta-analysis: Allowing for heterogeneity and studies with missing covariates. Stat. Med. 35 2938–2954.
  • Quartagno, M. and Carpenter, J. (2016b). jomo: A package for multilevel joint modelling multiple imputation. R package version 2.2-0.
  • Raghunathan, T., Lepkowski, J. M., Van Hoewyk, J. and Solenberger, P. (2001). A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv. Methodol. 27 85–96.
  • Reiter, J., Raghunathan, T. E. and Kinney, S. K. (2006). The importance of modeling the sampling design in multiple imputation for missing data. Surv. Methodol. 32 143.
  • Resche-Rigon, M. and White, I. (2016). Multiple imputation by chained equations for systematically and sporadically missing multilevel data. Stat. Methods Med. Res. DOI:10.1177/0962280216666564.
  • Resche-Rigon, M., White, I. R., Bartlett, J. W., Peters, S. A. E., Thompson, S. G. and Group, P. S. (2013). Multiple imputation for handling systematically missing confounders in meta-analysis of individual participant data. Stat. Med. 32 4890–4905.
  • Riley, R. D., Lambert, P. C., Staessen, J. A., Wang, J., Gueyffier, F., Thijs, L. and Boutitie, F. (2008). Meta-analysis of continuous outcomes combining individual patient data and aggregate data. Stat. Med. 27 1870–1893.
  • Riley, R. D., Ensor, J., Snell, K. I. E., Debray, T. P. A., Altman, D. G., Moons, K. G. M. and Collins, G. S. (2016). External validation of clinical prediction models using big datasets from e-health records or IPD meta-analysis: Opportunities and challenges. BMJ 353 i3140.
  • Robert, C. P. (2007). The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation, 2nd ed. Springer, New York.
  • Rubin, D. B. (1976). Inference and missing data. Biometrika 63 581–592.
  • Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley, New York.
  • Schafer, J. L. (1997). Analysis of Incomplete Multivariate Data. Monographs on Statistics and Applied Probability 72. Chapman & Hall, London.
  • Schafer, J. L. and Yucel, R. M. (2002). Computational strategies for multivariate linear mixed-effects models with missing values. J. Comput. Graph. Statist. 11 437–457.
  • Simmonds, M., Higgins, J., Stewart, L., Tierney, J., Clarke, M. and Thompson, S. (2005). Meta-analysis of individual patient data from randomized trials: A review of methods used in practice. Clin. Trials 2 209–217.
  • Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. J. Amer. Statist. Assoc. 82 528–550.
  • R Core Team (2016). R: A Language and Environment for Statistical Computing. Version 3.3.0. R Foundation for Statistical Computing, Vienna, Austria.
  • van Buuren, S. (2007). Multiple imputation of discrete and continuous data by fully conditional specification. Stat. Methods Med. Res. 16 219–242.
  • van Buuren, S. (2011). Multiple imputation of multilevel data. In The Handbook of Advanced Multilevel Analysis (J. J. Hox, ed.) 173–196. Routledge, New York.
  • van Buuren, S. (2012). Flexible Imputation of Missing Data (Chapman & Hall/CRC Interdisciplinary Statistics). Chapman & Hall/CRC, London.
  • van Buuren, S. and Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. J. Stat. Softw. 45 1–67.
  • van Buuren, S., Brand, J. P. L., Groothuis-Oudshoorn, C. G. M. and Rubin, D. B. (2006). Fully conditional specification in multivariate imputation. J. Stat. Comput. Simul. 76 1049–1064.
  • Vink, G., Lazendic, G. and van Buuren, S. (2015). Partitioned predictive mean matching as a multilevel imputation technique. Psychol. Test Assess. Model. 57 577–594.
  • Wagstaff, D. and Harel, O. (2011). A closer examination of three small-sample approximations to the multiple-imputation degrees of freedom. Stata J. 11 403–419.
  • Yucel, R. M. (2011). Random covariances and mixed-effects models for imputing multivariate multilevel continuous data. Stat. Model. 11 351–370.
  • Zhao, Y. and Long, Q. (2016). Multiple imputation in the presence of high-dimensional data. Stat. Methods Med. Res. 25 2021–2035.
  • Zhao, E. and Yucel, R. (2009). Performance of sequential imputation method in multilevel applications. In Proceedings of the Survey Research Methods Section (JSM 2009) 2800–2810. Amer. Statist. Assoc., Alexandria, VA.
  • Zhu, J. and Raghunathan, T. E. (2015). Convergence properties of a sequential regression multiple imputation algorithm. J. Amer. Statist. Assoc. 110 1112–1124.

Supplemental materials

  • Supplement to “Multiple Imputation for Multilevel Data with Continuous and Binary Variables”. Technical details on the posterior distributions of imputation model parameters and inference results for all configurations that have not been discussed in detail in the main text.
  • Supplement to “Multiple Imputation for Multilevel Data with Continuous and Binary Variables”. R code for the simulation study.