Brazilian Journal of Probability and Statistics

Multiple imputation of unordered categorical missing data: A comparison of the multivariate normal imputation and multiple imputation by chained equations

Innocent Karangwa, Danelle Kotze, and Renette Blignaut

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Missing data are common in survey data sets. Enrolled subjects do not often have data recorded for all variables of interest. The inappropriate handling of them may negatively affect the inferences drawn. Therefore, special attention is needed when analysing incomplete data. The multivariate normal imputation (MVNI) and the multiple imputation by chained equations (MICE) have emerged as the best techniques to deal with missing data. The former assumes a normal distribution of the variables in the imputation model and the latter fills in missing values taking into account the distributional form of the variables to be imputed. This study examines the performance of these methods when data are missing at random on unordered categorical variables treated as predictors in the regression models. First, a survey data set with no missing values is used to generate a data set with missing at random observations on unordered categorical variables. Then, the two methods are separately used to impute the missing values of the generated data set. Their performance is compared in terms of bias and standard errors of the estimates from the regression models that determine the association between the woman’s contraceptive methods use status and her marital status, controlling for the region of origin. The baseline data used is the 2007 Demographic and Health Survey (DHS) data set from the Democratic Republic of Congo. The findings indicate that although the MVNI relies on the statistical parametric theory, it produces more accurate estimates than MICE for nonordered categorical variables.

Article information

Braz. J. Probab. Stat., Volume 30, Number 4 (2016), 521-539.

Received: September 2014
Accepted: April 2015
First available in Project Euclid: 13 December 2016

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Missing data missing at random multiple imputation multivariate normal imputation multiple imputation by chained equations categorical data


Karangwa, Innocent; Kotze, Danelle; Blignaut, Renette. Multiple imputation of unordered categorical missing data: A comparison of the multivariate normal imputation and multiple imputation by chained equations. Braz. J. Probab. Stat. 30 (2016), no. 4, 521--539. doi:10.1214/15-BJPS292.

Export citation


  • Allison, A. P. (2001). Missing data. Thousand Oaks, CA: Sage publications.
  • Azur, M. J., Stuart, E. A., Frangakis, C. and Leaf, P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research 20, 40–49.
  • Bernaards, C. C., Belin, T. R. and Schafer, J. L. (2007). Robustness of a multivariate normal approximation for imputation of incomplete binary data. Statistics in Medicine 26, 1368–1382.
  • Brand, J. P. L. (1999). Development, implementation, and evaluation of multiple imputation strategies for the statistical analysis of incomplete data sets. PhD Dissertation, Erasmus Univ., Rotterdam.
  • Carpenter, J. and Kenward, M. (2012). Multiple imputation and its application. Wiley: New York.
  • Carpenter, J. R. and Kenward, M. G. (2013). Multiple Imputation and Its Application. United Kingdom: John Wiley and Sons.
  • Catellier, D. J., Hannan, P. J., Murray, D. M., Addy, C. L., Conway, T. L., Yang, S. and Rice, J. C. (2005). Imputation of missing data when measuring physical activity by accelerometry. Medicine and science in sports and exercise 37.
  • Demirtas, H., Freels, S. A. and Yucel, R. M. (2008). Plausibility of multivariate normality assumption when multiply imputing non-Gaussian continuous outcomes: A simulation assessment. Journal of Statistical Computation and Simulation 78, 69–84.
  • Demirtas, H., Freels, S. A. and Yucel, R. M. (2008). Plausibility of multivariate normality assumption when multiply imputing non-Gaussian continuous outcomes: A simulation assessment. Journal of Statistical Computation and Simulation 78, 69–84.
  • Efron, B. (1994). Missing data, imputation, and the bootstrap. Journal of the American Statistical Association 89, 463–475.
  • Finch, W. H. (2010). Imputation methods for missing categorical questionnaire data: A comparison of approaches. Journal of Data Science 8, 361–378.
  • Galati, J. C. and Carlin, J. B. (2008). INORM: Stata Module to Perform Multiple Imputation Using Schafer’s Method [software]. Chestnut Hill, MA: Department of Economics, Boston College.
  • Gelfand, A. E. and Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association 85, 398–409.
  • Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6, 721–741.
  • Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual Review of Psychology 60, 549–576.
  • He, Y., Zaslavsky, A. M., Landrum, M. B., Harrington, D. P. and Catalano, P. (2010). Multiple imputation in a large-scale complex survey: A practical guide. Statistical Methods in Medical Research 19, 653–670.
  • Horton, N. J. and Lipsitz, S. R. (2001). Multiple imputation in practice: Comparison of software packages for regression models with missing variables. American Statistical Association 55, 244–254.
  • Hughes, R., White, I. R., Seaman, S. R., Carpenter, J. R., Tilling, K. and Sterne, J. C. (2014). Joint modelling rationale for chained equations. BMC Medical Research Methodology 14, 28.
  • Jackman, S. (2000). Estimation and inference via Bayesian simulation: An introduction to Markov chain Monte Carlo. American Journal of Political Science 44, 375–404.
  • Karangwa, I. and Kotze, D. (2013). Using the Markov chain Monte Carlo method to make inferences on items of data contaminated by missing values. American Journal of Theoretical and Applied Statistics 2, 48–53.
  • Kropko, J., Goodrich, B., Gelman, A. and Hill, J. (2014). Multiple imputation for continuous and categorical data: Comparing joint multivariate normal and conditional approaches. Political Analysis, 1–23.
  • Lee, K. J. and Carlin, J. B. (2010). Multiple imputation for missing data: Fully conditional specification versus multivariate normal imputation. American Journal of Epidemiology 171, 624–632.
  • Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data, 2nd ed. Wiley Series in Probability and Statistics. Wiley-Interscience. Hoboken, NJ: Wiley Interscience.
  • Molenberghs, G., Fitzmaurice, G. M., Kenward, G. M., Tsiatis, A. A. and Verbeke, G. (2015). Handbook of Missing Data Methodology. London: Chapman and Hall/CRC.
  • Norazian, M. N., Shukri, Y. A., Azam, R. N., Mohd, A. and Al Bakri, M. (2008). Estimation of Missing Data Using Interpolation Technique: Fitting on Weibull Distribution. In Malaysian Technical Universities Conference on Engineering and Technology (Putra Palace, Perlis, Malaysia, 8-10 March 2008).
  • Raghunathan, T. E., Lepkowski, J. M., Van Hoewyk, J. and Solenberg, P. (2001). A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology 27, 85–95.
  • Reiter, J. P., Raghunathan, T. E. and Kinney, S. K. (2006). The importance of modelling the sampling design in multiple imputation for missing data. Survey Methodology 32, 143.
  • Royston, P. and White, I. R. (2011). Multiple imputation by chained equations (MICE): Implementation in stata. Journal of Statistical Software 45, 1–20.
  • Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. New York, NY: John Wiley and Sons.
  • Schafer, J. L. (1997). Analysis of Incomplete Multivariate Data. Monographs on Statistics and Applied Probability, Vol. 72. London: Chapman and Hall.
  • Schafer, J. L. and Graham, J. W. (2002). Missing data: Our view of the state of the. Art. Psychological Methods 7, 147–177.
  • Schenker, N., Raghunathan, T. E., Chiu, P.-L., Makuc, D. M., Zhang, G. and Cohen, A. J. (2006). Multiple imputation of missing income data in the national health interview survey. Journal of the American Statistical Association 101, 924–933.
  • Stuart, E. A., Azur, M., Frangakis, C. and Leaf, P. (2009). Multiple imputation with large data sets: A case study of the children’s mental health initiative. American Journal of Epidemiology 169, 1133–1139.
  • Tsikriktsis, N. (2005). A review of techniques for treating missing data in OM survey research. Journal of Operations Management 24, 53–62.
  • Twisk, J., De Boer, M., De Vente, W. and Heymans, M. (2013). Multiple imputation of missing values was not necessary before performing a longitudinal mixed-model analysis. Journal of Clinical Epidemiology 66, 1022–1028.
  • Van Buuren, S. and Knook, D. L. (1999). Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine 18, 681–694.
  • Van Buuren, S. (2007). Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research 16, 219–242.
  • White, I. R., Royston, P. and Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine 30, 377–399.
  • Wood, A., White, I., Hillsdon, M. and Carpenter, J. (2005). Comparison of imputation and modelling methods in the analysis of a physical activity trial with missing outcomes. International Journal of Epidemiology 34, 89–99.