The Annals of Applied Statistics

Calibrated imputation of numerical data under linear edit restrictions

Jeroen Pannekoek, Natalie Shlomo, and Ton De Waal

Full-text: Open access

Abstract

A common problem faced by statistical institutes is that data may be missing from collected data sets. The typical way to overcome this problem is to impute the missing data. The problem of imputing missing data is complicated by the fact that statistical data often have to satisfy certain edit rules and that values of variables across units sometimes have to sum up to known totals. For numerical data, edit rules are most often formulated as linear restrictions on the variables. For example, for data on enterprises edit rules could be that the profit and costs of an enterprise should sum up to its turnover and that the turnover should be at least zero. The totals of some variables across units may already be known from administrative data (e.g., turnover from a tax register) or estimated from other sources. Standard imputation methods for numerical data as described in the literature generally do not take such edit rules and totals into account. In this article we describe algorithms for imputing missing numerical data that take edit restrictions into account and ensure that sums are calibrated to known totals. These algorithms are based on a sequential regression approach that uses regression predictions to impute the variables one by one. To assess the performance of the imputation methods, a simulation study is carried out as well as an evaluation study based on a real data set.

Article information

Source
Ann. Appl. Stat., Volume 7, Number 4 (2013), 1983-2006.

Dates
First available in Project Euclid: 23 December 2013

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1387823307

Digital Object Identifier
doi:10.1214/13-AOAS664

Mathematical Reviews number (MathSciNet)
MR3161710

Zentralblatt MATH identifier
1283.62166

Keywords
Linear edit restrictions sequential regression imputation Fourier–Motzkin elimination benchmarking

Citation

Pannekoek, Jeroen; Shlomo, Natalie; De Waal, Ton. Calibrated imputation of numerical data under linear edit restrictions. Ann. Appl. Stat. 7 (2013), no. 4, 1983--2006. doi:10.1214/13-AOAS664. https://projecteuclid.org/euclid.aoas/1387823307


Export citation

References

  • Beaumont, J.-F. (2005). Calibrated imputation in surveys under a quasi-model-assisted approach. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 445–458.
  • Censor, Y. and Lent, A. (1981). An iterative row-action method for interval convex programming. J. Optim. Theory Appl. 34 321–353.
  • Chambers, R. (2003). Evaluation criteria for statistical editing and imputation. In Methods and Experimental Results from the EUREDIT Project (J. R. H. Charlton, ed.). Available at http://www.cs.york.ac.uk/euredit/.
  • Coutinho, W., De Waal, T. and Remmerswaal, M. (2011). Imputation of numerical data under linear edit restrictions. SORT 35 39–62.
  • Coutinho, W., De Waal, T. and Shlomo, N. (2013). Calibrated hot deck imputation subject to edit restrictions. Journal of Official Statistics 29 1–23.
  • Dalenius, T. and Reiss, S. P. (1982). Data-swapping: A technique for disclosure control. J. Statist. Plann. Inference 6 73–85.
  • De Waal, T., Pannekoek, J. and Scholtus, S. (2011). Handbook of statistical data editing and imputation. Wiley, New York.
  • Deville, J. C. and Särndal, C. E. (1994). Variance estimation for the regression imputed Horvitz–Thompson estimator. Journal of Official Statistics 10 381–394.
  • Draper, L. and Winkler, W. E. (1997). Balancing and ratio editing with the new SPEER system. In American Statistical Association, Proceedings of the 1997 Section on Surveys Research Methods 570–575. Available at http://www.census.gov/srd/www/byyear.html.
  • Duffin, R. J. (1974). On Fourier’s analysis of linear inequality systems. Math. Programming Stud. 1 71–95.
  • Fay, R. E. (1991). A design-based perspective on missing data variance. In Proceedings of the 1991 Annual Research Conference. Washington, D.C.
  • Fay, R. E. (1996). Alternative paradigms for the analysis of imputed survey data. J. Amer. Statist. Assoc. 91 490–498.
  • Fellegi, I. P. and Holt, D. (1976). A systematic approach to automatic edit and imputation. J. Amer. Statist. Assoc. 71 17–35.
  • Geweke, J. (1991). Efficient simulation from the multivariate normal and Student-$t$ distributions subject to linear constraints and the evaluation of constraint probabilities. Technical report, Univ. Minnesota.
  • Haziza, D. (2009). Imputation and inference in the presence of missing data. In Handbook of Statistics 29 (D. Pfeffermann and C. R. Rao, eds.) 215–246. Elsevier, Amsterdam.
  • Holan, S. H., Toth, D., Ferreira, M. A. R. and Karr, A. F. (2010). Bayesian multiscale multiple imputation with implications for data confidentiality. J. Amer. Statist. Assoc. 105 564–577.
  • Houbiers, M. (2004). Towards a social statistical database and unified estimates at statistics Netherlands. Journal of Official Statistics 20 55–75.
  • Kalton, G. and Kasprzyk, D. (1986). The treatment of missing survey data. Survey Methodology 12 1–16.
  • Kim, J. K. and Hong, M. (2012). Imputation for statistical inference with coarse data. Canad. J. Statist. 40 604–618.
  • Kovar, J. and Whitridge, P. (1995). Imputation of business survey data. In Business Survey Methods (B. G. Cox, D. A. Binder, B. N. Chinnappa, A. Christianson, M. J. Colledge and P. S. Kott, eds.) 403–423. Wiley, New York.
  • Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data, 2nd ed. Wiley, Hoboken, NJ.
  • Liu, J. S. (2001). Monte Carlo Strategies in Scientific Computing. Springer, New York.
  • Longford, N. T. (2005). Missing Data and Small-Area Estimation: Modern Analytical Equipment for the Survey Statistician. Springer, New York.
  • McKnight, P. E., McKnight, K. M., Sidani, S. and Figueredo, A. J. (2007). Missing Data—A Gentle Introduction. The Guilford Press, New York.
  • Pannekoek, J. and De Waal, T. (2005). Automatic edit and imputation for business surveys: The Dutch contribution to the EUREDIT project. Journal of Official Statistics 21 257–286.
  • Raghunathan, T. E., Solenberger, P. W. and Van Hoewyk, J. (2002). IVEware: Imputation and Variance Estimation Software—User Guide. Survey Methodology Program Survey Research Center, Institute for Social Research, Univ. Michigan, Ann Arbor, MI.
  • Raghunathan, T. E., Lepkowski, J. M., Van Hoewyk, J. and Solenberger, P. (2001). A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology 27 85–95.
  • Rao, J. N. K. (1996). On variance estimation with imputed survey data. J. Amer. Statist. Assoc. 91 499–506.
  • Rao, J. N. K. and Shao, J. (1992). Jackknife variance estimation with survey data under hot deck imputation. Biometrika 79 811–822.
  • Rao, J. N. K. and Sitter, R. R. (1995). Variance estimation under two-phase sampling with application to imputation for missing data. Biometrika 82 453–460.
  • Robert, C. P. and Casella, G. (1999). Monte Carlo Statistical Methods. Springer, New York.
  • Rubin, D. B. (1976). Inference and missing data. Biometrika 63 581–592.
  • Rubin, D. B. (1978). Multiple imputations in sample surveys—A phenomenological Bayesian approach to nonresponse. In Proceedings of the Section on Survey Research Methods. Amer. Statist. Assoc., Alexandria, VA.
  • Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley, New York.
  • Rubin, D. B. (1996). Multiple imputation after $18+$ years. J. Amer. Statist. Assoc. 91 473–489.
  • Rubin, D. B. (2003). Nested multiple imputation of NMES via partially incompatible MCMC. Stat. Neerl. 57 3–18.
  • Särndal, C. E. (1992). Methods for estimating the precision of survey estimates when imputation has been used. Survey Methodology 18 241–252.
  • Särndal, C.-E. and Lundström, S. (2005). Estimation in Surveys with Nonresponse. Wiley, Chichester.
  • Schafer, J. L. (1997). Analysis of Incomplete Multivariate Data. Monographs on Statistics and Applied Probability 72. Chapman & Hall, London.
  • Shao, J. (2002). Replication methods for variance estimation in complex surveys with imputed data. In Survey Non-Response (R. M. Groves, D. A. Dillman, J. L. Eltinge and R. J. A. Little, eds.) 303–324. Wiley, New York.
  • Shao, J. and Sitter, R. R. (1996). Bootstrap for imputed survey data. J. Amer. Statist. Assoc. 91 1278–1288.
  • Shao, J. and Steel, P. (1999). Variance estimation for survey data with composite imputation and nonnegligible sampling fractions. J. Amer. Statist. Assoc. 94 254–265.
  • Shlomo, N., De Waal, T. and Pannekoek, J. (2009). Mass imputation for building a numerical statistical database. Presented at the UNECE Statistical Data Editing Workshop, Neuchatel, October 2009. Available at http://www.unece.org/stats/documents/ece/ces/ge.44/2009/wp.31.e.pdf.
  • Tempelman, C. (2007). Imputation of restricted data. Doctorate thesis, Univ. Groningen.
  • Van Buuren, S. (2012). Flexible imputation of missing data. Chapman & Hall/CRC, Boca Raton, FL.
  • Van Buuren, S. and Groothuis-Oudshoorn, K. (2011). MICE: Multivariate imputation by chained equations in R. Journal of Statistical Software 45 1–67.
  • Winkler, W. E. (2008a). A contingency-table model for imputing data satisfying analytic constraints. Research Report Series 2003-07, Statistical Research Division, U.S. Census Bureau.
  • Winkler, W. E. (2008b). General methods and algorithms for modeling and imputing discrete data under a variety of constraints. Research Report Series 2008-08, Statistical Research Division, U.S. Census Bureau, Washington, DC.
  • Wolter, K. M. (1985). Introduction to Variance Estimation. Springer, New York.