International Statistical Review

Automatic Editing for Business Surveys: An Assessment of Selected Algorithms

Ton de Waal and Wieger Coutinho

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

Statistical offices are responsible for publishing accurate statistical information about many different aspects of society. This task is complicated considerably by the fact that data collected by statistical offices generally contain errors. These errors have to be corrected before reliable statistical information can be published. This correction process is referred to as statistical data editing. Traditionally, data editing was mainly an interactive activity with the aim to correct all data in every detail. For that reason the data editing process was both expensive and time-consuming. To improve the efficiency of the editing process it can be partly automated. One often divides the statistical data editing process into the error localisation step and the imputation step. In this article we restrict ourselves to discussing the former step, and provide an assessment, based on personal experience, of several selected algorithms for automatically solving the error localisation problem for numerical (continuous) data. Our article can be seen as an extension of the overview article by Liepins, Garfinkel & Kunnathur (1982). All algorithms we discuss are based on the (generalised) Fellegi-Holt paradigm that says that the data of a record should be made to satisfy all edits by changing the fewest possible (weighted) number of fields. The error localisation problem may have several optimal solutions for a record. In contrast to what is common in the literature, most of the algorithms we describe aim to find all optimal solutions rather than just one. As numerical data mostly occur in business surveys, the described algorithms are mainly suitable for business surveys and less so for social surveys. For four algorithms we compare the computing times on six realistic data sets as well as their complexity.

Article information

Source
Internat. Statist. Rev., Volume 73, Number 1 (2005), 73-102.

Dates
First available in Project Euclid: 31 March 2005

Permanent link to this document
https://projecteuclid.org/euclid.isr/1112304813

Zentralblatt MATH identifier
1104.62128

Keywords
Branch-and-bound Cutting planes Error localisation Fellegi-Holt method Fellegi-Holt paradigm Fourier-Motzkin elimination Integer programming Statistical data editing Vertex generation

Citation

de Waal, Ton; Coutinho, Wieger. Automatic Editing for Business Surveys: An Assessment of Selected Algorithms. Internat. Statist. Rev. 73 (2005), no. 1, 73--102. https://projecteuclid.org/euclid.isr/1112304813


Export citation

References

  • [1] Atkinson, A.C. (1994). Fast Very Robust Methods for the Detection of Multiple Outliers. Journal of the American Statistical Association, 89, 1329-1339.
  • [2] Austin, J. & Lees, K. (2000). A Search Engine Based on Neural Correlation Matrix Memories. Neurocomputing, 35, 55-72.
  • [3] Bankier, M., Poirier, P., Lachance, M. & Mason, P. (2000). A Generic Implementation of the Nearest-Neighbour Imputation Methodology (NIM). Proceedings of the Second International Conference on Establishment Surveys, Buffalo, pp. 571-578.
  • [4] Barcaroli, G., Ceccarelli, C., Luzi, O., Manzari, A., Riccini, E. & Silvestri, F. (1995). The Methodology of Editing and Imputation of Qualitative Variables Implemented in SCIA. Internal Report, Istituto Nazionale di Statistica, Rome.
  • [5] Barnett, V. & Lewis, T. (1994). Outliers in Statistical Data. John Wiley & Sons., New York
  • [6] Béguin, C. & Hulliger, B. (2004). Multivariate Outlier Detection in Incomplete Survey Data: The Epidemic Algorithm and Transformed Rank Correlation. Journal of the Royal Statistical Society A, 167, 275-294.
  • [7] Billor N., Hadi, A.S. & Velleman, P.F. (2000). BACON: Blocked Adaptive Computationally Efficient Outlier Nominators. Computational Statistics and Data Analysis, 34, 279-298.
  • [8] Bishop, M.C (1995). Neural Networks for Pattern Recognition. Clarendon Press.,Oxford
  • [9] Boskovitz, A., Goré, R. & Hegland, M. (2003). A Logical Formalisation of the Fellegi-Holt Method of Data Cleaning. Report, Research School of Information Sciences and Engineering, Australian National University, Canberra.
  • [10] Breiman, L., Friedman, J.H., Olshen, R.A. & Stone, C.J. (1984). Classification and Regression Trees. Wadsworth, Pacific Grove.
  • [11] Bruni, R., Reale, A. & Torelli, R. (2001). Optimization Techniques for Edit Validation and Data Imputation. Proceedings of Statistics Canada Symposium 2001 ''Achieving Data Quality in a Statistical Agency: a Methodological Perspective'' XVIII-th International Symposium on Methodological Issues.
  • [12] Bruni, R. & Sassano, A. (2001). Logic and Optimization Techniques for an Error Free Data Collecting. Report, University of Rome ''La Sapienza''.
  • [13] Casado Valero, C., Del Castillo Cuervo-Arango, F., Mateo Ayerra, J. & De Santos Ballesteros, A. (1996). Quantitative Data Editing: Quadratic Programming Method. Presented at the COMPSTAT 1996 Conference, Barcelona.
  • [14] Central Statistical Office (2000). Editing and Calibration in Survey Processing. Report SMD-37, Ireland.
  • [15] Chambers, R. (2004). Methods Investigated in the EUREDIT Project. In Methods and Experimental Results from the EUREDIT Project, Ed. J.R.H. Charlton. (http://www.cs.york.ac.uk/euredit/).
  • [16] Chambers, R., Hentges, A. & Zhao, X. (2004). Robust Automatic Methods for Outlier and Error Detection. Journal of the Royal Statistical Society A, 167, 323-339.
  • [17] Chernikova, N.V. (1964). Algorithm for Finding a General Formula for the Non-Negative Solutions of a System of Linear Equations. USSR Computational Mathematics and Mathematical Physics, 4, 151-158.
  • [18] Chernikova, N.V. (1965). Algorithm for Finding a General Formula for the Non-Negative Solutions of a System of Linear Inequalities. USSR Computational Mathematics and Mathematical Physics, 5, 228-233.
  • [19] Chvátal, V. (1983). Linear Programming. New York: W.H. Freeman and Company.
  • [20] Cormen, T.H., Leiserson, C.E. & Rivest, R.L. (1990). Introduction to Algorithms. Cambridge, MA: The MIT Cambridge Press/McGraw-Hill Book Company.
  • [21] De Jong, A. (2002). Unit-Edit: Standardized Processing of Structural Business Statistics in the Netherlands. UN/ECE Work Session on Statistical Data Editing, Helsinki.
  • [22] De Waal, T. (1996). CherryPi: A Computer Program for Automatic Edit and Imputation. UN/ECE Work Session on Statistical Data Editing, Voorburg.
  • [23] De Waal, T. (2003a). Processing of Erroneous and Unsafe Data. Ph.D. Thesis, Erasmus University, Rotterdam.
  • [24] De Waal, T. (2003b). Solving the Error Localization Problem by Means of Vertex Generation. Survey Methodology, 29, 71-79.
  • [25] De Waal, T. & Quere, R. (2003). A Fast and Simple Algorithm for Automatic Editing of Mixed Data. Journal of Official Statistics, 19, 383-402.
  • [26] De Waal, T., Renssen, R. & Van de Pol, F. (2000). Graphical Macro-Editing: Possibilities and Pitfalls. Proceedings of the Second International Conference on Establishment Surveys, Buffalo, pp. 579-588.
  • [27] Duffin, R.J. (1974). On Fourier's Analysis of Linear Inequality Systems. Mathematical Programming Studies, 1, 71-95.
  • [28] Fellegi, I.P. & Holt, D. (1976). A Systematic Approach to Automatic Edit and Imputation. Journal of the American Statistical Association, 71, 17-35.
  • [29] Ferguson, D.P. (1994). An Introduction to the Data Editing Process. Statistical Data Editing (Volume 1); Methods and Techniques. Geneva: United Nations.
  • [30] Fillion, J.M. & Schiopu-Kratina, I. (1993). On the Use of Chernikova's Algorithm for Error Localisation. Report, Statistics Canada.
  • [31] Fine, T.L. (1999). Feedforward Neural Network Methodology. New York: Springer-Verlag.
  • [32] Freund, R.J. & Hartley, H.O. (1967). A Procedure for Automatic Data Editing. Journal of the American Statistical Association, 62, 341-352.
  • [33] Garfinkel, R.S., Kunnathur, A.S. & Liepins, G.E. (1986). Optimal Imputation of Erroneous Data: Categorical Data, General Edits. Operations Research, 34, 744-751.
  • [34] Garfinkel, R.S., Kunnathur, A.S. & Liepins, G.E. (1988). Error Localization for Erroneous Data: Continuous Data, Linear Constraints. SIAM Journal on Scientific and Statistical Computing, 9, 922-931.
  • [35] Ghosh-Dastidar, B. & Schafer, J.L. (2003). Multiple Edit/Multiple Imputation for Multivariate Continuous Data. Journal of the American Statistical Association, 98, 807-817.
  • [36] Granquist, L. (1990). A Review of Some Macro-Editing Methods for Rationalizing the Editing Process. Proceedings of the Statistics Canada Symposium, pp. 225-234.
  • [37] Granquist, L. (1995). Improving the Traditional Editing Process. In Business Survey Methods, Eds. Cox, Binder, Chinnappa, Christianson and Kott, pp. 385-401. New York: John Wiley & Sons.
  • [38] Granquist, L. (1997). The New View on Editing. International Statistical Review, 65, 381-387.
  • [39] Granquist, L. & Kovar, J. (1997). Editing of Survey Data: How Much is Enough?. In Survey Measurement and Process Quality, Eds. Lyberg, Biemer, Collins, De Leeuw, Dippo, Schwartz and Trewin, pp. 415-435. New York: John Wiley & Sons.
  • [40] Hadi, A.S. & Simonoff, J.F. (1993). Procedures for the Identification of Multiple Outliers in Linear Models. Journal of the Royal Statistical Society B, 56, 393-396.
  • [41] Hedlin, D. (2003). Score Functions to Reduce Business Survey Editing at the U.K. Office for National Statistics. Journal of Official Statistics, 19, 177-199.
  • [42] Hoogland, J. (2002). Selective Editing by Means of Plausibility Indicators. UN/ECE Work Session on Statistical Data Editing, Helsinki.
  • [43] Hoogland, J. & Van der Pijll, E. (2003). Evaluation of Automatic Versus Manual Editing of Production Statistics 2000 Trade & Transport. UN/ECE Work Session on Statistical Data Editing, Madrid.
  • [44] Houbiers, M., Quere, R. & De Waal, T. (1999). Automatically Editing the 1997 Survey on Environmental Costs. Internal report (BPA number: 4917-99-RSM), Statistics Netherlands, Voorburg.
  • [45] ILOG CPLEX 7.5 Reference Manual (2001). ILOG, France.
  • [46] Kalton, G. & Kasprzyk, D. (1986). The Treatment of Missing Survey Data. Survey Methodology, 12, 1-16.
  • [47] Koikkalainen, P. & Oja, E. (1990). Self-Organizing Hierarchical Feature Maps. In Proceedings of the International Joint Conference on Neural Networks II, pp. 279-285. Piscataway, NJ: IEEE Press.
  • [48] Kosinski A.S. (1999). A Procedure for the Detection of Multivariate Outliers. Computational Statistics & Data Analysis, 29, 145-161.
  • [49] Kovar, J. & Whitridge, P. (1990). Generalized Edit and Imputation System; Overview and Applications. Revista Brasileira de Estadística, 51, 85-100.
  • [50] Kovar, J. & Whitridge, P. (1995). Imputation of Business Survey Data. In Business Survey Methods, Eds. Cox, Binder, Chinnappa, Christianson and Kott, pp. 403-423. New York: John Wiley & Sons.
  • [51] Kovar, J. & Winkler, W.E. (1996). Editing Economic Data. UN/ECE Work Session on Statistical Data Editing, Voorburg.
  • [52] Larsen, B.S. & Madsen, B. (1999). Error Identification and Imputations with Neural Networks. UN/ECE Work Session on Statistical Data Editing, Rome.
  • [53] Lawrence, D. & McKenzie, R. (2000). The General Application of Significance Editing. Journal of Official Statistics, 16, 243-253.
  • [54] Liepins, G.E., Garfinkel, R.S. & Kunnathur, A.S. (1982). Error Localization for Erroneous Data: A Survey. TIMS/Studies in the Management Sciences, 19, 205-219.
  • [55] Little, R.J.A. & Rubin, D.B. (2002). Statistical Analysis with Missing Data. New York: John Wiley & Sons.
  • [56] Little, R.J.A. & Smith, P.J. (1987). Editing and Imputation of Quantitative Survey Data. Journal of the American Statistical Association, 82, 58-68.
  • [57] Manzari, A. (2004). Combining Editing and Imputation Methods: An Experimental Application on Population Census Data. Journal of the Royal Statistical Society A, 167, 295-307.
  • [58] McKeown, P.G. (1984). A Mathematical Programming Approach to Editing of Continuous Survey Data. SIAM Journal on Scientific and Statistical Computing, 5, 784-797.
  • [59] Nemhauser, G.L. & Wolsey, L.A. (1988). Integer and Combinatorial Optimisation. New York: John Wiley & Sons.
  • [60] Pannekoek, J. & De Waal, T. (2005). Automatic Editing and Imputation for Business Surveys: The Dutch Contribution to the EUREDIT Project. Journal of Official Statistics, forthcoming.
  • [61] Ragsdale, C.T. & McKeown, P.G. (1996). On Solving the Continuous Data Editing Problem. Computers & Operations Research, 23, 263-273.
  • [62] Riani, M. & Atkinson, A.C. (2000). Robust Diagnostic Data Analysis: Transformations in Regression. Technometrics, 42, 384-398.
  • [63] Riera-Ledesma, J. & Salazar-González, J.J. (2003). New Algorithms for the Editing and Imputation Problem. UN/ECE Work Session on Statistical Data Editing, Madrid.
  • [64] Rocke, D.M. & Woodruff, D.L. (1993). Computation of Robust Estimates of Multivariate Location and Shape. Statistica Neerlandica, 47, 27-42.
  • [65] Rocke, D.M. & Woodruff, D.L. (1996). Identification of Outliers in Multivariate Data. Journal of the American Statistical Association, 91, 1047-1061.
  • [66] Rousseeuw P.J. & Leroy, M.L. (1987). Robust Regression & Outlier Detection. New York: John Wiley & Sons.
  • [67] Rubin, D.S. (1975). Vertex Generation and Cardinality Constrained Linear Programs. Operations Research, 23, 555-565.
  • [68] Rubin, D.S. (1977). Vertex Generation Methods for Problems with Logical Constraints. Annals of Discrete Mathematics, 1, 457-466.
  • [69] Sande, G. (1978). An Algorithm for the Fields to Impute Problems of Numerical and Coded Data. Technical report, Statistics Canada.
  • [70] Schafer, J.L. (1997). Analysis of Incomplete Multivariate Data. London: Chapman & Hall.
  • [71] Schaffer, J. (1987). Procedure for Solving the Data-Editing Problem with Both Continuous and Discrete Data Types. Naval Research Logistics, 34, 879-890.
  • [72] Schiopu-Kratina, I. & Kovar, J.G. (1989). Use of Chernikova's Algorithm in the Generalized Edit and Imputation System. Methodology Branch Working Paper BSMD 89-001E, Statistics Canada.
  • [73] Sedgewick, R. & Flajolet, P. (1996). An Introduction to the Analysis of Algorithms. New York: Addison-Wesley Publishing Company.
  • [74] Stoop, J.R. (2003). The Best Piece of CherryPie (in Dutch). Internal report (BPA number: 2098-03-TMO). Statistics Netherlands, Voorburg.
  • [75] Todaro, T.A. (1999). Overview and Evaluation of the AGGIES Automated Edit and Imputation System. UN/ECE Work Session on Statistical Data Editing, Rome.
  • [76] Van de Pol, F., Bakker, F. & De Waal, T. (1997). On Principles for Automatic Editing of Numerical Data with Equality Checks. Report (BPA number: 7141-97-TMO), Statistics Netherlands, Voorburg.
  • [77] Van Riessen, P. (2002). Automatic Editing by Means of CPLEX (in Dutch). Internal report (BPA number: 975-02-TMO), Statistics Netherlands, Voorburg.
  • [78] Winkler, W.E. (1995). Editing Discrete Data. UN/ECE Work Session on Statistical Data Editing, Athens.
  • [79] Winkler, W.E. (1996). State of Statistical Data Editing and Current Research Problems, UN/ECE Work Session on Statistical Data Editing, Rome.
  • [80] Winkler, W.E. (1998). Set-Covering and Editing Discrete Data. Statistical Research Division Report 98/01, US Bureau of the Census, Washington, D.C.
  • [81] Winkler, W.E. & Draper, L.A. (1997). The SPEER Edit System. Statistical Data Editing (Volume 2); Methods and Techniques, United Nations, Geneva.
  • [82] Winkler, W.E. & Petkunas, T.F. (1997). The DISCRETE Edit System. Statistical Data Editing (Volume 2); Methods and Techniques, United Nations, Geneva.
  • [83] Woodruff, D.L. & Rocke, D.M. (1994). Computable Robust Estimation of Multivariate Location and Shape in High Dimension Using Compound Estimators. Journal of the American Statistical Association, 89, 888-896.