Statistical Science

How to Lie with Bad Data

Richard D. De Veaux and David J. Hand

Full-text: Open access

Abstract

As Huff’s landmark book made clear, lying with statistics can be accomplished in many ways. Distorting graphics, manipulating data or using biased samples are just a few of the tried and true methods. Failing to use the correct statistical procedure or failing to check the conditions for when the selected method is appropriate can distort results as well, whether the motives of the analyst are honorable or not. Even when the statistical procedure and motives are correct, bad data can produce results that have no validity at all. This article provides some examples of how bad data can arise, what kinds of bad data exist, how to detect and measure bad data, and how to improve the quality of data that have already been collected.

Article information

Source
Statist. Sci. Volume 20, Number 3 (2005), 231-238.

Dates
First available in Project Euclid: 24 August 2005

Permanent link to this document
https://projecteuclid.org/euclid.ss/1124891289

Digital Object Identifier
doi:10.1214/088342305000000269

Mathematical Reviews number (MathSciNet)
MR2189000

Zentralblatt MATH identifier
1100.62533

Keywords
Data quality data profiling data rectification data consistency accuracy distortion missing values record linkage data warehousing data mining

Citation

De Veaux, Richard D.; Hand, David J. How to Lie with Bad Data. Statist. Sci. 20 (2005), no. 3, 231--238. doi:10.1214/088342305000000269. https://projecteuclid.org/euclid.ss/1124891289.


Export citation

References

  • Baggerly, K. A, Morris, J. S. and Coombes, K. R. (2004). Reproducibility of SELDI-TOF protein patterns in serum: Comparing datasets from different experiments. Bioinformatics 20 777--785.
  • Brunskill, A. J. (1990). Some sources of error in the coding of birth weight. American J. Public Health 80 72--73.
  • Check, E. (2004). Proteomics and cancer: Running before we can walk? Nature 429 496--497.
  • Coale, A. J. and Stephan, F. F. (1962). The case of the Indians and the teen-age widows. J. Amer. Statist. Assoc. 57 338--347.
  • De Veaux, R. D. (2002). Data mining: A view from down in the pit. Stats (34) 3--9.
  • De Veaux, R. D., Donahue, R. and Small, R. D. (2002). Using data mining techniques to harvest information in clinical trials. Presentation at Joint Statistical Meetings, New York.
  • De Veaux, R. D., Gordon, A., Comiso, J. and Bacherer, N. E. (1993). Modeling of topographic effects on Antarctic sea-ice using multivariate adaptive regression splines. J. Geophysical Research---Oceans 98 20,307--20,320.
  • Hand, D. J. (2001). Reject inference in credit operations. In Handbook of Credit Scoring (E. Mays, ed.) 225--240. Glenlake Publishing, Chicago.
  • Hand, D. J. (2004a). Academic obsessions and classification realities: Ignoring practicalities in supervised classification. In Classification, Clustering and Data Mining Applications (D. Banks, L. House, F. R. McMorris, P. Arabie and W. Gaul, eds.) 209--232. Springer, Berlin.
  • Hand, D. J. (2004b). Measurement Theory and Practice: The World Through Quantification. Arnold, London.
  • Hand, D. J., Blunt, G., Kelly, M. G. and Adams, N. M. (2000). Data mining for fun and profit (with discussion). Statist. Sci. 15 111--131.
  • Hand, D. J. and Henley, W. E. (1993). Can reject inference ever work? IMA J. of Mathematics Applied in Business and Industry 5(4) 45--55.
  • Huff, D. (1954). How to Lie with Statistics. Norton, New York.
  • Jones, P. D. and Wigley, T. M. L. (1990). Global warming trends. Scientific American 263(2) 84--91.
  • Kim, W., Choi, B.-J., Hong, E.-K., Kim, S.-K. and Lee, D. (2003). A taxonomy of dirty data. Data Mining and Knowledge Discovery 7 81--99.
  • Klein, B. D. (1998). Data quality in the practice of consumer product management: Evidence from the field. Data Quality 4(1).
  • Kruskal, W. (1981). Statistics in society: Problems unsolved and unformulated. J. Amer. Statist. Assoc. 76 505--515.
  • Laudon, K. C. (1986). Data quality and due process in large interorganizational record systems. Communications of the ACM 29 4--11.
  • Little, R. J. A. and Rubin, D. B. (1987). Statistical Analysis with Missing Data. Wiley, New York.
  • Loshin, D. (2001). Enterprise Knowledge Management: The Data Quality Approach. Morgan Kaufmann, San Francisco.
  • Madnick, S. E. and Wang, R. Y. (1992). Introduction to the TDQM research program. Working Paper 92-01, Total Data Quality Management Research Program.
  • Morey, R. C. (1982). Estimating and improving the quality of information in a MIS. Communications of the ACM 25 337--342.
  • Percy, T. (1986). My data, right or wrong. Datamation 32(11) 123--124.
  • Petricoin, E. F., III, Ardekani, A. M., Hitt, B. A., Levine, P. J., Fusaro, V. A., Steinberg, S. M., Mills, G. B., Simone, C., Fishman, D. A., Kohn, E. C. and Liotta, L. A. (2002). Use of proteomic patterns in serum to identify ovarian cancer. The Lancet 359 572--577.
  • Pierce, E. (1997). Modeling database error rates. Data Quality 3(1). Available at www.dataquality.com/dqsep97.htm.
  • PricewaterhouseCoopers (2004). The Tech Spotlight 22. Available at www.pwc.com/extweb/manissue.nsf/docid/ 2D6E2F57E06E022F85256B8F006F389A.
  • Redman, T. C. (1992). Data Quality. Management and Technology. Bantam, New York.
  • Strayhorn, J. M. (1990). Estimating the errors remaining in a data set: Techniques for quality control. Amer. Statist. 44 14--18.
  • Wainer, H. (2004). Curbstoning IQ and the 2000 presidential election. Chance 17(4) 43--46.
  • West, M. and Winkler, R. L. (1991). Data base error trapping and prediction. J. Amer. Statist. Assoc. 86 987--996.
  • Willenborg, L. and de Waal, T. (2001). Elements of Statistical Disclosure Control. Springer, New York.
  • Wolins, L. (1962). Responsibility for raw data. American Psychologist 17 657--658.