Statistical Science

Data Mining for Fun and Profit

Niall M. Adams, Gordon Blunt, David J. Hand, and Mark G. Kelly

Full-text: Open access

Abstract

Data mining is defined as the process of seeking interesting or valuable information within large data sets. This presents novel challenges and problems, distinct from those typically arising in the allied areas of statistics, machine learning, pattern recognition or database science. A distinction is drawn between the two data mining activities of model building and pattern detection. Even though statisticians are familiar with the former, the large data sets involved in data mining mean that novel problems do arise. The second of the activities, pattern detection, presents entirely new classes of challenges, some arising, again, as a consequence of the large sizes of the data sets. Data quality is a particularly troublesome issue in data mining applications, and this is examined. The discussion is illustrated with a variety of real examples.

Article information

Source
Statist. Sci., Volume 15, Number 2 (2000), 111-131.

Dates
First available in Project Euclid: 24 December 2001

Permanent link to this document
https://projecteuclid.org/euclid.ss/1009212753

Digital Object Identifier
doi:10.1214/ss/1009212753

Keywords
Data mining knowledge discovery large data sets computers databases

Citation

Hand, David J.; Blunt, Gordon; Kelly, Mark G.; Adams, Niall M. Data Mining for Fun and Profit. Statist. Sci. 15 (2000), no. 2, 111--131. doi:10.1214/ss/1009212753. https://projecteuclid.org/euclid.ss/1009212753


Export citation

References

  • Adams, N. M. and Hand, D. J. (1999). Mining for unusual patterns in data. Working paper, Dept. Mathematics, Imperial College, London. Agrawal, R., Stolorz, P. and Piatetsky-Shapiro, G. (eds.)
  • (1998). Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, CA.
  • Babcock, C. (1994). Parallel processing mines retail data. Computer World. Sept. 6.
  • Bartholomew, D. J. (1995). What is statistics? J. Roy. Statist. Soc. Ser. A 158 1-20.
  • Blunt, G. and Hand, D. J. (1999). Credit card petrol purchases: an example of data mining in practice. Working paper, Dept. Mathematics, Imperial College, London.
  • Chambers, J. M. (1993). Greater or lesser statistics: a choice for future research. Statist. Comput. 3 182-184. Cooper, C., Shah, S., Hand, D. J., Compston, J., Davie, M. and
  • Woolf, A. (1991). Screening for vertebral osteoporosis using individual risk factors. Osteoporosis International 2 48-53.
  • Copas, J. B. and Li, H. G. (1997). Inference for non-random samples. J. Roy. Statist. Soc. Ser. B 59 55-95.
  • Cortes, C. and Pregibon, D. (1997). Mega-monitoring. Paper presented at the Univ. Washington/Microsoft Summer Research Institute on Data Mining, July 6-11. Cox, K. C., Eick, S. G., Wills, G. J. and Brachman R. J.
  • (1997). Visual data mining: recognizing telephone calling fraud. Data Mining and Knowledge Discovery 1 225-231.
  • Derthick, M., Kolojejchick, J. and Roth, S. F. (1997). An interactive visualisation environment for data exploration. In Proceeding of the Third International Conference on Knowledge Discovery and Data Mining (D. Heckerman, H. Mannila, D. Pregibon and R. Uthurusamy, eds.) 2-9. AAAI Press, Menlo Park.
  • Elder, J. IV and Pregibon, D. (1996). A statistical perspective on knowledge discovery in databases. In Advances in Knowledge Discovery and Data Mining (U. M. Fayyad, G. PiatetskyShapiro, P. Smyth and R. Uthurusamy, eds.) 83-113. AAAI Press, Menlo Park, CA.
  • Fayyad, U. M., Piatetsky-Shapiro, G. and Smyth, P. (1996). From data mining to knowledge discovery: an overview. In Advances in Knowledge Discovery and Data Mining (U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy, eds.) 1-34. AAAI Press, Menlo Park, CA. Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P. and Uthu
  • rusamy, R. (eds.) (1996). Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, CA.
  • Gifi, A. (1990). Nonlinear Multivariate Analysis. Wiley, Chichester.
  • Glymour, C., Madigan, D., Pregibon, D. and Smyth, P. (1997). Statistical themes and lessons for data mining. Data Mining and Knowledge Discovery 1 11-28. Hand, D. J. (1998a). Data mining: statistics and more? Amer. Statist. 52 112-118. Hand, D. J. (1998b). Data mining-reaching beyond statistics. Res. Official Statist. 2 5-17.
  • Hand, D. J. (1999). Statistics and data mining: intersecting disciplines. SIGKDD Exploration 1 16-19.
  • Hand, D. J., Mannila, H. and Smyth, P. (2000). Principles of Data Mining. MIT Press.
  • Harrison, D. (1993). Backing up. Neural Computation 98-104. Heckerman, D., Mannila, H., Pregibon, D. and Uthurusamy, R.
  • (eds.) (1997). Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, CA.
  • Kl ¨osgen, W. (1998). Analysing databases with knowledge discovery methods. Res. Official Statist. 1 9-35.
  • Leighton, G. and McKinlay, P. L. (1930). Milk Consumption and the Growth of School Children. H.M. Stationery Office, Edinburgh.
  • Mannila, H., Toivonen, H. and Verkamo, A. I. (1997). Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery 1 259-289.
  • Mihalisin, T. and Timlin, J. (1997). Fast robust visual data mining. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (D. Heckerman, H. Mannila, D. Pregibon and R. Uthurusamy, eds.) 231-234. AAAI Press, Menlo Park, CA.
  • Scott, D. W. (1992). Multivariate Density Estimation. Wiley, New York.
  • Oct., Nov., 1999). However, while I appreciate the importance of data mining, in practice the profit it brings has turned out to be surprisingly limited in many key businesses. W. Edwards Deming began his 1943 statistics book, The Statistical Adjustment of Data with "The purpose of collecting data is to provide a basis for action." It is in this spirit, the spirit of helping businesses make better decisions, that I wish to discuss an extension of Hand et al. beyond analysis of large data sets to the active collection of useful data.
  • Stanghellini, McConway and Hand (1999). On the other hand, we feel that Kahn may be a little unrealistic when he argues, under Brainstorming, that "it is thus crucial to have a view of everything a business could do. This involves explicitly knowing everything the company is trying, and has tried, both recently and long ago. Furthermore, knowledge of everything competitors are doing is obviously key." In general, while we should clearly strive to improve the information on which we base our decisions, it will inevitably be at best partial. The comments under Experimental Design and Implementation also ring true. In various contexts, we have had considerable difficulty convincing business of the merits of accepting a small sample of
  • Deming, W. E. (1943). Statistical Adjustment of Data. Wiley, NY. (Republished in 1964 by Dover, New York.)
  • Hand, D. J., McConway, K. J. and Stanghellini, E. (1997). Graphical models of applicants for credit. IMA J. Math. Appl. Bus. Indust. 8 143-155. New York Times, 6 October 1999, page 4. New York Times, 18 November 1999, page 4.
  • Stanghellini, E., McConway, K. J. and Hand, D. J. (1999). A chain graph for applicants for bank credit. J. Roy. Statist. Soc. Ser. C 48 239-251.