Statistical Science

Data Mining in Electronic Commerce

David L. Banks and Yasmin H. Said
Source: Statist. Sci. Volume 21, Number 2 (2006), 234-246.

Abstract

Modern business is rushing toward e-commerce. If the transition is done properly, it enables better management, new services, lower transaction costs and better customer relations. Success depends on skilled information technologists, among whom are statisticians. This paper focuses on some of the contributions that statisticians are making to help change the business world, especially through the development and application of data mining methods. This is a very large area, and the topics we cover are chosen to avoid overlap with other papers in this special issue, as well as to respect the limitations of our expertise. Inevitably, electronic commerce has raised and is raising fresh research problems in a very wide range of statistical areas, and we try to emphasize those challenges.

First Page: Show Hide
Full-text: Open access
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.ss/1154979824
Digital Object Identifier: doi:10.1214/088342306000000204
Mathematical Reviews number (MathSciNet): MR2324083
Zentralblatt MATH identifier: 05191863

References

Allen, G. N., Burk, D. L. and Davis, G. B. (2006). Academic data collection in electronic environments: Defining acceptable use of Internet resources. MIS Quarterly 30(3). To appear.
Ball, P. (2003). Using multiple system estimation to assess the magnitude and pattern of political killings in Guatemala and Kosovo. Bull. Internat. Statist. Inst., 54th session.
Banks, D., Over, P. and Zhang, N.-F. (1999). Blind men and elephants: Six approaches to TREC data. Information Retrieval 1 7--34.
Bapna, R., Goes, P., Gopal, R. and Marsden, J. (2006). Moving from data-constrained to data-enabled research: Experiences and challenges in collecting, validating and analyzing large-scale e-commerce data. Statist. Sci. 21 116--130.
Mathematical Reviews (MathSciNet): MR2324072
Digital Object Identifier: doi:10.1214/088342306000000231
Project Euclid: euclid.ss/1154979815
Zentralblatt MATH: 05191854
Bickel, P. J. and Levina, E. (2004). Some theory of Fisher's linear discriminant function, `naive Bayes,' and some alternatives when there are many more variables than observations. Bernoulli 10 989--1010.
Mathematical Reviews (MathSciNet): MR2108040
Digital Object Identifier: doi:10.3150/bj/1106314847
Project Euclid: euclid.bj/1106314847
Blum, L., Blum, M. and Shub, M. (1986). A simple unpredictable pseudorandom number generator. SIAM J. Comput. 15 364--383.
Mathematical Reviews (MathSciNet): MR0837589
Digital Object Identifier: doi:10.1137/0215025
Bradlow, E. T. and Schmittlein, D. C. (2000). The little engines that could: Modeling the performance of the World Wide Web search engines. Marketing Sci. 19 43--62.
Chatterjee, P., Hoffman, D. L. and Novak, T. (2003). Modeling the clickstream: Implications for web-based advertising efforts. Marketing Sci. 22 520--541.
Clyde, M. and George, E. I. (2004). Model uncertainty. Statist. Sci. 19 81--94.
Mathematical Reviews (MathSciNet): MR2082148
Digital Object Identifier: doi:10.1214/088342304000000035
Project Euclid: euclid.ss/1089808274
Zentralblatt MATH: 1062.62044
Dobra, A. and Fienberg, S. E. (2003). How large is the World Wide Web? In Web Dynamics (M. Levene and A. Poulovassilis, eds.) 23--44. Springer, New York.
Donoho, D. L. and Huber, P. J. (1983). The notion of breakdown point. In A Festschrift for Erich L. Lehmann (P. Bickel, K. Doksum and J. Hodges, eds.) 157--184. Wadsworth, Belmont, CA.
Mathematical Reviews (MathSciNet): MR0689745
Zentralblatt MATH: 0523.62032
Dumais, S. (1991). Improving the retrieval of information from external sources. Behavior Research Methods, Instruments, and Computers 23 229--236.
Fellegi, I. P. and Sunter, A. B. (1969). A theory for record linkage. J. Amer. Statist. Assoc. 64 1183--1210.
Fienberg, S. E. (2006). Privacy and confidentiality in an e-commerce world: Data mining, data warehousing, matching and disclosure limitation. Statist. Sci. 21 143--154.
Mathematical Reviews (MathSciNet): MR2324074
Digital Object Identifier: doi:10.1214/088342306000000240
Project Euclid: euclid.ss/1154979817
Zentralblatt MATH: 05191856
Friedman, J. H. and Popescu, B. E. (2005). Predictive learning via rule ensembles. Available at stat.stanford.edu/~jhf/#selected.
Ghose, A. and Sundararajan, A. (2006). Evaluating pricing strategy using e-commerce data: Evidence and estimation challenges. Statist. Sci. 21 131--142.
Mathematical Reviews (MathSciNet): MR2324073
Digital Object Identifier: doi:10.1214/088342306000000187
Project Euclid: euclid.ss/1154979816
Zentralblatt MATH: 05191855
Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika 40 237--264.
Mathematical Reviews (MathSciNet): MR0061330
Zentralblatt MATH: 0051.37103
Hand, D. and Yu, K. (2001). Idiot's Bayes---Not so stupid after all. Internat. Statist. Rev. 69 385--398.
Harman, D. K., ed. (1994). The Second Text Retrieval Conference (TREC-2). National Institute of Standards and Technology (NIST special publication 500-215), Gaithersburg, MD.
Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning. Springer, New York.
Mathematical Reviews (MathSciNet): MR1851606
Zentralblatt MATH: 0973.62007
Hui, K.-L. and Png, I. P. L. (2006). The economics of privacy. In Handbook on Economics and Information Systems. To appear.
Karr, A. F., Lin, X., Sanil, A. P. and Reiter, J. P. (2005). Secure regression on distributed databases. J. Comput. Graph. Statist. 14 263--279.
Mathematical Reviews (MathSciNet): MR2160813
Digital Object Identifier: doi:10.1198/106186005X47714
Karr, A. F., Sanil, A. P. and Banks, D. L. (2006). Data quality: A statistical perspective. Statist. Methodology 3 137--173.
Mathematical Reviews (MathSciNet): MR2227417
Digital Object Identifier: doi:10.1016/j.stamet.2005.08.005
Kohavi, R., Mason, L., Parekh, R. and Zheng, Z. (2004). Lessons and challenges from mining retail e-commerce data. Machine Learning 57 83--113.
Lawrence, S. and Giles, C. L. (1999). Accessibility of information on the web. Nature 400 107--109.
Liggett, W. and Buckley, C. (2005). System performance and natural language expression of information needs. Information Retrieval 8 101--128.
Madigan, D. (2005). Statistics and the war on spam. In Statistics: A Guide to the Unknown, 4th ed. (R. Peck, G. Casella, G. Cobb, R. Hoerl, D. Nolan, R. Starbuck and H. Stern, eds.) 135--147. Thomson Brooks/Cole, Belmont, CA.
Mauldin, M. L. (1991). Retrieval performance in FERRET, a conceptual information retrieval system. In Proc. 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (E. Fox, ed.) 347--355. ACM Press, New York.
Maxion, R. and Tan, K. (2002). Anomaly detection in embedded systems. IEEE Transactions on Computers 51 108--120.
Miller, D. R. H., Leek, T. and Schwartz, R. M. (1999). A hidden Markov model information retrieval system. In Proc. 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (F. Gey, M. Hearst and R. Tony, eds.) 214--221. ACM Press, New York.
Moe, W. and Fader, P. S. (2004). Dynamic conversion behavior at e-commerce sites. Management Sci. 50 326--335.
National commission for the protection of human subjects of Biomedical and Behavioral Research (1979). The Belmont Report: Ethical Principles and Guidelines for the Protection of Human Subjects in Research. National Institues of Health.
Paice, C. D. (1996). Method for evaluation of stemming algorithms based on error counting. J. Amer. Soc. Information Science 47 632--649.
Rimm, M. (1995). Marketing pornography on the information highway: A survey of 917,410 images, descriptions, short stories, and animations downloaded 8.5 million times by consumers in over 2000 cities and territories. Georgetown Law J. 83 1849--1934.
Rivest, R. L., Shamir, A. and Adleman, L. (1978). A method for obtaining digital signatures and public-key cryptosystems. Comm. ACM 21 120--126.
Mathematical Reviews (MathSciNet): MR700103
Digital Object Identifier: doi:10.1145/359340.359342
Zentralblatt MATH: 0368.94005
Schapire, R., Singer, Y. and Singhal, A. (1998). Boosting and rocchio applied to text filtering. In Proc. 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (W. B. Croft, A. Moffat, C. van Rijsbergen, R. Wilkinson and J. Zobel, eds.) 215--223. ACM Press, New York.
Mathematical Reviews (MathSciNet): MR1811573
Digital Object Identifier: doi:10.1145/279943.279960
Shmueli, G. and Jank, W. (2005). Visualizing online auctions. J. Comput. Graph. Statist. 14 299--319.
Mathematical Reviews (MathSciNet): MR2160815
Digital Object Identifier: doi:10.1198/106186005X48236
Shmueli, G. and Jank, W. (2006). Modeling the dynamics of online auctions: A modern statistical approach. In Economics, Information Systems and E-commerce Research II: Advanced Empirical Methods (R. Kauffman and P. Tallon, eds.). Sharpe, Armonk, NY. To appear.
Sismeiro, C. and Bucklin, R. E. (2004). Modeling purchase behavior at an e-commerce web site: A task completion approach. J. Marketing Research 41 306--323.
Sullivan, D. (2004). Search engine size wars V erupts. Search Engine Watch. Available at blog.searchenginewatch. com/blog/041111-084221.
U.S. Census Bureau (2005). E-Stats. May 11. Available at www.census.gov/estats.

2012 © Institute of Mathematical Statistics

Statistical Science

Statistical Science