Statistical Science

Privacy and Confidentiality in an e-Commerce World: Data Mining, Data Warehousing, Matching and Disclosure Limitation

Stephen E. Fienberg

Full-text: Open access

Abstract

The growing expanse of e-commerce and the widespread availability of online databases raise many fears regarding loss of privacy and many statistical challenges. Even with encryption and other nominal forms of protection for individual databases, we still need to protect against the violation of privacy through linkages across multiple databases. These issues parallel those that have arisen and received some attention in the context of homeland security. Following the events of September 11, 2001, there has been heightened attention in the United States and elsewhere to the use of multiple government and private databases for the identification of possible perpetrators of future attacks, as well as an unprecedented expansion of federal government data mining activities, many involving databases containing personal information. We present an overview of some proposals that have surfaced for the search of multiple databases which supposedly do not compromise possible pledges of confidentiality to the individuals whose data are included. We also explore their link to the related literature on privacy-preserving data mining. In particular, we focus on the matching problem across databases and the concept of “selective revelation” and their confidentiality implications.

Article information

Source
Statist. Sci., Volume 21, Number 2 (2006), 143-154.

Dates
First available in Project Euclid: 7 August 2006

Permanent link to this document
https://projecteuclid.org/euclid.ss/1154979817

Digital Object Identifier
doi:10.1214/088342306000000240

Mathematical Reviews number (MathSciNet)
MR2324074

Zentralblatt MATH identifier
05191856

Keywords
Encryption multiparty computation privacy-preserving data mining record linkage R–U confidentiality map selective revelation

Citation

Fienberg, Stephen E. Privacy and Confidentiality in an e-Commerce World: Data Mining, Data Warehousing, Matching and Disclosure Limitation. Statist. Sci. 21 (2006), no. 2, 143--154. doi:10.1214/088342306000000240. https://projecteuclid.org/euclid.ss/1154979817


Export citation

References

  • Agrawal, R., Evfimievski, A. and Srikant, R. (2003). Information sharing across private databases. In Proc. 2003 ACM SIGMOD International Conference on Management of Data 86--97. ACM Press, New York.
  • Bilenko, M., Mooney, R., Cohen, W. W., Ravikumar, P. and Fienberg, S. E. (2003). Adaptive name matching in information integration. IEEE Intelligent Systems 18(5) 16--23.
  • Bishop, Y. M. M., Fienberg, S. E. and Holland, P. W. (1975). Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge, MA.
  • Clarke, R. (1988). Information technology and dataveillance. Comm. ACM 31 498--512.
  • Dobra, A. and Fienberg, S. E. (2001). Bounds for cell entries in contingency tables induced by fixed marginal totals. Statist. J. United Nations ECE 18 363--371.
  • Dobra, A. and Fienberg, S. E. (2003). Bounding entries in multi-way contingency tables given a set of marginal totals. In Foundations of Statistical Inference (Y. Haitovsky, H. R. Lerche and Y. Ritov, eds.) 3--16. Physica, Heidelberg.
  • Domingo-Ferrer, J., Mateo-Sanz, J. M. and Sánchez del Castillo, R. X. (2000). Cryptographic techniques in statistical data protection. In Proc. Joint UN/ECE-Eurostat Work Session on Statistical Data Confidentiality 159--166. Office for Official Publications of the European Communities, Luxembourg.
  • Domingo-Ferrer, J. and Torra, V. (2003). Disclosure risk assessment in statistical microdata protection via advanced record linkage. Stat. Comput. 13 343--354.
  • Duncan, G. T. (2001). Confidentiality and statistical disclosure limitation. International Encyclopedia of the Social and Behavioral Sciences 2521--2525. North-Holland, Amsterdam.
  • Duncan, G. T., Fienberg, S. E., Krishnan, R., Padman, R. and Roehrig, S. F. (2001). Disclosure limitation methods and information loss for tabular data. In Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies (P. Doyle, J. Lane, J. Theeuwes and L. Zayatz, eds.) 135--166. North-Holland, Amsterdam.
  • Duncan, G. T., Keller-McNulty, S. A. and Stokes, S. L. (2004). Database security and confidentiality: Examining disclosure risk vs. data utility through the R--U confidentiality map. Technical Report 142, National Institute of Statistical Sciences.
  • Duncan, G. T. and Stokes, S. L. (2004). Disclosure risk vs. data utility: The R--U confidentiality map as applied to topcoding. Chance 17(3) 16--20.
  • Dwork, C. and Nissim, K. (2004). Privacy-preserving data mining on vertically partitioned databases. In Proc. CRYPTO 2004, 24th International Conference on Cryptology 528--544. Univ. California, Santa Barbara.
  • Fellegi, I. P. and Sunter, A. B. (1969). A theory for record linkage. J. Amer. Statist. Assoc. 64 1183--1210.
  • Fienberg, S. E. (2005). Confidentiality and disclosure limitation. Encyclopedia of Social Measurement 463--469. North-Holland, Amsterdam.
  • Fienberg, S. E. (2005). Homeland insecurity: Datamining, terrorism detection, and confidentiality. Bull. Internat. Stat. Inst., 55th Session. Sydney.
  • Fienberg, S. E. and Shmueli, G. (2005). Statistical issues and challenges associated with rapid detection of bio-terrorist attacks. Stat. Med. 24 513--529.
  • Fienberg, S. E. and Slavkovic, A. B. (2004). Making the release of confidential data from multi-way tables count. Chance 17(3) 5--10.
  • Fienberg, S. E. and Slavkovic, A. B. (2005). Preserving the confidentiality of categorical statistical data bases when releasing information for association rules. Data Mining and Knowledge Discovery 11 155--180.
  • Gopal, R., Garfinkel, R. and Goes, P. (2002). Confidentiality via camouflage: The CVC approach to disclosure limitation when answering queries to databases. Oper. Res. 50 501--516.
  • Information Science and Technology Study Group on Security and Privacy (chair: J. D. Tygar) (2002). Security With Privacy. Briefing.
  • Jaro, M. A. (1995). Probabilistic linkage of large public health data files. Stat. Med. 14 491--498.
  • Karr, A. F., Lin, X., Sanil, A. P. and Reiter, J. P. (2006). Secure statistical analysis of distributed databases. In Statistical Methods in Counterterrorism (A. Wilson, G. Wilson and D. H. Olwell, eds.). Springer, New York.
  • Kreimer, S. F. (2004). Watching the watchers: Surveillance, transparency, and political freedom in the war on terror. J. Constitutional Law 7 133--181.
  • Larsen, M. D. and Rubin, D. B. (2001). Iterative automated record linkage using mixture models. J. Amer. Statist. Assoc. 96 32--41.
  • Li, Y., Tygar, J. D. and Hellerstein, J. M. (2005). Private matching. In Computer Security in the 21st Century (D. T. Lee, S. P. Shieh and J. D. Tygar, eds.) 25--50. Springer, New York.
  • Lunt, T. (2003). Protecting privacy in terrorist tracking applications. Presentation to the Department of Defense Technology and Privacy Advisory Committee, September 29, 2003.
  • Lunt, T., Staddon, J., Balfanz, D., Durfee, G., Uribe, T. et al. (2005). Protecting privacy in terrorist tracking applications. Powerpoint presentation. Available at research.microsoft.com/projects/SWSecInstitute/five-minute/Balfanz5.ppt.
  • Muralidhar, K., Sarathy, R. and Parsa, R. (2001). An improved security requirement for data perturbation with implications for e-commerce. Decision Sci. 32 683--698.
  • Relyea, H. C. and Seifert, J. W. (2005). Information Sharing for Homeland Security: A Brief Overview. Congressional Research Service, The Library of Congress (Updated January 10, 2005). Available at www.fas.org/sgp/crs/RL32597.pdf.
  • Secure Flight Working Group (2005). Report of the secure flight working group. Presented to the Transportation Security Administration, September 19, 2005. Available at www.epic.org/privacy/airtravel/sfwg_report_091905.pdf.
  • Sweeney, L. (2005). Privacy-preserving bio-terrorism surveillance. Presentation at AAAI Spring Symposium, AI Technologies for Homeland Security, Stanford Univ.
  • Sweeney, L. (2005). Privacy-preserving surveillance using selective revelation. LIDAP Working Paper 15, School Computer Science, Carnegie Mellon Univ.
  • Tygar, J. D. (2003). Privacy architectures. Presentation at Microsoft Research, June 18, 2003. Available at research.microsoft.com/projects/SWSecInstitute/slides/Tygar. pdf.
  • Tygar, J. D. (2003). Privacy in sensor webs and distributed information systems. In Software Security Theories and Systems (M. Okada, B. Pierce, A. Scedrov, H. Tokuda and A. Yonezawa, eds.) 84--95. Springer, New York.
  • U.S. Department of Defense Technology and Privacy Advisory Committee (TAPAC) (2004). Safeguarding Privacy in the Fight Against Terrorism. Department of Defense, Washington.
  • U.S. General Accounting Office (2004). Data Mining: Federal Efforts Cover a Wide Range of Uses. GAO-04-548, Report to the Ranking Minority Member, Subcommittee on Financial Management, the Budget and International Security, Committee on Governmental Affairs, U.S. Senate, Washington.
  • Winkler, W. E. (2002). Methods for record linkage and Bayesian networks. Proc. Section Survey Research Methods 3743--3748. Amer. Statist. Assoc., Alexandria, VA.
  • Winkler, W. E. (2005). Data quality in data warehouses. Encyclopedia of Data Warehousing and Data Mining 1. Idea Group, Hershey, PA.
  • Zhong, S., Yang, Z. and Wright, R. N. (2005). Privacy-enhancing k-anonymization of customer data. In Proc. 24th ACM SIGMOD International Conference on Management of Data/Principles of Database Systems (PODS 2005). ACM Press, New York.