The Annals of Applied Statistics

Multiple imputation for sharing precise geographies in public use data

Hao Wang and Jerome P. Reiter

Full-text: Open access

Abstract

When releasing data to the public, data stewards are ethically and often legally obligated to protect the confidentiality of data subjects’ identities and sensitive attributes. They also strive to release data that are informative for a wide range of secondary analyses. Achieving both objectives is particularly challenging when data stewards seek to release highly resolved geographical information. We present an approach for protecting the confidentiality of data with geographic identifiers based on multiple imputation. The basic idea is to convert geography to latitude and longitude, estimate a bivariate response model conditional on attributes, and simulate new latitude and longitude values from these models. We illustrate the proposed methods using data describing causes of death in Durham, North Carolina. In the context of the application, we present a straightforward tool for generating simulated geographies and attributes based on regression trees, and we present methods for assessing disclosure risks with such simulated data.

Article information

Source
Ann. Appl. Stat., Volume 6, Number 1 (2012), 229-252.

Dates
First available in Project Euclid: 6 March 2012

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1331043395

Digital Object Identifier
doi:10.1214/11-AOAS506

Mathematical Reviews number (MathSciNet)
MR2951536

Zentralblatt MATH identifier
1236.86015

Keywords
Confidentiality disclosure dissemination spatial synthetic tree

Citation

Wang, Hao; Reiter, Jerome P. Multiple imputation for sharing precise geographies in public use data. Ann. Appl. Stat. 6 (2012), no. 1, 229--252. doi:10.1214/11-AOAS506. https://projecteuclid.org/euclid.aoas/1331043395


Export citation

References

  • Armstrong, M. P., Rushton, G. and Zimmerman, D. L. (1999). Geographically masking health data to preserve confidentiality. Stat. Med. 18 495–525.
  • Banerjee, S., Gelfand, A. E. and Carlin, B. P. (2004). Hierarchical Modeling and Analysis for Spatial Data. Chapman and Hall/CRC, Boca Raton, FL.
  • Banerjee, S., Gelfand, A. E., Finley, A. O. and Sang, H. (2008). Gaussian predictive process models for large spatial data sets. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 825–848.
  • Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). Classification and Regression Trees. Wadsworth, Belmont.
  • Caiola, G. and Reiter, J. P. (2010). Random forests for generating partially synthetic, categorical data. Trans. Data Priv. 3 27–42.
  • Chipman, H. A., George, E. I. and McCulloch, R. E. (2010). BART: Bayesian additive regression trees. Ann. Appl. Stat. 4 266–298.
  • Dalenius, T. and Reiss, S. P. (1982). Data-swapping: A technique for disclosure control. J. Statist. Plann. Inference 6 73–85.
  • De’ath, G. (2002). Multivariate regression trees: A new technique for modeling species environment relationships. Ecology 83 1105–1117.
  • Drechsler, J. (2011). New data dissemination approaches in old Europe—Synthetic datasets for a German establishment survey. J. Appl. Stat. To appear.
  • Drechsler, J. and Reiter, J. P. (2008). Accounting for intruder uncertainty due to sampling when estimating identification disclosure risks in partially synthetic data. In Privacy in Statistical Databases (LNCS 5262) (J. Domingo-Ferrer and Y. Saygin, eds.) 227–238. Springer, New York.
  • Drechsler, J. and Reiter, J. P. (2010). Sampling with synthesis: A new approach for releasing public use census microdata. J. Amer. Statist. Assoc. 105 1347–1357.
  • Duncan, G. T. and Lambert, D. (1989). The risk of disclosure for microdata. Journal of Business and Economic Statistics 7 207–217.
  • Federal Register (2000). Standards for privacy of individually identifiable health information—Final privacy rule. 45 C. F. R. Parts 160 and 164, Dept. Health and Human Services, Office of the Secretary, Washington, DC.
  • Fienberg, S. E., Makov, U. E. and Sanil, A. P. (1997). A Bayesian approach to data disclosure: Optimal intruder behavior for continuous data. Journal of Official Statistics 13 75–89.
  • Fienberg, S. E. and McIntyre, S. E. (2004). Data swapping: Variations on a theme by Dalenius and Reese. In Privacy in Statistical Databases (J. Domingo-Ferrer and V. Torra, eds.) 14–29. Springer, New York.
  • Freedman, D. A. (2004). The ecological fallacy. In Encyclopedia of Social Science Research Methods (M. Lewis-Beck, A. Bryman and T. F. Liao, eds.) 1 293. Sage, Thousand Oaks, CA.
  • Fuller, W. A. (1993). Masking procedures for microdata disclosure limitation. Journal of Official Statistics 9 383–406.
  • Gomatam, S., Karr, A. F., Reiter, J. P. and Sanil, A. P. (2005). Data dissemination and disclosure limitation in a world without microdata: A risk-utility framework for remote access analysis servers. Statist. Sci. 20 163–177.
  • Health and Retirement Study (2007). Data Description and Usage (2006 Core, Early, Version 2.0). Available at http://hrsonline.isr.umich.edu/meta/2006/core/desc/h06dd.pdf.
  • Kinney, S. K., Reiter, J. P., Reznek, A. P., Miranda, J., Jarmin, R. S. and Abowd, J. M. (2011). Towards unrestricted public use business microdata: The synthetic Longitudinal Business Database. Technical report, Center for Economic Studies Working Paper CES-WP-11-04, Census Bureau, Washington, DC.
  • Little, R. J. A. (1993). Statistical analysis of masked data. Journal of Official Statistics 9 407–426.
  • Little, R. J. A., Liu, F. and Raghunathan, T. E. (2004). Statistical disclosure techniques based on multiple imputation. In Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives (A. Gelman and X. L. Meng, eds.) 141–152. Wiley, New York.
  • Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J. and Vilhuber, L. (2008). Privacy: Theory meets practice on the map. In IEEE 24th International Conference on Data Engineering 277–286.
  • National Research Council (2005). Expanding access to research data: Reconciling risks and opportunities. Panel on Data Access for Research Purposes, Committee on National Statistics, Division of Behavioral and Social Sciences and Education. The National Academies Press, Washington, DC.
  • National Research Council (2007). Putting people on the map: Protecting confidentiality with linked social-spatial data. Panel on Confidentiality Issues Arising from the Integration of Remotely Sensed and Self-Identifying Data, Committee on the Human Dimensions of Global Change, Division of Behavioral and Social Sciences and Education. The National Academies Press, Washington, DC.
  • Reiter, J. P. (2003). Inference for partially synthetic, public use microdata sets. Survey Methodology 29 181–189.
  • Reiter, J. P. (2004a). New approaches to data dissemination: A glimpse into the future (?). Chance 17 11–15.
  • Reiter, J. P. (2004b). Simultaneous use of multiple imputation for missing data and disclosure limitation. Survey Methodology 30 235–242.
  • Reiter, J. P. (2005a). Estimating identification risks in microdata. J. Amer. Statist. Assoc. 100 1103–1113.
  • Reiter, J. P. (2005b). Releasing multiply imputed, synthetic public use microdata: An illustration and empirical study. J. Roy. Statist. Soc. Ser. A 168 185–205.
  • Reiter, J. P. (2005c). Significance tests for multi-component estimands from multiply imputed, synthetic microdata. J. Statist. Plann. Inference 131 365–377.
  • Reiter, J. P. (2005d). Using CART to generate partially synthetic, public use microdata. Journal of Official Statistics 21 441–462.
  • Reiter, J. P. (2009). Using multiple imputation to integrate and disseminate confidential microdata. International Statistical Review 77 179–195.
  • Reiter, J. P. and Mitra, R. (2009). Estimating risks of identification disclosure in partially synthetic data. Journal of Privacy and Confidentiality 1 99–110.
  • Robinson, W. S. (1950). Ecological correlations and the behavior of individuals. American Sociological Review 15 351–357.
  • Rubin, D. B. (1981). The Bayesian bootstrap. Ann. Statist. 9 130–134.
  • Sweeney, L. A. (2001). Computational disclosure control: A primer on data privacy protection. Ph.D. thesis, MIT, Cambridge, MA.
  • VanWey, L. K., Rindfuss, R. R., Guttman, M. P., Entwisle, B. and Balk, D. L. (2005). Confidentiality and spatially explicit data: Concerns and challenges. Proc. Natl. Acad. Sci. USA 102 15337–15342.
  • Wang, H. and Reiter, J. (2011). Supplement to “Multiple imputation for sharing precise geographies in public use data.” DOI:10.1214/11-AOAS506SUPP.
  • Zhou, Y., Dominici, F. and Louis, T. A. (2010). A smoothing approach for masking spatial data. Ann. Appl. Stat. 4 1451–1475.

Supplemental materials

  • Supplementary material: Computational details and further results. Computational details for geography disclosure and identification risks in Sections 3.2.1 and 3.2.2; further analytical validity results; and results based on genuine cause of death.