The Annals of Applied Statistics

Bayesian nonparametric disclosure risk estimation via mixed effects log-linear models

Cinzia Carota, Maurizio Filippone, Roberto Leombruni, and Silvia Polettini

Full-text: Open access

Abstract

Statistical agencies and other institutions collect data under the promise to protect the confidentiality of respondents. When releasing microdata samples, the risk that records can be identified must be assessed. To this aim, a widely adopted approach is to isolate categorical variables key to the identification and analyze multi-way contingency tables of such variables. Common disclosure risk measures focus on sample unique cells in these tables and adopt parametric log-linear models as the standard statistical tools for the problem. Such models often have to deal with large and extremely sparse tables that pose a number of challenges to risk estimation. This paper proposes to overcome these problems by studying nonparametric alternatives based on Dirichlet process random effects. The main finding is that the inclusion of such random effects allows us to reduce considerably the number of fixed effects required to achieve reliable risk estimates. This is studied on applications to real data, suggesting, in particular, that our mixed models with main effects only produce roughly equivalent estimates compared to the all two-way interactions models, and are effective in defusing potential shortcomings of traditional log-linear models. This paper adopts a fully Bayesian approach that accounts for all sources of uncertainty, including that about the population frequencies, and supplies unconditional (posterior) variances and credible intervals.

Article information

Source
Ann. Appl. Stat., Volume 9, Number 1 (2015), 525-546.

Dates
First available in Project Euclid: 28 April 2015

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1430226103

Digital Object Identifier
doi:10.1214/15-AOAS807

Mathematical Reviews number (MathSciNet)
MR3341126

Zentralblatt MATH identifier
06446579

Keywords
Bayesian nonparametric models confidentiality disclosure risk Dirichlet process log-linear models mixed effects models

Citation

Carota, Cinzia; Filippone, Maurizio; Leombruni, Roberto; Polettini, Silvia. Bayesian nonparametric disclosure risk estimation via mixed effects log-linear models. Ann. Appl. Stat. 9 (2015), no. 1, 525--546. doi:10.1214/15-AOAS807. https://projecteuclid.org/euclid.aoas/1430226103


Export citation

References

  • Blackwell, D. and MacQueen, J. B. (1973). Ferguson distributions via Pólya urn schemes. Ann. Statist. 1 353–355.
  • Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet allocation. J. Mach. Learn. Res. 3 993–1022.
  • Carlson, M. (2002). Assessing microdata disclosure risk using the Poisson-inverse Gaussian distribution. Statistics in Transition 5 901–925.
  • Dorazio, R. M., Mukherjee, B., Zhang, L., Ghosh, M., Jelks, H. L. and Jordan, F. (2008). Modeling unobserved sources of heterogeneity in animal abundance using a Dirichlet process prior. Biometrics 64 635–644, 670–671.
  • Elamir, E. A. H. and Skinner, C. J. (2006). Record level measures of disclosure risk for survey microdata. Journal of Official Statistics 22 525–539.
  • Erosheva, E. A., Fienberg, S. E. and Joutard, C. (2007). Describing disability through individual-level mixture models for multivariate binary data. Ann. Appl. Stat. 1 502–537.
  • Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc. 90 577–588.
  • Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209–230.
  • Fienberg, S. E. and Makov, U. E. (1998). Confidentiality, uniqueness, and disclosure limitation for categorical data. Journal of Official Statistics 14 385–397.
  • Fienberg, S. E. and Rinaldo, A. (2007). Three centuries of categorical data analysis: log-linear models and maximum likelihood estimation. J. Statist. Plann. Inference 137 3430–3445.
  • Fienberg, S. E. and Rinaldo, A. (2012). Maximum likelihood estimation in log-linear models. Ann. Statist. 40 996–1023.
  • Filippone, M., Mira, A. and Girolami, M. (2011). Discussion of: “Sampling schemes for generalized linear Dirichlet process random effects models”, by M. Kyung, J. Gill, and G. Casella [MR2859768]. Stat. Methods Appl. 20 295–297.
  • Forster, J. J. and Webb, E. L. (2007). Bayesian disclosure risk assessment: Predicting small frequencies in contingency tables. J. R. Stat. Soc. Ser. C. Appl. Stat. 56 551–570.
  • Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statist. Sci. 7 457–472.
  • Girolami, M. and Calderhead, B. (2011). Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. R. Stat. Soc. Ser. B. Stat. Methodol. 73 123–214.
  • Haberman, S. J. (1974). The Analysis of Frequency Data. Univ. Chicago Press, Chicago, IL.
  • Johnson, N. L., Kotz, S. and Balakrishnan, N. (1997). Discrete Multivariate Distributions. Wiley, New York.
  • Liu, J. S. (1996). Nonparametric hierarchical Bayes via sequential imputations. Ann. Statist. 24 911–930.
  • Lo, A. Y. (1984). On a class of Bayesian nonparametric estimates. I. Density estimates. Ann. Statist. 12 351–357.
  • Manrique-Vallier, D. and Reiter, J. P. (2012). Estimating identification disclosure risk using mixed membership models. J. Amer. Statist. Assoc. 107 1385–1394.
  • Manrique-Vallier, D. and Reiter, J. P. (2014). Bayesian estimation of discrete multivariate latent structure models with structural zeros. J. Comput. Graph. Statist. 23 1061–1079.
  • Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods. Technical Report No.CRG-TR-93-1, Dept. of Computer Science, Univ. Toronto.
  • Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Statist. 9 249–265.
  • Rinott, Y. and Shlomo, N. (2006). A generalized negative binomial smoothing model for sample disclosure risk estimation. In Privacy in Statistical Databases (J. Domingo-Ferrer and L. Franconi, eds.). Lecture Notes in Computer Science 4302 82–93. Springer, Berlin.
  • Rinott, Y. and Shlomo, N. (2007a). A smoothing model for sample disclosure risk estimation. In Complex Datasets and Inverse Problems (R. Liu, W. Strawderman and C.-H. Zhang, eds.). Institute of Mathematical Statistics Lecture Notes—Monograph Series 54 161–171. IMS, Beachwood, OH.
  • Rinott, Y. and Shlomo, N. (2007b). Variances and confidence intervals for sample disclosure risk measures. In Bulletin of the International Statistical Institute: Proceedings of the 56th Session of the International Statistical Institute, ISI’07, Lisbon. August 2229 1090–1096.
  • Roberts, G. O. and Rosenthal, J. S. (2009). Examples of adaptive MCMC. J. Comput. Graph. Statist. 18 349–367.
  • Ruggles, S., Alexander, J. T., Genadek, K., Goeken, R., Schroeder, M. B. and Sobek, M. (2010). Integrated public use microdata series: Version 5.0 [Machine-readable database]. University of Minnesota, Minneapolis. Available at https://usa.ipums.org/usa/.
  • Si, Y. and Reiter, J. P. (2013). Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. Journal of Educational and Behavioral Statistics 38 499–521.
  • Skinner, C. J. and Holmes, D. J. (1998). Estimating the re-identification risk per record in microdata. Journal of Official Statistics 14 361–372.
  • Skinner, C. and Shlomo, N. (2008). Assessing identification risk in survey microdata using log-linear models. J. Amer. Statist. Assoc. 103 989–1001.
  • Takemura, A. (1999). Some superpopulation models for estimating the number of population uniques. In Proceedings of the Conference on Statistical Data Protection 45–58. Eurostat, Luxembourg.
  • Teh, Y. W., Jordan, M. I., Beal, M. J. and Blei, D. M. (2006). Hierarchical Dirichlet processes. J. Amer. Statist. Assoc. 101 1566–1581.
  • Tierney, L. and Kadane, J. B. (1986). Accurate approximations for posterior moments and marginal densities. J. Amer. Statist. Assoc. 81 82–86.