Bayesian Analysis

Dirichlet Process Mixture Models for Modeling and Generating Synthetic Versions of Nested Categorical Data

Jingchen Hu, Jerome P. Reiter, and Quanli Wang

Full-text: Open access

Abstract

We present a Bayesian model for estimating the joint distribution of multivariate categorical data when units are nested within groups. Such data arise frequently in social science settings, for example, people living in households. The model assumes that (i) each group is a member of a group-level latent class, and (ii) each unit is a member of a unit-level latent class nested within its group-level latent class. This structure allows the model to capture dependence among units in the same group. It also facilitates simultaneous modeling of variables at both group and unit levels. We develop a version of the model that assigns zero probability to groups and units with physically impossible combinations of variables. We apply the model to estimate multivariate relationships in a subset of the American Community Survey. Using the estimated model, we generate synthetic household data that could be disseminated as redacted public use files. Supplementary materials (Hu et al., 2017) for this article are available online.

Article information

Source
Bayesian Anal. Volume 13, Number 1 (2018), 183-200.

Dates
First available in Project Euclid: 24 January 2017

Permanent link to this document
https://projecteuclid.org/euclid.ba/1485227030

Digital Object Identifier
doi:10.1214/16-BA1047

Keywords
confidentiality disclosure latent multinomial synthetic

Rights
Creative Commons Attribution 4.0 International License.

Citation

Hu, Jingchen; Reiter, Jerome P.; Wang, Quanli. Dirichlet Process Mixture Models for Modeling and Generating Synthetic Versions of Nested Categorical Data. Bayesian Anal. 13 (2018), no. 1, 183--200. doi:10.1214/16-BA1047. https://projecteuclid.org/euclid.ba/1485227030


Export citation

References

  • Abowd, J., Stinson, M., and Benedetto, G. (2006). “Final Report to the Social Security Administration on the SIPP/SSA/IRS Public Use File Project.” Technical report, U.S. Census Bureau Longitudinal Employer-Household Dynamics Program. Available athttp://www.census.gov/sipp/synth_data.html.
  • Albert, J. H. and Chib, S. (1993). “Bayesian analysis of binary and polychotomous response data.”Journal of the American Statistical Association, 88: 669–679.
  • Bennink, M., Croon, M. A., Kroon, B., and Vermunt, J. K. (2016). “Micro-macro multilevel latent class models with multiple discrete individual-level variables.”Advances in Data Analysis and Classification, 10(2): 139–154.
  • Dunson, D. B. and Xing, C. (2009). “Nonparametric Bayes modeling of multivariate categorical data.”Journal of the American Statistical Association, 104: 1042–1051.
  • Fellegi, I. P. and Holt, D. (1976). “A systematic approach to automatic edit and imputation.”Journal of the American Statistical Association, 71: 17–35.
  • Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013).Bayesian Data Analysis. London: Chapman & Hall.
  • Goodman, L. A. (1974). “Exploratory latent structure analysis using both identifiable and unidentifiable models.”Biometrika, 61: 215–231.
  • Hawala, S. (2008). “Producing partially synthetic data to avoid disclosure.” InProceedings of the Joint Statistical Meetings. Alexandria, VA: American Statistical Association.
  • Hoff, P. D. (2009).A First Course in Bayesian Statistical Methods. New York: Springer.
  • Hu, J., Reiter, J. P., and Wang, Q. (2014). “Disclosure risk evaluation for fully synthetic categorical data.” In Domingo-Ferrer, J. (ed.),Privacy in Statistical Databases, 185–199. Springer.
  • Hu, J., Reiter, J. P., and Wang, Q. (2017). “Supplementary Materials for “Dirichlet Process Mixture Models for Modeling and Generating Synthetic Versions of Nested Categorical Data”.”Bayesian Analysis.
  • Ishwaran, H. and James, L. F. (2001). “Gibbs sampling methods for stick-breaking priors.”Journal of the American Statistical Association, 161–173.
  • Jain, S. and Neal, R. M. (2007). “Splitting and merging components of a nonconjugate Dirichlet process mixture model.”Bayesian Analysis, 2: 445–472.
  • Kim, H. J., Cox, L. H., Karr, A. F., Reiter, J. P., and Wang, Q. (2015). “Simultaneous editing and imputation for continuous data.”Journal of the American Statistical Association, 110: 987–999.
  • Kinney, S., Reiter, J. P., Reznek, A. P., Miranda, J., Jarmin, R. S., and Abowd, J. M. (2011). “Towards unrestricted public use business microdata: The synthetic Longitudinal Business Database.”International Statistical Review, 79: 363–384.
  • Kunihama, T., Herring, A. H., Halpern, C. T., and Dunson, D. B. (2014). “Nonparametric Bayes modeling with sample survey weights.”arXiv:1409.5914.
  • Little, R. J. A. (1993). “Statistical analysis of masked data.”Journal of Official Statistics, 9: 407–426.
  • Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., and Vilhuber, L. (2008). “Privacy: Theory meets practice on the map.” InIEEE 24th International Conference on Data Engineering, 277–286.
  • Manrique-Vallier, D. and Reiter, J. P. (2014). “Bayesian estimation of discrete multivariate latent structure models with structural zeros.”Journal of Computational and Graphical Statistics, 23: 1061–1079.
  • Manrique-Vallier, D. and Reiter, J. P. (forthcoming). “Bayesian simultaneous edit and imputation for multivariate categorical data.”Journal of the American Statistical Association, to appear.
  • Murray, J. S. and Reiter, J. P. (forthcoming). “Multiple imputation of missing categorical and continuous values via Bayesian mixture models with local dependence.”Journal of the American Statistical Association, to appear.
  • Raghunathan, T. E., Reiter, J. P., and Rubin, D. B. (2003). “Multiple imputation for statistical disclosure limitation.”Journal of Official Statistics, 19: 1–16.
  • Reiter, J. and Raghunathan, T. E. (2007). “The multiple adaptations of multiple imputation.”Journal of the American Statistical Association, 102: 1462–1471.
  • Reiter, J. P. (2003). “Inference for partially synthetic, public use microdata sets.”Survey Methodology, 29: 181–189.
  • Reiter, J. P. (2005). “Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study.”Journal of the Royal Statistical Society, Series A, 168: 185–205.
  • Rodriguez, A., Dunson, D. B., and Gelfand, A. E. (2008). “The nested Dirichelt process.”Journal of the American Statistical Association, 103: 1131–1154.
  • Rubin, D. B. (1993). “Discussion: Statistical disclosure limitation.”Journal of Official Statistics, 9: 462–468.
  • Ruggles, S., Alexander, J. T., Genadek, K., Goeken, R., Schroeder, M. B., and Sobek, M. (2010). “Integrated Public Use Microdata Series: Version 5.0 [Machine-readable database].”Minneapolis: University of Minnesota.
  • Schifeling, T. and Reiter, J. P. (2016). “Incorporating marginal prior information in latent class models.”Bayesian Analysis, 2: 499–518.
  • Sethuraman, J. (1994). “A constructive definition of Dirichlet priors.”Statistica Sinica, 4: 639–650.
  • Si, Y. and Reiter, J. P. (2013). “Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys.”Journal of Educational and Behavioral Statistics, 38: 499–521.
  • Vermunt, J. K. (2003). “Multilevel latent class models.”Sociological Methodology, 213–239.
  • Vermunt, J. K. (2008). “Latent class and finite mixture models for multilevel data sets.”Statistical Methods in Medical Research, 33–51.
  • Wade, S., Mongelluzzo, S., and Petrone, S. (2011). “An enriched conjugate prior for Bayesian nonparametric inference.”Bayesian Analysis, 6: 359–385.

Supplemental materials