The Annals of Applied Statistics

Providing accurate models across private partitioned data: Secure maximum likelihood estimation

Joshua Snoke, Timothy R. Brick, Aleksandra Slavković, and Michael D. Hunter

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

This paper focuses on the privacy paradigm of providing access to researchers to remotely carry out analyses on sensitive data stored behind separate firewalls. We address the situation where the analysis demands data from multiple physically separate databases which cannot be combined. Motivating this work is a real model based on research data on kinship foster placement that came from multiple sources and could only be combined through a lengthy process with a trusted research network. We develop and demonstrate a method for accurate calculation of the multivariate normal likelihood, for a set of parameters given the partitioned data, which can then be maximized to obtain estimates. These estimates are achieved without sharing any data or any true intermediate statistics of the data across firewalls. We show that under a certain set of assumptions our method for estimation across these partitions achieves identical results as estimation with the full data. Privacy is maintained by adding noise at each partition. This ensures each party receives noisy statistics, such that the noise cannot be removed until the last step to obtain a single value, the true total log likelihood. Potential applications include all methods utilizing parameter estimation through maximizing the multivariate normal likelihood. We give detailed algorithms, along with available software, and present simulations and analyze the kinship foster placement data estimating structural equation models (SEMs) with partitioned data.

Article information

Source
Ann. Appl. Stat., Volume 12, Number 2 (2018), 877-914.

Dates
Received: November 2017
Revised: April 2018
First available in Project Euclid: 28 July 2018

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1532743480

Digital Object Identifier
doi:10.1214/18-AOAS1171

Keywords
Partitioned data privacy secure multiparty computation structural equation models distributed maximum likelihood estimation

Citation

Snoke, Joshua; Brick, Timothy R.; Slavković, Aleksandra; Hunter, Michael D. Providing accurate models across private partitioned data: Secure maximum likelihood estimation. Ann. Appl. Stat. 12 (2018), no. 2, 877--914. doi:10.1214/18-AOAS1171. https://projecteuclid.org/euclid.aoas/1532743480


Export citation

References

  • Arbuckle, J. L., Marcoulides, G. A. and Schumacker, R. E. (1996). Full information estimation in the presence of incomplete data. Adv. Struct. Equ. Model. Issues Techn. 243 277.
  • Boker, S. M., Brick, T. R., Pritikin, J. N., Wang, Y., Oertzen, T. V., Brown, D., Lach, J., Estabrook, R., Hunter, M. D., Maes, H. H. and Neale, M. C. (2015). Maintained Individual Data Distributed Likelihood Estimation (MIDDLE). Multivar. Behav. Res. 50 706–720.
  • Calandrino, J. A., Kilzer, A., Narayanan, A., Felten, E. W. and Shmatikov, V. (2011). “You might also like:” privacy risks of collaborative filtering. In Security and Privacy (SP), 2011 IEEE Symposium on 231–246. IEEE.
  • de Montjoye, Y.-A., Shmueli, E., Wang, S. S. and Pentland, A. S. (2014). Openpds: Protecting the privacy of metadata through safeanswers. PLoS ONE 9 e98790.
  • Di Crescenzo, G., Malkin, T. and Ostrovsky, R. (2000). Single database private information retrieval implies oblivious transfer. In Advances in Cryptology—EUROCRYPT 2000 (Bruges). Lecture Notes in Computer Science 1807 122–138. Springer, Berlin.
  • Dinur, I. and Nissim, K. (2003). Revealing information while preserving privacy. In Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems 202–210. ACM.
  • Dunst, C. J., Trivette, C. M. and Deal, A. G. (1988). Enabling and Empowering Families: Principles and Guidelines for Practice. Brookline Books, Cambridge, MA.
  • Dwork, C. (2008). Differential privacy: A survey of results. In Theory and Applications of Models of Computation. Lecture Notes in Computer Science 4978 1–19. Springer, Berlin.
  • Fienberg, S. E., Nardi, Y. and Slavković, A. B. (2009). Valid statistical analysis for logistic regression with multiple sources. In Protecting Persons While Protecting the People 82–94. Springer.
  • Fienberg, S. E. and Slavković, A. B. (2011). Data privacy and confidentiality. In International Encyclopedia of Statistical Science 342–345. Springer.
  • Fienberg, S. E., Fulp, W. J., Slavkovic, A. B. and Wrobel, T. A. (2006). “Secure” log-linear and logistic regression analysis of distributed databases. In Privacy in Statistical Databases 277–290. Springer.
  • Gaye, A., Marcon, Y., Isaeva, J., LaFlamme, P., Turner, A., Jones, E. M., Minion, J., Boyd, A. W., Newby, C. J., Nuotio, M.-L., Wilson, R., Butters, O., Murtagh, B., Demir, I., Doiron, D., Giepmans, L., Wallace, S. E., Budin-Ljøsne, I., Schmidt, C. O., Boffetta, P., Boniol, M., Bota, M., Carter, K. W., deKlerk, N., Dibben, C., Francis, R. W., Hiekkalinna, T., Hveem, K., Kvaløy, K., Millar, S., Perry, I. J., Peters, A., Phillips, C. M., Popham, F., Raab, G., Reischl, E., Sheehan, N., Waldenberger, M., Perola, M., van den Heuvel, E., Macleod, J., Knoppers, B. M., Stolk, R. P., Fortier, I., Harris, J. R., Woffenbuttel, B. H. R., Murtagh, M. J., Ferretti, V. and Burton, P. R. (2014). DataSHIELD: Taking the analysis to the data, not the data to the analysis. Int. J. Epidemiol. 43 1929–1944.
  • Ghosh, J., Reiter, J. P. and Karr, A. F. (2007). Secure computation with horizontally partitioned data using adaptive regression splines. Comput. Statist. Data Anal. 51 5813–5820.
  • Goldwasser, S. (1997). Multi party computations: Past and present. In Proceedings of the Sixteenth Annual ACM Symposium on Principles of Distributed Computing 1–6. ACM.
  • Haynsworth, E. V. (1968). On the Schur complement. Technical Report, DTIC Document.
  • Hecht, D. B., Hunter, M. D. and Beasley, L. O. (2016). Family KINnections: A Kinship Navigation Program. Presented to the University of Oklahoma Health Sciences Center Department of Pediatrics Section of Developmental and Behavioral Pediatrics at the Section Research Meeting.
  • Homer, N., Szelinger, S., Redman, M., Duggan, D., Tembe, W., Muehling, J., Pearson, J. V., Stephan, D. A., Nelson, S. F. and Craig, D. W. (2008). Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 4 e1000167.
  • Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E. S., Spicer, K. and de Wolf, P.-P. (2012). Statistical Disclosure Control. Wiley Series in Survey Methodology. Wiley, Chichester.
  • Karr, A. F., Lin, X., Sanil, A. P. and Reiter, J. P. (2005). Secure regression on distributed databases. J. Comput. Graph. Statist. 14 263–279.
  • Karr, A. F., Fulp, W. J., Vera, F., Young, S. S., Lin, X. and Reiter, J. P. (2007). Secure, privacy-preserving analysis of distributed databases. Technometrics 49 335–345.
  • Karr, A. F., Lin, X., Sanil, A. P. and Reiter, J. P. (2009). Privacy-preserving analysis of vertically partitioned data using secure matrix products. J. Off. Stat. 25 125.
  • Kissner, L. and Song, D. (2005). Privacy-preserving set operations. In Advances in Cryptology—CRYPTO 2005. Lecture Notes in Computer Science 3621 241–257. Springer, Berlin.
  • Lin, X. and Karr, A. F. (2010). Privacy-preserving maximum likelihood estimation for distributed data. J. Priv. Confid. 1 6.
  • Lindell, Y. and Pinkas, B. (2009). Secure multiparty computation for privacy-preserving data mining. J. Priv. Confid. 1 5.
  • Meredith, W. and Tisak, J. (1990). Latent curve analysis. Psychometrika 55 107–122.
  • Nardi, Y., Fienberg, S. E. and Hall, R. J. (2012). Achieving both valid and secure logistic regression analysis on aggregated data from different private sources. J. Priv. Confid. 4 9.
  • Nash, J. C. and Varadhan, R. (2011). Unifying optimization algorithms to aid software system users: optimx for R. J. Stat. Softw. 43 1–14.
  • Neale, M. C., Hunter, M. D., Pritikin, J. N., Zahery, M., Brick, T. R., Kirkpatrick, R. M., Estabrook, R., Bates, T. C., Maes, H. H. and Boker, S. M. (2016). OpenMx 2.0: Extended structural equation and statistical modeling. Psychometrika 81 535–549.
  • R Core Team (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  • Raghunathan, T. E., Reiter, J. P. and Rubin, D. B. (2003). Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19 1–17.
  • Reiter, J. P., Kohnen, C. N., Karr, A. F., Lin, X. and Sanil, A. P. (2004). Partitioned, Vertically and Data, Partially Overlapping. Technical Report, NISS. Available at https://www.niss.org/sites/default/files/technicalreports/tr146.pdf.
  • Samizo, Y. (2016). Secure statistical analyses on vertically distributed databases. Master’s thesis, The Pennsylvania State Univ.
  • Sanil, A. P., Karr, A. F., Lin, X. and Reiter, J. P. (2004). Privacy preserving regression modelling via distributed computation. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 677–682. ACM.
  • Savage, C. J. and Vickers, A. J. (2009). Empirical study of data sharing by authors publishing in PLoS journals. PLoS ONE 4 e7078. DOI:10.1371/journal.pone.0007078.
  • Schafer, J. L. (1997). Analysis of Incomplete Multivariate Data. Monographs on Statistics and Applied Probability 72. Chapman & Hall, London.
  • Schur, I. (1905). Neue Begründung der Theorie der Gruppencharaktere. Sitzungsberichte Königl. Preuss. Akad. Wiss. 406–432.
  • Slavkovic, A. B., Nardi, Y. and Tibbits, M. M. (2007). “Secure” logistic regression of horizontally and vertically partitioned distributed databases. In Data Mining Workshop, 2007, ICDM Workshops 2007, Seventh IEEE International Conference on Data Mining 723–728.
  • Snoke, J., Brick, T. and Slavković, A. (2016). Accurate estimation of structural equation models with remote partitioned data. In International Conference on Privacy in Statistical Databases 190–209. Springer.
  • Sullivan, C. M. (1992). An Overview of Disclosure Principles. Bureau of the Census.
  • Vaidya, J. and Clifton, C. (2004). Privacy preserving naïve Bayes classifier for vertically partitioned data. In Proceedings of the Fourth SIAM International Conference on Data Mining 522–526. SIAM, Philadelphia, PA.
  • Vaidya, J., Clifton, C., Kantarcioglu, M. and Patterson, A. S. (2008). Privacy-preserving decision trees over vertically partitioned data. ACM Trans. Knowl. Discov. Data 2 14.
  • Willenborg, L. and de Waal, T. (2001). Elements of Statistical Disclosure Control. Lecture Notes in Statistics 155. Springer, New York.
  • Yao, A. C. (1982). Protocols for secure computations. In 23rd Annual Symposium on Foundations of Computer Science (Chicago, IL, 1982) 160–164. IEEE, New York.