Bayesian Analysis

Entity Resolution with Empirically Motivated Priors

Rebecca C. Steorts

Full-text: Open access

Abstract

Databases often contain corrupted, degraded, and noisy data with duplicate entries across and within each database. Such problems arise in citations, medical databases, genetics, human rights databases, and a variety of other applied settings. The target of statistical inference can be viewed as an unsupervised problem of determining the edges of a bipartite graph that links the observed records to unobserved latent entities. Bayesian approaches provide attractive benefits, naturally providing uncertainty quantification via posterior probabilities. We propose a novel record linkage approach based on empirical Bayesian principles. Specifically, the empirical Bayesian-type step consists of taking the empirical distribution function of the data as the prior for the latent entities. This approach improves on the earlier HB approach not only by avoiding the prior specification problem but also by allowing both categorical and string-valued variables. Our extension to string-valued variables also involves the proposal of a new probabilistic mechanism by which observed record values for string fields can deviate from the values of their associated latent entities. Categorical fields that deviate from their corresponding true value are simply drawn from the empirical distribution function. We apply our proposed methodology to a simulated data set of German names and an Italian household survey on income and wealth, showing our method performs favorably compared to several standard methods in the literature. We also consider the robustness of our methods to changes in the hyper-parameters.

Article information

Source
Bayesian Anal. Volume 10, Number 4 (2015), 849-875.

Dates
First available in Project Euclid: 9 September 2015

Permanent link to this document
https://projecteuclid.org/euclid.ba/1441790411

Digital Object Identifier
doi:10.1214/15-BA965SI

Mathematical Reviews number (MathSciNet)
MR3432242

Zentralblatt MATH identifier
1335.62023

Citation

Steorts, Rebecca C. Entity Resolution with Empirically Motivated Priors. Bayesian Anal. 10 (2015), no. 4, 849--875. doi:10.1214/15-BA965SI. https://projecteuclid.org/euclid.ba/1441790411.


Export citation

References

  • Belin, T. R. and Rubin, D. B. (1995). “A method for calibrating false-match rates in record linkage.” Journal of the American Statistical Association, 90(430): 694–707.
  • Bhattacharya, I. and Getoor, L. (2006). “A Latent Dirichlet Model for Unsupervised Entity Resolution.” In: SDM, volume 5, 59. SIAM.
  • Breiman, L. (2001). “Random forests.” Machine Learning, 45(1): 5–32.
  • Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C., and Jordan, M. I. (2013). “Streaming variational Bayes.” In: Advances in Neural Information Processing Systems, 1727–1735.
  • Broderick, T. and Steorts, R. (2014). “Variational Bayes for Merging Noisy Databases.” Advances in Variational Inference NIPS 2014 Workshop. arXiv:1410.4792
  • Carlin, B. P. and Louis, T. A. (2000). Bayes and Empirical Bayes Methods for Data Analysis (2nd ed.). Chapman & Hall/CRC.
  • Chipman, H. A., George, E. I., and McCulloch, R. E. (2010). “BART: Bayesian additive regression trees.” The Annals of Applied Statistics, 4(1): 266–298.
  • Christen, P. (2005). “Probabilistic Data Generation for Deduplication and Data Linkage.” In: Proceedings of the Sixth International Conference on Intelligent Data Engineering and Automated Learning (IDEAL’05), 109–116.
  • — (2012). Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer.
  • Christen, P. and Pudjijono, A. (2009). “Accurate Synthetic Generation of Realistic Personal Information.” In: Theeramunkong, T., Kijsirikul, B., Cercone, N., and Ho, T.-B. (eds.), Advances in Knowledge Discovery and Data Mining, volume 5476 of Lecture Notes in Computer Science, 507–514. Springer, Berlin, Heidelberg. http://dx.doi.org/10.1007/978-3-642-01307-2_47
  • Christen, P. and Vatsalan, D. (2013). “Flexible and Extensible Generation and Corruption of Personal Data.” In: Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM 2013).
  • Dai, A. M. and Storkey, A. J. (2011). “The grouped author-topic model for unsupervised entity resolution.” In: Artificial Neural Networks and Machine Learning – ICANN 2011, 241–249. Springer.
  • Fellegi, I. and Sunter, A. (1969). “A Theory for Record Linkage.” Journal of the American Statistical Association, 64(328): 1183–1210.
  • Gutman, R., Afendulis, C., and Zaslavsky, A. (2013). “A Bayesian Procedure for File Linking to Analyze End- of-Life Medical Costs.” Journal of the American Statistical Association, 108(501): 34–47.
  • Han, H., Giles, L., Zha, H., Li, C., and Tsioutsiouliklis, K. (2004). “Two supervised learning approaches for name disambiguation in author citations.” In: Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004, 296–305. IEEE.
  • Larsen, M. D. and Rubin, D. B. (2001). “Iterative automated record linkage using mixture models.” Journal of the American Statistical Association, 96(453): 32–41.
  • Liseo, B. and Tancredi, A. (2013). “Some advances on Bayesian record linkage and inference for linked data.” http://www.ine.es/e/essnetdi_ws2011/ppts/Liseo_Tancredi.pdf
  • Martins, B. (2011). “A Supervised Machine Learning Approach for Duplicate Detection for Gazetteer Records.” Lecture Notes in Computer Science, 6631: 34–51.
  • Robbins, H. (1956). “An empirical Bayes approach to statistics.” In: Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theorem of Statistics, 157–163. MR.
  • Sadinle, M. (2014). “Detecting Duplicates in a Homicide Registry Using a Bayesian Partitioning Approach.” arXiv:1407.8219.
  • Sadinle, M. and Fienberg, S. (2013). “A Generalized Fellegi–Sunter Framework for Multiple Record Linkage with Application to Homicide Record-Systems.” Journal of the American Statistical Association, 108(502): 385–397.
  • Steorts, R., Ventura, S., Sadinle, M., and Fienberg, S. (2014a). “A Comparison of Blocking Methods for Record Linkage.” In: Privacy in Statistical Databases, 253–268. Springer.
  • Steorts, R. C., Hall, R., and Fienberg, S. (2014b). “SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication.” JMLR W&CP, 33: 922–930. arXiv:1403.0211
  • — (2015). “A Bayesian Approach to Graphical Record Linkage and De-duplication.” Minor Revision, Journal of the American Statistical Association. arXiv:1312.4645
  • Tancredi, A. and Liseo, B. (2011). “A hierarchical Bayesian approach to record linkage and population size problems.” Annals of Applied Statistics, 5(2B): 1553–1585.
  • Torvik, V. I. and Smalheiser, N. R. (2009). “Author name disambiguation in MEDLINE.” ACM Transactions on Knowledge Discovery from Data (TKDD), 3(3): 11.
  • Treeratpituk, P. and Giles, C. L. (2009). “Disambiguating authors in academic publications using random forests.” In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, 39–48. ACM.
  • Ventura, S. (2013). “Large-Scale Clustering Methods with Applications to Record Linkage.” PhD thesis proposal, CMU, Pittsburgh, PA.
  • Wainwright, M. J. and Jordan, M. I. (2008). “Graphical models, exponential families, and variational inference.” Foundations and Trends in Machine Learning, 1(1–2): 1–305.
  • Wallach, H. M., Jensen, S., Dicker, L., and Heller, K. A. (2010). “An Alternative Prior Process for Nonparametric Bayesian Clustering.” In: International Conference on Artificial Intelligence and Statistics, 892–899.
  • Winkler, W. E. (2006). “Overview of record linkage and current research directions.” In: Bureau of the Census. Citeseer.