The Annals of Applied Statistics

A hierarchical Bayesian approach to record linkage and population size problems

Andrea Tancredi and Brunero Liseo

Full-text: Open access

Abstract

We propose and illustrate a hierarchical Bayesian approach for matching statistical records observed on different occasions. We show how this model can be profitably adopted both in record linkage problems and in capture–recapture setups, where the size of a finite population is the real object of interest. There are at least two important differences between the proposed model-based approach and the current practice in record linkage. First, the statistical model is built up on the actually observed categorical variables and no reduction (to 0–1 comparisons) of the available information takes place. Second, the hierarchical structure of the model allows a two-way propagation of the uncertainty between the parameter estimation step and the matching procedure so that no plug-in estimates are used and the correct uncertainty is accounted for both in estimating the population size and in performing the record linkage. We illustrate and motivate our proposal through a real data example and simulations.

Article information

Source
Ann. Appl. Stat. Volume 5, Number 2B (2011), 1553-1585.

Dates
First available: 13 July 2011

Permanent link to this document
http://projecteuclid.org/euclid.aoas/1310562733

Digital Object Identifier
doi:10.1214/10-AOAS447

Mathematical Reviews number (MathSciNet)
MR2849786

Zentralblatt MATH identifier
1223.62015

Citation

Tancredi, Andrea; Liseo, Brunero. A hierarchical Bayesian approach to record linkage and population size problems. The Annals of Applied Statistics 5 (2011), no. 2B, 1553--1585. doi:10.1214/10-AOAS447. http://projecteuclid.org/euclid.aoas/1310562733.


Export citation

References

  • Albert, P. S. and Dood, L. E. (2004). A cautionary note on robustness of latent class models for estimating diagnostic error without a gold standard. Biometrics 60 427–435.
  • Alleva, G., Fortini, M. and Tancredi, A. (2007). The control of non-sampling errors on linked data: An application on population census. In Proceedings of the 2007 Intermediate Conference. Risk and Prediction. Venice.
  • Armstrong, J. and Mayda, J. E. (1993). Model-based estimation of record linkage error rates. Survey Methodology 19 137–147.
  • Belin, T. R. and Rubin, D. B. (1995). A method for calibrating false—match rates in record linkage. J. Amer. Statist. Assoc. 90 694–707.
  • Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300.
  • Bhattacharya, I. and Getoor, L. (2007). Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD) 1 Article No. 5.
  • Bishop, Y. M. M., Fienberg, S. E. and Holland, P. W. (1975). Discrete Multivariate Analysis: Theory and Parctice. MIT Press, Cambridge.
  • Christen, P. (2005). Probabilistic data generation for deduplication and data linkage. In IDEAL’05. LNCS 3578 109–116. Springer, Berlin.
  • Christen, P. and Pudjijono, A. (2009). Accurate synthetic generation of realistic personal information. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’09). LNAI 5476 507–514. Springer, Berlin.
  • Copas, J. B. and Hilton, F. J. (1990). Record linkage: Statistical models for matching computer records. J. Roy. Statist. Soc. Ser. A 153 287–320.
  • Darroch, J. N. (1958). The multiple capture census I. Estimation of a closed population. Biometrika 45 343–358.
  • DeGroot, M. H. and Goel, P. K. (1980). Estimation of the correlation coefficient from a broken random sample. Ann. Statist. 8 264–278.
  • Do, K. A., Mueller, P. and Tang, F. (2005). A Bayesian mixture model for differential gene expression. J. Roy. Statist. Soc. Ser. C. 54 627–644.
  • Ericson, W. A. (1969). Subjective Bayesian models in sampling finite populations. J. Roy. Statist. Soc. Ser. B 31 195–224.
  • Erosheva, E., Fienberg, S. and Joutard, C. (2007). Describing disability through individual level mixture models for multivariate binary data. Ann. Appl. Statist. 1 502–537.
  • Fellegi, I. P. and Sunter, A. B. (1969). A theory of record linkage. J. Amer. Statist. Assoc. 64 1183–1210.
  • Fienberg, S. E. (2011). Bayesian models and methods in public policy and government settings. Statist. Sci. To appear.
  • Fienberg, S. E., Johnson, M. S. and Junker, B. W. (1999). Classical multilevel and Bayesian approaches to population size estimation using multiple list, part 3. J. Roy. Statist. Soc. Ser. A 162 383–405.
  • Fienberg, S. E. and Manrique-Vallier, D. (2009). Integrated methodology for multiple systems estimation and record linkage using a missing data formulation. Advances in Statistical Analysis 93 49–60.
  • Forster, J. J. and Webb, E. L. (2007). Bayesian disclosure risk assesment: Predicting small frequencies in contingency table. J. Roy. Statist. Soc. Ser. C 56 551–570.
  • Fortini, M., Liseo, B., Nuccitelli, A. and Scanu, M. (2001). On Bayesian record linkage. Research in Official Statistics 4 185–198.
  • Fortini, M., Liseo, B., Nuccitelli, A. and Scanu, M. (2002). Modelling issues in record linkage: A Bayesian perspective. In Proceedings of the Section on Survey Research Methods 1008–1013. Amer. Statist. Assoc., Alexandria, VA.
  • Genovese, C. and Wasserman, L. (2003). Bayesian and frequentist multiple testing (with discussion). In Bayesian Statistics, 7 (Tenerife, 2002) 145–161. Oxford Univ. Press, New York.
  • Green, P. J. and Mardia, K. V. (2006). Bayesian alignment using hierarchical models, with applicationin protein bioinformatics. Biometrika 93 235–254.
  • Herzog, T. N., Scheuren, F. J. and Winkler, W. E. (2007). Data Quality and Record Linkage Techniques. Springer, New York.
  • Hoadley, B. (1969). The compound multinomial distribution and Bayesian analysis of categorical data from finite populations. J. Amer. Statist. Assoc. 64 216–229.
  • Jaro, M. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Amer. Statist. Assoc. 84 414–420.
  • Judson, D. H. (2007). Information integration for constructing social statistics: History, theory and ideas towards a reserach program. J. Roy. Statist. Soc. Ser. A 170 483–501.
  • Kalashnikov, D. V. and Mehrotra, S. (2006). Domain-independent data cleaning via analysis of entity-relationship graph. ACM Transactions on Database Systems 31 716–767.
  • Kelley, P. (1986). Robustness of the Census Bureau’s record linkage system. In Proceedings of the Section on Survey Research Methods, American Statistical Association 620–624.
  • Lahiri, P. and Larsen, M. D. (2005). Regression analysis with linked data. J. Amer. Statist. Assoc. 100 222–230.
  • Larsen, M. D. (1999). Multiple imputation analysis of records linked using mixture model. In Proceedings of Survey Methods Section, Statistical Society of Canada 65–71. Statistical Society of Canada, Ottawa.
  • Larsen, M. D. (2004). Record Linkage using finite mixture models. In Applied Bayesian Modeling and Causal Inference from Incomplete Data Perspectives (A. Gelman and X.-L. Meng, eds.) 309–318. Wiley, New York.
  • Larsen, M. D. (2005). Advances in record linkage theory: Hierarchical Bayesian record linkage theory. In Proceedings of the Section on Survey Research Methods 3277–3283. Amer. Statist. Assoc., Alexandria, VA.
  • Larsen, M. D. and Rubin, D. B. (2001). Iterative automated record linkage using mixture models. J. Amer. Statist. Assoc. 96 32–41.
  • Lindley, D. V. (1977). A problem in forensic science. Biometrika 64 207–213.
  • Link, W. A., Yoshizaki, J., Bailey, L. L. and Pollock, K. H. (2009). Uncovering a latent multinomial: Analysis of mark–recapture data with misidentication. Biometrics 66 178–185.
  • Liseo, B. and Tancredi, A. (2009). Bayesian estimation of population size via linkage of multivariate Normal data sets. Working Paper 66. Dept. Methods and Models for Economics, Territory and Finance, Sapienza Univ. Rome.
  • Manrique-Vallier, D. and Fienberg, S. E. (2008). Population size estimation using individual level mixture models. Biom. J. 50 1051–1063.
  • Marin, J. M. and Robert, C. P. (2007). Bayesian Core. A Practical Approach to Computational Statistics. Springer, New York.
  • McGlincy, M. (2004). A Bayesian record linkage methodology for multiple imputation of missing data. In Proceedings of the Section on Survey Research Methods 4001–4008. Amer. Statist. Assoc., Alexandria, VA.
  • Newcombe, H. B. (1967). Record linkage: The design of efficient systems for linking records into individual and family histories. American Journal of Human Genetics 9 335–359.
  • Newcombe, H. B., Kennedy, J. M., Axford, S. J. and James, A. P. (1959). Automatic linkage of vital records. Science 130 954–959.
  • Norén, G. N., Orre, R. and Bate, A. (2005). A hit–miss model for duplicate detection in the WHO drug safety database. In KDD’05: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining 459–468. Canada.
  • O’Hagan, A. and Forster, J. (2004). Kendall’s Advanced Theory of Statistics. Volume 2B. Bayesian Inference. Arnold, London.
  • Pepe, M. S. (2003). The Statistical Evaluation of Medical Test for Classification and Prediction. Oxford Univ. Press, London.
  • Pepe, M. S. and Janes, H. (2007). Insights into latent class analysis of diagnostic test performance. Biostatistics 8 474–484.
  • Perez, C. J., Giron, F. J., Martin, J., Ruiz, M. and Rojano, C. (2007). Misclassified multinomial data: A Bayesian approach. Rev. R. Acad. Cien. Serie A. Mat 101 71–80.
  • R Development Core Team (2009). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  • Robert, C. P. and Casella, G. (2004). Monte Carlo Statistical Methods, 2nd ed. Springer, New York.
  • Ruffieux, Y. and Green, P. J. (2009). Alignment of multiple configurations using hierachical models. J. Comput. Graph. Statist. 18 756–773.
  • Seber, G. A. F. (1986). A review of estimating animal abundance. Biometrics 42 267–292.
  • Swartz, T., Haitovsky, Y., Vexler, A. and Yang, T. (2004). Bayesian identifiability and misclassification in multinomial data. Canad. J. Statist. 32 285–302.
  • Tancredi, A. and Liseo, B. (2011). Supplement to “A hierarchical Bayesian approach to record linkage and population size problems.” DOI: 10.1214/10-AOAS447SUPP.
  • Winkler, W. E. (1993). Improved decision rules in the Fellegi–Sunter model of record linkage. In Proceedings of the Section on Survey Research Methods 274–279. Amer. Statist. Assoc., Alexandria, VA.
  • Winkler, W. E. (1995). Matching and record linkage. In Buisness Survey Methods (B. G. Cox, D. A. Binder, B. N. Chinnappa, A. Christianson, M. J. Colledge and P. S. Kott, eds.) 355–384. Wiley, New York.
  • Winkler, W. E. (2000). Machine learning, information retrieval and record linkage. In Proceedings of the Section on Survey Research Methods 20–29. Amer. Statist. Assoc., Alexandria, VA.
  • Winkler, W. E. (2004). Approximate string comparator search strategies for very large administrative lists. In Proceedings of the Section on Survey Research Methods 4595–4602. Amer. Statist. Assoc., Alexandria, VA.
  • Wolter, K. M. (1986). Some coverage error models for census data. J. Amer. Statist. Assoc. 81 338–345.
  • Wright, J. A., Baker, R. J., Schofield, M. R., Frantz, A. C., Byrom, A. E. and Gleeson, D. M. (2009). Incorporating genotype uncertainty into mark-recapture-type models for estimating abundance using DNA samples. Biometrics 65 833–840.

Supplemental materials

  • Supplementary material: Data files and codes. Included in the supplementary material there are the following files: exampleA.dat, exampleB.dat and exampleV.dat contain the data used in Section 5. The files B.Cat.matching.example.R, example.R, functions.r, gibbs.c contain the codes. The file supplementary_figure.pdf shows the trace plots for the application described in Section 5.