The Annals of Applied Statistics

Bayesian propagation of record linkage uncertainty into population size estimation of human rights violations

Mauricio Sadinle

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

Multiple-systems or capture–recapture estimation are common techniques for population size estimation, particularly in the quantitative study of human rights violations. These methods rely on multiple samples from the population, along with the information of which individuals appear in which samples. The goal of record linkage techniques is to identify unique individuals across samples based on the information collected on them. Linkage decisions are subject to uncertainty when such information contains errors and missingness, and when different individuals have very similar characteristics. Uncertainty in the linkage should be propagated into the stage of population size estimation. We propose an approach called linkage-averaging to propagate linkage uncertainty, as quantified by some Bayesian record linkage methodologies, into a subsequent stage of population size estimation. Linkage-averaging is a two-stage approach in which the results from the record linkage stage are fed into the population size estimation stage. We show that under some conditions the results of this approach correspond to those of a proper Bayesian joint model for both record linkage and population size estimation. The two-stage nature of linkage-averaging allows us to combine different record linkage models with different capture–recapture models, which facilitates model exploration. We present a case study from the Salvadoran civil war, where we are interested in estimating the total number of civilian killings using lists of witnesses’ reports collected by different organizations. These lists contain duplicates, typographical and spelling errors, missingness, and other inaccuracies that lead to uncertainty in the linkage. We show how linkage-averaging can be used for transferring the uncertainty in the linkage of these lists into different models for population size estimation.

Article information

Source
Ann. Appl. Stat., Volume 12, Number 2 (2018), 1013-1038.

Dates
Received: November 2017
Revised: April 2018
First available in Project Euclid: 28 July 2018

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1532743484

Digital Object Identifier
doi:10.1214/18-AOAS1178

Mathematical Reviews number (MathSciNet)
MR3834293

Keywords
Capture–recapture counting casualties data linkage decomposable graphical model duplicate detection entity resolution multiple-systems estimation multiple record linkage

Citation

Sadinle, Mauricio. Bayesian propagation of record linkage uncertainty into population size estimation of human rights violations. Ann. Appl. Stat. 12 (2018), no. 2, 1013--1038. doi:10.1214/18-AOAS1178. https://projecteuclid.org/euclid.aoas/1532743484


Export citation

References

  • Anderson, M. J. and Fienberg, S. E. (1999). Who Counts?: The Politics of Census-Taking in Contemporary America, Revised paperback (2001) ed. Russell Sage Foundation, New York.
  • Ball, P. (2000). The Salvadoran human rights commission: Data processing, data representation, and generating analytical reports. In Making the Case: Investigating Large Scale Human Rights Violations Using Information Systems and Data Analysis (P. Ball, H. F. Spirer and L. Spirer, eds.) American Association for the Advancement of Science, Washington, DC.
  • Bilenko, M., Mooney, R. J., Cohen, W. W., Ravikumar, P. and Fienberg, S. E. (2003). Adaptive name matching in information integration. IEEE Intell. Syst. 18 16–23.
  • Bird, S. M. and King, R. (2018). Multiple systems estimation (or capture–recapture estimation) to inform public policy. Ann. Rev. Statist. Appl. 5 95–118.
  • Bishop, Y. M. M., Fienberg, S. E. and Holland, P. W. (1975). Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge, MA. With the collaboration of Richard J. Light and Frederick Mosteller.
  • Castledine, B. J. (1981). A Bayesian analysis of multiple-recapture sampling for a closed population. Biometrika 68 197–210.
  • Christen, P. (2012). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24 1537–1555.
  • Commission on the Truth for El Salvador (1993). From madness to hope: The 12-year war in El Salvador: Report of the Commission on the Truth for El Salvador. Available at http://www.usip.org/files/file/ElSalvador-Report.pdf [Accessed May 21, 2018]. UN Security Council.
  • Dawid, A. P. and Lauritzen, S. L. (1993). Hyper-Markov laws in the statistical analysis of decomposable graphical models. Ann. Statist. 21 1272–1317.
  • Edwards, D. (2000). Introduction to Graphical Modelling, 2nd ed. Springer, New York.
  • Elmagarmid, A. K., Ipeirotis, P. G. and Verykios, V. S. (2007). Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19 1–16.
  • Ericksen, E. P., Kadane, J. B. and Tukey, J. W. (1989). Adjusting the 1980 census of population and housing. J. Amer. Statist. Assoc. 84 927–944.
  • Fellegi, I. P. and Sunter, A. B. (1969). A theory for record linkage. J. Amer. Statist. Assoc. 64 1183–1210.
  • Fienberg, S. E. (1972). The multiple recapture census for closed populations and incomplete $2^{k}$ contingency tables. Biometrika 59 591–603.
  • Fienberg, S. E., Johnson, M. S. and Junker, B. W. (1999). Classical multilevel and Bayesian approaches to population size estimation using multiple lists. J. Roy. Statist. Soc. Ser. A 162 383–405.
  • Fortini, M., Nuccitelli, A., Liseo, B. and Scanu, M. (2002). Modeling issues in record linkage: A Bayesian perspective. In Proceedings of the Section on Survey Research Methods 1008–1013. American Statistical Association, Alexandria, VA.
  • George, E. I. and Robert, C. P. (1992). Capture-recapture estimation via Gibbs sampling. Biometrika 79 677–683.
  • Gutman, R., Afendulis, C. C. and Zaslavsky, A. M. (2013). A Bayesian procedure for file linking to analyze end-of-life medical costs. J. Amer. Statist. Assoc. 108 34–47.
  • Herzog, T. N., Scheuren, F. J. and Winkler, W. E. (2007). Data Quality and Record Linkage Techniques. Springer, New York.
  • Hogan, H. (1992). The 1990 post-enumeration survey: An overview. Amer. Statist. 46 261–269.
  • Hogan, H. (1993). The 1990 post-enumeration survey: Operations and results. J. Amer. Statist. Assoc. 88 1047–1060.
  • Howland, T. (2008). How El Rescate, a small nongovernmental organization, contributed to the transformation of the human rights situation in El Salvador. Hum. Rights Q. 30 703–757.
  • Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. J. Amer. Statist. Assoc. 96 161–173.
  • Jaro, M. A. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Amer. Statist. Assoc. 84 414–420.
  • LaPorte, R. E., McCarty, D., Bruno, G., Tajima, N. and Baba, S. (1993). Counting diabetes in the next millennium: Application of capture–recapture technology. Diabetes Care 16 528–534.
  • Larsen, M. D. and Rubin, D. B. (2001). Iterative automated record linkage using mixture models. J. Amer. Statist. Assoc. 96 32–41.
  • Lauritzen, S. L. (1996). Graphical Models. Oxford Statistical Science Series 17. The Clarendon Press, Oxford Univ. Press, New York.
  • Liseo, B. and Tancredi, A. (2011). Bayesian estimation of population size via linkage of multivariate normal data sets. J. Off. Stat. 27 491–505.
  • Lum, K., Price, M. E. and Banks, D. (2013). Applications of multiple systems estimation in human rights research. Amer. Statist. 67 191–200.
  • Madigan, D. and York, J. C. (1997). Bayesian methods for estimation of the size of a closed population. Biometrika 84 19–31.
  • Manrique-Vallier, D. (2016). Bayesian population size estimation using Dirichlet process mixtures. Biometrics 72 1246–1254.
  • Matsakis, N. E. (2010). Active duplicate detection with Bayesian nonparametric models. Ph.D. thesis, Massachusetts Institute of Technology.
  • Plummer, M., Best, N., Cowles, K. and Vines, K. (2006). CODA: Convergence diagnosis and output analysis for MCMC. R News 6 7–11.
  • Pollock, K. H. (2000). Capture–recapture models. J. Amer. Statist. Assoc. 95 293–296.
  • Price, M. and Ball, P. (2015). Selection bias and the statistical patterns of mortality in conflict. Statist. J. IAOS 31 263–272.
  • Price, M., Gohdes, A. and Ball, P. (2015). Documents of war: Understanding the Syrian conflict. Significance 12 14–19.
  • Sadinle, M. (2014). Detecting duplicates in a homicide registry using a Bayesian partitioning approach. Ann. Appl. Stat. 8 2404–2434.
  • Sadinle, M. (2017). Bayesian estimation of bipartite matchings for record linkage. J. Amer. Statist. Assoc. 112 600–612.
  • Steorts, R. C. (2015). Entity resolution with empirically motivated priors. Bayesian Anal. 10 849–875.
  • Steorts, R. C., Hall, R. and Fienberg, S. E. (2016). A Bayesian approach to graphical record linkage and deduplication. J. Amer. Statist. Assoc. 111 1660–1672.
  • Tancredi, A. and Liseo, B. (2011). A hierarchical Bayesian approach to record linkage and population size problems. Ann. Appl. Stat. 5 1553–1585.
  • Winkler, W. E. (1988). Using the EM algorithm for weight computation in the Fellegi–Sunter model of record linkage. In Proceedings of the Section on Survey Research Methods 667–671. American Statistical Association, Alexandria, VA.
  • Winkler, W. E. (1990). String comparator metrics and enhanced decision rules in the Fellegi–Sunter model of record linkage. In Proceedings of the Section on Survey Research Methods 354–359. American Statistical Association, Alexandria, VA.