Bayesian Analysis
- Bayesian Anal.
- Volume 15, Number 2 (2020), 633-682.
A Unified Framework for De-Duplication and Population Size Estimation (with Discussion)
Andrea Tancredi, Rebecca Steorts, and Brunero Liseo
Abstract
Data de-duplication is the process of detecting records in one or more datasets which refer to the same entity. In this paper we tackle the de-duplication process via a latent entity model, where the observed data are perturbed versions of a set of key variables drawn from a finite population of different entities. The main novelty of our approach is to consider the population size as an unknown model parameter. As a result, a salient feature of the proposed method is the capability of the model to account for the de-duplication uncertainty in the population size estimation. As by-products of our approach we illustrate the relationships between de-duplication problems and capture-recapture models and we obtain a more adequate prior distribution on the linkage structure. Moreover we propose a novel simulation algorithm for the posterior distribution of the matching configuration based on the marginalization of the key variables at population level. We apply our method to two synthetic data sets comprising German names. In addition we illustrate a real data application, where we match records from two lists which report information about people killed in the recent Syrian conflict.
Note
BA Webinar: https://www.youtube.com/watch?v=eLZwDOUAqc8&t=752s.
Article information
Source
Bayesian Anal., Volume 15, Number 2 (2020), 633-682.
Dates
First available in Project Euclid: 7 March 2019
Permanent link to this document
https://projecteuclid.org/euclid.ba/1551949260
Digital Object Identifier
doi:10.1214/19-BA1146
Mathematical Reviews number (MathSciNet)
MR4122517
Keywords
cluster analysis entity resolution partition models record linkage
Rights
Creative Commons Attribution 4.0 International License.
Citation
Tancredi, Andrea; Steorts, Rebecca; Liseo, Brunero. A Unified Framework for De-Duplication and Population Size Estimation (with Discussion). Bayesian Anal. 15 (2020), no. 2, 633--682. doi:10.1214/19-BA1146. https://projecteuclid.org/euclid.ba/1551949260
Supplemental materials
- Supplementary Material for “A Unified Framework for De-Duplication and Population Size Estimation”. Digital Object Identifier: doi:10.1214/19-BA1146SUPP