## The Annals of Applied Statistics

### Unique entity estimation with application to the Syrian conflict

#### Abstract

Entity resolution identifies and removes duplicate entities in large, noisy databases and has grown in both usage and new developments as a result of increased data availability. Nevertheless, entity resolution has tradeoffs regarding assumptions of the data generation process, error rates, and computational scalability that make it a difficult task for real applications. In this paper, we focus on a related problem of unique entity estimation, which is the task of estimating the unique number of entities and associated standard errors in a data set with duplicate entities. Unique entity estimation shares many fundamental challenges of entity resolution, namely, that the computational cost of all-to-all entity comparisons is intractable for large databases. To circumvent this computational barrier, we propose an efficient (near-linear time) estimation algorithm based on locality sensitive hashing. Our estimator, under realistic assumptions, is unbiased and has provably low variance compared to existing random sampling based approaches. In addition, we empirically show its superiority over the state-of-the-art estimators on three real applications. The motivation for our work is to derive an accurate estimate of the documented, identifiable deaths in the ongoing Syrian conflict. Our methodology, when applied to the Syrian data set, provides an estimate of $191\text{,}874\pm 1\text{,}772$ documented, identifiable deaths, which is very close to the Human Rights Data Analysis Group (HRDAG) estimate of 191,369. Our work provides an example of challenges and efforts involved in solving a real, noisy challenging problem where modeling assumptions may not hold.

#### Article information

Source
Ann. Appl. Stat., Volume 12, Number 2 (2018), 1039-1067.

Dates
Revised: March 2018
First available in Project Euclid: 28 July 2018

https://projecteuclid.org/euclid.aoas/1532743485

Digital Object Identifier
doi:10.1214/18-AOAS1163

Mathematical Reviews number (MathSciNet)
MR3834294

#### Citation

Chen, Beidi; Shrivastava, Anshumali; Steorts, Rebecca C. Unique entity estimation with application to the Syrian conflict. Ann. Appl. Stat. 12 (2018), no. 2, 1039--1067. doi:10.1214/18-AOAS1163. https://projecteuclid.org/euclid.aoas/1532743485

#### References

• Aleksandrov, P. S. (1947). Combinatorial Topology 1. Courier Corporation.
• Andoni, A. and Indyk, P. (2004). E2lsh: Exact Euclidean locality sensitive hashing. Technical report.
• Baxter, R., Christen, P., Churches, T. et al. (2003). A comparison of fast blocking methods for record linkage. In ACM SIGKDD 3 25–27.
• Bhattacharya, I. and Getoor, L. (2006). A latent Dirichlet model for unsupervised entity resolution. In Proceedings of the Sixth SIAM International Conference on Data Mining 47–58. SIAM, Philadelphia, PA.
• Broder, A. Z. (1997a). On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences 1997 (SEQUENCES’97) 21–29. IEEE Computer Society, Washington, DC.
• Broder, A. Z. (1997b). On the resemblance and containment of documents. In The Compression and Complexity of Sequences 21–29.
• Chazelle, B., Rubinfeld, R. and Trevisan, L. (2005). Approximating the minimum spanning tree weight in sublinear time. SIAM J. Comput. 34 1370–1379.
• Chen, B., Shrivastava, A. and Steorts, R. C. (2018). Supplement to “Unique entity estimation with application to the Syrian conflict.” DOI:10.1214/18-AOAS1163SUPP.
• Chen, B., Xu, Y. and Shrivastava, A. (2018). LSH sampling breaks the computational chicken-and-egg loop in adaptive stochastic gradient estimation.
• Christen, P. (2012). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24 1537–1555.
• Christen, P. (2014). Preparation of a real voter data set for record linkage and duplicate detection research. Tech. report.
• Deming, W. E. and Glasser, G. J. (1959). On the problem of matching lists by samples. J. Amer. Statist. Assoc. 54 403–415.
• Erdős, P. and Rényi, A. (1960). On the evolution of random graphs. Magy. Tud. Akad. Mat. Kut. Intéz. Közl. 5 17–61.
• Fellegi, I. and Sunter, A. (1969). A theory for record linkage. J. Amer. Statist. Assoc. 64 1183–1210.
• Frank, O. (1978). Estimation of the number of connected components in a graph by using a sampled subgraph. Scand. J. Stat. 5 177–188.
• Gionis, A., Indyk, P., Motwani, R. et al. (1999). Similarity search in high dimensions via hashing. In Very Large Data Bases (VLDB) 99 518–529.
• Grillo, C. (2016). Judges in Habre trial cite HRDAG analysis.
• Gutman, R., Afendulis, C. C. and Zaslavsky, A. M. (2013). A Bayesian procedure for file linking to analyze end-of-life medical costs. J. Amer. Statist. Assoc. 108 34–47.
• Indyk, P. and Motwani, R. (1999). Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC’98 (Dallas, TX) 604–613. ACM, New York.
• Liang, H., Wang, Y., Christen, P. and Gayler, R. (2014). Noise-tolerant approximate blocking for dynamic real-time entity resolution. In Pacific-Asia Conference on Knowledge Discovery and Data Mining 449–460. Springer, Berlin.
• Liseo, B. and Tancredi, A. (2013). Some advances on Bayesian record linkage and inference for linked data. Available at https://pdfs.semanticscholar.org/8926/9690219564cddf7d0b91ec5f692fef13b9a9.pdf.
• Luo, C. and Shrivastava, A. (2017). Arrays of (locality-sensitive) count estimators (ACE): High-speed anomaly detection via cache lookups. Preprint. Available at arXiv:1706.06664.
• Luo, C. and Shrivastava, A. (2018). Scaling-up split-merge MCMC with Locality Sensitive Sampling (LSS). Preprint. Available at arXiv:1802.07444.
• McCallum, A., Nigam, K. and Ungar, L. H. (2000). Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 169–178. ACM, New York.
• McCallum, A. and Wellner, B. (2004). Conditional models of identity uncertainty with application to noun coreference. In Advances in Neural Information Processing Systems (NIPS’04) 905–912. MIT Press, Cambridge, MA.
• Paulevé, L., Jégou, H. and Amsaleg, L. (2010). Locality sensitive hashing: A comparison of hash function types and querying mechanisms. Pattern Recogn. Lett. 31 1348–1358.
• Price, M., Klingner, J., Qtiesh, A. and Ball, P. (2014). Updated statistical analysis of documentation of killings in the Syrian Arab Republic. United Nations Office of the UN High Commissioner for Human Rights.
• Provan, J. S. and Ball, M. O. (1983). The complexity of counting cuts and of computing the probability that a graph is connected. SIAM J. Comput. 12 777–788.
• Rajaraman, A. and Ullman, J. D. (2012). Mining of Massive Datasets. Cambridge Univ. Press, Cambridge, MA.
• Sadinle, M. (2014). Detecting duplicates in a homicide registry using a Bayesian partitioning approach. Ann. Appl. Stat. 8 2404–2434.
• Sadosky, P., Shrivastava, A., Price, M. and Steorts, R. C. (2015). Blocking methods applied to casualty records from the Syrian conflict. ArXiv preprint. Available ar arXiv:1510.07714.
• Shrivastava, A. and Li, P. (2014a). Densifying one permutation hashing via rotation for fast near neighbor search. In Proceedings of the 31st International Conference on Machine Learning 557–565.
• Shrivastava, A. and Li, P. (2014b). Improved densification of one permutation hashing. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence.
• Shrivastava, A. and Li, P. (2014c). In defense of Minhash over Simhash. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics 886–894.
• Spring, R. and Shrivastava, A. (2017a). A new unbiased and efficient class of LSH-based samplers and estimators for partition function computation in log-linear models. Preprint. Available at arXiv:1703.05160.
• Spring, R. and Shrivastava, A. (2017b). Scalable and sustainable deep learning via randomized hashing. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 445–454. ACM, New York.
• Steorts, R. C. (2015). Entity resolution with empirically motivated priors. Bayesian Anal. 10 849–875.
• Steorts, R. C., Hall, R. and Fienberg, S. E. (2014). SMERED: A Bayesian approach to graphical record linkage and de-duplication. J. Mach. Learn. Res. 33 922–930.
• Steorts, R. C., Hall, R. and Fienberg, S. E. (2016). A Bayesian approach to graphical record linkage and deduplication. J. Amer. Statist. Assoc. 111 1660–1672.
• Steorts, R. C., Ventura, S. L., Sadinle, M. and Fienberg, S. E. (2014). A comparison of blocking methods for record linkage. In International Conference on Privacy in Statistical Databases 253–268.
• Tancredi, A. and Liseo, B. (2011). A hierarchical Bayesian approach to record linkage and population size problems. Ann. Appl. Stat. 5 1553–1585.
• Vatsalan, D., Christen, P., O’Keefe, C. M. and Verykios, V. S. (2014). An evaluation framework for privacy-preserving record linkage. J. Priv. Confident. 6 3.
• Wang, Y., Shrivastava, A. and Ryu, J. (2017). FLASH: Randomized algorithms accelerated over CPU-GPU for ultra-high dimensional similarity search. ArXiv preprint. Available at arXiv:1709.01190.
• Winkler, W. E. (2005). Approximate string comparator search strategies for very large administrative lists. Proceedings of the Section on Survey Research Methods, American Statistical Association.
• Winkler, W. E. (2006). Overview of record linkage and current research directions. In U.S. Bureau of the Census. Washington, DC. Available at https://www.census.gov/srd/papers/pdf/rrs2006-02.pdf.
• Zanella, G., Betancourt, B., Miller, J. W., Wallach, H., Zaidi, A. and Steorts, R. (2016). Flexible models for microclustering with application to entity resolution. In Advances in Neural Information Processing Systems 1417–1425.

#### Supplemental materials

• Supplementary Material for “Unique entity estimation with application to the Syrian conflict”. This supplement consists of two parts. It offers more details about: (A) the Syrian data set and (B) our unique entity estimation proofs. In (A), we give details regarding the Syrian data set and the training data that is used. In (B), we give detailed proofs that our proposed estimator that is unbiased and has has provable low variance compared to random sampling. Refer to Chen, Shrivastava and Steorts (2018) for details.