A Uniﬁed Framework for De-Duplication and Population Size Estimation

. Data de-duplication is the process of detecting records in one or more datasets which refer to the same entity. In this paper we tackle the de-duplication process via a latent entity model, where the observed data are perturbed versions of a set of key variables drawn from a ﬁnite population of N diﬀerent entities. The main novelty of our approach is to consider the population size N as an unknown model parameter. As a result, a salient feature of the proposed method is the capability of the model to account for the de-duplication uncertainty in the population size estimation. As by-products of our approach we illustrate the relationships between de-duplication problems and capture-recapture models and we obtain a more adequate prior distribution on the linkage structure. Moreover we propose a novel simulation algorithm for the posterior distribution of the matching conﬁguration based on the marginalization of the key variables at population level. We apply our method to two synthetic data sets comprising German names. In addition we illustrate a real data application, where we match records from two lists which report information about people killed in the recent Syrian conﬂict.

tifiers are rarely available and the researcher must deal with the uncertainty related to the linking step. The problem of how to account for the matching uncertainty has then caused an active line of recent research among the statistical, the machine learning, and the computer science communities. In fact, in practical applications of record linkage procedures, the concrete possibility to make wrong matching decisions should be accounted for, especially when the result of the linking step, namely the fused data set, will be used for further statistical analyses, such as regression, capture-recapture methods or small area estimation: see for example Liseo (2011, 2015), Briscolini et al. (2018), and Sadinle (2018).
The classical record linkage approach with two data sets was formalized by Jaro (1989), following the seminal paper by Fellegi and Sunter (1969). This standard method is based on the comparison vectors -data vectors obtained by comparing the common fields, also known as key variables, for each pair of records. Since the distribution of the comparison vectors depends on the unknown match or non-match status of the record pairs, a mixture model fitted to the entire collection of comparison vectors can be used to classify all the pairs in two or more sets concerning their matching status (Belin and Rubin, 1995;Larsen and Rubin, 2001). Recently, Sadinle and Fienberg (2013) extended the Fellegi-Sunter approach to allow situations with three or more files, while also preserving transitive closures.
To our knowledge, Fortini et al. (2001) proposed the first Bayesian approach to record linkage, where the likelihood function provided by the set of multiple comparison vectors was used to estimate the matching configuration through the use of Markov Chain Monte Carlo (MCMC) methods. This approach, together with Larsen (2005) and Sadinle (2017), can be interpreted as a Bayesian version of the classical Fellegi-Sunter record linkage approach. Note that these papers do not assume the presence of "within file" duplications. That is, it is only possible to match a record in a file to a single record of another file and vice versa. A clear advantage of the Bayesian approach is that one can naturally account for this constraint by simply selecting appropriate prior distributions on the matching status to incorporate this assumption. Tancredi and Liseo (2011) recently proposed a Bayesian record linkage method that is well suited for categorical data. The authors deviate from the Fellegi-Sunter approach in two major ways -they do not work with comparison data and allow for record linkage uncertainty to be accounted for in population size estimation. To handle the former, they explicitly model the fully observed records through a particular measurement error model, inspired by the so called "hit-and-miss" strategy proposed by Copas and Hilton (1990). The latter is naturally handled through the joint estimation of the record linkage model and the capture-recapture model used for population size estimation. In the same spirit, Liseo and Tancredi (2011) have introduced a record linkage model for continuous data based on a multivariate normal model with measurement error. The de-duplication problem for a single list framework has been tackled from a Bayesian perspective in Sadinle (2014) by using the information provided by the comparison data. Steorts et al. (2014Steorts et al. ( , 2016 were the first to perform simultaneous record linkage and de-duplication on multiple files through the use of the fully observed records, creating a scalable record linkage algorithm. Steorts (2015) extended this work further to the case of string and categorical data, where arbitrary distance metrics between strings have been considered.
In this paper we extend both the work of Tancredi and Liseo (2011) and Steorts et al. (2016). We develop a unified framework for population size estimation by using multiple files that require both linkage and de-duplication. In fact the former paper considered only the case of two files without duplication inside each of the single lists while the latter assumed a generating population with a fixed and known size.
The rest of the paper proceeds as follows. Section 2 introduces the basic framework of our generalized Bayesian record linkage and de-duplication model and specifies the measurement error model for the key variables, namely the hit and miss model. Section 3 illustrates how the task of estimating a population size can be rephrased in terms of the partition associated with the observed records. Moreover, we provide new insights about the prior modeling of the matching configuration in a de-duplication problem and show some connections between our prior partition modeling and capture-recapture models with non homogeneous capture probabilities and duplication rates. Section 4 shows how to simplify the model by integrating out the unknown population values. Section 5 discusses the computational aspects of our proposed model. In particular, in comparison with respect to Steorts et al. (2016), we propose a novel simulation algorithm for the posterior distribution based exactly on the marginalization of the records values at the population level. Section 6 illustrates the results of our unified model for de-duplication and population size estimation applied to the synthetic data sets RLdata500 and RLdata10000 from the RecordLinkage package in R, presenting an intensive sensitivity analysis with respect to all model hyperparameters. In Section 7 we fit the model to a real data set reporting the names of victims of the recent Syrian conflict. Finally, Section 8 provides a brief discussion of our work.

The key variables model
We first introduce the methodological framework of the record linkage process. Assume L lists F 1 , F 2 . . . , F L , whose records respectively relate to statistical units (e.g. individuals, firms, etc.) of partially overlapping samples. The records in the lists consist of several categorical variables which may contain corruptions, noise, and errors. Moreover we do not handle missing fields across lists, and assume that all lists have p fields in common, representing the key variables. For example, in lists regarding individuals, the common fields, might be surname, name, age, sex. Denoting the j-th record of file F i as (i, j), the main goal of a standard record linkage procedure is to identify all pairs of records, say (i 1 , j 1 ) and (i 2 , j 2 ), with i 1 = i 2 , that actually refer to the same unit, by using the key variables of the observed records of L lists. An additional difficulty in record linkage arises when some records in the same file, say (i, j 1 ), . . . , (i, j n ), refer to a single entity-known as duplicate detection.
Assume that the L sets of records have been collected from a given population with N entities, that is,Ũ N = {ũ 1 ,ũ 2 , . . . ,ũ N } where N < ∞ and that the lists are independent, that is population entities occur independently across the lists in the same framework as Steorts et al. (2016). Assign to each member of the population the label j resulting from its position in the ordered listŨ N . Hence j = 1, . . . , N. We assume that N is unknown; thus knowing the labels of the entities observed in the data sets would produce strong information about N if only because N should be greater of the maximum label. However these labels cannot be observed and neither estimated via the information provided by the L list of records. In fact, we anticipate that the data can be informative only on how many distinct population entities have been observed at the sample level and which sample records gather around each one of them. The former information will be used to estimate N , the latter to perform the matching process.
Letṽ j = (ṽ j 1 , . . . ,ṽ j p ) be the vector of the p categorical key variables for the population individual j . Denote byṽ = (ṽ 1 , . . . ,ṽ N ) the entire set of population records. Assume the set of population recordsṽ is generated independently, for j = 1, . . . , N, from a vector of independent categorical variablesṼ = (Ṽ 1 , . . . ,Ṽ , . . . ,Ṽ p where M is the number of categorical values for the th field. Note that here and later, to simplify notations we let the arguments define the density and mass functions. Hence, the model for the population records can be written as At the sample level we assume that one does not observe the population "true" values, due to measurement and reporting errors. In fact, each set of observed records, which is a list of size n i , i = 1, . . . , L, comprises contaminated versions of subsets of the vectorsṽ j . Let v ij = (v ij1 , . . . , v ijp ) denote the observed values for the j-th record of the i-th file, with i = 1, . . . , L and j = 1, . . . , n i . Moreover, denote with v = (v 11 , . . . , v 1n1 , . . . , v L1 , . . . v Ln L ) the entire set of observed records across the L lists.
Let λ ij ∈ {1, 2, . . .} j = 1, . . . , n i , i = 1, . . . , L be the unknown population labels of the sample units. This way λ = (λ 11 , . . . , λ 1n1 , . . . , λ L1 , . . . , λ Ln L ) denotes the unknown matching pattern between the observed records v and the population recordsṽ, where λ ij = j indicates that the observed record v ij is a version of the population recordṽ j . The relation λ ij1 = λ ij2 , with j 1 = j 2 , implies that records j 1 and j 2 of the i-th list are co-referent to the same population record. This is an instance of duplicate-detection within the same list. Instead, when λ i1j1 = λ i2j2 , with i 1 = i 2 , one has the usual record linkage framework with the same individual appearing in two different lists.
Let us now formalize the generative distortion mechanism when the population records are observed on the L lists. In particular, we assume the hit-and-miss model proposed by Copas and Hilton (1990) and also adopted in Steorts et al. (2014Steorts et al. ( , 2016 and Steorts (2015). Let V ij be the random variable generating v ij . Assume that V ij ∈ V , that is V ij has the same support ofṼ . Moreover, set δ(a, b) = 1 if a = b and δ(a, b) = 0 if a = b, let α j = (α j 1 , . . . , α j , . . . , α j p ) be the vector with the measurement error probabilities of the p key variables for the population individual j and denote by α = (α 1 , . . . , α j , . . . , α N ) the entire set of distortion probabilities. We firstly assume This way, for the − th key variable, the true population value of the individual j generating the record ij is observed with probability 1 − α j , while, with probability α j , we observe a different value drawn from the random variableṼ generating the population values.
Finally, assuming the conditional independence among all the sample records and all the key variables given their respective unobserved population counterparts, one obtains ( 2.3) The model summarized by equations (2.1), (2.2) and (2.3) can be viewed as a part of a hierarchical model where N unobserved population recordsṽ j , drawn from a superpopulation model parametrized by the probability vectors θ , generate the observed records v ij with the vectors α j acting as record distortion parameters. The key variables probabilities θ and the distortion probabilities α j are unknown quantities. For the probability vectors θ we assume independent Dirichlet priors for = 1, . . . , p. An exchangeable prior will be assumed for the distortion probabilities α j for j = 1, . . . , N.
In particular the logit transformation of α j , that is β j l = log (α j l /(1 − α j l )) will be Normal with mean β 0l and variance s 2 , for j = 1, . . . , N and β 0l will be Normal with mean m 0 and variance s 2 0 . Note also that distortion probabilities for different key variables will be assumed independent.

The prior for the records partition and the population size
The interpretation and the prior specification of the labeling variables λ is more challenging with respect to all other model variables and parameters. One interpretation of λ is that its values are drawn from a known and specific sampling design, which generates the labels allowing for duplications within each list. Consider the simplest situation, where L independent simple random samples are drawn with replacement from a population of size N < ∞. It follows that where n = L i=1 n i . Therefore, with fixed N , one has a uniform prior over the set of all possible configurations of the λ values. This is exactly the prior used in Steorts et al. (2016) and we will call this distribution the uniform prior on the label space. Note that a similar scheme was considered also by Tancredi and Liseo (2011) in the context of twofile record linkage without duplication. There, their matching matrix prior distribution was based on the assumption that the lists were two simple random samples without replacement.
We now investigate an alternative aspect of the uniform prior distribution of λ given N . Let Z = Z(λ) denote the partition of the n records determined by λ. For example assuming N = 3, L = 1, n 1 = n = 3 and λ = (1, 2, 2) we have the partition Z = 1|23 indicating that the second and third sample units share the same population label which is different from the one of the first sample unit. Note that, in this case, λ may assume 27 different vectors, all with equal probability, producing the five different partitions of the n = 3 records, namely {123, 1|23, 13|2, 12|3, 1|2|3}. Moreover the partition 1|23 can be obtained when λ is one of the following vectors (1, 2, 2), (1, 3, 3), (2, 1, 1), (2, 3, 3), (3, 1, 1), (3, 2, 2). Thus the probability of the partition 1|23 given N = 3 is 6/27. When N = 4, λ may assume 64 different vectors and it is simple to verify the probability of the partition 1|23 is now 12/64. Thus the distribution on the sample labels λ given N induces a distribution on the partition space which depends on N . This means that the simple knowledge of the partition of the sample records is able to produce information on the population size N . Furthermore, matches and duplicates are completely specified given the knowledge of Z. Thus estimating the partition will permit at the same time to produce inference on N and to estimate the linkage structure of the data at hand.
In the following we will indicate with P the set containing all the possible partitions of the n observed records and with z ∈ Z a single block of the partition Z. Moreover, let u z (λ) be the label identifying the block z on the vector λ and U = U (λ) = {u z (λ), z ∈ Z} be the set of the block labels ordered accordingly to the sequence z ∈ Z. Hence λ = (3, 5, 1, 5) and λ = (5, 3, 1, 3) produce the same partition Z = 1|24|3 but different label vectors U = (3, 5, 1) and U = (5, 3, 1). Note that (Z, U ) and λ are in one to one correspondence, thus p(Z, U |N ) = p(λ|N ).
We now obtain the prior distribution on the partition space P, for a given N , resulting from the uniform prior on the label space. Let k = k(Z) be the observed number of blocks of the partition Z. The number of elements λ producing the partition Z is N k = N !/(N − k)!. In fact we have N k ways to select the unordered labels for the blocks of Z and for each of them k! ordered labellings U . Thus Note also that N n = n k=0 N k S(n, k), where S(n, k) is the Stirling number of the second kind, that is the number of possible partitions of the n records into k non empty sets, so we have Following Pitman (2006), equation (3.3) defines a special case of Gibbs partitions. Moreover, the distribution of the random number of blocks K is given by The mean and the variance of K are easily obtained as and Tancredi, Steorts, and Liseo, 2019). For fixed N , as the number of records n → ∞, the distribution of K|N concentrates on N , since E(K|N ) tends to N and the variance vanishes. Also observe that, for a fixed number of records n and large values of N , the distribution of the distinct entities K|N concentrates on n. That is, the prior probability of observing links or duplicates approaches 0 in the limit, as intuition suggests.
To complete the prior modeling of the linkage structure we need to specify the prior for the population size N . Throughout this paper we assume Note that the use of heavy-tailed priors p(N ) ∝ 1/N g as non informative distributions is quite diffuse in population size Bayesian estimation, see for example George and Robert (1992) or Wang et al. (2007). Straightforward calculations (see Appendix A in Supplementary Material, Tancredi, Steorts, and Liseo, 2019) show that, under this class of priors, the marginal prior mean for K is Notice that, as g → 1 E(K) converges to n which is the upper end point of the support of K; hence when g approaches to 1 the whole distribution of K concentrates on n.
The left part of Table 1 reports, for different values of g, the mean and the standard deviation for K when the total number of records is n = 500 as in the first application that will be illustrated in this paper. Such summaries are obtained by simulating 10 7 draws from p(N, λ) via the accept reject algorithm for p(N ) proposed in Devroye (1986) §10.6 and by direct simulation of p(λ|N ). Note that even for values of g close to 1, the standard deviation of K is quite high. Thus, such values of g have the important role to induce a priori a high number of clusters with few observation per cluster, i.e. the microclustering effect, see for example Zanella et al. (2016) and Johndrow et al. (2018), without being too much informative. The right part of Table 1 reports the mean and the standard deviation for K when we use the uniform prior for λ by fixing the values of N . Note that the assumption of a uniform distribution on the label space conditioned on the value of N might not be adequate in real applications of record linkage and de-duplication even when we are only interested on the linkage structure and we do not need to make inference on N . In fact the resulting distribution on the number of distinct entities K will be generally too concentrated as illustrated by the extremely low standard deviation.

Estimation for the population size N when the partition Z is known
When the partition Z of the n records is known and the model generating the partition is given by (3.2), inference on N can be conducted via the posterior distribution where I {N ≥k} denotes the indicator function of the set N ≥ k. Notice that the distribution (3.7) is exactly the posterior for N obtained from a T -stage homogeneous capture recapture model when T = n, we observe k different individuals across the samples and we condition on one capture in each occasion, see for example Marin and Robert (2014) §5. Note also that assuming the prior (3.5), the posterior (3.7) is proper ∀g ≥ 0 when k < n − 1.
It is also interesting to observe that the mode of the posterior for N when g = 0 is approximated by the moment estimator of N obtained by the expression (3.4). In fact, by approximating the logarithm of p(Z|N ) using the Stirling formula, we have that and the mode of the posterior distribution p(N |Z) when g = 0, i.e. p(N ) ∝ c, is approximately given by the solution of the equation k = N (1 − e −n/N ) which can be further approximated by solving that is the equation providing the expected value of K as a function of N .

Connections with capture-recapture models with non homogeneous capture probabilities and duplication rates
Now suppose that, in order to form the jth list, each one of the N population units is subject to being captured a random number of times. That is for each label j and for each list j there are T jj attempts to capture the population unit U j and for each attempt, the capture probability is p j . Moreover assume that the random variables T jj are independent Poisson with mean δ j . Hence δ 1 , . . . , δ L are list dependent parameters providing the "within-list" duplication rates while p 1 , . . . , p L are the different list capture probabilities.
Now let X jj t for t = 1, . . . , T jj , j = 1, . . . , N and j = 1, . . . , L be a random number of independent Bernoulli variables with probability p j indicating if, in list j, unit U j has been captured at the attempt t and let X jj = T jj t=1 X jj t be the number of times that U j has been captured in list j. Note that the mean of X jj is δ j p j and it is Poisson distributed being the sum of a Poisson number of Bernoulli variables. Now let n j be the list size, for j = 1, . . . , L, and observe that n j = N j X jj is Poisson distributed with mean Nδ j p j , the conditional distribution of X jj |n j is Binomial(n j , 1/N ) and Moreover, each label sequence of the j list, that is the vector λ j = (λ 1j , . . . , λ nj j ) has probability Assuming duplication and capture independence across the lists, we also have that p(λ|n 1 , . . . n L ) = 1 N n . Then, the conditioning on list sizes has eliminated the duplication rates and the capture probabilities, thus providing a conditional likelihood for N which depends on the non identifiable population labels which, in turn, provides the likelihood function for N (3.2) given the observable partition Z. In summary, the proposed prior (3.1) for λ exactly embeds the sampling information, conditional on list sizes, provided by a capture-recapture model with non homogeneous capture probabilities and duplication rates.
Notice that the elimination of the capture probabilities and the duplication rate parameters from the prior model for λ automatically implies that two records of the same list and two records of two different sets would have the same prior probability to be duplicates. Such assumption, which follows directly from the prior (3.1), is admittedly unlikely to be true in practice. We simply consider this assumption a convenient and operative starting point for performing matching estimation.

The hit-miss marginal model for record clustering
A convenient property of the hit-miss model illustrated in Section 2 is that one can integrate out the unknown population valuesṽ to directly obtain the distribution p (v|Z, U, N, α, θ), as it is illustrated below. The resulting marginal distribution is the product of within-blocks distributions. In fact, records belonging to different blocks are independent because they refer to different and independent population records, while records within the same block are dependent, since they are observations on the same population individual. Clustering approaches based on similar dependence structures are discussed in Booth et al. (2008) and McCullagh and Yang (2008).
Let z ∈ Z be a partition block, let v z = (v ij : ij ∈ z) denote the corresponding cluster of records and let v z = (v ij : ij ∈ z) denote the cluster of observed records for the -th key variable. Also let u z denote the label in U corresponding to the block z and letṽ U = (ṽ uz , z ∈ Z) and α U = (α uz , z ∈ Z) be the relative sets of population records and distortion probabilities.
Hence, observing that p(ṽ U |Z, U, N, α, θ) = z∈Z p(ṽ uz |Z, U, N, α, θ), and marginalizing out the true valuesṽ U , one obtains |Z, U, N, α, θ). Now, let us consider a block with only a single record, i.e., z = {(i j)}. Then the marginal distribution of the observed value for the l-th field of this record is Since we have assumed conditional independence among the key variables, one has After simple algebra, an analytical expression can also be found for a cluster z = {(i 1 j 1 ), (i 2 j 2 )} with two records, that is, Furthermore, it is straightforward (see Appendix B in Supplementary Material, Tancredi, Steorts, and Liseo, 2019) to obtain a general and recursive formula for the marginal distribution of a cluster with n records, z = {(i 1 j 1 ), . . . , (i n j n )}: where v z\(in jn) indicates the cluster values for the -th key variable excluding those observed on the record (i n , j n ).
As a final note, observe that, for all z, p(v z |Z, U, N, α, θ) depends on α and on the partition block z along with the corresponding label u z . Then p (v|λ, N, α, θ) = p(v|Z, U, α, θ), that is the distribution of the observed data depends on Z, U, α, θ and not on the population size N .

Posterior simulation
De-duplication and population size inference can be carried out by simulating from the posterior p (Z, N |v), that is the marginal distribution of p (λ, N, β, β 0 , θ|v) where β is the vector with the logit transformations of the distortion probabilities of the N population entities, β 0 is the vector with their means for each key variable and Note that the marginal posterior p(Z, N, β 0 , θ|v) is Note that by integrating out the measurement error parameters β uz , the integrals inside the square brackets in the last expression do not depend on the population labels {u z , z ∈ Z}. Hence we have that where q(v z |β 0 , θ) is the marginal distribution of the block z given β 0 and θ. Now let η be an alternative set of labels for the sample records where η ij ∈ {1, . . . , n} ∀ij. Let Z be the partition generated by η and U the set of labels assigned by η to the blocks z ∈ Z. Note that η ↔ (Z, U ). Assume that p(Z|N ) = N k /N n , as for the random partition generated by λ, while p(U |Z, N ) = 1/ n k k! so that Moreover let β j for j = 1, . . . , n be a vector with L measurement error parameters with the same prior model of the original variable dimension vector β. Then the posterior (5.2) can also be seen as the marginal, with respect to U and β of the distribution Note that simulating the distribution (5.3) instead of (5.1) may imply a considerable saving of computing time since the label indicators η ij vary in {1, . . . , n} and no longer in {1, . . . , N} without any loss of information for the de-duplication and population size inference. Drawings from the distribution (5.3) can be obtained updating the elements η, β , β 0 , N and θ via the following Gibbs sampler algorithm.
In particular, the updating of the vector η which leads to the consequent updating of both Z and U is the most critical step of the algorithm. Denote η (−ij) the vector η without the element η ij . Moreover let z \ (ij) be a partition block without the record ij, and let z q be the partition block such that u zq = q. Then, the full conditional distribution of η ij can be written as This occurs because, in equation (5.4), setting η ij = q, one has z = z \ (ij), ∀z = z q so that Equation (5.4) suggests that the conditional posterior probability p(η ij |η (−ij) , N, β , β 0 , θ, v) depends on the ratio between the probability of the cluster of records referring to the label q considering η −(ij) and η ij = q and the probability of the same cluster with the exclusion of the record ij.
The above ratio, when the label q identifies an already existing block given η −(ij) , exploiting the recursive formula (4.1), can also be written as however, it gets simplified into when the label q identifies a new block.
Thus we can update η ij with the following distribution Such a way to update the cluster composition is a standard approach for mixture models when the marginal likelihood of the cluster observations is known or it can be easily calculated, as in our case via the recursive formula (4.1); see for example MacEachern (1994) and Neal (2000).

The full conditional distribution
can be updated using a Metropolis step when j labels a record cluster or directly by the prior distribution p(β j l |β 0 ) when j does not identify any cluster. A Metropolis step can also be used to update the parameters β 0l whose conditional distribution is Anyway, to improve the mixing of the chain we have adopted a non centered parameterization (Papaspiliopoulos et al., 2003), for β j , updating the differential effects β j l −β 0 slightly modifying the Metropolis steps for β j l and β 0l .
The full conditional distribution of N is given by and an exact Gibbs step truncating N to a very large integer or a Metropolis step with integer proposals can be easily implemented. Lastly, note that the full conditional distribution of the probability vector θ is which can be updated using a Metropolis-Hastings steps with a Dirichlet proposal distribution. Finally note that having all n records from a single set or from L > 1 sets would not make a difference for the whole proposed algorithm. This is a direct consequence of the use of the uniform prior distribution p(λ|N ) which, although based on overly restrictive assumptions, has the advantages of simplifying the computation of the posterior distribution. In fact, more elaborated prior distributions for λ would require more complex posterior sampling schemes.

Experiments with synthetic data
To investigate the performance of our proposed methodology we first consider the RLdata500 data set from the RecordLinkage package in R. This synthetic data set consists of 500 records, each comprising first and last name and full date of birth. This data set contains 50 records that are intentionally constructed as "duplicates" of other records. Hence the true value of k is 450 and the true partition is represented by 400 clusters of size one and 50 clusters of size two. In order to apply a model with categorical variables only, we partially modify the data set by transforming names and surnames via the English soundex algorithm. This way we obtain records with 14 fields; 4 of them are produced by the name, 4 comes from the surname and the last 6 are obtained from the date of birth (4 given by the year, 1 by the month, and 1 by the day). Table 2 shows the first 6 records of the transformed data set.
name fields surname fields day of birth fields year month day 1 C 6 2 3 M 6 0 0 1 9 4 9 7 22 2 G 6 3 0 B 6 0 0 1 9 6 8 7 27 3 R 1 6 3 H 6 3 5 1 9 3 0 4 30 4 S 3 1 5 W 4 1 0 1 9 5 7 9 2 5 R 4 1 0 K 6 2 6 1 9 6 6 1 13 6 J 6 2 5 F 6 5 2 1 9 2 9 7 4 We fit our de-duplication and size estimation model to the modified RLdata500 data set by taking p(N ) ∝ 1/N g with g=1.02. Note that with this choice, as reported in Table 1, the prior mean for K is approximately 450, that is the true number of clusters for this file, and the dispersion is quite large as we can see also from the upper left panel of Figure 2 where the prior for K has been plotted. The probability vector θ are uniform on the simplex. The prior variance of the logit transformations β j of the distortion probabilities is equal to s 2 = 0.5 while the mean and the variance of their common mean β 0 are m 0 = logit(0.01) and s 2 = 0.1. Such a prior specification leads to a prior mean and a 0.99 prior quantile for α j l respectively equal to 0.013 and 0.058 indicating strong belief towards low block distortion probabilities. We observe that this is a condition to facilitate the micro-clustering effect since larger distortion probabilities would allow to gather more records into the same cluster even if they do not refer to the same entity. Instead, with low values of α j l we force all the clusters to have a reduced within-cluster variability and a greater between-cluster separation. At this regard, Johndrow et al. (2018) show from a more general and theoretical point of view that, in order to be effective, entity resolution via micro-clusters identification requires that the measurements errors go to zero as the number of entities increases. Such a condition practically states the infeasibility of cluster based approaches for high dimensional record linkage problems without introducing further information that may facilitate the correct aggregation into microclusters as our informative prior on α jl tries to do.
The Metropolis within Gibbs algorithm described in Section 5, was run for 50000 iterations. Figure 2 reports the posterior distributions for K and N and the performance of the record linkage procedure measured in terms of the posterior distributions of the false negative rates (FNR) and the false discovery rates (FDR) (third and fourth rows). For a review of false negative and false discovery rates in the context of record linkage we refer to Steorts (2015). In single list framework, these rates are obtained by setting and calculating j1<j2 Δ j1j2 across the MCMC simulation.
Note that the posterior means for K and N are equal to 446.6 and 2209 while the 95% posterior intervals are respectively [443,449] and [1710,2854]. Hence we have a considerable uncertainty reduction with respect to the prior specification for these quantities. The low posterior mean for the FNR, equal to 0.015, indicates that almost all the true matches are correctly linked in the same cluster. In addition, the posterior mean for the FDR, equal to 0.080, suggests that the model produces a limited number of false links. Hence the performance of the de-duplication process is quite satisfactory considering also the information lost in the data set transformation via the soundex algorithm and the diffuse prior specification of N and K. Table 3 shows the result of a sensitivity analysis with respect to the hyperparameters controlling the prior for the β jl s, i.e. s 2 , m 0 and s 2 0 , and with respect to hyperparameter g regulating the fatness of the prior for N . In particular we show the posterior means for K, N , the F NR and the F DR obtained when logit −1 (m 0 ) = 0.01, 0.1, 0.2, s 2 = 0.1, 0.5, 1 s 2 0 = 0.1, 0.5, 1 and g = 1.01, 1.02, 1.05, 1.1, 1.5, 2. For each value of g, the results are ordered with respect to increasing values of logit −1 (m 0 ), then by the variance of β j l s 2 + s 2 0 and finally by the covariance between β j l and β j l . As expected increasing a priori the mean and the variance of the distortion probabilities leads to increase the cluster sizes as we can see from the reduced values of E(K|y). In fact the posterior mean of K switches dramatically from the corrected values by about 450 microclusters to inconsistent values of less than 200 clusters, confirming the theoretical findings of Johndrow et al. (2018) regarding the necessity to introduce external information to obtain micro-clusters via a mixture model based approach. The small FNR and the high FDR when the microcluster effect do not occur confirm that records of the same entity are gathered into the same cluster although together with the other records generated by entities without list duplications. Note also that with the same variance values, micro-clustering is more likely to occur with lower covariance between β j l and β j l . Finally notice that the effect of g is practically negligible with higher values that slightly reduce the posterior mean of N . Table 4 shows the posteriors means for K, F NR and F DR, obtained by conditioning on grid of known values for N varying from 250 to 10000 and the hyperparameters values s 2 , s 2 0 and logit −1 (m0) equal to 0.5, 0.1 and 0.01. Note that also by fixing the values of N we regulate the microclustering effect with larger values producing the desired effect. Anyway we observe a greater sensitivity of the results when we vary N with respect to g. In fact for N ≥ 1000 we have the posterior means of K varying from 443 to 451, while when we vary g the posterior means of K are always 446.5 despite a wider range for the prior means in this setting.
To increase the difficulty of the deduplication problem in a situation where we know the exact matching configuration, we have also considered the RLdata10000 data set. Figure 3 shows the box-plots of the posterior distributions of K, N , F NR and F DR for ten blocks of size 1000 with approximately 800 single clusters and 100 two-elements clusters. The hyperparameters values are s 2 = 0.01, s 2 0 = 0.001 logit −1 (m 0 ) = 0.01 and g = 1.02. Note that the true value of K (represented by a triangle) is always covered by the corresponding posterior drawings except for one block. Moreover, the posterior distributions of N partially overlap even when the related posterior for K are well g = 1.01 g = 1.02 g = 1.05     separated confirming the robustness of population size inference when we account for matching uncertainty. Finally record linkage performances are quite satisfactory with posterior medians for FNR and FDR respectively less than 0.07 and 0.15 except for one block

Application with Syrian data
As a real application we now face the problem of matching records from two public available data sets reporting different number of recorded victims killed in the recent Syrian conflict, along with available identifying information including first and family names, date of death, and death location. A more detailed application can be found in Chen et al. (2018). Here we consider the data provided by the Violations Documentation  Center in Syria (VDC) and the Syrian Center for Statistics and Research (CSR) and we focus on the killings in the province of Raqqa from the beginning of the conflict until March 2017, since the CSR data set does not report records after this date.
The VDC data set provides directly the English equivalents of the Arabic names while, for the CSR list, the English equivalents have been obtained by software transliteration of the reported Arabic names causing additional noise. Several records of the VDC data set represent unidentified victims and report only the date of death or do not have the first name and report only the relationship with the head of the family. All these records have been eliminated and the resulting VDC data sets comprises 1694 records. The CSR list presents only completely identified victims for a total size of 1003 records. As in the previous experiments first and family names have been transformed by the English version of the soundex algorithm and the resulting fields have been considered as key variables together with year, month and day of death for a total of 11 variables.
We show the results obtained with the same hyperparameters set for the Rldata10000 data set and considering three different analyses. In the first case, that we call separated list analysis, we investigate only the within list deduplication problem. Hence we fit our model to the single lists one by one. Note that identification of true within list duplicates is a very challenging problem with these data since most attacks killed whole families causing records differing only in the first name that may easily confused as duplicates. Anyway the number of record pairs that exceed a 0.5 posterior probability of being duplicates, p(η ij1 = η ij2 |v), is small. In fact we have 51 pairs in the first data set and 43 in the second one hence visual inspection of these pairs may eventually confirm their matching status. Table 5 reports the distribution of the of cluster sizes averaged across the MCMC simulations showing the microcluster effect for both the lists.
In the second analysis we consider both within and between lists de-duplication, that is the natural scenario for our model where the two lists are joined into a single data set. The total number of pairs with p(η i1j1 = η i2j2 |v) > 0.5 is 617 out of which 481 are between lists duplicates and 84 and 52 are respectively within the first and the second list. Hence about 78% of duplicates link the same victim across the two lists. Table 5 shows the distribution of the cluster sizes for the joined lists but also within the two lists separately. Note the cluster size distribution within the two lists are quite similar to the previous case where the lists are separated before fitting the model.
In the third analysis we exclude the within list duplications and we consider only the record linkage problem across the two lists. One way to adapt our proposed model to that particular case is to modify the prior distribution on the λ's such that η ij1 = η ij2 ∀j 1 = j 2 and for i = 1, 2. Note that, in this case, clusters consist of at most two elements so that the distribution of the observed records v, conditional on η and α, can be calculated analytically without exploiting the recursive formula. Moreover, the above conditioning is equivalent to assuming that the two lists are two simple random samples without replacement from a population of N units. This is the same situation described in Tancredi and Liseo (2011). From a computational perspective, this scenario does not imply substantial changes. In fact, we can arbitrarily fix the labels of the first file, for example by assuming that η 1j = j for j = 1, . . . , n 1 and update only the labels of the second file. In particular, indicating with m q the size of the cluster identified by the label q without the record (i, j) we can use the Gibbs step provided by equation (5.5)  The number of pairs with p(η 1j1 = η 2j2 |v) > 0.5 in the record linkage framework is 423 and the posterior mean of the match number, that is the frequency of the two elements clusters, is 431.52. The reduced number of matches between the lists with respect to the previous case is due to a larger estimate of the measurement error when within list duplications are taken into account with the consequent increase also of between lists estimated duplications.
Finally, Figure 4 shows the posterior distribution for N provided by the three different data analyses described above. Notice that, since we eliminated records with missing information from the first list, here N represents the size of a smaller population than all the victims killed in the province of Raqqa until March 2017. We may say that N represents the size of victims with recordable information about first and last name. The posterior mean for N , when accounting for duplications both within and between the lists is equal to 5350 while when accounting only for between lists duplications is equal to 7507. When considering the two lists separately the posterior mean of N increases considerably to 24832 with the VDC list and to 11116 with the CSR list. Anyway the former two estimates are more reliable for the additional informative content obtained by joining the two lists. Note also that our estimates depend on the information retrieved on the original records via the soundex algorithm and that adding other key variables or using the full Arabic names with suitable string distance may lead to different estimates. Moreover population size estimates are strongly dependent on the capture-recapture model specifications, hence introducing heterogeneous and/or dependent captures may also produce different estimates. However our estimates can be seen as a starting point for future comparisons.

Discussion
In this paper we have shown how population size estimation can be performed when records related to population units have been sampled and duplicated across multiple files and the matching reconstruction within the same file and across different files is uncertain. In particular, through the prior specification of the matching process, we assumed that the observed lists are obtained as independent simple random sampling with replacement from a closed population of unknown size N . The hit-and-miss model (Copas and Hilton, 1990) has been used as a measurement error model in order to interpret differences among the sample records and the population records.
As a by-product of this approach, we obtained a more adequate prior distribution for the matching pattern, which can also be used when the population size estimation is not the primary task of the de-duplication process. However, more sophisticated prior distributions could be used to incorporate more realistic sampling design. For example, it would be important to extend our approach by introducing both heterogeneity and dependence in the sampling probability of the population units as in usual capturerecapture models. In particular the independence among the L lists is a very strong assumption which rarely occurs in real applications. Note also that, in the de-duplication framework, the problem is even more involved, because we may have different degrees of dependence among captures and duplications across the lists. Moreover, from a theoret-ical perspective, it would be also worthwhile to investigate the role that different prior distributions on the partition space, like that one induced by the Pitman-Yor process, may play in the facilitation of the microclustering effect.
Other specific assumptions that we made throughout the paper concern the independence of the key variables at the population level and the conditional independence of the measurement error mechanism. Also in this case, more sophisticated versions of the hit-and-miss model together with an appropriate model for the key variables should be used to take into account more realistic scenarios. Anyway, we are confident that our framework may provide a basis for all these kinds of extensions.