Bayesian Causal Inference with Bipartite Record Linkage

In some scenarios, the observational data needed for causal inferences are spread over two data files. In particular, we consider scenarios where one file includes covariates and the treatment measured on a set of individuals, and a second file includes responses measured on another, partially overlapping set of individuals. In the absence of error-free direct identifiers like social security numbers, straightforward merging of separate files is not feasible, so that records must be linked using error-prone variables such as names, birth dates, and demographic characteristics. Typical practice in such situations generally follows a two-stage procedure: first link the two files using a probabilistic linkage technique, then make causal inferences with the linked dataset. This does not propagate uncertainty due to imperfect linkages to the causal inference, nor does it leverage relationships among the study variables to improve the quality of the linkages. We propose a joint model for simultaneous Bayesian inference on probabilistic linkage and causal effects that addresses these deficiencies. Using simulation studies and theoretical arguments, we show that the joint model can improve the accuracy of estimated treatment effects, as well as the record linkages, compared to the twostage modeling option. We illustrate the joint model using a constructed causal study of the effects of debit card possession on household spending.


Introduction
In some scenarios, researchers seek to make causal inferences from variables spread over two datasets. For example, a social scientist seeks to link records from a survey and an administrative database to assess the effect of some policy on economic outcomes. Similarly, a health researcher seeks to link patients' electronic health records and Medicare claims data to assess the effect of some medical intervention. As a final example, a researcher seeks to link records from a study done in the past to records in a current database to make inferences about long-term effects of a treatment, without having to incur the substantial costs of collecting new primary data.
When perfectly measured, unique identifiers like social security numbers or Medicare patient IDs are available in the two files, it is reasonably straightforward to link individuals across the files (based on these identifiers). However, in some circumstances, direct identifiers may be missing from one or more files, or may not be made available due to privacy restrictions. In such situations, data files have to be linked based on indirect identifiers, such as individuals' names, birth dates, addresses and demographic information. These are inherently imperfect, e.g., they could be recorded differently on the files. This introduces uncertainty in linkages that should be propagated to the causal inferences.
Historically, record linkage and causal inference have been carried out as a twostage process. The researcher first links records using a probabilistic record linkage model based on indirect identifiers, not taking into account available information on the outcome, covariate or treatment status. Subsequently, the researcher uses the set of linked records in a causal inference procedure. This two-stage approach suffers from two drawbacks. First, it does not propagate uncertainty from imperfect linkages. Second, it does not take advantage of relationships among the study variables that could enhance the accuracy of the linkages.
In this article, we propose a Bayesian joint modeling framework for simultaneous causal inference and record linkage in observational studies. In particular, we consider scenarios where one file includes the treatment indicator and causally relevant covariates measured on a set of individuals, and the other file includes outcomes measured on a partially overlapping set of individuals. We follow the Bayesian paradigm for causal inference and posit models for the missing potential outcomes, conditional on the linking status and known covariates. We couple these outcome models with a probabilistic model for the unknown linkage statuses, i.e., which record pairs are links and which are not. For the outcome models, we consider both parametric and semi-parametric forms, with the latter based on a regression of the outcome on a flexible function of the propensity scores (Rosenbaum and Rubin, 1983). For the record linkage model, we use the Bayesian version of the Fellegi and Sunter (1969) model proposed by Sadinle (2017). As part of the model estimation, we generate plausible values of the missing potential outcomes, which we then use to estimate posterior distributions of causal effects.
Our work adds to a body of literature that uses Bayesian methods for simultaneous record linkage and statistical inference, including regression modeling Dalzell and Reiter, 2018) and population size estimation (Domingo-Ferrer, 2011;Tancredi and Liseo, 2011;Sadinle et al., 2018;Tancredi et al., 2018). It also adds to the literature on non-Bayesian methods for simultaneous record linkage and estimation (e.g., Scheuren and Winkler, 1991;Lahiri and Larsen, 2005;Chipperfield et al., 2011;Solomon and O'Brien, 2019). None of these Bayesian and non-Bayesian works consider causal inference as the analysis goal. Wortman and Reiter (2018) introduced the concept of allowing the causal model to inform the linkage model. Their (non-Bayesian) approach uses point estimates of average causal effects to determine the thresholds at which record pairs are declared links in a Fellegi and Sunter (1969) algorithm. It does not use the causal estimates to determine the record pairs to consider as possible links in the first place, which our Bayesian approach does. Further, their approach does not provide uncertainty quantification.
The remainder of the article proceeds as follows. In Section 2 we discuss the background, notations and the formulation of the Bayesian joint model. We also present theoretical results arguing for improved inference on record linkage from a joint model compared to a two-stage model. In Section 3 we describe posterior computation for the model. In Section 4, we provide results from simulation studies used to assess the effectiveness of the joint model, both for causal inference and for linkage quality. In Section 5 we apply the joint model to data from an Italian household survey to link records between different files and assess the effect of debit card possession on household spending. Finally, in Section 6 we conclude with an eye towards future work.

Model and Prior Formulation
We define a few key concepts and assumptions related to causal inference in Section 2.1, and describe probabilistic record linkage in Section 2.2. We propose the Bayesian joint modeling approach in Section 2.3.

Background and Notation for Bayesian Causal Inference
We assume a binary treatment, w i ∈ {0, 1}, with w i = 1 and w i = 0 indicating treatment and control assignment to individual i, respectively. Let x i be the p × 1 covariate vector and y i be the (continuous) outcome for individual i. Each individual is assumed to have two potential outcomes (Rubin, 1974), one under each value of the treatment. We denote y i (1) and y i (0) as the potential outcomes for individual i when w i = 1 or w i = 0, respectively. The treatment effect for the ith individual is given by Other treatment effects can be defined as well, such as y i (1)/y i (0), although here we consider effects in the form of T i .
In reality, for each individual i, we can observe only one of y i (1) and y i (0); that is, we can observe y i = w i y i (1) + (1 − w i )y i (0). Bayesian approaches to causal inference essentially treat the unobserved potential outcomes as missing data (Rubin, 2005;Hill, 2011;Ding et al., 2018). One can impute the missing values repeatedly by sampling from posterior predictive distributions, and use the resulting draws of each T i to make statements about causal effects. For example, one can compute the posterior distribution of the average of the causal effects for the n individuals in the study,T = i T i /n.
Following convention, we make the following assumptions to facilitate causal inferences.
1. Stable unit treatment value assumption (SUTVA): The SUTVA contains two subassumptions, no interference between units (i.e., the treatment applied to one unit does not affect the outcome for another unit) and no different versions of any treatment (Rubin, 1974).
2. Strong ignorability: Strong ignorability stipulates that (y i (0), y i (1)) ⊥ w i |x i for all i, which means that there is no unobserved confounding, and that 0 < P (w i = 1|x i ) < 1.
We also make use of propensity scores (Rosenbaum and Rubin, 1983). For any individual i, let the propensity score e(x i ) = P (w i = 1|x i ), i.e., the probability of being assigned to treatment given the covariate x i . Rosenbaum and Rubin (1983) show that the treatment assignment is independent of x i given e(x i ) under strong ignorability. Typically, propensity scores are estimated using binary regressions of the treatment on causally-relevant covariates (Rosenbaum and Rubin, 1983). Analysts can use the resulting estimate in a variety of ways, for example, to create subsets of matched treated and control records (Stuart, 2010).

Background and Notation for Probabilistic Record Linkage
We consider the scenario where we seek to link two files, File A and File B, comprising n A and n B records, respectively. Without loss of generality, we assume that n A ≥ n B . We suppose each individual or entity is recorded at most once within each file, i.e., each file contains no duplicates. Under this setting, the goal of record linkage is to identify which records in File A and File B refer to the same subject. This setting is known as bipartite record linkage (Sadinle, 2017).
A corollary to the no-duplicates assumption comes in the form of a maximum oneto-one restriction in the linkage, i.e., a record in one file can be linked with a maximum of one record in the other file. Most commonly, the one-to-one linkage is enforced as a post-processing step after identifying a set of potentially many-to-one links (e.g., Fellegi and Sunter, 1969;Jaro, 1989;Winkler, 1993;Belin and Rubin, 1995;Larsen and Rubin, 2001;Herzog et al., 2007). Alternatively, one can embed the bipartite matching constraint into a Bayesian model (Fortini et al., 2002;Tancredi and Liseo, 2011;Larsen, 2010;Sadinle, 2017;Dalzell and Reiter, 2018), as we do here.
Following Sadinle (2017), we introduce z = (z 1 , . . . , z n B ) for the records in File B to encode a particular linking status between the two files. Specifically, let if record j in File B has no link in File A.
In the context of bipartite matching, one enforces z j = z j whenever j = j .
Suppose the two files include F variables in common that can be used to link records across the files. We call these the linking variables or fields. For each pair of records (i, j) in File A × File B, we define a F -dimensional vector γ ij = (γ 1,ij , . . . , γ F,ij ) , where γ f,ij is a score reflecting the similarity in field f for the record pair. In this article, for categorical variables (e.g., birth year, birth date), we set γ f,ij = 1 when the values of field f for records i and j are equal, and set γ f,ij = 0 otherwise. For string fields (e.g., names) we take into account partial agreement using the normalized Levenshtein Similarity (LS) metric (Winkler, 1990). This metric ranges between 0 (no agreement) and 1 (full agreement). We obtain it using the "levenshteinSim" function in the RecordLinkage package in R. We convert the LS metric into a binary γ f,ij by setting γ f,ij = 1 when the LS metric exceeds a predetermined threshold -we use 0.95 in the simulations -and γ f,ij = 0 otherwise. One can convert the LS metric into a multinomial variable for more refined comparisons (Sadinle et al., 2018;Wortman and Reiter, 2018).
Following Fellegi and Sunter (1969) and related literature, we assume that γ ij is a random realization from a mixture of two distributions, one for true links and the other for nonlinks. We have where θ m = (θ 1,m , . . . , θ F,m ) and θ u = (θ 1,u , . . . , θ F,u ) comprise probabilities of agreement for each field specific to each mixture component. For computational convenience we assume conditional independence across fields, so that As a prior distribution on the set of z j , such that z j = z j for any j = j , we follow a construct used in the bipartite record linkage literature, including Fortini et al. (2002), Larsen (2010) and Sadinle (2017). Specifically, let I(z j ≤ n A ) ∼ Ber(π), where π represents the proportion of matches expected a priori as a fraction of the smaller file. Here and throughout, I(E) = 1 when its argument E is true, and I(E) = 0 otherwise. We assume π is distributed according to a Beta(α π , β π ) a priori. Marginalizing over π, the total number of links between File A and File B, given by n AB (z) = n B j=1 I(z j ≤ n A ), is distributed according to a Beta-binomial (n B , α π , β π ) distribution. Conditioning on the number of records in File B with a link, all possible bipartite pairings are taken as equally likely. The final form of the prior distribution of z, marginalizing over π, is given by (2. 3) The choice of the hyper-parameters α π and β π provides prior information on the number of overlapping records between the two files. We discuss the specific choices of α π and β π in Section 3. Finally, the parameters θ f,m and θ f,u follow i.i.d. Beta(a = 1, b = 1) distributions for all f = 1, . . . , F .

Joint Model for Bayesian Causal Inference and Record Linkage
We now present the joint model for Bayesian causal inference and record linkage for the setting where the outcomes y are in File A, and the covariates x and the treatment status w are in File B. We presume that the sets of covariates and linking variables are disjoint. This is the case, for example, when the linking variables comprise only string variables like names, addresses, or phone numbers. We discuss relaxing this assumption after presenting the model.
With this setting, we must specify the distribution of y i in File A depending on whether or not it is linked to any record's covariates and treatment in File B. For linked records, we specify the conditional distribution of y i |(x j , w j ) using a regression of our choice. For records without a link, we specify a model for the marginal distribution of y i . We couple these with the model for record linkage in (2.1)-(2.3).
More specifically, the contribution to the likelihood function from the ith record in File A is f 1 (y i | x j , w j , θ c ) when z j = i, and is f 2 (y i | θ d ) when z j = i, for any j. Here, θ c and θ d represent parameters in the regression and in the marginal model for outcomes, respectively. Let y = (y 1 , . . . , y n A ) and w = (w 1 , . . . , w n B ) be the n A × 1 vector of outcomes in File A and n B × 1 vector of treatment statuses in File B, respectively, and X = [x 1 : · · · : x n B ] be a n B × p dimensional matrix of covariates obtained from File B. The joint likelihood is given by Here, as frequently done in record linkage models, we assume that x does not affect the distribution for z. When this is not plausible, analysts may want to condition on features of x when modeling z in (2.2). Additionally, analysts may want to include a model for x in the likelihood when the covariates suffer from missing values; otherwise, it is not necessary to specify a model for x. More broadly, we assume that the distributions in (2.2) are the same across individuals, e.g., mismatches in the linking variables across files occur randomly. This is typically done for record linkage algorithms (Christen, 2012). When this is not the case, the joint model may not perform as effectively.
We can adapt the model to settings where analysts use some variables as both linkage variables and covariates. When such variables are recorded identically across the two files, this presents no issue for the modeling framework. In such cases we can view them as blocking variables-i.e., records are required to have the same values of these variables, or else they are not linked-rather than treat them as linkage variables having some γ f,ij . We then replace f 2 with a regression of y on this subset of variables. When these variables are not recorded identically across files, one can use the values in both files as linking variables and select one set, presumably those in File B with the rest of x, to include among the covariates in the model for y. We illustrate this approach in the debit card analysis as well as in a simulation in the supplementary material (Guha et al., 2022). For the remainder of this section, we assume that the covariates and linking variables are disjoint.
To illustrate the potential benefit of joint modeling over two-stage modeling, we examine the likelihood ratio that any pair of records is linked versus not linked. Under the joint model, the likelihood ratio of i ∼ j (i.e., record i is linked to record j) and i ∼ j is given by Notably, (2.5) depends on a contribution to the likelihood from the outcome model. In contrast, the likelihood ratio for linking records in the traditional two-stage model only involves the likelihood from the assumed probabilistic record linkage model. For this model, we have (2.6) Theorem 2.1 offers insight into the behavior of Ratio Joint and Ratio 2Stage .
is bounded away from 0 and ∞ in its support, we have Proof. The likelihood ratio under the joint model, Ratio Joint , can be expressed as (2.7) The likelihood ratio under the two-stage model, Ratio 2Stage , is given by (2.6), which we abbreviate as h(θ f,m , θ f,u ). Thus, log(Ratio Joint ) = log(h(θ f,m , θ f,u )) + log f1 (yi|xj ,wj ,θc) f2(yi|θ d ) , and log(Ratio 2Stage ) = log(h(θ f,m , θ f,u )). Therefore, we have as a consequence of this expression being a Kullback-Leibler divergence. And, we have where the last inequality follows from the fact that the expression is (−1) times the Kullback-Leibler divergence between the two densities f 1 and f 2 .
Theorem 2.1 indicates that the likelihood ratio for the joint model is more extreme than the likelihood ratio for the two stage model, which facilitates more accurate identification of a link or no link between records i and j.

Outcome Models
Naturally, one should specify f 1 (y i |w j , x j , θ c ) and f 2 (y i |θ d ) to describe the distribution of outcomes as faithfully as possible. In this article, we specify models for y i ∈ R and assume f 2 (y i |θ d ) = N (y i |μ 2 , σ 2 2 ); setting a more complicated distributional form for f 2 or extending to a categorical y i is relatively straightforward. For f 1 (y i |w j , x j , θ c ), we use a general mean-zero additive error form, ( 2.8) The specification for m(x j , w j ) could be a simple linear form, although often in observational studies it is advantageous to use more flexible modeling (Hill, 2011).
We use a computationally favorable yet flexible specification for m(x j , w j ). In particular, we assume whereê(x j ) is the estimated propensity score. Here, one can determineê(x j ) using the covariates and treatment statuses for all the records in File B. Alternatively, one can determineê(x j ) for just linked records, so as to facilitate modeling over the specific region of the propensity score distribution relevant for (2.8). We illustrate the second strategy in the simulations of Section 4, withê(x j ) = ρ −1 (x jη ), where ρ(·) is the logit link function andη is the maximum likelihood estimate of η obtained by fitting a logistic regression of w j on x j for all j ∈ B = {j : z j = i, for some i, 1 ≤ i ≤ n A }. We note that, in our simulations and illustration with the debit card study, estimating propensity scores with all of File B did not change results meaningfully.
We suggest placing a large number of knots to estimate the semi-parametric functions accurately. We recommend and use a fixed set of knots chosen on a regular grid that remains unchanged for all iterations of the Markov chain Monte Carlo (MCMC) sampler. Even a moderately large choice of m may result in model over-fitting. We therefore regularize the spline coefficients, β s+1 , . . . , β s+m and γ s+1 , . . . , γ s+m , using Bayesian Lasso shrinkage priors. Following Park and Casella (2008), a scale-mixture representation of the Bayesian Lasso shrinkage prior is given by ∼ N (0, 1) priors. We also assign β 0 ∼ N (0, 1) and σ 2 ∼ IG(a σ , b σ ) priors. The prior specification is completed by setting prior distributions on θ d = (μ 2 , σ 2 2 ) as μ 2 ∼ N (0, 1) and σ 2 2 ∼ IG(a σ2 , b σ2 ) a priori. We discuss the choice of hyper-parameters further in Section 3.

Estimation of the Causal Effect
As the causal effect of interest, we define the average treatment effect for the linked records, which we abbreviate as ATEL.  (Hill, 2011). When seeking to generalize results beyond the linked records, analysts should examine the distributions of x in the linked and not linked cases. When substantial differences exist, it is prudent to be cautious in generalizing the ATEL to the study population in File B. One special case where the ATEL readily generalizes is when the individual treatment effect y i (1) − y i (0) = T for all i in File B, i.e., a constant treatment effect. Then, the ATEL is an average treatment effect for the study population in File B.
As an extension of the Bayesian joint model, analysts could impute plausible values of (y i (0), y i (1)) for cases in File B that do not have links. This would allow analysts to compute average treatment effects for the population of records in File B. We leave comparison of this estimator to the ATEL for future research.

Posterior Computation
Incorporating the prior information, the full posterior for the model with the semiparametric outcome regression is proportional to Exp(τ 2 1,k |λ 2 1 ) × Exp(τ 2 2,k |λ 2 2 ) × Gamma(λ 2 1 |r 1 , δ 1 ) × Gamma(λ 2 2 |r 2 , δ 2 ) Similarly, the full posterior for the model with the parametric outcome regression is proportional to Summaries of these posterior distributions cannot be computed in closed form. Thus, posterior computation proceeds through MCMC algorithms. In each iteration, we update the outcome regression parameters using the current set of model-determined links.
We also re-estimate propensity scores based only on those records in File B that have been linked to File A in that iteration. The full posterior conditionals can be found in the supplementary material.
For either model, we let the MCMC chain run until apparent convergence (2000 iterations in our simulations) and discard an appropriate burn-in. Let z (1) j , . . . , z (L) j be the L post burn-in MCMC iterates of z j , where j = 1, . . . , n B . For each j, we empirically estimate P (z j = q|−) using the proportion of post burn-in samples where z j takes the value q, i.e.,P (z j = q|−) = #{l : z (l) j = q}/L, for q ∈ J j = {1, . . . , n A , n A +j}. The most likely link for record j in File B is the record q satisfying 1 ≤ q * = arg max q∈JjP (z j = q|−) ≤ n A . We denote this record asẑ j = q. When q * = n A + j, we declare it most likely that record j does not have a link in File A. The posterior distributions of each z j characterize the uncertainties associated with the links.
To draw posterior inferences on the ATEL, define y miss,i = (1 − w i )y i (1) + w i y i (0) as the counterfactual outcome for the ith record in File A, where i = 1, . . . , n A . At the l-th post burn-in iteration, we impute the counterfactual outcomes y miss,i using a treatment indicator that is opposite of what is observed for its linked record, i.e., we set w i = (1− w j ). We obtain the l-th post burn-in iterate for the ATEL using (2.12) with (y i , y (l) miss,i ) for all i ∈ A (l) . In the simulations in Section 4 and analyses in Section 5, we choose the values of the hyperparameters as a σ = 1, b σ = 1, α π = 1, β π = 1, a σ2 = 1, b σ2 = 1, r 1 = r 2 = δ 1 = δ 2 = 1. Moderate perturbations of the hyperparameter values lead to practically indistinguishable results.

Simulation Studies
We assess the performance of the Bayesian joint model using simulation studies. We consider scenarios in which we vary (a) the proportion of records in common between the two files and (b) the data generation model for the outcomes. Within these, we consider scenarios with the correctly specified and a mis-specified outcome regression model. Finally, we present results from a simulation with missing outcome values.

Simulated Data Generation
We work with the RLdata10000 data from the R package, RecordLinkage (Sariyar and Borg, 2010). These data comprise an artificial population of 10000 records with first names, last names, birth years and birth dates, which we use as linking variables. Among these records, there are 1000 individuals whose values of these variables have been duplicated and then randomly perturbed, introducing errors into these potential linking variables; see Sariyar and Borg (2010) for details. The RLdata10000 dataset is widely used as benchmark data for record linkage methods (Binette and Steorts, 2020).
The RLdata10000 data do not include covariates, treatments, or outcomes. Thus, we generate values of these for each of the 9000 unique individuals in the RLdata10000 file. For each individual j, we generate p = 2 covariates, x 1,j and x 2,j , sampled i.i.d. from standard normal distributions. We generate each individual's binary treatment assignment w j from a Bernoulli distribution with probability given by where (α 0 , α 1 , α 2 ) = (1, 1.5, −1). We generate each individual's outcome y j from where the superscript 0 indicates the true data generating mechanism. We examine results for two choices of (m 0 1 , m 0 2 ). The first uses linear functions in e(x j ): m 0 1 (x j ) = 1 + 2e(x j ) and m 0 2 (x j ) = 4. We call this Scheme L. The second uses nonlinear functions in e(x j ): m 0 1 (x j ) = 0.58 − 1.5 e(x j ) and m 0 2 (x j ) = exp(−0.8 + 2.6 e(x j )). We call this Scheme N. In both schemes, the true regression function can be viewed as a nonlinear function of the covariates.
We construct File A and File B by putting subsets of these records into two files. For any record, File A includes the outcome information, while File B includes the covariate and treatment information; both files include the imperfect linking variables. For ease of simulation, we set the sizes of File A and File B to be n A = n B = 1000.
In any simulation, we randomly sample a subset of the 1000 individuals with duplicates. We put these records in File A and their duplicates in File B. The number of these overlapping individuals is denoted by O AB , which is varied to be 100, 500, or 900. For the remaining (n A − O AB ) records in File A, we randomly choose (n A − O AB ) records from the 8000 individuals without duplicates, discarding their treatments and covariates and keeping their outcomes and the linking variables. To ensure that the non-overlapping records of File A and File B correspond to different individuals, we set aside these (n A − O AB ) records from the 8000 records. To add the remaining (n B − O AB ) records to File B, we randomly choose (n B −O AB ) records from the remaining (8000−n A +O AB ) records, discarding their outcomes and keeping the treatments, covariates, and linking variables. This method of selection results in the same distributions of all variables for linked and not linked records. This matches an implied assumption of the model and is a condition for the validity of typical estimation tasks with linked data (Christen, 2012).
For comparisons, we consider the performance of two alternatives. The first is a twostage model in which we fit the bipartite Bayesian record linkage method as described in Section 2.2, without using the covariates, treatments, or outcomes. For each of the L post burn-in MCMC samples of z, we estimate the outcome model using a Bayesian regression. Using this model, we take a sample from the posterior predictive distribution of the missing potential outcomes. Thus, we have L draws of each missing potential outcome, which we use for inferences about the ATEL. Comparisons with this approach reveal if the sharing of information between the record linkage and outcome models offers any inferential advantages. We also consider using the known links, that is, we make causal inferences with the true links. Although this approach is not feasible in practice, as one does not know the true links in genuine scenarios, we consider it a benchmark for the best we can do in these simulation scenarios.
We compare the performances of the joint model and two-stage model in terms of record linkage as well as causal inference. For the former, we examine the positive predictive value (PPV) and the negative predictive value (NPV), defined as follows. After running the record linkage procedure on the two files, e.g., as we do for each simulation replication, letẑ be the posterior mode of z. Let A 1,j = {ẑ j = z j , z j ≤ n A } and A 2,j = {ẑ j = z j , z j = n A + j}. Let I(A k,j ) be the indicator function corresponding to set A k,j , where k = 1, 2 and j = 1, . . . , n B . The PPV is the proportion of links that are actual matches, that is, The NPV is the proportion of non-links that are actual non-matches; that is, A perfect record linkage procedure would result in PPV = NPV = 1.
To assess the quality of causal inference for all three methods in any simulation replication, we use the mean squared error (MSE) of the L post burn-in causal effects, i.e., MSE = L l=1 (ATEL (l) − ATEL 0 ) 2 /L, where ATEL 0 is the true causal effect and ATEL (l) is the lth post burn-in estimate of ATEL.

Results
We begin with results with no missing outcomes and with correct outcome model specifications. That is, we specify parametric or semi-parametric outcome regressions that match the choices of m 1 (·) and m 2 (·) in the data generation models in Section 4.1. We estimate the joint model and the two-stage model using four linking variables: first name, last name, birth date and birth year. We treat birth dates and years as categorical variables, setting each γ f,ij = 1 when they match across files and each γ f,ij = 0 otherwise. We compute LS metrics for first and last names, and for each set γ f,ij = 1 when the LS metrics are at least 0.95 and γ f,ij = 0 otherwise. We note that, in these and all other simulation studies, the effective sample sizes for all regression parameters are greater than 400, among the L = 500 post burn-in samples. Convergence happens quickly for both models, producing practically uncorrelated post burn-in samples. Table 1 summarizes the average PPVs and NPVs for the joint model and twostage model over 100 independent replications-enough to generate sufficiently small Monte Carlo errors-when using the correct outcome model specifications. For both data generation schemes, the PPVs and the NPVs of the joint model decrease as the percentage of overlap between File A and File B decrease. The two-stage model follows a similar pattern. Comparing the two models, we see that the joint model tends to have a higher PPV than the two-stage model. The improved performance of the joint model becomes increasingly apparent as the amount of overlap decreases. The joint model also tends to have a higher NPV than the two-stage model, although the differences are negligible in the scenario with 10% overlap under Scheme N. The improvements in the linkages when using the joint model have benefits for the estimation of the causal effect. As evident in Table 2, the joint model performs significantly better on MSE than the two-stage model. Because the two-stage method makes fewer correct linkages than the joint model, the posterior distributions of its parameter estimates tend to be centered away from the true values, which results in less accurate imputations of the missing potential outcomes. Additionally, the variability of the estimated mean regression function tends to be larger for the two-stage method than the joint method. Among these two reasons, the bias tends to represent a larger portion of the MSE. For example, under the two-stage model under scheme L and with 10% overlap, the squared bias of the posterior mean is around 10.9, whereas the variance is around 0.7. The corresponding squared bias and variance for the joint model are 1.53 and 0.19. We also observe that the performance gap in MSE becomes more substantial as the percentage of overlap decreases, especially under Scheme L. Notably, the results from the joint model are similar to those from the gold-standard Known Link model in the 50% and 90% overlap scenarios.
We next examine performance when the outcome model does not exactly match the data generating model. In particular, we fit an outcome regression that is linear in the propensity score even though the outcomes are generated using Scheme N; and, we fit an outcome regression that uses the penalized splines even though the outcomes are generated using Scheme L. Here, we only consider the scenario with 90% overlap, which gives both methods the best chance to perform well. As evident in Table 3, not surprisingly performances of both models deteriorate substantially compared to the results in Table 1 and Table 2. We are imputing potential outcomes from misspecified models, after all. When fitting the semi-parametric model to data generated under Scheme L, we observe higher PPV and NPV, as well as a lower MSE, for the    Table 2. joint model compared to the two-stage model. This is also the case when fitting the parametric model to data generated under Scheme N; however, in this scenario the differences are practically modest. Taken together, these results suggest that, even with some model mis-specification, it may be advantageous to use the joint model over the two-stage model.
Finally, we examine the performance of the joint model and two-stage model in the presence of missing outcomes in File A. We blank either 5% or 10% of the values of y i in File A using a missing completely at random mechanism. We examine the cases of correct model specifications with 90% overlap of records between File A and File B. To handle the missing values in the joint model, we sample the missing observations from their posterior predictive distributions in each MCMC iteration. For the two-stage model, based on the posterior modes of z, we impute the missing values from their  Table 4: PPV, NPV, and MSE for the joint and the two-stage models with 5% and 10% missing outcomes in File A. Results based on 90% overlap of records between File A and File B. The true causal effects averaged over 100 replications for Scheme L and Scheme N for 90% overlap are given in Table 2.
posterior predictive distributions after fitting the outcome model on the linked dataset. Table 4 summarizes results over 100 independent simulation runs. The performance of the joint model worsens as the percentage of missing data increases, although not by much in these scenarios.
A similar trend is observed for the two-stage model. We continue to see advantages of the joint model over the two-stage model.
In the supplementary material, we describe results from additional simulation scenarios. In particular, we find that the relative performances of the joint model and two-stage model remain qualitatively unchanged when using correlated (rather than independent) covariates (Supplement A). We lower the signal to noise ratio by increasing the regression variance (Supplement B). Not surprisingly, the performance gap between the joint model and the two-stage model closes as the variance increases. We study the performance of the two models when a subset of the predictors are also used as linking variables (Supplement C). The joint model continues to offer improved performance over the two-stage model. Finally, we present a simulation study when the true regression function is not entirely a function of e(x j ) and the fitted spline model also adjusts for covariates along with the propensity score (Supplement D). Using the covariate adjustments, the joint model offers lower MSEs than the two-stage model. We next turn to demonstrating the joint model with data from an observational study, which concerns the causal effect of possession of debit cards on household consumption. As we describe, we have true links in these data. Thus, the analysis serves primarily to illustrate the methodology.

Causal Study of Debit Cards
As context, we begin with background on the motivation for the causal question of interest. The past few decades have seen a steadily increasing global trend in the use of non cash payment instruments like credit, debit and prepaid cards. Thaler (1985) and Thaler (1999) argue that the form of payment instruments can have a significant impact on consumer decisions via mental accounting, a set of cognitive operations used by individuals and households to keep track of financial activities. Indeed, there is evidence that consumers who have cards would spend more than ones who do not (Cole, 1998). A comprehensive causal study carried out by Mercatanti et al. (2014) in this regard focuses on the effect of debit cards on spending. Mercatanti et al. (2014) argue that debit cards, unlike credit cards, do not allow consumers to incorporate additional long-term sources of funds in their spending decisions, thus eliminating any confounding intertemporal reallocations of wealth from the psychological effects on spending (Soman, 2001), and hence are more appropriate to look at for this kind of a causal study.

Data Description and Background
We use data from the Italy Survey on Household Income and Wealth (SHIW). The SHIW is a nationally representative survey, run by the Bank of Italy once in every two years since 1965, with the only exception being that the 1997 survey was delayed to 1998. The purpose of this survey is to collect information on several aspects of Italian households' economic and financial behavior. Since the data contain information related to household characteristics, spending and payment instruments, the SHIW can provide a useful opportunity to evaluate the causal effect of debit card possession on spending in Italian households.
We link two files comprising data collected during the years 1995 and 1998. A number of the same households participated in both years. In particular, our target population is the set of households having at least one current bank account but no debit cards before 1995. The treatment w = 1 if the household (all members combined) possesses one and only one debit card at 1998, and w = 0 if the household does not possess any debit cards at 1998. Households with more than one debit card are excluded from our sample. Ideally, we would perform the analysis with units being individuals that possess debit cards, because debit cards are typically issued to individuals. However, the SHIW survey only has this information at the household level. Our strategy to limit the sample of treated units to households possessing only one debit card ensures that a possible effect on household spending will be due to a certain individual possessing this card. Though we do not have exact information on the ownership of the card, we make the (reasonable) assumption that the head of the household has possession of the sole debit card.
The outcome on which we evaluate the treatment effect is the monthly average spending of the household on all consumer goods, measured in the latter survey (1998). For data quality control, we delete 15 observations which have either negative values of the outcome (monthly spending) or unusually high ratios (greater than 5 and going up to 900) of monthly spending to monthly income. Upon implementing such data quality control measures, the data file corresponding to 1995 contains 589 observations with information on the treatment (debit card possession) and covariates, while the data file corresponding to 1998 (3919 observations) contains information on the outcome (monthly average household spending).
Both files contain a common set of imperfect linking variables, including the geographical area of residence of the household, the number of inhabitants in the town of the household, the number of members in the household, and the gender, birth year, marital status, region of birth and highest educational qualification of the head of the household. Fortunately, we also have a unique ID that we can use to perfectly link households across years. We use this ID variable to assess how well our model has linked observations in the two files, based on the other imperfect linking variables noted above. Using the unique matching ID, we observe that the file contains 191 observations in the treatment group (who possess a debit card) and 398 observations in the control group. An initial check on the spending distribution for the treatment and control groups (see Figure 1(a)) hints at a positive effect of acquiring a debit card on household spending.
The covariates (possible confounders) we consider in this study are all measured in the initial survey (1995), and consist of the monthly average spending of the household on consumer goods in the initial survey year (lagged outcome), the net wealth of the household, the household net disposable income, the monthly average cash inventory held by the household, the number of members in the household, the marital status of the head of the household, the average interest rate and the number of banks in the municipality where the household is located. The choice of these confounders appears to satisfy the strong ignorability condition, as we discuss in the online supplement (Supplement F). The lagged outcome is generally indicated in the economic literature as a fundamental confounder (Angrist and Pischke, 2009;Frölich and Sperlich, 2019). The cash inventory held by the household was introduced in the specific context by Mercatanti et al. (2014). The net wealth and the net disposable income are important indicators of the household economic condition. The last two covariates have been suggested by Attanasio et al. (2002), who have shown in non-causal contexts that the interest rate and the number of banks in the municipality where the household lives had a significant contribution to the probability of acquiring debit card in Italy. Moreover, the number of banks is a good indicator of the size of the municipality. We note that the number of members in the household and the marital status of the household head are used both as covariates and linking variables.

Results
We implement the joint model with the semi-parametric outcome regression specification discussed in Section 2.3. We also include the two-stage model and the results using the known-links for comparisons. We use the same prior hyperparameter values as in the simulation studies; moderate perturbation of them leads to practically indistinguishable results. For the outcome models, we re-estimate the propensity scores using the linked records at each MCMC iteration. Results based on outcome models with propensity scores computed from all records in the 1995 data are presented in Supplement I; they are essentially identical to what we present here. We let the MCMC chains run for 2000 iterations and discard the first 1500 as burn-in, and draw inferences on both the ATEL and record linkage based on the post burn-in iterates. The effective sample sizes averaged over all parameters are 411 and 428, for the 500 post burn-in iterates corresponding to the joint and two-stage models, respectively. The post burn-in iterates  Figure 1(b) shows the distribution of the ATEL for the joint, two-stage and the knownlink models. The numbers are in per thousand Italian Liras.  Table 5 presents the PPV and NPV values, along with the posterior median and 95% credible intervals of the estimated ATEL (in thousand Italian Liras) for all models. Consistent with the simulation results, the joint model offers a noticeably better PPV and NPV than the two-stage model. Using the results from the known links as a benchmark, we find that the posterior inferences for the joint model seem more plausible than those from the two-stage model. First, the posterior medians for the joint model and the known link model are more similar to one another than are the posterior medians for the known-link model and two-stage model. Second, the 95% credible interval for the joint model is wider than the interval for the known-links model, which is sensible in that it reflects additional uncertainty from imperfect linkages. On the other hand, the 95% credible interval from the two-stage model actually is practically the same length as the interval for the known-links model, effectively portraying no propagation of uncertainty from imprecise linkages. Figure 1(b) displays the posterior distributions of the ATEL. The results suggest that, on average, the effect of possession of a single debit card for a household leads to more monthly consumption than households that do not possess any debit card during the study period. Our analysis largely eliminates any potential confounding effect of intertemporal reallocation of wealth, since debit cards do not allow for longterm fund sources (Soman and Cheema, 2002). Hence, the significant estimated effects of debit card possession on spending may be attributed to psychological reasons (increased perceived amount of money) (Soman, 2001) and easier accessibility to financial resources (Morewedge et al., 2007). The estimated ATEL is higher than the average treatment effect on the treated (∼ 200 thousand Italian Liras) in Mercatanti et al. (2014). This result is interpretable in the light of some recent economic models for the use of debit cards (e.g., see Kim and Lee, 2010 and references therein), which imply that people with less economic resources adopt debit cards later than the rest of the population. This is confirmed by Mercatanti et al. (2014) who show that households with debit cards generally have members with higher levels of income, wealth and education compared to households without debit cards. Therefore, our estimated ATEL values indicate larger psychological effects on spending for people in disadvantageous social and economic conditions.

Discussion and Future Work
The Bayesian approach to causal inference and record linkage offers interesting future directions. For example, some data applications have predictors and treatment status residing in different files. This requires significant modifications of the approach presented here, as one needs a model for the covariates as well as the outcomes. Another important future direction is to extend this approach to other flexible outcome models. In fact, to address outcome model mis-specification, it may be possible to adopt a doubly robust approach to causal inference (Graham et al., 2016;Saarela et al., 2016).
When one or more files have large numbers of records, Bayesian probabilistic record linkage techniques, including our joint modeling approaches, are likely to face computational challenges. These models may require comparing very large numbers of record pairs in each MCMC iteration (Binette and Steorts, 2020). In such cases, it may be possible to leverage techniques such as blocking, filtering and indexing (Murray, 2016;Steorts et al., 2014) to reduce computational burdens. Nonetheless, efficient computation with joint models is an important topic for research.
We conclude with a connection to the philosophy of causal inference. Performing causal inference and record linkage simultaneously allows the values of the outcome variables to influence which records are used in the causal estimator. This is in conflict with the often followed advice that the design of the observational study should proceed separately from the analysis (Imbens and Rubin, 2015). As suggested by Wortman and Reiter (2018), if one seeks the potential gains in accuracy from using the relationships among the variables, this is the price to pay for working with imperfect linkages.

Supplementary Material
Supplementary Material: Bayesian Causal Inference with Bipartite Record Linkage (DOI: 10.1214/21-BA1297SUPP; .pdf). Supplement A includes additional simulations where the predictors are correlated. Supplement B includes additional simulations by varying the signal to noise ratio. Supplement C includes additional simulations that incorporate overlap between linking variables and covariates. Supplement D includes additional simulations with a different data generation scheme, and with the spline model adjusting for covariates along with the propensity score. Supplement E contains full conditional distributions for implementing the Gibbs sampler. Supplement F examines the plausibility of the strong ignorability assumption for the debit card usage data. Supplement G looks at convergence diagnostics for the MCMC chains for the competing methods. Supplement H provides additional descriptive analysis for the debit card study. Supplement I provides additional analysis for the debit card study, where the propensity score is computed only once from the full 1995 file, rather than in every iteration. Supplement J re-analyzes the real data using the ratio of spending to income as the outcome.