Causal inference under mis-specification: adjustment based on the propensity score

We study Bayesian approaches to causal inference via propensity score regression. Much of the Bayesian literature on propensity score methods have relied on approaches that cannot be viewed as fully Bayesian in the context of conventional `likelihood times prior' posterior inference; in addition, most methods rely on parametric and distributional assumptions, and presumed correct specification. We emphasize that causal inference is typically carried out in settings of mis-specification, and develop strategies for fully Bayesian inference that reflect this. We focus on methods based on decision-theoretic arguments, and show how inference based on loss-minimization can give valid and fully Bayesian inference. We propose a computational approach to inference based on the Bayesian bootstrap which has good Bayesian and frequentist properties.


Introduction
In the study of the causal relationship between an exposure (or treatment) and an outcome, bias in the estimation of the exposure effect may occur due to confounding if the exposure is not an experimental intervention.Confounding exists whenever the exposure assignment is dependent on predictors that also influence the outcome.If the dependence of outcome on exposure and predictors is modelled correctly, standard regression is adequate to obtain correct inference about the exposure effect.When correct specification cannot be guaranteed, the propensity score can be used to break the dependence between confounders and exposure, to create balance in the distribution of confounders across exposure groups, and facilitate correct inference.This paper studies how the propensity score can be deployed in a Bayesian causal analysis.
Adjustment via the propensity score can be carried out using regression, inverse weighting, stratification or matching.In regression settings, parametric models are proposed to represent the propensity score and the (expected) outcome given the propensity score.In frequentist approaches, adjustment is carried out by estimating parameters in the propensity score and the outcome models separately.In a fully Bayesian framework, such a two-step analysis is uncommon; it would be more natural to fit a single joint model for the treatment and outcome.This has led to discussion as to how Bayesian methods can be used in the causal setting, and even whether Bayesian methods are valid.There is a growing literature on sophisticated procedures for performing Bayesian causal analysis, but in a fully Bayesian framework, some aspects of the methodologies deployed appear non-standard and not justified via Bayesian logic.
We address these issues in this paper.Section 2 recaps the regression approach to causal estimation, and section 3 describes how the key to valid Bayesian causal inference results from the assumption of exchangeability of the observable quantities to be modelled, which can be derived through de Finetti's representation, and a review of Bayesian adjustments using the propensity score.Section 4 describes Bayesian decision-theoretic inference which gives the framework for inference under mis-specification, and section 5 gives the non-parametric computational strategy that we deploy.Section 6 recasts the conventional Bayesian approach in the decision-theoretic framework.We provide simulation studies in section 7, more complicated inference settings in section 8, and conclude with a discussion in section 9.
We note here that Bayesian methods that do not rely on the propensity score are also quite widely used: these methods utilize flexible parametric or non-parametric procedures to represent the outcome model as a function of the treatment and other predictors and attempt to avoid mis-specification.These methods are certainly useful, and the inferential theory supporting such one-stage analyses is more straightforward.However, such flexible outcome regression models cannot estimate the causal effect of interest in all cases, such as those where a more general target of inference is defined.These models are not the primary focus of this paper.Similarly, we will not discuss Bayesian matching methods in detail, although some comments are given in section 9.

Background
To formulate causal inference estimation, potential or counterfactual outcomes are often used.Potential outcomes, {Y (z)} for z in some putative treatment set, represent the outcomes that would be observed if treatment level Z was set to z.If exposure Z takes two levels labelled {0, 1}, the potential outcomes represent values of outcome Y that would be observed had exposure been set by intervention to z = 0, 1 respectively (Neyman, 1923;Rubin, 1974;Holland, 1986).We consider n subjects, and for the ith subject, let Y i be the outcome of interest, Z i be the exposure, and X i = (X i1 , X i2 , . . ., X ip ) ⊤ be a p-dimensional vector of confounders.We denote the data observation space X .

The average treatment effect (ATE)
Under an assumption of no unmeasured confounding or ignorability, so that {Y (z)} ⊥ Z|X, the Average Treatment Effect (ATE) for a binary treatment is defined by If Z is assigned independently of covariates X, Z ⊥ X, the ATE, τ , is defined as This definition differs notationally from the formulation via counterfactual or potential outcomes (Rubin, 1974) or the do-operator (Pearl, 2009), but under the independence assumption is equivalent.Equation (2) defines a marginal (over X) estimand, although conditional (subset-specific) estimands may also be defined.The calculation in (2) can be mimicked in the observed data to yield the estimate but this requires knowledge of the conditional expectation E Y |X,Z [Y |X = x, Z = z] for all (x, z).Typically, this expectation would be represented using a regression model, and if this model is misspecified, incorrect inference about τ in the presence of confounding, when the independence assumption does not hold, and X is also associated with Y .

The role of the propensity score
To estimate the ATE in the presence of confounding, Rosenbaum and Rubin (1983) showed that if the exposure assignment is ignorable and b(X) is a balancing score, defined so that X ⊥ Z|b(X), the ATE can be evaluated by averaging analyses carried out conditional on b(X).If Z is binary, a typical choice for the balancing score is the propensity score, where b(X) = Pr[Z = 1|X].Conditioning on the propensity score allows estimation of τ in the presence of confounding when the conditional model for Y given X and Z is not correctly specified by breaking the dependence between X and Z, Propensity score regression represents the expected outcome conditioned on the exposure, confounders and propensity score.The ATE τ from (2) can be evaluated as as b(X) is a balancing score, based on a model for for a modified version of (3) -see for example Rosenbaum and Rubin (1983).
Typically b(. ) is represented using a parametric model, b(x) ≡ b(x; γ), with γ estimated from the observed Z and X data.However, the balancing result X ⊥ Z | B only holds when b(X; γ) correctly characterizes the probability that Z = 1 for any given X; this corresponds to the existence of a true value γ 0 of γ which defines the function precisely.For γ = γ 0 , the method of proof of Rosenbaum and Rubin (1983) does not work to establish balance; see Appendix section A for a summary of the argument.Therefore, in a correctly specified parametric formulation of the propensity score, to yield balance, we must identify a single point in the parameter space, and use that to define the propensity score.If γ 0 is not known, we must resort to substituting a consistent estimator γ for γ 0 , and then the required balancing result will hold asymptotically.

An illustrative model
Suppose the observed outcome data are generated according to the structural model where for p-dimensional parameter ξ the term X 0i ξ defines the true treatment-free mean model, and τ defines the ATE.If a regression model matching this specification is fitted using least squares, then the resulting estimator for τ is consistent.Similarly, if Z is assigned independently of X, then the estimator for τ is consistent even if the treatment-free mean model is mis-specified.However, if the model is mis-specified and Z and X are not independent, then the estimator of τ is in general inconsistent due to confounding.As demonstrated by Robins et al. (1992) the regression model where φ is scalar parameter, yields a consistent estimator of τ , albeit one whose variance is at least as large as the variance of the estimator arising from the correctly specified model.An 'augmented' model that contains an additional 'prognostic' linear predictor term X i β involving nuisance parameter β, that is, with can be fitted as an attempt to reduce the variance for τ ; note, however, that the inclusion of this augmenting term is not necessary for consistent estimation of τ provided the propensity score model is correctly specified.
Least squares then provides a semiparametric estimation approach.
If a parametric model b(x) = b(x; γ) is used, then parameter γ must be consistently estimated for the adequate adjustment.A plug-in estimation procedure, where γ is replaced by γ, and the regression utilizes b(x; γ) is typically used and corresponds to the 'feasible' E-estimator of Robins et al. (1992).It is justified in part by the asymptotic independence of γ and ( φ, τ ) (Henmi and Eguchi, 2004).The extended model ( 7) has the advantage of additional inferential robustness: if the X i β component is correctly specified (i.e.reflects the data-generating mechanism), the estimator of τ will be consistent even if the propensity score is not correctly specified.This is known as double robustness.If the data generating structural model contains a more general treatment effect structure, the propensity score regression approach must be modified.For example, if the model takes the form where ψ is a q × 1 vector parameter and M 0i is a 1 × q vector of predictors, the ATE is E[M 0i ]ψ.This quantity (and the parameters ψ) can be consistently estimated using the propensity score regression approach based the model where now φ is a q × 1 parameter, that is, with an interaction term involving the propensity score configured to match the treatment effect model.This construction is necessary to ensure that confounding via the open paths that involve the interaction terms is also removed by conditioning on the propensity score.Further modifications are necessary if the structural model is extended beyond the linear; see section 8.

Bayesian inference under exchangeability
The key construction for any Bayesian inference problem to be solved under an assumption of exchangeability of the observable quantities is de Finetti's representation, which leads to the standard definitions of likelihood, prior, parameters and the notion of 'correct specification'.If {O i } ∞ i=1 is a sequence of exchangeable observable quantities, where each O i takes values on X , the de Finetti representation of the joint density of any collection of size n ≥ 1 of the observables is where π 0 (θ) is the prior distribution on parameter θ presumed to take values in parameter space Θ.Bayesian inference about θ is made via the posterior distribution If o 1:n = (x 1:n , y 1:n , z 1:n ) θ characterize the joint distribution.We have Decomposing θ = (η, γ, ζ), and assuming independent prior structure, we require that the three components in (12) each admit a de Finetti representation based on what we term conditional exchangeability assumptions (Saarela et al., 2022).For n ≥ 1, the triples (X i , Y i , Z i ), i = 1, . . ., n, are assumed to be conditionally independent given θ, and the Bayesian specification is completed after defining a probability distribution This formulation proposes that in the data generating model the Y i s are conditionally independent given the (X i , Z i ) pairs and parameter η.This is a standard assumption in the frequentist parametric sequel, and would hold in any conventional regression model.The full probability model for observables and unobservables can be decomposed as with the usual conditional independence decompositions of the 'likelihood' terms.The prior independence assumption is natural in light of the conditional exchangeability formulation in ( 12).This leads to the posterior distribution π n (η, γ, ζ) in the usual way.Under standard assumptions, the posterior distribution converges as n −→ ∞ to a unique degenerate limit at a single point (η 0 , γ 0 , ζ 0 ), and the data generating model is in fact factorized The Bayesian model is considered correctly specified if this limiting behaviour holds.
The formulation above is parametric, but extensions to the non-parametric case where θ is infinite dimensional are straightforward.We regard a valid Bayesian approach as one which relies on the de Finetti representation (or equivalent) for observable quantities in the data generating model, with inference following a decision-theoretic argument, as outlined in section 4. Note that under exchangeability, the de Finetti representation defines (up to the choice of the prior) the complete probabilistic specification for the model, whether or not we opt to depend on it for inference.Furthermore, it determines the frequentist characteristics of Bayesian inference procedures.

Existing approaches to Bayesian causal inference
A parametric Bayesian analysis based on the true model (5) or proposed model ( 6) would proceed in a standard fashion.The marginal posterior distributions for τ derived from ( 5) and ( 6) are in general different.However, model ( 5) is essentially an 'oracle' model to which we do not have access.In this case, it is relatively straightforward to show that as n increases, the posterior distribution for τ derived from (6) becomes concentrated at the true (data generating) value of the ATE present in the structural model, despite the mis-specification present in (6).This asymptotic calculation hypothesizes an increasingly large sample of data drawn from the same probability model.These arguments hold for the extended model ( 7).
If b(x) is replaced by b(x; γ) in ( 6) or ( 7), and γ is treated as an unknown parameter, the question arises whether this log-likelihood, coupled with the log-likelihood for γ itself, should be used as the basis of a three-parameter posterior in the parameters (φ, τ, γ).It is not evident on first inspection whether this posterior, or the bivariate posterior based on (φ, τ ) for some plug-in value γ as in the frequentist approach is justified in a formal Bayesian inference setting.Zigler (2016) summarizes the most commonly used approaches, and describes directions in which the Bayesian formulation may be developed productively.We summarize some of the key elements below.
Cutting feedback: The joint model based on (15) does not create the required balance, or correct appropriately for confounding, due the presence of what is termed feedback, and the marginal posterior for τ does not concentrate at the true value.To overcome this, McCandless et al. (2010) proposed that the full conditional distribution of γ should be independent from the rightmost term of the likelihood in equation ( 15); This is known as the cutting feedback approach which can be implemented as follows: a sample of size L of π n (γ) is produced, and then used to construct propensity score sampled values b , where γ (l) denotes the l-th sample from π n (γ).Then, a sample of size L is obtained for the outcome parameters, with the l-th sample, for l = 1, • • • , L being generated from Two-step inference: A two-step procedure (Zigler et al., 2013) assumes complete separation between the exposure and outcome models.First, a point estimate of γ is obtained from π n (γ) computed via ( 16).This point estimate is then used to construct an estimate of the propensity score, b i = f Z|X (1|x; γ), which is then plugged into the outcome model.A posterior sample is then obtained from In the cutting feedback and two-step approaches, it is not immediately clear how the inferential uncertainty concerning γ in the estimation of τ should be handled.Several methods to evaluate the variance of the posterior distribution of τ have been proposed; see for example Kaplan and Chen (2012).The cutting feedback approach attempts to account for the uncertainty in the estimation of γ by direct sampling from π n (γ) in ( 16) with posterior computation for the remaining parameters being carried out conditionally on each sampled value of γ; the twostep approach as described above ignores the uncertainty in γ, but an adjustment based on Taylor expansions around γ can be implemented (Graham et al., 2016).

Current literature
It is not universally accepted that fully Bayesian inference is possible using the fitted propensity score in a regression as in Robins et al. (1992), or via other methods such as inverse probability weighting (see 8.1), as such methods involve a plug-in strategy is not fully Bayesian; see the discussion of Saarela et al. (2015).For example, it is contended that if the propensity model is unknown and must be estimated, the plug-in estimation of b(x) is contrary to conventional Bayesian inference based on a 'likelihood times prior' formulation.This issue can be resolved using more general Bayesian decision-theoretic logic, and a Bayesian analysis under model mis-specification.
Despite such objections, there has been a marked increase in research on Bayesian methods for causal quantities based on propensity score adjustment (see, for example, Adhikari et al., 2019;Comment et al., 2019;Geneletti et al., 2019;Samartsidis et al., 2020;Nethery et al., 2020;Liu et al., 2020).While sharing a common goal of adjusting for bias due to confounding with a Bayesian lens, it is clear that consensus has not been reached on how to perform inference with propensity score-based approaches.For instance, Comment et al. (2019), Nethery et al. (2020), and Liao and Zigler (2020) all use an approach that succeeds in cutting feedback, using the propensity score to create a matched sample; these authors view the matching step as part of a 'design' rather than analytic phase of the analysis.Bornn et al. (2019) use a form of joint modelling of the treatment and outcome, as do Ray and van der Vaart (2020).Two-step approaches are widely used, although there is no agreement in the literature on whether to plug in fixed quantities (such as a posterior mean or mode) or random (draws from the posterior).For instance, Vegetabile et al. (2020) use a Bayesian non-parametric approach to estimate the propensity score which is then plugged into a standard (frequentist) estimator of the average treatment effect.Wang and Rosner (2019) use propensity score regression, conditioning on the expected value of the propensity score.In contrast, Xu et al. (2018) take a propensity regression approach to estimate the quantile (rather than average) treatment effect, conditioning on draws from the posterior distribution of the propensity score.Hahn et al. (2020) sample the estimated propensity score's posterior distribution, incorporating the samples into a nonlinear regression model for the outcome (including heterogeneous treatment effects) using additive regression trees.Liu et al. (2020) use inverse weighting in a two-step procedure and propagate uncertainty using the Bayesian bootstrap; see also Graham et al. (2016).Other authors have combined aspects of Bayesian and frequentist modelling to address complex models.Davis et al. (2019) use approximate Bayesian methods to estimate both a propensity score and an outcome model, and then combine predictions from these into a frequentist doubly-robust estimator in a spatial modelling context.Antonelli et al. (2020) consider the high-dimensional case, also using Bayesian methods to estimate both a propensity score and an outcome model and computing a doubly-robust estimator by averaging over draws from the posterior distribution of the parameters of these models.
Models ( 6) or ( 7) are simple compared to some of the approaches described above, but serve to illustrate the relevant theoretical issues.Flexible models that attempt to model the outcome directly can be extremely useful in capturing the causal relationship by overcoming issues of mis-specification.Similar models are also widely used to represent the treatment-confounder relationship in a flexible model for the propensity score and, despite some drawbacks, such models can be effective.The methods described in this paper are relevant to any form of propensity score modelling.

Bayesian decision-theoretic inference
The Bayes estimate is a function of the observed data that minimizes the Bayes risk, or the posterior expected loss for some loss function ℓ(t, θ) If the loss function can be written for some function u(s, t) : X × Θ −→ R + , then the estimation problem can be rewritten where p n (s) is the posterior predictive distribution implied by the Bayesian specification.For example, if, for t ∈ Θ, u(s, t) = − log f O (s; t), (see Bernardo (1979)) we have that For example, in the Normal model with f O (s; t) ≡ N ormal(t, 1), the calculation becomes arg min where φ(.) is the standard Normal pdf, that is, the estimate is the posterior mean.Equation ( 20) indicates that Bayesian parameter estimation can be formulated as a prediction problem if an appropriate loss function is defined.Equation ( 19) depends on an integral over a single variable s that can be taken to be a single 'future' variate drawn from f O (. ; θ), but the formulation extends to m independent 'future' variates, and can be expressed via the m-fold posterior predictive.

The Gibbs posterior
The standard Bayesian posterior distribution can be justified (Zhang, 2006;Jiang and Tanner, 2008;Bissiri et al., 2016) as the solution to the variational expected loss minimization problem where M π0 is the space of probability measures that are absolutely continuous with respect to the prior (measure) π 0 , K (µ, π 0 ) is the Kullback-Leibler divergence between measure µ and π 0 , and ℓ (o ) is a loss function measuring the value of o 1:n for learning about θ (see Bernardo (1979)).It follows that which yields the conventional posterior π n (θ) by properties of the Kullback-Leibler divergence.More generally, if the loss function is not specified as minus a log density, the solution to the loss minimization problem has been termed the Gibbs posterior.For the log-density specification for ℓ(., .), this method is equivalent to the de Finetti formulation, but more general specifications are also possible.Equation ( 22) thus provides an alternative but also fully Bayesian decision-theoretic solution.
For bivariate data, the same logic can be applied.Suppose that o i = (x i , y i ), and the model is specified by two parameters (θ, φ), with loss function The variational formulation (23) leads to the joint posterior π n (θ, φ), and under an independent prior specification, the two parameters are a posteriori independent.

Bayesian inference under mis-specification
Broadly, mis-specification of a Bayesian model arises either if the 'likelihood' model -the conditional density of the observables given the parameters -does not match f O , or if the true value θ 0 does not lie in the support of the prior.In such cases, there is no guarantee of reliable statistical behaviour.However, certain mis-specified models can have utility; for example, the model in ( 6) is not the data generating model, and yet can provide consistent frequentist inference provided the propensity score model is correctly specified.In this section we examine some aspects of mis-specification.
Suppose initially we retain the data generating likelihood model f O (. ; θ 0 ), but consider the implications for inference in a second model with density f having support X , parameterized by ϑ ∈ Θ ′ .That is, while assuming the data are generated by f O , we wish to perform inference for ϑ acknowledging that f is mis-specified.Conventional Bayesian inference for ϑ can be performed using a likelihood based on f , but it is difficult to justify the resulting posterior as the focus of inference since the model is mis-specified; see, for example, Walker (2013) and its discussion.The Bayesian decision theoretic framework can be deployed, however.Define loss function where u θ (s, t ′ ) = log (f O (s; θ)/f (s; t ′ )), which extends the calculation in ( 19) to allow the function u(., . ) to depend on θ -note that the resulting optimization over t ′ may still not depend on θ.By arguments equivalent to those leading to (21), we have that To compute the posterior distribution for ϑ, we may use a simulation-based strategy; if a single sampled variate θ (l) is generated from π n (θ), then we may convert this into a sampled variate ϑ (l) from the posterior for ϑ by performing the transformation and then replicate this for l = 1, . . ., L. In each of the expressions, the integral with respect to s may not be analytically tractable, but can be approximated using Monte Carlo by sampling s k , k = 1, . . ., N from f O (. ; θ), and computing Standard Bayesian theory is used to compute the posterior for θ, and the posterior for ϑ is computed (using the relevant integral forms) via deterministic transformation.
The Kullback-Leibler loss can be modified to reflect quantitative statements about ϑ in the approximating model.For example, we may specify for some non-negative function u 0 (. ) with domain Θ ′ that does not depend on θ or s.This additional term essentially functions as (minus) a log prior distribution on ϑ, although as we explicitly acknowledge that the model f is mis-specified, and ϑ has no real-world interpretation, this interpretation may be problematic for some Bayesians.In any case, the maximizations leading to the estimate ϑ in (25) and sampled variate ϑ (l) in ( 26) can be modified accordingly.

Conscious mis-specification and modularization
The formulation of inference under mis-specification is inspired by the reasoning that inference concerning an approximating model may be of interest in its own right (for example, simplicity of interpretation).In addition, note that the calculation in (20) does not require explicit computation of the posterior π n (θ), so in principle a representation of, approximation to, or samples drawn directly from p n (s) can be used to compute the estimate or posterior sample for ϑ.Such a strategy would be useful if complex models such as flexible Bayesian models or artificial neural networks were used to construct prediction techniques.In the causal inference setting, the parameters of interest are not defined in the actual data generating model, but rather are quantities defined with respect to some hypothetical data generating process where confounding is not present.It is possible to construct examples where even a correctly specified regression model, say, cannot yield consistent estimators of the causal effect of interest, although these examples typically need to have more complex structural forms than those in (5), involving multiple treatments.We return to these examples in section 8.
Such 'conscious' mis-specification has direct relevance in the causal setting, but it has also been argued that similar calculations, where the data generating model does not correspond to the inference model, may be relevant in Bayesian calculations more generally.Bayarri et al. (2009) argue for a form of Bayesian inference based on 'modularization' of the model, where a form of stagewise analysis in complex models is used.Motivated by formulations based on Bayesian mis-specification, Jacob et al. (2017) provide extensive evidence that such modularized inference can be advantageous in Bayesian settings, including a study of the empirical properties of propensity score regression estimators using the methods from section 3.1.

Connection to estimating equations
If the utility function in (20) or u θ (s, t ′ ) is differentiable with respect to its second argument with derivative uθ (s, t ′ ), the optimization problem can be re-stated as a root-finding problem where we must solve for ϑ to obtain the estimate or sampled variate as in the calculation described in section 4.2.In the Monte Carlo version, we sample s k , k = 1, . . ., N from the posterior predictive p n (. ), and solve the (Bayesian) estimating equation Note that if the utility function is specified as minus a log density, then a scale or dispersion parameter, λ say, may be present, but may be irrelevant to the estimation of ϑ in (29).If λ is estimated as a nuisance parameter, the utility optimization can typically be carried out for the parameters of interest and nuisance parameter separately in two sub-problems.In this case, the parameters may still exhibit posterior dependence due their common dependence on the posterior predictive distribution or sampled values.If the estimation of λ is to be included, the utility function must chosen with some care; we might require that the (joint) utility is not monotonic in λ.
Note: It is tempting for a Bayesian analysis to mimic the frequentist approach to estimating equations, adopting a general form of ( 29) and performing root-finding to produce the estimate.Again this approach needs careful implementation.Consider for example the loss-based approach to defining a standard posterior as in ( 22); for a specified loss ℓ(., .), the Gibbs posterior is automatically defined as being proportional to exp{−ℓ(o 1:n , θ)}π 0 (θ).For example, if the θ is scalar and ℓ(o, θ) = |o − θ|, the procedure is immediately equivalent to using a double exponential likelihood model with known scale parameter.This equivalence illustrates the potential for loss-based derivation of the posterior to be quite restrictive.This cautionary note is also relevant to Bayesian estimation for mis-specified models described in this section: u θ (s, ϑ) must be a true, well-calibrated expression of the utility of specifying the approximating parameter as ϑ for generic datum s when the data generating model is f O (. ; θ).

The Dirichlet process model
In order to weaken the parametric assumption concerning f O , we allow θ to become an infinite dimensional parameter describing the distribution of O. Suppose that F O (.) parameterizes unknown distribution function of the data with true value F 0 , such that in reality O 1 , . . ., O n ∼ F 0 (.) are independent; this interpretation is consistent with the de Finetti formulation, with the F 0 (.) interpreted as the limiting empirical cdf derived from the exchangeable sequence.The Dirichlet process model DP (α, G) is a probability measure on the set of distribution functions with countable support, with probabilities ω j , j = 1, 2, . . .at locations x j , j = 1, 2, . . .∈ X , and the DP (α, G) model induces randomness by drawing the ω j s via a probabilistic algorithm that depends on α -commonly the so-called 'stick-breaking' algorithm is used -and the x j independently from G. In the most common form of Bayesian non-parametric analysis, the Dirichlet process acts as a prior for parameter F O ; hyperparameter α > 0 acts as a concentration parameter, and G(.) is a prior (base) distribution with domain X .
In light of data o 1:n , the resulting posterior distribution is also a Dirichlet process DP (α n , G n ) where ), where w n = α/(α + n) and F n (. ) is the empirical measure derived from o 1:n .
It is straightforward to generate samples from DP (α n , G n ) (that is, randomly generated distributions that represent sampled versions of 'parameter' F O ) and also from the implied model for the observable quantities in light of the data (that is, a randomly generated posterior predictive distribution).Furthermore, the Dirichlet process posterior becomes concentrated at the data generating model F 0 in the limit as n −→ ∞ (Ghosal and van der Vaart, 2017, section 4.7), and provides a consistent estimation procedure.
With this relaxation of the parametric assumption about the data generating model, the calculations from section 4.2 can be reproduced.The Bayes estimate again results from a minimum loss calculation based on the posterior predictive distribution.When the posterior distribution is the DP (α n , G n ) distribution, we have, for example replicates ϑ (l) , l = 1, . . ., L sampled from the posterior for ϑ given by where {ω (l) j , j = 1, 2, . ..} are a sample of probabilities drawn by, say, stick-breaking with parameter α n , and {s (l) j , j = 1, 2, . ..} are drawn independently from G n .In practice, the infinite sum is truncated by machine accuracy, as the ω j values decrease in expectation as j increases.The {ω j } may also be drawn such that they are decreasing in magnitude, rendering the truncation straightforward to implement.
In Rubin (1981), the Bayesian bootstrap is proposed as a heuristic strategy, but its theoretical properties have since been widely studied; see for example Lo (1987); Cheng and Huang (2010) and Ghosal and van der Vaart (2017).
The argument confirming that this strategy was in fact producing an approximate Bayesian posterior statements was formalized by Newton and Raftery (1994).The Newton & Raftery algorithm is central to the procedures used in the Bayesian causal settings in Saarela et al. (2015) and Saarela et al. (2016): in those papers, the utility argument is made explicit, and the log-density utility is justified by considering a hypothetical experimental data generating mechanism that is explicitly misspecified (compared to the observational data generating model).See also Chamberlain and Imbens (2003) and Graham et al. (2016) for examples, and Lyddon et al. (2019) for some generalizations.
The Bayesian bootstrap results as is the consequence of a Dirichlet process specification for the probability model that generated data o 1:n , in the limiting case α −→ 0. Sampling from the posterior predictive coincides with the Bayesian bootstrap; if ω = (ω 1 , . . ., ω n ) ∼ Dirichlet(1, 1, . . ., 1), (31) yields the estimation procedure with ϑ now being a random quantity as ω is random.The summation in this expression is a deterministic function of ω for every fixed t ′ ; therefore the corresponding ϑ is also a deterministic function of ω.Hence, once we have sampled the weights in the Dirichlet process formulation, a transformation yields ϑ, and thus ϑ is simply a functional of the Dirichlet process posterior on F O .Therefore the posterior sample formed by repeatedly sampling the Dirichlet weights to yield ω (1) , . . ., ω (L) , with subsequent transformations to yield ϑ (1) , . . ., ϑ (L) is an exact sample from the posterior distribution for ϑ.A proper prior π 0 (ϑ) can be incorporated by modifying the specified utility function as in ( 27).
Such inference is a fully Bayesian expression of posterior beliefs concerning the target of inference under the Bayesian non-parametric formulation.As for any MCMC-based analysis, inference is only exact up to Monte Carlo sampling, that is, we can only compute the distribution of ϑ by sampling the Dirichlet process, and not analytically.The calculation based on the formulation equivalent to (28) involves solving where (ω j ), j = 1, 2, . . .define a random draw from the Dirichlet process posterior.

Bayesian inference for the structured causal model
For the causal inference problem with observed data o 1:n = (x 1:n , y 1:n , z 1:n ), for a parametric analysis, we may compute the posterior distribution for θ = (η, γ, ζ) using a factorization of the full model as in (13).We can also define the approximating model to respect the entire factorization, or target some component of interest.For example, a conditional model for Y given (X, Z) might be targeted, with u θ (o, ϑ) = − log f (y|x, z; ϑ) for some conditional density f (. ; x, z; ϑ).Then, by sampling the posterior for θ, or the posterior predictive distribution, the method of section 4 can be deployed.
For the illustrative model of section 2.3, let θ = (ξ, τ ) and ϑ = (φ, τ ) be the parameters in the data generating and approximating models respectively.In this parametric setting, assuming Normally distributed residual errors in both models, π n (θ) is readily computable, and using the methods described in section 4 we can obtain a sample from the posterior distribution and estimate for ϑ in the approximating model.Specifically, from the model (6), we have for b(. In this case the parameter of interest τ is identical in the two models, and the posterior computed for π n (θ) yields correct inference under the presumed correct specification of the conditional model.The posterior for τ as a component of ϑ would still concentrate at true value τ 0 , but in finite sample the posterior variance would be larger than that computed from the correctly specified model that led to π n (θ).
To relax the assumption of Normal residual errors in the data generating model, we may use the Bayesian bootstrap, and obtain a sampled variate from the posterior as (φ (l) , τ (l) ) = arg min for which the minimization can be achieved analytically for l = 1, . . ., L.
In (34), the Bayesian bootstrap is being used to sample the Dirichlet process posterior for the entire unknown joint distribution of the observables, but in the approximating parametric model only the conditional distribution for Y given X and Z is studied -the joint distribution does correspond to an implied conditional distribution.This possibility of partial specification of the model of interest is an advantage of the formulation from section 4.
In addition, if the utility is modified to be for proposed conditional densities f 1 and f 2 .Estimation or posterior sampling of ϑ 1 and ϑ 2 using the parametric or non-parametric algorithms can proceed by the obvious extension, and in this separable loss function the two optimizations can be carried out separately.However, in the inference problem for ( 6) with propensity score unknown, a modification of the loss function is required for optimal inference.Suppose that where ϑ OPT 2 is the loss minimizing value of ϑ 2 obtained by considering the second term only.This utility reflects the estimation task in the causal problem based on (6); the outcome model based on f 1 is adjusted using the fitted propensity score computed using the best estimate of the data generating parameter in the model f 2 .
Taking ( 6) or ( 7) as the approximating model, inference for ϑ 1 = (β, φ, τ ) will be correct (specifically consistent for, and with the posterior concentrated at, true value τ 0 ) provided the propensity score model encapsulated in model f 2 is itself correctly specified with ϑ 2 ≡ γ, so that the estimated propensity score based on the posterior mode f 2 (z|x; ϑ 2 ) consistently estimates the true propensity score.

Conventional Bayesian propensity score adjustment
Underlying our concept of a valid Bayesian approach is one which relies on the de Finetti representation for observable quantities in the data generating model as in section 3, with inference following a decision-theoretic argument as in section 4.2.It is common, however, to apply the Bayesian logic to procedures such as those indicated in section 3.1.Such procedures also can be assessed as fully Bayesian by reference to the decisiontheoretic formulation of section 4.

Joint estimation
Estimation using the joint Bayesian model in (15) can be justified using either conventional Bayesian logic or the arguments in section 4 leading to the Gibbs posterior formulation and (23), that is, with However, the resulting posterior does not concentrate at the correct ATE due to 'feedback' which arises because the outcome depends on the parameters associated with the exposure model.A graphical model argument can be made to support this.Feedback is present because of a 'backdoor' path (Galles and Pearl, 1995) from γ to (β, φ, τ ) via B in the graph describing the joint distribution of parameters and observables if the dependence of Y on the confounders is mis-specified; B is a 'collider' on this path, so conditioning on it opens the path.As a result, the propensity score estimated in this way will not have the balancing property, even as n increases.
A Bayesian analysis based on (36) may, of course, still be carried out, and in finite sample the performance of the resulting Bayesian inference summaries may be acceptable; for example, the resulting estimators may have low variance.However, as the sample size grows, it is clear from classical arguments that the Bayesian estimator of τ will be inconsistent.

Cutting feedback and two-step estimation
As noted in section 2.2, B should be constructed as b(X; γ 0 ), and if γ 0 is unknown, it should be estimated using the observed X and Z values only.The conventional Bayesian analysis therefore should be based on the posteriors where γ 0 is the degenerate limiting value of π n (γ) referred to in section 3. We first compute the posterior for γ from ( 16), then we compute a Bayesian estimate γ and fitted values b i = b(x i ; γ) for i = 1, . . ., n.The posterior distribution is computed via (18) and we can marginalize out to obtain π n (τ ).. Because of the conditioning on a specific γ value, there is in fact no 'feedback'.The use of a plug-in estimate γ may lead to imperfect adjustment for confounding in finite samples, but this is not due to feedback in the sense described above.
From the decision-theoretic perspective, a fully Bayesian justification via the variational formulation and ( 23) is obtained using the loss function where γ OPT is itself a loss-minimizing quantity, say the posterior mode or mean, derived using the variational solution from ( 22), which under an independent prior specification is the conventional posterior for γ.The formulation in (39) evidently leads to a form of 'modularized' inference as advocated by Bayarri et al. (2009); Zigler (2016) and Jacob et al. (2017).In this setting, however, due to the requirement to use a 'best estimate' of γ in order to produce consistent estimation of τ , the modularization is a necessary step rather than a choice the Bayesian analyst may opt to make.
The cut feedback approach is an attempt to account for the uncertainty in estimating γ.In this approach, L samples γ (1) , . . ., γ (L) from the posterior distribution π n (γ) are drawn, and each is used to compute a set of propensity score values, leading to L parallel analyses that involved drawing a single sample from (17), the posterior computed using the lth sampled γ value.However, recall that only if γ = γ 0 do we achieve the required balance.Thus when the propensity score values b (l) 1:n are computed using γ (l) , they can be interpreted as errorcorrupted versions of the true balancing scores b i = b(x i ; γ 0 ), and hence will not induce balance.Using a Taylor expansion, we have b i (x i ) say, where ḃ(x; γ) is the partial derivative of b(x; γ) taken with respect to γ. Hence when b (l) i is used in the propensity score regression approach, we should regard it as an error-corrupted version of the balance-inducing (but unknown) value b i , where the error has variance proportional to the (posterior) variance of the sampled values γ (l) .It is well-known that the presence of such error in regressors in a regression model typically leads to bias in the estimation of regression coefficients even if the functional form of the model is correct -although unlike the commonly-cited setting that leads to attenuation, here the measurement errors u (l) i (x i ) are dependent on the observed x i , and thus have different variances.A numerical example to illustrate the bias induced by the cut-feedback procedure is given in Appendix B.1.

Frequentist assessment of the conventional Bayesian estimators
The presence of γ 0 in (38) in practice requires the use of a Bayesian estimate to facilitate computation.A natural estimator is the posterior mean or mode derived from (37), and plugging the corresponding estimate into (38) allows posterior inference to proceed.Even if no account is taken of the estimation of γ, then the analysis of the parameters in the outcome model is being performed in a standard Bayesian fashion.It is evident from ( 37) and ( 38) that γ and (β, φ, τ ) are a posteriori independent.Therefore plugging in an estimate γ -a deterministic function of the (x, z) data -derived from π n (γ) has no impact on inference for (β, φ, η) provided the treatment model is correctly specified.
There are two things to note about this procedure.First, in finite sample, the posterior variance for τ is smaller when using an estimate of γ rather than the true value if it were known, in a result that is analogous to the results in the frequentist literature from Hirano et al. (2003) and Henmi and Eguchi (2004); see the results in Appendix Table B1 and related discussion.Secondly, if the plug-in approach is adopted, the resulting Bayesian inference exhibits relatively poor frequency properties: across replicate data sets of the same size, coverage properties of Bayesian credible intervals derived from π n (β, φ, η) with γ 0 estimated by γ are below the nominal level.This phenomenon arises from the fact that the model for the data generating process is mis-specified, and therefore frequentist behaviour (across replicate data sets) is not adequate.If inference is made using the posterior distribution conditioned on the observed data, standard Bayesian inference methods under exchangeability and correct specification (that is, that follow the de Finetti representation) will have expected frequentist bahaviour.However, if the presumed data generating process is mis-specified, then we have no such guarantees.This issue is overcome by the use of the Bayesian non-parametric model and the Bayesian bootstrap.See the simulation study in Appendix B.2.

Simulation studies
We examine the performance of the conventional Bayesian computational methods described in section 3.1 with the decision-theoretic and non-parametric methods from sections 4 and 5.

Conventional Bayesian methods
We fitted several parametric models under the assumption of Normal errors.In the cutting feedback models, bi = x i γ with γ being the sampled value of γ in a Gibbs sampler procedure, and in the two-step models b i = x i γ, where γ is the Bayesian estimator of γ obtained from the fitted exposure model.
Table 1 contains the estimated bias and root mean square error (RMSE) the posterior estimates (means), and coverage of the 95% credible interval) for τ .The unadjusted and joint models perform poorly as theory suggests.
Estimation based on cutting feedback yields a small amount of bias, which decreases as the sample size n increases.
The two-step approaches yield unbiased estimators.However, in all cases the coverage of the Bayesian credible intervals is not adequate when the outcome model is mis-specified, even though coverage at the nominal level can be obtained using a correct specification.

Estimation via the Bayesian bootstrap
The results demonstrate that model mis-specification disrupts parametric Bayesian inference.We repeated the analysis using the Bayesian bootstrap approach, restricting attention to the cutting feedback and two-step estimation procedures.To implement the cutting feedback procedure, recall that the Bayesian bootstrap produces a sample from the posterior for a target parameter.In our analysis, we assume correct specification for the treatment assignment model, and so for the posterior for π n (γ), we may either use the exact posterior computed under a Normal assumption, or the Bayesian bootstrap.Having obtained a sample of size L from this posterior, we then use the Bayesian bootstrap to generate L posterior samples for τ , conditioning on the fitted value bi = x i γ.
For the two-step method, we may proceed in the same fashion, but instead use b i = x i γ, where γ is the posterior mean derived from π n (γ).
These methods follow the conventional approach of separating the posteriors from the two parts of the model.However, following the argument leading to (35), the correct Bayesian approach retains the linkage of the two models via the common Dirichlet weights noted in (32); that is, a single draw of weights ω is used in the optimization over γ and the consequent optimization over (β, φ, τ ).This linkage reflects a Bayesian non-parametric assumption concerning the full joint distribution of the observables.
For the treatment assignment model, we carry out analysis using the True propensity score, and then compute π n (γ) using a Parametric (logistic regression) analysis, using the Bayesian bootstrap in an Unlinked fashion (via independent Dirichlet weights in the two components of expression ( 35)), and in a Linked fashion using a single Dirichlet draw.For the outcome model, we use a least-squares optimization for the Bayesian bootstrap sampling of (β, φ, τ ).The analyses were conducted in 1000 replicate data sets, using 1000 Bayesian bootstrap draws for each replicate.For each data replicate, we compute the RMSE of the Bayesian posterior estimates; coverage rates were computed by constructing, for each replicate data set, posterior sample quantiles.The results   2. All of the methods were unbiased in large sample, although the CF method showed a small bias as discussed in section 6.2 when n was small, and also larger variability.In terms of RMSE, the two-step methods generally performed best.Coverage at the nominal level was recovered for the two-step method in a Linked analysis, as suggested by the theory studied in section 5.
The results in Table 3 suggest that the distribution of the propensity score affects the performance of all methods.

Estimation via the Bayesian bootstrap
For the Bayesian bootstrap procedure, we use the same simulation design and L = 2000 Bayesian bootstrap draws for each data set.Table 4 shows the RMSE and coverage rates.Results largely agree with those observed in Example 1 where the Bayesian bootstrap is used.Overall the CF and 2S approaches show similar values of RMSE and coverage, and for the coverage in particular the general performance of the bootstrap methods seems an improvement over the results for the conventional analyses from Table 3.

Example 3: Comparison with Bayesian Causal Forests
In this section, we compare results from the Bayesian approaches described in this paper with results obtained from the Bayesian Causal Forests (BCF) method Hahn et al. (2020).The BCF approach is an example of flexible modelling based on Bayesian additive regression trees fitted using MCMC to infer potentially heterogeneous treatment effects.The BCF model for binary treatment is based on the linear predictor with assumed homoscedastic Normal errors, and with functions µ(., . ) and τ (. , . ) estimated via flexible Bayesian modelling.The propensity score b(x) in ( 41) is typically estimated as part of a separate Bayesian model.The approach is implemented efficiently in the R package 'bcf '.The BCF method allows for more flexibility than models such as ( 6) or ( 7) that are typically used; recall that standard implementations require correct specification of the treatment effect model τ (x i , b(x i )).We would therefore anticipate better performance of the standard implementations if the correct specification assumption holds.Nevertheless, a comparison is potentially enlightening.
With predictors X 1 , X 2 ∼ N ormal(1, 1), X 3 , X 4 ∼ N ormal(−1, 1), we simulate Z ∼ Bernoulli(p), with logit(p) = 0.45x 1 + 0.9x 2 + 1.35x 3 + 1.8x 4 , to simulate a binary treatment, and treatment-free outcome model For the treatment effect model, we assume that in the data generating model version of (41) we have τ (x) = ψ 0 + ψ 1 x 1 with (ψ 0 , ψ 1 ) = (1, 2) which yields an average treatment effect of ψ 0 + ψ 1 E[X 1 ] = 3.We compare the BCF approach with two-step approach with correctly specified treatment-effect model, that is, with mean fitted using the Linked Bayesian bootstrap.For the fitted BCF model we assume the more general structure Y = µ(x, b(x)) + τ (x 1 , b(x))z + ǫ, where ǫ ∼ N ormal(0, σ 2 ).In both analyses, the propensity score model is estimated under correct specification.The bcf package outputs individual-level posterior contrasts which can be converted into population average quantities via the sample average which is computed for each posterior sample.
The results of this analysis are presented in Table 5.The BCF method displays a small amount of bias for smaller sample sizes, but typically has a smaller variance, and therefore ultimately a lower RMSE.The two-step method using the Linked Bayesian bootstrap gives coverage at the target level, but the coverage of the BCF method is below the target level.Again it should be stressed that the comparison is not entirely fair, as the BCF method does not assume a known functional form for the treatment effect model, and therefore is robust to mis-specification of the treatment effect model.It is surprising that the variance of the BCF estimator is lower than that derived from the two-step method, but this phenomenon appears to persist in other settings (see Appendix section C).We note, however, that the BCF approach, or any flexible outcome regression model, can also be included within a Bayesian bootstrap, and that because of the properties of the non-parametric procedure, good frequency properties can be recovered.On average, the BCF method required three times the computational expenditure of a non-parallelized version of the Bayesian bootstrap approach.
8 Beyond regression adjustment in the Normal model In this section, we identify a number of extensions to the causal adjustment approach based on regression, including situations where flexible modelling of the expected outcome conditional on treatment and confounders cannot recover the causal effect.

Inverse probability weighting
Inverse probability weighting (IPW) is an alternate procedure for making causal adjustment based on the propensity score.Inverse weighting breaks the confounding by converting the original sample into a pseudo-sample in which confounder imbalance is removed.The loss/utility specification for this adjustment methods takes the form u θ (o, ϑ) = −w(x, z; ϑ OPT 2 ) log f 1 (y|x, z; ϑ 1 ) − log f 2 (z|x; ϑ 2 ) where w(x, z; ϑ OPT 2 ) = 1/f 2 (z|x; ϑ OPT 2 ) is a weight that depends on the proposed treatment model.Note that in the linear mean-model case with no treatment effect modification, IPW methods recover marginal parameters such as the ATE which coincide with conditional parameters such as those that appear in ( 6), but this correspondence between marginal and conditional parameters does not follow in more general models.

Doubly robust procedures for non-linear models
Doubly robust procedures provide correct inference even when one of the component models is mis-specified.In the linear case, (7) yields a doubly robust procedure provided the treatment effect model is correctly specified.The same conclusion follows for the inverse probability weighting method of section 8.1, even under a slight relaxation of assumptions concerning the treatment effect model: if the propensity model is correctly specified, the ATE can be correctly estimated even if the treatment effect model is mis-specified.Beyond the linear case, the situation is more complicated: in the log-linear equivalent to ( 6) or ( 7), parameters in the conditional model are not equivalent to marginal parameters.One important complication is that ensuring double robustness is not as straightforward.Xβ + Zψ, parameter ψ, and hence the ATE, cannot be recovered using a standard parametric analysis.The log-likelihood derived from this mis-specified outcome model with score function will lead to inconsistent inference for ψ, and the posterior distribution will concentrate at the wrong location.Unlike in the linear case, this cannot be rectified by the inclusion of the fitted propensity score in the mean model.The solution to this problem, first proposed by Robins et al. (1992), is to modify the likelihood-based score equation to become where b(x i ; γ) is the fitted propensity score, which can be shown to be a doubly robust estimating equation.
It is important to note that there is no likelihood model corresponds to the estimating equation in (42), and consequently, no conventional Bayesian analysis that can be carried out in a doubly robust fashion.However, the methods outlined in section 5 and based on the Dirichlet process/Bayesian bootstrap can be implemented, using the connection to estimating equations described in section 4.4, with the derivative of the utility/loss function chosen to match the form in (42), with computation of the posterior samples following (33).It should be noted that IPW methods following the ideas in section 8.1 can also be used to estimate the ATE.

Average treatment effect on the treated
In the binary treatment case, it is sometimes required to estimate the average treatment effect on the treated (ATT), that is, the causal effect of treatment on the subgroup of individuals in the sampled population who actually received treatment.Using counterfactual notation, the ATT is defined as the difference E[Y (1)−Y (0)|Z = 1].Using conventional random variable notation, it is less straightforward to define this quantity, which would be problematic for conventional Bayesian analysis.However, we may posit a new binary random variable V that is assigned independently of X given Z; V can be considered a re-randomization indicator used to define two hypothetical subgroups of the treated group.We can write the ATT as 1] and use this to define an estimator based on a weighting procedure.Crucially, the variable V does not need to be observed for inference, and we can estimate the ATT from the observed data; in the simplest The Bayesian bootstrap procedure from section 5 can be used to compute a fully Bayesian posterior distribution for the ATT by using the utility function where (see Moodie et al. (2018)) the weighting function is given by .
As for other weighting settings, it is not straightforward to estimate the ATT by simply modelling the dependence of Y on X and Z in an outcome regression model without relying on an assumption of correct specification.

Multiple treatments and the marginal structural model
If causal inference is required for multiple treatments, then there are data generating mechanisms for which the causal effect cannot be inferred by modelling the outcome as a function of the treatments and confounders, no matter how complex this model is.For a simple illustration, consider two binary treatments (Z 1 , Z 2 ) generated by the structural data generating model with X 1 ∼ N ormal(1, 1), Z 1 ∼ Bernoulli(expit(−2 + X 1 )) at the first stage, and A correctly specified outcome model, however, consistently estimates the coefficients of (X 1 , Z 1 , X 2 , Z 2 ) as (1, 1, 1, 1) via ordinary least squares (or any standard Bayesian method), and therefore the counterfactual outcomes are inconsistently estimated if the standard plug-in type approach is used.The issue arises due to the confounding that is present in the data generating model, but also due to mediation of the effect of Z 1 through X 2 .
The inverse weighting approach provides a solution to this problem; with utility 2 )}, and if the stagewise treatment models f 21 (z 1 |x 1 ; ϑ 2 ) and f 22 (z 2 |x 1 , z 1 , x 2 ; ϑ 2 ) are correctly specified, then the counterfactual quantities, and the associated average treatment effects, can be correctly inferred using the method of section 5; see Saarela et al. (2015).

Summary
When causal inference is the aim of a statistical analysis, control of confounding is an essential consideration.If an outcome model can be correctly specified or flexibly approximated, causal inferences may follow with or without the use of propensity score methods.However, when it is not possible to correctly capture the outcome process, propensity score methods can be very valuable, particularly when the treatment allocation process is easier to characterize.A joint modelling approach to the estimation of the propensity score and outcome model parameters can result in feedback from the outcome into the propensity score which prevents the estimated propensity score from providing balance, thus resulting in biased estimators of the treatment effect.Techniques aimed at cutting feedback have been suggested; we recap the reasoning as to why a Bayesian two-step approach, rather than one that cuts feedback is the correct approach to pursue, even if in large samples, a cutting feedback approach can provide adequate results.We demonstrated that the standard Bayesian two-step estimator results in poor frequentist performance, but shown that this can be rectified by using the Bayesian bootstrap with linkage between the two component models, yielding a fully Bayesian procedure with good frequentist properties.
Our argument is based on the realization that the causal analysis is carried out under conscious mis-specification of the Bayesian model, and develop the framework reflecting the literature on Bayesian analysis under misspecification (Walker, 2013) in the causal problem.The causal setting gives a concrete example where inference under a mis-specified model -that is, where the target of inference is not a parameter in the data generating model -is actually the objective.Methods that posit the capability of recovering the correct components of the outcome model using flexible modelling without reference to the propensity score also provide valid routes to inference about this target, but these methods often carry a heavier computational burden.There are also links to modularized Bayesian inference (Bayarri et al., 2009;Jacob et al., 2017) which also depend on a 'conscious misspecification' formulation, and in the causal setting (the main examples and the examples in section 8) existing frequentist semiparametric theory can give insight into the operating characteristics of such Bayesian analyses; see Pompe and Jacob (2021) for initial explorations in this direction.
The Bayesian bootstrap described in section 5 relies on the limiting Dirichlet process specification with α −→ 0, although equation ( 30) indicates that a more general model with α > 0 can be deployed.In the inference methodology described in section 4.2, the requirement is simply to be able to sample independently from the posterior predictive distribution, where that distribution is consistent for the data generating process; this can be achieved by statistical procedures beyond those based on the Dirichlet process.
In this paper, we have not discussed propensity score matching methods in detail.Such methods have been deployed successfully (Liao and Zigler, 2020) by using the propensity score to create a matched sample of treated and untreated individuals.The principles outlined in this paper suggest that matching on an estimated propensity score, rather than averaging over the posterior distribution of the propensity score parameters, would provide superior inference, although this would arguably depend on the matching criterion used.This is an interesting direction for future research.where x = (x 1 , x 2 , x 3 , x 4 ).
In the two-step analysis, we assign N ormal(0, 10 2 ) priors for the elements of γ, but non-informative priors for the parameters in the outcome model.We compare the results for the two-step method and the Bayesian Causal Forests (BCF) method for sample sizes n = 200, 500, 1000 and 2000.
Consider the data generating model based on a Poisson assumption, so that conditionally the outcome Y is Poisson distributed with log E[Y |X, Z] = X 0 ξ + Zψ.Here ψ captures the effect of treatment in the conditional model, but is not itself the ATE.The ATE can be measured on the additive scale, as in the linear case, E[Y (1)] − E[Y (0)] = E[exp{X 0 ξ + ψ}] − E[exp{X 0 ξ}], or on the multiplicative scale, for example E[Y (1)]/E[Y (0)] = E[exp{X 0 ξ + ψ}]/E[exp{X 0 ξ}] = exp{ψ}.However, if the fitted Poisson regression model is mis-specified in the treatment-free component, say log E[Y |X, Z] = formulation, E[Y (1)|Z = 1] is estimated directly from the treated individuals, but E[Y (0)|Z = 1] is estimated from the untreated individuals reweighted by a case weight w(X) = b(X)/(1 − b(X)).Extension to a doubly robust estimator is straightforward by augmentation.

Figure C2 :
Figure C2: Simulated example 2: Scenarios considered for the exposure model in the true data generating mechanism with binary exposure.

Table 1 :
Simulated Example 1: Summary of the conventional Bayesian estimates of τ under a normal exposure.The rows correspond to mean bias of the point estimates, RMSE and the coverage rates of the posterior 95% credible intervals of τ .Results over 1000 replicate data sets.

Table 2 :
Simulated Example 1: Summary of the estimates of τ under a normal exposure using the Bayesian bootstrap in the outcome model, and different approaches to the propensity score model parameters posterior: True indicates the true value of γ is used; Parametric indicates a parametric Normal model is used; Unlinked indicates that the posteriors for γ and (β, φ, τ ) were computed using separate Bayesian bootstrap computations and different Dirichlet weights (Unlinked Bayesian bootstrap, UBB); Linked (LBB) indicates that common Dirichlet weights were used in the two model components.Rows correspond to RMSE and the coverage rates of the posterior 95% credible intervals.Results over 1000 replicate data sets.

Table 3 :
Simulated example 2: RMSE of the estimates of τ , and coverage of the 95% credible interval for a binary exposure model.Rows correspond to the averages of the posterior means and variances, and coverage rates of τ for three settings of the propensity score model given in section 7.2.Results from 1000 replicate data sets.

Table 5 :
Simulated example 3: Comparison of results for two-step fitted using the Linked Bayesian bootstrap, and the BCF method.Summary of 2000 Bayesian estimates and credible intervals.Rows correspond to the bias, RMSE, and coverage rates.

Table C3 :
Additional simulated example: Comparison of results for two-step fitted using the Linked Bayesian bootstrap, and the BCF method for different propensity score settings.Summary of 2000 Bayesian estimates and credible intervals.Rows correspond to the bias, RMSE, and coverage rates for Scenario 1, Scenario 2 and Scenario 3.

Table C4 :
Additional simulated example: (no treatment effect): Comparison of results for two-step fitted using the Linked Bayesian bootstrap, and the BCF method for different propensity score settings.Summary of 2000 Bayesian estimates and credible intervals.Rows correspond to the bias, RMSE, and coverage rates for Scenario 1, Scenario 2 and Scenario 3.