Reproducible Parameter Inference Using Bagged Posteriors

Under model misspecification, it is known that Bayesian posteriors often do not properly quantify uncertainty about true or pseudo-true parameters. Even more fundamentally, misspecification leads to a lack of reproducibility in the sense that the same model will yield contradictory posteriors on independent data sets from the true distribution. To define a criterion for reproducible uncertainty quantification under misspecification, we consider the probability that two confidence sets constructed from independent data sets have nonempty overlap, and we establish a lower bound on this overlap probability that holds for any valid confidence sets. We prove that credible sets from the standard posterior can strongly violate this bound, particularly in high-dimensional settings (i.e., with dimension increasing with sample size), indicating that it is not internally coherent under misspecification. To improve reproducibility in an easy-to-use and widely applicable way, we propose to apply bagging to the Bayesian posterior ("BayesBag"'); that is, to use the average of posterior distributions conditioned on bootstrapped datasets. We motivate BayesBag from first principles based on Jeffrey conditionalization and show that the bagged posterior typically satisfies the overlap lower bound. Further, we prove a Bernstein--Von Mises theorem for the bagged posterior, establishing its asymptotic normal distribution. We demonstrate the benefits of BayesBag via simulation experiments and an application to crime rate prediction.


Introduction
It is widely acknowledged that statistical models are usually not exactly correct in practice (Box, 1979(Box, , 1980;;Cox, 1990;Lehmann, 1990).This model misspecification is known to lead to unreliable inferences, and in particular, Bayesian posteriors can be unstable and poorly calibrated under misspecification (Greco, Racugno and Ventura, 2008;Jewson, Smith and Holmes, 2018;Kleijn and van der Vaart, 2012).Unfortunately, this leads to a lack of reproducibility, even when using the same method on a replicate data set from the same distribution (Huggins and Miller, 2023;Yang and Zhu, 2018).In this paper, we propose a criterion for reproducible uncertainty quantification and a general technique for achieving it.
Defining valid uncertainty quantification in misspecified models presents a conceptual problem since there is no ""true parameter" that indexes a model with distribution equal to the data-generating distribution.The usual solution is to focus on a pseudo-true parameter, typically defined as the asymptotically optimal parameter in terms of Kullback-Leibler (KL) divergence (Grünwald, 2012;Kleijn and van der Vaart, 2012;Müller, 2013;Walker and Hjort, 2001).However, depending on the objectives of the analysis, it might not be desirable to concentrate at the KL-optimal parameter (Bissiri, Holmes and Walker, 2016;Jewson, Smith and Holmes, 2018;Miller and Dunson, 2018).Thus, rather than adopting a particular definition of pseudo-truth, we introduce a truth-agnostic approach to assessing reproducibility.Specifically, we consider the probability that two credible sets constructed from independent data sets have nonempty overlap, and we establish a simple lower bound on this overlap probability that holds for any valid confidence sets.Under misspecification, we show that credible sets from the standard posterior can strongly violate this boundparticularly when the dimension grows with the number of observations -indicating that it exhibits poor reproducibility.
To improve the reproducibility of Bayesian inference under misspecification, we propose to use BayesBag (Bühlmann, 2014;Douady et al., 2003;Waddell, Kishino and Ota, 2002).The idea of BayesBag is to apply bagging (Breiman, 1996) to the Bayesian posterior.More precisely, the bagged posterior π * (θ | x) is defined by taking bootstrapped copies x * := (x * 1 , . . ., x * M ) of the original dataset x := (x 1 , . . ., x N ) and averaging over the posteriors obtained by treating each bootstrap dataset as the observed data: where π(θ | x * ) ∝ π 0 (θ) M m=1 p θ (x * m ) is the standard posterior density given data x * and the sum is over all N M possible bootstrap datasets of M samples drawn with replacement from the original dataset.In this work, we focus on parameter inference and prediction, complementing our work on BayesBag for model selection (Huggins and Miller, 2023).In theory and experiments, we consider both the case of fixed finite-dimensional parameters as well as high-dimensional cases where the dimension grows with the sample size.
We motivate the bagged posterior from first principles using Jeffrey conditionalization and show that bagged posterior credible sets typically satisfy our lower bound on the overlap probability, indicating that the bagged posterior quantifies uncertainty in a more reproducible way.These results illustrate how the bagged posterior integrates the attractive features of Bayesian inference-such as flexible hierarchical modeling, automatic integration over nuisance parameters, and the use of prior information-with the distributional robustness of frequentist methods, nonparametrically accounting for sampling variability and model misspecification.Simulation experiments validate our theory and demonstrate the bagged posterior is particularly important for stability in high-dimensional settings.An application to crime rate prediction using a Poisson regression model with a horseshoe prior to induce approximate sparsity demonstrate that BayesBag-based analysis can also lead to different conclusions -and better predictions -than a standard Bayesian analysis.
In practice, we suggest approximating π * (θ | x) by generating B independent bootstrap datasets x * (1) , . . ., x * (B) , where x * (b) consists of M samples drawn with replacement from x, yielding the approximation Since the bagged posterior is just the average of standard Bayesian posteriors, one can use any algorithm for standard posterior inference to compute each of the B posteriors, and then aggregate across them.While this requires B times as much computation as a single posterior, it is trivial to parallelize the computation of the B posteriors.Since Eq. ( 2) is a simple Monte Carlo approximation, the error of this approximation can easily be estimated in order to choose B appropriately (Huggins and Miller, 2023).
Despite its many attractive features, there has been little practical or theoretical investigation of bagged posteriors prior to Huggins and Miller (2023).In the only previous work of which we are aware, Bühlmann (2014) presented some simulation results for a simple Gaussian location model, while Waddell, Kishino and Ota (2002) and Douady et al. (2003) used bagged posteriors for phylogenetic tree inference in papers focused primarily on speeding up model selection and comparing Bayesian inference versus the bootstrap.
The article is organized as follows.In Section 2, we motivate the use of BayesBag for reproducible uncertainty quantification in terms of our overlap criterion as well as a Jeffrey conditionalization derivation.In Section 3, we prove that the standard posterior often fails to satisfy the overlap criterion, whereas the bagged posterior typically satisfies it, focusing on Gaussian location models, regular finite-dimensional models, and linear regression.In Section 4, we prove a general Bernstein-Von Mises theorem establishing the asymptotic normal distribution of the bagged posterior, which is employed in the overlap theory of the preceding section.Section 5 evaluates the performance of the bagged posterior in simulation studies, and Section 6 illustrates with an application to crime rate prediction using Poisson regression.We close with a discussion in Section 7.

Motivation
When misspecified, a Bayesian model can be so unstable that it contradicts itself.Specifically, given two independent data sets from the same distribution, the resulting two posteriors-for the same model-can place nearly all their mass on disjoint sets.Figure 1a provides a simple illustration of the problem.Intuitively, it seems clear that this must violate some principle of coherent uncertainty quantification.But if there is no true parameter for which the model is correct, then what is a posterior quantifying uncertainty about?In most previous work, this question is dealt with by focusing on the pseudo-true parameter-that is, the model parameter value that is closest in Kullback-Leibler divergence to the true distribution (De Blasi and Walker, 2013;Hoff and Wakefield, 2012;Kleijn and van der Vaart, 2012;Walker, 2013).However, this choice-or any choice of pseudo-truth-is somewhat arbitrary and entails implicit assumptions about the goal of the analysis, such as minimizing a certain loss function.
In this section, we instead formulate a criterion for reproducible uncertainty quantification that does not require any assumptions of what is true in terms of models or parameters.The basic idea is that two valid confidence sets constructed from independent data sets must intersect with a certain minimal probability.We prove a simple lower bound on this overlap probability that holds for any valid confidence sets, for any definition of pseudo-truth, and for any data distribution.We then use this criterion to motivate the use of BayesBag via Jeffrey conditionalization.Figure 1b illustrates how the bagged posterior does not suffer from the instability exhibited by the standard posterior.

Overlap criterion for reproducible uncertainty quantification
Suppose x → A x is a method of constructing confidence sets that takes data x and produces a set A x that is intended to provide coverage of some unknown quantity of interest, η.For any fixed value of η, let X | η be a random data set.Here, η does not have to be a model parameter.Rather, it is simply some quantity that X depends on.Definition 2.1.We say that x → A x has coverage 1 − α with respect to X | η if for all η, we have P This definition is agnostic to making any assumptions of what is true in terms of models or parameters.
Proposition 2.2.Let X and Y be independent data sets, given η.If x → A x and y → B y have coverage 1 − α and 1 − α ′ with respect to X | η and Y | η, respectively, then (3) This provides a lower bound on the probability that two valid confidence sets intersect.For example, if the coverage is 1−α = 1−α ′ = 0.95, then the lower bound on the probability of intersection is 0.9025.We refer to P(A X ∩ B Y ̸ = ∅ | η) as the overlap probability, and satisfying the bound is referred to as the overlap criterion.Failing to satisfy this criterion indicates a lack of stability and reproducibility across plausible datasets.While satisfying this bound is a necessary condition for coherent uncertainty quantification, it is not sufficient.For example, choosing A x = A, a constant, would satisfy the bound but would clearly be an ineffective method for inference.

Jeffrey conditionalization for reproducibility leads to BayesBag
For reproducibility, one needs to represent uncertainty across data sets from the true distribution.A natural way to do this is via Jeffrey conditionalization, which turns out to lead to the bagged posterior.This interpretation elegantly unifies the Bayesian and frequentist elements of the bagged posterior that might otherwise seem challenging to interpret together in a principled way.
To explain, suppose we have a model p(x, y) of two variables x and y.In the absence of any other data or knowledge, we would quantify our uncertainty in x and y via the marginal distributions p(x) = p(x | y)p(y)dy and p(y) = p(y | x)p(x)dx, respectively.Now, suppose we are informed that the true distribution of x is p • (x), but we are not given any samples of x or y.We would then quantify our uncertainty in x via p • (x), and a natural way to quantify our uncertainty in y is via q(y) := p(y | x)p • (x)dx.The idea is that q(x, y) := p(y | x)p • (x) updates the model to have the correct distribution of x, while remaining as close as possible to the original model p(x, y).This is referred to as Jeffrey conditionalization (Diaconis and Zabell, 1982;Jeffrey, 1968Jeffrey, , 1990)).
Suppose x = x 1:N := (x 1 , . . ., x N ) is the data and y = θ is a parameter, so that p(x, y) = p(x 1:N , θ) is the joint distribution of the data and the parameter.If we are informed that the true distribution of the data is p •N (x 1:N ), then the Jeffrey conditionalization approach is to quantify our uncertainty in θ by Now, suppose we are not informed of the true distribution exactly, but we are given data X 1 , . . ., X N i.i.d.∼ p • .Since the empirical distribution P N := N −1 N n=1 δ Xn is a consistent estimator of p • , and p 4).Doing so, we arrive at the bagged posterior π * (θ | x) from Eq. ( 1), in the case of M = N : where X * 1 , . . ., X * N i.i.d.∼ P N given X 1:N .Thus, the bagged posterior represents uncertainty in θ, integrating over data sets drawn from an approximation to the true distribution.Hence, the bagged posterior naturally improves reproducibility across data sets.

BayesBag combines Bayesian and frequentist uncertainty
In Eq. (4), p(θ | x 1:N ) represents Bayesian model-based uncertainty and integrating with respect to p •N (x 1:N ) represents frequentist sampling uncertainty.Remarkably, these two sources of uncertainty combine additively in the bagged posterior whenever θ ∈ R D .
To see this, let X * | x be a random bootstrap dataset given data x, and let ϑ * | X * ∼ π(θ | X * ) be distributed according to the standard posterior given data X * .Marginalizing out X * , we have dθ to be the standard posterior mean given x.By the law of total expectation, the mean of the bagged posterior is By the law of total covariance, the covariance matrix of the bagged posterior is dθ is the standard posterior covariance.In this decomposition of Cov(ϑ * | x), the first term approximates the mean of the posterior covariance matrix under the sampling distribution, and the second term approximates the covariance of the posterior mean under the sampling distribution.Thus, the first term reflects Bayesian model-based uncertainty averaged with respect to frequentist sampling variability, and the second term reflects frequentist sampling-based uncertainty of a Bayesian model-based point estimate.

Reproducibility using overlap probability
We now investigate if and when the standard and bagged posteriors satisfy the overlap criterion for reproducible uncertainty quantification.We focus on Gaussian location models, regular finite-dimensional models, and linear regression as representative cases, and consider settings where the dimension is fixed or growing with the sample size.We show that under misspecification, the bagged posterior typically satisfies the overlap criterion whereas the standard posterior does not.But, for correctly specified models, both the standard and bagged posteriors usually satisfy the criterion.
First, however, as a check on the reasonableness of our criterion, we establish that for any correctly specified Bayesian model, the overlap criterion is satisfied in expectation with respect to the prior.
Theorem 3.1.Consider any model for data X|ϑ and any prior π on ϑ.Suppose x → A x is a 100(1 − α)% posterior credible set for ϑ under this model and prior, that is, Theorem 3.1 is a direct analogue of the classical result that posterior credible sets have correct frequentist coverage in expectation under the assumed prior.All proofs are in Appendix D.

Gaussian location model
We first consider the simple Gaussian location model in which observations x n are modeled as i.i.d.N (θ, V ) with fixed positive definite covariance matrix V , and assume a conjugate prior, θ ∼ N (0, V 0 ).Given data ).The bagged posterior mean and covariance are Unlike the standard posterior, which simply assumes the data have covariance V , the bagged posterior accounts for the true covariance of the data through the inclusion of the term involving ΣN .

Overlap probability for Gaussian location model with fixed dimension
Consider the Gaussian location model above.Fix α ∈ (0, 1) and u ∈ R D \ {0}, and let A x 1:N be a 100(1 − α)% central credible interval for u ⊤ θ given data x 1:N .For BayesBag, let A * x 1:N denote the 100(1 − α)% central interval for the normal distribution matching the mean and variance of the bagged posterior distribution of u ⊤ θ given x 1:N .For readability, we abbreviate p(overlap) : , If the model is correct then V = Σ • , so the standard and bagged posteriors have the same asymptotic behavior when M = 2N , specifically, the overlap probability converges to P(|W | ≤ z α/2 √ 2 ).However, in misspecified cases where u ⊤ V u < u ⊤ Σ • u, the overlap probability for the standard posterior can be arbitrarily small.On the other hand, the bagged posterior satisfies lim Thus, BayesBag is guaranteed to satisfy the overlap criterion necessary for reproducible uncertainty quantification (Eq.( 3)) when 0 < c ≤ 2, while standard Bayes is not.

Overlap probability for Gaussian location model with growing dimension
To study the case of growing dimension D in the Gaussian location model, we establish finite sample expressions for the overlap probability in the special case of V = I and a flat prior (V −1 0 = 0), assuming Gaussian data.
Theorem 3.3.Consider the same setup as in Theorem 3.2.Suppose P • = N (0, Σ • ), V = I, V −1 0 = 0, and ∥u∥ = 1.Then for the standard posterior, (5) where W ∼ N (0, 1), and for the BayesBag posterior, when N ≥ 2, where Note that the right-hand side of Eq. ( 5) does not depend on N , and the right-hand side of Eq. ( 6) does not depend on D. Eq. ( 5) can be arbitrarily small as D grows, since u ⊤ Σ • u can be arbitrarily large.For instance, this will often be the case when Σ • has order D 2 nonnegligible entries.Thus, as the dimension D grows, the standard posterior can severely violate the overlap criterion.Meanwhile, if M/N → 1 as N → ∞, then the lower bound in Eq. ( 6) converges to Therefore, for all N sufficiently large, for all D, BayesBag satisfies the overlap criterion.

Regular finite-dimensional models
Asymptotically, sufficiently regular finite-dimensional models behave like the Gaussian location model.We have ) by the Bernstein-Von Mises theorem, and ) by classical theory, where θN is the maximum likelihood estimator, θ • is the Kullback-Leibler optimal parameter, and J θ• , I θ• are information matrices; see Section 4 for details.In Section 4, we prove that for the bagged posterior, and α ∈ (0, 1).Let p ∞ (overlap) and p * ∞ (overlap) denote the asymptotic overlap probabilities of 100(1−α)% central credible intervals for u ⊤ θ under these asymptotic normal distributions for the standard and bagged posteriors, respectively, assuming J θ• and I θ• are positive definite.
Theorem 3.4.Let W ∼ N (0, 1).For the standard posterior, , and for the bagged posterior, In general, the ratio (u ) can be arbitrarily large or small.In particular, p ∞ (overlap) can be arbitrarily small, implying that the asymptotic standard posterior can strongly violate the overlap criterion in Eq. (3).On the other hand, as long as c ≤ 2, we have p * ∞ (overlap) ≥ 1 − α, implying that the asymptotic bagged posterior satisfies the overlap criterion.

Linear regression model
Consider data consisting of regressors Z n ∈ R D and outcomes Y n ∈ R (n = 1, . . ., N ), and let Z ∈ R N ×D denote the complete design matrix, and Y ∈ R N the vector of outcomes.We analyze the linear regression model where β ∈ R D is the vector of coefficients, σ 2 > 0 is the outcome variance, To simplify the analysis, assume Z ⊤ Z is invertible, σ 2 is fixed but possibly unknown, and β is given a flat prior.For any u ∈ R D \ {0}, the resulting posterior on u ⊤ β is where µ † and Σ † are functions of Z, say, µ † = m(Z) and Σ † = K(Z).Note that the model is correctly specified when m(Z) = Zβ and K(Z) = σ 2 I.
2. If Z = Z, but we make no assumptions on the form of m(Z) or K(Z), then 3. If K(Z) = σ 2 † I, but we make no assumptions on the form of m(Z), then Eq. ( 7) shows that if the linear regression model is correctly specified, then the standard posterior satisfies the overlap criterion (Eq.( 3)), since σ = σ = σ † and therefore p If the model is correct but the variance is unknown, and consistent estimators of σ 2 † are plugged in for σ 2 and σ2 , then the overlap criterion is satisfied for all N sufficiently large.
However, when either the covariance K(Z) or the mean function m(Z) is misspecified, the standard posterior can violate the overlap criterion.Consider the case where Z = Z, that is, the design matrix is the same across replicates; we refer to this as a fixed design setting.Eq. ( 8) shows that p(overlap | Z, Z) does not depend on m(Z), so misspecification of the mean function has no effect on the overlap probability in this case.Nonetheless, Eq. ( 8) shows that p(overlap | Z, Z) can be arbitrarily small when the covariance is misspecified, because the ratio (σ + σ)∥v∥/(v ⊤ K(Z)v) 1/2 can be arbitrarily small.Clearly, this ratio will be small if σ 2 and σ2 are blindly set too low, but it can also be small if these variances are estimated from the data.For instance, if the true distribution exhibits heteroskedasticity (that is, K(Z) has a nonconstant diagonal), then p(overlap | Z, Z) can violate the overlap criterion even when σ 2 and σ2 are estimated; see Section 5.
Finally, consider the case where Z and Z are not necessarily equal and we make no assumptions on m(Z).To avoid trivial failure modes in which the choice of Z and Z leads to a nonnegligible differential bias v ⊤ m(Z) − ṽ⊤ m( Z) as N grows, assume a random design setting where the rows of Z and Z are independent and identically distributed.Then even if K(Z) = σ 2 † I, so that there is no heteroskedasticity and no correlation among outcomes, the overlap criterion can still be violated.As before, p(overlap | Z, Z) can be arbitrarily small if σ 2 and σ2 are blindly set too low, but it can also be small if these variances are estimated.By Eq. ( 9), p(overlap | Z, Z) will be small if the magnitude of is large relative to √ σ 2 + σ2 , where Z + = (Z ⊤ Z) −1 Z ⊤ is the pseudoinverse.A trivial way this can occur is if the entries of β † are large.More interestingly, however, Eq. ( 10) can be large if the dimension D grows with N , even if each entry of β † has fixed magnitude.Specifically, in Section 5 we present experiments demonstrating this when β † consists of the first D entries of a fixed sequence β †,1 , β †,2 , . . .such that

Asymptotic normality of the bagged posterior
In this section, we establish a Bernstein-Von Mises theorem for the bagged posterior under sufficiently regular finite-dimensional models (Theorem 4.1).In particular, we show that while the standard posterior may be arbitrarily under-or over-confident when the model is misspecified, the bagged posterior avoids overconfident uncertainty quantification by accounting for sampling variability.
More formally, consider a model {P θ : θ ∈ Θ} for independent and identically distributed (i.i.d.) data x 1 , . . ., x N , where x n ∈ X and Θ ⊂ R D is open.Suppose p θ is the density of P θ with respect to some reference measure.The standard Bayesian posterior distribution given where Π 0 (dθ) is the prior distribution and p(x Assume the observed data X 1 , . . ., X N is generated i.i.d.from some unknown distribution P • .Suppose there is a unique parameter θ • that minimizes the Kullback-Leibler divergence from P • to the model, or equivalently, θ • = arg max θ∈Θ E{log p θ (X 1 )}.Under regularity conditions, the maximum likelihood estimator θN := arg max θ N n=1 p θ (X n ) is asymptotically normal in the sense that where J θ := −E{∇ 2 θ log p θ (X 1 )}, I θ := Cov{∇ θ log p θ (X 1 )}, and (White, 1982).Under mild conditions, the Bernstein-Von Mises theorem (van der Vaart, 1998, Ch. 10 andKleijn andvan der Vaart, 2012) guarantees that for ϑ ∼ Π N , Hence, the standard posterior is correctly calibrated, asymptotically, if the covariance matrices of the Gaussian distributions in Eqs. ( 11) and ( 12) coincide -that is, if then Bayesian credible sets are (asymptotically) valid confidence sets in the frequentist sense: sets of posterior probability 1 − α contain the true parameter with P ∞ • -probability 1 − α, under mild conditions.If the model is well-specified, that is, if P • = P θ † for some parameter θ † ∈ Θ (and thus θ • = θ † by the uniqueness assumption), then I θ• = J θ• under very mild conditions.On the other hand, if the model is misspecified -that is, if P • ̸ = P θ for all θ ∈ Θ -then although Eq. ( 12) still holds, typically then the standard posterior is not correctly calibrated, and in fact, asymptotic Bayesian credible sets may be arbitrarily overor under-confident.
Our Bernstein-von Mises theorem shows that the bagged posterior does not suffer from the overconfidence of the standard posterior.Let X * 1:M denote a bootstrapped copy of X 1:N with M observations; that is, each observation X n is replicated K n times in X * 1:M , where K 1:N ∼ Multi(M, 1/N ) is a multinomial-distributed count vector of length N .We formally define the bagged posterior Π for all measurable A ⊆ Θ; this is equivalent to the informal definition in Eq. (1).To avoid notational clutter, we suppress the dependence of Π * (• | X 1:N ) on M .We use the shorthand notation Π * N := Π * (• | X 1:N ) and we let ϑ * | X 1:N ∼ Π * N denote a random variable distributed according to the bagged posterior.We assume Π N and Π * N have densities π N and π * N , respectively, with respect to Lebesgue measure.Note that π * N exists if π N exists.For a measure ν and function f , we use the shorthand ν(f ) := f dν.Let X 1:∞ denote the infinite sequence (X 1 , X 2 , . . .), and abbreviate ℓ θ := log p θ .
Theorem 4.1.Suppose X 1 , X 2 , . . .i.i.d.∼ P • and assume that: Then, letting ϑ * ∼ Π * N , we have that conditionally on X 1:∞ , for almost every X 1:∞ , where Xn .The result also holds in the regression setting with random regressors where the data take the form X n = (Y n , Z n ) and the models p θ (y | z) are conditional, so ℓ θ (x) := log p θ (y | z).
The proof of Theorem 4.1 is in Appendix D. Theorem C.1 is a simpler version of the same result for the univariate Gaussian location model, for which the statement and our proof technique are more transparent.Our technical assumptions are essentially the same as those used by Kleijn and van der Vaart (2012) to prove the Bernstein-Von Mises theorem under misspecification for the standard posterior.Of particular note, Kleijn and van der Vaart (2012) require that (and give conditions under which) for every sequence of constants We conjecture that under reasonable regularity assumptions, this expected posterior concentration condition implies our condition (v).
To interpret this result, it is helpful to compare it to the behavior of the standard posterior.Under the conditions of Theorem 4.
) in probability by Kleijn and van der Vaart (2012, Theorem 2.1 and Lemma 2.1).Thus, the bagged posterior and the standard posterior for N 1/2 (θ−θ • ) have the same asymptotic mean, ∆ N , but the bagged posterior has asymptotic covariance Hence, asymptotically, the bagged posterior is never overconfident if c = 1 (for instance, if M = N ) and by Theorem 3.4, we expect 100(1 − α)% credible sets of the bagged posteriors to have overlap probability of at least 1 − α when 0 < c ≤ 2.

Simulations
In this section, we validate our theoretical results through a simulation study with a linear regression model, which is ideal for investigating the properties of BayesBag since all computations of posterior quantities can be done in closed form.The setup is similar to the linear regression model from Section 3.3 except we place proper priors on the regression coefficients and the outcome variance σ 2 .The data consist of regressors Z n ∈ R D and outcomes Y n ∈ R (n = 1, . . ., N ), and the parameter is θ = (θ 0 , . . ., θ D ) = (log σ 2 , β 1 , . . ., β D ) ∈ R D+1 .Using conjugate priors, the assumed model is where a 0 = 2, b 0 = 1, and λ = 1 are fixed hyperparameters.
Data generating distribution.We simulated data for a random design scenario by generating ∼ N (0, 1), and for n = 1, . . ., N , where β †d = 4/ √ d for d = 1, . . ., D and we used two settings for each of f and G.
• Regression function f .By default, we used a linear function f (z) = z to simulate data for the well-specified setting.Alternatively, we used the nonlinear function f (z) = (z 3 1 , . . ., z 3 D ) ⊤ for a misspecified setting.• Regressor distribution G.By default, we used G = N (0, I) to simulate data; we refer to this as the uncorrelated setting.Alternatively, we used a correlated-κ setting, where, for h = 10, Z ∼ G was defined by generating ξ ∼ χ 2 (h) and then odd) .The motivation for the correlated-κ sampling procedure is to generate correlated regressors that have different tail behaviors while still having the same first two moments, since regressors are typically standardized to have mean 0 and variance 1.Note that, marginally, Z 1 , Z 3 , . . .are each rescaled t-distributed random variables with h degrees of freedom such that Var(Z 1 ) = 1, and Z 2 , Z 4 , . . .are standard normal.Overlap probabilities.The primary objective in these experiments is to validate that the BayesBag posterior does not violate the probability of overlap lower bounds while the Bayesian posterior sometimes does.Thus, for each data-generating distribution of interest, we estimate overlap probabilities by generating R pairs of datasets {(Z (r,1) For each i ∈ {1, . . ., 100}, we estimate the probability of overlap for Z test i as For all experiments we use R = 100.Figure 2 shows that for nonlinear-uncorrelated data, BayesBag never violates the overlap lower bounds while Bayes always or often violates the lower bounds, depending on the value of 1 − α (larger 1 − α leads to more violations).for which the Bayes overlap probability satisfies the lower bound.For BayesBag, the proportion is 1 in all cases.
as shown in Fig. 3, the problem becomes more severe as N and D jointly increase, but improves or stays the same if D is fixed and N increases.These results emphasize how the misspecified high-dimensional regime is particularly problematic for the reproducibility of the standard posterior.We find similar results in the case of a fixed design matrix with heteroskedastic noise (see Appendix B.2 in the Supplementary Material).
Predictive performance.To complement our overlap probability analysis, we also computed the mean log predictive densities at the same test points.Figure 4 shows that while in well-specified linear settings the standard posterior can slightly outperform BayesBag (by roughly 0.2 nats or less), in the misspecified nonlinear settings BayesBag can be far superior (by 0.2 to nearly 10 nats).

Application
We next consider an application to community-level crime data from the United States using a Poisson regression model with log link function and the spike-and-slab prior proposed by Piironen and Vehtari (2017).The data consist of N = 1994 observations containing 100 community-level covariates such as demographic summaries and local law enforcement statistics such as the number of police officers per capita.The goal is to predict the number of violent crimes per 100,000 persons in the population.We chose M = N and used B = 50 bootstrap samples to approximate the bagged posterior.Nearly identical results were obtained with B = 25, indicating that B = 50 was sufficiently large.
To compute overlap probabilities, we held out 20% of the observations as test points and randomly split the remaining observations into two equally sized data sets, from which we computed two posteriors to compare.We generated R = 50 replicate experiments in this way, and followed the procedure in Section 5 to approximate the overlap probability for each replicate.
Figure 5 validates our theoretical results: the standard posterior is unstable across datasets, with overlap probabilities below (1 − α) 2 for 1 − α ∈ {0.8, 0.9, 0.95} in the vast majority of replicates.The bagged posteriors, on the other hand, have overlap greater than (1−α) 2 in all replicates.Moreover, BayesBag has superior predictive performance: the mean log predictive densities for the standard and bagged posteriors are, respectively, −5.4 and −4.3 with a 99% confidence interval for difference of (1.043, 1.093) (paired t interval).
To explore how using the bagged rather than the standard posterior might result in different conclusions, we compared the posterior marginals of the regression coefficients, with some representative results shown in Figs. 6 and 7.In all cases, the bagged posteriors were more diffuse, as would be expected.In several cases, however, the BayesBag results are qualitatively different from the standard posterior results.The standard posterior for the coefficient of Upper Quartile Rent is symmetric and concentrated below zero while for the bagged posterior it has a sharp peak at zero and is skewed left (Fig. 6).Similarly, the standard posteriors are symmetric for the coefficients of covariates related to percent of different racial and ethnic groups (Fig. 7).Meanwhile, the bagged posterior for the coefficients of Percent Asian and Percent Hispanic are multimodal and have significantly more mass centered at zero.These examples illustrate how the bagged and standard posteriors may yield substantively different results in practice -BayesBag is not merely inflating the posterior uncertainty.The standard and bagged posterior marginals for three coefficients related to race for the data and model from Section 6.

Discussion
We conclude by first situating BayesBag in the wider literature on robust Bayesian inference, and then, with that additional context in place, highlighting the strengths of our approach and suggest fruitful directions for future development.

Bayesian bagging
Despite the similar sounding names, BayesBag is very different than Bayesian bagging (Clyde and Lee, 2001;Lee and Clyde, 2004).Bayesian bagging consists of applying the Bayesian bootstrap to a point estimator of a classification or regression model, such as ordinary least squares.In other words, it is a slight variant of traditional bagging where, instead of multinomial weights, one uses continuous weights drawn uniformly from the probability simplex.
In contrast, BayesBag uses traditional bagging on the posterior of an arbitrary Bayesian model.In short, Bayesian bagging performs bagging using Bayes, whereas BayesBag performs Bayes using bagging.Relatedly, in the same way that bagging expands the model space for a classification or regression method (Domingos, 1997), BayesBag expands the posterior space for a Bayesian model.

Bayesian uncertainty quantification with the bootstrap
The bootstrap has previously been employed to perform uncertainty quantification in Bayesian settings.See Laird and Louis (1987) and references therein for uses of the bootstrap to adjust for underestimated uncertainties when using empirical Bayesian methods.Similar in spirit to the present work, Efron (2015) develops a variety of methods for obtaining frequentist uncertainty quantification of Bayesian point estimates, including some that rely on bootstrapping.

Robust Bayesian inference
Two common themes emerge when surveying existing methods for robust Bayesian inference.First, many methods require choosing a free parameter, and the proposals for choosing free parameters tend to be either (a) heuristic, (b) strongly dependent on being in the asymptotic regime, or (c) computationally prohibitive for most real-world problems.Second, those methods without a free parameter lose key parts of what makes the Bayesian approach attractive.For example, they strongly rely on asymptotic assumptions, make a Gaussian assumption, or do not incorporate a prior distribution.The power posterior is perhaps the most widely studied method for making the posterior robust to model misspecification (Grünwald, 2012;Grünwald and van Ommen, 2017;Holmes and Walker, 2017;Lyddon, Holmes and Walker, 2019;Miller and Dunson, 2018;Syring and Martin, 2019).For a likelihood function L(θ), prior distribution Π 0 , and any ζ ≥ 0, the ζpower posterior is defined as Π (ζ) (dθ) ∝ L(θ) ζ Π 0 (dθ).Hence, Π (1) is equal to the standard posterior and Π (0) is equal to the prior.Typically, ζ is set to a value between these two extremes, as there is significant theoretical support for the use of power posteriors with ζ ∈ (0, 1) (Bhattacharya, Pati and Yang, 2019;Grünwald, 2012;Miller and Dunson, 2018;Royall and Tsou, 2003;Walker and Hjort, 2001).However, there are two significant methodological challenges.First, computing the power posterior often requires new computational methods or additional approximations, particularly in latent variable models (Antoniano-Villalobos and Walker, 2013;Miller and Dunson, 2018).Second, choosing an appropriate value of ζ can be difficult.Grünwald (2012) proposes SafeBayes, a theoretically sound method which is evaluated empirically in Grünwald andvan Ommen (2017) andde Heide et al. (2019).However, SafeBayes is computationally prohibitive except with simple models and very small datasets.In addition, the underlying theory relies on strong assumptions on the model class.Many other methods for choosing ζ have been suggested, but they are either heuristic or rely on strong asymptotic assumptions such as the accuracy of the plug-in estimator for the sandwich covariance (Holmes and Walker, 2017;Lyddon, Holmes and Walker, 2019;Miller and Dunson, 2018;Royall and Tsou, 2003;Syring and Martin, 2019).
More in the spirit of BayesBag are a number of bootstrapped point estimation approaches (Chamberlain and Imbens, 2003;Lyddon, Holmes and Walker, 2019;Lyddon, Walker and Holmes, 2018;Newton and Raftery, 1994;Rubin, 1981).However, unlike Bayes-Bag, these methods compute a collection of maximum a posteriori (MAP) or maximum likelihood (ML) estimates.The weighted likelihood bootstrap of Newton and Raftery (1994) and a generalization proposed by Lyddon, Holmes and Walker (2019) do not incorporate a prior, and therefore lose many of the benefits of Bayesian inference.The related approach of Lyddon, Walker and Holmes (2018), which includes the weighted likelihood bootstrap and standard Bayesian inference as limiting cases, draws the bootstrap samples partially from the posterior and partially from the empirical distribution.Unfortunately, there is no accompanying theory to guide how much the empirical distribution and posterior distribution should be weighted relative to each other -nor rigorous robustness guarantees.Moreover, bootstrapped point estimation methods can behave poorly when the MAP and ML estimates are not well-behaved -for example, due to the likelihood being peaked (or even tending to infinity) in a region of low posterior probability.Müller (2013) suggests replacing the standard posterior by a Gaussian distribution with covariance proportional to a plug-in estimate of the sandwich covariance.A benefit of our approach is that it does not rely on a Gaussian approximation and does not require estimation of the sandwich covariance, making it suitable for small-sample settings.While our theory does focus on Gaussian or asymptotically Gaussian posteriors, in practice Bayes-Bag is applicable in non-asymptotic regimes where the posterior is highly non-Gaussian, as shown by the application in Section 6.

The benefits of BayesBag
In view of previous work, the BayesBag approach has a number of attractive features that make it flexible, easy-to-use, and widely applicable.From a methodological perspective, BayesBag is general-purpose.It relies only on carrying out standard posterior inference, it is applicable to a wide range of models, and it can make full use of modern probabilistic programming tools -the only added requirement is the design of a bootstrapping scheme.Although this paper focuses on using BayesBag with independent observations, future work can draw on the large literature devoted to adapting the bootstrap to more complex models such as those involving time-series and spatial data.BayesBag is also general-purpose in the sense that it is useful no matter whether the ultimate goal of Bayesian inference is parameter estimation, prediction, or model selection; see Huggins and Miller (2023) for how to use BayesBag for model selection.
Another appeal of BayesBag as a methodology is that the only hyperparameter -the bootstrap dataset size M -is straightforward to set.Specifically, M = N is a natural, theoretically well-justified choice that, while slightly conservative, yields reproducible inferences.
In terms of computation, when using the approximation in Eq. ( 2), there is an additional cost due to the need to compute the posterior for each bootstrapped dataset.However, it is trivial to compute the bootstrapped posteriors in parallel.As described in Appendix A.1, validating that the number of bootstrap datasets B is sufficiently large only requires computing simple Monte Carlo error bounds.Moreover, defaulting to B = 50 or 100 appears to be an empirically sound choice across a range of problems.Nonetheless, speeding up BayesBag with more specialized computational methods could be worthwhile in some applications.For example, in Appendix A.2, we suggest one simple approach to speeding up Markov chain Monte Carlo (MCMC) runs when using BayesBag.Pierre Jacob has proposed using more advanced unbiased MCMC techniques for potentially even greater computational efficiency. 1nother benefit of BayesBag is that it incorporates robustness features of frequentist methods into Bayesian inference without sacrificing the core benefits of the Bayesian approach such as flexible modeling, straightforward integration over nuisance parameters, and the use of prior information.Further, our Jeffrey conditionalization interpretation establishes solid epistemological foundations for using BayesBag.Thus, it provides an appealing and philosophically coherent synthesis of Bayesian and frequentist approaches without introducing difficult-to-choose tuning parameters and without sacrificing the most useful parts of Bayesian inference.) ⊤ β (i = 1, . . ., 100) for linear regression with linear mean function f , fixed design, and heteroskedastic error.

B.2. Fixed design linear regression simulations
To simulate data for a fixed design scenario, we set z n0 = 1 to include an intercept, set covariates z n1 and z n2 to be a uniform grid on [−2, 2] × [−2, 2], and generate the remaining covariates as i.i.d.N (0, 1).We use the (well-specified) linear regression function f (z) = z and to introduce misspecification, we generate the outcomes as in Eq. ( 13) but with heteroskedastic noise given by ϵ n | z n indep ∼ N (0, 1 + z 2 n1 + z 2 n2 ). Figure B.4 shows that standard Bayes exhibits poor overlap behavior, similar to the case of nonlinear-correlated-2 data (Fig. B.1), whereas BayesBag has overlap probability very close to 1 at every test point.BayesBag also has superior predictive performance, with 99% confidence intervals for the difference in mean log predictive densities of (0.49, 0.69) and (1.00, 1.38) for, respectively, N = D = 256 and 400.
Proof of Theorem 3.1.Since X and X are independent and identically distributed given ϑ, where in the last step we use that P(ϑ ∈ A x | x) ≥ 1 − α for all x.
Proof of Theorem 3.2.To handle both the standard Bayes and BayesBag cases simultaneously, consider a multivariate normal posterior on θ with mean R M XN ∈ R D and covariance matrix V M + bM −1 R M ΣN R M ; then standard Bayes is the case of M = N and b = 0, while BayesBag is the case of b = 1.The posterior of u ⊤ θ is then where σ 2 X 1:N := u ⊤ V M u + bM −1 u ⊤ R M ΣN R M u.Thus, a 100(1 − α)% credible interval for u ⊤ θ is given by A b X 1:N = u ⊤ R M XN ± z α/2 σ X 1:N .Letting X 1:N and Y 1:N be independent data sets drawn i.i.d.from P • , we have By the central limit theorem, N 1/2 ( XN − ȲN ) D → N (0, 2Σ • ).By assumption, M/N → c > 0 as N → ∞, which implies that M → ∞.Recalling that R M = (V −1 0 V /M + I) −1 and V M = (V −1 0 + M V −1 ) −1 , we have R M → I and N V M → V /c as N → ∞.Thus, by the strong law of large numbers, N 1/2 σ X 1:N → (u ⊤ V u/c + bu ⊤ Σ • u/c) 1/2 almost surely as N → ∞, and likewise for N 1/2 σ Y 1:N .Therefore, by Slutsky's theorem, as N → ∞, where W ∼ N (0, 1).This proves the theorem.

Fig 1 :
Fig 1: Standard and bagged posterior distributions of the mean for a Gaussian location model assuming the data are i.i.d.N (µ, 1), when the data are actually i.i.d.N (0, 5 2 ).Posteriors for six randomly generated data sets of size N = 100 are shown.(a) Many pairs of posterior distributions have essentially no overlap with each other, and 5 out 6 do not contain the true mean in their 95% central credible sets.(b) All pairs of bagged posterior distributions have significant overlap and 6 out of 6 contain the true mean in their 95% central credible sets.

Fig 3 :
Fig 3: Proportion of test points Z test i

Fig 4 :
Fig 4: 99% confidence intervals for difference in the mean log predictive densities of the standard and BayesBag posteriors (paired t intervals), with values greater than zero indicating superior performance by BayesBag.Note the different scales for linear versus nonlinear.

Fig 5 :Fig 6 :
Fig 5:For crime data using a sparse Poisson regression model, shown are histograms of the overlap probability for Z ⊤ β where Z is drawn from a held-out test set.For most replicates, the overlap probabilities for the standard posteriors are below (1 − α) 2 for 1 − α ∈ {0.8, 0.9, 0.95}.Meanwhile, for all replicates, the overlap probabilities for the bagged posteriors are greater than (1 − α) 2 .