Reproducible Model Selection Using Bagged Posteriors

Bayesian model selection is premised on the assumption that the data are generated from one of the postulated models. However, in many applications, all of these models are incorrect (that is, there is misspecification). When the models are misspecified, two or more models can provide a nearly equally good fit to the data, in which case Bayesian model selection can be highly unstable, potentially leading to self-contradictory findings. To remedy this instability, we propose to use bagging on the posterior distribution ("BayesBag") -- that is, to average the posterior model probabilities over many bootstrapped datasets. We provide theoretical results characterizing the asymptotic behavior of the posterior and the bagged posterior in the (misspecified) model selection setting. We empirically assess the BayesBag approach on synthetic and real-world data in (i) feature selection for linear regression and (ii) phylogenetic tree reconstruction. Our theory and experiments show that, when all models are misspecified, BayesBag (a) provides greater reproducibility and (b) places posterior mass on optimal models more reliably, compared to the usual Bayesian posterior; on the other hand, under correct specification, BayesBag is slightly more conservative than the usual posterior, in the sense that BayesBag posterior probabilities tend to be slightly farther from the extremes of zero and one. Overall, our results demonstrate that BayesBag provides an easy-to-use and widely applicable approach that improves upon Bayesian model selection by making it more stable and reproducible.


Introduction
In Bayesian statistics, the usual method of quantifying uncertainty in the choice of model is simply to use the posterior distribution over models. An implicit assumption of this approach is that one of the assumed models is exactly correct. But it is widely recognized that in practice, this assumption is typically unrealistic. When all of the models are incorrect (that is, they are misspecified ), the posterior concentrates on the model that provides the best fit in terms of Kullback-Leibler divergence (Berk, 1966). However, when two or more models can explain the data almost equally well, the posterior becomes unstable and can yield contradictory results when seemingly inconsequential changes are where the sum is over all possible N M bootstrap datasets of M samples drawn with replacement from the original dataset. The BayesBag approach is to use Q * (m | x) to quantify uncertainty in the model m. In practice, we can approximate Q * (m | x) by generating B bootstrap datasets x * (1) , . . . , x * (B) , where each x * (b) consists of M samples drawn with replacement from x, yielding the approximation Hence, BayesBag is easy to use since the bagged posterior model probability is simply an average over Bayesian model probabilities. No additional algorithmic tools are needed beyond what a data analyst would normally use for posterior inference. Implementing BayesBag via (2) does require more computation since one must approximate B posteriors (one for each bootstrap dataset), where typically B ≈ 100. However, this drawback is minimized by the fact that each posterior can be approximated in parallel, which is ideal for modern cluster-based high-performance computing environments.
Despite its attractive features, there has been limited methodological or theoretical work on BayesBag prior to the present paper. Bühlmann (2014) and Huggins and Miller (2019) consider BayesBag in the parameter inference and prediction setting. In this paper, we focus on the use of BayesBag for model selection, which has been explored empirically in an application to phylogenetic tree reconstruction (Douady et al., 2003;Waddell, Kishino and Ota, 2002). Building off this previous work, our primary contributions are: selection is unstable, while BayesBag model selection remains stable. Our analysis quantifies the effects of the relevant factors such as the mean and variance of the log-likelihood ratios and the correlation structure of the log-likelihoods.
2. We provide concrete guidance on selecting the bootstrap dataset size M and, via our theory, we clarify the effect of M on the stability of BayesBag model selection.
3. We verify through numerical experiments on synthetic and real data that, when all of the models are misspecified, BayesBag model selection leads to more stable inferences across datasets and small model changes, while Bayesian model selection is unstable. When one of the models is correctly specified, BayesBag is slightly more conservative than Bayesian model selection, in the sense that the bagged posterior probabilities tend to be slightly farther from zero and one.
In short, we find that in the presence of misspecification, model selection with the bagged posterior has appealing statistical properties while also being easy to use and computationally tractable on practical problems.
The paper is organized as follows. Section 2 provides an overview of our theory, methodology, and experiments, and how they relate to previous work. In Section 3, we present our theoretical results, illustrate the theory graphically, discuss the use of BayesBag for model criticism, and outline our recommended workflow. Section 4 contains a simulation study using BayesBag for feature selection in linear regression. In Section 5, we evaluate BayesBag on real-world data in applications involving (i) feature selection for linear regression and (ii) phylogenetic tree reconstruction. We conclude in Section 6 with a discussion of current limitations and future directions. All data and code for the results in this paper are available at https://github.com/TARPS-group/ bayesbag-model-selection-code.

Theory
It has long been known that when the best fit to the data distribution is attained by more than one model, the posterior typically does not converge on a single model (Berk, 1966). In Theorems 3.1 and 3.2, we characterize the asymptotic distribution of the posterior on models in this setting, for both the usual posterior ("Bayes") and the bagged posterior ("BayesBag"). More generally, our theory covers the case of multiple misspecified models with approximately equally good fit.
Suppose the observed data x 1 , . . . , x N are realizations of independent and identically distributed (i.i.d.) random variables X 1 , . . . , X N ∈ X, and denote X 1:N = (X 1 , . . . , X N ). First, consider the special case of two distinct models, M = {m 1 , m 2 }. Assume these models are asymptotically equally misspecified in the sense that lim N →∞ Then under mild conditions, Theorem 3.1 (part 1) shows that the Bayes posterior mass on model m 1 converges in distribution to a Bern(1/2) random variable: In other words, when N is large, with probability 1/2 model m 1 has posterior probability ≈ 1 and otherwise it has posterior probability ≈ 0. Since, asymptotically, both models provide equally good fit to the true data-generating distribution, one might hope that Q(m 1 | X 1:N ) → 1/2. However, (3) describes the opposite behavior: a single arbitrary model has posterior probability 1.
We show that BayesBag model selection does not suffer from this pathological behavior (Theorem 3.1, part 2). In the special case above (two models with asymptotically equally good fit), when M = N , the bagged posterior probability of model m 1 converges in distribution to a uniform random variable on the interval from 0 to 1: Alternatively, if we choose M such that M/N → 0 and M/N 1/2 → ∞, then the bagged posterior mass on model m 1 has the appealing behavior of converging to 1/2: This is not simply due to the bagged posterior reverting to the prior; this result holds for any prior giving positive mass to both models. Theorem 3.2 extends Theorem 3.1 to the case of more than two models, in which case the asymptotic distribution depends on the covariance structure of the log marginal likelihoods of the models. Corollary 3.3 extends Theorem 3.1 to the case of models with non-trivial parameter spaces.
In practice, it is unlikely that two models would fit the true data-generating distribution exactly equally well. However, even if, say, model m 1 has posterior probability tending to 1 asymptotically, for a finite sample size it may be that N −1/2 E{log p(X 1:N | m 1 )− log p(X 1:N | m 2 )} ≈ 0, such that with probability ≈ 1/2, model m 2 has posterior probability near 1. Indeed, the analysis of Yang and Zhu (2018) was motivated by observations of this phenomenon in Bayesian phylogenetic tree reconstruction (Alfaro, Zoller and Lutzoni, 2003;Douady et al., 2003;Wilcox et al., 2002), though it occurs more generally (Meng and Dunson, 2020), such as in neuroscience and economic modeling (Oelrich et al., 2020).
To understand this kind of finite-sample behavior via an asymptotic analysis, Theorems 3.1 and 3.2 are formulated for sequences of models for N = 1, 2, . . . that are not exactly equally good, but are asymptotically comparable in the sense that the expected log-likelihood ratios between models are O(N 1/2 ). In this way, our results provide insight into cases where the models are not dependent on N but the sample size is not yet large enough for the posterior to concentrate at the best fitting model(s).

Methodology
BayesBag requires the choice of a bootstrap dataset size M and the number of bootstrap datasets B. The choice of B controls the accuracy of the Monte Carlo approximation to the bagged posterior; see (1) and (2). It is straightforward to empirically estimate the error using the standard formula for the variance of a Monte Carlo approximation (Huggins and Miller, 2019). If M = N , we have found B = 100 to be sufficient in all of the applications we have considered. For M < N, the following result provides a natural lower bound on B to ensure all available data are used with high probability.
Proposition 2.1. For N > 1, if B ≥ (N − 1/2) log(N/δ)/M then the probability that all observations are included in at least one bootstrap sample is greater than 1 − δ.
While Proposition 2.1 offers a minimum value for B, we still recommend checking that the Monte Carlo standard error is sufficiently small for the application at hand.
For the choice of M , our theoretical and empirical results indicate that M = N 0.95 is a good default choice that will behave fairly well for model selection, both in cases where one model is correctly specified and, at the opposite extreme, when multiple misspecified models explain the data-generating distribution equally well. If significant misspecification is likely and there is a sufficient amount of data, a more aggressive choice such as M = N 0.75 could be appropriate. A recommended workflow is in Section 3.3.

Experiments
We validate our theory and proposed methods through simulations on feature selection for linear regression, and we evaluate the performance of BayesBag on real-data applications involving feature selection and phylogenetic tree reconstruction. Overall, our empirical results demonstrate that in the presence of significant misspecification, the bagged posterior produces more stable inferences and puts significant mass on optimal models more reliably than the usual Bayes posterior; on the other hand, when one of the models is correctly specified, the bagged posterior with N 0.95 ≤ M ≤ N is slightly more conservative than the posterior. Thus, BayesBag leads to more stable model selection results that are robust to minor changes in the model or representation of the data.

Related work
First, we discuss previous work in the parameter inference and prediction setting, with a model smoothly parameterized by θ ∈ Θ ⊆ R D . In a short discussion paper, Bühlmann (2014) introduced the name "BayesBag" to refer to bagging the posterior, and he presented a few simulation results in a simple Gaussian location model. However, Bühlmann (2014) employed a parametric bootstrap, which does not provide much benefit in a misspecified setting. In contrast, in recent work (Huggins and Miller, 2019), we found that using the nonparametric bootstrap to implement BayesBag yielded significant benefits for parameter inference and prediction under misspecification. In that work, we developed asymptotic theory for uncertainty quantification of the Kullback-Leibler optimal parameter, providing insight into how to choose the bootstrap dataset size (M = 2N if the model is correctly specified, and M = N if the model is misspecified). Neither paper considers model selection, which raises fundamentally different issues because it involves a discrete space where smoothness does not play a role. Notably, our recommendation in this paper to take M = o(N ) for model selection is very different from our recommendations for parameter inference and prediction.
The previous work most closely related to the present work is a mix of empirical investigation (Douady et al., 2003;Oelrich et al., 2020;Waddell, Kishino and Ota, 2002) and theoretical work (Bühlmann and Yu, 2002;Oelrich et al., 2020;Yang and Zhu, 2018). The purely empirical papers undertake limited investigations in the setting of phylogenetic tree inference: Waddell, Kishino and Ota (2002) focus primarily on speeding up model selection and Douady et al. (2003) mainly aim to compare Bayesian inference to the bootstrap. Our Theorem 3.1 is similar in spirit to the bagging result of Bühlmann and Yu (2002, Proposition 2.1). However, the Bühlmann and Yu (2002) result is not applicable in the model selection setting since it would require assigning probability 1 to whichever model has the larger marginal likelihood-which does not correspond to Bayesian model selection-and then applying bagging to this selection procedure. Our other results (Theorem 3.2 and Corollary 3.3) go well beyond the scope of the Bühlmann and Yu (2002) result, covering three or more models as well as nontrivial parameter spaces.
Regarding the behavior of Bayesian model selection under the usual posterior, Yang and Zhu (2018) prove a result similar to (3) but more limited than our general versions in part 1 of Theorems 3.1 and 3.2. Finally, Oelrich et al. (2020) provide complementary results to our own: they study additional real-world examples of overconfident model selection and, in the feature selection setting, analyze the mean and variance of the log marginal likelihood ratio for a particular type of linear regression model with known variance, offering a more precise characterization of the posterior in that particular setting. However, they do not analyze or consider using the bagged posterior.

Theory and methodology
In this section, we present our theoretical results, illustrate the theory with plots comparing the asymptotics of BayesBag versus Bayes (Section 3.1), discuss the use of BayesBag for model criticism (Section 3.2), and provide a recommended workflow (Section 3.3).

Asymptotic analysis
In Bayesian model selection, we have a countable set of models M. Assume that model m ∈ M has prior probability Q 0 (m) > 0 and marginal likelihood where θ m ∈ Θ m is an element of a model-specific parameter space with prior distribution Π 0 (dθ m | m). Assume X 1 , X 2 , . . . are i.i.d. from some unknown distribution P • . Further, for each m ∈ M, assume there is a unique parameter Two models with degenerate parameter spaces We first state our asymptotic theory in the case of two misspecified models, M = {m 1 , m 2 }, since the results are more intuitive in this special case. For the moment, we also assume that each model contains a single parameter value (that is, |Θ m | = 1). On the other hand, we allow the observation model p N (X n | m) to depend on the number of observations N , so that p(X 1: denote the log-likelihood ratio for each observation. To perform an asymptotic analysis that captures the behavior of the nonasymptotic regime in which the mean of Z N is comparable to its standard deviation, we assume that μ ∞ := lim N →∞ N 1/2 E(Z N 1 ) and σ 2 ∞ := lim N →∞ Var(Z N 1 ) exist. Thus, when N is large, E(Z N ) ≈ N 1/2 μ ∞ and Std(Z N ) ≈ N 1/2 σ ∞ . Consequently, E(Z N ) does not overwhelm Std(Z N ), even in the asymptotic regime. The asymptotic effect size η ∞ := μ ∞ /σ ∞ quantifies the amount of evidence in favor of m 1 under the true distribution P • . If η ∞ > 0, then m 1 is favored, whereas m 2 is favored if η ∞ < 0.
Our first result shows that (1) the posterior probability of model m 1 converges to a Bernoulli random variable with parameter depending on η ∞ , and (2) the bagged posterior probability of model m 1 converges to a continuous random variable on [0, 1] with a distribution that depends on η ∞ and lim N →∞ M/N . For μ ∈ R and σ 2 > 0, let N (μ, σ 2 ) denote the normal distribution with mean μ and variance σ 2 , and let Φ(t) denote the cumulative distribution function of the standard normal distribution N (0, 1).
In particular, for the usual posterior, if η ∞ = 0 then Q(m 1 | X 1:N ) Theorem 3.1 will follow as an immediate corollary of Theorem 3.2 below. Note that in part 2, when c > 0, the cumulative distribution function of the random variable Thus, by differentiating this function, we find that the density of U * is, for u ∈ (0, 1), Figure 1 illustrates how Theorem 3.1 establishes the greater stability of BayesBag versus Bayes for model selection. Even for effect sizes η ∞ > 1, which should strongly favor model m 1 , the Bayes posterior overwhelmingly favors model m 2 with non-negligible probability -that is, P{Q(m 1 | X 1:N ) ≈ 0} ≈ 0. On the other hand, the probability that the BayesBag posterior strongly favors model m 2 goes to zero more rapidly as η ∞ increases -that is, P{Q * (m 1 | X 1:N ) ≈ 0} → 0 more rapidly as η ∞ grows. For example, when η ∞ = 2 and c = 1, P(U = 0) > 0.02 whereas P(U * < 0.1) < 7 × 10 −5 . Thus, in this example, Bayes will overwhelmingly favor the "wrong" model in approximately 1 out of 50 experiments, whereas BayesBag will strongly favor the wrong model in only approximately 7 out of 100,000 experiments.
Extension to three or more models In the case of three or more models, the behavior of the posteriors is more complicated because there is dependence on both the correlation structure and the relative variances of the log-likelihood ratios between each pair of models. Consider the case of K < ∞ models and enumerate them from 1 to K, so that , and η ∞ is the asymptotic effect size in favor of m 1 (see Theorem 3.1). (a) U = 0 represents the event that the Bayes posterior overwhelmingly favors the wrong model (or equally good model, if η ∞ = 0) -that is, the model with lower (or equal) expected log-likelihood under the true distribution. Likewise, U * < 0.1 is the event that the BayesBag posterior strongly favors the wrong (or equivalent) model.

Without loss of generality, consider the probability of
The proof is in Section S.3.2 of the Supplementary Material (Huggins and Miller, 2022). Figure 2 shows how Theorem 3.2 establishes that across a range of mean and covariance structures of the log-likelihoods, BayesBag is more stable than Bayes. Indeed, both methods behave fairly consistently as the covariance structure varies.
Extension to non-degenerate parameter spaces We now extend Theorem 3.1 to nondegenerate parameter spaces Θ 1 ⊂ R D1 and Θ 2 ⊂ R D2 and we integrate over θ m ∈ Θ m for each model m. To avoid tedious arguments, we only consider the case where μ ∞ = 0. For m ∈ {m 1 , m 2 }, let m,θm (X n ) := log p θm (X n | m) and recall that the optimal parameter is θ m . We will assume that conditionally on X 1:∞ , for almost every X 1:∞ , where X * 1:M is bootstrapped from X 1:N and O P + (1) denotes a random quantity which is bounded in (outer) probability. It is well known that (4) holds with X 1:N in place of X * 1:M , under standard regularity assumptions (Clarke and Barron, 1990). Thus, we expect (4) to hold under similar but slightly stronger conditions, since we must consider a triangular array rather than a sequence of random variables.
The posterior distribution given X 1:N and m is The bagged posterior Π * (· | X 1:N , m) given X 1:N and m is defined such that θm m,θm (X 1 )} denote the Fisher information matrix. Finally, for a measure ν and function f , we use the shorthand notation ν(f ) := f dν.
2 ) as θ m → θ m• ; (iv) J θm• is an invertible matrix; and X 1:N , m), it holds that conditionally on X 1:∞ , for almost every X 1:∞ , for every sequence of constants C N → ∞, Further, assume that (4) holds, Then the conclusions of Theorem 3.1 apply in the case of η ∞ = 0.
The proof is in Section S.3.3 of the Supplementary Material.
Extension to dependent data A further extension, which we will not pursue in detail, is to non-independent data such as those encountered in time-series and spatial data analysis. In principle the generalization to, for example, time-series using the block bootstrap (or another nonparametric estimator such as a Gaussian process) is straightforward. However, the accompanying theory is much less straightforward since we must (A) determine the asymptotic distribution of rescaled log marginal likelihoods N −κ log p(m | X 1:N ) and (B) show that a nonparametric estimator has the same asymptotic distribution. More concretely, consider the two-model scenario and define W (X 1: where L(ξ) denotes the law of a random variable ξ and the metric d C is defined in Section S.3.2 of the Supplementary Material. Then, for (B) we must show that for data X * 1:M distributed according to the nonparametric estimator, We leave a thorough investigation of models for dependent data to future work.
where NA is short for "not available." The interpretation is as follows: I(f ) ≈ 0 indicates no evidence of mismatch; I(f ) > 0 (respectively, I(f ) < 0) indicates the Bayes posterior is overconfident (respectively, under-confident); I(f ) = NA indicates that either there is severe model-data mismatch or the required asymptotic assumptions do not hold (for example, due to multimodality in the posterior or small sample size). We refer the interested reader to Huggins and Miller (2019) for more justification and description of a non-asymptotic version of I.
The reference model should be chosen such that if any model m ∈ M is well-specified, then the reference model is well-specified. One common case is a finite set of models with partial order ≺ based on inclusion such that there exists a unique maximal model; in this case, the maximal model can be used as the reference model. More precisely, let P m := {p θm (· | m) : θ m ∈ Θ m }. Then for models m, m ∈ M, m ≺ m if and only if P m ⊆ P m , and m is the unique maximal model if m ≺ m for all m ∈ M. Feature selection (Section 4) is an example of this type, where the maximal model includes all features. Another common situation is when all models have a set of shared, interpretable parameters, in which case we can define the reference model to be the disjoint union of all models m ∈ M. Phylogenetic tree reconstruction (Section 5) is an example of this type.
When there is more than one univariate quantity of inferential interest, we consider a collection of functions f ∈ F and suggest taking the most pessimistic mismatch value: I(F) := sup f ∈F I(f ). In general, F can be chosen to reflect the quantities of interest in the application at hand. When θ ∈ R D , two natural choices for the collection F are F 1 := {θ → w θ : w 2 = 1} and F proj = {θ → θ d : d = 1, . . . , D}. In our experiments, we use the latter and therefore adopt the shorthand notation I := I(F proj ).

Recommended workflow
Algorithm 1 outlines our recommended workflow for using BayesBag; here, dim(Θ m ) is the dimensionality of the parameter space of model m. In steps 1-3, we suggest computing the mismatch index with M = N since the definition of the mismatch index is based on asymptotics, and thus, it is desirable to make M large in order to improve the accuracy of this asymptotic approximation. The mismatch index assesses the fit of the usual posterior, so there is no reason to use the same value of M that is used for robust inference with BayesBag. For very large datasets, it may be preferable to compute the mismatch index using M < N in order to reduce the computation required. In steps 4-7, our recommendations for when to use M = N α with α = 0.95 versus α = 0.75, and these particular values of α, should be taken as rough guidelines. The condition m∈M dim(Θ m ) > N 0.75 is meant to capture being in the "small-data" regime, where using very small bootstrap dataset sizes may result in unsatisfactory estimation accuracy. However, this precise condition may not always be appropriate; for example, it could be more appropriate to instead use max m∈M dim(Θ m ) > N 0.75 when the models are nested.

Simulation study
To validate our theory and assess the performance of BayesBag for model selection, we carry out a simulation study in the setting of feature selection for linear regression.
Model The data consist of regressors Z n ∈ R D and observations Y n ∈ R for n = 1, . . . , N, and the goal is to predict Y n given Z n . For each γ ∈ {0, 1} D , define a model such that the dth regressor is included in the linear regression if and only if γ d = 1. Letting D γ := D d=1 γ d denote the number of regressors in model γ and k ∈ {1, . . . , D} denote the maximum number of regressors to include, we consider a collection of models denote the matrix with the nth row equal to Z n and let Z γ denote the submatrix of Z that includes the dth column if and only if γ d = 1. Conditional on γ ∈ M k , the assumed model is We parameterize the model as θ = (θ 0 , . . . , θ Dγ ) = (log σ 2 , β 1 , . . . , β Dγ ) ∈ Θ γ = R Dγ +1 . To perform posterior inference for γ, we analytically compute the marginal likelihood for each γ ∈ M k , integrating out σ 2 and β; specifically, for Y := (Y 1 , . . . , Y N ) , we use where q 0 ∈ (0, 1) is the prior inclusion probability of each component. Thus, the posterior probability of model γ is and the posterior inclusion probability of the dth regressor is ∼ N(0, 1), and for n = 1, . . . , N, with the regressor distribution G, the regression function f , and the coefficient vector β † ∈ R D as described next. Using the linear regression function f (z) = z results in well-specified data. To generate misspecified data, we use the nonlinear component-wise cubic function f (z) = (z 3 1 , . . . , z 3 D ) . We choose G and β † in the spirit of genome-wide association study fine-mapping (Schaid, Chen and Larson, 2018) to simulate a scenario with many highly correlated regressors, of which only a few regressors are actually employed in the data-generating process. For k ∈ {1, 2}, we use a k-sparse vector (that is, a vector with k non-zero components) defined by setting β †d = 1 if d ∈ { j(D + 1 2 )/(k + 1) | j = 1, . . . , k} and β †d = 0 otherwise. For h > 2 and ψ > 0, if d is odd and ξ d = 1 otherwise. The motivation for this data simulation procedure is to generate correlated regressors that have different tail behaviors while still having the same first two moments, since regressors are typically standardized to have mean 0 and variance 1. Note that, marginally, Z 1 , Z 3 , . . . are each rescaled t-distributed random variables with h degrees of freedom such that Var(Z 1 ) = 1, and Z 2 , Z 4 , . . . are N (0, 1).

Results
We are interested in verifying the theory of Section 3 in the finite-sample regime, which suggests that when the model is misspecified, similar models may be assigned wildly varying probabilities under the usual posterior (Bayes), while the bagged posterior (BayesBag) probabilities will tend to be more balanced. Figures 3, 4, and S.1 to S.3 in the Supplementary Material show the Bayes and BayesBag posterior inclusion probabilities for each component, for all 50 replications. First, Figure 3 shows that when the model is correctly specified, the Bayes and BayesBag posteriors with α ≥ 0.95 behave similarly. However, BayesBag can be more stable even in this well-specified setting, exhibiting fewer outlier posterior inclusion probabilities. As α decreases, BayesBag yields substantially more conservative inferences, in the sense that the posterior inclusion probabilities tend to shrink toward the prior inclusion probability.
In the misspecified setting, the results are more interesting and subtle (Figures 4, S.1-S.3 in the Supplementary Material). Due to the misspecification and correlated regressors, it no longer holds in general that the components that were actually nonnull in the data-generating process will be selected (see Section S.2 of the Supplementary  Figure 3: Simulation results for feature selection in linear regression with k = 2 when the model contains the true distribution. The BayesBag posterior inclusion probabilities are similar to the Bayes posterior inclusion probabilities, but tend to shrink toward the prior inclusion probability (lower horizontal dotted line). The data was generated from the assumed model, Y n = Z n β † + n for n = 1, . . . , N, where n ∼ N (0, σ 2 ) with σ 2 = 1, Z n ∈ R D is a vector of covariates, and β † ∈ R D is a k-sparse vector, that is, β † has k nonzero components. The prior on inclusion vectors γ ∈ {0, 1} D is proportional to q A conjugate prior is placed on the coefficients and σ 2 given γ. The posterior inclusion probabilities were computed by analytically integrating out the parameters and summing over all binary inclusion vectors γ. The figure shows results for simulations using (a) D = 10, N = 50, k = 1, (b) D = 10, N = 5,000, k = 1, (c) D = 20, N = 100, k = 2, and (d) D = 20, N = 1,000, k = 2. For each of these settings, 50 replicate datasets were generated, and the resulting posterior inclusion probabilities are shown. Components that were actually nonzero when generating the data are enclosed by red rectangles. Figure 4: Simulation results for feature selection in linear regression with 1-sparsenonlinear data (so the model is misspecified) and k = 2. Everything is the same as in Figure 3(a, b), except that the data was generated using Y n = f (Z n ) β † + n , where f (z) = (z 3 1 , . . . , z 3 D ) . Results are shown for (a) N = 50, (b) N = 500, (c) N = 5,000, and (d) N = 50,000. See the caption of Figure 3 for further explanation. The Bayes posterior inclusion probabilities show considerable instability both (i) across datasets with N fixed and (ii) as N increases. Meanwhile, the BayesBag probabilities are much more stable, particularly for M = N α with α ≤ 0.75. The component that was actually nonzero when generating the data is enclosed by a red rectangle; see the text for interpretation.
Material and Buja et al., 2019a,b). For the 1-sparse-nonlinear data, when k = 1, the Bayes and BayesBag posteriors behave quite similarly and concentrate on component 5, which is asymptotically optimal; see Figure S.1 in the Supplementary Material. However, when k = 2, two models are asymptotically optimal and equivalent, namely, the models with supp ( For the k-sparse-linear data, the overall mismatch indices were either near zero or were NA, reflecting that the model is correctly specified but there are some issues with poor identifiability. For the k-sparse-nonlinear data, the mismatch indices were nearly all NA, reflecting that the model is misspecified and there may also be identifiability issues.
Summary Overall, the simulation results are in agreement with our asymptotic theory from Section 3: the behavior of Bayes can vary dramatically with the dataset size and the degree of misspecification, whereas BayesBag is much more stable. Additionally, the simulations provide insight into the behavior of the bagged posterior when M is sublinear in N . Of particular note is that M = N 0.95 yields noticeably improved stability with little loss of statistical efficiency. Meanwhile, for settings with substantial misspecification, taking M = N α with α ∈ [0.55, 0.75] may be preferable -with the caveat that inferences will tend to be more conservative.

Feature selection for linear regression
We compare Bayesian model selection and BayesBag model selection for linear regression on four real-world datasets, summarized in Table 1. Based on our findings in Section 4, for BayesBag we consider M = N α with α ∈ {1.0, 0.95, 0.75}. We use a prior inclusion probability of q 0 = 3/D and use k = D for the maximum number of nonzero components, except on the residential building dataset, where for computational tractability we use k = 3. We set the model hyperparameters to a 0 = 2, b 0 = 1, and λ = 1.
We expect the parameters to be well-identified for all datasets except the residential building dataset, since the residential building dataset requires only 58 out of 104 principal components to explain 99% of the variance, whereas for the other three datasets,  Figure 4) with (c) N = 100 and (d) N = 10,000. We only display two components of β since the I values follow fairly similar distributions for all components. The results show that in the well-specified setting, when N is sufficiently large (panel (b)), I tends to be near zero, indicating correct specification as expected. An exception is that the I value for β 11 is closer to −1, indicating that the Bayes posterior on β 11 may be somewhat underconfident. When N is small (panel (a)), I is often NA in these simulations, reflecting the poor identifiability of the coefficients due to strong correlation in the regressors. Meanwhile, in the misspecified setting (panels (c) and (d)), I is typically NA for the coefficients, reflecting that the model is misspecified and there may also be identifiability issues.
D out of D principal components are needed to explain 99% of the variance. The model mismatch indices (for the reference model with γ d = 1 for all d = 1, . . . , D) are in agreement with expectations, since only the residential building dataset has a model mismatch index of NA. For the other datasets, the mismatch indices are 1.00 (California housing), 0.62 (Boston housing), and 0.03 (Diabetes), which suggests that the model is misspecified for the two housing datasets. Figure 6 shows the posterior inclusion probabilities for all four datasets. To compare the reliability of the methods, we also run each method on subsets of the data obtained by randomly dividing each dataset into roughly equally sized splits ( Figure 6). We use three splits for all datasets except for California housing, for which we use five splits since N is substantially larger. Generally, across splits, BayesBag produced lower-variance, more conservative posterior inclusion probabilities that are more consistent with the

Phylogenetic tree reconstruction
Finally, we investigate the use of BayesBag for reconstructing the phylogenetic tree of a collection of species based on their observed characteristics. This is an important model selection problem due to the widespread use of phylogeny reconstruction algorithms.
Systematists have exhaustively documented that Bayesian model selection of phylogenetic trees can behave poorly. In particular, the posterior can provide contradictory results depending on what characteristics are used (for example, coding deoxyribonucleic acid [DNA] or amino acid sequences), what evolutionary model is used, or which outgroups are included (Alfaro, Zoller and Lutzoni, 2003;Buckley, 2002;Douady et al., 2003;Huelsenbeck and Rannala, 2004;Lemmon and Moriarty, 2004;Waddell, Kishino and Ota, 2002;Wilcox et al., 2002;Yang, 2007). We illustrate how BayesBag model selection provides reasonable inferences that are significantly more robust to the choice of data and model.

Models and data
We use the whale dataset from Yang (2008), consisting of mitochondrial coding DNA from 13 whale species and the hippopotamus ( Table 1). The hippopotamus is included as an "outgroup" species to identify the root of the tree, because the assumed evolutionary models are time-reversible and hence the trees are modeled as unrooted. We consider four DNA models (JC, HKY+C+Γ 5 , GTR+Γ+I, and mixed+Γ 5 ) and one amino acid model (mtmam+Γ 5 ); see Yang (2008) for more details on these models and an explanation of the acronyms. For brevity, we refer to the models as, respectively, JC, HKY, GTR, mixed, and mtmam. To approximate the usual posterior (Bayes) and the bagged posterior (BayesBag), we use MrBayes 3.2 (Ronquist et al., 2012) with 2 independent runs, each with 4 coupled chains run for 1,000,000 total iterations (discarding the first quarter as burn-in). We confirm acceptable mixing Figure 6: Application to feature selection on four real-world datasets; see Table 1 for dataset details. The assumed model is the same as in Section 4 (see also the caption of Figure 3), using a prior inclusion probability of q 0 = 3/D (lower horizontal dotted line), where D is the number of regressors. To assess reproducibility, we randomly split each dataset into roughly equally sized parts, and computed the posterior inclusion probabilities for each split separately (indicated with a •) as well as for the full dataset (indicated with a ). As before, the posterior inclusion probabilities are computed by analytically integrating out the parameters and summing out all possible binary inclusion vectors. For computational tractability, we constrain the model for the residential building dataset to only allow up to k = 3 nonzero components. For visual readability, we only display the components with at least one posterior inclusion probability greater than min(0.25, 3q 0 ). The BayesBag posterior inclusion probabilities exhibit greater reproducibility, in that (i) the between-split differences tend to be smaller and (ii) the differences between the split posterior inclusion probabilities and the full data posterior inclusion probabilities also tend to be smaller for BayesBag than Bayes.
using the built-in convergence diagnostics for MrBayes. For BayesBag, we take B = 100 in all experiments and, since the number of models is very large, we only consider M ∈ {N, N 0.95 }.
Evaluation Our goal is to investigate whether BayesBag avoids the self-contradictory inferences that Bayes produces. To this end, we compare the output of different configurations of the data, model, and inference method, as follows. We compute the set of trees in the 99% highest posterior probability (HPP) regions for each data, model, inference method configuration. For selected pairs of configurations, we then compute the overlap of the two 99% HPP regions in terms of (a) probability mass and (b) number of trees. Since the BayesBag posterior is approximated via Monte Carlo as in (2), we quantify the uncertainty in each overlap by reporting an 80% confidence interval for the overlapping mass. We compute these intervals using standard bootstrap methodology for a Monte Carlo estimate.
Results First, we look at the overlap between pairs of models. As shown in Figure 7(a) and However, the good overlap between BayesBag posteriors does not necessarily mean that it is performing well, since it could simply be producing posteriors that are too diffuse, spreading the posterior mass over a very large number of trees. Notably (as expected), BayesBag with M = N 0.95 leads to a more diffuse posterior with 5-15 overlapping trees compared to 3-11 trees when M = N . To further investigate the possibility of the BayesBag posteriors being too diffuse, we consider the overlap of the BayesBag posterior for each model and the Bayes posterior for mixed, which is the most complex of the DNA models. As shown in Figure 7(b) and Table S.2 in the Supplementary Material, all of the BayesBag posteriors (with the exception of mtmam) put substantial posterior probability on the 99% HPP region of the Bayes mixed posterior. Moreover, all but BayesBag mtmam has two trees in the overlap, which is the maximum possible since the Bayes mixed 99% HPP region only contains two trees. Finally, using BayesBag with M = N 0.95 results in fairly small decreases (relative to BayesBag with M = N ) in the mass on the two trees in the Bayes mixed 99% HPP region.
Next, we perform intra-model comparisons by considering three datasets: the complete whale dataset (denoted all) and two additional datasets formed by splitting the genomic data for each species in half (denoted S1 and S2). Ideally, for each model, we hope to see substantial overlap when comparing the results across these three datasets (all, S1, and S2). However, when using the Bayes posterior, there is little to no overlap in many cases, particularly for the simpler JC model and mtmam; see Figure 7(c) and Table S.3 in the Supplementary Material. Meanwhile, the BayesBag posteriors typically exhibit overlaps of between 21% and 56%, with less (though still nonzero) overlap with mtmam. These results suggest that BayesBag exhibits superior reproducibility in terms of uncertainty quantification. Figure 7: Application to phylogenetic tree inference on a whale genetics dataset. To assess reproducibility, we computed the posterior under five different evolutionary models (JC, HKY, GTR, mixed, mtmam). We quantify the similarity of posteriors by computing the overlapping probability mass of 99% highest posterior probability regions. To quantify uncertainty in the overlap due to Monte Carlo error, 80% confidence intervals are shown for the overlaps involving BayesBag. Panel (a) shows the posterior overlap for each pair of models. The usual posterior (Bayes) is quite sensitive to the choice of model, exhibiting ≈ 0% overlap in many cases, for instance, between JC and the other models. Meanwhile, the bagged posterior (BayesBag) is more robust, exhibiting overlaps in a reasonable range. Panel (b) shows the overlap between the Bayes posterior for the mixed model, which is the most flexible of the DNA models, and the Bayes or BayesBag posterior for each other model. Panel (c) shows the overlap when using the same model on different subsets of the data -specifically, splitting the genomic data for each species into two halves (S1, S2) or using the complete data (all).
Finally, we compute the mismatch index for each model on the complete whale dataset, obtaining 0.23 (JC), NA (HKY), 0.47 (GTR), 0.84 (mixed), and 0.34 (mtmam). These mismatch indices suggest significant amounts of model misspecification, with the

Discussion
In this paper, we have developed an approach to overcome the instability of Bayesian model selection when the models are all misspecified. This type of misspecification is common in scientific settings where idealized but interpretable models are commonly used (such as in systematics, population and cancer genetics, and economics). Our bagged posterior approach is theoretically justified, easy to use, and widely applicable. However, we see three potential limitations in practice. The first is that bagged posterior model selection tends to be more conservative, with posterior model probabilities farther from the extremes of zero and one. The recommended workflow discussed in Section 3.3 is designed to at least partially ameliorate this issue, however, this conservative behavior may be a necessary price for greater stability and reliability. The second limitation is the additional computational cost required for the naive estimation of the bagged model probabilities. The development of more computationally efficient alternatives is an important direction for future work. A final limitation is that our asymptotic theory only covers cases where the observations are independent. Extending the theory to cover important structured models like time-series and spatial models is another valuable direction for future work.