Bayesian hierarchical stacking: Some models are (somewhere) useful

Stacking is a widely used model averaging technique that asymptotically yields optimal predictions among linear averages. We show that stacking is most effective when model predictive performance is heterogeneous in inputs, and we can further improve the stacked mixture with a hierarchical model. We generalize stacking to Bayesian hierarchical stacking. The model weights are varying as a function of data, partially-pooled, and inferred using Bayesian inference. We further incorporate discrete and continuous inputs, other structured priors, and time series and longitudinal data. To verify the performance gain of the proposed method, we derive theory bounds, and demonstrate on several applied problems.


Introduction
Statistical inference is conditional on the model, and a general challenge is how to make full use of multiple candidate models. Consider data D = (y i ∈ Y, x i ∈ X ) n i=1 , and K models M 1 , . . . , M k , each having its own parameter vector θ k ∈ Θ k , likelihood, and prior. We fit each model and obtain posterior predictive distributions, The model fit is judged by its expected predictive utility of future (out-of-sample) data (ỹ,x) ∈ Y × X , which generally have an unknown true joint density p t (ỹ,x). Model selection seeks the best model with the highest utility when averaged over p t (ỹ,x). Model averaging assigns models with weight w 1 , . . . , w K subject to a simplex constraint w ∈ S K = {w : K k=1 w k = 1; w k ∈ [0, 1], ∀k}, and the future prediction is a linear mixture from individual models: p(ỹ|x, w, model averaging) = K k=1 w k p(ỹ|x, M k ), w ∈ S K .
(2) joint density p t (ỹ,x). The next step is to determine the vector 1 of weights w = (w 1 , . . . , w K ) that optimize the average log score of the stacked prediction, However, the linear mixture (2) restricts an identical set of weights for all input x. We will later label this solution (3) as complete-pooling stacking. The present paper proposes hierarchical stacking, an approach that goes further in three ways: 1. Framing the estimation of the stacking weights as a Bayesian inference problem rather than a pure optimization problem. This in itself does not make much difference in the completepooling estimate (3) but is helpful in the later development.
2. Expanding to a hierarchical model in which the stacking weights can vary over the population. If the model predictors x take on J different values in the data, we can use Bayesian inference to estimate a J ×K matrix of weights that partially pools the data both in row and column.
3. Further expanding to allow weights to vary as a function of continuous predictors. This idea generalizes the feature-weighted linear stacking (Sill et al., 2009) with a more flexible form and Bayesian hierarchical shrinkage. There are two reasons we would like to consider input-dependent model weights. First, the scoring rule measures the expected predictive performance averaged overx andỹ, as the objective function in (3) divided by n is a consistent estimate of Ex ,ỹ log K k=1 w k p(ỹ|x, D, M k ) . But an overall good model fit does not ensure a good conditional prediction at a given locationx =x 0 , or under covariate shift when the distribution of input x in the observations differs from the population of interest. More importantly, different models can be good at explaining different regions in the input-response space, which is why model averaging can be a better solution to model selection. Even if we are only interested in the average performance, we can further improve model averaging by learning where a model is good so as to locally inflate its weight.
In Section 2, we develop detailed implementation of hierarchical stacking. We explain why it is legitimate to convert an optimization problem into a formal Bayesian model. With hierarchical shrinkage, we partially pool the stacking weights across data. By varying priors, hierarchical stacking includes classic stacking and selection as special cases. We generalize this approach to continuous input variables, other structured priors, and time-series and longitudinal data. In Section 3, we turn heuristics from the previous paragraph into a rigorous learning bound, indicating the benefit from model selection to model averaging, and from complete-pooling model averaging to a local averaging that allows the model weights to vary in the population. We outline related work in Section 4. In Section 5, we evaluate the proposed method in several simulated and real-data examples, including a U.S. presidential election forecast. This paper makes two main contributions: • Hierarchical stacking provides a Bayesian recipe for model averaging with input-dependent weights and hierarchical regularization. It is beneficial for both improving the overall model fit, and the conditional local fit in small and new areas.
• Our theoretical results characterize how the model list should be locally separated to be useful in model averaging and local model averaging.

Hierarchical stacking
The present paper generalizes the linear model averaging (2) to pointwise model averaging. The goal is to construct an input-dependent model weight function w(x) = (w 1 (x), . . . , w K (x)) : X → S K , and combine the predictive densities pointwisely by p(ỹ|x, w(·), pointwise averaging) = K k=1 w k (x)p(ỹ|x, M k ), such that w(·) ∈ S X K .
If the input is discrete and has finite categories, one naïve estimation of the pointwise optimal weight is to run complete-pooling stacking (3) separately on each category, which we will label no-pooling stacking. The no-pooling procedure generally has a larger variance and overfits the data. From a Bayesian perspective, it is natural to compromise between unpooled and completely pooled procedures by a hierarchical model. Given some hierarchical prior p prior (·), we define the posterior distribution of the stacking weights w ∈ S X K through the usual likelihood-prior protocol: The final estimate of the pointwise stacking weight used in (4) is then the posterior mean from this joint density E(w(·)|D). We call this approach hierarchical stacking.

Complete-pooling and no-pooling stacking
For notational consistency, we rewrite the input variables into two groups (x, z), where x are variables on which the model weight w(x) depends during model averaging (4), and z are all remaining input variables.
To start, we consider x to be discrete and has J < ∞ categories, x = 1, . . . J. We will extend to continuous and hybrid x later. The input varying stacking weight function is parameterized by a J ×K matrix {w jk } ∈ S J K : Each row of the matrix is an element of the length-K simplex. The k-th model in cell j has the weight w k (x i ) = w jk , ∀x i = j. We fit each individual model M k to all observed data D = (x i , z i , y i ) n i=1 and obtain pointwise leave-one-out cross-validated log predictive densities: p k,−i := Θ k p(y i |θ k , x i , z i , M k )p(θ k |{(x l , y l , z l ) : l = i}, M k )dθ k .
Same as in complete-pooling stacking, here we avoid refitting each model n times, and instead use the Pareto smoothed importance sampling (PSIS, Vehtari et al., 2017Vehtari et al., , 2019 to approximate {p k,−i } n i=1 from one-time-fit posterior draws p(θ k |M k , D). The cost of such approximate leave-one-out cross validation is often negligible compared with individual model fitting.
To optimize the expected predictive performance of the pointwisely combined model averaging, we can maximize the leave-one-out predictive density On one extreme, the complete-pooling stacking (3) solves optimization (7) subject to a constant constraint w k (x) = w k (x ), ∀k, x, x . On the other extreme, no-pooling stacking maximizes this objective function (7) without extra constraint other than the row-simplex-condition, which amounts to separately solving complete-pooling stacking (3) on each input cell D j = {(x i , z i , y i ) : x i = j}.
If there are a large number of repeated measurements in each cell, n j := ||{i : x i = j}|| → ∞, then 1 n j i:x i =j log K k=1 w k (j)p k,−i becomes a reasonable estimate of the conditional log predictive density Y p t (ỹ|x = j) log K k=1 w k (j)p(ỹ|j, M k )dỹ, with convergence rate √ n j , and therefore, nopooling stacking becomes asymptotically optimal among all cell-wise combination weights. For finite sample size, because the cell size is smaller than total sample size, we would expect a larger variance in no-pooling stacking than in complete-pooling stacking. Moreover, the cell sizes are often not balanced, which entails a large noise of no-pooling stacking weight in small cells.

Bayesian inference for stacking weights
Vanilla (optimization-based) stacking (3) is justified by Bayesian decision theory: the expected log predictive density of the combined model Eỹ log The point optimum asymptotically maximizes the expected utility (Le and Clarke, 2017), hence is an M * -optimal decision in terms of Vehtari and Ojanen (2012).
To fold stacking into a Bayesian inference problem, we want to treat the objective function in (7) as a log likelihood with parameter w. After integrating out individual-model-specific parameters θ k such that p(y|x, M k ) is given, the outcomes y i at input location x i in the combined model have densities p(y . But this procedure has used data twice-in other practices, data are often used twice to pick the prior, whereas here data are used twice to pick the likelihood.
We use a two-stage estimation procedure to avoid reusing data. Assuming a hypothetically provided holdout dataset D of the same size and identical distribution as observations In the second stage we plug in the observed y i , x i , and obtain the pointwise full likelihood p(y i | w, D , . Now in lack of holdout data D , the leave-i-th-observation-out predictive density p k,−i is a consistent estimate of the pointwise out-of-sample predictive density E D (p(y i |x i , M k , D )). By plugging it into the two-stage log likelihood and integrating out the unobserved holdout data D , we get a profile likelihood This log likelihood coincides with the no-pooling optimization objective function (7).
Integrating out the hypothetical data D is related to the idea of marginal data augmentation (Meng and van Dyk, 1999). Polson and Scott (2011) took a similar approach to convert the optimization-based support vector machine into a Bayesian inference.

Hierarchical stacking: discrete inputs
The log posterior density of hierarchical stacking model (5) contains the log likelihood defined above n i=1 log K k=1 w k (x i )p k,−i , and a prior distribution on the weight matrix w = {w jk } ∈ S J K , which we specify in the following.
We first take a softmax transformation that bijectively converts the simplex matrix space S J K to unconstrained space R J(K−1) : α jk ∈ R is interpreted as the log odds ratio of model k with reference to M K in cell j.
We propose a normal hierarchical prior on the unconstrained model weights (α jk ) K−1 k=1 conditional on hyperparameters µ ∈ R K−1 and σ ∈ R K−1 The prior partially pools unconstrained weights toward the shared mean (µ 1 , . . . , µ K−1 ). The shrinkage effect depends on both the cell sample size n j (how strong the likelihood is in cell j), and the model-specific σ k (how much across-cell discrepancy is allowed in model k). If µ and σ are given constants, and if the posterior distribution is summarized by its mode, then hierarchical stacking contains two special cases: • no-pooling stacking by a flat prior σ k → ∞, k = 1, . . . , K − 1.
Instead of choosing fixed values, we view µ and σ as hyperparameters and aim for a full Bayesian solution: to describe the uncertainty of all parameters by their joint posterior distribution p(α, µ, σ|D), letting the data to tell how much regularization is desired.
To accomplish this Bayesian inference, we assign a hyperprior to (µ, σ): where normal + (0, τ σ ) stands for the half-normal distribution supported on [0, ∞) with scale parameter τ σ . Putting the pieces (5), (9), (10) together, up to a normalization constant that has been omitted, we attain a joint posterior density of all free parameters α ∈ R J×K , µ ∈ R K−1 , σ ∈ R K−1 Unlike complete and no-pooling stacking, which are typically solved by optimization, the maximum a posteriori (MAP) estimate of (11) is not meaningful: the mode is attained at the completepooling subspace α jk = µ k , σ k = 0, ∀j, k, on which the joint density is positive infinity. Instead, we sample (α, µ, σ) from this joint density (11) using Markov chain Monte Carlo (MCMC) methods and compute the Monte Carlo mean of posterior draws w jk , which we will call hierarchical stacking weights.
The final posterior predictive density of outcomeỹ at any input location (x,z) is Using a point estimate w jk is not a waste of the joint simulation draws. Because equation (12) is a linear expression on w k , and because of the linearity of expectation, using w is as good as using all simulation draws. Nonetheless, for the purpose of post-processing, approximate cross validation, and extra model check and comparison, we will use all posterior simulation draws; see discussion in Section 6.3.

Hierarchical stacking: continuous and hybrid inputs
The next step is to include more structure in the weights, which could correspond to regression for continuous predictors, nonexchangeable models for nested or crossed grouping factors, nonparametric prior, or combinations of these.

Additive model
Hierarchical stacking is not limited to discrete cell-divider x. When the input x is continuous or hybrid, one extension is to model the unconstrained weights additively: where {f m : X → R} are M distinct features. Here we have already extracted the prior mean µ k , representing the "average" weight of model k in the unconstrained space. The discrete model (9) is now equivalent to letting f m (x) = 1(x = m) for m = 1, . . . , J. We may still use the basic prior (9) and hyperprior (10): We provide Stan (Stan Development Team, 2020) code for this additive model and discuss practical hyperparameter choice in Appendix C.
Because the main motivation of our paper is to convert the one-fit-all model-averaging algorithm into open-ended Bayesian modeling, the basic shrinkage prior above should be viewed as a starting point for model building and improvement. Without trying to exhaust all possible variants, we list a few useful prior structures: • Grouped hierarchical prior. The basic model (14) is limited to have a same regularization σ k for all α mk . When the features f m (x) are grouped (e.g., f m are dummy variables from two discrete inputs; states are grouped in regions), we achieve group specific shrinkage by replacing (14) by where g[m] = 1, . . . G is the group index of feature m.
• Feature-model decomposition. Alternatively we can learn feature-dependent regularization by • Prior correlation. For discrete cells, we would like to incorporate prior knowledge of the groupcorrelation. For example in election forecast (Section 5.3), we have a rough sense of some states being demographically close, and would expect a similar model weights therein. To this end, we calculate a prior correlation matrix Ω J×J from various sources of state level historical data, and replace the independent prior (9) by a multivariate normal (MVN) distribution, The prior correlation is especially useful to stabilize stacking weights in small cells.
• Crude approximation of input density. When applying the basic model (13) to continuous inputs x = (x 1 , . . . , x D ) ∈ R D , instead of a direct linear regression f d (x) = x d , we recommend a coordinate-wise ReLU-typed transformation: where med(x d ) is the sample median of x d . The pointwise model predictive performance typically relies on the training density P train X (x): The more training data seen nearby, the better predictions. The feature (16) is designed to be a crude approximation of log marginal input densities.

Choice of features and exploratory data analysis
Choosing the division of (x, z) in discrete inputs is now a more general problem on how to construct features f m (x), or a variable selection problem in a regression (13). In ordinary statistical modeling, we often start variable selection by exploratory data analysis. Here we cannot directly associate model weights w ki with observable quantities. Nevertheless, we can use the paired pointwise log predictive density difference ∆ ki = (log p k,−i − log p K,−i ) as an exploratory approximation to the trend of α k (x i ). A scatter plot of ∆ ki against x may suggest which margin of x is likely important. For example, the dependence of ∆ ki on whether x i is in the bulk or tail is an evidence for our previous recommendation of the rectified features.
As more variables x are allowed to vary in the stacking model, model averaging is more prone to over-fitting. Pointwise stacking typically has a large noise-to-signal ratio not only due to model similarity, but also a high variance of pointwise model evaluation: the approximate leave-one-out cross validation possesses Monte Carlo errors; even if we run exact leave-one-out, or use an independent validation set in lieu of leave-one-out, we only observe one y i for one x i (if x is continuous) such that log p k,−i is at best an one-sample-estimate of Eỹ |x i (log p(ỹ|x i , M k )) with non-vanishing variance. If f m (·) is flexible enough, then the sample optimum of no-pooling stacking (7) always degenerates to pointwise model selection that pointwisely picks the model that "best" fits current realization of y i : w arg max k p k,−i (x i ) = 1, which is purely over-fitting.
Even in companion with hierarchical priors, we do not expect to include too many features on which stacking weights depend on. In our experiments, an additive model with discrete variables and rectified continuous variables without interaction is often adequate. After standardizing all features such that Var(f m (x)) = 1, we typically use a generic informative prior setting τ µ = τ σ = 1 in experiments. With a moderate or large number of features/cells, M , it is sensible to scale the hyperprior τ σ = O( 1/M ), or adopt other feature-wise shrinkage priors such as horseshoe for better regularization.

Gaussian process prior
An alternative way to generalize both the discrete prior in Section 2.3 and the prior correlation (15) is Gaussian process priors. To this end we need K − 1 covariance kernels K 1 , . . . , K K−1 , and place priors on the unconstrained weight α k (x), viewed as an X → R function: α k (x) ∼ GP(µ k , K k (x)). The discrete prior is a special case of a Gaussian process via a zero-one kernel Due to the previously discussed measurement error and the preference on stronger regularization for continuous x, we recommend simple exponentiated quadratic kernels with an informative hyperprior that avoids too small or too big length-scale ρ k , and too big a k . We present an example in Section 5.2.

Time series and longitudinal data
Hierarchical stacking can easily extend to time series and longitudinal data. Consider a time series dataset where outcomes y i come sequentially in time 0 ≤ t i ≤ T . The joint likelihood is not exchangeable, but still factorizable via p(y 1:n |θ) = n i=1 p(y i |θ, y 1:(i−1) ). Therefore, assuming some stationary condition, we can approximate the expected log predictive densities of the next-unit unseen outcome by historical average of one-unit-ahead log predictive densities, defined by p k,−i := Θ k p(y i |x i , y 1:(i−1) , x 1:(i−1) , θ k , M k )p(θ k |y 1:(n−1) , x 1:(n−1) )dθ k .
In hierarchical stacking, we only need to replace the regular leave-one-out predictive density (6) by this redefined p k,−i , and run hierarchical stacking (11) as usual. Using importance sampling based approximation , we also make efficient computation without the need to fit each model n times.
If we worry about time series being non-stationary, we can reweight the likelihood in (11) by a non-decreasing sequence π i : n n i=1 π i log K k=1 w k (x i )p k,−i / n i=1 π i , so as to emphasize more recent dates. For example, π i = 1 + γ − (1 − t i /T ) 2 , where a fixed parameter γ > 0 determines how much influence early data has. By appending x := (x, t), the stacking weight can vary across the time variable, too.
In Section 5.3, we present an election example with longitudinal polling data (40 weeks × 50 states). For the i-th poll (already ordered by date), we encode state index into input x i = 1, . . . , 50, all other poll-specific variables z i , data t i , and poll outcome y i . We compute the oneweek-ahead predictive density p k,−i : contains polls from all states up to one week before date t i .

Why model averaging works and why hierarchical stacking can work better
The consistency of leave-one-out cross validation ensures that complete-pooling stacking (3) is asymptotically no worse than model selection in predictions (Clarke, 2003;Le and Clarke, 2017), hence justified by Bayesian decision theory. The theorems we establish in Section 3.2 go a step further, providing lower bounds on the utility gain of stacking and pointwise stacking. In short, model averaging is more pronounced when the model predictive performances are locally separable, but in the same situation, we can improve the linear mixture model by learning locally which model is better, so that the stacking is a step toward model improvement rather than an end to itself. We illustrate with a theoretical example in Appendix A and provide proofs in Appendix B.

All models are wrong, but some are somewhere useful
With an M-closed view (Bernardo and Smith, 1994), one of the candidate models is the true data generating process, whereas in the more realistic M-open scenario, none of the candidate models is completely correct, hence models are evaluated to the extent that they interpret the data.
The expectation of a strictly proper scoring rule, such as the expected log predictive density (elpd), is maximized at the correct data generating process. However, the extent to which a model is "true" is contingent on the input information we have collected. Consider an input-outcome pair (x, y) generated by x ∈ [0, 1], y ∈ {0, 1}, x ∼ uniform(0, 1), Pr(y = 1|x) = x.
If the input x is not observed or is omitted in the analysis, then M 1 : y ∼ Bernoulli(0.5) is the only correct model and is optimal among all probabilistic predictions of y unconditioning on x. But this marginally true model is strictly worse than a misspecified conditional prediction, M 2 : Pr(y = 1|x) = √ x, since the expected log predictive densities are log(0.5) = −0.69 and − 7 12 = −0.58 respectively after averaged over x and y. The former model is true purely because it ignores some predictors. This wronger-model-does-better example does not contradict the log score being strictly proper, as we are changing the decision space from measures on y to conditional measures on y|x. But this example does underline two properties of model evaluation and averaging. First, we have little interest in a binary model check. The hypothesis testing based model-being-true-or-false depends on what variables to condition on and is not necessarily related to model fit or prediction accuracy. In a non-quantum scheme, a really "everywhere true" model that has exhausted all potentially unobserved inputs contains no aleatory uncertainty. Second, the model fits typically vary across the input space. In the Bernoulli example, despite its larger overall error, M 1 is more desired near x ≈ .5, and is optimal at x = .5.
For theoretical interest, we define the conditional (onx) expected (onỹ|x) log predictive density in the k-th model, are known, we can divide the input space X into K disjoint sets based on which model has the locally best fit (When there is a tie, the point is assigned the smallest index, and I stands for "input"): In this Bernoulli example, I 1 = [0.25, 0.67].

The gain from stacking, and what can be gained more
In this subsection, we focus on the oracle expressiveness power of model selection and averaging, and their input-dependent version. w stacking,cp refers to the complete-pooling stacking weight in the population: Apart from the heuristic that model averaging is likely to be more useful when candidate models are more "dissimilar" or "distinct" (Breiman, 1996;Clarke, 2003), we are not aware of rigorous theories that characterize this "diversity" regarding the effectiveness of stacking. It seems tempting to use some divergence measure between posterior predictions from each model as a metric of how close these models are, but this is irrelevant to the true data generating process.
We define a more relevant metric on how individual predictive distributions can be pointwisely separated. The description of a forecast being good is probabilistic on bothx andỹ: an overall bad forecast may be lucky at an one-time realization of outcomeỹ and covariatex. We consider the input-output product space X × Y and divide it into K disjoints subsets (J stands for "joint"): In this framework, we call a family of predictive densities {p(ỹ|M k ,x)} K k=1 to be locally separable with a constant pair L > 0 and 0 Stacking is sometimes criticized for being a black box. The next two theorems link stacking weight to a probabilistic explanation. Unlike Bayesian model averaging (Hoeting et al., 1999) that computes the probability of a model being "true", stacking is more related to Pr(J k ): the probability of a model being the locally "best" fit, with respect to the true joint measure p t (ỹ,x).
Theorem 1. When the separation condition (19) holds, the complete pooling stacking weight is approximately the probability of the model being the locally best fit: in the sense that the objective function is nearly optimal: Further, a model is only ignored by stacking if its winning probability is low.
Theorem 2. When the separation condition (19) holds, and if the k-th model has zero weight in stacking, w stacking,cp k = 0, then the probability of its winning region is bounded by: The right-hand side can be further upper-bounded by exp(−L) + .
The separation condition (19) trivially holds for = 1 and an arbitrary L, or for L = 0 and an arbitrary , though in those cases the bounds (21) and (22) are too loose. To be clear, we only use the closed form approximation (20) for theoretical assessment.
The next theorem bounds the utility gain from shifting model selection to stacking: Theorem 3. Under the separation condition (19), let ρ = sup k Pr(J k ), and a deterministic function g(L, K, ρ, ) = L(1 − ρ)(1 − ) − log K, then the utility gain of stacking is lower-bounded by Evaluating J k requires access toỹ|x andx. Though both terms are unknown, the roles ofx andỹ are not symmetric: we could bespoke the model in preparation for a future prediction at a givenx, but cannot be tailored for a realization ofỹ. To be more tractable, we consider the case when the variation onx predominates the uncertainty of model comparison, such that J k ≈ I k × Y, single model selection (1) complete-pooling stacking (2) pointwise selection (24) no-pooling stacking (7) hierarchical stacking (11) quicker finite-sample convergence higher asymptotic expressiveness Thm.3 Thm.4 Thm.3 Figure 1: Evolution of methods. First row from left to right: the methods have a higher degree of freedom to ensure a higher asymptotic predictive accuracy, the gain of which is bounded by the labeled theorems. Meanwhile, complex methods come with a slower convergence rate. The hierarchical stacking is a generalization of all remaining methods by assigning various structured priors, and adapts to the complexity-expressiveness tradeoff by hierarchical modeling.
where I k is defined in (17). More precisely, we define a strong local separation condition with a distance-probability pair (L, ): We define ρ X = sup k Pr(I k ). Under condition (23), ρ X and ρ will be close. If we know the input space division {I k }, we can select model M k for and only for x ∈ I k , which we call pointwise selection. The predictive density is As per Theorem 3, for a given pair of L and , the smaller is ρ, the higher improvement (K(1 − )(1 − ρ)) can stacking achieve against model selection: the situation in which no model always predominates. Thus, the effectiveness of stacking can indicate heterogeneity of model fitting. Next, we show that the heterogeneity of model fitting provides an additional utility gain if we shift from stacking to pointwise selection: Theorem 4. Under the strong separation condition (23), and if the divisions {I k } are known exactly, then the extra utility gain of pointwise selection has a lower bound, For a given input location x 0 ∈ X , the pointwise no-pooling optimum w(x 0 ) ∈ S K in the population is same as the complete-pooling solution restricted to the slice {x 0 }×Y. Hence, applying Theorem 3 to each slice will bound the advantage of pointwise averaging (4) against pointwise selection (24).
The potential utility gain from Theorems 3 and 4 is the motivation behind the input-varying model averaging. Despite this asymptotic expressiveness, the finite sample estimate remains challenging. (a) We do not know I k or J k . We may use leave-one-out cross validation to estimate the overall model fit elpd k , but in the pointwise version, we want to assess conditional model performance. Further, the more data coming in, the more input locations need to assess. (b) The asymptotic expressiveness comes with increasing complexity. The free parameters in single model selection, complete-pooling stacking, pointwise selection, and no-pooling stacking are a single model index, a length-K simplex, a vector of pointwise model selection index {1, 2, . . . , K} X , and a matrix of pointwise weight (S K ) X . To handle this complexity-expressiveness tradeoff, it is natural to apply the hierarchical shrinkage prior.

Immunity to covariate shift
So far we have adopted an IID view: the training and out-of-sample data are from the same distribution. Yet another appealing property of hierarchical stacking is its immunity to covariate shift (Shimodaira, 2000), a ubiquitous problem in non-representative sample survey, data-dependent collection, causal inference, and many other areas.
If the distribution of inputs x in the training sample, p train X (·), differs from these predictors' distribution in the population of interest, p pop X (·) (p pop X is absolutely continuous with respect to p train X ), and if p(z|x) and p(y|x, z) remain invariant, then we do not need to adjust weight estimate from (11), because it has already aimed at pointwise fit.
By contrast, complete-pooling stacking targets the average risk. Under covariate shift, the sample mean of leave-one-out score in the k-th model, 1 n n i=1 log p(ỹ|x, M k ), is no longer a consistent estimate of population elpd. To adjust, we can run importance sampling (Sugiyama and Müller, 2005;Sugiyama et al., 2007;Yao et al., 2018) and reweight the i-th term in the objective (3) proportional to the inverse probability ratio p pop Even in the ideal situation when both p pop X and p train X are known, the importance weighted sum has in general larger or even infinite variance (Vehtari et al., 2019), thereby decreasing the effective sample size and convergence rate in complete-pooling stacking (toward its optimum (18)). When p train X is unknown, the covariate reweighting is more complex while hierarchical stacking circumvents the need of explicit modeling of p train X . When we are interested only at one fixed input location p pop X (x) = δ(x = x 0 ), hierarchical stacking is ready for conditional predictions, whereas no-pooling stacking and reweighted-completepooling stacking effectively discard all x i = x 0 training data in their objectives, especially a drawback when x 0 is rarely observed in the sample.

Related literature
Stacking (Wolpert, 1992;Breiman, 1996;LeBlanc and Tibshirani, 1996), or what we call completepooling stacking in this paper has long been a popular method to combine learning algorithms, and has been advocated for averaging Bayesian models (Clarke, 2003;Clyde and Iversen, 2013;Le and Clarke, 2017;Yao et al., 2018). Stacking is applied in various areas such as recommendation systems, epidemiology (Bhatt et al., 2017), network modeling (Ghasemian et al., 2020), and postprocessing in Monte Carlo computation (Tracey and Wolpert, 2016;Yao et al., 2020). Stacking can be equipped with any scoring rules, while the present paper focuses on the logarithm score by default.
Our theory investigation in Section 3.2 is inspired by the discussion of how to choose candidate models by Clarke (2003) and Le and Clarke (2017). In L 2 loss stacking, they recommended "independent" models in terms of posterior point predictions (E(ỹ|x, M 1 ), . . . , E(ỹ|x, M K )) being independent. When combining Bayesian predictive distributions, the correlations of the posterior predictive mean is not enough to summarize the relation between predictive distributions (Pirš and Štrumbelj, 2019), hence we consider the local separation condition instead.
Allowing a heterogeneous stacking model weight that changes with input x is not a new idea. Feature-weighted linear stacking (Sill et al., 2009) constructs data-varying model weights of the k-th model by w k (x) = M m=1 α km f m (x), and α km optimizes the L 2 loss of the point predictions of the weighted model. This is similar to the likelihood term of our additive model specification in Section 2.4, except we model the unconstrained weights. The direct least-squares optimization solution from feature-weighted linear stacking is what we label no-pooling stacking.
It is also not a new idea to add regularization and optimize the penalized loss function. For L 2 loss stacking, Breiman (1996) advocated non-negative constraints. In the context of combining Bayesian predictive densities, a simplex constraint is necessary. Reid and Grudic (2009) investigated to add L 1 or L 2 penalty, −λ||w|| 1 or −λ||w|| 2 , into complete-pooling stacking objective (3). Yao et al. (2020) assigned a Dirichlet(λ), λ > 1 prior distribution to the complete-pooling stacking weight vector w to ensure strict concavity of the objective function. Sill et al. (2009) mentioned the use of L 2 penalization in feature-weighted linear stacking, which is equivalent to setting a fixed prior for all free parameters α km ∼ normal(0, τ ), ∀k, m, whose solution path connects between uniform weighing and no-pooling stacking by tuning τ . All of these schemes are shown to reduce over-fitting with an appropriate amount of regularization, while the tuning is computation intensive. In particular, each stacking run is built upon one layer of cross validation to compute the expected pointwise score in each model p k,−i , and this extra tuning would require to fit each model n(n − 1) times for each tuning parameter value evaluation if both done in exact leave-one-out way. Fushiki (2020) approximated this double cross validation for L 2 loss complete-pooling stacking with L 2 penalty on w, beyond which there was no general efficient approximation.
Hierarchical stacking treats {µ k } and {σ k } as parameters and sample them from the joint density. Such hierarchy could be approximated by using L 2 penalized point estimate with a different tuning parameter in each model, and tune all parameters ({σ k } K−1 k=1 for the basic model, or for the product model). But then this intensive tuning is the same as finding the Type-II MAP of hierarchical stacking in an inefficient grid search (in contrast to gradient-based MCMC).
Another popular family of regularization in stacking enforces sparse weights (e.g., Zhang and Zhou, 2011;Şen and Erdogan, 2013;Yang and Dunson, 2014), which include sparse and grouped sparse priors on the unconstrained weights, and sparse Dirichlet prior on simplex weights. The goal is that only a limited number of models are expressed. From our discussion in Section 3, all models are somewhere useful, hence we are not aimed for model sparsity-The concavity of log scoring rules implicitly resists sparsity; The posterior mean of hierarchical stacking weights w jk is, in general, never sparse. Nevertheless, when sparsity is of concern for memory saving or interpretability, we can run hierarchical stacking first and then apply projection predictive variable selection (Piironen and Vehtari, 2017) afterwards to the posterior draws from the stacking model (11) and pick a sparse (or cell-wise sparse) solution.
In contrast to fitting individual models in parallel before model averaging, an alternative approach is to fit all models jointly in a bigger mixture model. Kamary et al. (2019) proposed a Bayesian hypothesis testing by fitting an encompassing model p(y| w, θ) = K k=1 w k p(y|θ k , M k ). The mixture model requires to simultaneously fit model parameters and model weights p(w 1,...,K , θ 1,...,K |y), of which the computation burden is a concern when K is big. Yao et al. (2018) illustrated that (complete-pooling) stacking is often more stable than fullmixture, especially when the sample size is small and some models are similar. Nevertheless, our formulation of hierarchical stacking agrees with Kamary et al. (2019) in terms of sampling from the posterior marginal distribution of p(w |y) in a full-Bayesian model. A jointly-inferred model is related to the "mixture of experts" (Jacobs et al., 1991;Waterhouse et al., 1996) and "hierarchical mixture of experts" (Jordan and Jacobs, 1994;Svensën and Bishop, 2003), where w k (·) and p(·|x, M k ) are parameterized by neural networks and trained jointly in the bigger mixture model. Hierarchical stacking differs from mixture modeling in two aspects. First, its separate inference of individual models p(θ 1 |y, M 1 ), . . . , p(θ K |y, M k ) and weights greatly reduces computation burden, making exact Bayes affordable. Second, the built-in leave-one-out likelihood helps prevent overfitting. Both the mixture model and stacking have limitations. If the true data generating process is truly a mixture model, then fitting individual component p(θ k |y) separately is wrong and stacking cannot remedy it. On the other hand, stacking and hierarchical stacking are more suitable when each model has already been developed to fit the data on their own. Put it in another way, rather than to compete with a mixture-of-experts on combining weak learners, hierarchical stacking is more recommended to combine a mixture-of-experts with other sophisticated models. Lastly, our full-Bayesian formulation makes hierarchical stacking directly applicable to complex priors and complex data structures, such as time series or panel data, while these extensions are not straightforward in the mixture of experts.

Examples
We present three examples. The well-switching example demonstrates an automated hierarchical stacking implementation with both continuous and categorical inputs. The Gaussian process example highlights the benefit of hierarchical stacking when individual models are already highly expressive. The election forecast illustrates a real-world classification task with a complex data structure. We evaluate the proposed method on several metrics, including the mean log predictive density on holdout data, conditional log predictive densities, and the calibration error.

Well-switching in Bangladesh
We work with a dataset used by Vehtari et al. (2017) to demonstrate cross validation. A survey with a size of n = 3020 was conducted on residents from a small area in Bangladesh that was affected by arsenic in drinking water. Households with elevated arsenic levels in their wells were asked whether or not they were interested in switching to a neighbor's well, denoted by y. Well-switching behavior can be predicted by a set of household-level variables x, including the detected arsenic concentration value in the well, the distance to the closest known safe well, the education level of the head of household, and whether any household members are in community organizations. The first two inputs are continuous and the remaining two are categorical variables.
We fit a series of logistic regressions, starting with an additive model including all covariates x in model 1. In model 2, we replace one input-well arsenic level-by its logarithm. In models 3 and 4, we add cubic spline basis functions with ten knots of well arsenic level and distance, respectively in input variables. In model 5 we replace the categorical education variable with a continuous measure of years of schooling.
Using the additive model specification (13) and default prior (14), we model the unconstrained weight α k (x) by a linear regression of all categorical inputs and all rectified continuous inputs (16). In this example the categorical input has eight distinct levels based on the product of education (four levels) and community participation (binary).
For comparison, we consider three alternative approaches: (a) complete-pooling stacking (b) no-pooling stacking: the maximum likelihood estimate of (13), and (c) model selection that picks model with the highest leave-one-out log predictive densities. We split data into a training set (n train = 2000) and an independent holdout test set.
The leftmost panel in Figure 2 displays the pointwise difference of leave-one-out log scores for models 1 and 2 against log arsenic values in training data. Intuitively, model 1 fits poorly for data with high arsenic. In line with this evidence, hierarchical stacking assigns model 1 an overall low weight, and especially low for the right end of the arsenic levels. The second panel shows the pointwise posterior mean of unconstrained weight difference between model 1 and 2, α 2 (x) − α 1 (x), against the arsenic values in training data. The no-pooling stacking reveals a similar direction that Figure 2: (1) Pointwise difference of leave-one-out log scores between models 1 and 2, plotted against log arsenic. Model 1 poorly fits points with high arsenic. (2) Posterior mean of pointwise unconstrained weight difference between models 1 and 2, α 2 (x) − α 1 (x) in hierarchical stacking. (3) Pointwise log weight difference between models 1 and 2 in no-pooling stacking. (4) Posterior mean of w 4 (x), the weight assigned to model 4, in hierarchical stacking, displayed against log arsenic and education levels. There are few samples with high school education and above, whose effect on model weights is pooled toward the shared mean. The blue line is the complete-pooling stacking. (5) The unconstrained weight of model 4, α 4 (x), in no-pooling stacking. The "high school" effect stands out and the resulting model weights w are nearly all zeroes and ones.
model 1's weight should be lower with a higher arsenic value, but for lack of hierarchical prior regularization, the fitted α 2 (x) − α 1 (x) is orders of magnitude larger (the third panel). As a result, the realized pointwise weights w are nearly either zero or one.
The rightmost two columns in Figure 2 display the fitted pointwise weights of model 4 against log arsenic values and education level in test data. Because only a small proportion (7%) of respondents had high school education and above, the no-pooling stacking weight for this category is largely determined by small sample variation. Hierarchical stacking partially pools this "high school" effect toward the shared posterior mean of all educational levels, and the realized hierarchical stacking model weights do not clearly depend on education levels.
We evaluate model fit on the following three metrics. To reduce randomness, we evaluate all these metrics averaging over 50 random training-test splits.
(a) The log predictive densities averaged over test data. In the first panel of Figure 3, we set hierarchical stacking as a baseline and all other methods attain lower predictive densities.
(b) The L 1 calibration error. We set 20 equally spaced bins between 0 and 1. For each bin and each learning algorithm, we collect test data points whose model-predicted positive probability falling in that bin, and compute the absolute discrepancy between the realized proportion of positives in test data and the model-predicted probabilities. The middle panel in Figure 3 displays the resulted calibration error averaged over 20 bins. The proposed hierarchical stacking has the lowest error. No-pooling stacking has the highest calibration error despite its higher overall log predictive densities than model selection, suggesting prediction overconfidence.
(c) We compute the average log predictive densities of four methods among the n worst most shocking test data points (the ones with lowest predictive densities conditioning on a given method) for n worst varying from 10 to 200 and the total test data has size 1020. As exhibited in the last panel in Figure 3, the proposed hierarchical stacking consistently outperforms all other approaches for all n worst : a robust performance in the worst-case scenario. Figure 4 presents the same comparisons of four methods while the training sample size n train varies from 100 to 1200 (averaged over 50 random training-test splits). In agreement with the  heuristic in Figure 1, the most complex method-no-pooling stacking-performs especially poorly with a small sample size. By contrast, the simplest method, model selection, reaches its peak elpd quickly with a moderate sample size but cannot keep improving as training data size grows. The proposed hierarchical stacking performs the best in this setting under all metrics.

Gaussian process regression weighted by another Gaussian process
The local model averaging (12) tangles a x-dependent weight w(x) and x-dependent individual prediction p(y|x, M k ). If the individual model y|x, M k is already big enough to have exhausted "all" variability in input x, is there still a room for improvement by modeling local model weights w(x)? The next example suggests a positive answer.
Consider a regression problem with observations {y i } n i=1 at one-dimensional input locations {x i } n i=1 . To the data we fit a Gaussian process regression on the latent function f with zero mean and squared exponential covariance, and independent noise :   Figure 5: From left to right, Column 1: posterior density at σ = 0.25. At least two modes exist. Column 2: predictive distribution of y from two modes. Column 3: the pointwise companion of log predictive density of the Laplace approximations at two modes, and the hierarchical stacking weight of mode 1. Column 4: the test data mean predictive densities of the weighted model, where individual components in the final model consists of either the MAP, Laplace approximation, or importance sampling around the two modes, and the weighting methods include hierarchical stacking, completepooling stacking, mode heights and importance weighing.
We adopt training data from Neal (1998). They were generated such that the posterior distribution of hyperparameters θ = (a, ρ, σ) contains at least two isolated modes (see the first panel in Figure 5). We consider three mode-based approximate inference of θ|y: (a) Type-II MAP, where we pick the local modes of hyperparameters that maximizes the marginal densityθ = arg max p(θ|y), and further draw local variables f |θ, y, (b) Laplace approximation of θ|y around the mode, and (c) importance resampling where we draw uniform samples near the mode and keep sample with probability proportional to p (θ|y). In the existence of two local modesθ 1 ,θ 2 , we either obtain two MAPs or two nearly-nonoverlapped draws, further leading to two predictive distributions. Yao et al. (2020) suggests using complete-pooling stacking to combine two predictions, which shows advantages over other ad-hoc weighting strategies such as mode heights or importance weighting.
Visually, mode 1 has smaller length scale, more wiggling and attracted by training data. Because of a better overall fit, it receives higher complete-stacking weights. However, the wiggling tail makes its extrapolation less robust. We now run hierarchical stacking with x-dependent weight w k (x) for mode k = 1, 2 by placing another Gaussian process prior on unconstrained weight logit(w 1 (x)) with squared exponential covariance, Despite using the same GP prior, this is not related to the training regression model (25). To evaluate how good the weighted ensemble is, we generate independent holdout test data (x i ,ỹ i ). Both training and test inputs, x andx, are distributed from normal(0, 1). As presented in the rightmost panel in Figure 5, for all three approximate inferences, hierarchical stacking always has a higher mean test log predictive density than complete pooling stacking and other weighting schemes.
In this dataset, exact MCMC is able to explore both posterior modes in model (25) Figure 6: Compare hierarchical stacking with (left) stacking of two Laplace approximations or (right) a long-chain exact Bayes from the true model. We compare the binned test log predictive densities over 10 equally spaced bins on (−3, 3). A positive value means hierarchical stacking has a better fit than the counterpart.
long enough sampling. Gaussian process regression equipped with exact Bayesian inference can be regarded as the "always true" model here. Hierarchical stacking achieves a similar average test data fit by combining two Laplace approximations. Furthermore, hierarchical stacking has better predictive performance under covariate shift. To examine local model fit, we generate another independent holdout test data, with results shown in Figure 6. This time the test inputsx are from uniform(−3, 3). We divide the test data into 10 equally spaced bins and compute the mean test data log predictive density inside each bin. Compared with exact inference, hierarchical stacking has comparable performance in the bulk region of x, while it yields higher predictive densities in the tail, suggesting a more reliable extrapolation.

U.S. presidential election forecast
We explore the use of hierarchical stacking on a practical example of forecasting polls for the 2016 United States presidential election. Since the polling data are naturally divided into states, it provides a suitable platform for hierarchical stacking in which model weights vary on states.
To create a pool of candidate models, we first concisely describe the model of Heidemanns et al. (2020), an updated dynamic Bayesian forecasting model (Linzer, 2013) for the presidential election, and then follow up with different variations of it. Let i be the index of an individual poll, y i the number of respondents that support the Democratic candidate, and n i the number of respondents who support either the Democratic or the Republican candidate in the poll. Let s[i] and t[i] denote the state and time of poll i respectively. The model is expressed by where superscripts denote parameter names, and subscripts their indexes. The term µ b is the underlying support for the Democratic candidate, and α i , ζ, and ξ represent different bias terms. α i is further decomposed into  Figure 7: Left: pointwise differences in 7-day running mean log predictive densities on one-weekahead test data, where we set the hierarchical stacking as benchmark 0. Right: pointwise differences in cumulative average predictive log density by date. The advantage of hierarchical stacking is most noticeable toward the beginning, where there are fewer polls available.
where µ c is the house effect, µ r polling population effect, µ m polling mode effect, and an adjustment term for non-response bias. Furthermore, an autoregressive (AR(1)) prior is given to the µ b : where Σ b is the estimated state-covariance matrix and µ b T is the estimate from the fundamentals.
Although we believe this model reasonably fits data, there is always room for improvement. Our pool of candidates consists of eight models. M 1 : The fundamentals-based model of Abramowitz (2008). M 2 : The model of Heidemanns et al. (2020). M 3 : M 2 without the fundamentals prior, We equip hierarchical stacking with either the basic independent prior (9) or the state-correlated prior (15). The prior correlation Ω is estimated using a pool of state-level macro variables (election results in the past, racial makeup, educational attainment, etc.), and has already been used in some of the individual models to partially pool state-level polling. We plug this pre-estimated prior correlation in the correlated stacking prior (15) and refer to it as "hierarchical stacking with correlation" in later comparisons.
Since the data are longitudinal, we evaluate different pooling approaches using a one-week-ahead forecast with an expanding window for each conducted poll. We extract the fitted one-week-ahead predictions from each individual model, and train hierarchical stacking, complete-pooling, and nopooling stacking, and evaluate the combined models by computing their mean log predictive densities on the unseen data next week. To account for the non-stationarity discussed in Section 2.5, we only use the last four weeks prior to prediction day for training model averaging. In the end we obtain a trajectory of this back-testing performance of hierarchical stacking, complete-pooling stacking, no-pooling stacking, and single model selection.
The left-hand side of Figure 7 shows the seven-day running average of the one-week-ahead backtest log predictive density from models combined with various approaches. The right-hand side of Figure 7 shows the overall cumulative one-week-ahead back-test log predictive density. We set the uncorrelated hierarchical stacking to be a constant zero for reference. Hierarchical stacking  Figure 8: Mean test log predictive densities with 50% and 95% confidence intervals, among subsets of states with few, moderate, and many number of state polls, and among all states. Correlated hierarchical stacking is set as reference 0. It is better than independent hierarchical stacking when data are scarce. Complete-pooling stacking is close to hierarchical stacking in small states but worse in bigger states.
performs the best, followed by stacking, no-pooling stacking, and model selection respectively. The advantage of hierarchical stacking is highest at the beginning and slowly decreases the closer we get to election day. As we move closer to the election, more polls become available, so the candidate models become better and also more similar since some models only differ in priors. As a result, all combination methods eventually become more similar. No-pooling stacking has high variance and hence performs the worst out of all combination methods. Hierarchical stacking with correlated prior performs similarly to the independent approach, with a minor advantage at the beginning of the year, where the prior correlation stabilizes the state weights where the data are scarce, and later we see this advantage more discernible in individual states.
To examine small area estimates, we divided states into three categories based on how many state polls were conducted. Figure 8 shows the overall mean pointwise differences in test log predictive densities divided by these categories, along with a fourth panel over all states. No-pooling stacking performs the worst in all panels. An explanation for that could be that we are using a four-week moving window to tackle non-stationarity, which might not contain enough data for the no-pooling method. The variance of the no-pooling is amended by the hierarchical approach, which performs on par with stacking with scarcer data and outperforms it otherwise. Figure 14 in Appendix E shows the state-level cumulative log predictive density by time. With a large number of state polls available, for example, close to election day in Florida and North Carolina, no-pooling stacking performs well. In states with fewer polls, no-pooling stacking is unstable. Hierarchical stacking alleviates this instability while retaining enough flexibility for a good performance when large data come in. Figure 9 illustrates how cell size affects the pooling effect. The first panel shows the hierarchical stacking state-wise weights for the first candidate model w 1j as a function of date. For either earlydate forecasts or states with few polls, hierarchical stacking weights are more pooled toward the shared nationwide mean. The middle and right panels compare the difference between state-wise hierarchical stacking weights and the nationwide mean, or with no-pooling weights, against the total number of respondents for each state and prediction date. The cells with more observed data are less pooled and closer to their no-pooling optimums, and vice versa. States with smaller sample sizes are more pooled to the mean. Right: absolute differences between hierarchical stacking and no-pooling stacking weights, generally decreasing with bigger sample sizes.

Robustness in small areas
The input-varying model averaging is designed to improve both the overall averaged prediction Eỹ ,x (log p(ỹ|x)) and conditional prediction Eỹ |x=x 0 (log p(ỹ|x)), whereas these two tasks are subject to a trade-off in complete-pooling stacking. In addition, the partial pooling prior (9) borrows information from other cells, which stabilizes model weights in small cells where there are not enough data for no-pooling stacking. For a crude mean-field approximation, the likelihood in the discrete model (11) is approximately j,k normal(α mode jk , λ jk ), where α mode = arg max α n i=1 log K k=1 w k (x i )p k,−i is the unconstrained no-pooling stacking weight, and −λ −2 jk = ∂ 2 ∂α 2 jk | mode n i=1 log K k=1 w k (x i )p k,−i is the diagonal element of the Hessian. Because α jk appears in n j terms of the summation, λ jk = O(n −1/2 j ) for a given k. Combined with the prior α jk ∼ normal(µ k , σ k ), the conditional posterior mean of the k-th model weight in the j-th cell is the usual precision-weighed average of the no-pooling optimum and the shared mean: α post Hence for a given model k, |α mode jk − α post jk | = O(n −1 j ). Larger pooling usually occurs in smaller cells. This pooling factor is in line with Figure 9 and general ideas in hierarchical modeling (Gelman and Pardoe, 2006). Our full-Bayesian solution also integrates out µ k and σ k , which further partially pools across models.
The possibility of partial pooling across cells encourages open-ended data gathering. In the election polling example, even if a pollster is only interested in the forecast of one state, they could gather polling data from everywhere else, fit multiple models, evaluate models on each state, and use hierarchical stacking to construct model averaging, which is especially applicable when the state of interest does not have enough polls to conduct a meaningful model evaluation individually. In this context swing states naturally have more state polls, so that the small-area estimation may not be crucial, but in general, we conjecture that the hierarchical techniques can be useful for model evaluation and averaging in a more general domain adaptation setting. Without going into extra details, hierarchical models are as useful for making inferences from a subset of data (smallarea estimation) as to generalize data to a new area (extrapolation). When the latter task is the focus, hierarchical stacking only needs to redefine the leave-one-data-out predictive density (6) by leave-one-cell-out p k,−i :

Using hierarchical stacking to understand local model fit
We use hierarchical stacking not only as a tool for optimizing predictions but also as a way to understand problems with fitted models. The fact that hierarchical stacking is being used is already an implicit recognition that we have different models that perform better or worse in different subsets of data, and it can valuable to explore the conditions under which different models are fitting poorly, reveal potential problems in the data or data processing, and point to directions for individual-model improvement. Vehtari et al. (2017) and Gelman et al. (2020) suggested to examine the pointwise cross-validated log score log p k,−i as a function of x i , and see if there is a pattern or explanation for why some observations are harder to fit than others. For example, the first panel of Figure 2 seems to indicate that Model 1 is incapable of fitting the rightmost 10-15 non-switchers. However, log p k,−i contains a non-vanishing variance since y i is a single realization from p t (y|x i ). Despite its merit in exploratory data analysis, it is hard to tell from the raw cross validation scores whether Model 1 is incapable of fitting high arsenic or is merely unlucky for these few points. The hierarchical stacking weight w(x) provides a smoothed summary of how each model fits locally in x and comes with built-in Bayesian uncertainty estimation. For example, in Figure 5, log p 1,−i − log p 2,−i has a slightly inflated right tail, but this small bump is smoothed by stacking, and the local weight therein is close to (0.5, 0.5).

Retrieving a formal likelihood from an optimization objective
The implication of hierarchical stacking (11) being a formal Bayesian model is that we can evaluate its posterior distribution as with a regular Bayesian model. For example, we can run (approximate) leave-one-out cross validation of the the stacking posterior p(w |D −i ) ∝ p(w |D)/p(y i |x i , w) = p(w |D)/ K k=1 w k (x i )p k,−i . In practice, we only need to fit the stacking model (11) once, collect a size-S MCMC sample of stacking parameters from the full posterior p(w |D), denoted by {(w k1 (x i ), . . . , w kS (x i ))} i,k , compute the PSIS-stabilized importance ratio of each draw r is ≈ K k=1 w ks (x i )p k,−i −1 , and then compute the mean leave-one-out cross validated log predictive density to evaluate the overall out-of-sample fit of the final stacked model: As discussed in Section 4, the same task of out-of-sample prediction evaluation in an optimizationbased stacking requires double cross validation (refit the model n(n−1) times if using leave-one-out), but now becomes almost computationally free by post-processing posterior draws of stacking. The Bayesian justification above applies to log-score stacking. In general, we cannot convert an arbitrary objective function into a log density-its exponential is not necessarily integrable, and, even if it is, the resulted density does not necessarily correspond to a relevant model. Take linear regression for example, the ordinary least square estimate arg min β n i=1 (y i − x T i β) 2 is identical to the maximum likelihood estimate of β from a probabilistic model y i |x i , β, σ ∼ normal(x T i β, σ) with flat priors. But the directly adapted "log posterior density" from the negative L 2 loss, log p(β|y) = − n i=1 (y i − x T i β) 2 + C, differs from the Bayesian inference of the latter probabilistic model unless σ ≡ 1. The hierarchical stacking framework may still apply to other scoring rules, while we leave their Bayesian calibration for future research.

Statistical workflow for black box algorithms
Unlike our previous work (Yao et al., 2018) that merely applied stacking to Bayesian models, the present paper converts optimization-based stacking itself into a formal Bayesian model, analogous to reformulating a least-squares estimate into a normal-error regression. Breiman (2001) distinguished between two cultures in statistics: the generative modeling culture assumes that data come from a given stochastic model, whereas the algorithmic modeling treats the data mechanism unknown and advocates black box learning for the goal of predictive accuracy. As a method that Breiman himself introduced (along with Wolpert, 1992), stacking is arguably closer to the algorithmic end of the spectrum, while our hierarchical Bayesian formulation pulls it toward the generative modeling end.
Such a full-Bayesian formulation is appealing for two reasons. First, the generative modeling language facilitates flexible data inclusion during model averaging. For example, the election forecast model contains various outcomes on state polls and national polls from several pollsters, and pollster-, state-and national-level fundamental predictors, and prior state-level correlations. It is not clear how methods like bagging or boosting can include all of them. Data do not have to conveniently arrive in independent (x i , y i ) pairs and compliantly await an algorithm to train upon. Second, instead of a static algorithm, hierarchical stacking is now part of a statistical workflow . It then enjoys all the flexibility of Bayesian model building, fitting, and checking-we can incorporate other Bayesian shrinkage priors as add-on components without reinventing them; we can run a posterior predictive check or approximate leave-one-out cross validation (28) to assess the out-of-sample performance of the final stacking model; we may even further select, stack, or hierarchically stack a sequence of hierarchical stacking model with various priors and parametric forms. Looking ahead, the success of this work encourages more use of generative Bayesian modeling to improve other black box prediction algorithms.
At the pointwise level, stacking behaves as a plurality voting system: as long a model "wins" a sub-region (subject to a prefixed threshold L in condition (19)), the winner take all and its winning margin no longer matters.
By contrast, likelihood-based model averaging techniques such as Bayesian model averaging (BMA, Hoeting et al., 1999) and pseudo-Bayesian model averaging (Yao et al., 2018) are analogies of proportional representation: every count of the winning margin matters. For illustration, we vary the slab probability δ in Model 1 and 2: The left column in Figure 11 visualizes the predictive densities from these two models at δ = 0.2, 0.33, and 0.45.
When the slab probability δ increases from 0 to 0.5, these two models are closer and closer to each other, measured by a smaller KL(M 1 , M 2 ). The (0.5, 1) counterpart is similar, though not exactly symmetric. We compute stacking weight and the expected pseudo-BMA weight with sample size n: w BMA 1 (n, δ) = 1 + exp n E y|δ log p 2 (y) − n E y|δ log p 1 (y) −1 .
Interestingly, pseudo-BMA weight w BMA 1 (n, δ) is strictly decreasing as a function of δ ∈ (0, 1). This is because when δ → 0 + , log predictive density of model 2 in the left part log(δ/4) → ∞ can be arbitrarily small, and the influence of this bad region dominates the overall performance of model 2. By contrast, stacking weight is monotonic non-increasing on (0, 0.5) (strictly decreasing on (0, 1/3), and remains flat afterwards)-the opposite direction of BMA. Stacking simply recognizes model 1 winning the [−3, 0] interval and does not haggle over how much it wins.
The first two panels in Figure 12 show the KL divergence from model 1 or from model 2 to the data generating process and the KL divergence between model 1 and model 2. The third panel is the largest separation constant L for which the separation condition (19) holds. The last two panels show the stacking epld gain (compared with the best individual model) as a function of δ and L. This constructive example reflects the worst case for it matches the theoretical lower bound g * (L, K, ρ, ) = log(ρ) + (1 − ρ)(log(1 − ρ) − log(K − 1)) (here L = L, K = 2, ρ = 1/4, = 0) in Theorem 3. When δ ∈ [1/3, 1/2), Model 2 still wins on the interval J 2 = (0, 2] with the separation constant = 0 and L ≤ log 2 (the winning margin is maximized at δ = 1/3). Nevertheless, a zero stacking weight and a non-zero winning area do not contradict Theorem 1. Indeed, Theorem 2 precisely bounds the mass of the winning region when stacking weight is zero. We provide self-contained theorem proofs in the next section.
Loosely speaking, BMA computes the probability of a model being true (if one model has to be true), while stacking (through the approximation Pr(J k )) computes the probability of a model being the best.

B. Proofs of theorems
For briefly, in later proofs we will use the abbreviation for the posterior pointwise conditional predictive density from the k-th model: This subscript index k should not be confused with the notation t as in p t (ỹ|x) or p t (ỹ,x): the unknown conditional or joint density of the true data generating process. The subscript letter t is always reserved for "true".
Recall that in this section w stacking refers to the complete-pooling stacking in the population: Theorem 1. We call K predictive densities {p(ỹ = ·|x = ·, M k )} K k=1 to be locally separable with a constant pair L > 0 and 0 < < 1 with respect to the true data generating process For a small and a large L, the stacking weights that solve (3) is approximately the proportion of the model being the locally best model: in the sense that the objective function is nearly optimal: |elpd(w approx ) − elpd(w stacking )| ≤ O( + exp(−L)).
Proof. The expected log predictive density of the weighted prediction k w k p k (·|x) (as a function of w) is The expression is legit for any simplex vector w ∈ S K that does not contain zeros. We will treat zeros later. For now we only consider a dense weight: {w ∈ S K : w k > 0, k = 1, . . . K}.
Consider a surrogate objective function (the first term in the integral above): (Pr(J k ) log w k ) + constant.
Ignoring the constant term above (the expected cross-entropy between each conditional prediction and the true DG), to maximize the surrogate objective function is equivalent to maximizing K k=1 Pr(J k ) log w k , we call this function elbo(w), the evidence lower bound. To optimize elpd surrogate is equivalent to optimizing elbo. We show that this elbo function has a closed form optimum. Using Jensen's inequality, Pr(J k ) log Pr(J k ).
The equality is attained at w k = Pr(J k ), k = 1, . . . , K, which reaches our definition of w approx in Theorem 1.
Next, we deal with w stacking 1 = 0, w stacking k = 0, ∀k = 1. If w approx 1 = 0, too, then we have solved in the previous paragraph. If not, Theorem 2 shows that w approx 1 has to be a small order term: We leave the proof of this inequality in Theorem 2.
The contribution of the first model in the surrogate model is at most Pr(J 1 ) log Pr(J 1 ). After we remove the first model from the model list, with the surrogate model elpd changes by at most a small order term, not affecting the final bound. Because the separation condition with constant ( , L) applies to model 1, . . . , K, and due to lack of a competition source, the same separation condition applies to model 2, . . . , K and the same bound applies.
Theorem 2. When the separation condition (19) holds, and if the k-th model has zero weight in stacking, w stacking Theorem 3. Let ρ = sup 1≤k≤K Pr(J k ), and two deterministic functions g and g * by Assuming the separation condition (19) holds for all k = 1, . . . , K, then the utility gain of stacking is further lower-bounded by elpd stacking − elpd k ≥ max (g * (L, K, ρ) + O(exp(−L) + ), 0) .
Proof. As before, we consider the approximate weights: w approx k = Pr(J k ), and the surrogate elpd The last inequality comes from the fact that, under the constraint of max k Pr(J k ) = ρ, the entropy K k=1 Pr(J k ) log Pr(J k ) attains its minimal when each of the Pr(J k ) term equals (1 − ρ)/(K − 1) except for the largest term ρ. This inequality is due to the convexity of x log x.
Finally, using the proof of Theorem 1, the error from the surrogate is bounded, Hence the overall utility is bounded, Because selection is always a specials case of averaging, the utility is further bounded below by 0.
From the constrictive example (Appendix A), g * (·) is a tight bound. We use the looser bound g(·) in the main paper for its simpler form.
Theorem 4. Under the strong separation assumption K k=1 x∈I k ỹ∈Y 1 log p(ỹ|M k , x) < log p(ỹ|M k , x) + L, ∀k = k * (x) p t (ỹ|x, D)dỹdx ≤ , and if the sets {I k } are known exactly, then we can construct pointwise selection Its utility gain is bounded from below by Finally, from the proof of Theorem 1, |elpd surrogate (w approx ) − elpd stacking (w stacking )| ≤ O(exp(−L) + ).
We close this section with two remarks. First, Yao (2019) approximates the probabilistic stacking weights under the strong separation condition (23). The result therein can be viewed as a special case of Theorem 4 in the present paper as Pr(J k ) ≈ Pr(I k ) under assumption (23).
Second, most proofs only use the concavity of the log scoring rule. Therefore, some proprieties of stacking weights could be extended to other concave scoring rules, too.

C. Software implementation in Stan
We summarize our formulation of hierarchical stacking by pseudo code 1.
Algorithm 1: Hierarchical stacking Data: y: outcomes; x: input on which the stacking weights vary, z: other inputs; p k,−i : approximate leave-one-out predictive densities of the k-th model and i-th data.
To code the basic additive model, we prepare the input covariate X = (X discrete , X continuous ), where X discrete is discrete dummy variable, and X continuous are remaining features (already rectified as in (16)). The dimension of these two parts are d continuous and d discrete .
Here we use the "grouped hierarchical priors" (Section 6.3) with only two groups, distinguishing between continuous and discrete variables. We discuss more on the hyper prior choice in the next section.

D. Prior recommendations
We believe the prior specification should follow the general principle of the weakly-informative prior 2 . In the context of the additive model (Section 2.4), some weakly-informative prior heuristics imply • We would like to use a half-normal instead of a too wide half-Cauchy or inverse-gamma for the model-wise scale parameter (i.e., σ k ∼ normal + (0, τ σ )). This is not only because generally, we prefer half-normal for its lighter right tail in hierarchical models, but also because we know that the complete-pooling stacking (σ k ≡ 0) is often a rational solution in many problems, to begin with.
On the contrary, a wide σ 2 k ∼ InvGamma(10 −2 , 10 −4 ) seems a popular choice in the mixture of experts, which we do not recommend.
• When the number of features M is large, it is sensible to first standardize feature such that Var(f m (x)) = 1, 1 ≤ m ≤ M , and scale the hyper-parameter to control Var( M m=1 α mk f m (x)). With independent inputs, it leads to τ σ = O( 1/M ).
• When there are a small number of features and no extra information to incorporate, we often first standardize all features and use a half-normal(0, 1) prior on model-wise scale σ k (i.e., τ σ := 1). The half-normal(0, 1) has been used as a default informative prior for group-level scale in some applied regression tasks.
• The structure of the prior matters more than the scale of the prior. Hierarchical stacking is typically not sensitive to the difference between a half-normal(0, 1) or half-normal(0, 2) hyperprior on σ k , although this sensitivity can be checked. But it would be sensitive to the structure of priors, such as feature-model decomposition, correlated priors, and horseshoe priors, as we have discussed in Section 2.4.
Second, instead of recommending a static default prior, we would rather adopt the attitude that the prior is part of the model and can be checked and improved. Because of our full-Bayesian formulation of hypercritical stacking, we do not have to reinvent model checking tools. When there are concerns on the prior specification, we would like to run prior predictive checks, sensitivity analysis by influence function or importance sampling, and select, stacking, or hierarchically stack a sequence of priors based upon an extra layer of (approximate) leave-one-out cross validation (28).

E. Experiment details
The replication code for experiments is available at https://github.com/yao-yl/hierarchical-stacking-code.
And place a default prior on parameters and hyper-parameters.
Gaussian process regression. We use training data {x i , y i } from Neal (1998) (file odata.txt in our repo). Yao et al. (2020) use same setting to explain the benefit of complete-pooling stacking. The training size is n = 100. We generate additional test data for model evaluation. The univariate input x is distributed normal(0, 1), and the corresponding outcome y is also Gaussian. The true but unknown conditional mean is E true (y|x) = f true (x) = 0.3 + 0.4x + 0.5 sin(2.7x) + 1.1/(1 + x 2 ).
In the data generating process, with probability 0.95, y is a realization from y|f true = normal(mean = f true , sd = 0.1). With probability 0.05, y is considered an outlier and the standard deviation is inflated to 1: y|f true = normal(f true , 1). This outlier probability is independent of location x, and the observational noises are mutually independent.
The posterior multimodality relies on the particular realization of x and y. We have tried other randomly generated training datasets, among which only Neal (1998)'s original data realization can give rise to two distinct modes. We then consider three standard mode-based approximate inference: • Type-II MAP: The valueθ that maximizes the marginal posterior distribution. We further draw f |θ, y.
• Laplace approximation. First compute Σ: the inverse of the negative Hessian matrix of the log posterior density at the local modeθ, draw z from MVN(0, I 3 ), and use θ(z) =θ + VΛ 1/2 z as the approximate posterior samples around the modeθ, where the matrices V, Λ are from the eigen-decomposition Σ = VΛ 1/2 V T .
To get the elpd of a state, we take the average of all elpds in that state, for example elpd NY = 1 N NY i∈A NY elpd i , where N NY is the number of polls conducted in New York, and A NY is the set of indexes of polls in New York. Figure 14 displays the state-level log predictive density of the combined model in six representative states.  Figure 14: Log predictive density of the combined model in three small states (in the number of available state polls n, RI, SD, WV) and three swing states (GA, NC, FL). We fix the uncorrelated hierarchical stacking to be constant zero as a reference. The number in the bracket is the total number of polls in that state. With a large number of state polls available, for example, close to election day in Florida and North Carolina, no-pooling stacking performs well. With fewer polls, no-pooling stacking is unstable, as can be seen in Rhode Island, South Dakota, West Virginia, and the early part of Georgia plots. Hierarchical stacking alleviates this instability, while retaining enough flexibility for a good performance with large data come in.