Bayesian Analysis

Using Stacking to Average Bayesian Predictive Distributions (with Discussion)

Yuling Yao, Aki Vehtari, Daniel Simpson, and Andrew Gelman

Full-text: Open access


Bayesian model averaging is flawed in the M-open setting in which the true data-generating process is not one of the candidate models being fit. We take the idea of stacking from the point estimation literature and generalize to the combination of predictive distributions. We extend the utility function to any proper scoring rule and use Pareto smoothed importance sampling to efficiently compute the required leave-one-out posterior distributions. We compare stacking of predictive distributions to several alternatives: stacking of means, Bayesian model averaging (BMA), Pseudo-BMA, and a variant of Pseudo-BMA that is stabilized using the Bayesian bootstrap. Based on simulations and real-data applications, we recommend stacking of predictive distributions, with bootstrapped-Pseudo-BMA as an approximate alternative when computation cost is an issue.

Article information

Bayesian Anal., Volume 13, Number 3 (2018), 917-1007.

First available in Project Euclid: 16 January 2018

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Bayesian model averaging model combination proper scoring rule predictive distribution stacking Stan

Creative Commons Attribution 4.0 International License.


Yao, Yuling; Vehtari, Aki; Simpson, Daniel; Gelman, Andrew. Using Stacking to Average Bayesian Predictive Distributions (with Discussion). Bayesian Anal. 13 (2018), no. 3, 917--1007. doi:10.1214/17-BA1091.

Export citation


  • Adams, J., Bishin, B. G., and Dow, J. K. (2004). “Representation in Congressional Campaigns: Evidence for Discounting/Directional Voting in U.S. Senate Elections.” Journal of Politics, 66(2): 348–373.
  • Akaike, H. (1978). “On the likelihood of a time series model.” The Statistician, 217–235.
  • Bernardo, J. M. and Smith, A. F. M. (1994). Bayesian Theory. John Wiley & Sons.
  • Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). “Variational inference: A review for statisticians.” Journal of the American Statistical Association, 112(518): 859–877.
  • Breiman, L. (1996). “Stacked regressions.” Machine Learning, 24(1): 49–64.
  • Burnham, K. P. and Anderson, D. R. (2002). Model Selection and Multi-Model Inference: A Practical Information-Theoretic Approach. Springer, 2nd edition.
  • Clarke, B. (2003). “Comparing Bayes model averaging and stacking when model approximation error cannot be ignored.” Journal of Machine Learning Research, 4: 683–712.
  • Clyde, M. and Iversen, E. S. (2013). “Bayesian model averaging in the M-open framework.” In Damien, P., Dellaportas, P., Polson, N. G., and Stephens, D. A. (eds.), Bayesian Theory and Applications, 483–498. Oxford University Press.
  • Fokoue, E. and Clarke, B. (2011). “Bias-variance trade-off for prequential model list selection.” Statistical Papers, 52(4): 813–833.
  • Geisser, S. and Eddy, W. F. (1979). “A Predictive Approach to Model Selection.” Journal of the American Statistical Association, 74(365): 153–160.
  • Gelfand, A. E. (1996). “Model determination using sampling-based methods.” In Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (eds.), Markov Chain Monte Carlo in Practice, 145–162. Chapman & Hall.
  • Gelman, A. (2004). “Parameterization and Bayesian modeling.” Journal of the American Statistical Association, 99(466): 537–545.
  • Gelman, A. and Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.
  • Gelman, A., Hwang, J., and Vehtari, A. (2014). “Understanding predictive information criteria for Bayesian models.” Statistics and Computing, 24(6): 997–1016.
  • George, E. I. (2010). “Dilution priors: Compensating for model space redundancy.” In Borrowing Strength: Theory Powering Applications – A Festschrift for Lawrence D. Brown, 158–165. Institute of Mathematical Statistics.
  • Geweke, J. and Amisano, G. (2011). “Optimal prediction pools.” Journal of Econometrics, 164(1): 130–141.
  • Geweke, J. and Amisano, G. (2012). “Prediction with misspecified models.” American Economic Review, 102(3): 482–486.
  • Gneiting, T. and Raftery, A. E. (2007). “Strictly proper scoring rules, prediction, and estimation.” Journal of the American Statistical Association, 102(477): 359–378.
  • Gutiérrez-Peña, E. and Walker, S. G. (2005). “Statistical decision problems and Bayesian nonparametric methods.” International Statistical Review, 73(3): 309–330.
  • Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. (1999). “Bayesian model averaging: A tutorial.” Statistical Science, 14(4): 382–401.
  • Key, J. T., Pericchi, L. R., and Smith, A. F. M. (1999). “Bayesian model choice: What and why.” Bayesian Statistics, 6: 343–370.
  • Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2017). “Automatic differentiation variational inference.” Journal of Machine Learning Research, 18(1): 430–474.
  • Le, T. and Clarke, B. (2017). “A Bayes interpretation of stacking for M-complete and M-open settings.” Bayesian Analysis, 12(3): 807–829.
  • LeBlanc, M. and Tibshirani, R. (1996). “Combining estimates in regression and classification.” Journal of the American Statistical Association, 91(436): 1641–1650.
  • Li, M. and Dunson, D. B. (2016). “A framework for probabilistic inferences from imperfect models.” ArXiv e-prints:1611.01241.
  • Liang, F., Paulo, R., Molina, G., Clyde, M. A., and Berger, J. O. (2008). “Mixtures of g priors for Bayesian variable selection.” Journal of the American Statistical Association, 103(481): 410–423.
  • Madigan, D., Raftery, A. E., Volinsky, C., and Hoeting, J. (1996). “Bayesian model averaging.” In Proceedings of the AAAI Workshop on Integrating Multiple Learned Models, 77–83.
  • Merz, C. J. and Pazzani, M. J. (1999). “A principal components approach to combining regression estimates.” Machine Learning, 36(1–2): 9–32.
  • Montgomery, J. M. and Nyhan, B. (2010). “Bayesian model averaging: Theoretical developments and practical applications.” Political Analysis, 18(2): 245–270.
  • Piironen, J. and Vehtari, A. (2017). “Comparison of Bayesian predictive methods for model selection.” Statistics and Computing, 27(3): 711–735.
  • Rubin, D. B. (1981). “The Bayesian bootstrap.” Annals of Statistics, 9(1): 130–134.
  • Smyth, P. and Wolpert, D. (1998). “Stacked density estimation.” In Advances in Neural Information Processing Systems, 668–674.
  • Stan Development Team (2017). Stan modeling language: User’s guide and reference manual. Version 2.16.0,
  • Stone, M. (1977). “An Asymptotic Equivalence of Choice of Model by Cross-Validation and Akaike’s Criterion.” Journal of the Royal Statistical Society. Series B (Methodological), 39(1): 44–47.
  • Ting, K. M. and Witten, I. H. (1999). “Issues in stacked generalization.” Journal of Artificial Intelligence Research, 10: 271–289.
  • Vehtari, A., Gelman, A., and Gabry, J. (2017a). “Pareto smoothed importance sampling.” ArXiv e-print:1507.02646.
  • Vehtari, A., Gelman, A., and Gabry, J. (2017b). “Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC.” Statistics and Computing, 27(5): 1413–1432.
  • Vehtari, A. and Lampinen, J. (2002). “Bayesian model assessment and comparison using cross-validation predictive densities.” Neural Computation, 14(10): 2439–2468.
  • Vehtari, A. and Ojanen, J. (2012). “A survey of Bayesian predictive methods for model assessment, selection and comparison.” Statistics Surveys, 6: 142–228.
  • Wagenmakers, E.-J. and Farrell, S. (2004). “AIC model selection using Akaike weights.” Psychonomic bulletin & review, 11(1): 192–196.
  • Watanabe, S. (2010). “Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory.” Journal of Machine Learning Research, 11: 3571–3594.
  • Wolpert, D. H. (1992). “Stacked generalization.” Neural Networks, 5(2): 241–259.
  • Wong, H. and Clarke, B. (2004). “Improvement over Bayes prediction in small samples in the presence of model uncertainty.” Canadian Journal of Statistics, 32(3): 269–283.
  • Yang, Y. and Dunson, D. B. (2014). “Minimax Optimal Bayesian Aggregation.” ArXiv e-prints:1403.1345.
  • Yao, Y., Vehtari, A., Simpson, D., and Gelman, A. (2018). “Supplementary Material to “Using stacking to average Bayesian predictive distributions”.” Bayesian Analysis.

Supplemental materials