## Bayesian Analysis

### Inconsistency of Bayesian Inference for Misspecified Linear Models, and a Proposal for Repairing It

#### Abstract

We empirically show that Bayesian inference can be inconsistent under misspecification in simple linear regression problems, both in a model averaging/selection and in a Bayesian ridge regression setting. We use the standard linear model, which assumes homoskedasticity, whereas the data are heteroskedastic (though, significantly, there are no outliers). As sample size increases, the posterior puts its mass on worse and worse models of ever higher dimension. This is caused by hypercompression, the phenomenon that the posterior puts its mass on distributions that have much larger KL divergence from the ground truth than their average, i.e. the Bayes predictive distribution. To remedy the problem, we equip the likelihood in Bayes’ theorem with an exponent called the learning rate, and we propose the SafeBayesian method to learn the learning rate from the data. SafeBayes tends to select small learning rates, and regularizes more, as soon as hypercompression takes place. Its results on our data are quite encouraging.

#### Article information

Source
Bayesian Anal. Volume 12, Number 4 (2017), 1069-1103.

Dates
First available in Project Euclid: 18 November 2017

https://projecteuclid.org/euclid.ba/1510974325

Digital Object Identifier
doi:10.1214/17-BA1085

#### Citation

Grünwald, Peter; van Ommen, Thijs. Inconsistency of Bayesian Inference for Misspecified Linear Models, and a Proposal for Repairing It. Bayesian Anal. 12 (2017), no. 4, 1069--1103. doi:10.1214/17-BA1085. https://projecteuclid.org/euclid.ba/1510974325

#### References

• Audibert, J. Y. (2004). “PAC-Bayesian statistical learning theory.” Ph.D. thesis, Université Paris VI.
• Barron, A. R. (1998). “Information-Theoretic Characterization of Bayes Performance and the Choice of Priors in Parametric and Nonparametric Problems.” In Bernardo, J. M., Berger, J. O., Dawid, A. P., and Smith, A. F. M. (eds.), Bayesian Statistics, volume 6, 27–52. Oxford: Oxford University Press.
• Barron, A. R. and Cover, T. M. (1991). “Minimum Complexity Density Estimation.” IEEE Transactions on Information Theory, 37(4): 1034–1054.
• Bissiri, P. G., Holmes, C., and Walker, S. G. (2016). “A General Framework for Updating Belief Distributions.” Journal of the Royal Statistical Society, Series B (Statistical Methodology), 78(5): 1103–1130.
• Carslaw, D. C. and Ropkins, K. (2012). “Openair – an R package for air quality data analysis.” Environmental Modelling and Software, 27(18): 52–61.
• Catoni, O. (2007). PAC-Bayesian Supervised Classification. Lecture Notes-Monograph Series. IMS.
• Cuong, N. V., Lee, W. S., Ye, N., Chai, K. M. A., and Chieu, H. L. (2013). “Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion.” In Advances in Neural Information Processing Systems 26.
• Dawid, A. P. (1984). “Present Position and Potential Developments: Some Personal Views, Statistical Theory, The Prequential Approach.” Journal of the Royal Statistical Society, Series A, 147(2): 278–292.
• De Blasi, P. and Walker, S. G. (2013). “Bayesian asymptotics with misspecified models.” Statistica Sinica, 23: 169–187.
• Devaine, M., Gaillard, P., Goude, Y., and Stoltz, G. (2013). “Forecasting electricity consumption by aggregating specialized experts; a review of the sequential aggregation of specialized experts, with an application to Slovakian and French country-wide one-day-ahead (half-)hourly predictions.” Machine Learning, 90(2): 231–260.
• Diaconis, P. and Freedman, D. (1986). “On the Consistency of Bayes Estimates.” The Annals of Statistics, 14(1): 1–26.
• Doob, J. L. (1949). “Application of the theory of martingales.” In Le Calcul de Probabilités et ses Applications. Colloques Internationaux du Centre National de la Recherche Scientifique, 23–27. Paris.
• Van Erven, T., Grünwald, P., Mehta, N., Reid, M., and Williamson, R. (2015). “Fast Rates in Statistical and Online Learning.” Journal of Machine Learning Research. Special issue in Memory of Alexey Chervonenkis.
• Ghosal, S., Ghosh, J., and Van der Vaart, A. (2000). “Convergence rates of posterior distributions.” Annals of Statistics, 28(2): 500–531.
• Grünwald, P. D. (1998). “The Minimum Description Length Principle and Reasoning under Uncertainty.” Ph.D. thesis, University of Amsterdam, The Netherlands. Available as ILLC Dissertation Series 1998-03; see www.grunwald.nl.
• Grünwald, P. D. (1999). “Viewing all Models as “Probabilistic”.” In Proceedings of the Twelfth ACM Conference on Computational Learning Theory (COLT’ 99), 171–182.
• Grünwald, P. D. (2007). The Minimum Description Length Principle. Cambridge, MA: MIT Press.
• Grünwald, P. D. (2011). “Safe Learning: Bridging the gap between Bayes, MDL and statistical learning theory via empirical convexity.” In Proceedings of the Twenty-Fourth Conference on Learning Theory (COLT’ 11).
• Grünwald, P. D. (2012). “The Safe Bayesian: Learning the learning rate via the mixability gap.” In Proceedings 23rd International Conference on Algorithmic Learning Theory (ALT ’12). Springer.
• Grünwald, P. D. (2017). “Safe Probability.” Journal of Statistical Planning and Inference. To appear.
• Grünwald, P. D. and Langford, J. (2004). “Suboptimality of MDL and Bayes in classification under misspecification.” In Proceedings of the Seventeenth Conference on Learning Theory (COLT’ 04). New York: Springer-Verlag.
• Grünwald, P. D. and Langford, J. (2007). “Suboptimal behavior of Bayes and MDL in classification under misspecification.” Machine Learning, 66(2–3): 119–149.
• Grünwald, P. D. and Mehta, N. A. (2016). “Fast Rates with Unbounded Losses.” arXiv preprint arXiv:1605.00252.
• Grünwald, P. D. and Van Ommen, T. (2014). “Inconsistency of Bayesian Inference for Misspecified Linear Models, and a Proposal for Repairing It.” arXiv preprint arXiv:1412.3730.
• Grünwald, P. D. and Van Ommen, T. (2017). “Supplementary material of “Inconsistency of Bayesian Inference for Misspecified Linear Models, and a Proposal for Repairing It”.” Bayesian Analysis.
• Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer Verlag.
• De Heide, R. (2016a). “$R$-Package SafeBayes.” Freely available at CRAN Repository.
• De Heide, R. (2016b). “The Safe–Bayesian Lasso.” Master’s thesis, Leiden University.
• Hjorth, U. (1982). “Model Selection and Forward Validation.” Scandinavian Journal of Statistics, 9: 95–105.
• Holmes, C. and Walker, S. (2017). “Assigning a value to a power likelihood in a general Bayesian model.” Biometrika, 104(2): 497–503.
• Jiang, W. and Tanner, M. (2008). “Gibbs posterior for variable selection in high-dimensional classification and data mining.” Annals of Statistics, 36(5): 2207–2231.
• Kleijn, B. and Van der Vaart, A. (2006). “Misspecification in infinite-dimensional Bayesian statistics.” Annals of Statistics, 34(2).
• Martin, R., Mess, R., and Walker, S. G. (2017). “Empirical Bayes posterior concentration in sparse high-dimensional linear models.” Bernoulli, 23(3): 1822–1847.
• McAllester, D. (2003). “PAC-Bayesian Stochastic Model Selection.” Machine Learning, 51(1): 5–21.
• Miller, J. and Dunson, D. (2015). “Robust Bayesian Inference via Coarsening.” Technical report, arXiv. Available at arXiv:1506.06101.
• Park, T. and Casella, G. (2008). “The Bayesian Lasso.” Journal of the American Statistical Association, 103(482): 681–686.
• Quadrianto, N. and Ghahramani, Z. (2015). “A very simple safe-Bayesian random forest.” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 37(6).
• Raftery, A. E., Madigan, D., and Hoeting, J. A. (1997). “Bayesian model averaging for linear regression models.” Journal of the American Statistical Association, 92(437): 179–191.
• Ramamoorthi, R. V., Sriram, K., and Martin, R. (2015). “On posterior concentration in misspecified models.” Bayesian Analysis, 10(4): 759–789.
• Rissanen, J. (1984). “Universal coding, information, prediction and estimation.” IEEE Transactions on Information Theory, 30: 629–636.
• Royall, R. and Tsou, T.-S. (2003). “Interpreting statistical evidence by using imperfect models: Robust adjusted likelihood functions.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(2): 391–404.
• Seeger, M. (2002). “PAC-Bayesian Generalization Error Bounds for Gaussian Process Classification.” Journal of Machine Learning Research, 3: 233–269.
• Syring, N. and Martin, R. (2017). “Calibrating general posterior credible regions.” arXiv preprint arXiv:1509.00922.
• Vovk, V. G. (1990). “Aggregating strategies.” In Proc. COLT’ 90, 371–383.
• Walker, S. and Hjort, N. L. (2002). “On Bayesian consistency.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(4): 811–821.
• Zhang, T. (2006a). “From $\epsilon$-entropy to KL entropy: Analysis of minimum information complexity density estimation.” Annals of Statistics, 34(5): 2180–2210.
• Zhang, T. (2006b). “Information Theoretical Upper and Lower Bounds for Statistical Estimation.” IEEE Transactions on Information Theory, 52(4): 1307–1321.

#### Supplemental materials

• Supplementary material of “Inconsistency of Bayesian Inference for Misspecified Linear Models, and a Proposal for Repairing It”. In this paper, we described a problem for Bayesian inference under misspecification, and proposed the SafeBayes algorithm for solving it. The main appendix, Appendix B, places SafeBayes in proper context by giving a six point overview of what can go wrong in Bayesian inference from a frequentist point of view, and what can be done about it, both in the well- and in the misspecified case. Specifically we clarify the one other problem with Bayes under misspecification — interest in non-KL-associated tasks — and its relation to Gibbs posteriors. The remainder of the supplement is devoted to discussing these six points in great detail, explicitly stating several Open Problems, related work, and ideas for a general Bayesian misspecification theory as we go along. We also provide further details on SafeBayes (Appendix C), additional experiments (Appendix G) and refine and explain in more detail the notion of bad misspecification and hypercompression (Appendix D).