Bayesian Analysis

Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression

Tim Salimans and David A. Knowles

Full-text: Open access


We propose a general algorithm for approximating nonstandard Bayesian posterior distributions. The algorithm minimizes the Kullback-Leibler divergence of an approximating distribution to the intractable posterior distribution. Our method can be used to approximate any posterior distribution, provided that it is given in closed form up to the proportionality constant. The approximation can be any distribution in the exponential family or any mixture of such distributions, which means that it can be made arbitrarily precise. Several examples illustrate the speed and accuracy of our approximation method in practice.

Article information

Bayesian Anal. Volume 8, Number 4 (2013), 837-882.

First available in Project Euclid: 4 December 2013

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

variational Bayes approximate inference stochastic approximation


Salimans, Tim; Knowles, David A. Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression. Bayesian Anal. 8 (2013), no. 4, 837--882. doi:10.1214/13-BA858.

Export citation


  • Albert, J. (2009). Bayesian Computation with R. Springer Science, New York. Second edition.
  • Amari, S. (1997). “Neural Learning in Structured Parameter Spaces - Natural Riemannian Gradient.” In Advances in Neural Information Processing Systems, 127–133. MIT Press.
  • Attias, H. (2000). “A variational Bayesian framework for graphical models.” In Advances in Neural Information Processing Systems (NIPS) 12, 209–215.
  • Beal, M. J. and Ghahramani, Z. (2002). “The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures.” In Bayesian Statistics 7: Proceedings of the 7th Valencia International Meeting, 453–463.
  • — (2006). “Variational Bayesian learning of directed graphical models with hidden variables.” Bayesian Analysis, 1(4): 793–832.
  • Bishop, C. M. (2006). Pattern recognition and machine learning, volume 1. Springer New York.
  • Bottou, L. (2010). “Large-Scale Machine Learning with Stochastic Gradient Descent.” In Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010), 177–187. Springer.
  • de Freitas, N., Højen-Sørensen, P., Jordan, M. I., and Russell, S. (2001). “Variational MCMC.” In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, UAI’01, 120–127. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. URL
  • Durbin, J. and Koopman, S. (2001). Time Series Analysis by State Space Methods. Oxford University Press.
  • Geweke, J. (2005). Contemporary Bayesian Econometrics and Statistics. Wiley-Interscience.
  • Gilks, W., Thomas, A., and Spiegelhalter, D. (1994). “A language and program for complex Bayesian modelling.” The Statistician, 169–177.
  • Girolami, M. and Calderhead, B. (2011). “Riemann manifold Langevin and Hamiltonian Monte Carlo methods.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2): 123–214.
  • Hoffman, M., Blei, D., and Bach, F. (2010). “Online learning for latent Dirichlet allocation.” Advances in Neural Information Processing Systems, 23.
  • Hoffman, M., Blei, D., Wang, C., and Paisley, J. (2012). “Stochastic Variational Inference.” arXiv preprint arXiv:1206.7051.
  • Honkela, A., Raiko, T., Kuusela, M., Tornio, M., and Karhunen, J. (2010). “Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes.” Journal of Machine Learning Research, 3235–3268.
  • Hoogerheide, L., Opschoor, A., and van Dijk, H. K. (2012). “A class of adaptive importance sampling weighted EM algorithms for efficient and robust posterior and predictive simulation.” Journal of Econometrics, 171(2): 101 – 120. URL
  • Jordan, M., Ghahramani, Z., Jaakkola, T., and Saul, L. (1999). “An introduction to variational methods for graphical models.” Machine learning, 37(2): 183–233.
  • Kim, S., Shephard, N., and Chib, S. (1998). “Stochastic Volatility: Likelihood Inference and Comparison with ARCH Models.” The Review of Economic Studies, 65(3): pp. 361–393.
  • Knowles, D. A. and Minka, T. P. (2011). “Non-conjugate Variational Message Passing for Multinomial and Binary Regression.” In Advances in Neural Information Processing Systems (NIPS), 25.
  • Liesenfeld, R. and Richard, J.-F. (2008). “Improving MCMC, using efficient importance sampling.” Computational Statistics and Data Analysis, 53(2): 272 – 288.
  • Lovell, M. (2008). “A Simple Proof of the FWL Theorem.” The Journal of Economic Education, 39(1): 88–91.
  • Minka, T. (2005). “Divergence measures and message passing.” Technical Report MSR-TR-2005-173, Microsoft Research.
  • Minka, T. P. (2001). “A family of algorithms for approximate Bayesian inference.” Ph.D. thesis, MIT.
  • Minka, T. P., Winn, J. M., Guiver, J. P., and Knowles, D. A. (2010). “Infer.NET 2.4.”
  • Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009). “Robust Stochastic Approximation Approach to Stochastic Programming.” SIAM Journal on Optimization, 19(4): 1574–1609.
  • Nickisch, H. and Rasmussen, C. E. (2008). “Approximations for Binary Gaussian Process Classification.” Journal of Machine Learning Research, 9: 2035–2078.
  • Nott, D., Tan, S., Villani, M., and Kohn, R. (2012). “Regression density estimation with variational methods and stochastic approximation.” Journal of Computational and Graphical Statistics, 21(3): 797–820.
  • Opper, M. and Archambeau, C. (2009). “The Variational Gaussian Approximation Revisited.” Neural Computation, 21(3): 786–792.
  • Ormerod, J. T. and Wand, M. P. (2010). “Explaining Variational Approximations.” The American Statistician, 64(2): 140–153.
  • Paisley, J., Blei, D., and Jordan, M. (2012). “Variational Bayesian Inference with Stochastic Search.” In International Conference on Machine Learning 2012.
  • Porteous, I., Newman, D., Ihler, A., Asuncion, A., Smyth, P., and Welling, M. (2008). “Fast collapsed Gibbs sampling for latent Dirichlet allocation.” In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 569–577.
  • Richard, J.-F. and Zhang, W. (2007). “Efficient high-dimensional importance sampling.” Journal of Econometrics, 141(2): 1385 – 1411.
  • Robbins, H. and Monro, S. (1951). “A Stochastic Approximation Method.” The Annals of Mathematical Statistics, 22(3): 400–407.
  • Saul, L. and Jordan, M. (1996). “Exploiting tractable substructures in intractable networks.” Advances in Neural Information Processing Systems, 486–492.
  • Stern, D. H., Herbrich, R., and Graepel, T. (2009). “Matchbox: large scale online Bayesian recommendations.” In Proceedings of the 18th International Conference on World Wide Web, 111–120.
  • Storkey, A. J. (2000). “Dynamic Trees: A Structured Variational Method Giving Efficient Propagation Rules.” In Conference on Uncertainty in Artificial Intelligence (UAI).
  • Teh, Y., Newman, D., and Welling, M. (2006). “A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation.” Advances in Neural Information Processing Systems, 19: 1353–1360.
  • Turner, R. E., Berkes, P., and Sahani, M. (2008). “Two problems with variational expectation maximisation for time-series models.” In Proceedings of the Workshop on Inference and Estimation in Probabilistic Time-Series Models, 107–115.
  • Wainwright, M. J. and Jordan, M. I. (2008). “Graphical models, exponential families, and variational inference.” Foundations and Trends® in Machine Learning, 1(1-2): 1–305.
  • Winn, J. and Bishop, C. M. (2006). “Variational message passing.” Journal of Machine Learning Research, 6(1): 661.