Bayesian Analysis

A Stochastic Variational Framework for Fitting and Diagnosing Generalized Linear Mixed Models

Linda S. L. Tan and David J. Nott

Full-text: Open access

Abstract

In stochastic variational inference, the variational Bayes objective function is optimized using stochastic gradient approximation, where gradients computed on small random subsets of data are used to approximate the true gradient over the whole data set. This enables complex models to be fit to large data sets as data can be processed in mini-batches. In this article, we extend stochastic variational inference for conjugate-exponential models to nonconjugate models and present a stochastic nonconjugate variational message passing algorithm for fitting generalized linear mixed models that is scalable to large data sets. In addition, we show that diagnostics for prior-likelihood conflict, which are useful for Bayesian model criticism, can be obtained from nonconjugate variational message passing automatically, as an alternative to simulation-based Markov chain Monte Carlo methods. Finally, we demonstrate that for moderate-sized data sets, convergence can be accelerated by using the stochastic version of nonconjugate variational message passing in the initial stage of optimization before switching to the standard version.

Article information

Source
Bayesian Anal., Volume 9, Number 4 (2014), 963-1004.

Dates
First available in Project Euclid: 21 November 2014

Permanent link to this document
https://projecteuclid.org/euclid.ba/1416579187

Digital Object Identifier
doi:10.1214/14-BA885

Mathematical Reviews number (MathSciNet)
MR3293964

Zentralblatt MATH identifier
1327.62167

Keywords
Variational Bayes stochastic approximation nonconjugate variational message passing conflict diagnostics hierarchical models identify divergent units

Citation

Tan, Linda S. L.; Nott, David J. A Stochastic Variational Framework for Fitting and Diagnosing Generalized Linear Mixed Models. Bayesian Anal. 9 (2014), no. 4, 963--1004. doi:10.1214/14-BA885. https://projecteuclid.org/euclid.ba/1416579187


Export citation

References

  • Amari, S. (1998). “Natural gradient works efficiently in learning.” Neural Computation, 10: 251–276.
  • Attias, H. (1999). “Inferring parameters and structure of latent variable models by variational Bayes.” In Laskey, K. and Prade, H. (eds.), Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence, 21–30. San Francisco, CA: Morgan Kaufmann.
  • Bishop, C. M. (2006). Pattern recognition and machine learning. New York: Springer.
  • Booth, J. G. and Hobert, J. P. (1999). “Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm.” Journal of the Royal Statistical Society: Series B, 61: 265–285.
  • Bottou, L. and Le Cun, Y. (2005). “On-line learning for very large data sets.” Applied stochastic models in business and industry, 21: 137–151.
  • Bottou, L. and Bousquet, O. (2008). “The trade-offs of large scale learning.” In Platt, J. C., Koller, D., Singer, Y. and Roweis, S. (eds.), Advances in Neural Information Processing Systems 20, 161–168. Red Hook, NY: Curran Associates, Inc.
  • Box, G. E. P. and Tiao, G. C. (1973). Bayesian inference in statistical analysis. MA: Addison-Wesley.
  • Breslow, N. E. and Clayton, D. G. (1993). “Approximate inference in generalized linear mixed models.” Journal of the American Statistical Association, 88, 9–25.
  • Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C. and Jordan, M. I. (2013). “Streaming variational Bayes.” In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z. and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 26, 1727–1735. Red Hook, NY: Curran Associates, Inc.
  • Diggle, P. J., Heagerty, P., Liang, K. and Zeger, S. L. (2002). Analysis of longitudinal data. UK: Oxford University Press, 2nd edition.
  • Donohue, M. C., Overholser, R., Xu, R. and Vaida, F. (2011). “Conditional Akaike information under generalized linear and proportional hazards mixed models.” Biometrika, 98: 685–700.
  • Evans, M. and Moshonov, H. (2006). “Checking for prior-data conflict.” Bayesian Analysis, 4: 893–914.
  • Farrell, P. J., Groshen, S., MacGibbon, B. and Tomberlin, T. J. (2010). “Outlier detection for a hierarchical Bayes model in a study of hospital variation in surgical procedures.” Statistical Methods in Medical Research, 19: 601–619.
  • Fong, Y., Rue, H. and Wakefield, J. (2010). “Bayesian inference for generalised linear mixed models.” Biostatistics, 11: 397–412.
  • Gelfand, A. E., Sahu, S. K. and Carlin, B. P. (1995). “Efficient parametrisations for normal linear mixed models.” Biometrika, 82: 479–488.
  • ––– (1996). “Efficient parametrizations for generalized linear mixed models.” In Bernardo, J. M., Berger, J. O., Dawid, A. P. and Smith, A. F. (eds.), Bayesian Statistics 5, 165–180. Oxford: Clarendon Press.
  • Ghahramani, Z. and Beal, M. J. (2001). “Propagation algorithms for variational Bayesian learning.” In Leen, T. K., Dietterich, T. G. and Tresp, V. (eds.), Advances in Neural Information Processing Systems 13, 507–513. Cambridge, MA: MIT Press.
  • Greenberg, E. R., Baron, J. A., Stevens, M. M., Stukel, T. A., Mandel, J. S., Spencer, S. K., Elias, P. M., Lowe, N., Nierenberg, D. N., Bayrd G. and Vance, J. C. (1989). “The skin cancer prevention study: design of a clinical trial of beta-carotene among persons at high risk for nonmelanoma skin cancer.” Controlled Clinical Trials, 10: 153–166.
  • Hoffman, M. D., Blei, D. M. and Bach, F. (2010). “Online learning for latent Dirichlet allocation.” In Lafferty, J., Williams, C., Shawe-Taylor, J., Zemel, R. and Culotta, A. (eds.), Advances in Neural Information Processing Systems 23, 856–864. Red Hook, NY: Curran Associates, Inc.
  • Hoffman, M. D., Blei, D. M., Wang, C. and Paisley, J. (2013). “Stochastic variational inference.” Journal of Machine Learning Research, 14: 1303–1347.
  • Honkela, A., Tornio, M., Raiko, T. and Karhunen, J. (2008). “Natural conjugate gradient in variational inference.” In Ishikawa, M., Doya, K., Miyamoto, H. and Yamakawa, T. (eds.), Neural Information Processing, 305–314. Berlin: Springer-Verlag.
  • Hosmer, D. W., Lemeshow, S. and Sturdivant, R. X. (2013). Applied Logistic Regression. Hoboken, New Jersey: John Wiley & Sons Inc., 3rd edition.
  • Huang, A. and Wand, M. P. (2013). “Simple Marginally Noninformative Prior Distributions for Covariance Matrices.” Bayesian Analysis, 8: 439–452.
  • Ibrahim, J. G. and Laud, P. W. (1991). “On Bayesian analysis of generalized linear models using Jeffreys’s prior.” Journal of the American Statistical Association, 86: 981–986.
  • Jank, W. (2006). “Implementing and diagnosing the stochastic approximation EM algorithm.” Journal of Computational and Graphical Statistics, 15: 803–829.
  • Ji, C., Shen, H. and West, M. (2010). “Bounded approximations for marginal likelihoods.” Available at http://ftp.stat.duke.edu/WorkingPapers/10-05.pdf.
  • Kass, R. E. and Natarajan, R. (2006). “A default conjugate prior for variance components in generalized linear mixed models (Comment on article by Browne and Draper).” Bayesian Analysis, 1: 535–542.
  • Knowles, D. A., Minka, T. P. (2011). “Non-conjugate variational message passing for multinomial and binary regression.” In Shawe-Taylor, J., Zemel, R. S., Bartlett, P., Pereira, F. and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 24, 1701–1709. Red Hook, NY: Curran Associates, Inc.
  • Liang, F., Cheng, Y., Song, Q., Park, J. and Yang, P. (2013). “A resampling-based stochastic approximation method for analysis of large geostatistical data.” Journal of the American Statistical Association, 108: 325–339.
  • Liu, Q. and Pierce, D. A. (1994). “A note on Gauss-Hermite quadrature.” Biometrika, 81: 624–629.
  • Lunn, D., Spiegelhalter, D., Thomas, A. and Best, N. (2009). “The BUGS project: Evolution, critique and future directions.” Statistics in Medicine, 28: 3049–3067.
  • Luts, J., Broderick, T. and Wand, M. P. (2013). “Real-time semiparametric regression.” Journal of Computational and Graphical Statistics, (to appear).
  • Magnus, J. R. and Neudecker, H. (1988). Matrix differential calculus with applications in statistics and econometrics. Chichester, UK: Wiley.
  • Marshall, E. C. and Spiegelhalter, D. J. (2007). “Identifying outliers in Bayesian hierarchical models: a simulation-based approach.” Bayesian Analysis, 2: 409-444.
  • Nott, D. J., Tan, S. L., Villani, M. and Kohn, R. (2012). “Regression density estimation with variational methods and stochastic approximation.” Journal of Computational and Graphical Statistics, 21: 797–820.
  • Nott, D. J., Tran, M.-N., Kuk, A. Y. C., Kohn, R. (2013). “Efficient variational inference for generalized linear mixed models with large datasets.” arXiv: 1307.7963.
  • Ormerod, J. T. and Wand, M. P. (2010). “Explaining variational approximations.” The American Statistician, 64: 140–153.
  • ––– (2012). “Gaussian variational approximate inference for generalized linear mixed models.” Journal of Computational and Graphical Statistics, 21: 2–17.
  • Overstall, A. M. and Forster, J. J. (2010). “Default Bayesian model determination methods for generalised linear mixed models.” Computational Statistics and Data Analysis, 54: 3269–3288.
  • Paisley, J., Blei, D. M. and Jordan, M. I. (2012). “Variational Bayesian inference with stochastic search.” In Langford, J. and Pineau, J. (eds.), Proceedings of the 29th International Conference on Machine Learning, 1367–1374. Madison, WI: Omnipress.
  • Papaspiliopoulos, O., Roberts, G. O. and Sköld, M. (2003). “Non-centered parametrizations for hierarchical models and data augmentation.” In Bernardo, J. M., Bayarri, M. J., Berger, J. O., Dawid, A. P., Heckerman, D. A., Smith, F. M. and West, M. (eds.), Bayesian Statistics 7, 307–326. New York: Oxford University Press.
  • ––– (2007). “A general framework for the parametrization of hierarchical models.” Statistical Science, 22: 59–73.
  • Petris, G. and Tardella, L. (2003). “A geometric approach to transdimensional Markov chain Monte Carlo.” The Canadian Journal of Statistics, 31: 469–482.
  • Polyak, B. T. and Juditsky, A. B. (1992). “Acceleration of stochastic approximation by averaging.” SIAM Journal on Control and Optimization, 30: 838–855.
  • Presanis, A. M., Ohlssen, D., Spiegelhalter, D. J. and De Angelis, D. (2013). “Conflict diagnostics in directed acyclic graphs, with applications in Bayesian evidence synthesis.” Statistical Science, 28: 376–397.
  • Ranganath, R., Wang, C., Blei, D. M. and Xing, E. P. (2013). “An adaptive learning rate for stochastic variational inference.” In Dasgupta, S. and McAllester, D. (eds.) JMLR W&CP: Proceedings of the 30th International Conference on Machine Learning, 28: 298–306.
  • Raudenbush, S. W., Yang, M. L. and Yosef, M. (2000). “Maximum likelihood for generalized linear models with nested random effects via high-order, multivariate Laplace approximation.” Journal of Computational and Graphical Statistics, 9: 141–157.
  • Robbins, H. and Monro, S. (1951). “A stochastic approximation method.” Annals of Mathematical Statistics 22: 400–407.
  • Roux, N. L., Schmidt, M. and Bach, F. (2012). “A stochastic gradient method with an exponential convergence rate for finite training sets.” In Pereira, F., Burges, C. J. C., Bottou, L. and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 25, 2663–2671. Red Hook, NY: Curran Associates, Inc.
  • Salimans, T. and Knowles, D. A. (2013). “Fixed-form variational posterior approximation through stochastic linear regression.” Bayesian Analysis, 4: 837–882.
  • Sato, M. (2001). “Online model selection based on the variational Bayes.” Neural Computation, 13: 1649–1681.
  • Scheel, I., Green, P. J. and Rougier, J. C. (2011). “A graphical diagnostic for identifying influential model choices in Bayesian hierarchical models.” Scandinavian Journal of Statistics, 38: 529–550.
  • Spall, J. C. (2003). Introduction to stochastic search and optimization: estimation, simulation and control. New Jersey: Wiley.
  • Sturtz, S., Ligges, U., and Gelman, A. (2005). “R2WinBUGS: A package for running WinBUGS from R.” Journal of Statistical Software, 12: 1–16.
  • Tan, L. S. L. and Nott, D. J. (2013). “Variational inference for generalized linear mixed models using partially noncentered parametrizations.” Statistical Science, 28: 168–188.
  • Thall, P. F. and Vail, S. C. (1990). “Some covariance models for longitudinal count data with overdispersion.” Biometrics, 46: 657–671.
  • Thara, R., Henrietta, M., Joseph, A., Rajkumar, S. and Eaton, W. (1994). “Ten year course of schizophrenia - the Madras longitudinal study.” Acta Psychiatrica Scandinavica, 90: 329–336.
  • Tseng, P. (1998). An incremental gradient(-projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization, 8: 506–531.
  • Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S. NY: Springer, 4th edition.
  • Wand, M. P. (2013). “Fully simplified multivariate normal updates in non-conjugate variational message passing.” Available at http://www.uow.edu.au/ mwand/fsupap.pdf.
  • Wang, C., Paisley, J. and Blei, D. M. (2011). “Online variational inference for the hierarchical Dirichlet process.” In Gordon, G., Dunson, D. and Dudik, M. (eds.) JMLR W&CP: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, 15: 752–760.
  • Wang, B. and Titterington, D. M. (2005). “Inadequacy of interval estimates corresponding to variational Bayesian approximations.” In Cowell, R. G. and Ghahramani, Z. (eds.), Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, 373–380. Society for Artificial Intelligence and Statistics.
  • Winn, J. and Bishop, C. M. (2005). “Variational message passing.” Journal of Machine Learning Research, 6: 661–694.
  • Xiao, L. (2010). “Dual averaging methods for regularized stochastic learning and online optimization.” Journal of Machine Learning Research, 11: 2543–2596.
  • Zhao, H. and Marriott, P. (2013). “Diagnostics for variational Bayes approximations.” arXiv:1309.5117.
  • Zhu, H. T. and Lee, S. Y. (2002). “Analysis of generalized linear mixed models via a stochastic approximation algorithm with Markov chain Monte Carlo method.” Statistics and Computing, 12: 175–183.