## Electronic Journal of Statistics

### Consistency of variational Bayes inference for estimation and model selection in mixtures

#### Abstract

Mixture models are widely used in Bayesian statistics and machine learning, in particular in computational biology, natural language processing and many other fields. Variational inference, a technique for approximating intractable posteriors thanks to optimization algorithms, is extremely popular in practice when dealing with complex models such as mixtures. The contribution of this paper is two-fold. First, we study the concentration of variational approximations of posteriors, which is still an open problem for general mixtures, and we derive consistency and rates of convergence. We also tackle the problem of model selection for the number of components: we study the approach already used in practice, which consists in maximizing a numerical criterion (the Evidence Lower Bound). We prove that this strategy indeed leads to strong oracle inequalities. We illustrate our theoretical results by applications to Gaussian and multinomial mixtures.

#### Article information

Source
Electron. J. Statist., Volume 12, Number 2 (2018), 2995-3035.

Dates
First available in Project Euclid: 19 September 2018

https://projecteuclid.org/euclid.ejs/1537344604

Digital Object Identifier
doi:10.1214/18-EJS1475

Mathematical Reviews number (MathSciNet)
MR3855643

Zentralblatt MATH identifier
06942964

#### Citation

Chérief-Abdellatif, Badr-Eddine; Alquier, Pierre. Consistency of variational Bayes inference for estimation and model selection in mixtures. Electron. J. Statist. 12 (2018), no. 2, 2995--3035. doi:10.1214/18-EJS1475. https://projecteuclid.org/euclid.ejs/1537344604

#### References

• [1] H. Akaike. A new look at the statistical model identification., IEEE Transactions on Automatic Control, 19:716–723, 1974.
• [2] P. Alquier and V. Cottet. 1-bit matrix completion: PAC-Bayesian analysis of a variational approximation., Machine Learning, 107(3):579–603, 2018.
• [3] P. Alquier and J. Ridgway. Concentration of tempered posteriors and of their variational approximations., arXiv preprint arXiv :1706.09293, 2017.
• [4] P. Alquier, J. Ridgway, and N. Chopin. On the properties of variational approximations of Gibbs posteriors., JMLR, 17(239):1–41, 2016.
• [5] S. Ayer and H.S. Sawhney. Layered representation of motion video using robust maximum-likelihood estimation of mixture models and MDL encoding., International Conference on Computer Vision, 1995.
• [6] A.G. Bacharoglou. Approximation of probability distributions by convex mixtures of Gaussian measures., Proceedings of the American of the American Mathematical Society, 138(7) :2619–2628, 2010.
• [7] G. Behrens, N. Friel, and M. Hurn. Tuning tempered transitions., Statistics and computing, 22(1):65–78, 2012.
• [8] A. Bhattacharya, D. Pati, and Y. Yang. Bayesian fractional posteriors., arXiv preprint arXiv :1611.01125 (to appear in the Annals of Statistics), 2016.
• [9] C. Biernacki, G. Celeux, and G. Govaert. An improvement of the NEC criterion for assessing the number of clusters in a mixture model., Pattern Recognition Letters, 20(3):267–272, 1999.
• [10] D. M. Blei, A.Y. Ng, C. Wang, and M.I. Jordan. Latent Dirichlet allocation., The Journal of Machine Learning Research, 3:993 –1022, 2003.
• [11] D.M. Blei, A. Kucukelbir, and J.D. McAuliffe. Variational inference: a review for statisticians., arXiv preprint arXiv :1601.00670, 2017.
• [12] C. Bouveyron and C. Brunet-Saumard. Model-based clustering of high-dimensional data: a review., Computational Statistics and Data Analysis, 71:52–78, 2014.
• [13] P. Carbonetto and M. Stephens. Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies., Bayesian analysis, 7(1):73–108, 2012.
• [14] L. Carel and P. Alquier. Simultaneous dimension reduction and clustering via the NMF-EM algorithm., arXiv preprint arXiv :1709.03346, 2017.
• [15] O. Catoni., Statistical learning theory and stochastic optimization. Saint-Flour Summer School on Probability Theory 2001 (Jean Picard ed.), Lecture Notes in Mathematics. Springer, 2004.
• [16] O. Catoni., PAC-Bayesian supervised classification: the thermodynamics of statistical learning. Institute of Mathematical Statistics Lecture Notes—Monograph Series, 56. Institute of Mathematical Statistics, Beachwood, OH, 2007.
• [17] B. E. Chérief-Abdellatif and P. Alquier., Supplement to “Consistency of Variational Bayes Inference for Estimation and Model Selection in Mixtures”, DOI: 10.1214/18-EJS1475SUPP, 2018.
• [18] G. Celeux, S. Frühwirth-Schnatter, and C. P. (Editors) Robert., Handbook of mixture analysis. CRC Press, 2018.
• [19] P. Deb, W.T. Gallo, P. Ayyagari, J.M. Fletcher, and J.L. Sindelar. The effect of job loss on overweight and drinking., Journal of Health Economics, 2011.
• [20] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm., Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977.
• [21] M. N. Do. Fast approximation of Kullback-Leibler distance for dependence trees and hidden Markov models., IEEE Signal Processing Letters, 10(4):115–118, 4 2003.
• [22] E. Gassiat, J. Rousseau, and E. Vernet. Efficient semiparametric estimation and model selection for multidimensional mixtures., Electronic Journal of Statistics, 12(1):703–740, 2018.
• [23] S. Ghosal, J. K. Ghosh, and A. W. Van Der Vaart. Convergence rates of posterior distributions., Annals of Statistics, pages 500–531, 2000.
• [24] L. Gordon. A stochastic approach to the gamma function., The American Mathematical Monthly, 101(9):858–865, 1994.
• [25] P. D. Grünwald and T. Van Ommen. Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it., Bayesian Analysis, 12(4) :1069–1103, 2017.
• [26] S. Guo. Monotonicity and concavity properties of some functions involving the gamma function with applications., JIPAM. Journal of Inequalities in Pure & Applied Mathematics [electronic only], 7, 01 2006.
• [27] J.R. Hershey and P.A. Olsen. Approximating the Kullback Leibler divergence between Gaussian mixture models., IEEE International Conference on Acoustics, Speech and Signal Processing, 4, 2007.
• [28] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference., The Journal of Machine Learning Research, 14(1) :1303–1347, 2013.
• [29] W. Kruijer, J. Rousseau, and A. Van Der Vaart. Adaptive Bayesian density estimation with location-scale mixtures., Electronic Journal of Statistics, 4 :1225–1257, 2010.
• [30] A. Laforgia and P. Natalin. On some inequalities for the gamma function., Advances in Dynamical Systems and Applications, 8(2):261–267, 2013.
• [31] P. Massart., Concentration inequalities and model selection. Saint-Flour Summer School on Probability Theory 2003 (Jean Picard ed.), Lecture Notes in Mathematics. Springer, 2007.
• [32] P. D. McNicholas. Model-based clustering., Journal of Classification, 33(3):331–373, 2016.
• [33] N. Nasios and A.G. Bors. Variational learning for Gaussian mixture models. In, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), volume 36, pages 849–862, 2006.
• [34] R. M. Neal. Sampling from multimodal distributions using tempered transitions., Statistics and Computing, 6(4):353–366, 1996.
• [35] A. O’Hagan, T. B. Murphy, and I. C. Gormley. Computational aspects of fitting mixture models via the expectation–maximization algorithm., Computational Statistics & Data Analysis, 56(12) :3843–3864, 2012.
• [36] W. Pan, J. Lin, and C.T. Le. A mixture model approach to detecting differentially expressed genes with microarray data., Functional & Integrative Genomics, 3:117–124, 2003.
• [37] L. Rigouste, O. Cappé, and F. Yvon. Inference and evaluation of the multinomial mixture model for text clustering. In, Information Processing & Management, volume 43, pages 1260–1280, 2007.
• [38] G. Schwarz. Estimating the dimension of a model., The Annals of Statistics, 6(2):461–464, 1978.
• [39] Y. Singer and M. K. Warmuth. Batch and online parameter estimation of Gaussian mixtures based on the joint entropy. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 11. MIT Press, Cambridge, MA, 1999.
• [40] C. J. Stoneking. Bayesian inference of Gaussian mixture models with noninformative priors., arXiv preprint arXiv :1405.4895, 2014.
• [41] E.B. Sudderth and M.I. Jordan. Shared segmentation of natural scenes using dependent pitman-yor processes., In Advances in Neural Information Processing Systems, pages 1585–1592, 2009.
• [42] N. Syring and R. Martin. Scaling the Gibbs posterior credible regions., Preprint, 2015.
• [43] T. Van Erven and P. Harremos. Rényi divergence and Kullback-Leibler divergence., IEEE Transactions on Information Theory, 60(7) :3797–3820, 2014.
• [44] Y. Wang and D.M. Blei. Frequentist consistency of variational Bayes., arXiv preprint arXiv :1705.034339v1, accepted for publication in JASA, 2017.
• [45] L. Watier, S. Richardson, and P. J. Green. Using Gaussian mixtures with unknown number of components for mixed model estimation. In, 14th International Workshop on Statistical Modelling, Graz, Austria, 1999.
• [46] Y. Wu and P. Yang. Optimal estimation of Gaussian mixtures via denoised method of moments., working paper, 2018.
• [47] Y. Yang. Can the strengths of AIC and BIC be shared? A conflict between model indentification and regression estimation., Biometrika, 92(4):937–950, 2005.
• [48] Y. Yang, Pati D., and A. Bhattacharya. $\alpha$-variational inference with statistical guarantees., preprint arXiv :1710.03266v1, 2017.
• [49] F. Zhang and C. Gao. Convergence rates of variational posterior distributions., arXiv preprint arXiv :1712.02519v1, 2017.

#### Supplemental materials

• Supplement to “Consistency of variational Bayes inference for estimation and model selection in mixtures”. The supplementary material zip contains the description of a short simulation study (supplement.pdf) and the notebook used for the simulation study (supplement.ipynb).