Electronic Journal of Statistics

Simultaneous variable selection and component selection for regression density estimation with mixtures of heteroscedastic experts

Minh-Ngoc Tran, David J. Nott, and Robert Kohn

Full-text: Open access


This paper is concerned with the problem of flexibly estimating the conditional density of a response variable given covariates. In our approach the density is modeled as a mixture of heteroscedastic normals with the means, variances and mixing probabilities all varying smoothly as functions of the covariates. We use the variational Bayes approach and propose a novel fast algorithm for simultaneous covariate selection, component selection and parameter estimation. Our method is able to deal with the local maxima problem inherent in mixture model fitting, and is applicable to high-dimensional settings where the number of covariates can be larger than the sample size. In the special case of the classical regression model, the proposed algorithm is similar to currently used greedy algorithms while having many attractive properties and working efficiently in high-dimensional problems. The methodology is demonstrated through simulated and real examples.

Article information

Electron. J. Statist. Volume 6 (2012), 1170-1199.

First available in Project Euclid: 29 June 2012

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62G07: Density estimation
Secondary: 62G08: Nonparametric regression

Bayesian model selection heteroscedasticity mixture of normals variational approximation


Tran, Minh-Ngoc; Nott, David J.; Kohn, Robert. Simultaneous variable selection and component selection for regression density estimation with mixtures of heteroscedastic experts. Electron. J. Statist. 6 (2012), 1170--1199. doi:10.1214/12-EJS705. http://projecteuclid.org/euclid.ejs/1340974140.

Export citation


  • [1] Bishop, C. M. (2006)., Pattern Recognition and Machine Learning. New York: Springer.
  • [2] Bishop, C. M. and Svensen, M. (2003). Bayesian hierarchical mixtures of experts. In: U. Kjaerulff and C. Meek (Eds.), Proceedings Nineteenth Conference on Uncertainty in Artificial Intelligence, pp. 57-64, Morgan Kaufmann, San Francisco, CA.
  • [3] Boughton, W. (2004). The Australian water balance model., Environmental Modelling and Software 19, 943-956.
  • [4] Calinski, R. B. and Harabasz, J. (1974). A dendrite method for cluster analysis., Communications in Statistics, 3(1), 1-27.
  • [5] Chan, D., Kohn, R., Nott, D. J. and Kirby, C. (2006). Adaptive nonparametric estimation of mean and variance functions., Journal of Computational and Graphical Statistics, 15, 915-936.
  • [6] Chen, J. and Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces., Biometrika, 95, 759-771.
  • [7] Corduneanu, A., and Bishop, C. M. (2001). Variational Bayesian model selection for mixture distributions. In: T. Jaakkola and T. Richardson (Eds), Artifcial Intelligence and Statistics, 27-34, Morgan Kaufmann, San Francisco, CA.
  • [8] Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression (with discussion)., The Annals of Statistics, 32, 407–451.
  • [9] George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling., Journal of the American Statistical Association, 88, 881-889.
  • [10] Geweke, J. and Keane, M. (2007). Smoothly mixing regressions., Journal of Econometrics, 138, 252–290.
  • [11] Hall, P., Pham, T., Wand, M. P. and Wang, S. S. J. (2011). Asymptotic normality and valid inference for Gaussian variational approximation., The Annals of Statistics, 39(5), 2502-2532.
  • [12] Jacobs, R., Jordan, M., Nowlan, S. and Hinton, G. (1991). Adaptive mixtures of local experts., Neural Computation, 3, 79-87.
  • [13] Jiang, W. and Tanner, M. A. (1999). Hierarchical mixture-of-experts for exponential family regression models: Approximation and maximum likelihood estimation., The Annals of Statistics, 27, 987–1011.
  • [14] Jordan, M. I. and Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm., Neural Computation, 6, 181-214.
  • [15] Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., Saul, L. K. (1999). An introduction to variational methods for graphical models. In M. I. Jordan (Ed.), Learning in Graphical Models. MIT Press, Cambridge.
  • [16] Mallat, S. G. and Zhang, Z. (1993). Matching pursuits with time-frequency dictionaries., IEEE Transactions on signal processing, 41, 3397-3415.
  • [17] McGrory, C. A. and Titterington, D. M. (2007). Variational approximations in Bayesian model selection for finite mixture distributions., Computaional Statistics and Data Analysis, 51, 5352-5367.
  • [18] McLachlan, G. J. and Peel, D. (2000)., Finite Mixture Models. Wiley, New York.
  • [19] Nott, D. J., Tan, S. L., Villani, M. and Kohn, R. (2011). Regression density estimation with variational methods and stochastic approximation., Journal of Computational and Graphical Statistics, to appear. Preprint: http://villani.files.wordpress.com/2010/02/variational-heteroscedastic-moe-july-6-20114.pdf
  • [20] Nott, D. J., Tran, M.-N. and Leng, C. (2012). Variational approximation for heteroscedastic linear models and matching pursuit algorithms., Statistics and Computing, 22(2), 497-512.
  • [21] Ormerod, J. T. and Wand, M. P. (2010). Explaining variational approximation., The American Statistician, 64(2), 140-153.
  • [22] Ueda, N., Nakano, R., Ghahramani, Z. and Hinton, G. E. (2000). SMEM algorithm for mixture models., Neural Computation, 12, 2109-2128.
  • [23] Ueda, N. and Ghahramani, Z. (2002). Bayesian model search for mixture models based on optimizing variational bounds., Neural Networks, 15, 1223-1241.
  • [24] Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components., Journal of the Royal Statistical Society B, 59(4), 731-792.
  • [25] Ruppert, D., Wand, M. P. and Carroll, R. J. (2003)., Semiparametric Regression, Cambridge University Press, Cambridge.
  • [26] Szekely, G. J. and Rizzo, M. L. (2009). Brownian distance covariance., The Annals of Applied Statistics, 3, 1236–1265.
  • [27] Titterington, D. M., Smith, A. F. M. and Makov, U. E. (1985)., Statistical Analysis of Finite Mixture Distributions. John Wiley & Sons, New York.
  • [28] Villani, M., Kohn, R. and Giordani, P. (2009). Regression density estimation using smooth adaptive Gaussian mixtures., Journal of Econometrics, 153, 155–173.
  • [29] Waterhouse, S., MacKay, D. and Robinson, T. (1996). Bayesian methods for mixtures of experts. In: D.S. Touretzky, M.C. Mozer and M.E. Hasselmo (Eds.), Advances in Neural Information Processing Systems 8, pp. 351-357, MIT Press, Cambridge.
  • [30] Wood, S.A., Jiang, W., and Tanner, M.A. (2002). Bayesian mixture of splines for spatially adaptive nonparametric regression., Biometrika, 89, 513-528.
  • [31] Wood, S.A., Kohn, R., Cottet, R., Jiang, W. and Tanner, M. (2008). Locally adaptive nonparametric binary regression., Journal of Computational and Graphical Statistics, 17, 352-372.
  • [32] Wu, B., McGrory, C. A. and Pettitt, A. N. (2012). A new variational Bayesian algorithm with application to human mobility pattern modeling., Statistics and Computing, 22(1), 185-203.
  • [33] Zhang, T. (2009). On the consistency of feature selection using greedy least squares regression., Journal of Machine Learning Research, 10, 555-568.