The Annals of Statistics

Hierarchical mixtures-of-experts for exponential family regression models: approximation and maximum likelihood estimation

Wenxin Jiang and Martin A. Tanner

Full-text: Open access


We consider hierarchical mixtures-of-experts (HME) models where exponential family regression models with generalized linear mean functions of the form $\psi(\alpha + \mathbf{x}^T \mathbf{\beta})$ are mixed. Here $\psi(\cdot)$ is the inverse link function. Suppose the true response $y$ follows an exponential family regression model with mean function belonging to a class of smooth functions of the form $\psi(h(\mathbf{x}))$ where $h(\cdot)\in W_{2; K_0}^{\infty}$ (a Sobolev class over $[0, 1]^s$). It is shown that the HME probability density functions can approximate the true density, at a rate of $O(m^{-2/s})$ in Hellinger distance and at a rate of $O(m^{-4/s})$ in Kullback–Leibler divergence, where $m$ is the number of experts, and $s$ is the dimension of the predictor $x$. We also provide conditions under which the mean-square error of the estimated mean response obtained from the maximum likelihood method converges to zero, as the sample size and the number of experts both increase.

Article information

Ann. Statist., Volume 27, Number 3 (1999), 987-1011.

First available in Project Euclid: 5 April 2002

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62G07: Density estimation
Secondary: 41A25: Rate of convergence, degree of approximation

Approximation rate exponential family generalized linear models Hellinger distance Hierarchical mixtures-of-experts Kullback-Leibler divergence maximum likelihood estimation mean square error


Jiang, Wenxin; Tanner, Martin A. Hierarchical mixtures-of-experts for exponential family regression models: approximation and maximum likelihood estimation. Ann. Statist. 27 (1999), no. 3, 987--1011. doi:10.1214/aos/1018031265.

Export citation


  • Bickel, P. J. and Doksum, K. A. (1977). Mathematical Statistics. Prentice-Hall, Englewood Cliffs, NJ.
  • Bishop, C.M. (1995). Neural Networks for Pattern Recognition. Oxford Univ. Press.
  • Cacciatore, T. W. and Nowlan, S. J. (1994). Mixtures of controllers for jump linear and nonlinear plants. In Advances in Neural Informations Processing Systems 6 (G. Tesauro, D. S. Touretzky and T. K. Leen, eds.). Morgan Kaufmann, San Mateo, CA.
  • Devroye, L. and Gyoerfi, L. (1985). Nonparametric Density Estimation: The L1 View. Wiley, New York.
  • Fritsch, J., Finke, M. and Waibel, A. (1997). Adaptively growing hierarchical mixtures of experts. In Advances in Neural Informations Processing Systems 9 (M. C. Mozer, M. I. Jordan and T. Petsche, eds.). MIT Press.
  • Ghahramani, Z. and Hinton, G. E. (1996). The EM algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96-1, Dept. Computer Science, Univ. Toronto.
  • Haussler, D. and Opper, M. (1995). General bounds on the mutual information between a parameter and n conditionally independent observations. In Proceedings of the Eighth Annual Computational Learning Theory Conference (COLT), 1995, Santa Cruz, CA. ACM Press, New York.
  • Haykin, S. (1994). Neural Networks. Macmillan, New York.
  • Jaakkola, T. S. and Jordan, M. I. (1998). Improving the mean field approximation via the use of mixture distributions. In Learning in Graphical Models (M. I. Jordan, ed.). Kluwer, Dordrecht.
  • Jacobs, R. A., Jordan, M. I., Nowlan, S. J. and Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Comp. 3 79-87.
  • Jennrich, R. I. (1969). Asymptotic properties of nonlinear least squares estimators. Ann. Math. Statist. 40 633-643.
  • Jiang, W. and Tanner, M. A. (1998). Hierarchical mixtures-of-experts for exponential family regression models: approximation and maximum likelihood estimation. Technical report, Dept. Statistics, Northwestern Univ., Evanston, IL.
  • Jiang, W. and Tanner, M. A. (1999). On the approximation rate of hierarchical mixtures-ofexperts for generalized linear models. Neural Comp. To appear.
  • Jordan, M. I. and Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Comp. 6 181-214.
  • Jordan, M. I. and Xu, L. (1995). Convergence results for the EM approach to mixtures-of-experts architectures. Neural Networks 8 1409-1431.
  • Lehmann, E. L. (1991). Theory of Point Estimation. Wadsworth, Monterey, CA.
  • McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. Chapman and Hall, London.
  • Meil a, M. and Jordan, M. I. (1995). Learning fine motion by Markov mixtures of experts. A.I. Memo 1567, Artificial Intelligence Lab., Massachusetts Institute Technology.
  • Mhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic functions. Neural Comp. 8 164-177.
  • Peng, F., Jacobs, R. A. and Tanner, M. A. (1996). Bayesian inference in mixtures-of-experts and hierarchical mixtures-of-experts models with an application to speech recognition. J. Amer. Statist. Assoc. 91 953-960.
  • Tipping, M. E. and Bishop, C. M. (1997). Mixtures of probabilistic principal component analysers. Technical Report NCRG-97-003, Dept. Computer Science and Applied Mathematics, Aston Univ., Birmingham, UK.
  • Titterington, D. M., Smith, A. F. M. and Makov, U. E. (1985). Statistical Analysis of Finite Mixture Distributions. Wiley, New York.
  • White, H. (1994). Estimation, Inference and Specification Analysis. Cambridge Univ. Press.
  • Zeevi, A. and Meir, R. (1997). Density estimation through convex combinations: approximation and estimation bounds. Neural Networks 10 99-106.
  • Zeevi, A., Meir, R. and Maiorov, V. (1998). Error bounds for functional approximation and estimation using mixtures of experts. IEEE Trans. Information Theory 44 1010-1025.