## The Annals of Statistics

### Hierarchical mixtures-of-experts for exponential family regression models: approximation and maximum likelihood estimation

#### Abstract

We consider hierarchical mixtures-of-experts (HME) models where exponential family regression models with generalized linear mean functions of the form $\psi(\alpha + \mathbf{x}^T \mathbf{\beta})$ are mixed. Here $\psi(\cdot)$ is the inverse link function. Suppose the true response $y$ follows an exponential family regression model with mean function belonging to a class of smooth functions of the form $\psi(h(\mathbf{x}))$ where $h(\cdot)\in W_{2; K_0}^{\infty}$ (a Sobolev class over $[0, 1]^s$). It is shown that the HME probability density functions can approximate the true density, at a rate of $O(m^{-2/s})$ in Hellinger distance and at a rate of $O(m^{-4/s})$ in Kullback–Leibler divergence, where $m$ is the number of experts, and $s$ is the dimension of the predictor $x$. We also provide conditions under which the mean-square error of the estimated mean response obtained from the maximum likelihood method converges to zero, as the sample size and the number of experts both increase.

#### Article information

Source
Ann. Statist., Volume 27, Number 3 (1999), 987-1011.

Dates
First available in Project Euclid: 5 April 2002

https://projecteuclid.org/euclid.aos/1018031265

Digital Object Identifier
doi:10.1214/aos/1018031265

Mathematical Reviews number (MathSciNet)
MR1724038

Zentralblatt MATH identifier
0957.62032

Subjects
Primary: 62G07: Density estimation
Secondary: 41A25: Rate of convergence, degree of approximation

#### Citation

Jiang, Wenxin; Tanner, Martin A. Hierarchical mixtures-of-experts for exponential family regression models: approximation and maximum likelihood estimation. Ann. Statist. 27 (1999), no. 3, 987--1011. doi:10.1214/aos/1018031265. https://projecteuclid.org/euclid.aos/1018031265

#### References

• Bickel, P. J. and Doksum, K. A. (1977). Mathematical Statistics. Prentice-Hall, Englewood Cliffs, NJ.
• Bishop, C.M. (1995). Neural Networks for Pattern Recognition. Oxford Univ. Press.
• Cacciatore, T. W. and Nowlan, S. J. (1994). Mixtures of controllers for jump linear and nonlinear plants. In Advances in Neural Informations Processing Systems 6 (G. Tesauro, D. S. Touretzky and T. K. Leen, eds.). Morgan Kaufmann, San Mateo, CA.
• Devroye, L. and Gyoerfi, L. (1985). Nonparametric Density Estimation: The L1 View. Wiley, New York.
• Fritsch, J., Finke, M. and Waibel, A. (1997). Adaptively growing hierarchical mixtures of experts. In Advances in Neural Informations Processing Systems 9 (M. C. Mozer, M. I. Jordan and T. Petsche, eds.). MIT Press.
• Ghahramani, Z. and Hinton, G. E. (1996). The EM algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96-1, Dept. Computer Science, Univ. Toronto.
• Haussler, D. and Opper, M. (1995). General bounds on the mutual information between a parameter and n conditionally independent observations. In Proceedings of the Eighth Annual Computational Learning Theory Conference (COLT), 1995, Santa Cruz, CA. ACM Press, New York.
• Haykin, S. (1994). Neural Networks. Macmillan, New York.
• Jaakkola, T. S. and Jordan, M. I. (1998). Improving the mean field approximation via the use of mixture distributions. In Learning in Graphical Models (M. I. Jordan, ed.). Kluwer, Dordrecht.
• Jacobs, R. A., Jordan, M. I., Nowlan, S. J. and Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Comp. 3 79-87.
• Jennrich, R. I. (1969). Asymptotic properties of nonlinear least squares estimators. Ann. Math. Statist. 40 633-643.
• Jiang, W. and Tanner, M. A. (1998). Hierarchical mixtures-of-experts for exponential family regression models: approximation and maximum likelihood estimation. Technical report, Dept. Statistics, Northwestern Univ., Evanston, IL.
• Jiang, W. and Tanner, M. A. (1999). On the approximation rate of hierarchical mixtures-ofexperts for generalized linear models. Neural Comp. To appear.
• Jordan, M. I. and Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Comp. 6 181-214.
• Jordan, M. I. and Xu, L. (1995). Convergence results for the EM approach to mixtures-of-experts architectures. Neural Networks 8 1409-1431.
• Lehmann, E. L. (1991). Theory of Point Estimation. Wadsworth, Monterey, CA.
• McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. Chapman and Hall, London.
• Meil a, M. and Jordan, M. I. (1995). Learning fine motion by Markov mixtures of experts. A.I. Memo 1567, Artificial Intelligence Lab., Massachusetts Institute Technology.
• Mhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic functions. Neural Comp. 8 164-177.
• Peng, F., Jacobs, R. A. and Tanner, M. A. (1996). Bayesian inference in mixtures-of-experts and hierarchical mixtures-of-experts models with an application to speech recognition. J. Amer. Statist. Assoc. 91 953-960.
• Tipping, M. E. and Bishop, C. M. (1997). Mixtures of probabilistic principal component analysers. Technical Report NCRG-97-003, Dept. Computer Science and Applied Mathematics, Aston Univ., Birmingham, UK.
• Titterington, D. M., Smith, A. F. M. and Makov, U. E. (1985). Statistical Analysis of Finite Mixture Distributions. Wiley, New York.
• White, H. (1994). Estimation, Inference and Specification Analysis. Cambridge Univ. Press.
• Zeevi, A. and Meir, R. (1997). Density estimation through convex combinations: approximation and estimation bounds. Neural Networks 10 99-106.
• Zeevi, A., Meir, R. and Maiorov, V. (1998). Error bounds for functional approximation and estimation using mixtures of experts. IEEE Trans. Information Theory 44 1010-1025.