## Statistical Science

### Bayesian Methods for Neural Networks and Related Models

D. M. Titterington

#### Abstract

Models such as feed-forward neural networks and certain other structures investigated in the computer science literature are not amenable to closed-form Bayesian analysis. The paper reviews the various approaches taken to overcome this difficulty, involving the use of Gaussian approximations, Markov chain Monte Carlo simulation routines and a class of non-Gaussian but “deterministic” approximations called variational approximations.

#### Article information

Source
Statist. Sci., Volume 19, Number 1 (2004), 128-139.

Dates
First available in Project Euclid: 14 July 2004

https://projecteuclid.org/euclid.ss/1089808278

Digital Object Identifier
doi:10.1214/088342304000000099

Mathematical Reviews number (MathSciNet)
MR2082152

Zentralblatt MATH identifier
1057.62078

#### Citation

Titterington, D. M. Bayesian Methods for Neural Networks and Related Models. Statist. Sci. 19 (2004), no. 1, 128--139. doi:10.1214/088342304000000099. https://projecteuclid.org/euclid.ss/1089808278

#### References

• Andrieu, C., de Freitas, J. F. G. and Doucet, A. (2000). Robust full Bayesian methods for neural networks. In Advances in Neural Information Processing Systems (S. A. Solla, T. K. Leen and K.-R. Müller, eds.) 12 379--385. MIT Press.
• Andrieu, C., de Freitas, N. and Doucet, A. (2001). Robust full Bayesian methods for radial basis networks. Neural Computation 13 2359--2407.
• Archer, G. E. B. and Titterington, D. M. (2002). Parameter estimation for hidden Markov chains. J. Statist. Plann. Inference 108 365--390.
• Attias, H. (1999a). Independent factor analysis. Neural Computation 11 803--851.
• Attias, H. (1999b). Inferring parameters and structure of latent variable models by variational Bayes. In Proc. 15th Conf. Uncertainty in Artificial Intelligence 21--30. Morgan Kaufmann, San Mateo, CA.
• Attias, H. (2000). A variational Bayesian framework for graphical models. In Advances in Neural Information Processing Systems (S. A. Solla, T. K. Leen and K.-R. Müller, eds.) 12 209--215. MIT Press.
• Barber, D. and Bishop, C. M. (1998). Variational learning in Bayesian neural networks. In Neural Networks and Machine Learning (C. M. Bishop, ed.) 215--237. Springer, New York.
• Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd ed. Springer, New York.
• Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Clarendon, Oxford.
• Bishop, C. M. (1999). Variational principal components. In Proc. 9th Internat. Conf. Artificial Neural Networks 1 509--514. Institution of Electrical Engineers, London.
• Bishop, C. M., Lawrence, N., Jaakkola, T. and Jordan, M. I. (1998). Approximating posterior distributions in belief networks using mixtures. In Advances in Neural Information Processing Systems (M. I. Jordan, M. J. Kearns and S. A. Solla, eds.) 10 416--422. MIT Press.
• Bishop, C. M., Spiegelhalter, D. J. and Winn, J. (2003). VIBES: A variational inference engine for Bayesian networks. In Advances in Neural Information Processing Systems (S. Becker, S. Thrun and K. Obermayer, eds.) 15 793--800. MIT Press.
• Bishop, C. M., Svensén, M. and Williams, C. K. I. (1998). Developments of the generative topographic mapping. Neurocomputing 21 203--224.
• Bishop, C. M. and Tipping, M. E. (2000). Variational relevance vector machines. In Proc. 16th Conf. Uncertainty in Artificial Intelligence (C. Boutilier and M. Goldszmidt, eds.) 46--53. Morgan Kaufmann, San Mateo, CA.
• Buntine, W. and Weigend, A. (1991). Bayesian back-propagation. Complex Systems 5 603--643.
• Chan, K., Lee, T.-W. and Sejnowski, T. J. (2003). Variational Bayesian learning of ICA with missing data. Neural Computation 15 1991--2011.
• Chen, S., Gunn, S. R. and Harris, C. J. (2001). The relevance vector machine technique for channel equalization application. IEEE Trans. Neural Networks 12 1529--1532.
• Cheng, B. and Titterington, D. M. (1994). Neural networks: A review from a statistical perspective (with discussion). Statist. Sci. 9 2--54.
• Choudrey, R. and Roberts, S. J. (2001). Variational Bayesian independent component analysis with flexible sources. Technical Report PARG-01-03, Dept. Engineering Science, Univ. Oxford.
• Chu, W., Keerthi, S. S. and Ong, C. J. (2001). Bayesian inference in support vector regression. Technical Report CD-01-15, Natl. Univ. Singapore.
• Corduneanu, A. and Bishop, C. M. (2001). Variational Bayesian model selection for mixture distributions. In Proc. 8th Internat. Conf. Artificial Intelligence and Statistics (T. Richardson and T. Jaakkola, eds.) 27--34. Morgan Kaufmann, San Mateo, CA.
• Csiszár, I. and Tusnády, G. (1984). Information geometry and alternating minimization procedures. Statist. Decisions (suppl. issue) 1 205--237.
• de Freitas, N., Højen-Sørensen, P., Jordan, M. I. and Russell, S. (2001). Variational MCMC. In Proc. 18th Conf. Uncertainty in Artificial Intelligence (J. Breese and D. Koller, eds.) 120--127. Morgan Kauffman, San Mateo, CA.
• de Freitas, J. F. G., Niranjan, M., Gee, A. H. and Doucet, A. (2000). Sequential Monte Carlo methods to train neural network models. Neural Computation 12 955--993.
• Duane, S., Kennedy, A. D., Pendelton, B. J. and Roweth, D. (1987). Hybrid Monte Carlo. Phys. Lett. B 195 216--222.
• Edwards, P. J., Murray, A. F., Papadopoulos, G., Wallace, A. R., Barnard, J. and Smith, G. (1999). The application of neural networks to the papermaking industry. IEEE Trans. Neural Networks 10 1456--1464.
• Geiger, D., Heckerman, D. and Meek, C. (1999). Asymptotic model selection for directed networks with hidden variables. In Learning in Graphical Models (M. Jordan, ed.) 461--477. MIT Press.
• Gencay, R. and Qi, M. (2001). Pricing and hedging derivative securities with neural networks: Bayesian regularization, early stopping and bagging. IEEE Trans. Neural Networks 12 726--734.
• Ghahramani, Z. and Beal, M. (2000). Variational inference for Bayesian mixtures of factor analysers. In Advances in Neural Information Processing Systems (S. A. Solla, T. K. Leen and K.-R. Müller, eds.) 12 449--455. MIT Press.
• Ghahramani, Z. and Beal, M. (2001). Propagation algorithms for variational Bayesian learning. In Advances in Neural Information Processing Systems (T. Leen, T. Dietterich and V. Tresp, eds.) 13 507--513. MIT Press.
• Ghahramani, Z. and Jordan, M. I. (1997). Factorial hidden Markov models. Machine Learning 29 245--273.
• Gibbs, M. N. and MacKay, D. J. C. (2000). Variational Gaussian process classifiers. IEEE Trans. Neural Networks 11 1458--1464.
• Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82 711--732.
• Hall, P., Humphreys, K. and Titterington, D. M. (2002). On the adequacy of variational lower bound functions for likelihood-based inference in Markovian models with missing values. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 549--564.
• Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer, New York.
• Heckerman, D. (1999). A tutorial on learning with Bayesian networks. In Learning in Graphical Models (M. Jordan, ed.) 301--354. MIT Press.
• Hinton, G. E. and van Camp, D. (1993). Keeping neural networks simple by minimizing the description length of the weights. In Proc. 6th ACM Conf. Computational Learning Theory 5--13. ACM Press, New York.
• Holmes, C. C. and Mallick, B. K. (1998). Bayesian radial basis functions of variable dimension. Neural Computation 10 1217--1233.
• Humphreys, K. and Titterington, D. M. (2000a). Approximate Bayesian inference for simple mixtures. In Proc. Computational Statistics 2000 (J. G. Bethlehem and P. G. M. van der Heijden, eds.) 331--336. Physica-Verlag, Heidelberg.
• Humphreys, K. and Titterington, D. M. (2000b). Improving the mean-field approximation in belief networks using Bahadur's reparameterisation of the multivariate binary distribution. Neural Processing Lett. 12 183--197.
• Humphreys, K. and Titterington, D. M. (2001). Some examples of recursive variational approximations. In Advanced Mean Field Methods: Theory and Practice (M. Opper and D. Saad, eds.) 179--195. MIT Press.
• Husmeier, D. (2000). The Bayesian evidence scheme for regularizing probability-density estimating neural networks. Neural Computation 12 2685--2717.
• Husmeier, D., Penny, W. D. and Roberts, S. J. (1999). An empirical evaluation of Bayesian sampling with hybrid Monte Carlo for training neural network classifiers. Neural Networks 12 677--705.
• Jaakkola, T. (2001). Tutorial on variational approximation methods. In Advanced Mean Field Methods: Theory and Practice (M. Opper and D. Saad, eds.) 129--159. MIT Press.
• Jaakkola, T. and Jordan, M. I. (2000). Bayesian parameter estimation via variational methods. Statist. Comput. 10 25--37.
• Jacobs, R. A., Peng, F. and Tanner, M. A. (1997). A Bayesian approach to model selection in hierarchical mixtures-of-experts architectures. Neural Networks 10 231--241.
• Jordan, M. I., ed. (1999). Learning in Graphical Models. MIT Press.
• Jordan, M. I., Ghahramani, Z., Jaakkola, T. S. and Saul, L. K. (1999). An introduction to variational methods for graphical models. In Learning in Graphical Models (M. Jordan, ed.) 105--162. MIT Press.
• Jordan, M. I. and Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation 6 181--214.
• Kay, J. W. and Titterington, D. M., eds. (1999). Statistics and Neural Networks: Advances at the Interface. Oxford Univ. Press.
• Konishi, S., Ando, T. and Imoto, S. (2004). Bayesian information criteria and smoothing parameter selection in radial basis function networks. Biometrika 91 27--43.
• Kwok, J. T.-Y. (1999). Moderating the outputs of support vector machine classifiers. IEEE Trans. Neural Networks 10 1018--1031.
• Kwok, J. T.-Y. (2000). The evidence framework applied to support vector machines. IEEE Trans. Neural Networks 11 1162--1173.
• Lampinen, J. and Vehtari, A. (2001). Bayesian approach for neural networks---review and case studies. Neural Networks 14 257--274.
• Lee, H. K. H. (2001). Model selection for neural network classification. J. Classification 18 227--243.
• Lee, H. K. H. (2003). A noninformative prior for neural networks. Machine Learning 50 197--212.
• Leisink, M. A. R. and Kappen, H. J. (2001). A tighter bound for graphical models. Neural Computation 13 2149--2171.
• Luttrell, S. P. (1994). A Bayesian analysis of self-organizing maps. Neural Computation 6 767--794.
• MacKay, D. J. C. (1992a). Bayesian interpolation. Neural Computation 4 415--447.
• MacKay, D. J. C. (1992b). A practical Bayesian framework for backpropagation networks. Neural Computation 4 448--472.
• MacKay, D. J. C. (1992c). The evidence framework applied to classification networks. Neural Computation 4 720--736.
• MacKay, D. J. C. (1995). Probable networks and plausible predictions---a review of practical Bayesian methods for supervised neural networks. Network: Computation in Neural Systems 6 469--505.
• MacKay, D. J. C. (1997). Ensemble learning for hidden Markov models. Technical report, Cavendish Lab., Univ. Cambridge.
• MacKay, D. J. C. (1999). Comparison of approximate methods for handling hyperparameters. Neural Computation 11 1035--1068.
• Medeiros, M. C., Veiga, A. and Pedreira, C. E. (2001). Modeling exchange rates: Smooth transitions, neural networks and linear models. IEEE Trans. Neural Networks 12 755--764.
• Meng, X.-L. and van Dyk, D. (1997). The EM algorithm---an old folk-song sung to a fast new tune (with discussion). J. Roy. Statist. Soc. Ser. B 59 511--567.
• Miskin, J. W. and MacKay, D. J. C. (2000). Ensemble learning for blind image separation and deconvolution. In Advances in Independent Component Analysis (M. Girolami, ed.) 123--141. Springer, New York.
• Moody, J. E. (1992). The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In Advances in Neural Information Processing Systems (J. E. Moody, S. J. Hanson and R. P. Lippmann, eds.) 4 847--854. Morgan Kaufmann, San Mateo, CA.
• Müller, P. and Ríos-Insua, D. (1998). Issues in Bayesian analysis of neural network models. Neural Computation 10 749--770.
• Neal, R. M. (1996). Bayesian Learning for Neural Networks. Springer, New York.
• Neal, R. M. and Hinton, G. E. (1999). A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models (M. Jordan, ed.) 355--368. MIT Press.
• Opper, M. and Saad, D., eds. (2001). Advanced Mean Field Methods: Theory and Practice. MIT Press.
• Paige, R. L. and Butler, R. W. (2001). Bayesian inference in neural networks. Biometrika 88 623--641.
• Peng, F., Jacobs, R. A. and Tanner, M. A. (1996). Bayesian inference in mixtures-of-experts and hierarchical mixtures-of-experts models with an application to speech recognition. J. Amer. Statist. Assoc. 91 953--960.
• Penny, W. D. and Roberts, S. J. (1999). Bayesian neural networks for classification: How useful is the evidence framework? Neural Networks 12 877--892.
• Ripley, B. D. (1994). Neural networks and related methods for classification (with discussion). J. Roy. Statist. Soc. Ser. B 56 409--456.
• Ripley, B. D. (1995). Choosing network complexity. In Probabilistic Reasoning and Bayesian Belief Networks (A. Gammerman, ed.) 97--108. Waller, Henley-on-Thames, UK.
• Ripley, B. D. (1996). Pattern Recognition and Neural Networks. Cambridge Univ. Press.
• Sato, M. (2001). Online model selection based on the variational Bayes. Neural Computation 13 1649--1681.
• Seeger, M. (2000). Bayesian model selection for support vector machines, Gaussian processes, and other kernel classifiers. In Advances in Neural Information Processing Systems (S. A. Solla, T. K. Leen and K.-R. Müller, eds.) 12 603--609. MIT Press.
• Sommer, F. T. and Dayan, P. (1998). Bayesian retrieval in associative memories with storage errors. IEEE Trans. Neural Networks 9 705--713.
• Spiegelhalter, D. J., Best, N. G., Carlin, B. P. and van der Linde, A. (2002). Bayesian measures of model complexity and fit (with discussion). J. R. Stat. Soc. Ser. B Stat. Methodol. 64 583--639.
• Tanaka, K., Inoue, J. and Titterington, D. M. (2003). Probabilistic image processing by means of Bethe approximation for the $Q$-Ising model. J. Phys. A 36 11,023--11,035.
• Thodberg, H. H. (1996). A review of Bayesian neural networks with an application to near infrared spectroscopy. IEEE Trans. Neural Networks 7 56--72.
• Tipping, M. E. (2000). The relevance vector machine. In Advances in Neural Information Processing Systems (S. A. Solla, T. K. Leen and K.-R. Müller, eds.) 12 652--658. MIT Press.
• Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. J. Machine Learning Res. 1 211--244.
• Titterington, D. M., Smith, A. F. M. and Makov, U. E. (1985). Statistical Analysis of Finite Mixture Distributions. Wiley, New York.
• Ueda, N. and Ghahramani, Z. (2002). Bayesian model search for mixture models based on optimizing variational bounds. Neural Networks 15 1223--1241.
• Utsugi, A. (1997). Hyperparameter selection for self-organizing maps. Neural Computation 9 623--635.
• Utsugi, A. (1998). Density estimation by mixture models with smoothing priors. Neural Computation 10 2115--2135.
• Van Gestel, T., Suykens, J. A. K., Lanckriet, G., Lambrechts, A., De Moor, B. and Vandewalle, J. (2002). Bayesian framework for least-squares support vector machine classifiers, Gaussian processes, and kernel Fisher discriminant analysis. Neural Computation 14 1115--1147.
• Vila, J.-P., Wagner, V. and Neveu, P. (2000). Bayesian nonlinear model selection and neural networks: A conjugate prior approach. IEEE Trans. Neural Networks 11 265--278.
• Vivarelli, F. and Williams, C. K. I. (2001). Comparing Bayesian neural network algorithms for classifying segmented outdoor images. Neural Networks 14 427--437.
• Waterhouse, S., MacKay, D. and Robinson, T. (1996). Bayesian methods for mixtures of experts. In Advances in Neural Information Processing Systems (M. C. Mozer, D. S. Touretzky and M. E. Hasselmo, eds.) 8 351--357. MIT Press.
• Williams, C. K. I. (1998). Computation with infinite neural networks. Neural Computation 10 1203--1216.
• Wright, W. A. (1999). Bayesian approach to neural-network modeling with input uncertainty. IEEE Trans. Neural Networks 10 1261--1270.
• Yedidia, J., Freeman, W. T. and Weiss, Y. (2001). Bethe free energy, Kikuchi approximations and belief propagation algorithms. Technical Report MERL TR 2001-16, Mitsubishi Electric Research Laboratories, Cambridge, MA.
• Zhang, B.-L., Coggins, R., Jabri, M. A., Dersch, D. and Flower, B. (2001). Multiresolution forecasting for futures trading using wavelet decompositions. IEEE Trans. Neural Networks 12 765--775.
• Zhang, J. (1992). The mean field theory in EM procedures for Markov random fields. IEEE Trans. Signal Process. 40 2570--2583.
• Zhang, J. (1993). The mean field theory in EM procedures for blind Markov random field image restoration. IEEE Trans. Image Process. 2 27--40.