Electronic Journal of Statistics

Online natural gradient as a Kalman filter

Yann Ollivier

Full-text: Open access

Abstract

We cast Amari’s natural gradient in statistical learning as a specific case of Kalman filtering. Namely, applying an extended Kalman filter to estimate a fixed unknown parameter of a probabilistic model from a series of observations, is rigorously equivalent to estimating this parameter via an online stochastic natural gradient descent on the log-likelihood of the observations.

In the i.i.d. case, this relation is a consequence of the “information filter” phrasing of the extended Kalman filter. In the recurrent (state space, non-i.i.d.) case, we prove that the joint Kalman filter over states and parameters is a natural gradient on top of real-time recurrent learning (RTRL), a classical algorithm to train recurrent models.

This exact algebraic correspondence provides relevant interpretations for natural gradient hyperparameters such as learning rates or initialization and regularization of the Fisher information matrix.

Article information

Source
Electron. J. Statist., Volume 12, Number 2 (2018), 2930-2961.

Dates
Received: June 2017
First available in Project Euclid: 18 September 2018

Permanent link to this document
https://projecteuclid.org/euclid.ejs/1537257630

Digital Object Identifier
doi:10.1214/18-EJS1468

Subjects
Primary: 68T05: Learning and adaptive systems [See also 68Q32, 91E40] 65K10: Optimization and variational techniques [See also 49Mxx, 93B40]
Secondary: 93E35: Stochastic learning and adaptive control 90C26: Nonconvex programming, global optimization 93E11: Filtering [See also 60G35] 49M15: Newton-type methods

Keywords
Statistical learning natural gradient Kalman filter stochastic gradient descent

Rights
Creative Commons Attribution 4.0 International License.

Citation

Ollivier, Yann. Online natural gradient as a Kalman filter. Electron. J. Statist. 12 (2018), no. 2, 2930--2961. doi:10.1214/18-EJS1468. https://projecteuclid.org/euclid.ejs/1537257630


Export citation

References

  • [Ama98] Shun-ichi Amari. Natural gradient works efficiently in learning., Neural Comput., 10:251–276, February 1998.
  • [AN00] Shun-ichi Amari and Hiroshi Nagaoka., Methods of information geometry, volume 191 of Translations of Mathematical Monographs. American Mathematical Society, Providence, RI, 2000. Translated from the 1993 Japanese original by Daishi Harada.
  • [APF00] Shun-ichi Amari, Hyeyoung Park, and Kenji Fukumizu. Adaptive method of realizing natural gradient learning for multilayer perceptrons., Neural Computation, 12(6) :1399–1409, 2000.
  • [Ber96] Dimitri P. Bertsekas. Incremental least squares methods and the extended Kalman filter., SIAM Journal on Optimization, 6(3):807–822, 1996.
  • [Bis06] Christopher M. Bishop., Pattern recognition and machine learning. Springer, 2006.
  • [BL03] Léon Bottou and Yann LeCun. Large scale online learning. In, NIPS, volume 30, page 77, 2003.
  • [BRD97] M. Boutayeb, H. Rafaralahy, and M. Darouach. Convergence analysis of the extended Kalman filter used as an observer for nonlinear deterministic discrete-time systems., IEEE transactions on automatic control, 42(4):581–586, 1997.
  • [BV04] Stephen Boyd and Lieven Vandenberghe., Convex optimization. Cambridge University Press, 2004.
  • [dFNG00] João FG de Freitas, Mahesan Niranjan, and Andrew H. Gee. Hierarchical Bayesian models for regularization in sequential learning., Neural computation, 12(4):933–953, 2000.
  • [GA15] Mohinder S. Grewal and Angus P. Andrews., Kalman filtering: Theory and practice using MATLAB. Wiley, 2015. 4th edition.
  • [GHL87] S. Gallot, D. Hulin, and J. Lafontaine., Riemannian geometry. Universitext. Springer-Verlag, Berlin, 1987.
  • [GS15] Roger B. Grosse and Ruslan Salakhutdinov. Scaling up natural gradient by sparsely factorizing the inverse Fisher matrix. In, ICML, pages 2304–2313, 2015.
  • [Hay01] Simon Haykin., Kalman filtering and neural networks. John Wiley & Sons, 2001.
  • [Jae02] Herbert Jaeger. Tutorial on training recurrent neural networks, covering BPTT, RTRL, EKF and the “echo state network” approach. Technical Report 159, German National Research Center for Information Technology, 2002.
  • [Jaz70] Andrew H. Jazwinski., Stochastic processes and filtering theory. Academic Press, 1970.
  • [Kul97] Solomon Kullback., Information theory and statistics. Dover Publications Inc., Mineola, NY, 1997. Reprint of the second (1968) edition.
  • [LCL$^+$17] Yubo Li, Yongqiang Cheng, Xiang Li, Xiaoqiang Hua, and Yuliang Qin. Information geometric approach to recursive update in nonlinear filtering., Entropy, 19(2):54, 2017.
  • [LMB07] Nicolas Le Roux, Pierre-Antoine Manzagol, and Yoshua Bengio. Topmoumoute online natural gradient algorithm. In, Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007, pages 849–856, 2007.
  • [LS83] Lennart Ljung and Torsten Söderström., Theory and Practice of Recursive Identification. MIT Press, 1983.
  • [Mar14] James Martens. New insights and perspectives on the natural gradient method., arXiv preprint arXiv :1412.1193, 2014.
  • [MCO16] Gaétan Marceau-Caron and Yann Ollivier. Practical Riemannian neural networks. arXiv preprint arXiv :1602.08007, 2016.
  • [MG15] James Martens and Roger B. Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. In, ICML, pages 2408–2417, 2015.
  • [OAAH17] Yann Ollivier, Ludovic Arnold, Anne Auger, and Nikolaus Hansen. Information-geometric optimization algorithms: A unifying picture via invariance principles., Journal of Machine Learning Research, 18(18):1–65, 2017.
  • [Oll15] Yann Ollivier. Riemannian metrics for neural networks I: feedforward networks., Information and Inference, 4(2):108–153, 2015.
  • [OTC15] Yann Ollivier, Corentin Tallec, and Guillaume Charpiat. Training recurrent networks online without backtracking. arXiv preprint arXiv :1507.07680, 2015.
  • [Pat16] Vivak Patel. Kalman-based stochastic gradient method with stop condition and insensitivity to conditioning., SIAM Journal on Optimization, 26(4) :2620–2648, 2016.
  • [PB13] Razvan Pascanu and Yoshua Bengio. Natural gradient revisited., CoRR, abs /1301.3584, 2013.
  • [PJ92] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging., SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
  • [RRK$^+$92] Dennis W. Ruck, Steven K. Rogers, Matthew Kabrisky, Peter S. Maybeck, and Mark E. Oxley. Comparative analysis of backpropagation and the extended Kalman filter for training multilayer perceptrons., IEEE Transactions on Pattern Analysis & Machine Intelligence, (6):686–691, 1992.
  • [Sim06] Dan Simon., Optimal state estimation: Kalman, $H_\infty $, and nonlinear approaches. John Wiley & Sons, 2006.
  • [ŠKT01] Miroslav Šimandl, Jakub Královec, and Petr Tichavskỳ. Filtering, predictive, and smoothing Cramér–Rao bounds for discrete-time nonlinear dynamic systems., Automatica, 37(11) :1703–1716, 2001.
  • [SW88] Sharad Singhal and Lance Wu. Training multilayer perceptrons with the extended Kalman algorithm. In, NIPS, pages 133–140, 1988.
  • [Sä13] Simo Särkkä., Bayesian filtering and smoothing. Cambridge University Press, 2013.
  • [TO17] Corentin Tallec and Yann Ollivier. Unbiased online recurrent optimization. arXiv preprint arXiv :1702.05043, 2017.
  • [vdV00] A.W. van der Vaart., Asymptotic statistics. Cambridge university press, 2000.
  • [Wil92] Ronald J Williams. Training recurrent networks using the extended Kalman filter. In, Neural Networks, 1992. IJCNN., International Joint Conference on, volume 4, pages 241–246. IEEE, 1992.
  • [WN96] Eric A. Wan and Alex T. Nelson. Dual Kalman filtering methods for nonlinear prediction, smoothing and estimation. In, NIPS, pages 793–799, 1996.