Electronic Journal of Statistics

Stochastic heavy ball

Abstract

This paper deals with a natural stochastic optimization procedure derived from the so-called Heavy-ball method differential equation, which was introduced by Polyak in the 1960s with his seminal contribution [Pol64]. The Heavy-ball method is a second-order dynamics that was investigated to minimize convex functions $f$. The family of second-order methods recently received a large amount of attention, until the famous contribution of Nesterov [Nes83], leading to the explosion of large-scale optimization problems. This work provides an in-depth description of the stochastic heavy-ball method, which is an adaptation of the deterministic one when only unbiased evalutions of the gradient are available and used throughout the iterations of the algorithm. We first describe some almost sure convergence results in the case of general non-convex coercive functions $f$. We then examine the situation of convex and strongly convex potentials and derive some non-asymptotic results about the stochastic heavy-ball method. We end our study with limit theorems on several rescaled algorithms.

Article information

Source
Electron. J. Statist., Volume 12, Number 1 (2018), 461-529.

Dates
First available in Project Euclid: 19 February 2018

https://projecteuclid.org/euclid.ejs/1519030878

Digital Object Identifier
doi:10.1214/18-EJS1395

Mathematical Reviews number (MathSciNet)
MR3765604

Zentralblatt MATH identifier
06841011

Citation

Gadat, Sébastien; Panloup, Fabien; Saadane, Sofiane. Stochastic heavy ball. Electron. J. Statist. 12 (2018), no. 1, 461--529. doi:10.1214/18-EJS1395. https://projecteuclid.org/euclid.ejs/1519030878

References

• [1] F. Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression., Journal of Machine Learning Research, 15:595–627, 2014.
• [2] O. Brandière and M. Duflo. Les algorithmes stochastiques contournent-ils les pièges ?, Annales de l’I.H.P. Probabilités et Statistiques, 32:395–427, 1996.
• [3] M. Benaïm., Dynamics of stochastic approximation algorithms. Lecture Notes in Mathematics, Séminaire de Probabilités XXXIII. Springer-Verlag, 2006. Characterization and convergence.
• [4] M. Benaïm and M.W. Hirsh. Asymptotic pseudotrajectories and chain recurrent flows, with applications., J. Dynam. Differential Equations, 8:141–176, 1996.
• [5] P. Billingsley., Convergence of Probability Measures. Wiley series in Probability & Statistics, New York, 1995.
• [6] S. Boucheron, G. Lugosi, and P. Massart., Concentration inequalities. Oxford University Press, Oxford, 2013. A nonasymptotic theory of independence, With a foreword by M. Ledoux.
• [7] M. Benaïm, M. Ledoux, and O. Raimond. Self-interacting diffusions., Probab. Theory Related Fields, 122:1–41, 2002.
• [8] S. Burer and R. Monteiro. Local minima and convergence in low-rank semidefinite programming., Math. Program., 103(3, Ser. A):427–444, 2005.
• [9] F. Bach and E. Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning., Advances in Neural Information Processing Systems (NIPS), 2011.
• [10] Vivek S. Borkar. Stochastic approximation with two time scales., Systems Control Lett., 29(5):291–294, 1997.
• [11] Vivek S. Borkar., Stochastic approximation. Cambridge University Press, Cambridge; Hindustan Book Agency, New Delhi, 2008. A dynamical systems viewpoint.
• [12] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In, Proceedings of COMPSTAT ’2010, pages 177–186. Physica-Verlag/Springer, Heidelberg, 2010.
• [13] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems., SIAM J. Imaging Sci., 2(1):183–202, 2009.
• [14] S. Boyd and L. Vandenberghe., Convex optimization. Cambridge University Press, Cambridge, 2004.
• [15] A. Cabot, H. Engler, and S. Gadat. On the long time behavior of second order differential equations with asymptotically small dissipation., Trans. Amer. Math. Soc., 361(11) :5983–6017, 2009.
• [16] A. Cabot, H. Engler, and S. Gadat. Second-order differential equations with asymptotically small dissipation and piecewise flat potentials. In, Proceedings of the Seventh Mississippi State–UAB Conference on Differential Equations and Computational Simulations, volume 17 of Electron. J. Differ. Equ. Conf., pages 33–38. Southwest Texas State Univ., San Marcos, TX, 2009.
• [17] M. Duflo. Random iterative models, adaptive algorithms and stochastic approximations,., Applications of Mathematics (New York). Springer-Verlag, Berlin., 22, 1997.
• [18] S. Ethier and T. Kurtz., Markov Processes. John Willey and Sons, New York, 1986.
• [19] M. Frank and P. Wolfe. An algorithm for quadratic programming., Naval Res. Logist. Quart., 3:95–110, 1956.
• [20] M. Freidlin and A. Wentzell., Random perturbations of dynamical systems, volume 260 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer, Heidelberg, third edition, 1984. Translated from the 1979 Russian original by Joseph Szücs.
• [21] S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming., SIAM Journal on Optimization, 23 :2341–2368, 2013.
• [22] S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming., Math. Program., 156:59–99, 2016.
• [23] S. Gadat, L. Miclo, and F. Panloup. A stochastic model for speculative bubbles., Alea: Latin American journal of probability and mathematical statistics, 12:491–532, 2015.
• [24] S. Gadat and F. Panloup. Long time behaviour and stationary regime of memory gradient diffusions., Annales Institut Henri Poincaré (B), 50:564–601, 2014.
• [25] S. Gadat and F. Panloup. Optimal non-asymptotic bound of the Ruppert-Polyak averaging without strong convexity, Preprint, 2018.
• [26] S. Gadat and L. Younes. A stochastic algorithm for feature selection in pattern recognition., Journal of Machine Learning Research, 8:509–547, 2007.
• [27] P. Hartman., Ordinary Differential Equations. Classic in Applied Mathematics. Wiley, 1982.
• [28] A. Haraux., Systèmes dynamiques dissipatifs et applications. R.M.A. Masson, Paris, 1991.
• [29] C. Hu, W. Pan, and J. T. Kwok. Accelerated gradient methods for stochastic optimization and online learning., In Advances in Neural Information Processing Systems, 2009.
• [30] P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford. Accelerating Stochastic Gradient Descent., ArXiv e-prints, April 2017.
• [31] C. Jin, S. M. Kakade, and P. Netrapalli. Provable efficient online matrix completion via non-convex stochastic gradient descent., NIPS, 2016.
• [32] C. Jin, P. Netrapalli, and M. Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent., Preprint, 2017.
• [33] J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regression function., Ann. Math. Statist., 23:462–466, 1952.
• [34] H. J. Kushner and G. Yin. Stochastic approximation and recursive algorithms and applications., Springer-Verlag, second edition,, 2003.
• [35] G. Lan. An optimal method for stochastic composite optimization., Math. Program, 133(1-2, Ser.A):365–397, 2012.
• [36] V. Lemaire. An adaptive scheme for the approximation of dissipative systems., Stochastic Processes and their Applications, 117(10) :1491–1518, 2007.
• [37] J. Lee, M. Simchowitz, M. Jordan, and B. Recht. Gradient descent converges to minimizers., Preprint, 2016.
• [38] P. Mertikopoulos and M. Staudigl. On the convergence of gradient-like flows with noisy gradient input., SIAM Journal on Optimization, to appear, 2017.
• [39] J. C. Mattingly, A. M. Stuart, and D. J. Higham. Ergodicity for SDEs and approximations: locally Lipschitz vector fields and degenerate noise., Stochastic Process. Appl., 101(2):185–232, 2002.
• [40] S. Meyn and R. Tweedie. Stability of Markovian processes. III. Foster-Lyapunov criteria for continuous-time processes., Adv. in Appl. Probab., 25(3):518–548, 1993.
• [41] Y. Nesterov. A method of solving a convex programming problem with convergence rate $o(1/k^2)$., Soviet Mathematics Doklady, 27(2):372–376, 1983.
• [42] Y. Nesterov., Introductory lectures on convex optimization, volume 87 of Applied Optimization. Kluwer Academic Publishers, Boston, MA, 2004. A basic course.
• [43] A. Nitanda. Accelerated Stochastic Gradient Descent for Minimizing Finite Sums., ArXiv e-prints, June 2015.
• [44] A. Nemirovski and D. Yudin. Problem complexity and method efficiency in optimization., Wiley-Interscience Series in Discrete Mathematics., John Wiley, XV, 1983.
• [45] R. Pemantle. Non-convergence to unstable points in urn models and stochastic approximations., Annals of Probability, 18:698–712, 1990.
• [46] B. T. Polyak and A. Juditsky. Acceleration of stochastic approximation by averaging., SIAM Journal on Control and Optimization, 30:838–855, 1992.
• [47] H. Poincaré. Mémoire sur les courbes définies par une équation différentielle (iv)., Journal de Mathématiques Pures et Appliquées, 4:151–217, 1886.
• [48] B. T. Polyak. Some methods of speeding up the convergence of iteration methods., USSR Computational Mathematics and Mathematical Physics, 4:1–17, 1964.
• [49] H. Robbins and S. Monro. A stochastic approximation method., Ann. Math. Statist., 22:400–407, 1951.
• [50] D. Ruppert. Efficient estimations from a slowly convergent robbins-monro process., Technical Report, 781, Cornell University Operations Research and Industrial Engineering, 1988.
• [51] K.R. Stromberg., Probability for Analysts. Chapman & Hall, CRC, New York, 1994.
• [52] D. W. Stroock and S. R. S. Varadhan., Multidimensional diffusion processes. Classics in Mathematics. Springer-Verlag, Berlin, 2006. Reprint of the 1997 edition.
• [53] C. Villani. Hypocoercivity., Mem. Amer. Math. Soc., 202(950), 2009.
• [54] S. Boyd W. Su and E. J. Candes. A differential equation for modeling nesterov’s accelerated gradient method: theory and insights., Journal of Machine Learning Research, to appear, 2016.
• [55] T. Yang, Q. Lin, and Z. Li. Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization., Preprint, 2016.