## The Annals of Statistics

### Asymptotic and finite-sample properties of estimators based on stochastic gradients

#### Abstract

Stochastic gradient descent procedures have gained popularity for parameter estimation from large data sets. However, their statistical properties are not well understood, in theory. And in practice, avoiding numerical instability requires careful tuning of key parameters. Here, we introduce implicit stochastic gradient descent procedures, which involve parameter updates that are implicitly defined. Intuitively, implicit updates shrink standard stochastic gradient descent updates. The amount of shrinkage depends on the observed Fisher information matrix, which does not need to be explicitly computed; thus, implicit procedures increase stability without increasing the computational burden. Our theoretical analysis provides the first full characterization of the asymptotic behavior of both standard and implicit stochastic gradient descent-based estimators, including finite-sample error bounds. Importantly, analytical expressions for the variances of these stochastic gradient-based estimators reveal their exact loss of efficiency. We also develop new algorithms to compute implicit stochastic gradient descent-based estimators for generalized linear models, Cox proportional hazards, M-estimators, in practice, and perform extensive experiments. Our results suggest that implicit stochastic gradient descent procedures are poised to become a workhorse for approximate inference from large data sets.

#### Article information

Source
Ann. Statist., Volume 45, Number 4 (2017), 1694-1727.

Dates
Revised: August 2016
First available in Project Euclid: 28 June 2017

Permanent link to this document
https://projecteuclid.org/euclid.aos/1498636871

Digital Object Identifier
doi:10.1214/16-AOS1506

Mathematical Reviews number (MathSciNet)
MR3670193

Zentralblatt MATH identifier
1378.62046

#### Citation

Toulis, Panos; Airoldi, Edoardo M. Asymptotic and finite-sample properties of estimators based on stochastic gradients. Ann. Statist. 45 (2017), no. 4, 1694--1727. doi:10.1214/16-AOS1506. https://projecteuclid.org/euclid.aos/1498636871

#### References

• Amari, S.-I. (1998). Natural gradient works efficiently in learning. Neural Comput. 10 251–276.
• Amari, S.-I., Park, H. and Fukumizu, K. (2000). Adaptive method of realizing natural gradient learning for multilayer perceptrons. Neural Comput. 12 1399–1409.
• Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate $o(1/n)$. In Advances in Neural Information Processing Systems 773–781.
• Bather, J. A. (1989). Stochastic approximation: A generalisation of the Robbins–Monro procedure. In Proceedings of the Fourth Prague Symposium on Asymptotic Statistics (Prague, 1988) 13–27. Charles Univ., Prague.
• Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2 183–202.
• Benveniste, A., Métivier, M. and Priouret, P. (1990). Adaptive Algorithms and Stochastic Approximations. Springer, Berlin.
• Bertsekas, D. P. (2011). Incremental proximal methods for large scale convex optimization. Math. Program. 129 163–195.
• Bordes, A., Bottou, L. and Gallinari, P. (2009). SGD-QN: Careful quasi-Newton stochastic gradient descent. J. Mach. Learn. Res. 10 1737–1754.
• Borkar, V. S. (2008). Stochastic Approximation. Cambridge Univ. Press, Cambridge.
• Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010 177–186. Springer, Heidelberg.
• Byrd, R. H., Hansen, S. L., Nocedal, J. and Singer, Y. (2016). A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim. 26 1008–1031.
• Cappé, O. and Moulines, E. (2009). On-line expectation-maximization algorithm for latent data models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 71 593–613.
• Chen, H. F., Lei, G. and Gao, A. J. (1988). Convergence and robustness of the Robbins–Monro algorithm truncated at randomly varying bounds. Stochastic Process. Appl. 27 217–231.
• Chung, K. L. (1954). On a stochastic approximation method. Ann. Math. Stat. 25 463–483.
• Cox, D. R. (1972). Regression models and life-tables. J. Roy. Statist. Soc. Ser. B 34 187–220.
• Davison, A. C. (2003). Statistical Models. Cambridge Series in Statistical and Probabilistic Mathematics 11. Cambridge Univ. Press, Cambridge.
• Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q. V. et al. (2012). Large scale distributed deep networks. In Advances in Neural Information Processing Systems 1223–1231.
• Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38.
• Dominici, F., Daniels, M., Zeger, S. L. and Samet, J. M. (2002). Air pollution and mortality: Estimating regional and national dose-response relationships. J. Amer. Statist. Assoc. 97 100–111.
• Donoho, D. and Montanari, A. (2016). High dimensional robust M-estimation: Asymptotic variance via approximate message passing. Probab. Theory Related Fields 166 935–969.
• Duchi, J., Hazan, E. and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12 2121–2159.
• Duchi, J. and Singer, Y. (2009). Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. 10 2899–2934.
• El Karoui, N. (2008). Spectrum estimation for large dimensional covariance matrices using random matrix theory. Ann. Statist. 36 2757–2790.
• Fabian, V. (1968). On asymptotic normality in stochastic approximation. Ann. Math. Stat. 39 1327–1332.
• Fabian, V. (1978). On asymptotically efficient recursive estimation. Ann. Statist. 6 854–866.
• Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 222 309–368.
• Fisher, R. A. (1925). Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh.
• Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33 1–22.
• George, A. P. and Powell, W. B. (2006). Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming. Mach. Learn. 65 167–198.
• Green, P. J. (1984). Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. J. Roy. Statist. Soc. Ser. B 46 149–192.
• Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer, New York.
• Hennig, P. and Kiefel, M. (2013). Quasi-Newton methods: A new direction. J. Mach. Learn. Res. 14 843–865.
• Hoffman, J. D. and Frankel, S. (2001). Numerical Methods for Engineers and Scientists. CRC press, Boca Raton, FL.
• Huber, P. J. (1964). Robust estimation of a location parameter. Ann. Math. Stat. 35 73–101.
• Jain, P., Tewari, A. and Kar, P. (2014). On iterative hard thresholding methods for high-dimensional m-estimation. In Advances in Neural Information Processing Systems 685–693.
• Klein, J. P. and Moeschberger, M. L. (2003). Survival Analysis: Techniques for Censored and Truncated Data. Springer Science & Business Media New York.
• Krakowski, K. A., Mahony, R. E., Williamson, R. C. and Warmuth, M. K. (2007). A geometric view of non-linear on-line stochastic gradient descent. Available at https://users.soe.ucsc.edu/manfred/pubs/T3.pdf.
• Lange, K. (2010). Numerical Analysis for Statisticians, 2nd ed. Springer, New York.
• Lehmann, E. L. and Casella, G. (1998). Theory of Point Estimation, 2nd ed. Springer Texts in Statistics 31. Springer, New York.
• Lions, P.-L. and Mercier, B. (1979). Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal. 16 964–979.
• Ljung, L., Pflug, G. and Walk, H. (1992). Stochastic Approximation and Optimization of Random Systems. DMV Seminar 17. Birkhäuser, Basel.
• Marshall, A. W. and Olkin, I. (1960). Multivariate Chebyshev inequalities. Ann. Math. Stat. 31 1001–1014.
• Martin, R. D. and Masreliez, C. J. (1975). Robust estimation via stochastic approximation. IEEE Trans. Inform. Theory IT-21 263–271.
• Mestre, X. (2008). Improved estimation of eigenvalues and eigenvectors of covariance matrices using their sample estimates. IEEE Trans. Inform. Theory 54 5113–5129.
• Miller, A. J. (1992). Algorithm as 274: Least squares routines to supplement those of gentleman. Applied Statistics 458–478.
• Moulines, E. and Bach, F. R. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems 451–459.
• Nagumo, J.-I. and Noda, A. (1967). A learning method for system identification. IEEE Trans. Automat. Control 12 282–287.
• National Research Council (2013). Frontiers in Massive Data Analysis. The National Academies Press, Washington, DC.
• Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear models. J. Roy. Statist. Soc. Ser. A 135 370–384.
• Nemirovski, A., Juditsky, A., Lan, G. and Shapiro, A. (2008). Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19 1574–1609.
• Nemirovsky, A. S. and Yudin, D. B. (1983). Problem Complexity and Method Efficiency in Optimization. Wiley, New York.
• Nevelson, M. B. and Khasminskiĭ, R. Z. (1973). Stochastic Approximation and Recursive Estimation 47. Amer. Math. Society, Washington.
• Parikh, N. and Boyd, S. (2013). Proximal algorithms. Found. Trends Optim. 1 123–231.
• Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30 838–855.
• Poljak, B. T. and Tsypkin, Ja. Z. (1980). Robust identification. Automatica J. IFAC 16 53–63.
• Polyak, B. T. and Tsypkin, Ya. Z. (1979). Adaptive estimation algorithms (convergence, optimality, stability). Avtomat. i Telemekh. 3 71–84.
• Robbins, H. and Monro, S. (1951). A stochastic approximation method. Ann. Math. Stat. 22 400–407.
• Rockafellar, R. T. (1976). Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14 877–898.
• Rosasco, L., Villa, S. and Vũ, B. C. (2014). Convergence of stochastic proximal gradient algorithm. Preprint. Available at arXiv:1403.5074.
• Ruppert, D. (1988). Efficient estimations from a slowly convergent Robbins–Monro process. Technical report, Dept. Operations Research and Industrial Engineering, Cornell Univ., Ithaca, NY.
• Sacks, J. (1958). Asymptotic distribution of stochastic approximation procedures. Ann. Math. Stat. 29 373–405.
• Sakrison, D. J. (1965). Efficient recursive estimation; application to estimating the parameters of a covariance function. Internat. J. Engrg. Sci. 3 461–483.
• Samet, J. M., Zeger, S. L., Dominici, F., Curriero, F., Coursac, I., Dockery, D. W., Schwartz, J. and Zanobetti, A. (2000). The national morbidity, mortality, and air pollution study. Part II: Morbidity and mortality from air pollution in the United States. Res. Rep. Health Eff. Inst. 94 5–79.
• Schmidt, M., Le Roux, N. and Bach, F. (2013). Minimizing finite sums with the stochastic average gradient. Technical report, HAL 00860051.
• Shamir, O. and Zhang, T. (2012). Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. Preprint. Available at arXiv:1212.1824.
• Simon, N., Friedman, J., Hastie, T. and Tibshirani, R. (2011). Regularization paths for Cox’s proportional Hazards model via coordinate descent. J. Stat. Softw. 39 1–13.
• Singer, Y. and Duchi, J. C. (2009). Efficient learning using forward-backward splitting. In Advances in Neural Information Processing Systems 495–503.
• Slock, D. T. M. (1993). On the convergence behavior of the lms and the normalized lms algorithms. IEEE Trans. Signal Process. 41 2811–2825.
• Toulis, P. and Airoldi, E. M. (2015a). Scalable estimation strategies based on stochastic approximations: Classical results and new insights. Stat. Comput. 25 781–795.
• Toulis, P. and Airoldi, E. M. (2015b). Implicit stochastic approximation. Preprint. Available at arXiv:1510.00967.
• Toulis, P. and Airoldi, E. M. (2017). Supplement to “Asymptotic and finite-sample properties of estimators based on stochastic gradients.” DOI:10.1214/16-AOS1506SUPP.
• Toulis, P., Airoldi, E. M. and Rennie, J. (2014). Statistical analysis of stochastic gradient methods for generalized linear models. J. Mach. Learn. Res. W&CP 32 (ICML) 667–675.
• Toulis, P., Tran, D. and Airoldi, E. M. (2016). Towards stability and optimality in stochastic gradient descent. J. Mach. Learn. Res. W&CP 51 (AISTATS).
• Tran, D., Toulis, P. and Airoldi, E. M. (2015). Stochastic gradient descent methods for estimation with large data sets. Preprint. Available at arXiv:1509.06459.
• Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11) 681–688.
• Widrow, B. and Hoff, M. E. (1960). Adaptive switching circuits. Defense Technical Information Center.
• Wood, S. N., Goude, Y. and Shaw, S. (2015). Generalized additive models for large data sets. J. R. Stat. Soc. Ser. C. Appl. Stat. 64 139–155.
• Xiao, L. and Zhang, T. (2014). A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24 2057–2075.
• Xu, W. (2011). Towards optimal one pass large scale learning with averaged stochastic gradient descent. Preprint. Available at arXiv:1107.2490.
• Zhang, T. (2004). Solving large scale linear prediction problems using gradient descent algorithms. In Proceedings of the Twenty-First International Conference on Machine Learning 116. ACM, New York.

#### Supplemental materials

• Supplement to “Asymptotic and finite-sample properties of estimators based on stochastic gradients”. The proofs of all technical results are provided in an online supplement [Toulis and Airoldi (2017)]. There, we also provide numerical results that extend the results in Section 4 of this article—referred to as the “main paper” in the supplement.