The Annals of Statistics

Nonparametric stochastic approximation with large step-sizes

Abstract

We consider the random-design least-squares regression problem within the reproducing kernel Hilbert space (RKHS) framework. Given a stream of independent and identically distributed input/output data, we aim to learn a regression function within an RKHS $\mathcal{H}$, even if the optimal predictor (i.e., the conditional expectation) is not in $\mathcal{H}$. In a stochastic approximation framework where the estimator is updated after each observation, we show that the averaged unregularized least-mean-square algorithm (a form of stochastic gradient descent), given a sufficient large step-size, attains optimal rates of convergence for a variety of regimes for the smoothnesses of the optimal prediction function and the functions in $\mathcal{H}$. Our results apply as well in the usual finite-dimensional setting of parametric least-squares regression, showing adaptivity of our estimator to the spectral decay of the covariance matrix of the covariates.

Article information

Source
Ann. Statist., Volume 44, Number 4 (2016), 1363-1399.

Dates
Revised: July 2015
First available in Project Euclid: 7 July 2016

https://projecteuclid.org/euclid.aos/1467894702

Digital Object Identifier
doi:10.1214/15-AOS1391

Mathematical Reviews number (MathSciNet)
MR3519927

Zentralblatt MATH identifier
1346.60041

Citation

Dieuleveut, Aymeric; Bach, Francis. Nonparametric stochastic approximation with large step-sizes. Ann. Statist. 44 (2016), no. 4, 1363--1399. doi:10.1214/15-AOS1391. https://projecteuclid.org/euclid.aos/1467894702

References

• [1] Abramowitz, M. and Stegun, I. (1964). Handbook of Mathematical Functions. Dover, New York.
• [2] Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc. 68 337–404.
• [3] Bach, F. (2013). Sharp analysis of low-rank kernel matrix approximations. In Proceedings of the International Conference on Learning Theory (COLT).
• [4] Bach, F. and Moulines, E. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Adv. NIPS. Curran Associates, Inc., Granada, Spain.
• [5] Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate $O(1/n)$. In Advances in Neural Information Processing Systems (NIPS). Curran Associates, Inc., Lake Tahoe, CA.
• [6] Berlinet, A. and Thomas-Agnan, C. (2004). Reproducing Kernel Hilbert Spaces in Probability and Statistics. Kluwer Academic, Boston, MA.
• [7] Blanchard, G. and Krämer, N. (2010). Optimal learning rates for kernel conjugate gradient regression. In Advances in Neural Inf. Proc. Systems (NIPS) 226–234. Curran Associates, Inc., Vancouver, Canada.
• [8] Bordes, A., Ertekin, S., Weston, J. and Bottou, L. (2005). Fast kernel classifiers with online and active learning. J. Mach. Learn. Res. 6 1579–1619.
• [9] Brezis, H. (1983). Analyse Fonctionnelle, Théorie et Applications. Masson, Paris.
• [10] Caponnetto, A. and De Vito, E. (2007). Optimal rates for the regularized least-squares algorithm. Found. Comput. Math. 7 331–368.
• [11] Cucker, F. and Smale, S. (2002). On the mathematical foundations of learning. Bull. Amer. Math. Soc. (N.S.) 39 1–49 (electronic).
• [12] Cucker, F. and Smale, S. (2002). Best choices for regularization parameters in learning theory: On the bias-variance problem. Found. Comput. Math. 2 413–428.
• [13] Dekel, O., Shalev-Shwartz, S. and Singer, Y. (2005). The Forgetron: A kernel-based perceptron on a fixed budget. In Adv. NIPS. Curran Associates, Inc., Vancouver, Canada.
• [14] De Vito, E., Caponnetto, A. and Rosasco, L. (2005). Model selection for regularized least-squares algorithm in learning theory. Found. Comput. Math. 5 59–85.
• [15] Dieuleveut, A. and Bach, F. (2014). Nonparametric stochastic approximation with large step sizes. Preprint. Available at arXiv:1408.0361.
• [16] Engl, H. W., Hanke, M. and Neubauer, A. (1996). Regularization of Inverse Problems. Mathematics and Its Applications 375. Kluwer Academic, Dordrecht.
• [17] Flammarion, N. and Bach, F. (2015). From averaging to acceleration, there is only a step-size. In Proceedings of the International Conference on Learning Theory (COLT). Microtome Publishing, Paris, France.
• [18] Golub, G. H. and Van Loan, C. F. (1996). Matrix Computations, 3rd ed. Johns Hopkins Univ. Press, Baltimore, MD.
• [19] Hazan, E. and Kale, S. (2011). Beyond the regret minimization barrier: An optimal algorithm for stochastic strongly-convex optimization. In Proceedings of the International Conference on Learning Theory (COLT). Microtome Publishing, Budapest, Hungary.
• [20] Hsu, D., Kakade, S. M. and Zhang, T. (2014). Random design analysis of ridge regression. Found. Comput. Math. 14 569–600.
• [21] Johnstone, I. M. (1994). Minimax Bayes, asymptotic minimax and sparse wavelet priors. In Statistical Decision Theory and Related Topics, V (West Lafayette, IN, 1992) 303–326. Springer, New York.
• [22] Kimeldorf, G. and Wahba, G. (1971). Some results on Tchebycheffian spline functions. J. Math. Anal. Appl. 33 82–95.
• [23] Kivinen, J., Smola, A. J. and Williamson, R. C. (2004). Online learning with kernels. IEEE Trans. Signal Process. 52 2165–2176.
• [24] Kolmogorov, A. N. and Fomin, S. V. (1999). Elements of the Theory of Functions and Functional Analysis, Vol. 1 Dover, New York.
• [25] Lacoste-Julien, S., Schmidt, M. and Bach, F. (2012). A simpler approach to obtaining an O(1/t) rate for the stochastic projected subgradient method. Preprint. Available at arXiv:1212.2002.
• [26] Mahoney, M. W. (2011). Randomized algorithms for matrices and data. Faund. Trends Mach. Learn. 3 123–224.
• [27] Micchelli, C. A., Xu, Y. and Zhang, H. (2006). Universal kernels. J. Mach. Learn. Res. 7 2651–2667.
• [28] Mikusinski, P. and Weiss, E. (2014). The Bochner integral. Preprint. Available at arXiv:1403.5209.
• [29] Nemirovski, A., Juditsky, A., Lan, G. and Shapiro, A. (2008). Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19 1574–1609.
• [30] Raskutti, G., Wainwright, M. J. and Yu, B. (2011). Early stopping for nonparametric regression: An optimal data-dependent stopping rule. In 49th Annual Allerton Conference on Communication, Control, and Computing. IEEE, Monticello, IL, USA.
• [31] Robbins, H. and Monro, S. (1951). A stochastic approximation method. Ann. Math. Stat. 22 400–407.
• [32] Rosasco, L., Tacchetti, A. and Villa, S. (2014). Regularization by early stopping for online learning algorithms. Preprint. Available at arXiv:1405.0042.
• [33] Schölkopf, B. and Smola, A. J. (2002). Learning with Kernels. MIT Press, Cambridge, MA.
• [34] Shalev-Shwartz, S. (2011). Online learning and online convex optimization. Foundations and Trends in Machine Learning 4 107–194.
• [35] Shawe-Taylor, J. and Cristianini, N. (2004). Kernel Methods for Pattern Analysis. Cambridge Univ. Press, Cambridge, MA.
• [36] Smale, S. and Zhou, D.-X. (2007). Learning theory estimates via integral operators and their approximations. Constr. Approx. 26 153–172.
• [37] Sriperumbudur, B. K., Fukumizu, K. and Lanckriet, G. R. G. (2011). Universality, characteristic kernels and RKHS embedding of measures. J. Mach. Learn. Res. 12 2389–2410.
• [38] Steinwart, I., Hush, D. and Scovel, C. (2009). Optimal rates for regularized least squares regression. In Proceedings of the 22nd Annual Conference in Learning Theory. Microtome Publishing, Montreal, Canada.
• [39] Tarrès, P. and Yao, Y. (2011). Online learning as stochastic approximation of regularization paths. Preprint. Available at arXiv:1103.5538.
• [40] Thomson, B., Bruckner, J. and Bruckner, A. M. (2000). Elementary Real Analysis. Pearson Education, Upper Saddle River, NJ.
• [41] Tsybakov, A. B. (2008). Introduction to Nonparametric Estimation, 1st ed. Springer, Berlin.
• [42] Vert, J. (2014). Kernel methods. Unpublished manuscript.
• [43] Wahba, G. (1990). Spline Models for Observational Data. SIAM, Philadelphia, PA.
• [44] Williams, C. and Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems 13 (T. K. Leen, T. G. Dietterich and V. Tresp, eds.). MIT Press, Cambridge, MA.
• [45] Yao, Y. (2006). A dynamic theory of learning. Ph.D. thesis, Univ. of California, Berkeley.
• [46] Yao, Y., Rosasco, L. and Caponnetto, A. (2007). On early stopping in gradient descent learning. Constr. Approx. 26 289–315.
• [47] Ying, Y. and Pontil, M. (2008). Online gradient descent learning algorithms. Found. Comput. Math. 8 561–596.
• [48] Zhang, T. (2004) Solving large scale linear prediction problems using stochastic gradient descent algorithms. In ICML 2014 Proceedings of the Twenty-First International Conference on Machine Learning. Microtome Publishing, Beijing, China.
• [49] Zhang, T. and Yu, B. (2005). Boosting with early stopping: Convergence and consistency. Ann. Statist. 33 1538–1579.