The Annals of Statistics

Regularization in kernel learning

Shahar Mendelson and Joseph Neeman

Full-text: Open access

Abstract

Under mild assumptions on the kernel, we obtain the best known error rates in a regularized learning scenario taking place in the corresponding reproducing kernel Hilbert space (RKHS). The main novelty in the analysis is a proof that one can use a regularization term that grows significantly slower than the standard quadratic growth in the RKHS norm.

Article information

Source
Ann. Statist., Volume 38, Number 1 (2010), 526-565.

Dates
First available in Project Euclid: 31 December 2009

Permanent link to this document
https://projecteuclid.org/euclid.aos/1262271623

Digital Object Identifier
doi:10.1214/09-AOS728

Mathematical Reviews number (MathSciNet)
MR2590050

Zentralblatt MATH identifier
1191.68356

Subjects
Primary: 68Q32: Computational learning theory [See also 68T05]
Secondary: 60G99: None of the above, but in this section

Keywords
Regression reproducing kernel Hilbert space regulation least-squares model selection

Citation

Mendelson, Shahar; Neeman, Joseph. Regularization in kernel learning. Ann. Statist. 38 (2010), no. 1, 526--565. doi:10.1214/09-AOS728. https://projecteuclid.org/euclid.aos/1262271623


Export citation

References

  • [1] Bartlett, P. L. (2008). Fast rates for estimation error and oracle inequalities for model selection. Econometric Theory 24 545–552.
  • [2] Bartlett, P. L., Bousquet, O. and Mendelson, S. (2005). Local rademacher complexities. Ann. Statist. 33 1497–1537.
  • [3] Bartlett, P. L. and Mendelson, S. (2006). Empirical minimization. Probab. Theory Related Fields 135 311–344.
  • [4] Bartlett, P. L., Mendelson, S. and Neeman, J. (2009). 1-regularized linear regression: Persistence and oracle inequalities. Submitted.
  • [5] Blanchard, G., Bousquet, O. and Massart, P. (2008). Statistical performance of support vector machines. Ann. Statist. 36 489–531.
  • [6] Birman, M. Š. and Solomyak, M. Z. (1977). Estimates of singular numbers of integral operators. Uspehi Mat. Nauk 32 15–89.
  • [7] Caponnetto, A. and de Vito, E. (2007). Optimal rates for regularized least-squares algorithm. Found. Comput. Math. 7 331–368.
  • [8] Cucker, F. and Smale, S. (2002). On the mathematical foundations of learning. Bull. Amer. Math. Soc. (N.S.) 39 1–49.
  • [9] Cucker, F. and Smale, S. (2002). Best choices for regularization parameters in learning theory: On the Bias-variance problem. Found. Comput. Math. 2 413–428.
  • [10] Cucker, F. and Zhou, D. X. (2007). Learning Theory: An Approximation Theory Viewpoint. Cambridge Univ. Press, Cambridge.
  • [11] Dudley, R. M. (1999). Uniform Central Limit Theorems. Cambridge Studies in Advanced Mathematics 63. Cambridge Univ. Press, Cambridge.
  • [12] Fernique, X. (1975). Régularité des trajectoires des fonctiones aléatoires Gaussiennes. In Ecole d’Eté de Probabilités de St-Flour 1974. Lecture Notes in Mathematics 480 1–96. Springer, Berlin.
  • [13] Giné, E. and Zinn, J. (1984). Some limit theorems for empirical processes. Ann. Probab. 12 929–989.
  • [14] Guedon, O., Mendelson, S., Pajor, A. and Tomczak-Jaegermann, N. (2007). Subspaces and orthogonal decompositions generated by bounded orthogonal systems. Positivity 11 269–283.
  • [15] Guédon, O. and Rudelson, M. (2007). Lp moments of random vectors via majorizing measures. Adv. Math. 208 798–823.
  • [16] Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Statist. 34 2593–2656.
  • [17] Konig, H. (1986). Eigenvalue Distribution of Compact Operators. Birkhäuser, Basel.
  • [18] Ledoux, M. (2001). The Concentration of Measure Phenomenon. Amer. Math. Soc., Providence, RI.
  • [19] Lee, W. S., Barlett, P. L. and Williamson, R. C. (1996). The importance of convexity in learning with squared loss. IEEE Trans. Inform. Theory 44 1974–1980.
  • [20] Massart, P. (2000). About the constants in Talagrand’s concentration inequality for empirical processes. Ann. Probab. 28 863–884.
  • [21] Massart, P. (2007). Concentration Inequalities and Model Selection. Lecture Notes in Mathematics 1896. Springer, Berlin.
  • [22] Mendelson, S. (2003). Estimating the performance of kernel classes. J. Mach. Learn. Res. 4 759–771.
  • [23] Mendelson, S., Pajor, A. and Tomczak-Jaegermann, N. (2007). Reconstruction and subgaussian operators in asymptotic geometric analysis. Geom. Funct. Anal. 17 1248–1282.
  • [24] Mendelson, S. (2009). Obtaining fast error rates in nonconvex situations. J. Complexity 24 380–397.
  • [25] Mendelson, S. (2008). On weakly bounded empirical processes. Math. Ann. 340 293–314.
  • [26] Pajor, A. and Tomczak-Jaegermann, N. (1985). Remarques sur les nombres d’entropie d’un opérateur et de son transposé. C. R. Acad. Sci. Paris Ser. I Math. 301 743–746.
  • [27] Pisier, G. (1989). The Volume of Convex Bodies and Banach Space Geometry. Cambridge Univ. Press, Cambridge.
  • [28] Rudelson, M. (1999). Random vectors in the isotropic position. J. Funct. Anal. 164 60–72.
  • [29] Steinwart, I. and Scovel, C. (2007). Fast rates for support vector machines using Gaussian kernels. Ann. Statist. 35 575–607.
  • [30] Smale, S. and Zhou, D. X. (2003). Estimating the approximation error in learning theory. Anal. Appl. 1 17–41.
  • [31] Smale, S. and Zhou, D. X. (2007). Learning theory estimates via integral operators and their approximations. Constr. Approx. 26 153–172.
  • [32] Talagrand, M. (1987). Regularity of Gaussian processes. Acta Math. 159 99–149.
  • [33] Talagrand, M. (1994). Sharper bounds for Gaussian and empirical processes. Ann. Probab. 22 28–76.
  • [34] Talagrand, M. (2005). The Generic Chaining. Springer, Berlin.
  • [35] Williamson, R. C., Smola, A. J. and Schölkopf, B. (2001). Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators. IEEE Trans. Inform. Theory 47 2516–2532.
  • [36] Wu, Q., Ying, Y. and Zhou, D. (2006). Learning rates of least-square regularized regression. Found. Comput. Math. 6 171–192.
  • [37] Zhou, D. (2002). The covering number in learning theory. J. Complexity 18 739–767.