Electronic Journal of Statistics

Penalized empirical risk minimization over Besov spaces

Sébastien Loustau

Source: Electron. J. Statist. Volume 3 (2009), 824-850.

Abstract

Kernel methods are closely related to the notion of reproducing kernel Hilbert space (RKHS). A kernel machine is based on the minimization of an empirical cost and a stabilizer (usually the norm in the RKHS). In this paper we propose to use Besov spaces as alternative hypothesis spaces. We study statistical performances of a penalized empirical risk minimization for classification where the stabilizer is a Besov norm. More precisely, we state fast rates of convergence to the Bayes rule. These rates are adaptive with respect to the regularity of the Bayes.

Full-text: Open access

Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.ejs/1250880017
Digital Object Identifier: doi:10.1214/08-EJS316

References

[1] Audibert, J.Y. and Tsybakov, A.B. (2007). Fast learning rates for plug-in classifiers. The Annals of Statistics 35 (2), 608–633.
[2] Bartlett, P.L., Boucheron, S., and Lugosi, G. (2002). Model selection and error estimation. Machine Learning 48, 85–113.
[3] Bartlett, P.L., Bousquet, O., and Mendelson, S. (2005). Local rademacher complexities. The Annals of Statistics 33 (4), 1497–1537.
[4] Bartlett, P.L., Jordan, M.I., and McAuliffe, J.D. (2006). Convexity, classification, and risk bounds. J. Amer. Statist. Assoc. 101 (473), 138–156.
[5] Bartlett, P.L. and Mendelson, S. (2002). Rademacher and Gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research 3, 463–482.
[6] Blanchard, G., Bousquet, O., and Massart, P. (2008). Statistical performance of support vector machines. Annals of Statistics 36 (2).
[7] Blanchard, G., Lugosi, G., and Vayatis, N. (2003). On the rate of convergence of regularized boosting classifiers. Journal of Machine Learning Research 4, 861–894.
[8] Boser, B.E., Guyon, I., and Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In Computational Learning Theory. 144–152.
[9] Canu, S., Mary, X., and Rakotomamonjy, A. (2003). Functional learning through kernel. Advances in Learning Theory: Methods, Models and Applications 190, 89–110.
[10] Daubechies, I. (1988). Orthonormal bases of compactly supported wavelets. Communications on Pure and Applied Mathematics 41 (7), 909–996.
[11] Devroye, L., Györfi, L., and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer-Verlag.
[12] Donoho, D.L., Johnstone, I.M., Kerkyacharian, G., and Picard, D. (1996). Density estimation by wavelet thresholding. The Annals of Statistics 24 (2), 508–539.
[13] Härdle, W., Kerkyacharian, G., Picard, D., and Tsybakov, A. (1997). Wavelets, Approximation, and Statistical Applications. Lecture Notes in Statistics.
[14] Koltchinskii, V. (2001). Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory 47 (5), 1902–1914.
[15] Lecué, G. (2008). Classification with minimax fast rates for classes of Bayes rules with sparse representation. Electronic Journal of Statistics 2, 741–773.
[16] Loustau, S. (2008). Aggregation of SVM classifiers using Sobolev spaces. Journal of Machine Learning Research 9, 1559–1582.
[17] Mallat, S. (2000). Une exploration des signaux en ondelettes. Ellipses.
[18] Mary, X., De Brucq, D., and Canu, S. (2003). Sous-dualités et noyaux (reproduisants) associés. C. R. Acad. Sc. Paris 336 (1), 949–954.
[19] Massart, P. and Nédélec, E. (2006). Risk bounds for statistical learning. The Annals of Statistics 34 (5), 2326–2366.
[20] Mendelson, S. (2003). On the performance of kernel classes. Journal of Machine Learning Research 4, 759–771.
[21] Meyer, Y. (1990). Ondelettes et Opérateurs 1 : Ondelettes. Hermann.
[22] Peetre, J. (1976). New thoughts on Besov spaces. Mathematics Department, Duke University, Durham, N.C.
[23] Rosenthal, H.P. (1972). On the span in lp of sequences of independent random variables. Israël J. Math. 8, 273–303.
[24] Scott, C. and Nowak, R. (2006). Minimax-optimal classification with dyadic decision trees. IEEE Transactions on Information Theory 52-4, 1335–1353.
[25] Smale, S. and Zhou, D.X. (2003). Estimating the approximation error in learning theory. Analysis and Applications 1 (1), 17–41.
[26] Steinwart, I. and Scovel, C. (2007). Fast rates for support vector machines using Gaussian kernels. The Annals of Statistics 35 (2), 575–607.
[27] Triebel, H. (1978). Interpolation Theory, Function Spaces, Differential Operators. North-Holland Publishing Company.
[28] Triebel, H. (1992). Theory of Functions Spaces II. Birkhauser.
[29] Tsybakov, A.B. (2004). Optimal aggregation of classifiers in statistical learning. The Annals of Statistics 32 (1), 135–166.
[30] Vapnik, V.N. and Chervonenkis, A.Ya. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications 16 (2), 264–280.
[31] Vapnik, V.N. and Chervonenkis, A.Ya. (1974). Theory of Pattern Recognition. Nauka, Moscow.

2009 © Institute of Mathematical Statistics