The Annals of Statistics

Statistical performance of support vector machines

Gilles Blanchard, Olivier Bousquet, and Pascal Massart

Source: Ann. Statist. Volume 36, Number 2 (2008), 489-531.

Abstract

The support vector machine (SVM) algorithm is well known to the computer learning community for its very good practical results. The goal of the present paper is to study this algorithm from a statistical perspective, using tools of concentration theory and empirical processes.

Our main result builds on the observation made by other authors that the SVM can be viewed as a statistical regularization procedure. From this point of view, it can also be interpreted as a model selection principle using a penalized criterion. It is then possible to adapt general methods related to model selection in this framework to study two important points: (1) what is the minimum penalty and how does it compare to the penalty actually used in the SVM algorithm; (2) is it possible to obtain “oracle inequalities” in that setting, for the specific loss function used in the SVM algorithm? We show that the answer to the latter question is positive and provides relevant insight to the former. Our result shows that it is possible to obtain fast rates of convergence for SVMs.

Primary Subjects: 62G05, 62G20
Keywords: Classification; support vector machine; model selection; oracle inequality

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber.
If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aos/1205420509
Digital Object Identifier: doi:10.1214/009053607000000839
Mathematical Reviews number (MathSciNet): MR2396805
Zentralblatt MATH identifier: 1133.62044

References

Alon, N., Ben-David, S., Cesa-Bianchi, N. and Haussler, D. (1997). Scale-sensitive dimensions, uniform convergence, and learnability. J. ACM 44 615–631.
Mathematical Reviews (MathSciNet): MR1481318
Digital Object Identifier: doi:10.1145/263867.263927
Bartlett, P. (1998). The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Trans. Inform. Theory 44 525–536.
Mathematical Reviews (MathSciNet): MR1607706
Digital Object Identifier: doi:10.1109/18.661502
Bartlett, P., Bousquet, O. and Mendelson, S. (2005). Local Rademacher complexities. Ann. Statist. 33 1497–1537.
Mathematical Reviews (MathSciNet): MR2166554
Digital Object Identifier: doi:10.1214/009053605000000282
Project Euclid: euclid.aos/1123250221
Bartlett, P., Jordan, M. and McAuliffe, J. (2006). Convexity, classification, and risk bounds. J. Amer. Statist. Assoc. 101 138–156.
Mathematical Reviews (MathSciNet): MR2268032
Digital Object Identifier: doi:10.1198/016214505000000907
Bartlett, P. and Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. J. Machine Learning Research 3 463–482.
Mathematical Reviews (MathSciNet): MR1984026
Digital Object Identifier: doi:10.1162/153244303321897690
Bartlett, P. and Mendelson, S. (2006). Empirical minimization. Probab. Theory Related Fields 135 311–334.
Mathematical Reviews (MathSciNet): MR2240689
Digital Object Identifier: doi:10.1007/s00440-005-0462-3
Birman, M. S. and Solomyak, M. Z. (1967). Piecewise polynomial approximations of functions of the class Wαp. Mat. USSR Sb. 73 295–317.
Zentralblatt MATH: 0173.16001
Blanchard, G., Bousquet, O. and Zwald, L. (2007). Statistical properties of kernel principal component analysis. Machine Learning 66 259–294.
Blanchard, G., Lugosi, G. and Vayatis, N. (2003). On the rate of convergence of regularized boosting classifiers. J. Machine Learning Research 4 861–894.
Bousquet, O. (2002). A Bennett concentration inequality and its application to suprema of empirical processes. C. R. Acad. Sci. Paris Ser. I 334 495–500.
Mathematical Reviews (MathSciNet): MR1890640
Digital Object Identifier: doi:10.1016/S1631-073X(02)02292-6
Bousquet, O. and Elisseeff, A. (2002). Stability and generalization. J. Machine Learning Research 2 499–526.
Mathematical Reviews (MathSciNet): MR1929416
Digital Object Identifier: doi:10.1162/153244302760200704
Chen, D.-R., Wu, Q., Ying, Y. and Zhou, D.-X. (2004). Support vector machine soft margin classifiers: Error analysis. J. Machine Learning Research 5 1143–1175.
Mathematical Reviews (MathSciNet): MR2248013
Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines. Cambridge Univ. Press.
Edmunds, D. E. and Triebel, H. (1996). Function Spaces, Entropy Numbers, Differential Operators. Cambridge Univ. Press.
Mathematical Reviews (MathSciNet): MR1410258
Zentralblatt MATH: 0865.46020
Evgeniou, T., Pontil, M. and Poggio, T. (2000). Regularization networks and support vector machines. Adv. Comput. Math. 13 1–50.
Mathematical Reviews (MathSciNet): MR1759187
Digital Object Identifier: doi:10.1023/A:1018946025316
Koltchinksii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Statist. 34 2593–2656.
Mathematical Reviews (MathSciNet): MR2329442
Digital Object Identifier: doi:10.1214/009053606000001019
Project Euclid: euclid.aos/1179935055
Koltchinskii, V. and Panchenko, D. (2002). Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist. 30 1–50.
Mathematical Reviews (MathSciNet): MR1892654
Project Euclid: euclid.aos/1015362183
Lecué, G. (2007). Simultaneous adaptation to the margin and to complexity in classification. Ann. Statist. 35 1698–1721.
Mathematical Reviews (MathSciNet): MR2351102
Digital Object Identifier: doi:10.1214/009053607000000055
Project Euclid: euclid.aos/1188405627
Lin, Y. (2002). Support vector machines and the Bayes rule in classification. Data Mining and Knowledge Discovery 6 259–275.
Mathematical Reviews (MathSciNet): MR1917926
Digital Object Identifier: doi:10.1023/A:1015469627679
Lugosi, G. and Wegkamp, M. (2004). Complexity regularization via localized random penalties. Ann. Statist. 32 1679–1697.
Mathematical Reviews (MathSciNet): MR2089138
Digital Object Identifier: doi:10.1214/009053604000000463
Project Euclid: euclid.aos/1091626183
Massart, P. (2000). About the constants in Talagrand’s inequality for empirical processes. Ann. Probab. 28 863–884.
Mathematical Reviews (MathSciNet): MR1782276
Digital Object Identifier: doi:10.1214/aop/1019160263
Project Euclid: euclid.aop/1019160263
Massart, P. (2000). Some applications of concentration inequalities in statistics. Ann. Fac. Sci. Toulouse Math. 9 245–303.
Mathematical Reviews (MathSciNet): MR1813803
Massart, P. and Nédelec, E. (2006). Risk bounds for statistical learning. Ann. Statist. 34 2326–2366.
Mathematical Reviews (MathSciNet): MR2291502
Digital Object Identifier: doi:10.1214/009053606000000786
Project Euclid: euclid.aos/1169571799
Massart, P. (2007). Concentration Inequalities and Model Selection. Lectures on Probability Theory and Statistics. Ecole d’Été de Probabilités de Saint-Flour XXXIII—2003. Lecture Notes in Math. 1896. Springer, Berlin.
Mathematical Reviews (MathSciNet): MR2319879
Mendelson, S. (2003). Estimating the performance of kernel classes. J. Machine Learning Research 4 759–771.
Mathematical Reviews (MathSciNet): MR2075996
Digital Object Identifier: doi:10.1162/1532443041424337
De Vito, E., Rosasco, L., Caponnetto, A., De Giovanni, U. and Odone, F. (2005). Learning from examples as an inverse problem. J. Machine Learning Research 6 883–904.
Mathematical Reviews (MathSciNet): MR2249842
Schölkopf, B., Smola, A. J. and Williamson, R. C. (2001). Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators. IEEE Trans. Inform. Theory 47 2516–2532.
Mathematical Reviews (MathSciNet): MR1873936
Digital Object Identifier: doi:10.1109/18.945262
Shawe-Taylor, J., Williams, C., Cristianini, N. and Kandola, J. (2005). On the eigenspectrum of the Gram matrix and the generalisation error of kernel PCA. IEEE Trans. Inform. Theory 51 2510–2522.
Mathematical Reviews (MathSciNet): MR2246374
Digital Object Identifier: doi:10.1109/TIT.2005.850052
Schölkopf, B. and Smola, A. (2002). Learning with Kernels. MIT Press.
Smola, A. and Schölkopf, B. (1998). From regularization operators to support vector kernels. In Advances in Neural Information Processings Systems 10 (M. I. Jordan, M. J. Kearns and S. A. Solla, eds.) 343–349. MIT Press.
Steinwart, I. (2002). Support vector machines are universally consistent. J. Complexity 18 768–791.
Mathematical Reviews (MathSciNet): MR1928806
Digital Object Identifier: doi:10.1006/jcom.2002.0642
Steinwart, I. and Scovel, C. (2007). Fast rates for support vector machines using Gaussian kernels. Ann. Statist. 35 575–607.
Mathematical Reviews (MathSciNet): MR2336860
Digital Object Identifier: doi:10.1214/009053606000001226
Project Euclid: euclid.aos/1183667285
Steinwart, I., Hush, D. and Scovel, C. (2006). A new concentration result for regularized risk minimizers. In High-Dimensional Probability IV 260–275. IMS Lecture Notes—Monograph Series 51.
Mathematical Reviews (MathSciNet): MR2387774
Digital Object Identifier: doi:10.1214/074921706000000897
Tarigan, B. and van de Geer, S. (2006). Adaptivity of support vector machines with 1 penalty. Bernoulli 12 1045–1076.
Tsybakov, A. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135–166.
Mathematical Reviews (MathSciNet): MR2051002
Digital Object Identifier: doi:10.1214/aos/1079120131
Project Euclid: euclid.aos/1079120131
Tsybakov, A. and van de Geer, S. (2005). Square root penalty: Adaptation to the margin in classification and in edge estimation. Ann. Statist. 33 1203–1224.
Mathematical Reviews (MathSciNet): MR2195633
Digital Object Identifier: doi:10.1214/009053604000001066
Project Euclid: euclid.aos/1120224100
Vapnik, V. (1998). Statistical Learning Theory. Wiley, New York.
Mathematical Reviews (MathSciNet): MR1641250
Zentralblatt MATH: 0935.62007
Vapnik, V. and Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16 264–280.
Yang, Y. (1999). Minimax nonparametric classification. I. Rates of convergence. IEEE Trans. Inform. Theory 45 2271–2284.
Mathematical Reviews (MathSciNet): MR1725115
Digital Object Identifier: doi:10.1109/18.796368
Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Statist. 32 56–85.
Mathematical Reviews (MathSciNet): MR2051001
Digital Object Identifier: doi:10.1214/aos/1079120130
Project Euclid: euclid.aos/1079120130
Zhou, D.-X. (2003). Capacity of reproducing kernel spaces in learning theory. IEEE Trans. Inform. Theory 49 1743–1752.
Mathematical Reviews (MathSciNet): MR1985575
Digital Object Identifier: doi:10.1109/TIT.2003.813564

2009 © Institute of Mathematical Statistics