The Annals of Statistics

Statistical performance of support vector machines

Gilles Blanchard, Olivier Bousquet, and Pascal Massart

Full-text: Open access

Abstract

The support vector machine (SVM) algorithm is well known to the computer learning community for its very good practical results. The goal of the present paper is to study this algorithm from a statistical perspective, using tools of concentration theory and empirical processes.

Our main result builds on the observation made by other authors that the SVM can be viewed as a statistical regularization procedure. From this point of view, it can also be interpreted as a model selection principle using a penalized criterion. It is then possible to adapt general methods related to model selection in this framework to study two important points: (1) what is the minimum penalty and how does it compare to the penalty actually used in the SVM algorithm; (2) is it possible to obtain “oracle inequalities” in that setting, for the specific loss function used in the SVM algorithm? We show that the answer to the latter question is positive and provides relevant insight to the former. Our result shows that it is possible to obtain fast rates of convergence for SVMs.

Article information

Source
Ann. Statist. Volume 36, Number 2 (2008), 489-531.

Dates
First available in Project Euclid: 13 March 2008

Permanent link to this document
https://projecteuclid.org/euclid.aos/1205420509

Digital Object Identifier
doi:10.1214/009053607000000839

Mathematical Reviews number (MathSciNet)
MR2396805

Zentralblatt MATH identifier
1133.62044

Subjects
Primary: 62G05: Estimation 62G20: Asymptotic properties

Keywords
Classification support vector machine model selection oracle inequality

Citation

Blanchard, Gilles; Bousquet, Olivier; Massart, Pascal. Statistical performance of support vector machines. Ann. Statist. 36 (2008), no. 2, 489--531. doi:10.1214/009053607000000839. https://projecteuclid.org/euclid.aos/1205420509


Export citation

References

  • Alon, N., Ben-David, S., Cesa-Bianchi, N. and Haussler, D. (1997). Scale-sensitive dimensions, uniform convergence, and learnability. J. ACM 44 615–631.
  • Bartlett, P. (1998). The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Trans. Inform. Theory 44 525–536.
  • Bartlett, P., Bousquet, O. and Mendelson, S. (2005). Local Rademacher complexities. Ann. Statist. 33 1497–1537.
  • Bartlett, P., Jordan, M. and McAuliffe, J. (2006). Convexity, classification, and risk bounds. J. Amer. Statist. Assoc. 101 138–156.
  • Bartlett, P. and Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. J. Machine Learning Research 3 463–482.
  • Bartlett, P. and Mendelson, S. (2006). Empirical minimization. Probab. Theory Related Fields 135 311–334.
  • Birman, M. S. and Solomyak, M. Z. (1967). Piecewise polynomial approximations of functions of the class Wαp. Mat. USSR Sb. 73 295–317.
  • Blanchard, G., Bousquet, O. and Zwald, L. (2007). Statistical properties of kernel principal component analysis. Machine Learning 66 259–294.
  • Blanchard, G., Lugosi, G. and Vayatis, N. (2003). On the rate of convergence of regularized boosting classifiers. J. Machine Learning Research 4 861–894.
  • Bousquet, O. (2002). A Bennett concentration inequality and its application to suprema of empirical processes. C. R. Acad. Sci. Paris Ser. I 334 495–500.
  • Bousquet, O. and Elisseeff, A. (2002). Stability and generalization. J. Machine Learning Research 2 499–526.
  • Chen, D.-R., Wu, Q., Ying, Y. and Zhou, D.-X. (2004). Support vector machine soft margin classifiers: Error analysis. J. Machine Learning Research 5 1143–1175.
  • Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines. Cambridge Univ. Press.
  • Edmunds, D. E. and Triebel, H. (1996). Function Spaces, Entropy Numbers, Differential Operators. Cambridge Univ. Press.
  • Evgeniou, T., Pontil, M. and Poggio, T. (2000). Regularization networks and support vector machines. Adv. Comput. Math. 13 1–50.
  • Koltchinksii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Statist. 34 2593–2656.
  • Koltchinskii, V. and Panchenko, D. (2002). Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist. 30 1–50.
  • Lecué, G. (2007). Simultaneous adaptation to the margin and to complexity in classification. Ann. Statist. 35 1698–1721.
  • Lin, Y. (2002). Support vector machines and the Bayes rule in classification. Data Mining and Knowledge Discovery 6 259–275.
  • Lugosi, G. and Wegkamp, M. (2004). Complexity regularization via localized random penalties. Ann. Statist. 32 1679–1697.
  • Massart, P. (2000). About the constants in Talagrand’s inequality for empirical processes. Ann. Probab. 28 863–884.
  • Massart, P. (2000). Some applications of concentration inequalities in statistics. Ann. Fac. Sci. Toulouse Math. 9 245–303.
  • Massart, P. and Nédelec, E. (2006). Risk bounds for statistical learning. Ann. Statist. 34 2326–2366.
  • Massart, P. (2007). Concentration Inequalities and Model Selection. Lectures on Probability Theory and Statistics. Ecole d’Été de Probabilités de Saint-Flour XXXIII—2003. Lecture Notes in Math. 1896. Springer, Berlin.
  • Mendelson, S. (2003). Estimating the performance of kernel classes. J. Machine Learning Research 4 759–771.
  • De Vito, E., Rosasco, L., Caponnetto, A., De Giovanni, U. and Odone, F. (2005). Learning from examples as an inverse problem. J. Machine Learning Research 6 883–904.
  • Schölkopf, B., Smola, A. J. and Williamson, R. C. (2001). Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators. IEEE Trans. Inform. Theory 47 2516–2532.
  • Shawe-Taylor, J., Williams, C., Cristianini, N. and Kandola, J. (2005). On the eigenspectrum of the Gram matrix and the generalisation error of kernel PCA. IEEE Trans. Inform. Theory 51 2510–2522.
  • Schölkopf, B. and Smola, A. (2002). Learning with Kernels. MIT Press.
  • Smola, A. and Schölkopf, B. (1998). From regularization operators to support vector kernels. In Advances in Neural Information Processings Systems 10 (M. I. Jordan, M. J. Kearns and S. A. Solla, eds.) 343–349. MIT Press.
  • Steinwart, I. (2002). Support vector machines are universally consistent. J. Complexity 18 768–791.
  • Steinwart, I. and Scovel, C. (2007). Fast rates for support vector machines using Gaussian kernels. Ann. Statist. 35 575–607.
  • Steinwart, I., Hush, D. and Scovel, C. (2006). A new concentration result for regularized risk minimizers. In High-Dimensional Probability IV 260–275. IMS Lecture Notes—Monograph Series 51.
  • Tarigan, B. and van de Geer, S. (2006). Adaptivity of support vector machines with 1 penalty. Bernoulli 12 1045–1076.
  • Tsybakov, A. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135–166.
  • Tsybakov, A. and van de Geer, S. (2005). Square root penalty: Adaptation to the margin in classification and in edge estimation. Ann. Statist. 33 1203–1224.
  • Vapnik, V. (1998). Statistical Learning Theory. Wiley, New York.
  • Vapnik, V. and Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16 264–280.
  • Yang, Y. (1999). Minimax nonparametric classification. I. Rates of convergence. IEEE Trans. Inform. Theory 45 2271–2284.
  • Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Statist. 32 56–85.
  • Zhou, D.-X. (2003). Capacity of reproducing kernel spaces in learning theory. IEEE Trans. Inform. Theory 49 1743–1752.