The Annals of Statistics

Statistical behavior and consistency of classification methods based on convex risk minimization

Tong Zhang

Source: Ann. Statist. Volume 32, Number 1 (2004), 56-85.

Abstract

We study how closely the optimal Bayes error rate can be approximately reached using a classification algorithm that computes a classifier by minimizing a convex upper bound of the classification error function. The measurement of closeness is characterized by the loss function used in the estimation. We show that such a classification scheme can be generally regarded as a (nonmaximum-likelihood) conditional in-class probability estimate, and we use this analysis to compare various convex loss functions that have appeared in the literature. Furthermore, the theoretical insight allows us to design good loss functions with desirable properties. Another aspect of our analysis is to demonstrate the consistency of certain classification methods using convex risk minimization. This study sheds light on the good performance of some recently proposed linear classification methods including boosting and support vector machines. It also shows their limitations and suggests possible improvements.

Primary Subjects: 62G05, G2H30, 68T05
Keywords: Classification; consistency; boosting; large margin methods; kernel methods

Full-text: Open access

Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aos/1079120130
Digital Object Identifier: doi:10.1214/aos/1079120130
Mathematical Reviews number (MathSciNet): MR2051001
Zentralblatt MATH identifier: 02113743

References

Bregman, L. M. (1967). The relaxation method of finding a common point of convex sets and its application to the solution of problems in convex programming. U.S.S.R. Computational Mathematics and Mathematical Physics 7 200--217.
Mathematical Reviews (MathSciNet): MR215617
Breiman, L. (1998). Arcing classifiers (with discussion). Ann. Statist. 26 801--849.
Mathematical Reviews (MathSciNet): MR1635406
Digital Object Identifier: doi:10.1214/aos/1024691079
Project Euclid: euclid.aos/1024691079
Breiman, L. (1999). Prediction games and arcing algorithms. Neural Computation 11 1493--1517.
Breiman, L. (2000). Some infinity theory for predictor ensembles. Technical Report 577, Dept. Statistics, Univ. California, Berkeley.
Bühlmann, P. and Yu, B. (2003). Boosting with $L_2$-loss: Regression and classification. J. Amer. Statist. Assoc. 98 324--339.
Mathematical Reviews (MathSciNet): MR1995709
Digital Object Identifier: doi:10.1198/016214503000125
Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55 119--139.
Mathematical Reviews (MathSciNet): MR1473055
Digital Object Identifier: doi:10.1006/jcss.1997.1504
Friedman, J., Hastie, T. and Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting (with discussion). Ann. Statist. 28 337--407.
Mathematical Reviews (MathSciNet): MR1790002
Digital Object Identifier: doi:10.1214/aos/1016218223
Project Euclid: euclid.aos/1016218223
Leshno, M., Lin, Ya. V., Pinkus, A. and Schocken, S. (1993). Multilayer feedforward networks with a non-polynomial activation function can approximate any function. Neural Networks 6 861--867.
Lugosi, G. and Vayatis, N. (2004). On the Bayes-risk consistency of regularized boosting methods. Ann. Statist. 32 30--55.
Mathematical Reviews (MathSciNet): MR2051000
Project Euclid: euclid.aos/1079120129
Mannor, S., Meir, R. and Zhang, T. (2002). The consistency of greedy algorithms for classification. In Proc. 15th Annual Conference on Computational Learning Theory. Lecture Notes in Comput. Sci. 2375 319--333. Springer, New York.
Mathematical Reviews (MathSciNet): MR2040422
Zentralblatt MATH: 1050.68581
Rockafellar, R. T. (1970). Convex Analysis. Princeton Univ. Press.
Mathematical Reviews (MathSciNet): MR274683
Zentralblatt MATH: 0193.18401
Rudin, W. (1987). Real and Complex Analysis, 3rd ed. McGraw-Hill, New York.
Mathematical Reviews (MathSciNet): MR924157
Zentralblatt MATH: 0925.00005
Schapire, R. E., Freund, Y., Bartlett, P. and Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Statist. 26 1651--1686.
Mathematical Reviews (MathSciNet): MR1673273
Digital Object Identifier: doi:10.1214/aos/1024691352
Project Euclid: euclid.aos/1024691352
Schapire, R. E. and Singer, Y. (1999). Improved boosting algorithms using confidence-rated predictions. Machine Learning 37 297--336.
Mathematical Reviews (MathSciNet): MR1811573
Steinwart, I. (2002). Support vector machines are universally consistent. J. Complexity 18 768--791.
Mathematical Reviews (MathSciNet): MR1928806
Digital Object Identifier: doi:10.1006/jcom.2002.0642
Vapnik, V. N. (1998). Statistical Learning Theory. Wiley, New York.
Mathematical Reviews (MathSciNet): MR1641250
Zentralblatt MATH: 0935.62007
Wahba, G. (1990). Spline Models for Observational Data. SIAM, Philadelpfia.
Mathematical Reviews (MathSciNet): MR1045442
Zentralblatt MATH: 0813.62001
Zhang, T. (2001). A leave-one-out cross validation bound for kernel methods with applications in learning. In Proc. 14th Annual Conference on Computational Learning Theory 427--443. Springer, New York.
Mathematical Reviews (MathSciNet): MR2042051
Zentralblatt MATH: 0992.68113

2010 © Institute of Mathematical Statistics