## The Annals of Statistics

### Optimal aggregation of classifiers in statistical learning

Alexander B. Tsybakov

#### Abstract

Classification can be considered as nonparametric estimation of sets, where the risk is defined by means of a specific distance between sets associated with misclassification error. It is shown that the rates of convergence of classifiers depend on two parameters: the complexity of the class of candidate sets and the margin parameter. The dependence is explicitly given, indicating that optimal fast rates approaching $O(n^{-1})$ can be attained, where n is the sample size, and that the proposed classifiers have the property of robustness to the margin. The main result of the paper concerns optimal aggregation of classifiers: we suggest a classifier that automatically adapts both to the complexity and to the margin, and attains the optimal fast rates, up to a logarithmic factor.

#### Article information

Source
Ann. Statist., Volume 32, Number 1 (2004), 135-166.

Dates
First available in Project Euclid: 12 March 2004

https://projecteuclid.org/euclid.aos/1079120131

Digital Object Identifier
doi:10.1214/aos/1079120131

Mathematical Reviews number (MathSciNet)
MR2051002

Zentralblatt MATH identifier
1105.62353

#### Citation

Tsybakov, Alexander B. Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 (2004), no. 1, 135--166. doi:10.1214/aos/1079120131. https://projecteuclid.org/euclid.aos/1079120131

#### References

• Aizerman, M. A., Braverman, E. M. and Rozonoer, L. I. (1970). Method of Potential Functions in the Theory of Learning Machines. Nauka, Moscow (in Russian).
• Alexander, K. S. (1984). Probability inequalities for empirical processes and a law of the iterated logarithm. Ann. Probab. 12 1041--1067. [Correction (1987) 15 428--430.]
• Alexander, K. S. (1987). Rates of growth and sample moduli for weighted empirical processes indexed by sets. Probab. Theory Related Fields 75 379--423.
• Anthony, M. and Bartlett, P. L. (1999). Neural Network Learning$:$ Theoretical Foundations. Cambridge Univ. Press.
• Barron, A. (1991). Complexity regularization with application to artificial neural networks. In Nonparametric Functional Estimation and Related Topics (G. Roussas, ed.) 561--576. Kluwer, Dordrecht.
• Bartlett, P. L., Boucheron, S. and Lugosi, G. (2002). Model selection and error estimation. Machine Learning 48 85--113.
• Birgé, L. and Massart, P. (1993). Rates of convergence for minimum contrast estimators. Probab. Theory Related Fields 97 113--150.
• Breiman, L. (1996). Bagging predictors. Machine Learning 24 123--140.
• Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984). Classification and Regression Trees. Wadsworth, Belmont, CA.
• Bühlmann, P. and Yu, B. (2002). Analyzing bagging. Ann. Statist. 30 927--961.
• Catoni, O. (2001) Randomized estimators and empirical complexity for pattern recognition and least square regression. Prépublication 677, Laboratoire de Probabilités et Modèles Aléatoires, Univ. Paris 6/7. Available at www.proba.jussieu.fr.
• Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines. Cambridge Univ. Press.
• Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer, New York.
• Dudley, R. M. (1974). Metric entropy of some classes of sets with differentiable boundaries. J. Approximation Theory 10 227--236.
• Horváth, M. and Lugosi, G. (1998). Scale-sensitive dimensions and skeleton estimates for classification. Discrete Appl. Math. 86 37--61.
• Koltchinskii, V. (2001). Rademacher penalties and structural risk minimization. IEEE Trans. Inform. Theory 47 1902--1914.
• Koltchinskii, V. and Panchenko, D. (2002). Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist. 30 1--50.
• Korostelev, A. P. and Tsybakov, A. B. (1993). Minimax Theory of Image Reconstruction. Lecture Notes in Statist. 82. Springer, New York.
• Lepski, O. V. (1990). A problem of adaptive estimation in Gaussian white noise. Theory Probab. Appl. 35 454--466.
• Lugosi, G. and Nobel, A. (1999). Adaptive model selection using empirical complexities. Ann. Statist. 27 1830--1864.
• Mammen, E. and Tsybakov, A. B. (1995). Asymptotical minimax recovery of sets with smooth boundaries. Ann. Statist. 23 502--524.
• Mammen, E. and Tsybakov, A. B. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808--1829.
• Massart, P. (2000). Some applications of concentration inequalities to statistics. Ann. Fac. Sci. Toulouse Math. 9 245--303.
• Schapire, R. E., Freund, Y., Bartlett, P. L. and Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Statist. 26 1651--1686.
• Schölkopf, B. and Smola, A. (2002). Learning with Kernels. MIT Press.
• Tsybakov, A. B. (2002). Discussion of Random rates in anisotropic regression,'' by M. Hoffmann and O. Lepskii. Ann. Statist. 30 379--385.
• van de Geer, S. (2000). Applications of Empirical Process Theory. Cambridge Univ. Press.
• van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer, New York.
• Vapnik, V. N. (1982). Estimation of Dependencies Based on Empirical Data. Springer, New York.
• Vapnik, V. N. (1998). Statistical Learning Theory. Wiley, New York.
• Vapnik, V. N. and Chervonenkis, A. Ya. (1974). Theory of Pattern Recognition. Nauka, Moscow (in Russian).
• Yang, Y. (1999). Minimax nonparametric classification. I. Rates of convergence. II. Model selection for adaptation. IEEE Trans. Inform. Theory 45 2271--2292.