The Annals of Statistics

Optimal aggregation of classifiers in statistical learning

Alexander B. Tsybakov

Full-text: Open access

Abstract

Classification can be considered as nonparametric estimation of sets, where the risk is defined by means of a specific distance between sets associated with misclassification error. It is shown that the rates of convergence of classifiers depend on two parameters: the complexity of the class of candidate sets and the margin parameter. The dependence is explicitly given, indicating that optimal fast rates approaching $O(n^{-1})$ can be attained, where n is the sample size, and that the proposed classifiers have the property of robustness to the margin. The main result of the paper concerns optimal aggregation of classifiers: we suggest a classifier that automatically adapts both to the complexity and to the margin, and attains the optimal fast rates, up to a logarithmic factor.

Article information

Source
Ann. Statist., Volume 32, Number 1 (2004), 135-166.

Dates
First available in Project Euclid: 12 March 2004

Permanent link to this document
https://projecteuclid.org/euclid.aos/1079120131

Digital Object Identifier
doi:10.1214/aos/1079120131

Mathematical Reviews number (MathSciNet)
MR2051002

Zentralblatt MATH identifier
1105.62353

Subjects
Primary: 62G07: Density estimation
Secondary: 62G08: Nonparametric regression 62H30: Classification and discrimination; cluster analysis [See also 68T10, 91C20] 68T10: Pattern recognition, speech recognition {For cluster analysis, see 62H30}

Keywords
Classification statistical learning aggregation of classifiers optimal rates empirical processes margins complexity of classes of sets

Citation

Tsybakov, Alexander B. Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 (2004), no. 1, 135--166. doi:10.1214/aos/1079120131. https://projecteuclid.org/euclid.aos/1079120131


Export citation

References

  • Aizerman, M. A., Braverman, E. M. and Rozonoer, L. I. (1970). Method of Potential Functions in the Theory of Learning Machines. Nauka, Moscow (in Russian).
  • Alexander, K. S. (1984). Probability inequalities for empirical processes and a law of the iterated logarithm. Ann. Probab. 12 1041--1067. [Correction (1987) 15 428--430.]
  • Alexander, K. S. (1987). Rates of growth and sample moduli for weighted empirical processes indexed by sets. Probab. Theory Related Fields 75 379--423.
  • Anthony, M. and Bartlett, P. L. (1999). Neural Network Learning$:$ Theoretical Foundations. Cambridge Univ. Press.
  • Barron, A. (1991). Complexity regularization with application to artificial neural networks. In Nonparametric Functional Estimation and Related Topics (G. Roussas, ed.) 561--576. Kluwer, Dordrecht.
  • Bartlett, P. L., Boucheron, S. and Lugosi, G. (2002). Model selection and error estimation. Machine Learning 48 85--113.
  • Birgé, L. and Massart, P. (1993). Rates of convergence for minimum contrast estimators. Probab. Theory Related Fields 97 113--150.
  • Breiman, L. (1996). Bagging predictors. Machine Learning 24 123--140.
  • Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984). Classification and Regression Trees. Wadsworth, Belmont, CA.
  • Bühlmann, P. and Yu, B. (2002). Analyzing bagging. Ann. Statist. 30 927--961.
  • Catoni, O. (2001) Randomized estimators and empirical complexity for pattern recognition and least square regression. Prépublication 677, Laboratoire de Probabilités et Modèles Aléatoires, Univ. Paris 6/7. Available at www.proba.jussieu.fr.
  • Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines. Cambridge Univ. Press.
  • Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer, New York.
  • Dudley, R. M. (1974). Metric entropy of some classes of sets with differentiable boundaries. J. Approximation Theory 10 227--236.
  • Horváth, M. and Lugosi, G. (1998). Scale-sensitive dimensions and skeleton estimates for classification. Discrete Appl. Math. 86 37--61.
  • Koltchinskii, V. (2001). Rademacher penalties and structural risk minimization. IEEE Trans. Inform. Theory 47 1902--1914.
  • Koltchinskii, V. and Panchenko, D. (2002). Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist. 30 1--50.
  • Korostelev, A. P. and Tsybakov, A. B. (1993). Minimax Theory of Image Reconstruction. Lecture Notes in Statist. 82. Springer, New York.
  • Lepski, O. V. (1990). A problem of adaptive estimation in Gaussian white noise. Theory Probab. Appl. 35 454--466.
  • Lugosi, G. and Nobel, A. (1999). Adaptive model selection using empirical complexities. Ann. Statist. 27 1830--1864.
  • Mammen, E. and Tsybakov, A. B. (1995). Asymptotical minimax recovery of sets with smooth boundaries. Ann. Statist. 23 502--524.
  • Mammen, E. and Tsybakov, A. B. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808--1829.
  • Massart, P. (2000). Some applications of concentration inequalities to statistics. Ann. Fac. Sci. Toulouse Math. 9 245--303.
  • Schapire, R. E., Freund, Y., Bartlett, P. L. and Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Statist. 26 1651--1686.
  • Schölkopf, B. and Smola, A. (2002). Learning with Kernels. MIT Press.
  • Tsybakov, A. B. (2002). Discussion of ``Random rates in anisotropic regression,'' by M. Hoffmann and O. Lepskii. Ann. Statist. 30 379--385.
  • van de Geer, S. (2000). Applications of Empirical Process Theory. Cambridge Univ. Press.
  • van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer, New York.
  • Vapnik, V. N. (1982). Estimation of Dependencies Based on Empirical Data. Springer, New York.
  • Vapnik, V. N. (1998). Statistical Learning Theory. Wiley, New York.
  • Vapnik, V. N. and Chervonenkis, A. Ya. (1974). Theory of Pattern Recognition. Nauka, Moscow (in Russian).
  • Yang, Y. (1999). Minimax nonparametric classification. I. Rates of convergence. II. Model selection for adaptation. IEEE Trans. Inform. Theory 45 2271--2292.