Source: Ann. Statist. Volume 32, Number 1
(2004), 135-166.
Classification can be considered as nonparametric estimation of sets, where the risk is defined by means of a specific distance between sets associated with misclassification error. It is shown that the rates of convergence of classifiers depend on two parameters: the complexity of the class of candidate sets and the margin parameter. The dependence is explicitly given, indicating that optimal fast rates approaching $O(n^{-1})$ can be attained, where n is the sample size, and that the proposed classifiers have the property of robustness to the margin. The main result of the paper concerns optimal aggregation of classifiers: we suggest a classifier that
automatically adapts both to the complexity and to the margin, and attains the optimal fast rates, up to a logarithmic factor.
References
Aizerman, M. A., Braverman, E. M. and Rozonoer, L. I. (1970). Method of Potential Functions in the Theory of Learning Machines. Nauka, Moscow (in Russian).
Alexander, K. S. (1984). Probability inequalities for empirical processes and a law of the iterated logarithm. Ann. Probab. 12 1041--1067. [Correction (1987) 15 428--430.]
Mathematical Reviews (MathSciNet):
MR757769
Alexander, K. S. (1987). Rates of growth and sample moduli for weighted empirical processes indexed by sets. Probab. Theory Related Fields 75 379--423.
Mathematical Reviews (MathSciNet):
MR890285
Anthony, M. and Bartlett, P. L. (1999). Neural Network Learning$:$ Theoretical Foundations. Cambridge Univ. Press.
Barron, A. (1991). Complexity regularization with application to artificial neural networks. In Nonparametric Functional Estimation and Related Topics (G. Roussas, ed.) 561--576. Kluwer, Dordrecht.
Bartlett, P. L., Boucheron, S. and Lugosi, G. (2002). Model selection and error estimation. Machine Learning 48 85--113.
Birgé, L. and Massart, P. (1993). Rates of convergence for minimum contrast estimators. Probab. Theory Related Fields 97 113--150.
Breiman, L. (1996). Bagging predictors. Machine Learning 24 123--140.
Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984). Classification and Regression Trees. Wadsworth, Belmont, CA.
Mathematical Reviews (MathSciNet):
MR726392
Bühlmann, P. and Yu, B. (2002). Analyzing bagging. Ann. Statist. 30 927--961.
Catoni, O. (2001) Randomized estimators and empirical complexity for pattern recognition and least square regression. Prépublication 677, Laboratoire de Probabilités et Modèles Aléatoires, Univ. Paris 6/7. Available at www.proba.jussieu.fr.
Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines. Cambridge Univ. Press.
Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer, New York.
Dudley, R. M. (1974). Metric entropy of some classes of sets with differentiable boundaries. J. Approximation Theory 10 227--236.
Mathematical Reviews (MathSciNet):
MR358168
Horváth, M. and Lugosi, G. (1998). Scale-sensitive dimensions and skeleton estimates for classification. Discrete Appl. Math. 86 37--61.
Koltchinskii, V. (2001). Rademacher penalties and structural risk minimization. IEEE Trans. Inform. Theory 47 1902--1914.
Koltchinskii, V. and Panchenko, D. (2002). Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist. 30 1--50.
Korostelev, A. P. and Tsybakov, A. B. (1993). Minimax Theory of Image Reconstruction. Lecture Notes in Statist. 82. Springer, New York.
Lepski, O. V. (1990). A problem of adaptive estimation in Gaussian white noise. Theory Probab. Appl. 35 454--466.
Lugosi, G. and Nobel, A. (1999). Adaptive model selection using empirical complexities. Ann. Statist. 27 1830--1864.
Mammen, E. and Tsybakov, A. B. (1995). Asymptotical minimax recovery of sets with smooth boundaries. Ann. Statist. 23 502--524.
Mammen, E. and Tsybakov, A. B. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808--1829.
Massart, P. (2000). Some applications of concentration inequalities to statistics. Ann. Fac. Sci. Toulouse Math. 9 245--303.
Schapire, R. E., Freund, Y., Bartlett, P. L. and Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Statist. 26 1651--1686.
Schölkopf, B. and Smola, A. (2002). Learning with Kernels. MIT Press.
Tsybakov, A. B. (2002). Discussion of ``Random rates in anisotropic regression,'' by M. Hoffmann and O. Lepskii. Ann. Statist. 30 379--385.
van de Geer, S. (2000). Applications of Empirical Process Theory. Cambridge Univ. Press.
van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer, New York.
Vapnik, V. N. (1982). Estimation of Dependencies Based on Empirical Data. Springer, New York.
Mathematical Reviews (MathSciNet):
MR672244
Vapnik, V. N. (1998). Statistical Learning Theory. Wiley, New York.
Vapnik, V. N. and Chervonenkis, A. Ya. (1974). Theory of Pattern Recognition. Nauka, Moscow (in Russian).
Mathematical Reviews (MathSciNet):
MR474638
Yang, Y. (1999). Minimax nonparametric classification. I. Rates of convergence. II. Model selection for adaptation. IEEE Trans. Inform. Theory 45 2271--2292.