The Annals of Statistics

Empirical Margin Distributions and Bounding the Generalization Error of Combined Classifiers

V. Koltchinskii and D. Panchenko

Full-text: Open access

Abstract

We prove new probabilistic upper bounds on generalization error of complex classifiers that are combinations of simple classifiers. Such combinations could be implemented by neural networks or by voting methods of combining the classifiers, such as boosting and bagging. The bounds are in terms of the empirical distribution of the margin of the combined classifier. They are based on the methods of the theory of Gaussian and empirical processes (comparison inequalities, symmetrization method, concentration inequalities) and they improve previous results of Bartlett (1998) on bounding the generalization error of neural networks in terms of $\ell_1$-norms of the weights of neurons and of Schapire, Freund, Bartlett and Lee (1998) on bounding the generalization error of boosting. We also obtain rates of convergence in Lévy distance of empirical margin distribution to the true margin distribution uniformly over the classes of classifiers and prove the optimality of these rates.

Article information

Source
Ann. Statist., Volume 30, Number 1 (2002), 1-50.

Dates
First available in Project Euclid: 5 March 2002

Permanent link to this document
https://projecteuclid.org/euclid.aos/1015362183

Digital Object Identifier
doi:10.1214/aos/1015362183

Mathematical Reviews number (MathSciNet)
MR1892654

Zentralblatt MATH identifier
1012.62004

Subjects
Primary: 62G05: Estimation
Secondary: 62G20: Asymptotic properties 60F15

Keywords
Generalization error combined classifier margin empirical process Rademacher process Gaussian process neural network boosting concentration inequalities

Citation

Koltchinskii, V.; Panchenko, D. Empirical Margin Distributions and Bounding the Generalization Error of Combined Classifiers. Ann. Statist. 30 (2002), no. 1, 1--50. doi:10.1214/aos/1015362183. https://projecteuclid.org/euclid.aos/1015362183


Export citation

References

  • ANTHONY, M. and BARTLETT, P. (1999). Neural Network Learning: Theoretical Foundations. Cambridge Univ. Press.
  • BARRON, A. (1991a). Complexity regularization with applications to artificial neural networks. In Nonparametric Functional Estimation and Related Topics (G. Roussas, ed.) 561-576. Kluwer, Dordrecht.
  • BARRON, A. (1991b). Approximation and estimation bounds for artificial neural networks. In Proceedings of the Fourth Annual Workshop on Computational Learning Theory 243- 249. Morgan Kaufmann, San Francisco.
  • BARRON, A., BIRGÉ, L. and MASSART, P. (1999). Risk bounds for model selection via penalization. Probab. Theory Related Fields 113 301-413.
  • BARTLETT, P. (1998). The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Trans. Inform. Theory 44 525-536.
  • BARTLETT, P. and SHAWE-TAYLOR, J. (1999). Generalization performance of support vector machines and other pattern classifiers. In Advances in Kernel Methods: Support Vector Learning (B. Schölkopf, C. Burges and A. J. Smola, eds.) 43-54. MIT Press.
  • BIRGÉ, L. and MASSART, P. (1997). From model selection to adaptive estimation. In Festschrift for Lucien Le Cam: Research Papers in Probability and Statistics (D. Pollard, E. Torgersen and G. Yang, eds.) 55-87. Springer, New York.
  • BLAKE, C. and MERZ, C. (1998). UCI repository of machine learning databases. Available at www.ics.uci.edu/ mlearn/MLRepository.html.
  • BREIMAN, L. (1996). Bagging predictors. Machine Learning 24 123-140.
  • CORTES, C. and VAPNIK, V. (1995). Support vector networks. Machine Learning 20 273-297.
  • DEVROYE, L., GYÖRFI, L. and LUGOSI, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer, New York.
  • DUDLEY, R. M. (1999). Uniform Central Limit Theorems. Cambridge Univ. Press.
  • FELLER, W. (1950). An Introduction to Probability Theory and Its Applications 1. Wiley, New York.
  • FINE, T. L. (1999). Feedforward Neural Network Methodology. Springer, New York.
  • FREUND, Y. (1995). Boosting a weak learning algorithm by majority. Inform. and Comput. 121 256- 285.
  • FREUND, Y. (1999). An adaptive version of the boost by majority algorithm. Preprint.
  • FRIEDMAN, J. (1999). Greedy function approximation: a gradient boosting machine. Technical report, Dept. Statistics, Stanford Univ.
  • FRIEDMAN, J., HASTIE, T. and TIBSHIRANI, R. (2000). Additive logistic regression: a statistical view of boosting. Ann. Statist. 28 337-374.
  • JOHNSTONE, I. M. (1998). Oracle inequalities and nonparametric function estimation. In Doc. Math. (Extra Volume) 3 267-278.
  • KOLTCHINSKII, V. (2001). Rademacher penalties and structural risk minimization. IEEE Trans. Inform. Theory. To appear.
  • KOLTCHINSKII, V. and PANCHENKO, D. (2000). Rademacher processes and bounding the risk of function learning. In High Dimensional Probability II (E. Giné, D. Mason and J. Wellner, eds.) 444-459. Birkhäuser, Boston.
  • KOLTCHINSKII, V., PANCHENKO, D. and LOZANO, F. (2000a). Bounding the generalization error of convex combinations of classifiers: balancing the dimensionality and the margins. Preprint. Available at www.boosting.org/.
  • KOLTCHINSKII, V., PANCHENKO, D. and LOZANO, F. (2000b). Some new bounds on the generalization error of combined classifiers. Advances in Neural Information Processing Systems. Available at www.boosting.org/.
  • LEDOUX, M. and TALAGRAND, M. (1991). Probability in Banach Spaces. Springer, New York.
  • MASON, L., BARTLETT, P. and BAXTER, J. (1999). Improved generalization through explicit optimization of margins. Machine Learning 38 243-255.
  • MASON, L., BAXTER, J., BARTLETT, P. and FREAN, M. (2000). Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classifiers (A. J. Smola, P. Bartlett, B. Schölkopf and D. Schurmans, eds.) MIT Press.
  • MASSART, P. (2000). About the constants in Talagrand's concentration inequalities for empirical processes. Ann. Probab. 28 863-885.
  • SAZONOV, V. V. (1963). On the Glivenko-Cantelli theorem. Theory Probab. Appl. 8 282-285.
  • SCHAPIRE, R., FREUND, Y., BARTLETT, P. and LEE, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Statist. 26 1651-1686.
  • TALAGRAND, M. (1996a). A new look at independence. Ann. Probab. 24 1-34.
  • TALAGRAND, M. (1996b). New concentration inequalities in product spaces. Invent. Math. 126 505- 563.
  • TALAGRAND, M. (1998). Rigorous results for the Hopfield model with many patterns. Probab. Theory Related Fields 110 177-276.
  • TOPSØE, F., DUDLEY, R. and HOFFMANN-JØRGENSEN, J. (1976). Two examples concerning uniform convergence of measures w.r.t. balls in Banach spaces. Empirical Distributions and Processes. Lecture Notes in Math. 566 141-146. Springer, Berlin.
  • VAN DER VAART, A. W. and WELLNER, J. A. (1996). Weak convergence and Empirical Processes. With Applications to Statistics. Springer, New York.
  • VAPNIK, V. (1998). Statistical Learning Theory. Wiley, New York.
  • VIDYASAGAR, M. (1997). A Theory of Learning and Generalization. Springer, New York.
  • YUKICH, J., STINCHCOMBE, H. and WHITE, H. (1995). Sup-norm approximation bounds for networks through probabilistic methods. IEEE Trans. Inform. Theory 41 1021-1027.
  • ALBUQUERQUE, NEW MEXICO 87131-1141 E-MAIL: vlad@math.unm.edu panchenk@math.unm.edu