The Annals of Statistics

Empirical Margin Distributions and Bounding the Generalization Error of Combined Classifiers

V. Koltchinskii and D. Panchenko

Full-text: Open access


We prove new probabilistic upper bounds on generalization error of complex classifiers that are combinations of simple classifiers. Such combinations could be implemented by neural networks or by voting methods of combining the classifiers, such as boosting and bagging. The bounds are in terms of the empirical distribution of the margin of the combined classifier. They are based on the methods of the theory of Gaussian and empirical processes (comparison inequalities, symmetrization method, concentration inequalities) and they improve previous results of Bartlett (1998) on bounding the generalization error of neural networks in terms of $\ell_1$-norms of the weights of neurons and of Schapire, Freund, Bartlett and Lee (1998) on bounding the generalization error of boosting. We also obtain rates of convergence in Lévy distance of empirical margin distribution to the true margin distribution uniformly over the classes of classifiers and prove the optimality of these rates.

Article information

Ann. Statist., Volume 30, Number 1 (2002), 1-50.

First available in Project Euclid: 5 March 2002

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62G05: Estimation
Secondary: 62G20: Asymptotic properties 60F15

Generalization error combined classifier margin empirical process Rademacher process Gaussian process neural network boosting concentration inequalities


Koltchinskii, V.; Panchenko, D. Empirical Margin Distributions and Bounding the Generalization Error of Combined Classifiers. Ann. Statist. 30 (2002), no. 1, 1--50. doi:10.1214/aos/1015362183.

Export citation


  • ANTHONY, M. and BARTLETT, P. (1999). Neural Network Learning: Theoretical Foundations. Cambridge Univ. Press.
  • BARRON, A. (1991a). Complexity regularization with applications to artificial neural networks. In Nonparametric Functional Estimation and Related Topics (G. Roussas, ed.) 561-576. Kluwer, Dordrecht.
  • BARRON, A. (1991b). Approximation and estimation bounds for artificial neural networks. In Proceedings of the Fourth Annual Workshop on Computational Learning Theory 243- 249. Morgan Kaufmann, San Francisco.
  • BARRON, A., BIRGÉ, L. and MASSART, P. (1999). Risk bounds for model selection via penalization. Probab. Theory Related Fields 113 301-413.
  • BARTLETT, P. (1998). The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Trans. Inform. Theory 44 525-536.
  • BARTLETT, P. and SHAWE-TAYLOR, J. (1999). Generalization performance of support vector machines and other pattern classifiers. In Advances in Kernel Methods: Support Vector Learning (B. Schölkopf, C. Burges and A. J. Smola, eds.) 43-54. MIT Press.
  • BIRGÉ, L. and MASSART, P. (1997). From model selection to adaptive estimation. In Festschrift for Lucien Le Cam: Research Papers in Probability and Statistics (D. Pollard, E. Torgersen and G. Yang, eds.) 55-87. Springer, New York.
  • BLAKE, C. and MERZ, C. (1998). UCI repository of machine learning databases. Available at mlearn/MLRepository.html.
  • BREIMAN, L. (1996). Bagging predictors. Machine Learning 24 123-140.
  • CORTES, C. and VAPNIK, V. (1995). Support vector networks. Machine Learning 20 273-297.
  • DEVROYE, L., GYÖRFI, L. and LUGOSI, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer, New York.
  • DUDLEY, R. M. (1999). Uniform Central Limit Theorems. Cambridge Univ. Press.
  • FELLER, W. (1950). An Introduction to Probability Theory and Its Applications 1. Wiley, New York.
  • FINE, T. L. (1999). Feedforward Neural Network Methodology. Springer, New York.
  • FREUND, Y. (1995). Boosting a weak learning algorithm by majority. Inform. and Comput. 121 256- 285.
  • FREUND, Y. (1999). An adaptive version of the boost by majority algorithm. Preprint.
  • FRIEDMAN, J. (1999). Greedy function approximation: a gradient boosting machine. Technical report, Dept. Statistics, Stanford Univ.
  • FRIEDMAN, J., HASTIE, T. and TIBSHIRANI, R. (2000). Additive logistic regression: a statistical view of boosting. Ann. Statist. 28 337-374.
  • JOHNSTONE, I. M. (1998). Oracle inequalities and nonparametric function estimation. In Doc. Math. (Extra Volume) 3 267-278.
  • KOLTCHINSKII, V. (2001). Rademacher penalties and structural risk minimization. IEEE Trans. Inform. Theory. To appear.
  • KOLTCHINSKII, V. and PANCHENKO, D. (2000). Rademacher processes and bounding the risk of function learning. In High Dimensional Probability II (E. Giné, D. Mason and J. Wellner, eds.) 444-459. Birkhäuser, Boston.
  • KOLTCHINSKII, V., PANCHENKO, D. and LOZANO, F. (2000a). Bounding the generalization error of convex combinations of classifiers: balancing the dimensionality and the margins. Preprint. Available at
  • KOLTCHINSKII, V., PANCHENKO, D. and LOZANO, F. (2000b). Some new bounds on the generalization error of combined classifiers. Advances in Neural Information Processing Systems. Available at
  • LEDOUX, M. and TALAGRAND, M. (1991). Probability in Banach Spaces. Springer, New York.
  • MASON, L., BARTLETT, P. and BAXTER, J. (1999). Improved generalization through explicit optimization of margins. Machine Learning 38 243-255.
  • MASON, L., BAXTER, J., BARTLETT, P. and FREAN, M. (2000). Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classifiers (A. J. Smola, P. Bartlett, B. Schölkopf and D. Schurmans, eds.) MIT Press.
  • MASSART, P. (2000). About the constants in Talagrand's concentration inequalities for empirical processes. Ann. Probab. 28 863-885.
  • SAZONOV, V. V. (1963). On the Glivenko-Cantelli theorem. Theory Probab. Appl. 8 282-285.
  • SCHAPIRE, R., FREUND, Y., BARTLETT, P. and LEE, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Statist. 26 1651-1686.
  • TALAGRAND, M. (1996a). A new look at independence. Ann. Probab. 24 1-34.
  • TALAGRAND, M. (1996b). New concentration inequalities in product spaces. Invent. Math. 126 505- 563.
  • TALAGRAND, M. (1998). Rigorous results for the Hopfield model with many patterns. Probab. Theory Related Fields 110 177-276.
  • TOPSØE, F., DUDLEY, R. and HOFFMANN-JØRGENSEN, J. (1976). Two examples concerning uniform convergence of measures w.r.t. balls in Banach spaces. Empirical Distributions and Processes. Lecture Notes in Math. 566 141-146. Springer, Berlin.
  • VAN DER VAART, A. W. and WELLNER, J. A. (1996). Weak convergence and Empirical Processes. With Applications to Statistics. Springer, New York.
  • VAPNIK, V. (1998). Statistical Learning Theory. Wiley, New York.
  • VIDYASAGAR, M. (1997). A Theory of Learning and Generalization. Springer, New York.
  • YUKICH, J., STINCHCOMBE, H. and WHITE, H. (1995). Sup-norm approximation bounds for networks through probabilistic methods. IEEE Trans. Inform. Theory 41 1021-1027.