The Annals of Statistics

Boosting the margin: a new explanation for the effectiveness of voting methods

Peter Bartlett, Yoav Freund, Wee Sun Lee, and Robert E. Schapire

Full-text: Open access

Abstract

One of the surprising recurring phenomena observed in experiments with boosting is that the test error of the generated classifier usually does not increase as its size becomes very large, and often is observed to decrease even after the training error reaches zero. In this paper, we show that this phenomenon is related to the distribution of margins of the training examples with respect to the generated voting classification rule, where the margin of an example is simply the difference between the number of correct votes and the maximum number of votes received by any incorrect label. We show that techniques used in the analysis of Vapnik’s support vector classifiers and of neural networks with small weights can be applied to voting methods to relate the margin distribution to the test error. We also show theoretically and experimentally that boosting is especially effective at increasing the margins of the training examples. Finally, we compare our explanation to those based on the bias-variance decomposition.

Article information

Source
Ann. Statist. Volume 26, Number 5 (1998), 1651-1686.

Dates
First available in Project Euclid: 21 June 2002

Permanent link to this document
http://projecteuclid.org/euclid.aos/1024691352

Mathematical Reviews number (MathSciNet)
MR1673273

Digital Object Identifier
doi:10.1214/aos/1024691352

Zentralblatt MATH identifier
0929.62069

Subjects
Primary: 62H30: Classification and discrimination; cluster analysis [See also 68T10, 91C20]

Keywords
Ensemble methods decision trees neural networks bagging boosting error-correcting output coding Markov chain Monte Carlo

Citation

Schapire, Robert E.; Freund, Yoav; Bartlett, Peter; Lee, Wee Sun. Boosting the margin: a new explanation for the effectiveness of voting methods. The Annals of Statistics 26 (1998), no. 5, 1651--1686. doi:10.1214/aos/1024691352. http://projecteuclid.org/euclid.aos/1024691352.


Export citation

References

  • 1 BARRON, A. R. 1993. Universal approximation bounds for superposition of a sigmoidal function. IEEE Trans. Inform. Theory 39 930 945.
  • 2 BARTLETT, P. L. 1998. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Trans. Inform. Theory 44 525 536.
  • 3 BAUER, E. and KOHAVI, R. 1997. An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Unpublished manuscript.
  • 4 BAUM, E. B. and HAUSSLER, D. 1989. What size net gives valid generalization? Neural Computation 1 151 160.
  • 5 BOSER, B. E., GUy ON, I. M. and VAPNIK, V. N. 1992. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory 144 152. ACM, New York.
  • 6 BREIMAN, L. 1996. Bagging predictors. Machine Learning 24 123 140.
  • 7 BREIMAN, L. 1997. Prediction games and arcing classifiers. Technical Report 504, Dept. Statistics, Univ. California, Berkeley.
  • 8 BREIMAN, L. 1998. Arcing classifiers with discussion. Ann. Statist. 26 801 849.
  • 9 BREIMAN, L., FRIEDMAN, J. H., OLSHEN, R. A. and STONE, C. J. 1984. Classification and Regression Trees. Wadsworth, Belmont, CA.
  • 10 CORTES, C. and VAPNIK, V. 1995. Support-vector networks. Machine Learning 20 273 297.
  • 11 DEVROy E, L. 1982. Bounds for the uniform deviation of empirical measures. J. Multivariate Anal. 12 72 79.
  • 12 DIETTERICH, T. G. 1998. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Unpublished manuscript.
  • 13 DIETTERICH, T. G. and BAKIRI, G. 1995. Solving multiclass learning problems via error-correcting output codes. J. Artificial Intelligence Res. 2 263 286.
  • 14 DONAHUE, M. J., GURVITS, L., DARKEN, C. and SONTAG, E. 1997. Rates of convex approximation in non-Hilbert spaces. Constr. Approx. 13 187 220.
  • 15 DRUCKER, H. 1997. Improving regressors using boosting techniques. In Machine Learning: Proceedings of the Fourteenth International Conference 107 115. Morgan Kauffman, San Francisco.
  • 16 DRUCKER, H. and CORTES, C. 1996. Boosting decision trees. Advances in Neural Information Processing Sy stems 8 479 485.
  • 17 FREUND, Y. 1995. Boosting a weak learning algorithm by majority. Inform. Comput. 121 256 285.
  • 18 FREUND, Y. and SCHAPIRE, R. E. 1996. Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference 148 156. Morgan Kauffman, San Francisco.
  • 19 FREUND, Y. and SCHAPIRE, R. E. 1996. Game theory, on-line prediction and boosting. In Proceedings of the Ninth Annual Conference on Computational Learning Theory 325 332.
  • 20 FREUND, Y. and SCHAPIRE, R. E. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Sy stem Sci. 55 119 139.
  • 21 FREUND, Y. and SCHAPIRE, R. E. 1998. Adaptive game playing using multiplicative weights. Games and Economic Behavior. To appear.
  • 22 FRIEDMAN, J. H. 1998. On bias, variance, 0 1-loss, and the curse-of-dimensionality. Available at http: stat.stanford.edu jhf.
  • 23 GROVE, A. J. and SCHUURMANS, D. 1998. Boosting in the limit: maximizing the margin of learned ensembles. In Proceedings of the Fifteenth National Conference on Artificial Intelligence. AAAI Press, Meno Park, NJ.
  • 24 JONES, L. K. 1992. A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training. Ann. Statist. 20 608 613.
  • 25 KOHAVI, R. and WOLPERT, D. H. 1996. Bias plus variance decomposition for zero-one loss functions. In Machine Learning: Proceedings of the Thirteenth International Conference 275 283. Morgan Kauffman, San Francisco.
  • 26 KONG, E. B. and DIETTERICH, T. G. 1995. Error-correcting output coding corrects bias and variance. In Proceedings of the Twelfth International Conference on Machine Learning 313 321. Morgan Kauffman, San Francisco.
  • SCHAPIRE, FREUND, BARTLETT AND LEE 1686
  • 27 LEE, W. S., BARTLETT, P. L. and WILLIAMSON, R. C. 1996. Efficient agnostic learning of neural networks with bounded fan-in. IEEE Trans. Inform. Theory 42 2118 2132.
  • 28 LEE, W. S., BARTLETT, P. L. and WILLIAMSON, R. C. 1998. The importance of convexity in learning with squared loss. IEEE Trans. Inform. Theory. To appear.
  • 29 MACLIN, R. and OPITZ, D. 1997. An empirical evaluation of bagging and boosting. In Proceedings of the Fourteenth National Conference on Artificial Intelligence 546 551.
  • 30 MERZ, C. J. and MURPHY, P. M. 1998. UCI repository of machine learning databases. Available at http: www.ics.uci.edu mlearn MLRepository.html.
  • 31 QUINLAN, J. R. 1996. Bagging, boosting, and C4.5. In Proceedings of the Thirteenth National Conference on Artificial Intelligence 725 730.
  • 32 QUINLAN, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo.
  • 33 SAUER, N. 1972. On the density of families of sets. J. Combin. Theory Ser. A 13 145 147.
  • 34 SCHAPIRE, R. E. 1990. The strength of weak learnability. Machine Learning 5 197 227.
  • 35 SCHAPIRE, R. E. 1997. Using output codes to boost multiclass learning problems. In Machine Learning: Proceedings of the Fourteenth International Conference 313 321. Morgan Kauffman, San Francisco.
  • 36 SCHAPIRE, R. E. and SINGER, Y. 1998. Improved boosting algorithms using confidence-rated predictions. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory.
  • 37 SCHWENK, H. and BENGIO, Y. 1998. Training methods for adaptive boosting of neural networks for character recognition. Advances in Neural Information Processing Sy stems 10 647 653. MIT Press.
  • 40 TIBSHIRANI, R. 1996. Bias, variance and prediction error for classification rules. Technical Report, Univ. Toronto.
  • 41 VAPNIK, V. N. 1995. The Nature of Statistical Learning Theory. Springer, New York.
  • 42 VAPNIK, V. N. and CHERVONENKIS, A. YA. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16 264 280.
  • FLORHAM PARK, NEW JERSEY 07932-0971 FLORHAM PARK, NEW JERSEY 07932-0971 E-MAIL: schapire@research.att.com E-MAIL: yoav@research.att.com
  • RSISE, AUSTRALIAN NATIONAL UNIVERSITY UNIVERSITY COLLEGE UNSW
  • CANBERRA, ACT 0200 AUSTRALIAN DEFENCE FORCE ACADEMY AUSTRALIA CANBERRA ACT 2600 E-MAIL: Peter.Bartlett@anu.edu.au AUSTRALIA E-MAIL: w-lee@ee.adfa.oz.au