Source: Ann. Statist.
Volume 26, Number 5
One of the surprising recurring phenomena observed in experiments
with boosting is that the test error of the generated classifier usually does
not increase as its size becomes very large, and often is observed to decrease
even after the training error reaches zero. In this paper, we show that this
phenomenon is related to the distribution of margins of the training
examples with respect to the generated voting classification rule, where the
margin of an example is simply the difference between the number of correct
votes and the maximum number of votes received by any incorrect label. We show
that techniques used in the analysis of Vapnik’s support vector
classifiers and of neural networks with small weights can be applied to voting
methods to relate the margin distribution to the test error. We also show
theoretically and experimentally that boosting is especially effective at
increasing the margins of the training examples. Finally, we compare our
explanation to those based on the bias-variance decomposition.
1 BARRON, A. R. 1993. Universal approximation bounds for superposition of a sigmoidal function. IEEE Trans. Inform. Theory 39 930 945.
2 BARTLETT, P. L. 1998. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Trans. Inform. Theory 44 525 536.
3 BAUER, E. and KOHAVI, R. 1997. An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Unpublished manuscript.
4 BAUM, E. B. and HAUSSLER, D. 1989. What size net gives valid generalization? Neural Computation 1 151 160.
5 BOSER, B. E., GUy ON, I. M. and VAPNIK, V. N. 1992. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory 144 152. ACM, New York.
6 BREIMAN, L. 1996. Bagging predictors. Machine Learning 24 123 140.
7 BREIMAN, L. 1997. Prediction games and arcing classifiers. Technical Report 504, Dept. Statistics, Univ. California, Berkeley.
8 BREIMAN, L. 1998. Arcing classifiers with discussion. Ann. Statist. 26 801 849.
9 BREIMAN, L., FRIEDMAN, J. H., OLSHEN, R. A. and STONE, C. J. 1984. Classification and Regression Trees. Wadsworth, Belmont, CA.
10 CORTES, C. and VAPNIK, V. 1995. Support-vector networks. Machine Learning 20 273 297.
11 DEVROy E, L. 1982. Bounds for the uniform deviation of empirical measures. J. Multivariate Anal. 12 72 79.
12 DIETTERICH, T. G. 1998. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Unpublished manuscript.
13 DIETTERICH, T. G. and BAKIRI, G. 1995. Solving multiclass learning problems via error-correcting output codes. J. Artificial Intelligence Res. 2 263 286.
14 DONAHUE, M. J., GURVITS, L., DARKEN, C. and SONTAG, E. 1997. Rates of convex approximation in non-Hilbert spaces. Constr. Approx. 13 187 220.
15 DRUCKER, H. 1997. Improving regressors using boosting techniques. In Machine Learning: Proceedings of the Fourteenth International Conference 107 115. Morgan Kauffman, San Francisco.
16 DRUCKER, H. and CORTES, C. 1996. Boosting decision trees. Advances in Neural Information Processing Sy stems 8 479 485.
17 FREUND, Y. 1995. Boosting a weak learning algorithm by majority. Inform. Comput. 121 256 285.
18 FREUND, Y. and SCHAPIRE, R. E. 1996. Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference 148 156. Morgan Kauffman, San Francisco.
19 FREUND, Y. and SCHAPIRE, R. E. 1996. Game theory, on-line prediction and boosting. In Proceedings of the Ninth Annual Conference on Computational Learning Theory 325 332.
20 FREUND, Y. and SCHAPIRE, R. E. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Sy stem Sci. 55 119 139.
21 FREUND, Y. and SCHAPIRE, R. E. 1998. Adaptive game playing using multiplicative weights. Games and Economic Behavior. To appear.
22 FRIEDMAN, J. H. 1998. On bias, variance, 0 1-loss, and the curse-of-dimensionality. Available at http: stat.stanford.edu jhf.
23 GROVE, A. J. and SCHUURMANS, D. 1998. Boosting in the limit: maximizing the margin of learned ensembles. In Proceedings of the Fifteenth National Conference on Artificial Intelligence. AAAI Press, Meno Park, NJ.
24 JONES, L. K. 1992. A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training. Ann. Statist. 20 608 613.
25 KOHAVI, R. and WOLPERT, D. H. 1996. Bias plus variance decomposition for zero-one loss functions. In Machine Learning: Proceedings of the Thirteenth International Conference 275 283. Morgan Kauffman, San Francisco.
26 KONG, E. B. and DIETTERICH, T. G. 1995. Error-correcting output coding corrects bias and variance. In Proceedings of the Twelfth International Conference on Machine Learning 313 321. Morgan Kauffman, San Francisco.
SCHAPIRE, FREUND, BARTLETT AND LEE 1686
27 LEE, W. S., BARTLETT, P. L. and WILLIAMSON, R. C. 1996. Efficient agnostic learning of neural networks with bounded fan-in. IEEE Trans. Inform. Theory 42 2118 2132.
28 LEE, W. S., BARTLETT, P. L. and WILLIAMSON, R. C. 1998. The importance of convexity in learning with squared loss. IEEE Trans. Inform. Theory. To appear.
29 MACLIN, R. and OPITZ, D. 1997. An empirical evaluation of bagging and boosting. In Proceedings of the Fourteenth National Conference on Artificial Intelligence 546 551.
30 MERZ, C. J. and MURPHY, P. M. 1998. UCI repository of machine learning databases. Available at http: www.ics.uci.edu mlearn MLRepository.html.
31 QUINLAN, J. R. 1996. Bagging, boosting, and C4.5. In Proceedings of the Thirteenth National Conference on Artificial Intelligence 725 730.
32 QUINLAN, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo.
33 SAUER, N. 1972. On the density of families of sets. J. Combin. Theory Ser. A 13 145 147.
34 SCHAPIRE, R. E. 1990. The strength of weak learnability. Machine Learning 5 197 227.
35 SCHAPIRE, R. E. 1997. Using output codes to boost multiclass learning problems. In Machine Learning: Proceedings of the Fourteenth International Conference 313 321. Morgan Kauffman, San Francisco.
36 SCHAPIRE, R. E. and SINGER, Y. 1998. Improved boosting algorithms using confidence-rated predictions. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory.
37 SCHWENK, H. and BENGIO, Y. 1998. Training methods for adaptive boosting of neural networks for character recognition. Advances in Neural Information Processing Sy stems 10 647 653. MIT Press.
40 TIBSHIRANI, R. 1996. Bias, variance and prediction error for classification rules. Technical Report, Univ. Toronto.
41 VAPNIK, V. N. 1995. The Nature of Statistical Learning Theory. Springer, New York.
42 VAPNIK, V. N. and CHERVONENKIS, A. YA. 1971. On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16 264 280.
Mathematical Reviews (MathSciNet): MR301855
FLORHAM PARK, NEW JERSEY 07932-0971 FLORHAM PARK, NEW JERSEY 07932-0971 E-MAIL: firstname.lastname@example.org E-MAIL: email@example.com
RSISE, AUSTRALIAN NATIONAL UNIVERSITY UNIVERSITY COLLEGE UNSW
CANBERRA, ACT 0200 AUSTRALIAN DEFENCE FORCE ACADEMY AUSTRALIA CANBERRA ACT 2600 E-MAIL: Peter.Bartlett@anu.edu.au AUSTRALIA E-MAIL: firstname.lastname@example.org