The Annals of Statistics

Arcing classifier (with discussion and a rejoinder by the author)

Leo Breiman

Full-text: Open access

Abstract

Recent work has shown that combining multiple versions of unstable classifiers such as trees or neural nets results in reduced test set error. One of the more effective is bagging. Here, modified training sets are formed by resampling from the original training set, classifiers constructed using these training sets and then combined by voting. Freund and Schapire propose an algorithm the basis of which is to adaptively resample and combine (hence the acronym “arcing”) so that the weights in the resampling are increased for those cases most often misclassified and the combining is done by weighted voting. Arcing is more successful than bagging in test set error reduction. We explore two arcing algorithms, compare them to each other and to bagging, and try to understand how arcing works. We introduce the definitions of bias and variance for a classifier as components of the test set error. Unstable classifiers can have low bias on a large range of data sets. Their problem is high variance. Combining multiple versions either through bagging or arcing reduces variance significantly.

Article information

Source
Ann. Statist. Volume 26, Number 3 (1998), 801-849.

Dates
First available: 21 June 2002

Permanent link to this document
http://projecteuclid.org/euclid.aos/1024691079

Mathematical Reviews number (MathSciNet)
MR1635406

Digital Object Identifier
doi:10.1214/aos/1024691079

Subjects
Primary: 62H30

Keywords
Ensemble methods decision trees neural networks bagging boosting error-correcting output coding Markov chain Monte Carlo

Citation

Breiman, Leo. Arcing classifier (with discussion and a rejoinder by the author). The Annals of Statistics 26 (1998), no. 3, 801--849. doi:10.1214/aos/1024691079. http://projecteuclid.org/euclid.aos/1024691079.


Export citation

References

  • ALI, K. 1995. Learning probabilistic relational concept descriptions. Ph.D. dissertation, Dept. Computer Science, Univ. California, Irvine. Z.
  • BREIMAN, L. 1996a. Bagging predictors. Machine Learning 26 123 140. Z.
  • BREIMAN, L. 1996b. The heuristics of instability in model selection. Ann. Statist. 24 2350 2383. Z.
  • BREIMAN, L., FRIEDMAN, J., OLSHEN, R. and STONE, C. 1984. Classification and Regression Trees. Chapman and Hall, London. Z.
  • DRUCKER, H. and CORTES, C. 1996. Boosting decision trees. Advances in Neural Information Processing Sy stems 8 479 485. Z.
  • FREUND, Y. and SCHAPIRE, R. 1996. Experiments with a new boosting algorithm. In Machine Z. Learning: Proceedings of the Thirteenth International Conference L. Saitta, ed. 148 156. Morgan Kaufmann, San Francisco. Z.
  • FREUND, Y. and SCHAPIRE, R. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Sy stem Sci. 55 119 139.
  • FRIEDMAN, J. H. 1996. On bias, variance, 0 1-loss, and the curse of dimensionality. Journal of Knowledge Discovery and Data Mining. To appear. Z.
  • GEMAN, S., BIENENSTOCK, E. and DOURSAT, R. 1992. Neural networks and the bias variance dilemma. Neural Computations 4 1 58. Z.
  • HASTIE, T. and TIBSHIRANI, R. 1994. Handwritten digit recognition via deformable prototy pes. Unpublished manuscript. Available at ftp stat.stanford.edu pub hastie/ zip.ps.Z. Z.
  • KEARNS, M. and VALIANT, L. G. 1988. Learning Boolean formulae or finite automata is as hard as factoring. Technical Report TR-14-88, Aiken Computation Laboratory, Harvard Univ. Z.
  • KEARNS, M. and VALIANT, L. G. 1989. Cry ptographic limitations on learning Boolean formulae and finite automata. In Proceedings of the Twenty-First Annual ACM Sy mposium on Theory of Computing. 433 444. ACM Press, New York. Z.
  • KOHAVI, R. and WOLPERT, D. H. 1996. Bias plus variance decomposition for zero-one loss functions. In Machine Learning: Proceedings of the Thirteenth International ConferZ. ence. L. Saitta, ed. 275 283. Morgan Kaufmann, San Francisco. Z.
  • KONG, E. B. and DIETTERICH, T. G. 1995. Error-correcting output coding corrects bias and variance. In Proceedings of the Twelfth International Conference on Machine Learning Z. A. Prieditis and S. Russell, eds. 313 321. Morgan Kaufmann, San Francisco.
  • LE CUN, Y., BOSER, B., DENKER, J., HENDERSON, D., HOWARD, R., HUBBARD, W. and JACKEL, L. Z. 1990. Handwritten digit recognition with a back-propagation network. Advances in Neural Information Processing Sy stems 2 396 404. Z.
  • MICHIE, D., SPIEGELHALTER, D. and TAy LOR, C. 1994. Machine Learning, Neural and Statistical Classification. Ellis Horwood, London. Z.
  • QUINLAN, J. R. 1996. Bagging, Boosting, and C4.5. In Proceedings of AAAI '96 National Conference on Artificial Intelligence 725 730. Z.
  • SCHAPIRE, R. 1990. The strength of weak learnability. Machine Learning 5 197 227. Z.
  • SIMARD, P., LE CUN, Y. and DENKER, J. 1993. Efficient pattern recognition using a new transformation distance. Advances in Neural Information Processing Sy stems 5 50 58. Z.
  • TIBSHIRANI, R. 1996. Bias, variance, and prediction error for classification rules. Technical Report, Dept. Statistics, Univ. Toronto. Z.
  • VAPNIK, V. 1995. The Nature of Statistical Learning Theory. Springer, New York.
  • BERKELEY, CALIFORNIA 94720-3860 E-MAIL: leo@stat.berkeley.edu
  • 1 BREIMAN, L. 1996. Bagging predictors. Machine Learning 26 123 140.
  • 2 BREIMAN, L. 1996. The heuristics of instability in model selection. Ann. Statist. 24 2350 2383.
  • 3 DRUCKER, H. and CORTES, C. 1996. Boosting decision trees. Advances in Neural Information Processing Sy stems 8 479 485.
  • 4 FLOy D, S. and WARMUTH, M. 1995. Sample compression, learnability, and the Vapnik Chervonenkis dimension. Machine Learning 21 269 304.
  • 5 FREUND, Y. 1995. Boosting a weak learning algorithm by majority. Inform. and Comput. 121 256 285.
  • 6 FREUND, Y. and SCHAPIRE, R. E. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Sy stem Sci. 55 119 139.
  • 7 KEARNS, M. and VALIANT, L. G. 1994. Cry ptographic limitations on learning Boolean formulae and finite automata. J. Assoc. Comput. Mach. 4 67 95.
  • 8 KOHAVI, R. and WOLPERT, D. H. 1996. Bias plus variance decomposition for zero-one loss functions. In Machine Learning: Proceedings of the Thirteenth International Z. ConferenceL. Saitta, ed. 275 283. Morgan Kaufmann, San Francisco.
  • 9 KONG, E. B. and DIETTERICH, T. G. 1995. Error-correcting output coding corrects bias and variance. In Proceedings of the Twelfth International Conference on Machine Learning Z. A. Prieditis and S. Russell, eds. 313 321. Morgan Kaufmann, San Francisco.
  • 10 QUINLAN, J. R. 1996. Bagging, boosting, and C4.5. In Proceedings of the Thirteenth National Conference on Artificial Intelligence 725 730.
  • 11 QUINLAN, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco.
  • 12 SCHAPIRE, R. E. 1990. The strength of weak learnability. Machine Learning 5 197 227.
  • 13 SCHAPIRE, R. E., FREUND, Y., BARTLETT, P. and LEE, W. S. 1998. Boosting the margin: a new explanation for the effectiveness of voting methods. Ann. Statist. 26. To appear.
  • 14 TIBSHIRANI, R. 1996. Bias, variance and prediction error for classification rules. Technical Report, Univ. Toronto.
  • 15 VALIANT, L. G. 1994. A theory of the learnable. Communications of the ACM 27 1134 1142.
  • 16 VAPNIK, V. N. 1982. Estimation of Dependences Based on Empirical Data. Springer, New York.
  • FLORHAM PARK, NEW JERSEY 07932-0971 E-MAIL: yoav@research.att.com schapire@research.att.com
  • Q I, where X is one component of a fixed-length feature vector and m Ä X c4 i i c is a constant. One could of course use multivariate functions or ``trans-gen erated features'' 7. Suppose we examine some of the queries by constructing a single binary tree TT by the usual data-driven induction method: stepwise entropy reduction estimated from a training set LL. Since we cannot entertain all possible splits at each node, we exploit a natural partial ordering on the set Q and examine only a tiny fraction of them. Basically we incrementally grow the geometric arrangements as we proceed down the tree. The classifier based on Z. Z. TT is then C Q, LL arg max P Y j TT. If the depths of the leaves of TT j Z. are far smaller than M, then evidently C Q, LL is not the Bay es classifier. However, for depths on the order of hundreds or thousands we could expect that P Y j TT P Y j Q, Z. Z. Z. the difference in some appropriate norm being a kind of ``approximation error.'' Of course, we cannot actually create or store a tree of such depth, and Z the best classification rate we obtained with a single tree of average depth. around ten was about 90% on test sets similar to the one discussed by Leo Breiman.
  • TT, P Y k TT Y j, 1 n, m N. Naturally, small covariances lead to n m small errors. More generally, if the trees are produced with some sampling mechanism from the population of trees, involving either resampling from LL or random restrictions on the queries, then the quantities above can be analyzed by taking expectations relative to the space of trees. In regard to arcing, probably the deterministic reweighting on misclassified examples produces new trees which are quite different from the previous ones. Moreover, the errors induced on data points which were correctly classified by the existing trees are sufficiently randomized to avoid any sy stematic deterioration.
  • 1 AMIT, Y. and GEMAN, D. 1994. Randomized inquiries about shape; an application to handwritten digit recognition. Technical Report 401, Univ. Chicago.
  • 2 AMIT, Y. and GEMAN, D. 1997. Shape quantization and recognition with randomized trees. Neural Computation 9 1545 1588.
  • 3 AMIT, Y., GEMAN, D. and JEDy NAK, B. 1998. Efficient focusing and face detection. In Face Z. Recognition: From Theory to Applications H. Wechsler and J. Phillips, eds. Springer, Berlin.
  • 4 AMIT, Y., GEMAN, D. and WILDER, K. 1997. Joint induction of shape features and tree classifiers. IEEE Trans. PAMI 19 1300 1306.
  • 5 BREIMAN, L., FRIEDMAN, J., OLSHEN, R. and STONE, C. 1984. Classification and Regression Trees. Wadsworth, Belmont, CA.
  • 6 FREUND, Y. and SCHAPIRE, R. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Sy stem Sci. 55 119 139.
  • 7 FRIEDMAN, J. H. 1973. A recursive partitioning decision rule for nonparametric classification. IEEE Trans. Comput. 26 404 408.
  • 8 GEMAN, S., BIENENSTOCK, E. and DOURSAT, R. 1992. Neural networks and the bias variance dilemma. Neural Computation 4 1 58.
  • 9 SHLIEN, S. 1990. Multiple binary decision tree classifiers. Pattern Recognition 23 757 763.
  • 10 WILDER, K. 1998. Decision tree algorithms for handwritten digit recognition. Ph.D. dissertation, Univ. Massachusetts, Amherst.
  • CHICAGO, ILLINOIS 60637 UNIVERSITY OF MASSACHUSETTS E-MAIL: amit@galton.uchicago.edu AMHERST, MASSACHUSETTS 01003 E-MAIL: geman@math.umass.edu
  • ALI, K. M. and PAZZANI, M. J. 1996. Error reduction through learning multiple descriptions. Machine Learning 24 173 202.
  • CHERKAUER, K. J. 1996. Human expert-level performance on a scientific image analysis task by a sy stem using combined artificial neural networks. In Working Notes of the AAAI Z. Workshop on Integrating Multiple Learned Models P. Chan, ed. 15 21. AAAI Press, Menlo Park, CA. Z.
  • DIETTERICH, T. G. and BAKIRI, G. 1995. Solving multiclass learning problems via error-correcting output codes. J. Artificial Intelligence Res. 2 263 286. Z.
  • DIETTERICH, T. G. and KONG, E. B. 1995. Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical Report, Dept. Computer Science, Oregon State Univ., Corvallis, Oregon. Available from ftp: ftp.cs.orst. edu pub tgd papers tr-bias.ps gz. Z.
  • FREUND, Y. and SCHAPIRE, R. E. 1996. Experiments with a new boosting algorithm. In ProceedZ. ings of the Thirteenth International Conference on Machine Learning L. Saitta, ed. 148 156. Morgan Kaufmann, San Francisco. Z.
  • HASHEM, S. 1993. Optimal linear combinations of neural networks. Ph.D. dissertation, School of Industrial Engineering, Purdue Univ., Lafay ette, IN. Z.
  • KONG, E. B. and DIETTERICH, T. G. 1995. Error-correcting output coding corrects bias and Z variance. In Twelfth International Conference on Machine Learning A. Prieditis and. S. Russell, eds. 313 321. Morgan Kaufmann, San Francisco. Z.
  • MACKAY, D. 1992. A practical bay esian framework for backpropagation networks. Neural Computation 4 448 472. Z.
  • NEAL, R. 1993. Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRG-TR-93-1, Dept. Computer Science, Univ. Toronto. Z.
  • PERRONE, M. P. and COOPER, L. N. 1993. When networks disagree: ensemble methods for Z hy brid neural networks. In Neural Networks for Speech and Image Processing R. J.. Mammone, ed. 126 142. Chapman and Hall, London. Z.
  • QUINLAN, J. R. 1993. C4.5: Programs for Empirical Learning. Morgan Kaufmann, San Francisco.Z.
  • SCHAPIRE, R. E. 1997. Using output codes to boost multiclass learning problems. In Proceedings of the Fourteenth International Conference on Machine Learning 313 321. Morgan Kaufmann, San Francisco.
  • CORVALLIS, OREGON 97331-3202 E-MAIL: tgd@cs.orst.edu
  • AMIT, Y. and GEMAN, D. 1997. Shape quantization and recognition with randomized trees. Neural Computation 9 1545 1588. Z.
  • BREIMAN, L. 1996c. Out-of-bag estimation. Available at ftp.stat users breimanas OOBestimation. Z.
  • BREIMAN, L. 1997. Prediction games and arcing algorithms. Technical Report 504, Dept. Statistics, Univ. California, Berkeley. Available at www.stat.berkeley.edu. Z.
  • DIETTERICH, T. 1998. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting and randomization. Machine Learning 1 22. Z.
  • FREUND, Y. and SCHAPIRE, R. 1996. Experiments with a new boosting algorithm. In Machine Z. Learning: Proceedings of the Thirteenth International Conference L. Saitta, ed. 148 156. Morgan Kaufmann, San Francisco. Z.
  • JI, C. and MA. S. 1997. Combinations of weak classifiers. IEEE Trans. Neural Networks 8 32 42. Z.
  • KONG, E. B. and DIETTERICH, T. G. 1995. Error-correcting output coding corrects bias and variance. In Proceedings of the Twelfth International Conference on Machine Learning Z. A. Prieditis and S. Russell, eds. 313 321. Morgan Kaufmann, San Francisco. Z.
  • SCHAPIRE, R., FREUND, Y., BARTLETT, P. and LEE, W. S. 1998. Boosting the margin: a new explanation for the effectiveness of voting methods. Ann. Statist. 26. To appear. Z.
  • VAPNIK, V. N. 1995. The Nature of Statistical Learning Theory. Springer, New York.
  • BERKELEY, CALIFORNIA 94720-3860 E-MAIL: leo@stat.berkeley.edu