The Annals of Statistics

Additive logistic regression: a statistical view of boosting (With discussion and a rejoinder by the authors)

Jerome Friedman, Trevor Hastie, and Robert Tibshirani

Full-text: Open access

Abstract

Boosting is one of the most important recent developments in classification methodology. Boosting works by sequentially applying a classification algorithm to reweighted versions of the training data and then taking a weighted majority vote of the sequence of classifiers thus produced. For many classification algorithms, this simple strategy results in dramatic improvements in performance. We show that this seemingly mysterious phenomenon can be understood in terms of well-known statistical principles, namely additive modeling and maximum likelihood. For the two-class problem, boosting can be viewed as an approximation to additive modeling on the logistic scale using maximum Bernoulli likelihood as a criterion. We develop more direct approximations and show that they exhibit nearly identical results to boosting. Direct multiclass generalizations based on multinomial likelihood are derived that exhibit performance comparable to other recently proposed multiclass generalizations of boosting in most situations, and far superior in some. We suggest a minor modification to boosting that can reduce computation, often by factors of 10 to 50. Finally, we apply these insights to produce an alternative formulation of boosting decision trees. This approach, based on best-first truncated tree induction, often leads to better performance, and can provide interpretable descriptions of the aggregate decision rule. It is also much faster computationally, making it more suitable to large-scale data mining applications.

Article information

Source
Ann. Statist. Volume 28, Number 2 (2000), 337-407.

Dates
First available in Project Euclid: 15 March 2002

Permanent link to this document
http://projecteuclid.org/euclid.aos/1016218223

Mathematical Reviews number (MathSciNet)
MR1790002

Digital Object Identifier
doi:10.1214/aos/1016218223

Zentralblatt MATH identifier
01828945

Subjects
Primary: 62G05: Estimation 62G07: Density estimation 68T10: Pattern recognition, speech recognition {For cluster analysis, see 62H30} 68T05: Learning and adaptive systems [See also 68Q32, 91E40]

Keywords
classification tree nonparametric estimation stagewise fitting machine learning

Citation

Friedman, Jerome; Hastie, Trevor; Tibshirani, Robert. Additive logistic regression: a statistical view of boosting (With discussion and a rejoinder by the authors). The Annals of Statistics 28 (2000), no. 2, 337--407. doi:10.1214/aos/1016218223. http://projecteuclid.org/euclid.aos/1016218223.


Export citation

References

  • Breiman, L. (1996). Bagging predictors. Machine Learning 24 123-140.
  • Breiman, L. (1997). Prediction games and arcing algorithms. Technical Report 504, Dept. Statistics, Univ. California, Berkeley. Breiman, L. (1998a). Arcing classifiers (with discussion). Ann. Statist. 26 801-849. Breiman, L. (1998b). Combining predictors. Technical report, Dept. Statistics, Univ. California, Berkeley.
  • Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984). Classification and Regression Trees. Wadsworth, Belmont, CA.
  • Buja, A., Hastie, T. and Tibshirani, R. (1989). Linear smoothers and additive models (with discussion). Ann. Statist. 17 453-555.
  • Cover, T. and Hart, P. (1967). Nearest neighbor pattern classification. Proc. IEEE Trans. Inform. Theory 13 21-27.
  • Dietterich, T. (1998). An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning? 1-22.
  • Freund, Y. (1995). Boosting a weak learning algorithm by majority. Inform. and Comput. 121 256-285. Freund, Y. and Schapire, R. (1996a). Game theory, on-line prediction and boosting. In Proceedings of the Ninth Annual Conference on Computational Learning Theory 325-332. Freund, Y. and Schapire, R. E. (1996b). Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference 148-156. Morgan Kaufman, San Francisco.
  • Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of online learning and an application to boosting. J. Comput. System Sciences 55.
  • Friedman, J. (1991). Multivariate adaptive regression splines (with discussion). Ann. Statist. 19 1-141.
  • Friedman, J. (1996). Another approach to polychotomous classification. Technical report, Stanford Univ.
  • Friedman, J. (1999). Greedy function approximation: the gradient boosting machine. Technical report, Stanford Univ.
  • Friedman, J. and Stuetzle, W. (1981). Projection pursuit regression. J. Amer. Statist. Assoc. 76 817-823.
  • Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models. Chapman and Hall, London.
  • Hastie, T. and Tibshirani, R. (1998). Classification by pairwise coupling. Ann. Statist. 26 451-471.
  • Hastie, T., Tibshirani, R. and Buja, A. (1994). Flexible discriminant analysis by optimal scoring. J. Amer. Statist. Assoc. 89 1255-1270.
  • Holte, R. (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning 11 63-90.
  • Huber, P. (1964). Robust estimation of a location parameter. Ann. Math. Statist. 53 73-101.
  • Kearns, M. and Vazirani, U. (1994). An Introduction to Computational Learning Theory. MIT Press.
  • Mallat, S. and Zhang, Z. (1993). Matching pursuits with time-frequency dictionaries. IEEE Trans. Signal Processing 41 3397-3415.
  • McCullagh, P. and Nelder, J. (1989). Generalized Linear Models. Chapman and Hall, London.
  • Schapire, R. (1997). Using output codes to boost multiclass learning problems. In Proceedings of the Fourteenth International Conference on Machine Learning 313-321. Morgan Kaufman, San Francisco.
  • Schapire, R. E. (1990). The strength of weak learnability. Machine Learning 5 197-227.
  • Schapire, R. E. and Singer, Y. (1998). Improved boosting algorithms using confidence-rated predictions. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory.
  • Schapire, R., Freund, Y., Bartlett, P. and Lee, W. (1998). Boosting the margin: a new explanation for the effectiveness of voting methods. Ann. Statist. 26 1651-1686.
  • Valiant, L. G. (1984). A theory of the learnable. Comm. ACM 27 1134-1142.
  • in a paper by Amit and Geman (1997). Using this approach and 100 iterations gives the following test-set errors as compared to the best corresponding values for LogitBoost.
  • Amit, Y. and Geman, D. (1997). Shape quantization and recognition with randomized trees. Neural Comput. 9 1545-1588. Breiman, L. (1999a). Random forests. Technical report, available at www.stat.berkeley.edu. Breiman, L. (1999b). Prediction games and arcing algorithms. Neural Comput. 11 1493-1517.
  • Dietterich, T. (2000). An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting and randomization. Mach. Learning 40 139-158.
  • Wheway, V. (1999). Variance reduction trends on `boosted' classifiers. Available from virg@cse. unsw.edu.au
  • Friedman (1999). It does not involve any "reweighting". The weak learner b · is fitted in the mth step to the current residuals Yi Fm-1 Xi There is no need for a surrogate loss function in this case since the evaluating L2 loss is the best possible for Newton's method and the quadratic approximation is
  • phenomena are reported in Friedman (1999). From our own work [B ¨uhlmann and Yu (2000)] we know that stumps evaluated at x have high variances for x in a whole region of the covariate space. From an asymptotic point of view, this region is "centered around" the true optimal split point for a stump and has "substantial" size O n-1/3 . That is, stumps do have high variances even in low dimensions as in this simple case (with only three parameters) as long as one is looking at the "right scale" O n-1/3 ; such a high variance presumably propagates when combining stumps in boosting. This observation is the starting point for another boosting machine to be described next.
  • Breiman, L. (1996). Bagging predictors. Machine Learning 24 123-140.
  • B ¨uhlmann, P. and Yu, B. (2000). Explaining bagging. Preprint.
  • Friedman, J. (1999). Greedy function approximation: the gradient boosting machine. Technical report, Stanford Univ.
  • Gill, E. P., Murray, W. and Wright, M. H. (1981). Practical Optimization. Academic Press, New York.
  • Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models. Chapman and Hall, London. Seminar f ¨ur Statistik ETH-Zentrum, LEO D72 CH-8092 Zurich Switzerland E-mail buhlmann@stat.math.ethz.ch Department of Statistics University of California Berkeley, California 94720-3860
  • Breimann, L. (1996). Bagging predictors. Machine Learning 24 123-140.
  • Friedman, J. and Stuetzle, W. (1981). Projection pursuit regression. J. Amer. Statist. Assoc. 76 817-823.
  • Holte, R. (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning 11 63-90.
  • proposed by Freund and Mason (1999). They represent decision trees as sums of very simple functions and use boosting to simultaneously learn both the decision rules and the way to average them. Another important issue discussed in this paper is the performance of boosting methods on data which are generated by classes that have a significant overlap, in other words, classification problems in which even the Bayes optimal prediction rule has a significant error. It has been observed by several authors, including those of the current paper, that AdaBoost is not an optimal method in this case. The problem seems to be that AdaBoost overemphasizes the atypical examples which eventually results in inferior rules. In the current
  • Freund (1999).
  • Breiman, L. (1998). Arcing classifiers. Ann. Statist. 26 801-849.
  • Freund, Y. (1999). An adaptive version of the boost by majority algorithm. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory.
  • Freund, Y. and Mason, L. (1999). The alternating decision tree learning algorithm. In Machine Learning: Proceedings of the Sixteenth International Conference 124-133.
  • Schapire, R. E., Freund, Y., Bartlett, P. and Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Statist. 26 1651-1686.
  • , takes on an interesting form when we have a sample.
  • Bauer, E. and Kohavi, R. (1999). An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Machine Learning 36 105-139.
  • Breiman, L. (1999). Using adaptive bagging to debias regressions. Technical Report 547, Dept. Statistics, Univ. California, Berkeley.
  • Chipman, H., George, E. and McCulloch, R. (1998). Bayesian CART model search (with discussion). J. Amer. Statist. Assoc. 93 935-960.
  • Denison, D. (2000). Boosting with Bayesian stumps. Technical report, Dept. Mathematics, Imperial College.
  • Denison, D., Mallick, B. and Smith, A. (1996). Bayesian CART. Technical report, Dept. Mathematics, Imperial College.
  • DiMatteo, I., Genovese, C. and Kass, R. (1999). Bayesian curve fitting with free-knot splines. Technical report, Carnegie Mellon Univ.
  • Drucker, H. and Cortes, C. (1996). Boosting decision trees. In Proceedings of Neural Information Processing 8 479-485. MIT Press.
  • Elkan, C. (1997). Boosting and na¨ive Bayes learning. Technical Report CS97-557, Univ. California, San Diego.
  • Freund, Y. and Schapire, R. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55 119-139.
  • Hastie, T. and Tibshirani, R. (1998). Bayesian backfitting. Technical report, Stanford Univ.
  • Heikkinen, J. (1998). Curve and surface estimation using dynamic step functions. In Practical Nonparametric and Semiparametric Bayesian Statistics (D. Dey, P. M ¨uller and D. Sinha, eds.) 255-272. Springer, New York.
  • Lee, J. (1999). A computer program that plays a hunch. New York Times August 17.
  • Quinlan, J. (1996). Bagging, boosting, and C4.5. In Proceedings Thirteenth American Association for Artificial Intelligence National Conference on Artificial Intelligence 725-730. AAAI Press, Menlo Park, CA.
  • Ridgeway, G. (1999). The state of boosting. In Proceedings of the Thirty-first Symposium on the Interface 172-181.
  • Ridgeway, G. (1999). The state of boosting. In Computing Science and Statistics 31 (K. Berk, M. Pourahmadi, eds.) Interface Foundation of North America, 172-181. Fairfax, VA.
  • Breiman, L. (1999). Using adaptive bagging to debias regressions. Technical Report 547, Dept. Statistics, Univ. California, Berkeley.
  • Freund, Y. (1999). An adaptive version of boost by majority algorithm. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory.
  • Freund, Y. and Mason, L. (1999). The alternating decision tree learning algorithm. In Machine Learning: Proceedings of the Sixteenth International Conference 124-133.
  • Friedman, J. H. (1991). Multivariate adaptive regression splines (with discussion). Ann. Statist. 19 1-141. Friedman, J. H. (1999a). Greedy function approximation: a gradient boosting machine. Ann. Statist. To appear. Friedman, J. H. (1999b). Stochastic gradient boosting. Technical report, Dept. Statistics, Stanford Univ.
  • Friedman, J. H. and Hall, P. (1999). On bagging and nonlinear estimation. J. Comput. Graph. Statist. To appear.
  • Grove, A. and Schuurmans, D. (1998). Boosting in the limit: maximizing the margin of learned ensembles. In Proceedings of the Fifteenth National Conference on Artificial Intelligence.
  • Quinlan, J. (1996). Boosting first order learning. In Proceedings of the Seventh International Workshop on Algorithmic Learning Theory (S. Arikawa and A. Sharma, eds.) Lecture Notes in Artificial Intelligence 1160 143-155. Springer, Berlin.
  • Ratsch, G. (1998). Ensemble learning methods for classification. Masters thesis, Dept. Computer Science, Univ. Potsdam.
  • Ratsch, G., Onoda, T. and Muller, K. R. (2000). Soft margins for AdaBoost. Machine Learning 1-35.
  • Ridgeway, G. (1999). The state of boosting. In Proceedings of the Thirty-first Symposium on the Interface 172-181.