The Annals of Statistics

Boosting with early stopping: Convergence and consistency

Tong Zhang and Bin Yu

Full-text: Open access

Abstract

Boosting is one of the most significant advances in machine learning for classification and regression. In its original and computationally flexible version, boosting seeks to minimize empirically a loss function in a greedy fashion. The resulting estimator takes an additive function form and is built iteratively by applying a base estimator (or learner) to updated samples depending on the previous iterations. An unusual regularization technique, early stopping, is employed based on CV or a test set.

This paper studies numerical convergence, consistency and statistical rates of convergence of boosting with early stopping, when it is carried out over the linear span of a family of basis functions. For general loss functions, we prove the convergence of boosting’s greedy optimization to the infinimum of the loss function over the linear span. Using the numerical convergence result, we find early-stopping strategies under which boosting is shown to be consistent based on i.i.d. samples, and we obtain bounds on the rates of convergence for boosting estimators. Simulation studies are also presented to illustrate the relevance of our theoretical results for providing insights to practical aspects of boosting.

As a side product, these results also reveal the importance of restricting the greedy search step-sizes, as known in practice through the work of Friedman and others. Moreover, our results lead to a rigorous proof that for a linearly separable problem, AdaBoost with ɛ→0 step-size becomes an L1-margin maximizer when left to run to convergence.

Article information

Source
Ann. Statist. Volume 33, Number 4 (2005), 1538-1579.

Dates
First available in Project Euclid: 5 August 2005

Permanent link to this document
https://projecteuclid.org/euclid.aos/1123250222

Digital Object Identifier
doi:10.1214/009053605000000255

Mathematical Reviews number (MathSciNet)
MR2166555

Zentralblatt MATH identifier
1078.62038

Subjects
Primary: 62G05: Estimation 62G08: Nonparametric regression

Keywords
Boosting greedy optimization matching pursuit early stopping consistency

Citation

Zhang, Tong; Yu, Bin. Boosting with early stopping: Convergence and consistency. Ann. Statist. 33 (2005), no. 4, 1538--1579. doi:10.1214/009053605000000255. https://projecteuclid.org/euclid.aos/1123250222


Export citation

References

  • Barron, A. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inform. Theory 39 930–945.
  • Bartlett, P. L., Bousquet, O. and Mendelson, S. (2005). Local Rademacher complexities. Ann. Statist. 33 1497–1537.
  • Bartlett, P. L., Jordan, M. and McAuliffe, J. (2005). Convexity, classification, and risk bounds. J. Amer. Statist. Assoc. To appear.
  • Bartlett, P. L. and Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res. 3 463–482.
  • Blanchard, G., Lugosi, G. and Vayatis, N. (2004). On the rate of convergence of regularized boosting classifiers. J. Mach. Learn. Res. 4 861–894.
  • Bousquet, O., Koltchinskii, V. and Panchenko, D. (2002). Some local measures of complexity of convex hulls and generalization bounds. Computational Learning Theory. Lecture Notes in Artificial Intelligence 2375 59–73. Springer, Berlin.
  • Breiman, L. (1998). Arcing classifiers (with discussion). Ann. Statist. 26 801–849.
  • Breiman, L. (1999). Prediction games and arcing algorithms. Neural Computation 11 1493–1517.
  • Breiman, L. (2004). Population theory for boosting ensembles. Ann. Statist. 32 1–11.
  • Bühlmann, P. (2002). Consistency for $L_2$ boosting and matching pursuit with trees and tree-type basis functions. Technical report, ETH Zürich.
  • Bühlmann, P. and Yu, B. (2003). Boosting with the $L_2$ loss: Regression and classification. J. Amer. Statist. Assoc. 98 324–339.
  • Collins, M., Schapire, R. E. and Singer, Y. (2002). Logistic regression, AdaBoost and Bregman distances. Machine Learning 48 253–285.
  • Freund, Y. and Schapire, R. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55 119–139.
  • Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Ann. Statist. 29 1189–1232.
  • Friedman, J. H., Hastie, T. and Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting (with discussion). Ann. Statist. 28 337–407.
  • Grove, A. and Schuurmans, D. (1998). Boosting in the limit: Maximizing the margin of learned ensembles. In Proc. Fifteenth National Conference on Artificial Intelligence 692–699. AAAI Press, Menlo Park, CA.
  • Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models. Chapman and Hall, London.
  • Hastie, T., Tibshirani, R. and Friedman, J. H. (2001). The Elements of Statistical Learning. Springer, New York.
  • Jiang, W. (2004). Process consistency for AdaBoost. Ann. Statist. 32 13–29.
  • Jones, L. (1992). A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training. Ann. Statist. 20 608–613.
  • Koltchinskii, V. and Panchenko, D. (2002). Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist. 30 1–50.
  • Koltchinskii, V. and Panchenko, D. (2005). Complexities of convex combinations and bounding the generalization error in classification. Ann. Statist. 33 1455–1496.
  • Koltchinskii, V., Panchenko, D. and Lozano, F. (2001). Further explanation of the effectiveness of voting methods: The game between margins and weights. Computational Learning Theory. Lecture Notes in Artificial Intelligence 2111 241–255. Springer, Berlin.
  • Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Processes. Springer, Berlin.
  • Lee, W., Bartlett, P. and Williamson, R. (1996). Efficient agnostic learning of neural networks with bounded fan-in. IEEE Trans. Inform. Theory 42 2118–2132.
  • Leshno, M., Lin, Y. V., Pinkus, A. and Schocken, S. (1993). Multilayer feedforward networks with a non-polynomial activation function can approximate any function. Neural Networks 6 861–867.
  • Li, F. and Yang, Y. (2003). A loss function analysis for classification methods in text categorization. In Proc. 20th International Conference on Machine Learning 2 472–479. AAAI Press, Menlo Park, CA.
  • Lugosi, G. and Vayatis, N. (2004). On the Bayes-risk consistency of regularized boosting methods. Ann. Statist. 32 30–55.
  • Mallat, S. and Zhang, Z. (1993). Matching pursuits with time-frequency dictionaries. IEEE Trans. Signal Process. 41 3397–3415.
  • Mannor, S., Meir, R. and Zhang, T. (2003). Greedy algorithms for classification–-consistency, convergence rates, and adaptivity. J. Mach. Learn. Res. 4 713–742.
  • Mason, L., Baxter, J., Bartlett, P. L. and Frean, M. (2000). Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classifiers (A. J. Smola, P. L. Bartlett, B. Schölkopf and D. Schuurmans, eds.) 221–246. MIT Press.
  • Meir, R. and Zhang, T. (2003). Generalization error bounds for Bayesian mixture algorithms. J. Mach. Learn. Res. 4 839–860.
  • Schapire, R., Freund, Y., Bartlett, P. L. and Lee, W. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Statist. 26 1651–1686.
  • Schapire, R. and Singer, Y. (1999). Improved boosting algorithms using confidence-rated predictions. Machine Learning 37 297–336.
  • van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. With Applications to Statistics. Springer, New York.
  • Vapnik, V. (1998). Statistical Learning Theory. Wiley, New York.
  • Zhang, T. (2003). Sequential greedy approximation for certain convex optimization problems. IEEE Trans. Inform. Theory 49 682–691.
  • Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Statist. 32 56–85.
  • Zhang, T. and Oles, F. J. (2001). Text categorization based on regularized linear classification methods. Information Retrieval 4 5–31.