Statistical Science

Boosting Algorithms: Regularization, Prediction and Model Fitting

Peter Bühlmann and Torsten Hothorn

Full-text: Open access

Abstract

We present a statistical perspective on boosting. Special emphasis is given to estimating potentially complex parametric or nonparametric models, including generalized linear and additive models as well as regression models for survival analysis. Concepts of degrees of freedom and corresponding Akaike or Bayesian information criteria, particularly useful for regularization and variable selection in high-dimensional covariate spaces, are discussed as well.

The practical aspects of boosting procedures for fitting statistical models are illustrated by means of the dedicated open-source software package mboost. This package implements functions which can be used for model fitting, prediction and variable selection. It is flexible, allowing for the implementation of new boosting algorithms optimizing user-specified loss functions.

Article information

Source
Statist. Sci. Volume 22, Number 4 (2007), 477-505.

Dates
First available in Project Euclid: 7 April 2008

Permanent link to this document
http://projecteuclid.org/euclid.ss/1207580163

Digital Object Identifier
doi:10.1214/07-STS242

Mathematical Reviews number (MathSciNet)
MR2420454

Citation

Bühlmann, Peter; Hothorn, Torsten. Boosting Algorithms: Regularization, Prediction and Model Fitting. Statist. Sci. 22 (2007), no. 4, 477--505. doi:10.1214/07-STS242. http://projecteuclid.org/euclid.ss/1207580163.


Export citation

References

  • [1] Amit, Y. and Geman, D. (1997). Shape quantization and recognition with randomized trees. Neural Computation 9 1545–1588.
  • [2] Audrino, F. and Barone-Adesi, G. (2005). Functional gradient descent for financial time series with an application to the measurement of market risk. J. Banking and Finance 29 959–977.
  • [3] Audrino, F. and Barone-Adesi, G. (2005). A multivariate FGD technique to improve VaR computation in equity markets. Comput. Management Sci. 2 87–106.
  • [4] Audrino, F. and Bühlmann, P. (2003). Volatility estimation with functional gradient descent for very high-dimensional financial time series. J. Comput. Finance 6 65–89.
  • [5] Bartlett, P. (2003). Prediction algorithms: Complexity, concentration and convexity. In Proceedings of the 13th IFAC Symp. on System Identification.
  • [6] Bartlett, P. L., Jordan, M. and McAuliffe, J. (2006). Convexity, classification, and risk bounds. J. Amer. Statist. Assoc. 101 138–156.
  • [7] Bartlett, P. and Traskin, M. (2007). AdaBoost is consistent. J. Mach. Learn. Res. 8 2347–2368.
  • [8] Benner, A. (2002). Application of “aggregated classifiers” in survival time studies. In Proceedings in Computational Statistics (COMPSTAT) (W. Härdle and B. Rönz, eds.) 171–176. Physica-Verlag, Heidelberg.
  • [9] Binder, H. (2006). GAMBoost: Generalized additive models by likelihood based boosting. R package version 0.9-3. Available at http://CRAN.R-project.org.
  • [10] Bissantz, N., Hohage, T., Munk, A. and Ruymgaart, F. (2007). Convergence rates of general regularization methods for statistical inverse problems and applications. SIAM J. Numer. Anal. 45 2610–2636.
  • [11] Blake, C. L. and Merz, C. J. (1998). UCI repository of machine learning databases. Available at http://www.ics.uci.edu/~mlearn/MLRepository.html.
  • [12] Blanchard, G., Lugosi, G. and Vayatis, N. (2003). On the rate of convergence of regularized boosting classifiers. J. Machine Learning Research 4 861–894.
  • [13] Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technometrics 37 373–384.
  • [14] Breiman, L. (1996). Bagging predictors. Machine Learning 24 123–140.
  • [15] Breiman, L. (1998). Arcing classifiers (with discussion). Ann. Statist. 26 801–849.
  • [16] Breiman, L. (1999). Prediction games and arcing algorithms. Neural Computation 11 1493–1517.
  • [17] Breiman, L. (2001). Random forests. Machine Learning 45 5–32.
  • [18] Bühlmann, P. (2006). Boosting for high-dimensional linear models. Ann. Statist. 34 559–583.
  • [19] Bühlmann, P. (2007). Twin boosting: Improved feature selection and prediction. Technical report, ETH Zürich. Available at ftp://ftp.stat.math.ethz.ch/Research-Reports/Other-Manuscripts/buhlmann/TwinBoosting1.pdf.
  • [20] Bühlmann, P. and Lutz, R. (2006). Boosting algorithms: With an application to bootstrapping multivariate time series. In The Frontiers in Statistics (J. Fan and H. Koul, eds.) 209–230. Imperial College Press, London.
  • [21] Bühlmann, P. and Yu, B. (2000). Discussion on “Additive logistic regression: A statistical view,” by J. Friedman, T. Hastie and R. Tibshirani. Ann. Statist. 28 377–386.
  • [22] Bühlmann, P. and Yu, B. (2003). Boosting with the L2 loss: Regression and classification. J. Amer. Statist. Assoc. 98 324–339.
  • [23] Bühlmann, P. and Yu, B. (2006). Sparse boosting. J. Machine Learning Research 7 1001–1024.
  • [24] Buja, A., Stuetzle, W. and Shen, Y. (2005). Loss functions for binary class probability estimation: Structure and applications. Technical report, Univ. Washington. Available at http://www.stat.washington.edu/wxs/Learning-papers/paper-proper-scoring.pdf.
  • [25] Dettling, M. (2004). BagBoosting for tumor classification with gene expression data. Bioinformatics 20 3583–3593.
  • [26] Dettling, M. and Bühlmann, P. (2003). Boosting for tumor classification with gene expression data. Bioinformatics 19 1061–1069.
  • [27] DiMarzio, M. and Taylor, C. (2008). On boosting kernel regression. J. Statist. Plann. Inference. To appear.
  • [28] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression (with discussion). Ann. Statist. 32 407–499.
  • [29] Freund, Y. and Schapire, R. (1995). A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the Second European Conference on Computational Learning Theory. Springer, Berlin.
  • [30] Freund, Y. and Schapire, R. (1996). Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA.
  • [31] Freund, Y. and Schapire, R. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55 119–139.
  • [32] Friedman, J. (2001). Greedy function approximation: A gradient boosting machine. Ann. Statist. 29 1189–1232.
  • [33] Friedman, J., Hastie, T. and Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting (with discussion). Ann. Statist. 28 337–407.
  • [34] Garcia, A. L., Wagner, K., Hothorn, T., Koebnick, C., Zunft, H. J. and Trippo, U. (2005). Improved prediction of body fat by measuring skinfold thickness, circumferences, and bone breadths. Obesity Research 13 626–634.
  • [35] Gentleman, R. C., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., Hornik, K., Hothorn, T., Huber, M., Iacus, S., Irizarry, R., Leisch, F., Li, C., Mächler, M., Rossini, A. J., Sawitzki, G., Smith, C., Smyth, G., Tierney, L., Yang, J. Y. and Zhang, J. (2004). Bioconductor: Open software development for computational biology and bioinformatics. Genome Biology 5 R80.
  • [36] Green, P. and Silverman, B. (1994). Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. Chapman and Hall, New York.
  • [37] Greenshtein, E. and Ritov, Y. (2004). Persistence in high-dimensional predictor selection and the virtue of over-parametrization. Bernoulli 10 971–988.
  • [38] Hansen, M. and Yu, B. (2001). Model selection and minimum description length principle. J. Amer. Statist. Assoc. 96 746–774.
  • [39] Hastie, T. and Efron, B. (2004). Lars: Least angle regression, lasso and forward stagewise. R package version 0.9-7. Available at http://CRAN.R-project.org.
  • [40] Hastie, T. and Tibshirani, R. (1986). Generalized additive models (with discussion). Statist. Sci. 1 297–318.
  • [41] Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models. Chapman and Hall, London.
  • [42] Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning; Data Mining, Inference and Prediction. Springer, New York.
  • [43] Hothorn, T. and Bühlmann, P. (2007). Mboost: Model-based boosting. R package version 0.5-8. Available at http://CRAN.R-project.org/.
  • [44] Hothorn, T. and Bühlmann, P. (2006). Model-based boosting in high dimensions. Bioinformatics 22 2828–2829.
  • [45] Hothorn, T., Bühlmann, P., Dudoit, S., Molinaro, A. and van der Laan, M. (2006). Survival ensembles. Biostatistics 7 355–373.
  • [46] Hothorn, T., Hornik, K. and Zeileis, A. (2006). Party: A laboratory for recursive part(y)itioning. R package version 0.9-11. Available at http://CRAN.R-project.org/.
  • [47] Hothorn, T., Hornik, K. and Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. J. Comput. Graph. Statist. 15 651–674.
  • [48] Huang, J., Ma, S. and Zhang, C.-H. (2008). Adaptive Lasso for sparse high-dimensional regression. Statist. Sinica. To appear.
  • [49] Hurvich, C., Simonoff, J. and Tsai, C.-L. (1998). Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion. J. Roy. Statist. Soc. Ser. B 60 271–293.
  • [50] Iyer, R., Lewis, D., Schapire, R., Singer, Y. and Singhal, A. (2000). Boosting for document routing. In Proceedings of CIKM-00, 9th ACM Int. Conf. on Information and Knowledge Management (A. Agah, J. Callan and E. Rundensteiner, eds.). ACM Press, New York.
  • [51] Jiang, W. (2004). Process consistency for AdaBoost (with discussion). Ann. Statist. 32 13–29, 85–134.
  • [52] Kearns, M. and Valiant, L. (1994). Cryptographic limitations on learning Boolean formulae and finite automata. J. Assoc. Comput. Machinery 41 67–95.
  • [53] Koltchinskii, V. and Panchenko, D. (2002). Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist. 30 1–50.
  • [54] Leitenstorfer, F. and Tutz, G. (2006). Smoothing with curvature constraints based on boosting techniques. In Proceedings in Computational Statistics (COMPSTAT) (A. Rizzi and M. Vichi, eds.). Physica-Verlag, Heidelberg.
  • [55] Leitenstorfer, F. and Tutz, G. (2007). Generalized monotonic regression based on B-splines with an application to air pollution data. Biostatistics 8 654–673.
  • [56] Leitenstorfer, F. and Tutz, G. (2007). Knot selection by boosting techniques. Comput. Statist. Data Anal. 51 4605–4621.
  • [57] Lozano, A., Kulkarni, S. and Schapire, R. (2006). Convergence and consistency of regularized boosting algorithms with stationary β-mixing observations. In Advances in Neural Information Processing Systems (Y. Weiss, B. Schölkopf and J. Platt, eds.) 18. MIT Press.
  • [58] Lugosi, G. and Vayatis, N. (2004). On the Bayes-risk consistency of regularized boosting methods (with discussion). Ann. Statist. 32 30–55, 85–134.
  • [59] Lutz, R. and Bühlmann, P. (2006). Boosting for high-multivariate responses in high-dimensional linear regression. Statist. Sinica 16 471–494.
  • [60] Mallat, S. and Zhang, Z. (1993). Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing 41 3397–3415.
  • [61] Mannor, S., Meir, R. and Zhang, T. (2003). Greedy algorithms for classification–consistency, convergence rates, and adaptivity. J. Machine Learning Research 4 713–741.
  • [62] Mason, L., Baxter, J., Bartlett, P. and Frean, M. (2000). Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classifiers (A. Smola, P. Bartlett, B. Schölkopf and D. Schuurmans, eds.) 221–246. MIT Press, Cambridge.
  • [63] McCaffrey, D. F., Ridgeway, G. and Morral, A. R. G. (2004). Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychological Methods 9 403–425.
  • [64] Mease, D., Wyner, A. and Buja, A. (2007). Cost-weighted boosting with jittering and over/under-sampling: JOUS-boost. J. Machine Learning Research 8 409–439.
  • [65] Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the Lasso. Ann. Statist. 34 1436–1462.
  • [66] Meir, R. and Rätsch, G. (2003). An introduction to boosting and leveraging. In Advanced Lectures on Machine Learning (S. Mendelson and A. Smola, eds.). Springer, Berlin.
  • [67] Osborne, M., Presnell, B. and Turlach, B. (2000). A new approach to variable selection in least squares problems. IMA J. Numer. Anal. 20 389–403.
  • [68] Park, M.-Y. and Hastie, T. (2007). An L1 regularization-path algorithm for generalized linear models. J. Roy. Statist. Soc. Ser. B 69 659–677.
  • [69] R Development Core Team (2006). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Available at http://www.R-project.org.
  • [70] Rätsch, G., Onoda, T. and Müller, K. (2001). Soft margins for AdaBoost. Machine Learning 42 287–320.
  • [71] Ridgeway, G. (1999). The state of boosting. Comput. Sci. Statistics 31 172–181.
  • [72] Ridgeway, G. (2000). Discussion on “Additive logistic regression: A statistical view of boosting,” by J. Friedman, T. Hastie, R. Tibshirani. Ann. Statist. 28 393–400.
  • [73] Ridgeway, G. (2002). Looking for lumps: Boosting and bagging for density estimation. Comput. Statist. Data Anal. 38 379–392.
  • [74] Ridgeway, G. (2006). Gbm: Generalized boosted regression models. R package version 1.5-7. Available at http://www.i-pensieri.com/gregr/gbm.shtml.
  • [75] Schapire, R. (1990). The strength of weak learnability. Machine Learning 5 197–227.
  • [76] Schapire, R. (2002). The boosting approach to machine learning: An overview. Nonlinear Estimation and Classification. Lecture Notes in Statist. 171 149–171. Springer, New York.
  • [77] Schapire, R., Freund, Y., Bartlett, P. and Lee, W. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Statist. 26 1651–1686.
  • [78] Schapire, R. and Singer, Y. (2000). Boostexter: A boosting-based system for text categorization. Machine Learning 39 135–168.
  • [79] Southwell, R. (1946). Relaxation Methods in Theoretical Physics. Oxford, at the Clarendon Press.
  • [80] Street, W. N., Mangasarian, O. L., and Wolberg, W. H. (1995). An inductive learning approach to prognostic prediction. In Proceedings of the Twelfth International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA.
  • [81] Temlyakov, V. (2000). Weak greedy algorithms. Adv. Comput. Math. 12 213–227.
  • [82] Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • [83] Tukey, J. (1977). Exploratory Data Analysis. Addison-Wesley, Reading, MA.
  • [84] Tutz, G. and Binder, H. (2006). Generalized additive modelling with implicit variable selection by likelihood based boosting. Biometrics 62 961–971.
  • [85] Tutz, G. and Binder, H. (2007). Boosting Ridge regression. Comput. Statist. Data Anal. 51 6044–6059.
  • [86] Tutz, G. and Hechenbichler, K. (2005). Aggregating classifiers with ordinal response structure. J. Statist. Comput. Simul. 75 391–408.
  • [87] Tutz, G. and Leitenstorfer, F. (2007). Generalized smooth monotonic regression in additive modelling. J. Comput. Graph. Statist. 16 165–188.
  • [88] Tutz, G. and Reithinger, F. (2007). Flexible semiparametric mixed models. Statistics in Medicine 26 2872–2900.
  • [89] van der Laan, M. and Robins, J. (2003). Unified Methods for Censored Longitudinal Data and Causality. Springer, New York.
  • [90] West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., Zuzan, H., Olson, J., Marks, J. and Nevins, J. (2001). Predicting the clinical status of human breast cancer by using gene expression profiles. Proc. Natl. Acad. Sci. USA 98 11462–11467.
  • [91] Yao, Y., Rosasco, L. and Caponnetto, A. (2007). On early stopping in gradient descent learning. Constr. Approx. 26 289–315.
  • [92] Zhang, T. and Yu, B. (2005). Boosting with early stopping: Convergence and consistency. Ann. Statist. 33 1538–1579.
  • [93] Zhao, P. and Yu, B. (2007). Stagewise Lasso. J. Mach. Learn. Res. 8 2701–2726.
  • [94] Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Machine Learning Research 7 2541–2563.
  • [95] Zhu, J., Rosset, S., Zou, H. and Hastie, T. (2005). Multiclass AdaBoost. Technical report, Stanford Univ. Available at http://www-stat.stanford.edu/~hastie/Papers/samme.pdf.
  • [96] Zou, H. (2006). The adaptive Lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429.