The Annals of Statistics

Generalization bounds for averaged classifiers

Yoav Freund, Yishay Mansour, and Robert E. Schapire

Source: Ann. Statist. Volume 32, Number 4 (2004), 1698-1722.

Abstract

We study a simple learning algorithm for binary classification. Instead of predicting with the best hypothesis in the hypothesis class, that is, the hypothesis that minimizes the training error, our algorithm predicts with a weighted average of all hypotheses, weighted exponentially with respect to their training error. We show that the prediction of this algorithm is much more stable than the prediction of an algorithm that predicts with the best hypothesis. By allowing the algorithm to abstain from predicting on some examples, we show that the predictions it makes when it does not abstain are very reliable. Finally, we show that the probability that the algorithm abstains is comparable to the generalization error of the best hypothesis in the class.

Primary Subjects: 62C12
Keywords: Classification; ensemble methods; averaging; Bayesian methods; generalization bounds

Full-text: Open access

Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aos/1091626184
Digital Object Identifier: doi:10.1214/009053604000000058
Mathematical Reviews number (MathSciNet): MR2089139
Zentralblatt MATH identifier: 1045.62056

References

Allwein, E. L., Schapire, R. E. and Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. J. Mach. Learn. Res. 1 113--141.
Mathematical Reviews (MathSciNet): MR1884092
Digital Object Identifier: doi:10.1162/15324430152733133
Blumer, A., Ehrenfeucht, A., Haussler, D. and Warmuth, M. K. (1987). Occam's razor. Inform. Process. Lett. 24 377--380.
Mathematical Reviews (MathSciNet): MR896392
Digital Object Identifier: doi:10.1016/0020-0190(87)90114-1
Bousquet, O. and Elisseeff, A. (2002). Stability and generalization. J. Mach. Learn. Res. 2 499--526.
Mathematical Reviews (MathSciNet): MR1929416
Digital Object Identifier: doi:10.1162/153244302760200704
Breiman, L. (1996). Bagging predictors. Machine Learning 24 123--140.
Breiman, L. (1996). Heuristics of instability and stabilization in model selection. Ann. Statist. 24 2350--2383.
Mathematical Reviews (MathSciNet): MR1425957
Digital Object Identifier: doi:10.1214/aos/1032181158
Project Euclid: euclid.aos/1032181158
Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E. and Warmuth, M. K. (1997). How to use expert advice. J. ACM 44 427--485.
Mathematical Reviews (MathSciNet): MR1470152
Digital Object Identifier: doi:10.1145/258128.258179
de Bruijn, N. G. (1981). Asymptotic Methods in Analysis, 3rd ed. Dover, New York.
Mathematical Reviews (MathSciNet): MR671583
Zentralblatt MATH: 0556.41021
Freund, Y. (2003). Predicting a binary sequence almost as well as the optimal biased coin. Inform. and Comput. 182 73--94.
Mathematical Reviews (MathSciNet): MR1971486
Digital Object Identifier: doi:10.1016/S0890-5401(02)00033-0
Freund, Y. and Mason, L. (1999). The alternating decision tree learning algorithm. In Proc. Sixteenth International Conference on Machine Learning 124--133. Morgan Kaufmann, San Francisco.
Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55 119--139.
Mathematical Reviews (MathSciNet): MR1473055
Digital Object Identifier: doi:10.1006/jcss.1997.1504
Friedman, J. H. (1997). On bias, variance, 0$/$1-loss, and the curse-of-dimensionality. Data Min. Knowl. Discov. 1 55--77.
Helmbold, D. P. and Schapire, R. E. (1997). Predicting nearly as well as the best pruning of a decision tree. Machine Learning 27 51--68.
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58 13--30.
Mathematical Reviews (MathSciNet): MR144363
Littlestone, N. and Warmuth, M. K. (1994). The weighted majority algorithm. Inform. and Comput. 108 212--261.
Mathematical Reviews (MathSciNet): MR1265851
Digital Object Identifier: doi:10.1006/inco.1994.1009
MacKay, D. J. C. (1991). Bayesian methods for adaptive models. Ph.D dissertation, California Institute of Technology.
McAllester, D. A. (1999). Some PAC--Bayesian theorems. Machine Learning 37 355--363.
Mathematical Reviews (MathSciNet): MR1811587
McDiarmid, C. (1989). On the method of bounded differences. In Surveys in Combinatorics 1989 148--188. Cambridge Univ. Press.
Mathematical Reviews (MathSciNet): MR1036755
Zentralblatt MATH: 0712.05012
Schapire, R. E., Freund, Y., Bartlett, P. and Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Statist. 26 1651--1686.
Mathematical Reviews (MathSciNet): MR1673273
Digital Object Identifier: doi:10.1214/aos/1024691352
Project Euclid: euclid.aos/1024691352
Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C. and Anthony, M. (1998). Structural risk minimization over data-dependent hierarchies. IEEE Trans. Inform. Theory 44 1926--1940.
Mathematical Reviews (MathSciNet): MR1664055
Digital Object Identifier: doi:10.1109/18.705570
Shawe-Taylor, J. and Williamson, R. C. (1997). A PAC analysis of a Bayesian estimator. In Proc. Tenth Annual Conference on Computational Learning Theory 2--9. ACM Press, New York.
Vapnik, V. N. (1998). Statistical Learning Theory. Wiley, New York.
Mathematical Reviews (MathSciNet): MR1641250
Zentralblatt MATH: 0935.62007
Vovk, V. (2001). Transductive confidence machines. Unpublished manuscript.
Willems, F. M. J., Shtarkov, Y. M. and Tjalkens, T. J. (1995). The context-tree weighting method: Basic properties. IEEE Trans. Inform. Theory 41 653--664.

2010 © Institute of Mathematical Statistics