We study a simple learning algorithm for binary classification. Instead of predicting with the best hypothesis in the hypothesis class, that is, the hypothesis that minimizes the training error, our algorithm predicts with a weighted average of all hypotheses, weighted exponentially with respect to their training error. We show that the prediction of this algorithm is much more stable than the prediction of an algorithm that predicts with the best hypothesis. By allowing the algorithm to abstain from predicting on some examples, we show that the predictions it makes when it does not abstain are very reliable. Finally, we show that the probability that the algorithm abstains is comparable to the generalization error of the best hypothesis in the class.
References
Allwein, E. L., Schapire, R. E. and Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. J. Mach. Learn. Res. 1 113--141.
Blumer, A., Ehrenfeucht, A., Haussler, D. and Warmuth, M. K. (1987). Occam's razor. Inform. Process. Lett. 24 377--380.
Mathematical Reviews (MathSciNet):
MR896392
Bousquet, O. and Elisseeff, A. (2002). Stability and generalization. J. Mach. Learn. Res. 2 499--526.
Breiman, L. (1996). Bagging predictors. Machine Learning 24 123--140.
Breiman, L. (1996). Heuristics of instability and stabilization in model selection. Ann. Statist. 24 2350--2383.
Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E. and Warmuth, M. K. (1997). How to use expert advice. J. ACM 44 427--485.
de Bruijn, N. G. (1981). Asymptotic Methods in Analysis, 3rd ed. Dover, New York.
Mathematical Reviews (MathSciNet):
MR671583
Freund, Y. (2003). Predicting a binary sequence almost as well as the optimal biased coin. Inform. and Comput. 182 73--94.
Freund, Y. and Mason, L. (1999). The alternating decision tree learning algorithm. In Proc. Sixteenth International Conference on Machine Learning 124--133. Morgan Kaufmann, San Francisco.
Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55 119--139.
Friedman, J. H. (1997). On bias, variance, 0$/$1-loss, and the curse-of-dimensionality. Data Min. Knowl. Discov. 1 55--77.
Helmbold, D. P. and Schapire, R. E. (1997). Predicting nearly as well as the best pruning of a decision tree. Machine Learning 27 51--68.
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58 13--30.
Mathematical Reviews (MathSciNet):
MR144363
Littlestone, N. and Warmuth, M. K. (1994). The weighted majority algorithm. Inform. and Comput. 108 212--261.
MacKay, D. J. C. (1991). Bayesian methods for adaptive models. Ph.D dissertation, California Institute of Technology.
McAllester, D. A. (1999). Some PAC--Bayesian theorems. Machine Learning 37 355--363.
McDiarmid, C. (1989). On the method of bounded differences. In Surveys in Combinatorics 1989 148--188. Cambridge Univ. Press.
Schapire, R. E., Freund, Y., Bartlett, P. and Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Statist. 26 1651--1686.
Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C. and Anthony, M. (1998). Structural risk minimization over data-dependent hierarchies. IEEE Trans. Inform. Theory 44 1926--1940.
Shawe-Taylor, J. and Williamson, R. C. (1997). A PAC analysis of a Bayesian estimator. In Proc. Tenth Annual Conference on Computational Learning Theory 2--9. ACM Press, New York.
Vapnik, V. N. (1998). Statistical Learning Theory. Wiley, New York.
Vovk, V. (2001). Transductive confidence machines. Unpublished manuscript.
Willems, F. M. J., Shtarkov, Y. M. and Tjalkens, T. J. (1995). The context-tree weighting method: Basic properties. IEEE Trans. Inform. Theory 41 653--664.