The Annals of Statistics

Fast learning rates in statistical inference through aggregation

Jean-Yves Audibert
Source: Ann. Statist. Volume 37, Number 4 (2009), 1591-1646.

Abstract

We develop minimax optimal risk bounds for the general learning task consisting in predicting as well as the best function in a reference set $\mathcal{G}$ up to the smallest possible additive term, called the convergence rate. When the reference set is finite and when n denotes the size of the training data, we provide minimax convergence rates of the form $C(\frac{\log|\mathcal{G}|}{n})^{v}$ with tight evaluation of the positive constant C and with exact 0<v≤1, the latter value depending on the convexity of the loss function and on the level of noise in the output distribution.

The risk upper bounds are based on a sequential randomized algorithm, which at each step concentrates on functions having both low risk and low variance with respect to the previous step prediction function. Our analysis puts forward the links between the probabilistic and worst-case viewpoints, and allows to obtain risk bounds unachievable with the standard statistical learning approach. One of the key ideas of this work is to use probabilistic inequalities with respect to appropriate (Gibbs) distributions on the prediction function space instead of using them with respect to the distribution generating the data.

The risk lower bounds are based on refinements of the Assouad lemma taking particularly into account the properties of the loss function. Our key example to illustrate the upper and lower bounds is to consider the Lq-regression setting for which an exhaustive analysis of the convergence rates is given while q ranges in [1; +∞[.

First Page: Show Hide
Primary Subjects: 62G08
Secondary Subjects: 62H05, 68T10
Full-text: Access denied (no subscription detected)
We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber.
If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text
Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.aos/1245332827
Digital Object Identifier: doi:10.1214/08-AOS623
Zentralblatt MATH identifier: 05582005
Mathematical Reviews number (MathSciNet): MR2533466

References

[1] Alquier, P. (2008). Iterative feature selection in least square regression estimation. Ann. Inst. H. Poincaré Probab. Statist.
Mathematical Reviews (MathSciNet): MR2451571
Zentralblatt MATH: 05610824
Digital Object Identifier: doi:10.1214/07-AIHP106
Project Euclid: euclid.aihp/1203969868
[2] Antos, A. (2002). Lower bounds for the rate of convergence in nonparametric pattern recognition. Theoret. Comput. Sci. 284 3–24.
Mathematical Reviews (MathSciNet): MR1915064
Zentralblatt MATH: 0997.68111
Digital Object Identifier: doi:10.1016/S0304-3975(01)00077-9
[3] Assouad, P. (1983). Deux remarques sur l’estimation. C. R. Acad. Sci. Paris 296 1021–1024.
Mathematical Reviews (MathSciNet): MR777600
[4] Audibert, J.-Y. (2004). Aggregated estimators and empirical complexity for least square regression. Ann. Inst. H. Poincaré Probab. Statist. 40 685–736.
Mathematical Reviews (MathSciNet): MR2096215
Zentralblatt MATH: 1052.62037
Digital Object Identifier: doi:10.1016/j.anihpb.2003.11.006
[5] Audibert, J.-Y. (2004). A better variance control for PAC-Bayesian classification. Preprint n. 905, Laboratoire de Probabilités et Modèles Aléatoires, Universités Paris 6 and Paris 7. Available at http://www.proba.jussieu.fr/mathdoc/textes/PMA-905Bis.pdf.
Mathematical Reviews (MathSciNet): MR2096215
Zentralblatt MATH: 1052.62037
Digital Object Identifier: doi:10.1016/j.anihpb.2003.11.006
[6] Audibert, J.-Y. (2004). Classification under polynomial entropy and margin assumptions and randomized estimators. Preprint n. 908, Laboratoire de Probabilités et Modèles Aléatoires, Universités Paris 6 and Paris 7. Available at http://www.proba.jussieu.fr/mathdoc/preprints/index.html#2004.
[7] Audibert, J.-Y. (2004). PAC-Bayesian statistical learning theory. Ph.D. thesis, Laboratoire de Probabilités et Modèles Aléatoires. Univ. Paris 6 and Paris 7. Available at http://certis.enpc.fr/~audibert/ThesePack.zip.
[8] Audibert, J.-Y. (2006). Fast learning rates in statistical inference through aggregation. Research Report 06-20, Certis—Ecole des Ponts. Available at http://arxiv.org/ abs/math/0703854.
Mathematical Reviews (MathSciNet): MR2533466
Zentralblatt MATH: 05582005
Digital Object Identifier: doi:10.1214/08-AOS623
Project Euclid: euclid.aos/1245332827
[9] Audibert, J.-Y. (2007). Progressive mixture rules are deviation suboptimal. Adv. Neural Inf. Process. Syst. 20.
[10] Audibert, J.-Y. and Tsybakov, A. B. (2007). Fast learning rates for plug-in classifiers. Ann. Statist. 35 608–633.
Mathematical Reviews (MathSciNet): MR2336861
Zentralblatt MATH: 1118.62041
Digital Object Identifier: doi:10.1214/009053606000001217
Project Euclid: euclid.aos/1183667286
[11] Barron, A. (1987). Are bayes rules consistent in information? In Open Problems in Communication and Computation (T. M. Cover and B. Gopinath, eds.) 85–91. Springer, New York.
[12] Barron, A. and Yang, Y. (1999). Information-theoretic determination of minimax rates of convergence. Ann. Statist. 27 1564–1599.
Mathematical Reviews (MathSciNet): MR1742500
Zentralblatt MATH: 0978.62008
Digital Object Identifier: doi:10.1214/aos/1017939142
Project Euclid: euclid.aos/1017939142
[13] Bartlett, P. L., Bousquet, O. and Mendelson S. (2002). Localized rademacher complexities. In Proceedings of the 15th Annual Conference on Computational Learning Theory (K. Kivinen, ed.). Lecture Notes in Computer Science 2375 Springer, London.
Mathematical Reviews (MathSciNet): MR2040404
Zentralblatt MATH: 1050.68054
Digital Object Identifier: doi:10.1007/3-540-45435-7_4
[14] Birgé, L. (1983). Approximation dans les espaces métriques et théorie de l’estimation. Z. Wahrsch. Verw. Gebiete 65 181–237.
Mathematical Reviews (MathSciNet): MR722129
[15] Birgé, L. (2005). A new lower bound for multiple hypothesis testing. IEEE Trans. Inform. Theory 51 1611–1615.
Mathematical Reviews (MathSciNet): MR2241522
Digital Object Identifier: doi:10.1109/TIT.2005.844101
[16] Blanchard, G. (1999). The progressive mixture estimator for regression trees. Ann. Inst. H. Poincaré Probab. Statist. 35 793–820.
Mathematical Reviews (MathSciNet): MR1725711
Zentralblatt MATH: 1054.62539
Digital Object Identifier: doi:10.1016/S0246-0203(99)00115-6
[17] Boucheron, S., Bousquet, O. and Lugosi, G. (2005). Theory of classification: Some recent advances. ESAIM Probab. Stat. 9 323–375.
Mathematical Reviews (MathSciNet): MR2182250
Zentralblatt MATH: 1136.62355
Digital Object Identifier: doi:10.1051/ps:2005018
[18] Bretagnolle, J. and Huber, C. (1979). Estimation des densités: Risque minimax. Z. Wahrsch. Verw. Gebiete 47 119–137.
Mathematical Reviews (MathSciNet): MR523165
[19] Bunea, F. and Nobel, A. (2005). Sequential procedures for aggregating arbitrary estimators of a conditional mean. Technical report. Available at http://stat.fsu.edu/~flori/ps/bnapril2005IEEE.pdf.
[20] Catoni, O. (1997). A mixture approach to universal model selection. Preprint LMENS 97–30, Available at http://www.dma.ens.fr/edition/preprints/Index.97.html.
[21] Catoni, O. (1999). Universal aggregation rules with exact bias bound. Preprint n. 510. Available at http://www.proba.jussieu.fr/mathdoc/preprints/index.html#1999.
[22] Catoni, O. (2004). Statistical learning theory and stochastic optimization. In Ecole d’été de Probabilités de Saint-Flour XXXI—2001. Lecture Notes in Math. 1851. Springer, Berlin.
Mathematical Reviews (MathSciNet): MR2163920
Zentralblatt MATH: 1076.93002
[23] Cesa-Bianchi, N., Conconi, A. and Gentile, C. (2004). On the generalization ability of on-line learning algorithms. IEEE Trans. Inform. Theory 50 2050–2057.
Mathematical Reviews (MathSciNet): MR2097190
Digital Object Identifier: doi:10.1109/TIT.2004.833339
[24] Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E. and Warmuth, M.K. (1997). How to use expert advice. J. ACM 44 427–485.
Mathematical Reviews (MathSciNet): MR1470152
Zentralblatt MATH: 0890.68066
Digital Object Identifier: doi:10.1145/258128.258179
[25] Cesa-Bianchi, N. and Lugosi, G. (1999). On prediction of individual sequences. Ann. Statist. 27 1865–1895.
Mathematical Reviews (MathSciNet): MR1765620
Zentralblatt MATH: 0961.62081
Digital Object Identifier: doi:10.1214/aos/1017939242
Project Euclid: euclid.aos/1017939242
[26] Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, Learning and Games. Cambridge Univ. Press, Cambridge.
Mathematical Reviews (MathSciNet): MR2409394
Zentralblatt MATH: 1114.91001
[27] Cesa-Bianchi, N., Mansour, Y. and Stoltz, G. (2007). Improved second-order bounds for prediction with expert advice. Mach. Learn. 66 321–252.
[28] Csiszar, I. (1967). Information-type measures of difference of probability distributions and indirect observations. Stud. Math. Hung. 2 299–318.
Mathematical Reviews (MathSciNet): MR219345
[29] Devroye, L. (1982). Any discrimination rule can have an arbitrarily bad probability of error for finite sample size. IEEE Trans. Pattern Analysis and Machine Intelligence 4 154–157.
[30] Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer, New York.
Mathematical Reviews (MathSciNet): MR1383093
Zentralblatt MATH: 0853.68150
[31] Devroye, L. and Lugosi, G. (2000). Combinatorial Methods in Density Estimation. Springer, New York.
Mathematical Reviews (MathSciNet): MR1843146
Zentralblatt MATH: 0964.62025
[32] Dudley, R. M. (1978). Central limit theorems for empirical measures. Ann. Probab. 6 899–929.
Mathematical Reviews (MathSciNet): MR512411
Digital Object Identifier: doi:10.1214/aop/1176995384
Project Euclid: euclid.aop/1176995384
[33] Haussler, D., Kivinen, J. and Warmuth, M. K. (1998). Sequential prediction of individual sequences under general loss functions. IEEE Trans. Inform. Theory 44 1906–1925.
Mathematical Reviews (MathSciNet): MR1664051
Digital Object Identifier: doi:10.1109/18.705569
[34] Juditsky, A., Rigollet, P. and Tsybakov, A. B. (2006). Learning by mirror averaging. Preprint n. 1034, Laboratoire de Probabilités et Modèles Aléatoires. Univ. Paris 6 and Paris 7. Available at http://arxiv.org/abs/math/0511468.
Mathematical Reviews (MathSciNet): MR2458184
Zentralblatt MATH: 05368488
Digital Object Identifier: doi:10.1214/07-AOS546
Project Euclid: euclid.aos/1223908089
[35] Kivinen, J. and Warmuth, M. K. (1999). Averaging expert predictions. 18 p. Available at www.cse.ucsc.edu/~manfred/pubs/C50.pdf.
Mathematical Reviews (MathSciNet): MR1724987
Digital Object Identifier: doi:10.1007/3-540-49097-3_13
[36] Kivinen, J. and Warmuth, M. K. (1999). Averaging expert predictions. In Lecture Notes in Computer Science 1572 153–167. Springer, Berlin.
Mathematical Reviews (MathSciNet): MR1724987
Digital Object Identifier: doi:10.1007/3-540-49097-3_13
[37] Koltchinskii, V. (2006). Local rademacher complexities and oracle inequalities in risk minimization. Ann. Statist. 34.
Mathematical Reviews (MathSciNet): MR2329442
Zentralblatt MATH: 1118.62065
Digital Object Identifier: doi:10.1214/009053606000001019
Project Euclid: euclid.aos/1179935055
[38] Lecué, G. (2007). Optimal rates of aggregation in classification under low noise assumption. Bernoulli 13 1000–1022.
Mathematical Reviews (MathSciNet): MR2364224
Digital Object Identifier: doi:10.3150/07-BEJ6044
Project Euclid: euclid.bj/1194625600
[39] Lecué, G. (2007). Simultaneous adaptation to the margin and to complexity in classification. Ann. Statist. 35 1698–1721.
Mathematical Reviews (MathSciNet): MR2351102
Zentralblatt MATH: 05201518
Digital Object Identifier: doi:10.1214/009053607000000055
Project Euclid: euclid.aos/1188405627
[40] Lecué, G. (2007). Suboptimality of penalized empirical risk minimization in classification. In Proceedings of the 20th annual conference on Computational Learning Theory (COLT). Lecture Notes in Computer Science 4539 142–156. Springer, Berlin.
Mathematical Reviews (MathSciNet): MR2397584
[41] Mammen, E. and Tsybakov, A. B. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808–1829.
Mathematical Reviews (MathSciNet): MR1765618
Zentralblatt MATH: 0961.62058
Digital Object Identifier: doi:10.1214/aos/1017939240
Project Euclid: euclid.aos/1017939240
[42] Massart, P. (2000). Some applications of concentration inequalities to statistics. Ann. Fac. Sci. Toulouse Math. 9 245–303.
Mathematical Reviews (MathSciNet): MR1813803
[43] Merhav, N. and Feder, M. (1998). Universal prediction. IEEE Trans. Inform. Theory 44 2124–2147.
Mathematical Reviews (MathSciNet): MR1658815
Digital Object Identifier: doi:10.1109/18.720534
[44] Polonik, W. (1995). Measuring mass concentrations and estimating density contour clusters-an excess mass approach. Ann. Statist. 23 855–881.
Mathematical Reviews (MathSciNet): MR1345204
Zentralblatt MATH: 0841.62045
Digital Object Identifier: doi:10.1214/aos/1176324626
Project Euclid: euclid.aos/1176324626
[45] Tsybakov, A. B. (2003). Optimal rates of aggregation. In Computational Learning Theory and Kernel Machines (B. Scholkopf and M. Warmuth, eds). Lecture Notes in Artificial Intelligence 2777 303–313. Springer, Heidelberg.
[46] Tsybakov, A. B. (2004). Introduction à L’estimation non Paramétrique. Springer, Berlin.
[47] Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135–166.
Mathematical Reviews (MathSciNet): MR2051002
Zentralblatt MATH: 1105.62353
Digital Object Identifier: doi:10.1214/aos/1079120131
Project Euclid: euclid.aos/1079120131
[48] Vapnik, V. (1982). Estimation of Dependences Based on Empirical Data. Springer, Berlin.
Mathematical Reviews (MathSciNet): MR672244
[49] Vapnik, V. (1995). The Nature of Statistical Learning Theory, 2nd ed. Springer, New York.
Mathematical Reviews (MathSciNet): MR1367965
[50] Vapnik, V. and Chervonenkis, A. (1974). Theory of Pattern Recognition. Nauka, Moscow.
Mathematical Reviews (MathSciNet): MR474638
[51] Vovk, V. G. (1990). Aggregating strategies. In COLT ’90: Proceedings of the Third Annual Workshop on Computational Learning Theory 371–386. Morgan Kaufmann, San Francisco, CA.
[52] Vovk, V. G. (1998). A game of prediction with expert advice. J. Comput. System Sci. 153–173.
Mathematical Reviews (MathSciNet): MR1629690
Zentralblatt MATH: 0945.68528
Digital Object Identifier: doi:10.1006/jcss.1997.1556
[53] Yang, Y. (2000). Combining different procedures for adaptive regression. J. Multivariate Anal. 74 135–161.
Mathematical Reviews (MathSciNet): MR1790617
Zentralblatt MATH: 0964.62032
Digital Object Identifier: doi:10.1006/jmva.1999.1884
[54] Yaroshinsky, R., El-Yaniv, R. and Seiden, S. S. (2004). How to better use expert advice. Mach. Learn. 55 271–309.
[55] Zhang, T. (2005). Data dependent concentration bounds for sequential prediction algorithms. In Proceedings of the 18th Annual Conference on Computational Learning Theory (COLT). Lecture Notes in Computer Science 173–187. Springer, Berlin.
Mathematical Reviews (MathSciNet): MR2203261
Zentralblatt MATH: 1137.68568
[56] Zhang, T. (2006). Information theoretical upper and lower bounds for statistical estimation. IEEE Trans. Inform. Theory 52 1307–1321.
Mathematical Reviews (MathSciNet): MR2241190
Digital Object Identifier: doi:10.1109/TIT.2005.864439

2012 © Institute of Mathematical Statistics

The Annals of Statistics

The Annals of Statistics