The Annals of Statistics

Fast learning rates in statistical inference through aggregation

Jean-Yves Audibert

Full-text: Open access

Abstract

We develop minimax optimal risk bounds for the general learning task consisting in predicting as well as the best function in a reference set $\mathcal{G}$ up to the smallest possible additive term, called the convergence rate. When the reference set is finite and when n denotes the size of the training data, we provide minimax convergence rates of the form $C(\frac{\log|\mathcal{G}|}{n})^{v}$ with tight evaluation of the positive constant C and with exact 0<v≤1, the latter value depending on the convexity of the loss function and on the level of noise in the output distribution.

The risk upper bounds are based on a sequential randomized algorithm, which at each step concentrates on functions having both low risk and low variance with respect to the previous step prediction function. Our analysis puts forward the links between the probabilistic and worst-case viewpoints, and allows to obtain risk bounds unachievable with the standard statistical learning approach. One of the key ideas of this work is to use probabilistic inequalities with respect to appropriate (Gibbs) distributions on the prediction function space instead of using them with respect to the distribution generating the data.

The risk lower bounds are based on refinements of the Assouad lemma taking particularly into account the properties of the loss function. Our key example to illustrate the upper and lower bounds is to consider the Lq-regression setting for which an exhaustive analysis of the convergence rates is given while q ranges in [1; +∞[.

Article information

Source
Ann. Statist., Volume 37, Number 4 (2009), 1591-1646.

Dates
First available in Project Euclid: 18 June 2009

Permanent link to this document
https://projecteuclid.org/euclid.aos/1245332827

Digital Object Identifier
doi:10.1214/08-AOS623

Mathematical Reviews number (MathSciNet)
MR2533466

Zentralblatt MATH identifier
1360.62167

Subjects
Primary: 62G08: Nonparametric regression
Secondary: 62H05: Characterization and structure theory 68T10: Pattern recognition, speech recognition {For cluster analysis, see 62H30}

Keywords
Statistical learning fast rates of convergence aggregation L_q-regression lower bounds in VC-classes excess risk convex loss minimax lower bounds

Citation

Audibert, Jean-Yves. Fast learning rates in statistical inference through aggregation. Ann. Statist. 37 (2009), no. 4, 1591--1646. doi:10.1214/08-AOS623. https://projecteuclid.org/euclid.aos/1245332827


Export citation

References

  • [1] Alquier, P. (2008). Iterative feature selection in least square regression estimation. Ann. Inst. H. Poincaré Probab. Statist.
  • [2] Antos, A. (2002). Lower bounds for the rate of convergence in nonparametric pattern recognition. Theoret. Comput. Sci. 284 3–24.
  • [3] Assouad, P. (1983). Deux remarques sur l’estimation. C. R. Acad. Sci. Paris 296 1021–1024.
  • [4] Audibert, J.-Y. (2004). Aggregated estimators and empirical complexity for least square regression. Ann. Inst. H. Poincaré Probab. Statist. 40 685–736.
  • [5] Audibert, J.-Y. (2004). A better variance control for PAC-Bayesian classification. Preprint n. 905, Laboratoire de Probabilités et Modèles Aléatoires, Universités Paris 6 and Paris 7. Available at http://www.proba.jussieu.fr/mathdoc/textes/PMA-905Bis.pdf.
  • [6] Audibert, J.-Y. (2004). Classification under polynomial entropy and margin assumptions and randomized estimators. Preprint n. 908, Laboratoire de Probabilités et Modèles Aléatoires, Universités Paris 6 and Paris 7. Available at http://www.proba.jussieu.fr/mathdoc/preprints/index.html#2004.
  • [7] Audibert, J.-Y. (2004). PAC-Bayesian statistical learning theory. Ph.D. thesis, Laboratoire de Probabilités et Modèles Aléatoires. Univ. Paris 6 and Paris 7. Available at http://certis.enpc.fr/~audibert/ThesePack.zip.
  • [8] Audibert, J.-Y. (2006). Fast learning rates in statistical inference through aggregation. Research Report 06-20, Certis—Ecole des Ponts. Available at http://arxiv.org/ abs/math/0703854.
  • [9] Audibert, J.-Y. (2007). Progressive mixture rules are deviation suboptimal. Adv. Neural Inf. Process. Syst. 20.
  • [10] Audibert, J.-Y. and Tsybakov, A. B. (2007). Fast learning rates for plug-in classifiers. Ann. Statist. 35 608–633.
  • [11] Barron, A. (1987). Are bayes rules consistent in information? In Open Problems in Communication and Computation (T. M. Cover and B. Gopinath, eds.) 85–91. Springer, New York.
  • [12] Barron, A. and Yang, Y. (1999). Information-theoretic determination of minimax rates of convergence. Ann. Statist. 27 1564–1599.
  • [13] Bartlett, P. L., Bousquet, O. and Mendelson S. (2002). Localized rademacher complexities. In Proceedings of the 15th Annual Conference on Computational Learning Theory (K. Kivinen, ed.). Lecture Notes in Computer Science 2375 Springer, London.
  • [14] Birgé, L. (1983). Approximation dans les espaces métriques et théorie de l’estimation. Z. Wahrsch. Verw. Gebiete 65 181–237.
  • [15] Birgé, L. (2005). A new lower bound for multiple hypothesis testing. IEEE Trans. Inform. Theory 51 1611–1615.
  • [16] Blanchard, G. (1999). The progressive mixture estimator for regression trees. Ann. Inst. H. Poincaré Probab. Statist. 35 793–820.
  • [17] Boucheron, S., Bousquet, O. and Lugosi, G. (2005). Theory of classification: Some recent advances. ESAIM Probab. Stat. 9 323–375.
  • [18] Bretagnolle, J. and Huber, C. (1979). Estimation des densités: Risque minimax. Z. Wahrsch. Verw. Gebiete 47 119–137.
  • [19] Bunea, F. and Nobel, A. (2005). Sequential procedures for aggregating arbitrary estimators of a conditional mean. Technical report. Available at http://stat.fsu.edu/~flori/ps/bnapril2005IEEE.pdf.
  • [20] Catoni, O. (1997). A mixture approach to universal model selection. Preprint LMENS 97–30, Available at http://www.dma.ens.fr/edition/preprints/Index.97.html.
  • [21] Catoni, O. (1999). Universal aggregation rules with exact bias bound. Preprint n. 510. Available at http://www.proba.jussieu.fr/mathdoc/preprints/index.html#1999.
  • [22] Catoni, O. (2004). Statistical learning theory and stochastic optimization. In Ecole d’été de Probabilités de Saint-Flour XXXI—2001. Lecture Notes in Math. 1851. Springer, Berlin.
  • [23] Cesa-Bianchi, N., Conconi, A. and Gentile, C. (2004). On the generalization ability of on-line learning algorithms. IEEE Trans. Inform. Theory 50 2050–2057.
  • [24] Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E. and Warmuth, M.K. (1997). How to use expert advice. J. ACM 44 427–485.
  • [25] Cesa-Bianchi, N. and Lugosi, G. (1999). On prediction of individual sequences. Ann. Statist. 27 1865–1895.
  • [26] Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, Learning and Games. Cambridge Univ. Press, Cambridge.
  • [27] Cesa-Bianchi, N., Mansour, Y. and Stoltz, G. (2007). Improved second-order bounds for prediction with expert advice. Mach. Learn. 66 321–252.
  • [28] Csiszar, I. (1967). Information-type measures of difference of probability distributions and indirect observations. Stud. Math. Hung. 2 299–318.
  • [29] Devroye, L. (1982). Any discrimination rule can have an arbitrarily bad probability of error for finite sample size. IEEE Trans. Pattern Analysis and Machine Intelligence 4 154–157.
  • [30] Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer, New York.
  • [31] Devroye, L. and Lugosi, G. (2000). Combinatorial Methods in Density Estimation. Springer, New York.
  • [32] Dudley, R. M. (1978). Central limit theorems for empirical measures. Ann. Probab. 6 899–929.
  • [33] Haussler, D., Kivinen, J. and Warmuth, M. K. (1998). Sequential prediction of individual sequences under general loss functions. IEEE Trans. Inform. Theory 44 1906–1925.
  • [34] Juditsky, A., Rigollet, P. and Tsybakov, A. B. (2006). Learning by mirror averaging. Preprint n. 1034, Laboratoire de Probabilités et Modèles Aléatoires. Univ. Paris 6 and Paris 7. Available at http://arxiv.org/abs/math/0511468.
  • [35] Kivinen, J. and Warmuth, M. K. (1999). Averaging expert predictions. 18 p. Available at www.cse.ucsc.edu/~manfred/pubs/C50.pdf.
  • [36] Kivinen, J. and Warmuth, M. K. (1999). Averaging expert predictions. In Lecture Notes in Computer Science 1572 153–167. Springer, Berlin.
  • [37] Koltchinskii, V. (2006). Local rademacher complexities and oracle inequalities in risk minimization. Ann. Statist. 34.
  • [38] Lecué, G. (2007). Optimal rates of aggregation in classification under low noise assumption. Bernoulli 13 1000–1022.
  • [39] Lecué, G. (2007). Simultaneous adaptation to the margin and to complexity in classification. Ann. Statist. 35 1698–1721.
  • [40] Lecué, G. (2007). Suboptimality of penalized empirical risk minimization in classification. In Proceedings of the 20th annual conference on Computational Learning Theory (COLT). Lecture Notes in Computer Science 4539 142–156. Springer, Berlin.
  • [41] Mammen, E. and Tsybakov, A. B. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808–1829.
  • [42] Massart, P. (2000). Some applications of concentration inequalities to statistics. Ann. Fac. Sci. Toulouse Math. 9 245–303.
  • [43] Merhav, N. and Feder, M. (1998). Universal prediction. IEEE Trans. Inform. Theory 44 2124–2147.
  • [44] Polonik, W. (1995). Measuring mass concentrations and estimating density contour clusters-an excess mass approach. Ann. Statist. 23 855–881.
  • [45] Tsybakov, A. B. (2003). Optimal rates of aggregation. In Computational Learning Theory and Kernel Machines (B. Scholkopf and M. Warmuth, eds). Lecture Notes in Artificial Intelligence 2777 303–313. Springer, Heidelberg.
  • [46] Tsybakov, A. B. (2004). Introduction à L’estimation non Paramétrique. Springer, Berlin.
  • [47] Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135–166.
  • [48] Vapnik, V. (1982). Estimation of Dependences Based on Empirical Data. Springer, Berlin.
  • [49] Vapnik, V. (1995). The Nature of Statistical Learning Theory, 2nd ed. Springer, New York.
  • [50] Vapnik, V. and Chervonenkis, A. (1974). Theory of Pattern Recognition. Nauka, Moscow.
  • [51] Vovk, V. G. (1990). Aggregating strategies. In COLT ’90: Proceedings of the Third Annual Workshop on Computational Learning Theory 371–386. Morgan Kaufmann, San Francisco, CA.
  • [52] Vovk, V. G. (1998). A game of prediction with expert advice. J. Comput. System Sci. 153–173.
  • [53] Yang, Y. (2000). Combining different procedures for adaptive regression. J. Multivariate Anal. 74 135–161.
  • [54] Yaroshinsky, R., El-Yaniv, R. and Seiden, S. S. (2004). How to better use expert advice. Mach. Learn. 55 271–309.
  • [55] Zhang, T. (2005). Data dependent concentration bounds for sequential prediction algorithms. In Proceedings of the 18th Annual Conference on Computational Learning Theory (COLT). Lecture Notes in Computer Science 173–187. Springer, Berlin.
  • [56] Zhang, T. (2006). Information theoretical upper and lower bounds for statistical estimation. IEEE Trans. Inform. Theory 52 1307–1321.