## The Annals of Statistics

### Optimal learning with Q-aggregation

#### Abstract

We consider a general supervised learning problem with strongly convex and Lipschitz loss and study the problem of model selection aggregation. In particular, given a finite dictionary functions (learners) together with the prior, we generalize the results obtained by Dai, Rigollet and Zhang [Ann. Statist. 40 (2012) 1878–1905] for Gaussian regression with squared loss and fixed design to this learning setup. Specifically, we prove that the $Q$-aggregation procedure outputs an estimator that satisfies optimal oracle inequalities both in expectation and with high probability. Our proof techniques somewhat depart from traditional proofs by making most of the standard arguments on the Laplace transform of the empirical process to be controlled.

#### Article information

Source
Ann. Statist., Volume 42, Number 1 (2014), 211-224.

Dates
First available in Project Euclid: 18 February 2014

https://projecteuclid.org/euclid.aos/1392733186

Digital Object Identifier
doi:10.1214/13-AOS1190

Mathematical Reviews number (MathSciNet)
MR3178462

Zentralblatt MATH identifier
1286.68255

Subjects
Secondary: 62G08: Nonparametric regression 62G05: Estimation

#### Citation

Lecué, Guillaume; Rigollet, Philippe. Optimal learning with Q -aggregation. Ann. Statist. 42 (2014), no. 1, 211--224. doi:10.1214/13-AOS1190. https://projecteuclid.org/euclid.aos/1392733186

#### References

• [1] Alquier, P. and Lounici, K. (2011). PAC-Bayesian bounds for sparse regression estimation with exponential weights. Electron. J. Stat. 5 127–145.
• [2] Audibert, J.-Y. (2007). Progressive mixture rules are deviation suboptimal. In Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA.
• [3] Audibert, J.-Y. (2009). Fast learning rates in statistical inference through aggregation. Ann. Statist. 37 1591–1646.
• [4] Bartlett, P. L., Jordan, M. I. and McAuliffe, J. D. (2006). Convexity, classification, and risk bounds. J. Amer. Statist. Assoc. 101 138–156.
• [5] Boucheron, S., Lugosi, G. and Massart, P. (2012). Concentration Inequalities with Applications. Clarendon Press, Oxford.
• [6] Bunea, F., Tsybakov, A. B. and Wegkamp, M. H. (2007). Aggregation for Gaussian regression. Ann. Statist. 35 1674–1697.
• [7] Catoni, O. (2004). Statistical Learning Theory and Stochastic Optimization. Lecture Notes in Math. 1851. Springer, Berlin. Lecture notes from the 31st Summer School on Probability Theory held in Saint-Flour, July 8–25, 2001.
• [8] Catoni, O. (2007). Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. Institute of Mathematical Statistics Lecture Notes—Monograph Series 56. IMS, Beachwood, OH.
• [9] Dai, D., Rigollet, P. and Zhang, T. (2012). Deviation optimal learning using greedy $Q$-aggregation. Ann. Statist. 40 1878–1905.
• [10] Dalalyan, A. S., Ingster, Y. and Tsybakov, A. (2014). Statistical inference in compound functional models. Probab. Theory Related Fields. To appear.
• [11] Dalalyan, A. S. and Salmon, J. (2012). Sharp oracle inequalities for aggregation of affine estimators. Ann. Statist. 40 2327–2355.
• [12] Dalalyan, A. S. and Tsybakov, A. B. (2007). Aggregation by exponential weighting and sharp oracle inequalities. In Learning Theory. Lecture Notes in Computer Science 4539 97–111. Springer, Berlin.
• [13] Dalalyan, A. S. and Tsybakov, A. B. (2008). Aggregation by exponential weighting, sharp pac-Bayesian bounds and sparsity. J. Mach. Learn. Res. 72 39–61.
• [14] Dalalyan, A. S. and Tsybakov, A. B. (2010). Mirror averaging with sparsity priors. Bernoulli 18 914–944.
• [15] Dalalyan, A. S. and Tsybakov, A. B. (2012). Sparse regression learning by aggregation and Langevin Monte-Carlo. J. Comput. System Sci. 78 1423–1443.
• [16] Emery, M., Nemirovski, A. and Voiculescu, D. (2000). Lectures on Probability Theory and Statistics. Lecture Notes in Math. 1738. Springer, Berlin.
• [17] Hiriart-Urruty, J.-B. and Lemaréchal, C. (2001). Fundamentals of Convex Analysis. Grundlehren Text Editions. Springer, Berlin. Abridged version of Convex Analysis and Minimization Algorithms. I [Springer, Berlin, 1993; MR1261420 (95m:90001)] and II [ibid.; MR1295240 (95m:90002)].
• [18] Juditsky, A. and Nemirovski, A. (2000). Functional aggregation for nonparametric regression. Ann. Statist. 28 681–712.
• [19] Juditsky, A., Rigollet, P. and Tsybakov, A. B. (2008). Learning by mirror averaging. Ann. Statist. 36 2183–2206.
• [20] Koltchinskii, V. (2011). Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Lecture Notes in Math. 2033. Springer, Heidelberg.
• [21] Lecué, G. (2007). Optimal rates of aggregation in classification under low noise assumption. Bernoulli 13 1000–1022.
• [22] Lecué, G. (2007). Suboptimality of penalized empirical risk minimization in classification. In Learning Theory. Lecture Notes in Computer Science 4539 142–156. Springer, Berlin.
• [23] Lecué, G. and Mendelson, S. (2009). Aggregation via empirical risk minimization. Probab. Theory Related Fields 145 591–613.
• [24] Lecué, G. and Mendelson, S. (2010). Sharper lower bounds on the performance of the empirical risk minimization algorithm. Bernoulli 16 605–613.
• [25] Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Processes. Ergebnisse der Mathematik und Ihrer Grenzgebiete (3) [Results in Mathematics and Related Areas (3)] 23. Springer, Berlin.
• [26] Lee, W. S., Bartlett, P. L. and Williamson, R. C. (1996). The importance of convexity in learning with squared loss. In Proceedings of the Ninth Annual Conference on Computational Learning Theory 140–146. ACM Press, New York.
• [27] Rigollet, P. (2012). Kullback–Leibler aggregation and misspecified generalized linear models. Ann. Statist. 40 639–665.
• [28] Rigollet, P. and Tsybakov, A. (2011). Exponential screening and optimal rates of sparse estimation. Ann. Statist. 39 731–771.
• [29] Rigollet, P. and Tsybakov, A. B. (2012). Sparse estimation by exponential weighting. Statist. Sci. 27 558–575.
• [30] Tsybakov, A. B. (2003). Optimal rate of aggregation. In Computational Learning Theory and Kernel Machines (COLT-2003). Lecture Notes in Artificial Intelligence 2777 303–313. Springer, Heidelberg.
• [31] Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135–166.
• [32] Yang, Y. (2000). Combining different procedures for adaptive regression. J. Multivariate Anal. 74 135–161.
• [33] Yang, Y. (2000). Mixing strategies for density estimation. Ann. Statist. 28 75–87.