The Annals of Statistics

Deviation optimal learning using greedy $Q$-aggregation

Dong Dai, Philippe Rigollet, and Tong Zhang

Full-text: Open access


Given a finite family of functions, the goal of model selection aggregation is to construct a procedure that mimics the function from this family that is the closest to an unknown regression function. More precisely, we consider a general regression model with fixed design and measure the distance between functions by the mean squared error at the design points. While procedures based on exponential weights are known to solve the problem of model selection aggregation in expectation, they are, surprisingly, sub-optimal in deviation. We propose a new formulation called $Q$-aggregation that addresses this limitation; namely, its solution leads to sharp oracle inequalities that are optimal in a minimax sense. Moreover, based on the new formulation, we design greedy $Q$-aggregation procedures that produce sparse aggregation models achieving the optimal rate. The convergence and performance of these greedy procedures are illustrated and compared with other standard methods on simulated examples.

Article information

Ann. Statist., Volume 40, Number 3 (2012), 1878-1905.

First available in Project Euclid: 16 October 2012

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62G08: Nonparametric regression
Secondary: 90C52: Methods of reduced gradient type 62G05: Estimation 62G20: Asymptotic properties

Regression model selection model averaging greedy algorithm exponential weights oracle inequalities deviation bounds lower bounds deviation suboptimality


Dai, Dong; Rigollet, Philippe; Zhang, Tong. Deviation optimal learning using greedy $Q$-aggregation. Ann. Statist. 40 (2012), no. 3, 1878--1905. doi:10.1214/12-AOS1025.

Export citation


  • Audibert, J.-Y. (2004). Aggregated estimators and empirical complexity for least square regression. Ann. Inst. Henri Poincaré Probab. Stat. 40 685–736.
  • Audibert, J.-Y. (2008). Progressive mixture rules are deviation suboptimal. In Advances in Neural Information Processing Systems 20 (J. C. Platt, D. Koller, Y. Singer and S. Roweis, eds.) 41–48. MIT Press, Cambridge, MA.
  • Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inform. Theory 39 930–945.
  • Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2 183–202.
  • Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge Univ. Press, Cambridge.
  • Catoni, O. (1999). Universal aggregation rules with exact bias bounds. Technical report, Laboratoire de Probabilités et Modeles Aléatoires, Paris.
  • Clarkson, K. L. (2008). Coresets, sparse greedy approximation, and the Frank–Wolfe algorithm. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms 922–931. ACM, New York.
  • Dai, D. and Zhang, T. (2011). Greedy model averaging. In Advances in Neural Information Processing Systems 24 (J. Shawe-Taylor, R. S. Zemel, P. Bartlett, F. C. N. Pereira and K. Q. Weinberger, eds.) 1242–1250. MIT Press, Cambridge, MA.
  • Dalalyan, A. S. and Salmon, J. (2011). Sharp oracle inequalities for aggregation of affine estimators. Available at arXiv:1104.3969.
  • Dalalyan, A. S. and Tsybakov, A. B. (2007). Aggregation by exponential weighting and sharp oracle inequalities. In Learning Theory. Lecture Notes in Computer Science 4539 97–111. Springer, Berlin.
  • Dalalyan, A. and Tsybakov, A. B. (2008). Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Machine Learning 72 39–61.
  • Frank, M. and Wolfe, P. (1956). An algorithm for quadratic programming. Naval Res. Logist. Quart. 3 95–110.
  • Gaïffas, S. and Lecué, G. (2011). Hyper-sparse optimal aggregation. J. Mach. Learn. Res. 12 1813–1833.
  • Jaggi, M. (2011). Convex optimization without projection steps. Available at arXiv:1108.1170v6.
  • Jones, L. K. (1992). A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training. Ann. Statist. 20 608–613.
  • Juditsky, A. and Nemirovski, A. (2000). Functional aggregation for nonparametric regression. Ann. Statist. 28 681–712.
  • Juditsky, A., Rigollet, P. and Tsybakov, A. B. (2008). Learning by mirror averaging. Ann. Statist. 36 2183–2206.
  • Laurent, B. and Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection. Ann. Statist. 28 1302–1338.
  • Lecué, G. and Mendelson, S. (2009). Aggregation via empirical risk minimization. Probab. Theory Related Fields 145 591–613.
  • Lecué, G. and Mendelson, S. (2012). On the optimality of the aggregate with exponential weights for low temperatures. Bernoulli. To appear.
  • Nemirovski, A. (2000). Topics in non-parametric statistics. In Lectures on Probability Theory and Statistics (Saint-Flour, 1998). Lecture Notes in Math. 1738 85–277. Springer, Berlin.
  • Rigollet, P. (2012). Kullback–Leibler aggregation and misspecified generalized linear models. Ann. Statist. 40 639–665.
  • Rigollet, P. and Tsybakov, A. (2011). Exponential screening and optimal rates of sparse estimation. Ann. Statist. 39 731–771.
  • Rigollet, P. and Tsybakov, A. (2012). Sparse estimation by exponential weighting. Statist. Sci. To appear. Available at arXiv:1108.5116.
  • Shalev-Shwartz, S., Srebro, N. and Zhang, T. (2010). Trading accuracy for sparsity in optimization problems with sparsity constraints. SIAM J. Optim. 20 2807–2832.
  • Tsybakov, A. B. (2003). Optimal rates of aggregation. In COLT (B. Schölkopf and M. K. Warmuth, eds.). Lecture Notes in Computer Science 2777 303–313. Springer, Berlin.
  • Yang, Y. (1999). Model selection for nonparametric regression. Statist. Sinica 9 475–499.