The Annals of Statistics

Sharp oracle inequalities for aggregation of affine estimators

Arnak S. Dalalyan and Joseph Salmon

Full-text: Open access


We consider the problem of combining a (possibly uncountably infinite) set of affine estimators in nonparametric regression model with heteroscedastic Gaussian noise. Focusing on the exponentially weighted aggregate, we prove a PAC-Bayesian type inequality that leads to sharp oracle inequalities in discrete but also in continuous settings. The framework is general enough to cover the combinations of various procedures such as least square regression, kernel ridge regression, shrinking estimators and many other estimators used in the literature on statistical inverse problems. As a consequence, we show that the proposed aggregate provides an adaptive estimator in the exact minimax sense without discretizing the range of tuning parameters or splitting the set of observations. We also illustrate numerically the good performance achieved by the exponentially weighted aggregate.

Article information

Ann. Statist., Volume 40, Number 4 (2012), 2327-2355.

First available in Project Euclid: 23 January 2013

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62G08: Nonparametric regression
Secondary: 62C20: Minimax procedures 62G05: Estimation 62G20: Asymptotic properties

Aggregation regression oracle inequalities model selection minimax risk exponentially weighted aggregation


Dalalyan, Arnak S.; Salmon, Joseph. Sharp oracle inequalities for aggregation of affine estimators. Ann. Statist. 40 (2012), no. 4, 2327--2355. doi:10.1214/12-AOS1038.

Export citation


  • [1] Alquier, P. and Lounici, K. (2011). PAC-Bayesian bounds for sparse regression estimation with exponential weights. Electron. J. Stat. 5 127–145.
  • [2] Amit, Y. and Geman, D. (1997). Shape quantization and recognition with randomized trees. Neural Comput. 9 1545–1588.
  • [3] Arlot, S. and Bach, F. (2009). Data-driven calibration of linear estimators with minimal penalties. In NIPS 46–54. MIT Press, Vancouver.
  • [4] Audibert, J.-Y. (2007). Progressive mixture rules are deviation suboptimal. In NIPS 41–48. MIT Press, Vancouver.
  • [5] Baraud, Y., Giraud, C. and Huet, S. (2010). Estimator selection in the Gaussian setting. Unpublished manuscript.
  • [6] Barron, A., Birgé, L. and Massart, P. (1999). Risk bounds for model selection via penalization. Probab. Theory Related Fields 113 301–413.
  • [7] Breiman, L. (1996). Bagging predictors. Mach. Learn. 24 123–140.
  • [8] Buades, A., Coll, B. and Morel, J. M. (2005). A review of image denoising algorithms, with a new one. Multiscale Model. Simul. 4 490–530.
  • [9] Bunea, F., Tsybakov, A. B. and Wegkamp, M. H. (2007). Aggregation for Gaussian regression. Ann. Statist. 35 1674–1697.
  • [10] Cai, T. T. (1999). Adaptive wavelet estimation: A block thresholding and oracle inequality approach. Ann. Statist. 27 898–924.
  • [11] Catoni, O. (2004). Statistical Learning Theory and Stochastic Optimization. Lecture Notes in Math. 1851. Springer, Berlin.
  • [12] Cavalier, L. (2008). Nonparametric statistical inverse problems. Inverse Problems 24 19.
  • [13] Cavalier, L., Golubev, G. K., Picard, D. and Tsybakov, A. B. (2002). Oracle inequalities for inverse problems. Ann. Statist. 30 843–874.
  • [14] Cavalier, L. and Tsybakov, A. (2002). Sharp adaptation for inverse problems with random noise. Probab. Theory Related Fields 123 323–354.
  • [15] Cavalier, L. and Tsybakov, A. B. (2001). Penalized blockwise Stein’s method, monotone oracles and sharp adaptive estimation. Math. Methods Statist. 10 247–282.
  • [16] Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge Univ. Press, Cambridge.
  • [17] Dai, D., Rigollet, P. and Zhang, T. (2012). Deviation optimal learning using greedy $Q$-aggregation. Ann. Statist. To appear. Available at arXiv:1203.2507.
  • [18] Dai, D. and Zhang, T. (2011). Greedy model averaging. In NIPS 1242–1250. MIT Press, Granada.
  • [19] Dalalyan, A. S. and Salmon, J. (2011). Competing against the best nearest neighbor filter in regression. In ALT. Lecture Notes in Computer Science 6925 129–143. Springer, Berlin.
  • [20] Dalalyan, A. S. and Salmon, J. (2012). Supplement to “Sharp oracle inequalities for aggregation of affine estimators.” DOI:10.1214/12-AOS1038SUPP.
  • [21] Dalalyan, A. S. and Tsybakov, A. B. (2007). Aggregation by exponential weighting and sharp oracle inequalities. In Learning Theory. Lecture Notes in Computer Science 4539 97–111. Springer, Berlin.
  • [22] Dalalyan, A. S. and Tsybakov, A. B. (2008). Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Mach. Learn. 72 39–61.
  • [23] Dalalyan, A. S. and Tsybakov, A. B. (2012). Sparse regression learning by aggregation and Langevin Monte-Carlo. J. Comput. System Sci. 78 1423–1443.
  • [24] Dalalyan, A. S. and Tsybakov, A. B. (2012). Mirror averaging with sparsity priors. Bernoulli 18 914–944.
  • [25] Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81 425–455.
  • [26] Donoho, D. L. and Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet shrinkage. J. Amer. Statist. Assoc. 90 1200–1224.
  • [27] Donoho, D. L., Liu, R. C. and MacGibbon, B. (1990). Minimax risk over hyperrectangles, and implications. Ann. Statist. 18 1416–1437.
  • [28] Efromovich, S. and Pinsker, M. (1996). Sharp-optimal and adaptive estimation for heteroscedastic nonparametric regression. Statist. Sinica 6 925–942.
  • [29] Efroĭmovich, S. Y. and Pinsker, M. S. (1984). A self-training algorithm for nonparametric filtering. Avtomat. i Telemekh. 11 58–65.
  • [30] Freund, Y. (1990). Boosting a weak learning algorithm by majority. In COLT 202–216. Morgan Kaufmann, Rochester.
  • [31] Gaïffas, S. and Lecué, G. (2011). Hyper-sparse optimal aggregation. J. Mach. Learn. Res. 12 1813–1833.
  • [32] George, E. I. (1986). Minimax multiple shrinkage estimation. Ann. Statist. 14 188–205.
  • [33] Gerchinovitz, S. (2011). Sparsity regret bounds for individual sequences in online linear regression. J. Mach. Learn. Res. 19 377–396.
  • [34] Giraud, C. (2008). Mixing least-squares estimators when the variance is unknown. Bernoulli 14 1089–1107.
  • [35] Goldenshluger, A. and Lepski, O. (2008). Universal pointwise selection rule in multivariate function estimation. Bernoulli 14 1150–1190.
  • [36] Golubev, Y. (2010). On universal oracle inequalities related to high-dimensional linear models. Ann. Statist. 38 2751–2780.
  • [37] Juditsky, A. and Nemirovski, A. (2000). Functional aggregation for nonparametric regression. Ann. Statist. 28 681–712.
  • [38] Juditsky, A. and Nemirovski, A. (2009). Nonparametric denoising of signals with unknown local structure. I. Oracle inequalities. Appl. Comput. Harmon. Anal. 27 157–179.
  • [39] Kivinen, J. and Warmuth, M. K. (1999). Averaging expert predictions. In Computational Learning Theory (Nordkirchen, 1999). Lecture Notes in Computer Science 1572 153–167. Springer, Berlin.
  • [40] Kneip, A. (1994). Ordered linear smoothers. Ann. Statist. 22 835–866.
  • [41] Lanckriet, G. R. G., Cristianini, N., Bartlett, P., El Ghaoui, L. and Jordan, M. I. (2003/04). Learning the kernel matrix with semidefinite programming. J. Mach. Learn. Res. 5 27–72.
  • [42] Langford, J. and Shawe-Taylor, J. (2002). PAC-Bayes & margins. In NIPS 423–430. MIT Press, Vancouver.
  • [43] Lecué, G. and Mendelson, S. (2012). On the optimality of the aggregate with exponential weights for low temperatures. Bernoulli. To appear.
  • [44] Leung, G. (2004). Information theory and mixing least squares regression. Ph.D. thesis, Yale Univ.
  • [45] Leung, G. and Barron, A. R. (2006). Information theory and mixing least-squares regressions. IEEE Trans. Inform. Theory 52 3396–3410.
  • [46] Lounici, K. (2007). Generalized mirror averaging and $D$-convex aggregation. Math. Methods Statist. 16 246–259.
  • [47] McAllester, D. A. (1998). Some PAC-Bayesian theorems. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory (Madison, WI, 1998) 230–234 (electronic). ACM, New York.
  • [48] Nemirovski, A. (2000). Topics in non-parametric statistics. In Lectures on Probability Theory and Statistics (Saint-Flour, 1998). Lecture Notes in Math. 1738 85–277. Springer, Berlin.
  • [49] Pinsker, M. S. Optimal filtration of square-integrable signals in Gaussian noise. Probl. Peredachi Inf. 16 52–68.
  • [50] Polzehl, J. and Spokoiny, V. G. (2000). Adaptive weights smoothing with applications to image restoration. J. R. Stat. Soc. Ser. B Stat. Methodol. 62 335–354.
  • [51] Rigollet, P. (2012). Kullback–Leibler aggregation and misspecified generalized linear models. Ann. Statist. 40 639–665.
  • [52] Rigollet, P. and Tsybakov, A. (2011). Exponential screening and optimal rates of sparse estimation. Ann. Statist. 39 731–771.
  • [53] Rigollet, P. and Tsybakov, A. B. (2007). Linear and convex aggregation of density estimators. Math. Methods Statist. 16 260–280.
  • [54] Rigollet, P. and Tsybakov, A. B. (2011). Sparse estimation by exponential weighting. Unpublished manuscript.
  • [55] Salmon, J. and Dalalyan, A. S. (2011). Optimal aggregation of affine estimators. J. Mach. Learn. Res. 19 635–660.
  • [56] Salmon, J. and Le Pennec, E. (2009). NL-Means and aggregation procedures. In ICIP 2977–2980. IEEE, Caîro.
  • [57] Seeger, M. (2003). PAC-Bayesian generalisation error bounds for Gaussian process classification. J. Mach. Learn. Res. 3 233–269.
  • [58] Shawe-Taylor, J. and Cristianini, N. (2000). An Introduction to Support Vector Machines: And Other Kernel-Based Learning Methods. Cambridge Univ. Press, Cambridge.
  • [59] Stein, C. M. (1973). Estimation of the mean of a multivariate distribution. In Proc. Prague Symp. Asymptotic Statist. Charles Univ., Prague.
  • [60] Tsybakov, A. B. (2003). Optimal rates of aggregation. In COLT 303–313. Springer, Washington, DC.
  • [61] Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer, New York.
  • [62] Wang, Z., Paterlini, S., Gao, F. and Yang, Y. (2012). Adaptive minimax estimation over sparse $l_q$-hulls. Technical report. Available at arXiv:1108.1961v4 [math.ST].
  • [63] Yang, Y. (2000). Combining different procedures for adaptive regression. J. Multivariate Anal. 74 135–161.
  • [64] Yang, Y. (2003). Regression with multiple candidate models: Selecting or mixing? Statist. Sinica 13 783–809.
  • [65] Yang, Y. (2004). Aggregating regression procedures to improve performance. Bernoulli 10 25–47.
  • [66] Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49–67.

Supplemental materials