## The Annals of Statistics

### Optimal bounds for aggregation of affine estimators

Pierre C. Bellec

#### Abstract

We study the problem of aggregation of estimators when the estimators are not independent of the data used for aggregation and no sample splitting is allowed. If the estimators are deterministic vectors, it is well known that the minimax rate of aggregation is of order $\log(M)$, where $M$ is the number of estimators to aggregate. It is proved that for affine estimators, the minimax rate of aggregation is unchanged: it is possible to handle the linear dependence between the affine estimators and the data used for aggregation at no extra cost. The minimax rate is not impacted either by the variance of the affine estimators, or any other measure of their statistical complexity. The minimax rate is attained with a penalized procedure over the convex hull of the estimators, for a penalty that is inspired from the $Q$-aggregation procedure. The results follow from the interplay between the penalty, strong convexity and concentration.

#### Article information

Source
Ann. Statist., Volume 46, Number 1 (2018), 30-59.

Dates
Revised: December 2016
First available in Project Euclid: 22 February 2018

https://projecteuclid.org/euclid.aos/1519268423

Digital Object Identifier
doi:10.1214/17-AOS1540

Mathematical Reviews number (MathSciNet)
MR3766945

Zentralblatt MATH identifier
06865104

Subjects
Primary: 62G05: Estimation
Secondary: 62J07: Ridge regression; shrinkage estimators

#### Citation

Bellec, Pierre C. Optimal bounds for aggregation of affine estimators. Ann. Statist. 46 (2018), no. 1, 30--59. doi:10.1214/17-AOS1540. https://projecteuclid.org/euclid.aos/1519268423

#### References

• [1] Arlot, S. and Bach, F. R. (2009). Data-driven calibration of linear estimators with minimal penalties. In Advances in Neural Information Processing Systems 46–54.
• [2] Audibert, J.-Y. (2007). No fast exponential deviation inequalities for the progressive mixture rule. Preprint. Available at arXiv:math/0703848.
• [3] Baraud, Y., Giraud, C. and Huet, S. (2014). Estimator selection in the Gaussian setting. Ann. Inst. Henri Poincaré Probab. Stat. 50 1092–1119.
• [4] Bellec, P. C. (2017). Optimal exponential bounds for aggregation of density estimators. Bernoulli 23 219–248.
• [5] Belloni, A. and Chernozhukov, V. (2013). Least squares after model selection in high-dimensional sparse models. Bernoulli 19 521–547.
• [6] Belloni, A., Chernozhukov, V. and Wang, L. (2014). Pivotal estimation via square-root lasso in nonparametric regression. Ann. Statist. 42 757–788.
• [7] Boucheron, S., Lugosi, G. and Massart, P. (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford Univ. Press, London.
• [8] Boyd, S. and Vandenberghe, L. (2009). Convex Optimization. Cambridge Univ. Press, Cambridge.
• [9] Cohen, A. (1966). All admissible linear estimates of the mean vector. Ann. Math. Stat. 37 458–463.
• [10] Dai, D., Rigollet, P., Xia, L. and Zhang, T. (2014). Aggregation of affine estimators. Electron. J. Stat. 8 302–327.
• [11] Dai, D., Rigollet, P. and Zhang, T. (2012). Deviation optimal learning using greedy $Q$-aggregation. Ann. Statist. 40 1878–1905.
• [12] Dalalyan, A. S. and Salmon, J. (2012). Sharp oracle inequalities for aggregation of affine estimators. Ann. Statist. 40 2327–2355.
• [13] Dette, H., Munk, A. and Wagner, T. (1998). Estimating the variance in nonparametric regression—what is a reasonable choice? J. R. Stat. Soc. Ser. B. Stat. Methodol. 60 751–764.
• [14] Donoho, D. L., Liu, R. C. and MacGibbon, B. (1990). Minimax risk over hyperrectangles, and implications. Ann. Statist. 18 1416–1437.
• [15] Efroĭmovich, S. Y. and Pinsker, M. S. (1984). A self-training algorithm for nonparametric filtering. Avtomat. i Telemekh. 11 58–65.
• [16] Gerchinovitz, S. (2011). Prediction of individual sequences and prediction in the statistical framework: Some links around sparse regression and aggregation techniques. Ph.D. thesis, Univ. Paris Sud-Paris XI. Available at https://tel.archives-ouvertes.fr/tel-00653550.
• [17] Giraud, C. (2008). Mixing least-squares estimators when the variance is unknown. Bernoulli 14 1089–1107.
• [18] Giraud, C., Huet, S. and Verzelen, N. (2012). High-dimensional regression with unknown variance. Statist. Sci. 27 500–518.
• [19] Hall, P., Kay, J. W. and Titterington, D. M. (1990). Asymptotically optimal difference-based estimation of variance in nonparametric regression. Biometrika 77 521–528.
• [20] Hanson, D. L. and Wright, F. T. (1971). A bound on tail probabilities for quadratic forms in independent random variables. Ann. Math. Stat. 42 1079–1083.
• [21] Immerkaer, J. (1996). Fast noise variance estimation. Comput. Vis. Image Underst. 64 300–302.
• [22] Johnstone, I. M. (2002). Function estimation and gaussian sequence models. Unpublished manuscript 2 (5.3): 2.
• [23] Latała, R. (1999). Tail and moment estimates for some types of chaos. Studia Math. 1 39–53.
• [24] Laurent, B. and Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection. Ann. Statist. 28 1302–1338.
• [25] Lecué, G. (2013). Empirical risk minimization is optimal for the convex aggregation problem. Bernoulli 19 2153–2166.
• [26] Lecué, G. and Rigollet, P. (2014). Optimal learning with $Q$-aggregation. Ann. Statist. 42 211–224.
• [27] Leung, G. and Barron, A. R. (2006). Information theory and mixing least-squares regressions. IEEE Trans. Inform. Theory 52 3396–3410.
• [28] Mallows, C. L. (1973). Some comments on $\mathrm{C}_{\mathrm{P}}$. Technometrics 15 661–675.
• [29] Munk, A., Bissantz, N., Wagner, T. and Freitag, G. (2005). On difference-based variance estimation in nonparametric regression when the covariate is high dimensional. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 19–41.
• [30] Nemirovski, A. (2000). Topics in non-parametric statistics. In Lectures on Probability Theory and Statistics (Saint-Flour, 1998). Lecture Notes in Math. 1738. Springer, Berlin.
• [31] Rigollet, P. (2012). Kullback–Leibler aggregation and misspecified generalized linear models. Ann. Statist. 40 639–665.
• [32] Rigollet, P. and Tsybakov, A. (2011). Exponential screening and optimal rates of sparse estimation. Ann. Statist. 39 731–771.
• [33] Rigollet, P. and Tsybakov, A. B. (2007). Linear and convex aggregation of density estimators. Math. Methods Statist. 16 260–280.
• [34] Rigollet, P. and Tsybakov, A. B. (2012). Sparse estimation by exponential weighting. Statist. Sci. 27 558–575.
• [35] Rudelson, M. and Vershynin, R. (2013). Hanson–Wright inequality and sub-Gaussian concentration. Electron. Commun. Probab. 18 1–9.
• [36] Sun, T. and Zhang, C.-H. (2012). Scaled sparse linear regression. Biometrika 99 879–898.
• [37] Tsybakov, A. B. (2003). Optimal rates of aggregation. In Learning Theory and Kernel Machines 303–313. Springer, Berlin.
• [38] Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer, New York.
• [39] Tsybakov, A. B. (2014). Aggregation and minimax optimality in high-dimensional estimation. In Proceedings of the International Congress of Mathematicians, Vol. 3 225–246, Seoul.
• [40] Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing 210–268. Cambridge Univ. Press, Cambridge.
• [41] Wright, F. T. (1973). A bound on tail probabilities for quadratic forms in independent random variables whose distributions are not necessarily symmetric. Ann. Probab. 1 1068–1070.