## The Annals of Statistics

### Convergence rates of least squares regression estimators with heavy-tailed errors

#### Abstract

We study the performance of the least squares estimator (LSE) in a general nonparametric regression model, when the errors are independent of the covariates but may only have a $p$th moment ($p\geq1$). In such a heavy-tailed regression setting, we show that if the model satisfies a standard “entropy condition” with exponent $\alpha\in(0,2)$, then the $L_{2}$ loss of the LSE converges at a rate

$\mathcal{O}_{\mathbf{P}}\bigl(n^{-\frac{1}{2+\alpha}}\vee n^{-\frac{1}{2}+\frac{1}{2p}}\bigr).$ Such a rate cannot be improved under the entropy condition alone.

This rate quantifies both some positive and negative aspects of the LSE in a heavy-tailed regression setting. On the positive side, as long as the errors have $p\geq1+2/\alpha$ moments, the $L_{2}$ loss of the LSE converges at the same rate as if the errors are Gaussian. On the negative side, if $p<1+2/\alpha$, there are (many) hard models at any entropy level $\alpha$ for which the $L_{2}$ loss of the LSE converges at a strictly slower rate than other robust estimators.

The validity of the above rate relies crucially on the independence of the covariates and the errors. In fact, the $L_{2}$ loss of the LSE can converge arbitrarily slowly when the independence fails.

The key technical ingredient is a new multiplier inequality that gives sharp bounds for the “multiplier empirical process” associated with the LSE. We further give an application to the sparse linear regression model with heavy-tailed covariates and errors to demonstrate the scope of this new inequality.

#### Article information

Source
Ann. Statist., Volume 47, Number 4 (2019), 2286-2319.

Dates
Revised: May 2018
First available in Project Euclid: 21 May 2019

https://projecteuclid.org/euclid.aos/1558425646

Digital Object Identifier
doi:10.1214/18-AOS1748

Mathematical Reviews number (MathSciNet)
MR3953452

Subjects
Primary: 60E15: Inequalities; stochastic orderings
Secondary: 62G05: Estimation

#### Citation

Han, Qiyang; Wellner, Jon A. Convergence rates of least squares regression estimators with heavy-tailed errors. Ann. Statist. 47 (2019), no. 4, 2286--2319. doi:10.1214/18-AOS1748. https://projecteuclid.org/euclid.aos/1558425646

#### References

• [1] Alexander, K. S. (1985). The nonexistence of a universal multiplier moment for the central limit theorem. In Probability in Banach Spaces, V (Medford, Mass., 1984). Lecture Notes in Math. 1153 15–16. Springer, Berlin.
• [2] Andersen, N. T., Giné, E. and Zinn, J. (1988). The central limit theorem for empirical processes under local conditions: The case of Radon infinitely divisible limits without Gaussian component. Trans. Amer. Math. Soc. 308 603–635.
• [3] Arias-Castro, E., Donoho, D. L. and Huo, X. (2005). Near-optimal detection of geometric objects by fast multiscale methods. IEEE Trans. Inform. Theory 51 2402–2425.
• [4] Audibert, J.-Y. and Catoni, O. (2011). Robust linear least squares regression. Ann. Statist. 39 2766–2794.
• [5] Bai, Z. D., Silverstein, J. W. and Yin, Y. Q. (1988). A note on the largest eigenvalue of a large-dimensional sample covariance matrix. J. Multivariate Anal. 26 166–168.
• [6] Bai, Z. D. and Yin, Y. Q. (1993). Limit of the smallest eigenvalue of a large-dimensional sample covariance matrix. Ann. Probab. 21 1275–1294.
• [7] Baraud, Y., Birgé, L. and Sart, M. (2017). A new method for estimation and model selection: $\rho$-estimation. Invent. Math. 207 425–517.
• [8] Bartlett, P. L., Bousquet, O. and Mendelson, S. (2005). Local Rademacher complexities. Ann. Statist. 33 1497–1537.
• [9] Bartlett, P. L. and Mendelson, S. (2006). Empirical minimization. Probab. Theory Related Fields 135 311–334.
• [10] Birgé, L. and Massart, P. (1993). Rates of convergence for minimum contrast estimators. Probab. Theory Related Fields 97 113–150.
• [11] Boysen, L., Kempe, A., Liebscher, V., Munk, A. and Wittich, O. (2009). Consistencies and rates of convergence of jump-penalized least squares estimators. Ann. Statist. 37 157–183.
• [12] Brownlees, C., Joly, E. and Lugosi, G. (2015). Empirical risk minimization for heavy-tailed losses. Ann. Statist. 43 2507–2536.
• [13] Bubeck, S., Cesa-Bianchi, N. and Lugosi, G. (2013). Bandits with heavy tail. IEEE Trans. Inform. Theory 59 7711–7717.
• [14] Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Heidelberg.
• [15] Catoni, O. (2012). Challenging the empirical mean and empirical variance: A deviation study. Ann. Inst. Henri Poincaré Probab. Stat. 48 1148–1185.
• [16] Catoni, O. (2016). PAC-Bayesian bounds for the Gram matrix and least squares regression with a random design. Preprint. Available at arXiv:1603.05229.
• [17] Chernozhukov, V., Chetverikov, D. and Kato, K. (2013). Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann. Statist. 41 2786–2819.
• [18] Devroye, L., Lerasle, M., Lugosi, G. and Oliveira, R. I. (2016). Sub-Gaussian mean estimators. Ann. Statist. 44 2695–2725.
• [19] Dudley, R. M. (2014). Uniform Central Limit Theorems, 2nd ed. Cambridge Studies in Advanced Mathematics 142. Cambridge Univ. Press, New York.
• [20] Gao, X. and Huang, J. (2010). Asymptotic analysis of high-dimensional LAD regression with Lasso. Statist. Sinica 20 1485–1506.
• [21] Giné, E. and Koltchinskii, V. (2006). Concentration inequalities and asymptotic results for ratio type empirical processes. Ann. Probab. 34 1143–1216.
• [22] Giné, E., Latała, R. and Zinn, J. (2000). Exponential and moment inequalities for $U$-statistics. In High Dimensional Probability, II (Seattle, WA, 1999). Progress in Probability 47 13–38. Birkhäuser, Boston, MA.
• [23] Giné, E. and Nickl, R. (2016). Mathematical Foundations of Infinite-Dimensional Statistical Models. Cambridge Series in Statistical and Probabilistic Mathematics 40. Cambridge Univ. Press, New York.
• [24] Giné, E. and Zinn, J. (1984). Some limit theorems for empirical processes. Ann. Probab. 12 929–998.
• [25] Giné, E. and Zinn, J. (1986). Lectures on the central limit theorem for empirical processes. In Probability and Banach Spaces (Zaragoza, 1985). Lecture Notes in Math. 1221 50–113. Springer, Berlin.
• [26] Han, Q. and Wellner, J. A. (2019). Supplement to “Convergence rates of least squares regression estimators with heavy-tailed errors.” DOI:10.1214/18-AOS1748SUPP.
• [27] Han, Q. and Wellner, J. A. (2018). Robustness of shape-restricted regression estimators: An envelope perspective. Preprint. Available at arXiv:1805.02542.
• [28] Hsu, D. and Sabato, S. (2016). Loss minimization and parameter estimation with heavy tails. J. Mach. Learn. Res. 17 Paper No. 18, 40.
• [29] Joly, E., Lugosi, G. and Oliveira, R. I. (2017). On the estimation of the mean of a random vector. Electron. J. Stat. 11 440–451.
• [30] Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Statist. 34 2593–2656.
• [31] Koltchinskii, V. and Panchenko, D. (2000). Rademacher processes and bounding the risk of function learning. In High Dimensional Probability, II (Seattle, WA, 1999). Progress in Probability 47 443–457. Birkhäuser, Boston, MA.
• [32] Korostelëv, A. P. and Tsybakov, A. B. (1992). Asymptotically minimax image reconstruction problems. In Topics in Nonparametric Estimation. Adv. Soviet Math. 12 45–86. Amer. Math. Soc., Providence, RI.
• [33] Korostelëv, A. P. and Tsybakov, A. B. (1993). Minimax Theory of Image Reconstruction. Lecture Notes in Statistics 82. Springer, New York.
• [34] Lecué, G. and Mendelson, S. (2017). Sparse recovery under weak moment assumptions. J. Eur. Math. Soc. (JEMS) 19 881–904.
• [35] Lecué, G. and Mendelson, S. (2018). Regularization and the small-ball method I: Sparse recovery. Ann. Statist. 46 611–641.
• [36] Ledoux, M. and Talagrand, M. (1986). Conditions d’intégrabilité pour les multiplicateurs dans le TLC banachique. Ann. Probab. 14 916–921.
• [37] Ledoux, M. and Talagrand, M. (2011). Probability in Banach Spaces: Isoperimetry and Processes. Springer, Berlin.
• [38] Lugosi, G. and Mendelson, S. (2018). Regularization, sparse recovery, and median-of-means tournaments. Bernoulli. To appear.
• [39] Lugosi, G. and Mendelson, S. (2018). Risk minimization by median-of-means tournaments. J. Eur. Math. Soc. (JEMS). To appear.
• [40] Lugosi, G. and Mendelson, S. (2019). Sub-Gaussian estimators of the mean of a random vector. Ann. Statist. 47 783–794.
• [41] Mason, D. M. (1983). The asymptotic distribution of weighted empirical distribution functions. Stochastic Process. Appl. 15 99–109.
• [42] Massart, P. and Nédélec, É. (2006). Risk bounds for statistical learning. Ann. Statist. 34 2326–2366.
• [43] Massart, P. and Rio, E. (1998). A uniform Marcinkiewicz–Zygmund strong law of large numbers for empirical processes. In Asymptotic Methods in Probability and Statistics (Ottawa, ON, 1997) 199–211. North-Holland, Amsterdam.
• [44] Mendelson, S. (2015). Learning without concentration. J. ACM 62 Art. 21, 25.
• [45] Mendelson, S. (2016). Upper bounds on product and multiplier empirical processes. Stochastic Process. Appl. 126 3652–3680.
• [46] Mendelson, S. (2017). “Local” vs. “global” parameters—breaking the Gaussian complexity barrier. Ann. Statist. 45 1835–1862.
• [47] Mendelson, S. (2017). On aggregation for heavy-tailed classes. Probab. Theory Related Fields 168 641–674.
• [48] Mendelson, S. (2017). Extending the small-ball method. Preprint. Available at arXiv:1709.00843.
• [49] Mendelson, S. (2017). On multiplier processes under weak moment assumptions. In Geometric Aspects of Functional Analysis. Lecture Notes in Math. 2169 301–318. Springer, Cham.
• [50] Minsker, S. (2015). Geometric median and robust estimation in Banach spaces. Bernoulli 21 2308–2335.
• [51] Nickl, R. and van de Geer, S. (2013). Confidence sets in sparse regression. Ann. Statist. 41 2852–2876.
• [52] Pollard, D. (1991). Asymptotics for least absolute deviation regression estimators. Econometric Theory 7 186–199.
• [53] Portnoy, S. and Koenker, R. (1997). The Gaussian hare and the Laplacian tortoise: Computability of squared-error versus absolute-error estimators. Statist. Sci. 12 279–300.
• [54] Raskutti, G., Wainwright, M. J. and Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over $\ell_{q}$-balls. IEEE Trans. Inform. Theory 57 6976–6994.
• [55] Shorack, G. R. and Wellner, J. A. (2009). Empirical Processes with Applications to Statistics. Classics in Applied Mathematics 59. SIAM, Philadelphia, PA.
• [56] Sivakumar, V., Banerjee, A. and Ravikumar, P. K. (2015). Beyond sub-Gaussian measurements: High-dimensional structured estimation with sub-exponential designs. In Advances in Neural Information Processing Systems 2206–2214.
• [57] Strassen, V. and Dudley, R. M. (1969). The central limit theorem and $\varepsilon$-entropy. In Probability and Information Theory (Proc. Internat. Sympos., McMaster Univ., Hamilton, Ont., 1968) 224–231. Springer, Berlin.
• [58] Talagrand, M. (2014). Upper and Lower Bounds for Stochastic Processes: Modern Methods and Classical Problems. Ergebnisse der Mathematik und Ihrer Grenzgebiete. 3. Folge. A Series of Modern Surveys in Mathematics 60. Springer, Heidelberg.
• [59] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
• [60] van de Geer, S. (1987). A new approach to least-squares estimation, with applications. Ann. Statist. 15 587–602.
• [61] van de Geer, S. (1990). Estimating a regression function. Ann. Statist. 18 907–924.
• [62] van de Geer, S., Bühlmann, P., Ritov, Y. and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist. 42 1166–1202.
• [63] van de Geer, S. and Wegkamp, M. (1996). Consistency for the least squares estimator in nonparametric regression. Ann. Statist. 24 2513–2523.
• [64] van de Geer, S. A. (2000). Applications of Empirical Process Theory. Cambridge Series in Statistical and Probabilistic Mathematics 6. Cambridge Univ. Press, Cambridge.
• [65] van der Vaart, A. and Wellner, J. A. (2011). A local maximal inequality under uniform entropy. Electron. J. Stat. 5 192–203.
• [66] van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer, New York.
• [67] Yang, Y. (2001). Nonparametric regression with dependent errors. Bernoulli 7 633–655.
• [68] Yang, Y. and Barron, A. (1999). Information-theoretic determination of minimax rates of convergence. Ann. Statist. 27 1564–1599.
• [69] Yao, Q. W. (1993). Tests for change-points with epidemic alternatives. Biometrika 80 179–191.
• [70] Zhang, C.-H. (2002). Risk bounds in isotonic regression. Ann. Statist. 30 528–555.

#### Supplemental materials

• Supplement: Additional proofs. In the supplement [26], we provide detailed proofs for (i) Theorem 4, (ii) the impossibility results Propositions 1 and 3 and (iii) all remaining lemmas.