The Annals of Statistics

Convergence rates of least squares regression estimators with heavy-tailed errors

Qiyang Han and Jon A. Wellner

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


We study the performance of the least squares estimator (LSE) in a general nonparametric regression model, when the errors are independent of the covariates but may only have a $p$th moment ($p\geq1$). In such a heavy-tailed regression setting, we show that if the model satisfies a standard “entropy condition” with exponent $\alpha\in(0,2)$, then the $L_{2}$ loss of the LSE converges at a rate

\[\mathcal{O}_{\mathbf{P}}\bigl(n^{-\frac{1}{2+\alpha}}\vee n^{-\frac{1}{2}+\frac{1}{2p}}\bigr).\] Such a rate cannot be improved under the entropy condition alone.

This rate quantifies both some positive and negative aspects of the LSE in a heavy-tailed regression setting. On the positive side, as long as the errors have $p\geq1+2/\alpha$ moments, the $L_{2}$ loss of the LSE converges at the same rate as if the errors are Gaussian. On the negative side, if $p<1+2/\alpha$, there are (many) hard models at any entropy level $\alpha$ for which the $L_{2}$ loss of the LSE converges at a strictly slower rate than other robust estimators.

The validity of the above rate relies crucially on the independence of the covariates and the errors. In fact, the $L_{2}$ loss of the LSE can converge arbitrarily slowly when the independence fails.

The key technical ingredient is a new multiplier inequality that gives sharp bounds for the “multiplier empirical process” associated with the LSE. We further give an application to the sparse linear regression model with heavy-tailed covariates and errors to demonstrate the scope of this new inequality.

Article information

Ann. Statist., Volume 47, Number 4 (2019), 2286-2319.

Received: February 2018
Revised: May 2018
First available in Project Euclid: 21 May 2019

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Primary: 60E15: Inequalities; stochastic orderings
Secondary: 62G05: Estimation

Multiplier empirical process multiplier inequality nonparametric regression least squares estimation sparse linear regression heavy-tailed errors


Han, Qiyang; Wellner, Jon A. Convergence rates of least squares regression estimators with heavy-tailed errors. Ann. Statist. 47 (2019), no. 4, 2286--2319. doi:10.1214/18-AOS1748.

Export citation


  • [1] Alexander, K. S. (1985). The nonexistence of a universal multiplier moment for the central limit theorem. In Probability in Banach Spaces, V (Medford, Mass., 1984). Lecture Notes in Math. 1153 15–16. Springer, Berlin.
  • [2] Andersen, N. T., Giné, E. and Zinn, J. (1988). The central limit theorem for empirical processes under local conditions: The case of Radon infinitely divisible limits without Gaussian component. Trans. Amer. Math. Soc. 308 603–635.
  • [3] Arias-Castro, E., Donoho, D. L. and Huo, X. (2005). Near-optimal detection of geometric objects by fast multiscale methods. IEEE Trans. Inform. Theory 51 2402–2425.
  • [4] Audibert, J.-Y. and Catoni, O. (2011). Robust linear least squares regression. Ann. Statist. 39 2766–2794.
  • [5] Bai, Z. D., Silverstein, J. W. and Yin, Y. Q. (1988). A note on the largest eigenvalue of a large-dimensional sample covariance matrix. J. Multivariate Anal. 26 166–168.
  • [6] Bai, Z. D. and Yin, Y. Q. (1993). Limit of the smallest eigenvalue of a large-dimensional sample covariance matrix. Ann. Probab. 21 1275–1294.
  • [7] Baraud, Y., Birgé, L. and Sart, M. (2017). A new method for estimation and model selection: $\rho$-estimation. Invent. Math. 207 425–517.
  • [8] Bartlett, P. L., Bousquet, O. and Mendelson, S. (2005). Local Rademacher complexities. Ann. Statist. 33 1497–1537.
  • [9] Bartlett, P. L. and Mendelson, S. (2006). Empirical minimization. Probab. Theory Related Fields 135 311–334.
  • [10] Birgé, L. and Massart, P. (1993). Rates of convergence for minimum contrast estimators. Probab. Theory Related Fields 97 113–150.
  • [11] Boysen, L., Kempe, A., Liebscher, V., Munk, A. and Wittich, O. (2009). Consistencies and rates of convergence of jump-penalized least squares estimators. Ann. Statist. 37 157–183.
  • [12] Brownlees, C., Joly, E. and Lugosi, G. (2015). Empirical risk minimization for heavy-tailed losses. Ann. Statist. 43 2507–2536.
  • [13] Bubeck, S., Cesa-Bianchi, N. and Lugosi, G. (2013). Bandits with heavy tail. IEEE Trans. Inform. Theory 59 7711–7717.
  • [14] Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Heidelberg.
  • [15] Catoni, O. (2012). Challenging the empirical mean and empirical variance: A deviation study. Ann. Inst. Henri Poincaré Probab. Stat. 48 1148–1185.
  • [16] Catoni, O. (2016). PAC-Bayesian bounds for the Gram matrix and least squares regression with a random design. Preprint. Available at arXiv:1603.05229.
  • [17] Chernozhukov, V., Chetverikov, D. and Kato, K. (2013). Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann. Statist. 41 2786–2819.
  • [18] Devroye, L., Lerasle, M., Lugosi, G. and Oliveira, R. I. (2016). Sub-Gaussian mean estimators. Ann. Statist. 44 2695–2725.
  • [19] Dudley, R. M. (2014). Uniform Central Limit Theorems, 2nd ed. Cambridge Studies in Advanced Mathematics 142. Cambridge Univ. Press, New York.
  • [20] Gao, X. and Huang, J. (2010). Asymptotic analysis of high-dimensional LAD regression with Lasso. Statist. Sinica 20 1485–1506.
  • [21] Giné, E. and Koltchinskii, V. (2006). Concentration inequalities and asymptotic results for ratio type empirical processes. Ann. Probab. 34 1143–1216.
  • [22] Giné, E., Latała, R. and Zinn, J. (2000). Exponential and moment inequalities for $U$-statistics. In High Dimensional Probability, II (Seattle, WA, 1999). Progress in Probability 47 13–38. Birkhäuser, Boston, MA.
  • [23] Giné, E. and Nickl, R. (2016). Mathematical Foundations of Infinite-Dimensional Statistical Models. Cambridge Series in Statistical and Probabilistic Mathematics 40. Cambridge Univ. Press, New York.
  • [24] Giné, E. and Zinn, J. (1984). Some limit theorems for empirical processes. Ann. Probab. 12 929–998.
  • [25] Giné, E. and Zinn, J. (1986). Lectures on the central limit theorem for empirical processes. In Probability and Banach Spaces (Zaragoza, 1985). Lecture Notes in Math. 1221 50–113. Springer, Berlin.
  • [26] Han, Q. and Wellner, J. A. (2019). Supplement to “Convergence rates of least squares regression estimators with heavy-tailed errors.” DOI:10.1214/18-AOS1748SUPP.
  • [27] Han, Q. and Wellner, J. A. (2018). Robustness of shape-restricted regression estimators: An envelope perspective. Preprint. Available at arXiv:1805.02542.
  • [28] Hsu, D. and Sabato, S. (2016). Loss minimization and parameter estimation with heavy tails. J. Mach. Learn. Res. 17 Paper No. 18, 40.
  • [29] Joly, E., Lugosi, G. and Oliveira, R. I. (2017). On the estimation of the mean of a random vector. Electron. J. Stat. 11 440–451.
  • [30] Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Statist. 34 2593–2656.
  • [31] Koltchinskii, V. and Panchenko, D. (2000). Rademacher processes and bounding the risk of function learning. In High Dimensional Probability, II (Seattle, WA, 1999). Progress in Probability 47 443–457. Birkhäuser, Boston, MA.
  • [32] Korostelëv, A. P. and Tsybakov, A. B. (1992). Asymptotically minimax image reconstruction problems. In Topics in Nonparametric Estimation. Adv. Soviet Math. 12 45–86. Amer. Math. Soc., Providence, RI.
  • [33] Korostelëv, A. P. and Tsybakov, A. B. (1993). Minimax Theory of Image Reconstruction. Lecture Notes in Statistics 82. Springer, New York.
  • [34] Lecué, G. and Mendelson, S. (2017). Sparse recovery under weak moment assumptions. J. Eur. Math. Soc. (JEMS) 19 881–904.
  • [35] Lecué, G. and Mendelson, S. (2018). Regularization and the small-ball method I: Sparse recovery. Ann. Statist. 46 611–641.
  • [36] Ledoux, M. and Talagrand, M. (1986). Conditions d’intégrabilité pour les multiplicateurs dans le TLC banachique. Ann. Probab. 14 916–921.
  • [37] Ledoux, M. and Talagrand, M. (2011). Probability in Banach Spaces: Isoperimetry and Processes. Springer, Berlin.
  • [38] Lugosi, G. and Mendelson, S. (2018). Regularization, sparse recovery, and median-of-means tournaments. Bernoulli. To appear.
  • [39] Lugosi, G. and Mendelson, S. (2018). Risk minimization by median-of-means tournaments. J. Eur. Math. Soc. (JEMS). To appear.
  • [40] Lugosi, G. and Mendelson, S. (2019). Sub-Gaussian estimators of the mean of a random vector. Ann. Statist. 47 783–794.
  • [41] Mason, D. M. (1983). The asymptotic distribution of weighted empirical distribution functions. Stochastic Process. Appl. 15 99–109.
  • [42] Massart, P. and Nédélec, É. (2006). Risk bounds for statistical learning. Ann. Statist. 34 2326–2366.
  • [43] Massart, P. and Rio, E. (1998). A uniform Marcinkiewicz–Zygmund strong law of large numbers for empirical processes. In Asymptotic Methods in Probability and Statistics (Ottawa, ON, 1997) 199–211. North-Holland, Amsterdam.
  • [44] Mendelson, S. (2015). Learning without concentration. J. ACM 62 Art. 21, 25.
  • [45] Mendelson, S. (2016). Upper bounds on product and multiplier empirical processes. Stochastic Process. Appl. 126 3652–3680.
  • [46] Mendelson, S. (2017). “Local” vs. “global” parameters—breaking the Gaussian complexity barrier. Ann. Statist. 45 1835–1862.
  • [47] Mendelson, S. (2017). On aggregation for heavy-tailed classes. Probab. Theory Related Fields 168 641–674.
  • [48] Mendelson, S. (2017). Extending the small-ball method. Preprint. Available at arXiv:1709.00843.
  • [49] Mendelson, S. (2017). On multiplier processes under weak moment assumptions. In Geometric Aspects of Functional Analysis. Lecture Notes in Math. 2169 301–318. Springer, Cham.
  • [50] Minsker, S. (2015). Geometric median and robust estimation in Banach spaces. Bernoulli 21 2308–2335.
  • [51] Nickl, R. and van de Geer, S. (2013). Confidence sets in sparse regression. Ann. Statist. 41 2852–2876.
  • [52] Pollard, D. (1991). Asymptotics for least absolute deviation regression estimators. Econometric Theory 7 186–199.
  • [53] Portnoy, S. and Koenker, R. (1997). The Gaussian hare and the Laplacian tortoise: Computability of squared-error versus absolute-error estimators. Statist. Sci. 12 279–300.
  • [54] Raskutti, G., Wainwright, M. J. and Yu, B. (2011). Minimax rates of estimation for high-dimensional linear regression over $\ell_{q}$-balls. IEEE Trans. Inform. Theory 57 6976–6994.
  • [55] Shorack, G. R. and Wellner, J. A. (2009). Empirical Processes with Applications to Statistics. Classics in Applied Mathematics 59. SIAM, Philadelphia, PA.
  • [56] Sivakumar, V., Banerjee, A. and Ravikumar, P. K. (2015). Beyond sub-Gaussian measurements: High-dimensional structured estimation with sub-exponential designs. In Advances in Neural Information Processing Systems 2206–2214.
  • [57] Strassen, V. and Dudley, R. M. (1969). The central limit theorem and $\varepsilon $-entropy. In Probability and Information Theory (Proc. Internat. Sympos., McMaster Univ., Hamilton, Ont., 1968) 224–231. Springer, Berlin.
  • [58] Talagrand, M. (2014). Upper and Lower Bounds for Stochastic Processes: Modern Methods and Classical Problems. Ergebnisse der Mathematik und Ihrer Grenzgebiete. 3. Folge. A Series of Modern Surveys in Mathematics 60. Springer, Heidelberg.
  • [59] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • [60] van de Geer, S. (1987). A new approach to least-squares estimation, with applications. Ann. Statist. 15 587–602.
  • [61] van de Geer, S. (1990). Estimating a regression function. Ann. Statist. 18 907–924.
  • [62] van de Geer, S., Bühlmann, P., Ritov, Y. and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist. 42 1166–1202.
  • [63] van de Geer, S. and Wegkamp, M. (1996). Consistency for the least squares estimator in nonparametric regression. Ann. Statist. 24 2513–2523.
  • [64] van de Geer, S. A. (2000). Applications of Empirical Process Theory. Cambridge Series in Statistical and Probabilistic Mathematics 6. Cambridge Univ. Press, Cambridge.
  • [65] van der Vaart, A. and Wellner, J. A. (2011). A local maximal inequality under uniform entropy. Electron. J. Stat. 5 192–203.
  • [66] van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer, New York.
  • [67] Yang, Y. (2001). Nonparametric regression with dependent errors. Bernoulli 7 633–655.
  • [68] Yang, Y. and Barron, A. (1999). Information-theoretic determination of minimax rates of convergence. Ann. Statist. 27 1564–1599.
  • [69] Yao, Q. W. (1993). Tests for change-points with epidemic alternatives. Biometrika 80 179–191.
  • [70] Zhang, C.-H. (2002). Risk bounds in isotonic regression. Ann. Statist. 30 528–555.

Supplemental materials

  • Supplement: Additional proofs. In the supplement [26], we provide detailed proofs for (i) Theorem 4, (ii) the impossibility results Propositions 1 and 3 and (iii) all remaining lemmas.