## Annals of Statistics

### Robust machine learning by median-of-means: Theory and practice

#### Abstract

Median-of-means (MOM) based procedures have been recently introduced in learning theory (Lugosi and Mendelson (2019); Lecué and Lerasle (2017)). These estimators outperform classical least-squares estimators when data are heavy-tailed and/or are corrupted. None of these procedures can be implemented, which is the major issue of current MOM procedures (Ann. Statist. 47 (2019) 783–794).

In this paper, we introduce minmax MOM estimators and show that they achieve the same sub-Gaussian deviation bounds as the alternatives (Lugosi and Mendelson (2019); Lecué and Lerasle (2017)), both in small and high-dimensional statistics. In particular, these estimators are efficient under moments assumptions on data that may have been corrupted by a few outliers.

Besides these theoretical guarantees, the definition of minmax MOM estimators suggests simple and systematic modifications of standard algorithms used to approximate least-squares estimators and their regularized versions. As a proof of concept, we perform an extensive simulation study of these algorithms for robust versions of the LASSO.

#### Article information

Source
Ann. Statist., Volume 48, Number 2 (2020), 906-931.

Dates
Revised: February 2019
First available in Project Euclid: 26 May 2020

https://projecteuclid.org/euclid.aos/1590480039

Digital Object Identifier
doi:10.1214/19-AOS1828

Mathematical Reviews number (MathSciNet)
MR4102681

#### Citation

Lecué, Guillaume; Lerasle, Matthieu. Robust machine learning by median-of-means: Theory and practice. Ann. Statist. 48 (2020), no. 2, 906--931. doi:10.1214/19-AOS1828. https://projecteuclid.org/euclid.aos/1590480039

#### References

• [1] Alon, N., Matias, Y. and Szegedy, M. (1999). The space complexity of approximating the frequency moments. J. Comput. System Sci. 58 137–147.
• [2] Alquier, P., Cottet, V. and Lecué, G. (2019). Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions. Ann. Statist. 47 2117–2144.
• [3] Arlot, S. and Celisse, A. (2010). A survey of cross-validation procedures for model selection. Stat. Surv. 4 40–79.
• [4] Arlot, S. and Celisse, A. (2011). Segmentation of the mean of heteroscedastic data via cross-validation. Stat. Comput. 21 613–632.
• [5] Arlot, S. and Lerasle, M. (2016). Choice of $V$ for $V$-fold cross-validation in least-squares density estimation. J. Mach. Learn. Res. 17 Paper No. 208, 50.
• [6] Audibert, J.-Y. and Catoni, O. (2011). Robust linear least squares regression. Ann. Statist. 39 2766–2794.
• [7] Bach, F. R. (2010). Structured sparsity-inducing norms through submodular functions. In Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a Meeting Held 6–9 December 2010 118–126, Vancouver, BC.
• [8] Baraud, Y. (2011). Estimator selection with respect to Hellinger-type risks. Probab. Theory Related Fields 151 353–401.
• [9] Baraud, Y. and Birgé, L. (2016). Rho-estimators for shape restricted density estimation. Stochastic Process. Appl. 126 3888–3912.
• [10] Baraud, Y., Birgé, L. and Sart, M. (2017). A new method for estimation and model selection: $\rho$-estimation. Invent. Math. 207 425–517.
• [11] Bellec, P. C., Lecué, G. and Tsybakov, A. B. (2016). Slope meets lasso: Improved oracle bounds and optimality. Technical report, CREST, CNRS, Université Paris Saclay.
• [12] Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705–1732.
• [13] Birgé, L. (1984). Stabilité et instabilité du risque minimax pour des variables indépendantes équidistribuées. Ann. Inst. Henri Poincaré Probab. Stat. 20 201–223.
• [14] Birgé, L. (2006). Model selection via testing: An alternative to (penalized) maximum likelihood estimators. Ann. Inst. Henri Poincaré Probab. Stat. 42 273–325.
• [15] Blanchard, G., Bousquet, O. and Massart, P. (2008). Statistical performance of support vector machines. Ann. Statist. 36 489–531.
• [16] Bogdan, M., van den Berg, E., Sabatti, C., Su, W. and Candès, E. J. (2015). SLOPE—adaptive variable selection via convex optimization. Ann. Appl. Stat. 9 1103–1140.
• [17] Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Series in Statistics. Springer, Heidelberg.
• [18] Catoni, O. (2012). Challenging the empirical mean and empirical variance: A deviation study. Ann. Inst. Henri Poincaré Probab. Stat. 48 1148–1185.
• [19] Dean, J. and Ghemawat, S. (2010). Mapreduce: A flexible data processing tool. Commun. ACM 53 72–77.
• [20] Devroye, L., Lerasle, M., Lugosi, G. and Oliveira, R. I. (2016). Sub-Gaussian mean estimators. Ann. Statist. 44 2695–2725.
• [21] Elsener, A. and van de Geer, S. (2018). Robust low-rank matrix estimation. Ann. Statist. 46 3481–3509.
• [22] Fan, J., Li, Q. and Wang, Y. (2017). Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. J. R. Stat. Soc. Ser. B. Stat. Methodol. 79 247–265.
• [23] Giraud, C. (2015). Introduction to High-Dimensional Statistics. Monographs on Statistics and Applied Probability 139. CRC Press, Boca Raton, FL.
• [24] Hampel, F. R. (1971). A general qualitative definition of robustness. Ann. Math. Stat. 42 1887–1896.
• [25] Hampel, F. R. (1974). The influence curve and its role in robust estimation. J. Amer. Statist. Assoc. 69 383–393.
• [26] Han, Q. and Wellner, J. A. (2017). A sharp multiplier inequality with applications to heavy-tailed regression problems. Available at arXiv:1706.02410.
• [27] Huber, P. J. (1964). Robust estimation of a location parameter. Ann. Math. Stat. 35 73–101.
• [28] Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. In Proc. Fifth Berkeley Sympos. Math. Statist. and Probability (Berkeley, Calif., 1965/66), Vol. I: Statistics 221–233. Univ. California Press, Berkeley, CA.
• [29] Huber, P. J. and Ronchetti, E. M. (2009). Robust Statistics, 2nd ed. Wiley Series in Probability and Statistics. Wiley, Hoboken, NJ.
• [30] Jerrum, M. R., Valiant, L. G. and Vazirani, V. V. (1986). Random generation of combinatorial structures from a uniform distribution. Theoret. Comput. Sci. 43 169–188.
• [31] Koltchinskii, V. (2011). Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Lecture Notes in Math. 2033. Springer, Heidelberg.
• [32] Koltchinskii, V. and Mendelson, S. (2015). Bounding the smallest singular value of a random matrix without concentration. Int. Math. Res. Not. IMRN 23 12991–13008.
• [33] Le Cam, L. (1986). Asymptotic Methods in Statistical Decision Theory. Springer Series in Statistics. Springer, New York.
• [34] LeCam, L. (1973). Convergence of estimates under dimensionality restrictions. Ann. Statist. 1 38–53.
• [35] Lecué, G. and Lerasle, M. (2017). Learning from mom’s principle: Le Cam’s approach. Technical report, CNRS, ENSAE, Paris-sud, 2017. To appear in Stochastic Processes and their applications.
• [36] Lecué, G. and Lerasle, M. (2020). Supplement to “Robust machine learning by median-of-means: Theory and practice.” https://doi.org/10.1214/19-AOS1828SUPP.
• [37] Lecué, G. and Mendelson, S. (2013). Learning subgaussian classes: Upper and minimax bounds Technical report, CNRS, Ecole polytechnique and Technion.
• [38] Lecué, G. and Mendelson, S. (2017). Regularization and the small-ball method II: Complexity dependent error rates. J. Mach. Learn. Res. 18 Paper No. 146, 48.
• [39] Lecué, G. and Mendelson, S. (2018). Regularization and the small-ball method I: Sparse recovery. Ann. Statist. 46 611–641.
• [40] Lerasle, M. and Oliveira, R. I. (2011). Robust empirical mean estimators. Available at arXiv:1112.3914.
• [41] Lounici, K. (2008). Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators. Electron. J. Stat. 2 90–102.
• [42] Lugosi, G. and Mendelson, S. (2019). Regularization, sparse recovery, and median-of-means tournaments. Bernoulli 25 2075–2106.
• [43] Lugosi, G. and Mendelson, S. (2019). Risk minimization by median-of-means tournaments. J. Eur. Math. Soc. (JEMS). To appear.
• [44] Massart, P. and Nédélec, É. (2006). Risk bounds for statistical learning. Ann. Statist. 34 2326–2366.
• [45] Massias, M., Fercoq, O., Gramfort, A. and Salmon, J. (2017). Generalized concomitant multi-task lasso for sparse multimodal regression.
• [46] Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462.
• [47] Meinshausen, N. and Yu, B. (2009). Lasso-type recovery of sparse representations for high-dimensional data. Ann. Statist. 37 246–270.
• [48] Mendelson, S. (2014). Learning without concentration. In Proceedings of the 27th annual conference on Learning Theory COLT14, 25–39.
• [49] Mendelson, S. (2016). On multiplier processes under weak moment assumptions Technical report, Technion.
• [50] Minsker, S. (2015). Geometric median and robust estimation in Banach spaces. Bernoulli 21 2308–2335.
• [51] Minsker, S. and Strawn, N. (2019). Distributed statistical estimation and rates of convergence in normal approximation. Electron. J. Statist. 13 5213–5252.
• [52] Negahban, S. N., Ravikumar, P., Wainwright, M. J. and Yu, B. (2012). A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers. Statist. Sci. 27 538–557.
• [53] Nemirovsky, A. S. and Yudin, D. B. (1983). Problem Complexity and Method Efficiency in Optimization. A Wiley-Interscience Publication. Wiley, New York.
• [54] Nickl, R. and van de Geer, S. (2013). Confidence sets in sparse regression. Ann. Statist. 41 2852–2876.
• [55] Notebook. Available at https://github.com/lecueguillaume/MOMpower.
• [56] Saumard, A. (2018). On optimality of empirical risk minimization in linear aggregation. Bernoulli 24 2176–2203.
• [57] Su, W. and Candès, E. (2016). SLOPE is adaptive to unknown sparsity and asymptotically minimax. Ann. Statist. 44 1038–1068.
• [58] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
• [59] Tukey, J. W. (1960). A survey of sampling from contaminated distributions. In Contributions to Probability and Statistics 448–485. Stanford Univ. Press, Stanford, CA.
• [60] Tukey, J. W. (1962). The future of data analysis. Ann. Math. Stat. 33 1–67.
• [61] van de Geer, S. (2014). Weakly decomposable regularization penalties and structured sparsity. Scand. J. Stat. 41 72–86.
• [62] van de Geer, S., Bühlmann, P., Ritov, Y. and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist. 42 1166–1202.
• [63] van de Geer, S. A. (2007). The deterministic lasso. Technical report, ETH Zürich. Available at http://www.stat.math.ethz.ch/~geer/lasso.pdf.
• [64] van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso. Ann. Statist. 36 614–645.
• [65] Vapnik, V. N. (1998). Statistical Learning Theory. Adaptive and Learning Systems for Signal Processing, Communications, and Control. Wiley, New York.
• [66] Wang, T. E., Gu, Y., Mehta, D., Zhao, X. and Bernal, E. A. (2018). Towards robust deep neural networks. Preprint. Available at arXiv:1810.11726.
• [67] Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 217–242.
• [68] Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541–2563.
• [69] Zhou, W.-X., Sun, Q. and Fan, J. (2017). Adaptive huber regression: Optimality and phase transition. Preprint. Available at arXiv:1706.06991.

#### Supplemental materials

• Supplementary material to “Estimation bounds and sharp oracle inequalities of regularized procedures with Lipschitz loss functions”. Section 6 gives the proof of the main results. These main results focus on the regularized version of the MOM estimates of the increments presented in this Introduction that are well suited for high-dimensional learning frameworks. We complete these results in Section 7, providing results for the basic estimators without regularization in small dimension. Finally, Section 8 provides minimax optimality results for our procedures.