ERM and RERM are optimal estimators for regression problems when malicious outliers corrupt the labels

We study Empirical Risk Minimizers (ERM) and Regularized Empirical Risk Minimizers (RERM) for regression problems with convex and $L$-Lipschitz loss functions. We consider a setting where $|\mathcal O|$ malicious outliers may contaminate the labels. In that case, we show that the $L_2$-error rate is bounded by $r_N + L |\mathcal O|/N$, where $N$ is the total number of observations and $r_N$ is the $L_2$-error rate in the non-contaminated setting. When $r_N$ is minimax-rate-optimal in a non-contaminated setting, the rate $r_N + L|\mathcal O|/N$ is also minimax-rate-optimal when $|\mathcal O|$ outliers contaminate the label. The main results of the paper can be used for many non-regularized and regularized procedures under weak assumptions on the noise. For instance, we present results for Huber's M-estimators (without penalization or regularized by the $\ell_1$-norm) and for general regularized learning problems in reproducible kernel Hilbert spaces.


Introdution
Let (X i , Y i ) i=1,••• ,N be random variables taking values in X × R, where X is a measurable space.Given a new input X ∈ X , one wants to predict its associated label Y ∈ R. To proceed, we consider (X, Y ) as a random variable valued in X × R and given a class of predictors F of functions f : X → R, the goal is to predict/approximate the oracle f * defined as where (f (X), Y ) measures the error of predicting f (X) while the true label is Y .To estimate/approximate the function f * , we use the dataset (X i , Y i ) i=1,••• ,N .Regularized empirical risk minimization is the most widespread strategy in machine learning to estimate f * .There exists an extensive literature on its generalization capabilities [55,29,28,34,15].However, in the recent years, many papers highlighted its severe limitations.One main drawback, is that a single outlier (X o , Y o ) (in the sense that nothing is assumed on (X o , Y o )) may deteriorate the performances of RERM.Consequently, RERM is in general, not robust to outliers.However, what happens if only the labels (Y i ) i=1,••• ,N are contaminated ?In [17]; the authors raised the question whether it is possible to attain optimal rates of convergence in outlier-robust sparse regression using regularized empirical risk minimization.They consider the model, Y i = X i , t * + i , where X i is a Gaussian random vector in R p with a covariance matrix satisfying the Restricted Eigenvalue condition [54] and t * is s-sparse.For non-contaminated data they suppose that i ∼ N (0, σ 2 ), while it can be anything when malicious outliers contaminate the sample.The authors prove that the 1 -penalized empirical risk minimizer based on the Huber's loss function has an error rate of the order where |O| is the number of outliers contaminating the labels.Consequently, they showed that RERM associated with the Huber loss function is minimax-rateoptimal when |O| malicious outliers corrupt the labels.

Setting
Let (Ω, A, P) be a probability space where Ω = X ×Y.X denotes the measurable space of the inputs and Y ⊂ R the measurable space of the outputs.Let (X, Y ) be a random variable taking values in Ω with joint distribution P and let µ be the marginal distribution of X.Let F denote a class of functions f : X → Y.A function f in F is named a predictor.The function : Y × Y → R + is a loss function such that (f (x), y) measures the quality of predicting f (x) while the true answer is y.For any function f in F we write f (x, y) := (f (x), y).For any distribution Q on Ω and any funtion f : X × Y → R we write Qf = E (X,Y )∼P [f (X, Y )].Let f ∈ F , the risk of f is defined as R(f A prediction function with minimal risk is called an oracle and is defined as f * ∈ argmin f ∈F P f .For the sake of simplicity, it is assumed that the oracle f * exists and is unique.The joint distribution P of (X, Y ) being unknown, computing f * is impossible.Instead one is given a dataset D = (X i , Y i ) N i=1 of N random variables taking values in X × Y.In this paper, we consider a setup where |O| outputs may be contaminated.More precisely, let I ∪ O denote an unknown partition of {1, • • • , N } where I is the set of informative data and O the set of outliers.It is assumed that: Assumption 1. (X i , Y i ) i∈I are i.i.d with a common distribution P .The random variables (X i ) N i=1 are i.i.d with law µ.Nothing is assumed on the labels (Y i ) i∈O .They can even be adversial outliers making the learning as hard as possible.The goal is, without knowing the partition I ∪ O, to use the informative data (X i , Y i ) i∈I to construct an estimator f that approximates/estimates the oracle f * .A way of measuring the quality of an estimator is via the error rate f − f L2(µ) or the excess risk P L f := P f − P f * .We assume the following: Assumption 2. The class F is convex.
A natural idea to construct robust estimators when the labels might be contaminated is to consider Lipschitz loss functions [24,23].Moreover, for computational purposes we will also focus on convex loss functions [52].Assumption 3.There exists L > 0 such that, for any y ∈ Y, (•, y) is L-Lipschitz and convex.
Recall that the Empirical Risk Minimizer (ERM) and the Regularized Empirical Risk Minimizer (RERM) are respectively defined as fN ∈ argmin where λ > 0 is a tuning parameter and • is a norm.Under Assumptions 2 and 3 the ERM and RERM are computable using tools from convex optimization.

Our contributions
As exposed in [17], in a setting where |O| outliers contaminate only the labels, RERM with the Huber loss function is minimax-optimal for the sparseregression problem when the noise and design of non-contaminated data are both Gaussian.It leads to the following question: 1. Are the RERM optimal for other loss functions and other regresssion problems than the sparse-regression problem when malicious outliers corrupt the labels ?
Based on previous works [15,13,14, 1], we study ERM and RERM for regression problems when the penalization is a norm and the loss function is simultaneously convex and Lipschitz and show that: In a framework where |O| outliers may contaminate the labels, with weak assumptions on the noise, the excess risk and the square of the error rate for both ERM and RERM can be bounded by where N is the total number of observations, L is the Lipschitz constant from Assumption 3 and r N is the error rate in a non-contaminated setting.
When the proportion of outliers |O|/N is smaller than the error rate normalized by the Lipschitz constant r N /L, both ERM and RERM behave as if there was no contamination.The result holds for any loss function that is simultaneously convex and Lipschitz and not only for the Huber loss function.We obtain theorems that can be used for many well-known regression problems including structured high-dimensional regression (see Section 3.3), non-parametric regression (see Section 3.4) and matrix trace regression (using the results from [1]).
The next question one may ask is the following: 2. Is the general bound (2) minimax-rate-optimal when |O| malicious outliers may corrupt the labels ?
To answer question 2, we use the results from [10].The authors established a general minimax theory for the ε-contamination model defined as in Section B, we show that the lower minimax bounds for regression problems in the ε-contamination model are the same when • Both the design X and the response variable Y are contaminated.
• Only the response variable Y is contaminated.
Moreover, it is clear that a lower bound on the risk in the ε-contamination model implies a lower bound when |O| = εN arbritrary outliers contaminate the dataset since in our setting, outliers do not necessarily have the same distribution Q.As a consequence, for regression problems, minimax-rate-optimal bounds in the ε-contamination model are also optimal when N ε malicious outliers corrupt the labels.
When the bound ( 2) is minimax-rate-optimal for regression problems in the ε-contamination model with ε = |O|/N , then it is also minimax-rateoptimal when |O| malicious outliers corrupt the labels.
In particular, we recover and generalize the results from [17] when the noise of non-contaminated data is not necessarily Gaussian but may be heavy-tailed.
The results are derived under the local Bernstein condition introduced in [15].This condition enables to obtain fast rates of convergence when the noise is heavy-tailed.As a proof of concept, we study Huber's M -estimators in R p (non-penalized or regularized by the 1 -norm) when the noise may be heavytailed.In these cases, the error rates are respectively T r(Σ)/N + |O|/N and s log(p)/N +|O|/N , where Σ is the covariance matrix of the design X.We also study learning problems in general Reproducible Kernel Hilbert Space (RKHS).We derive error rates depending on the spectrum of the integral operator as in [44,42,7] without assumption on the design and when the noise has heavy tails (see section 3.3).

Related Litterature
Regression problems with possibly heavy-tailed data or outliers cannot be handled by classical least-squares estimators.This lack of robustness of least-squares estimators gave birth to the theory of robust statistics developed by Peter Huber [24,23,25] , John Tukey [50,51] and Frank Hampel [20,21].The most classical alternatives to least-squares estimators are M-estimators which consist in replacing the quadratic loss function by another one, less sensitive to outliers [40,57].Robust statistics has attracted a lot of attention in the past few years both in the computer science and the statistical communities.For example, although estimating the mean of a random vector in R p is one of the oldest and fundamental problems in robust statistics, it is still a very active research area.Surprisingly, optimal bounds for heavy-tailed data have been obtained only recently [38].The estimator in [38] cannot be computed in practice.Using SDP, [22] obtained optimal bounds achievable in polynomial time.In recent works, still using SDP, [30] designed an algorithm computable in nearly linear time, while [36] developed the first tractable optimal algorithm not based on the SDP.In the meantime, another recent trend in robust statistics is to focus on finite sample risk bounds that are minimax-rate-optimal when |O| outliers contaminate the dataset.For example, for the problem of mean estimation, when |O| malicious outliers contaminate the dataset and the non-contaminated data are assumed to be sub-Gaussian, the optimal rate of the estimation error measured in Euclidean norm scales as p/N + |O|/N .In [10], the authors developed a general analysis for the ε-contamination model.In [9], the same authors proposed an optimal estimator when |O| outliers with the same distribution contaminate the data.In [19], the authors focused on the problem of highdimensional linear regression in a robust model where an ε-fraction of the samples can be adversarially corrupted.Robust regression problems have also been studied in [12,18,37,5].Above-mentioned articles assume corruption both in the design and the label.In such a corruption setting ERM and RERM are known to be poor estimators.In [17], the authors raised the question whether it is possible to attain optimal rates of convergence in sparse regression using regularized empirical risk minimization when a proportion of malicious outliers contaminate only the labels.They studied 1 penalized Huber's M -estimators.This work is the closest to our setting and reveals that when only the labels are contaminated, simple procedures, such as penalized Huber's M estimators, still perform well and are minimax-rate-optimal.Their proofs rely on the fact that non-contaminated data are Gaussian.Our approach is different and more general.Other alternatives to be robust both for heavy-tailed data and outliers in regression have been proposed in the literature such as Median Of Means (MOM) based methods [31,32,15].However such estimators are difficult to compute in practice and can lead to sub-optimal rates.For instance, for sparse-linear regressions in R p with a sub-Gaussian design, MOM-based estimators have an error rate of the order s log(p)/N + L |O|/N (see [15]) while the optimal dependence with respect to the number of outliers is s log(p)/N + L|O|/N .Finally, there was a recent interest in robust iterative algorithms.It was shown that robustness of stochastic approximation algorithms can be enhanced by using robust stochastic gradients.For example, based on the geometric median [43], [11] designed a robust gradient descent scheme.More recently, [26] showed that a simple truncation of the gradient enhances the robutness of the stochastic mirror descent algorithm.
The paper is organized as follows.In Section 2, we present general results for non-regularized procedures with a focus on the example of the Huber's Mestimator in R p .Section 3 gives general results for RERM that we apply to 1penalized Huber's M -estimators with isotropic design and regularized learning in RKHS.Section A presents simple simulations to illustrate our theoritical findings.In section B, we show that the minimax lower bounds for regression problems in the ε-contamination model are the same when 1) both the design X and the labels are contaminated and 2) when only the labels are contaminated.Section C shows that we can extend the results for 1 -penalized Huber's Mestimator when the covariance matrix of the design X satisfies a Restricted Eigenvalue condition.Finally, the proofs of the main theorems are presented in Section D.
Notations All along the paper, for any f in F , f L2 will be written instead of f L2(µ) where f 2 L2(µ) = f 2 dµ.The letter c will denote an absolute constant.For a set T , its cardinality is denoted |T |.For two real numbers a, b, a ∨ b and a ∧ b denote respectively max(a, b) and min(a, b).For any set H for which it makes sense, let

Non-regularized procedures
In this section we study the Empirical Risk Minimizer (ERM) where we recall the definition below: We establish bounds on the error rate fN − f * L2 and the excess risk P L fN := P fN − P f * in two different settings 1) when F − f * is sub-Gaussian, and 2) when F − f * is locallly bounded.We derive fast rates of convergence under very weak assumptions.

General results in the sub-Gaussian framework
The ERM performs well when the empirical excess risk f → P N L f uniformly concentrates around its expectation f → P L f .Thus, it is necessary to impose a strong concentration assumption on the class {L f (X, Y ), f ∈ F }. From assumption 3 it is implied by a concentration assumption on the class ) .See [33] for many examples of sub-Gaussian classes.In this context, we use the Gaussian mean-width as a measure of the complexity of the class function F that we introduce here Definition 1.Let H ⊂ L 2 (µ).Let (G h ) h∈H be the canonical centered Gaussian process indexed by H (in particular, the covariance structure of (G h ) h∈H is given by For example, when F = { t, • , t ∈ T } and the covariance matrix of X is Σ, we have w(F ) = E sup t∈T t, G , where G ∼ N (0, Σ).Similarly to [34,15,14,1], the error rate and the excess risk are driven by fixed point solutions of a Gaussian mean-width: Definition 2. Let B L2 denote the unit ball induced by L 2 (µ).The complexity parameter r I (•) is defined as where c > 0 denotes an absolute constant, L is the Lipschitz constant from assumption 3 and B is the sub-Gaussian constant from assumption 4.
To obtain fast rates of convergence it is necessary to impose assumptions on the distribution P .For instance, the margin assumptions [39,49,53] and the Bernstein conditions from [3] have been widely used in statistics and learning theory to prove fast convergence rates for the ERM.In the spirit of [15] we introduce a weaker local Bernstein assumption.Assumption 5. Let r(•) be a complexity parameter s.t for all A > 0, r(A) ≥ r I (A).There exists a constant A > 0 such that for all f L2 ≤ AP L f .Note that assumption 5 holds locally around the oracle f * .The smallest radius corresponds to r I (A).The bigger r(•) the stronger assumption 5 is.Assumption 5 has been extensively studied in [15,14] for different Lipschitz and convex loss functions.For the sake of brevity, in applications we will only focus on the Huber loss function in this paper.
We are now in position to state the main theorem for the ERM.Theorem 1 holds if the local Bernstein condition 5 is satisfied for all functions f in F such that f − f * L2 = cAL(r I (A) + |O|/N ), that is on an L 2 -sphere with a radius equal to the rate of convergence.The bound on the error rate can be decomposed as the sum of the error rate in the non-contaminated setting and the proportion of outliers |O|/N .As long as the proportion of outliers is smaller than the error rate in the non-contaminated setting, the error rate remains constant.On the other hand, when the proportion of outliers exceeds the error rate in the non-contaminated setting, the error rate in the contaminated setting becomes linear with respect to the proportion of outliers.When r I is minimax optimal in a non-contaminated setting, we obtain that the ERM is minimax optimal when less that N r I outliers contaminate the labels.In Section 2.3, we show that this dependence with respect to the number of outliers is minimax optimal for linear regression in R p .

General results in the bounded framework
In Section 2.1 we considered sub-Gaussian class of functions to derive fast rates of convergence.In this section, we derive a general result when the localized class F −f * is bounded (localized around the oracle f * with respect to the L 2 (µ)norm, see Assumption 6).Since the Gaussian mean-width no longer appears naturally, it is necessary to define a new measure of the complexity of the class F .A way to measure the complexity a function class F is via Rademacher complexities [29,28].Definition 3. The complexity parameter in the bounded setting r b I (•) is defined as where (σ i ) i∈I are i.i.d Rademacher random variables independent to (X i ) i∈I , L is the Lipschitz constant from assumption 3 and B L2 denote the unit ball with respect to L 2 (µ).
To obtain fast rates, we need to adapt the local Bernstein condition to this new complexity parameter and introduce the local boundedness assumption Assumption 6.Let r b (•) be a complexity parameter such that for every A > 0, The second part of Equation ( 4) requires L ∞ -boundedness only in the L 2neighborhood around the oracle f * where the radius is proportional to the rate of convergence r b (A).For example, let us consider the case when Without loss of generality we can assume that M ≥ 1 and the condition becomes, there exists ). Simple computations (see [29]) show that when r b (A) = r b I (A), the complexity parameter r b (A) is of the order p/|I| and the condition become x 2  2 ≤ (M |I|)/(pL).The more informative data we have, the larger the euclidean radius of X can be.As long as |O| < (|I|r(A))/(2AL), with probability larger than 1 − 2 exp − c|I|r 2 (A)/(L + 1) 2 A 2 ) , the estimator fN defined in Equation (3) satisfies As in the sub-Gaussian setting there is a tradeoff between confidence and accuracy.When the number of outliers is smaller than N r b I (A), confidence and accuracy are constant.When |O| becomes larger than the threshold N r b I (A) the confidence is improved while the accuracy is deteriorated.The conclusion is the same as in the bounded case.The error rate in the contaminated setting is the maximum between the error rate in the non-contaminated setting and the proportion of outliers.

A concrete example: the class of linear functional in R p with Huber loss function
To put into perspective the results obtained in Sections 2.1, we apply Theorem 1 for linear regression in R p .For the sake of brevity we do no present the result for Theorem 2. In the vocabulary of Section 1, the class F of predictors is defined as be random variables defined by the following linear model: where (X i ) N i=1 are i.i.d Gaussian random vectors in R p with zero mean and covariance matrix Σ.The random variables ( i ) i∈I are centered and independent to X i .For the moment, nothing more is assumed for ( i ) i∈I .It is clear that assumption 1 holds.The Empirical Risk Minimizer with the Huber loss function is defined as where δ (•, •) is the Huber loss function defined for any δ > 0, u, y ∈ Y = R, by which satisfies assumption 3 for L = δ.All along this section, δ will be considered as a constant (i.e independent to the sample size N and the dimension p).Let and assumption 4 holds with B = 1.
To apply Theorem 1, it remains to study the local Bernstein assumption for the Huber loss function.We recall the result from [15].Let us introduce the following assumption.
b) Let ε, C be the constants defined in a).There exists α > 0 such that, for all x ∈ R p and all z Theorem 7).Grant assumption 7. The Huber loss function with parameter δ > 0 satisfies the Bernstein condition for A = 4/α: for all , the point a) holds with C = 3.Moreover, from the model ( 5), the point b) can be rewritten as: for all x ∈ R p , for all z ∈ R such that |z − x, t * | ≤ 18r, where F denotes the cumulative distribution of distributed as i for any i ∈ I.
The sufficient condition (7) implies that the noise puts enough mass around zero.
To finish, we need to compute complexity parameter r I (4/α).For an absolute constant c > 0, well-known computations (see [47]) give: where we used the fact that |I| ≥ N/2 and L = δ.
We are now in position to apply Theorem 1 for Huber's M -estimator in where ( i ) i∈I are i.i.d centered random variables independent to (X i ) i∈I such that there exists α > 0 such that where F denotes the cdf of distributed as i for i in I, δ is the hyperparameter of the Huber loss function.Nothing is assumed on ( i ) i∈O .Then with probability larger than the estimator tδ N defined in Equation (6) satisfies In Theorem 3 there is no assumption on |O| as long as |O| ≤ |I|.There are two situations: 1) the number of outliers |O| is smaller than T r(Σ)N .We obtain the optimal rate of convergence T r(Σ)/N for linear regression in R p with an exponentially large probability, 2) the number of outliers exceeds T r(Σ)N .In this case, the error rate and the excess risk are deteriorated but the confidence is improved.According to [10], this rate is minimax optimal in the ε-contamination model for ε = |O|/N .It follows that Theorem 3 is minimax-optimal for the problem of linear regression in R p when malicious outliers contaminate the labels [10].In Section A, we run simple simulations to illustrate the linear dependence between the error rate and the proportion of outliers.
Theorem 3 handles many different distributions for the noise as long as Equation ( 8) is satisfied.It is not necessary to impose that the noise is sub-Gaussian neither integrable.For instance, when ∼ C(1) is a standard Cauchy distribution, for all t ∈ R, we have F (t) = 1/2 + arctan(t)/π.With straightforward computations, Equation ( 7) can be rewritten as From Equation (10), Equation ( 8) is satisfied if Let us fix δ > 0 to be a quantity independent of the dimension p and the number of observations N .Take α = 2 arctan(δ/2)/π.When √ N ≥ c √ p(1 + δ)/α and |O| ≤ cαN the condition defined in Equation ( 8) holds and the local Bernstein condition 5 is verified for A = 4/α.We get the following corollary.

High dimensional setting
In Section 2 we studied non-regularized procedures.If the class of predictors F is too small there is no hope to approximate Y with f * (X).It is thus necessary to consider large classes of functions leading to a large error rate unless some extra low-dimensional structure is expected on f * .Adding a regularization term to the empirical loss is a wide-spread method to induce this low-dimensional structure.
The regularization term highlights the belief the statistician may have on the oracle f * .More formally, let F ⊂ E ⊂ L 2 (µ) and • → R + be a norm defined on the linear space E. For any λ > 0, the regularized empirical risk minimizer (RERM) is defined as For high dimensional statistics, it is possible to impose a low dimensional structure.For instance, the use of the 1 norm promotes sparsity [48] for regression and classification problems in R p while the 1-Schatten norm promotes low rank solutions for matrix reconstructions.Up to some technicalities the main result for the RERM is the same as the one in Section 2: the excess risk and the square of the error rate will be of the order where r N denote the (sparse or low-dimensional) error rate in the non-contaminated setting.As long as the proportion of outliers is smaller than the error rate the RERM behaves as if there was no contamination.

General result in the sub-Gaussian framework
To analyze regularized procedures, we first need to redefine the complexity parameter.
Definition 4. Let B be the unit ball induced by the regularization norm • .The complexity parameter rI (•, •) is defined as where c > 0 denotes an absolute constant, L is the Lipschitz constant from assumption 3 and B, the sub-Gaussian constant from assumption 4.
The main difference between r I (A) from Definition 2 and rI (A, ρ) is that rI (A, ρ) measures the local complexity of F ∩ (f * + ρB) whereas r I (A) measures the local complexity of the entire set F around f * .The regularization shifts the estimator towards a neighborhood of the oracle f * with respect to the regularization norm.To deal with the regularization part, we use the tools from [34].The idea is the following: the 1 norm induces sparsity properties because it has large subdifferentials at sparse vectors.Therefore to obtain "sparsity depedendent bounds", i.e bounds depending on the unknown sparsity of the oracle f * , a natural tool is to look at the size of the subdifferential of • in f * where we recall that the subdifferential of • in f is defined as where E * is the dual space of the normed space (E, • ).The subdifferential can be also written as where B * is the unit ball of the dual norm associated with • , i.e. z * ∈ E * → z * * = sup f ≤1 z * (f ) and S * is its unit sphere.In other words, when f = 0, the subdifferential of • in f is the set of all vectors z * in the unit dual sphere S * which are norming for f .For any ρ > 0, let Instead of looking at the subdifferential of • exactly in f * we consider subdifferentials for functions f ∈ F "close enough" to the oracle f * .It enables to handle oracles f * that are not exactly sparse but approximatively sparse.The main technical tool to analyze regularization procedures is the following sparsity equation [34].
The constant 4/5 in Definition 5 could be replaced by any constant in (0, 1).The sparsity equation is a very general and powerful tool allowing to derive "sparsity dependent bounds" by taking ρ * function of the unknown sparsity (see Section 3.3 for a more explicit example or [14,34] for many other illustrations).Finally, we adapt the local Bernstein assumption to this new framework.
There exist A > 0 and ρ * satisfying the A, r-sparsity equation from Definition 5 such that for all f ∈ F : We are now in position to state the main theorem of this section.
•) be such that for all A, ρ > 0, r(A, ρ) ≥ rI (A, ρ).Grant Assumptions 1, 3, 2, 4. Suppose that assumption 8 holds with ρ = ρ * satisfying the A, r-sparsity equation from Definition 5. Set: As long as |O| < c|I|r(A, ρ * )/(AL), with probability larger that By taking r(A, ρ * ) = c max(r I (A, ρ * ), AL|O|/|I|), the condition |O| < c|I|r(A, ρ * )/(AL) is necessarily satisfied and, with exponentially large probability, we get The error rate can be decomposed as the sum of the error rate in the noncontaminated setting and the proportion of outliers |O|/N .Theorem 4 is a "meta" theorem in the sense that it can used for many practical problems.We use Theorem 4 for 1 -penalized Huber's M-estimator in Section 3.3.It is also possible to use Theorem 4 for many other convex and Lipschitz loss functions and regularization norms as it is done in [14].It can also be used for matrix reconstruction problems by penalizing with the 1-Schatten norm [34].
General routine to apply Theorem 4 This small paragraph explains how in practice we can use Theorem 4.
2. Compute the localized Gaussian mean width w F ∩ (f * + rB L2 ∩ ρB) for any r, ρ > 0. Deduce the value of rI (A, ρ) for any A, ρ > 0. 3. Choose a new complexity parameter such that for every A, ρ > 0, r(A, ρ) ≥ rI (A, ρ).For instance, to derive results in the contaminated setting we will take r(A, ρ) = c max(r I (A, ρ), AL|O|/N ).From the computation of rI (A, ρ) deduce the closed form of r(A, ρ). 4. For a fixed constant A > 0, find ρ * > 0 satisfying the A, r-sparsity equation, where r(•, •) is the complexity parameter chosen in the previous step.5. From the value of ρ * , compute r(A, ρ * ) for any A > 0. 6. Find a constant A > 0 verifying Assumption 8.

General result in the local bounded framework
In Section 3.1, we established a meta theorem to analyze the RERM when the class F − f * is sub-Gaussian.In this section, we provide another meta theorem when the class F − f * is locally bounded.Contrary to the main result in the non-regularized case, the neighborhood is now defined with respect to the L 2 (µ) norm and the regularization norm.Definition 6.Let B be the unit ball induced by the regularization norm • .The complexity parameter rb I (•, •) is defined as where (σ i ) i∈I are i.i.d Rademacher random variables independent to (X i ) i∈I , c > 0 denotes an absolute constant and L is the Lipschitz constant from assumption 3. Now, adapt the sparsity equation and the local Bernstein condition to this new complexity parameter.
Finally, the following assumption imposes boundedness and a Bernstein condition in the small neighborhood around the oracle f * .Assumption 9. Let rb (•, •) be such that for all A, ρ > 0, rb (A, ρ) ≥ rb I (A, ρ).There exist A, M > 0 and ρ * satisfying the A, M, rb -sparsity equation from Definition 7 such that for all f ∈ F : Assumption 8 generalizes the local Bernstein condition and the local boundedness assumption to the regularized case.In this setting, the neighborhood around the oracle f * can be much smaller than in the non-regularized setting.In particular in Section 3.4, the localization with respect to the norm in the RKHS imposes local boundedness of F − f * .We are now in position to state the main theorem of this section.
As long as |O| < c|I|r b (A, ρ * )/(AL), with probability larger that the estimator f λ N defined in Equation (12) satisfies By taking rb (A, ρ * ) = c max(r b I (A, ρ * ), AL|O|/|I|), the condition |O| < c|I|r b (A, ρ * )/(AL) is necessarily satisfied and we get The error rate can be decomposed as the sum of the error rate in the noncontaminated setting and the proportion of outliers |O|/N .Theorem 5 is a "meta" theorem in the sense that it can used for many practical problems.
General routine to apply Theorem 5 This small paragraph explains how in practice we can use Theorem 5.
The main difference with the application of Theorem 4 in the sub-Gaussian setting is that we no longer have Assumption 4. However it is necessary to verify that the class F − f * is locally bounded by a constant M .

Application to 1 -penalized Huber's M-estimator with sub-Gaussian design
In this section we use the routine of Theorem 4 to the study of 1 -penalized Huber's M-estimator when the design X is supposed to be Gaussian.
Step 1: Under such assumptions, it is clear that Assumptions 1, 2, 3 with L = δ, 4 with B = 1 are verified.All along this section δ will be considered as a constant.
Step 2: Let us turn to the second step, i.e the computation of the local Gaussianmean width.Since X is isotropic i.e E X, t 2 = t 2 2 for every t ∈ R p , we have w F ∩ (f * + rB L2 ∩ ρB) = w(rB p 2 ∩ ρB p 1 ) for every r, ρ > 0, where B p q denotes the q ball in R p for q > 0. Well-known computations give (see [56] for example) and consequently, r2 Step 3 : For any A, ρ > 0 let us define r(A, ρ) = c max(r I (A, ρ), Aδ|O|/|I|).
From step 2, since |I| ≥ N/2, we easily get: Step 4 : To verify the A, r-sparsity equation from Definition 5 for the 1 norm and compute ρ * we use a result from [34].
where F denotes the cdf of distributed as i for i ∈ I.
We are now in position to state the main result for the 1 -penalized Huber estimator.
where t * is s-sparse and ( i ) i∈I are i.i.d centered random variables independent to (X i ) i∈I such that there exists α > 0 such that where F denotes the cdf of where is distributed as i for i in I, δ is the hyperparameter of the Huber loss function.Nothing is assumed on ( i ) i∈O .Set Then with probability larger than Let us analyze the two different cases.1) when the number of outliers |O| is smaller than s log(p)N , the regularization parameter λ does not depend on the unknown sparsity.We obtain the (nearly) minimax-optimal rate in sparse linear regression in R p with an exponentially large probability [4,34,16].Using more involved computations and taking a regularization parameter λ depending on the unknown sparsity we can get the exact minimax rate of convergence s log(p/s)/N .2) When the number of outliers exceeds s log(p)N the value of λ depends on the unknown quantities |O| and s.The error rate is deteriorated (but the confidence is improved) and becomes linear with respect to the proportion of outliers |O|/N .From [10], this error rate is minimax optimal (up to a logarithmic term) in the ε-contamination problem when ε = |O|/N .It follows that Theorem 6 is minimax-optimal (up to a logarithmic term) when |O| malicious outliers contaminate the labels.In Section A, we run simple simulations to illustrate the linear dependence between the error rate and the proportion of outliers.
Remark 2. In Theorem 6 we assumed that µ = N (0, I p ) to apply Lemma 1 and compute the local Gaussian-mean width.It is possible to generalize the result to Gaussian random vectors with covariance matrices Σ verifying RE(s, 9) [54], for s being the sparsity of t * .Recall that a matrix Σ is said to satisfy the restricted eigenvalue condition RE(s, c 0 ) with some constant κ > 0, if Σ 1/2 v 2 ≥ κ v J 2 for any vector v in R p and any set J ⊂ {1, • • • , p} such that |J| ≤ s and v J c 1 ≤ c 0 v J 1 .When Σ satisfies the RE(s, 9) condition with κ > 0 we get the same conclusion as Theorem 6 modulo an extra term 1/κ in front of the error rate (see Section C for a precise result).

Application to RKHS with the huber loss function
This section is mainly inspired from the work [1].We present another example of application of our main results.In particular, we use the routine associated with Theorem 5 for the problem of learning in a Reproducible Kernel Hilbert Space (RKHS) H K [45] associated to a positive definite kernel K.We improve the results of [1] in two points 1) we can take F = H K while in [1], the authors restrict themselves to the case F = RB H K , for R > 0, where B H K denotes the unit ball of H K and 2) the bayes rule (i.e the minimizer of the risk over all measurable functions) does not have to belong to RB H K and no margin assumption [2] is required.
We are given N pairs (X i , Y i ) N i=1 of random variables where the X i 's take their values in some measurable space X and Y i ∈ R. We introduce a kernel K : X × X → R measuring a similarity between elements of X i.e K(x 1 , x 2 ) is small if x 1 , x 2 ∈ X are "similar".The main idea of kernel methods is to transport the design data X i 's from the set X to a certain Hilbert space via the application x → K(x, •) := K x (•) and construct a statistical procedure in this "transported" and structured space.The kernel K is used to generate a Hilbert space known as Reproducing Kernel Hilbert Space (RKHS).Recall that if K is a positive definite function i.e for all n ∈ n j=1 c i c j K(x i , x j ) ≥ 0, then by Mercer's theorem there exists an orthonormal basis , where (λ) ∞ i=1 is the sequence of eigenvalues (arranged in a non-increasing order) of T K and φ i is the eigenvector corresponding to λ i where The Reproducing Kernel Hilbert Space H K is the set of all functions of the form An alternative way to define a RKHS is via the feature map Φ : k=1 is an orthogonal basis of H K , it is easy to see that the unit ball of H K can be expressed as where •, • 2 is the standard inner product in the Hilbert space 2 .In other words, the feature map Φ can the used to define an isometry between the two Hilbert spaces H K and 2 .The RKHS H K is therefore a convex class of functions from X to R that can be used as a learning class F .Let us assume that Y i = f * (X i ) + i where (X i ) N i=1 are i.i.d random variables taking values in X .The random variables ( i ) i∈I are symmetric i.i.d random variables independent to (X i ) i∈I and f * is assumed to belong to H K .It follows that the oracle f * is also defined as where δ is the Huber loss function.Let f be in H K , by the reproducing property and the Cauchy-Schwarz inequality we have for all x, y in X From Equation ( 23), it is clear that the norm of a function in the RKHS controls how fast the function varies over X with respect to the geometry defined by the kernel (Lipschitz with constant f H K ).As a consequence the norm of regularization • H K is related with its degree of smoothness w.r.t. the metric defined by the kernel on X .The estimators f δ,λ N we study in this section is defined as We obtain error rates depending on spectrum (λ i ) ∞ i=1 of the integral operator T K .
Assumption 10.The eigenvalues (λ i ) ∞ i=1 of the integral operator T K satisfy λ n ≤ cn −1/p for some 0 < p < 1 and c > 0 an absolute constant.
In Assumption 10, the value of p is related with the smoothness of the space H K .Different kinds of spectra could be analysis.It would only change the computation of the complexity fixed-points.For the sake of simplicity we only focus on this example as it has been also studied in [7,42] to obtain fast rates of convergence.
Let us use the routine to apply Theorem 5.
Step 1: Since every Reprocucible Kernel Hilbert space is convex, it is clear that assumptions 1, 2, 3 with L = δ are verified.
Step 2: From Theorem 2.1 in [41], if K is a bounded kernel, then for all ρ, r > 0 .
Step 5: From step 4, we easily get Let ε, C be the constants defined in a).There exists α > 0 such that, for all x ∈ R p and all z ∈ R satisfying The only difference with Assumption 7 is that the point a) is only required for functions f in F such that f − f * ≤ ρ.
Proposition 2. Grant assumption 7. The Huber loss function with parameter δ > 0 satisfies the Bernstein condition for A = 4/α: Therefore, the point a) holds with C = (ρ K ∞ /r) ε/(2+ε) .Let us turn to the point b) of assumption 11.From the fact that C = (ρ K ∞ /r) ε/(2+ε) , we have ( √ 2C ) (2+ε)/ε r = 2 (2+ε)/2ε ρ K ∞ and the point b) can be rewritten as, there exists α > 0 such that where F denotes the cdf of distributed as i for i ∈ I. Equation ( 25), simply means that the noise puts enough mass around 0. In our problem we have ρ = ρ * = c f * H K and Equation ( 25) becomes, We are now in position to state our main theorem for regularized learning in RKHS with the Huber loss function.
Theorem 7. Let H K be a reproducible kernel Hilbert space associated with a bounded kernel K.
where f * belongs to H K and ( i ) i∈I are i.i.d symmetric random variables independent to (X i ) i∈I such that there exists α > 0 such that where F denotes the cdf of where is distributed as i for i in I, δ is the hyperparameter of the Huber loss function.Nothing is assumed on ( i ) i∈O .Grant assumption 10 and let Then with probability larger than Theorem 7 holds with no assumption on the design X.When |O| ≤ (δ/α)N r b I (4α, f * H K ) we recover the same rates as [44,42] even when the target Y is heavy-tailed.In [44,42] the authors assume that Y is bounded while in [7] the noise is assumed to be light-tailed.When |O| ≥ (δ/α)N r b I (4α, f * H K ) the error rate is deteriorated and becomes linear with respect to the proportion of outliers.It is assumed that the noise is symmetric and satisfies Equation (26).When the noise is a standard Cauchy random variable Equation ( 26) can be rewritten as which holds for δ = c f * H K K ∞ and α = arctan(δ/2).When δ, K ∞ and f * H K are seen as constants, the error rate is of order N −1/(p+1) .Depending on the value of p we obtained fast rates of convergence for regularized Kernel methods.The faster the spectrum of T K decreases the the faster the rates of convergence.

Conclusion and perspectives
We have presented general analyses to study ERM and RERM when a number |O| of outliers may contaminate the labels when 1) the class F − f * is sub-Gaussian or 2) when the class F − f * is locally bounded.We use these "meta theorems" to study Huber's M-estimator with no regularization or penalized with the 1 norm.Under a very weak assumption on the noise (note that it can even not be integrable), we obtain minimax-optimal rate of convergence for these two examples when |O| malicious outliers corrupt the labels.We also obtained fast rates for regularized learning problems in RKHS when the target Y is unbounded and heavy-tailed.For the sake of simplicity, we have only presented two examples of applications.Many procedures can be analysed as it has be done in [14] such as Group Lasso, Fused Lasso, SLOPE ... The results can be easily extented when the sub-Gaussian assumption over F − f * is relaxed.It would only degrade the confidence in the main Theorems (assuming for example that the class is subexponential).The conclusion would be similar.As long as the proportion of outliers is smaller than the rate of convergence, both ERM and RERM behave as if there was to contamination.However in such setting ERM and RERM are known to be sub-optimal which is why such results have not been presented in this paper.
Note that other loss functions could be considered as the absolute loss function, or more generally, any quantile loss function.According to Theorem 3, we have where c > 0 is an absolute constant.We add malicious outliers following a uniform distribution over [−10 −5 , 10 5 ].We expect to obtain an error rate proportional to the proportion of outliers |O|/N .We ran our simulations with N = 1000 and p = 50.The only hyperparameter of the problem is δ.For the sake of simplicity we took δ = 1 for all our simulations.We see on Figure 1 that no matter the noise, the error rate is proportional to the proportion of outliers which is in adequation with our theoritical findings.In a second experiment, we study 1 penalized M -Huber's estimator defined as tλ,δ N ∈ argmin where δ : R×R → R + is the Huber loss function and λ > 0 is a hyperparameter.According to Theorem 6 we have where c > 0 is an absolute constant.We ran our simulations with N = 1000 and p = 1000 and s = 50.The hyperparameters of the problem are δ and λ.For the sake of simplicity we took δ = 1 and λ = 10 −3 for all our simulations.We see on Figure 2 that no matter the noise, the error rate is proportional to the proportion of outliers which is in adequation with our theoritical findings.The fact that the error rate may be large comes to the fact that we did not optimize the value of λ.This section is built on the work [10] where the authors establish a general minimax theory for the ε-contamination model defined as P (ε,θ,Q) = (1 − ε)P θ + εQ given a general statistical experiment {P θ , θ ∈ Θ}.A proportion ε of outliers with same the distribution Q contaminate P θ .Given a loss function L(θ 1 , θ 2 ), the minimax rate for the class {P (ε,θ,Q) , θ ∈ Θ, Q} depends on the modulus of continuity defined as: where T V (P θ1 , P θ2 ) denotes the total variation distance between P θ1 and P θ2 defined as T V (P θ1 , P θ2 ) = sup A∈F |P θ1 (A) − P θ2 (A)|, for F the sigma-algebra onto which P θ1 and P θ2 are defined.
w(ε, Θ) is the price to pay in the minimax rate when a proportion ε of the samples are contaminated.To illusrate Theorem 8, let us consider the linear regression model: where without contamination X i ∼ N (0, Σ), i ∼ N (0, σ 2 ) are independent.
In [9], the authors consider a setting when both the design X and the response variable in the model can be contaminated i.e (X 1 , Y 1 ), • • • , (X N , Y N ) ∼ (1 − ε)P θ + εQ, whith P θ = P (X)P (Y |X), P (X) = N (0, Σ) and P (Y |X) = N (X T θ, σ 2 ).They establish that the minimax optimal risk over the class of s-sparse vectors for the metric L(θ 1 , θ 2 ) = θ 1 − θ 2 2 is given by The question of main interest in our setting is the following: does the minimax risk for regression problem in the ε-contamination model remain the same when only the labels are contaminated ?
The following theorem answers to the above question.
Suppose there is some M(0) such that for ε = 0 holds.Then For any ε ∈ [0, 1] (29) holds for M(ε) = c M(0) ∨ w(ε, Θ) Theorem 9 states that the minimax optimal rates for regression problems in the ε-contamination model are the same when • Both the design X and the response variable Y are contaminated.
• Only the response variable Y is contaminated.

Straightforward computations give
and the sparsity equation is satisfied.Now let us turn to the case when P I c w ≤ 9 P I w 1 .From the RE(s, 9) condition we have P I w 2 ≤ Σ 1/2 w 2 /κ and it follows Now, let us turn to the computation of the Gaussian-mean width when the design X is not isotropic.To do so we use the following Proposition.
When F = { t, • , t ∈ R p } and the covariance matrix of X is Σ, for every r, ρ > 0 we have where G ∼ N (0, I p ).If Σ is assumed to be inversible, we get where T := Σ 1/2 B p 1 is the convex hull of (±Σ 1/2 e i ) p i=1 .To apply Proposition 3 it is necessary to assume that for every i = 1, • • • , p, Σ 1/2 e i ∈ B p 2 which holds when Σ i,i ≤ 1 and we get Proposition 4. Let F = { t, • , t ∈ R p } and assume that, Σ, the covariance matrix of X is invertible and satisfies Σ i,i ≤ 1 for every i = 1, • • • , p.Then, for every r, ρ > 0 Straightforward computations (see [34] for instance) show that s Steps 3,4,5,6 in Section 3.3 are not modified and the following theorem extends Theorem 6 for a non-isotropic design: Stochastic arguments First we identifiate the stochastic event onto which the proof easily follows.Let, where for any Lemma 3. Grant Assumptions 1, 3, 2, 4 and 5 with r(•).Then there exists an absolute constant c > 0 such the event Ω holds with probability larger than The proof of Lemma 7 necessitates several tools from sub-Gaussian random variables that we introduce now.Let ψ 2 (u) = exp(u 2 ) − 1.The Orlicz space L ψ2 associated to ψ 2 is defined as the set of all random variables Z on a probability space (Ω, A, P) such that Z ψ2 < ∞ where Let (X t ) t∈T denote a stochastic process indexed by a pseudo metric space (T, d) satisfying the following Lipschitz condition for all t, s ∈ T, X t − X s ψ2 ≤ d(t, s) For such a process it is possible to control the deviation of sup t∈T X t in terms of the geometry of (T, d) trough the Talagrand's γ-functionals.
Theorem 11 ([35], Theorem 11.13).Let (X t ) t∈T be a random process in L 1 (Ω, A, P) indexed by a pseudo metric space (T, d) such that for all measurable sets A in Ω then, there exists an absolute constant c > 0 such that for all u > 0 where γ 2 is the majorizing measure integral γ(T, d, ψ 2 ) and D(T ) is the diameter of (T, d).
First note that Equation (36) implies Equation (37).By Jensen inequality and the definition of • ψ2 we get Moreover, from the Majorizing Measure Theorem [46][Theorem 2.1.1],when T is a subset of L 2 (µ) and d(s, t) = E(X s − X t ) 2 we have c 1 w(T ) ≤ γ 2 (T ) ≤ c 2 w(T ) for c 1 , c 2 > 0 two absolute constants and w(T ) is the Gaussian meanwidth of T defined in Definition 1.The corollary follows: . Then, for any u ≥ log(2), with probability larger than 1 − exp(u 2 ) sup f,g∈ where c > 0 is an absolute constant, w( F ) is the Gaussian mean-width of F and D L2(µ) ( F ) its L 2 (µ)-diameter.
The following Lemma allows to control the ψ 2 -norm of a sum of independent centered random variables.Lemma 4 ([8], Theorem 1.2.1).Let X 1 , • • • , X N be independent real random variables such that for all The following Lemma connects ψ 2 -bounded random variable with the control of its Laplace transform.Lemma 5 ([8], Theorem 1.1.5).Let Z be a real valued random variable.The following assertions are equivalent We are now in position to prove Lemma 7.
Proof.First we prove that Ω I holds with probability larger than exp −c|I|r 2 (A)/(ALB(1+ Let us assume that for any f, g in F , the following condition holds then, from Corollary 2, for any u ≥ log(2), there exists an absolute constant c > 0 such that with probability larger that 1 − exp(u 2 ) sup To finish the proof it remains to show that Equations ( 39) and ( 40) hold.From Lemma 4 we get Thus, it remains to show that ( f − g )(X, Y )−E( f − g )(X, Y ) ψ2 ≤ cLB f − g L2(µ) for c > 0 an absolute constant.To do so, we use Lemma 5. Let λ ≥ cLB/( f − g L2(µ) ).From the symmetrization principle (Lemma 6.3 in [35]) and the contraction principle (Theorem 2.2 in [28]) we get where σ is a Rademacher random variation independent to (X, Y ).From assumption 4, we get where we used the fact that P N L fN ≤ 0, that we work on Ω O and the inequality |O| < (1/2AL)|I|r(A).

D.2. Proof Theorem 2
The proof is very similar to the one of Theorem 1.We present only the stochastic argument.The deterministic argument can be simply obtained by reproducing line by line the proof of Theorem 1.
Theorem 12 (Theorem 2.6, [27]).Let F be a class of functions bounded by M .For all t > 0, with probability larger than 1 − exp(−t) Proof.Let F = {f ∈ F, f − f * L2 ≤ max(1, √ LM )r b (A)}.Let (σ i ) N i=1 be i.i.d Rademacher random variables independent to (X i , Y i ) i=1 , from the symmetrization and contraction Lemmas (see [35]) we get where we used the Definition Since A, L ≥ 1, taking t = (|I|(r b (A)) 2 )(36A 2 (L + 1) 2 ) concludes the proof for the informative data I.For the outliers O, we used the same arguments since from Assumption 6, any function f in F, |f (x) − f * (x)| ≤ M for all x ∈ X .

D.3. Proof Theorem 4
Let r(•, •) such that for all A, ρ > 0, r(A, ρ) ≥ rI (A, ρ) and let ρ * satisfying the A, r-sparsity equation with A verifying assumption 8 The proof is split into two parts and is very similar as the one of Theorem 1.
First we identify a stochastic argument holding with large probability.Then, we show on that event that f λ N − f * L2(µ) ≤ r(A, ρ * ) and f λ N − f * ≤ ρ * .Then, at the very end of the proof we will control the excess risk P L f λ N where f λ N is defined in equation (12).Let us fix λ = 41r 2 (A, ρ * )/(112Aρ * ).

Stochastic arguments
The stochastic part is the same as the one in the proof of Theorem 1 where a localization with respect to the regularization norm is added.First we identifiate the stochastic event onto which the proof easily follows.Let, Moreover, by the triangular inequality we obtain
Assumption 6 is local around the oracle f * .The smallest radius corresponds to max(1, √ LM )r b I (A).The bigger r b (•) the stronger assumption 5 is.We are now in position to state the main theorem for the ERM in the bounded setting.Theorem 2. Let I ∪ O be a partition of {1, • • • , N } where |O| ≤ |I|.Let r b (•) be a complexity parameter such that for all A > 0, r b (A) ≥ r b I (A).Grant Assumptions 1, 3 with L ≥ 1, 2 and 6 with r b (•) for A ≥ 1 and M > 0.

Assumption 7 .
Let F Y |X=x be the conditional cumulative function of Y given X = x.Let us assume that the following holds.a) There exist ε, C > 0 such that, for all

|O| N Step 6 :
In assumption 9 there are two conditions to verify 1) the local Bernstein and 2) the local boundedness.Let us begin by the local Bernstein condition.We use the localized version of Theorem 1. Assumption 11.Let F Y |X=x be the conditional cumulative function of Y given X = x.Let us assume that the following holds.a) There exist ε, C > 0 such that, for all

Lemma 7 .
r(A, ρ * )B L2(µ) : (43)P − P I f − f * ≤ 1 4A(1 + L) r2 (A, ρ * ) Ω O = ∀f ∈ F ∩ f * + ρ * B ∩ r(A, ρ * )B L2(µ) :(44)P − P O |f − f * | ≤ 1 4A(1 + L) |I| |O| r2 (A, ρ * ) ,where we recall that B is the unit ball induced by the regularization norm • .Finally, set Ω = Ω I ∩ Ω O Grant Assumptions 1, 3, 2, 4 and 8 with r(•, •).Then the event Ω holds with probability larger than1 − 2 exp − c |I|r 2 (A, ρ * ) LBA(L + 1)(45)Proof.The proof is exactlty the same as the one in the non-regularized setup where a localization with respect to the regularization norm is added.It is enough to adapt the proof with the definition of rI (A, ρ) from Equation (4).Deterministic argumentIn this paragraph we place ourselves on the event Ω.Let us recall that for any function f in FP N L λ f = P N ( f − f * ) + λ( f − f * ) (46) Let B = ρ * B ∩ r(A, ρ * )B L2(µ).From the definition of f λ N , we haveP N L λ f λ N ≤ 0. To show that f λ N ∈ F ∩ f * + B it is sufficient to show that for all functions f ∈ F \ f * + B we have P N L λ f > 0. Let f in F \ f * + B .By convexity of F there exist a function f 1 in F and α ≥ 1 such that α(f 1 − f * ) = f − f * and f 1 ∈ ∂(f * + B)where ∂(f * + B) denotes the border of f * + B. Using the same convex argument as the one in the proof of Theorem 1 we obtain: