Concentration study of M-estimators using the influence function

We present a new finite-sample analysis of M-estimators of locations in $\mathbb{R}^d$ using the tool of the influence function. In particular, we show that the deviations of an M-estimator can be controlled thanks to its influence function (or its score function) and then, we use concentration inequality on M-estimators to investigate the robust estimation of the mean in high dimension in a corrupted setting (adversarial corruption setting) for bounded and unbounded score functions. For a sample of size $n$ and covariance matrix $\Sigma$, we attain the minimax speed $\sqrt{Tr(\Sigma)/n}+\sqrt{\|\Sigma\|_{op}\log(1/\delta)/n}$ with probability larger than $1-\delta$ in a heavy-tailed setting. One of the major advantages of our approach compared to others recently proposed is that our estimator is tractable and fast to compute even in very high dimension with a complexity of $O(nd\log(Tr(\Sigma)))$ where $n$ is the sample size and $\Sigma$ is the covariance matrix of the inliers. In practice, the code that we make available for this article proves to be very fast.


Introduction
Recently, there have been an increase in the quantity and dimensionality of data one has to investigate in Machine Learning tasks. Big datasets are rather difficult to handle because it is not possible anymore to check by hand some gigabyte or terabyte of data to search for possibly abnormal data, and the task is even more difficult in high dimension. An answer to this problem is robust statistics and in particular, because mean estimators are used everywhere in Machine Learning (for empirical risk minimization, cross validation, data preprocessing, feature engineering...), it has become critical to understand in depth robust mean estimators.
In this article, we try to understand one type of mean estimators: M-estimators. We show that they are nearly optimal in the heavy-tailed setting and we exhibit an algorithm that allows us to compute these estimators in a linear number of steps. Furthermore, the algorithm is very fast in practice. See Section 7.3 for more illustrations of our algorithms.
The analysis of M-estimators to estimate the mean is divided in two steps: first we analyse the deviations of M-estimators around T (P ), which is some location parameter meant to exhibit a central tendency of the data and then we bound the bias T (P ) − E [X] . Let X ∼ P for some P probability on a Hilbert space H, let ρ be an increasing function from R + to R + , we are interested in estimating the location parameter T (P ) defined by where · is a norm associated to a scalar product. Alternatively, if ρ is smooth enough (which will be the case in this article), we define T (P ) by where ψ = ρ ′ is called the score function. The empirical estimator obtained by plugging the empirical density P n in equation (2) is called M-estimator associated with ψ, it is denoted T (X n 1 ) and computed from a sample X 1 , . . . , X n using the following equation: A particular case of T (P ) is obtained when choosing ψ(x) = x in which case T (P ) = E[X] and T (X n 1 ) = 1 n n i=1 X i , however it is well known that the empirical mean is not robust. A careful choice of the function ψ yield estimators that are more robust to outliers and to heavy-tailed data (see [5]).
The subsequent problem is to see how the properties of ψ impact the robustness and efficiency of T (X n 1 ) when estimating T (P ) or E[X]. To study the robustness of T (X n 1 ) we use the influence function, a common tool in robust statistics. The influence function is used to quantify the robustness of an estimator, see for example [19,20,27,36] in which are derived properties such as the asymptotic variance or the breakdown point of the estimator T (X n 1 ) using the influence function. The influence function is the Gâteaux derivative of T evaluated in the Dirac distribution in a point x ∈ H and in the case of Mestimators in R d , from [20,Eq 4.2.9 in Section 4.2C.], the influence function takes the following simple form: where M P,T is a non-singular matrix whose explicit formula is not important for most of our application because of our choice of ψ function (an explicit formula can however be found in the proof of Theorem 3).
The general idea is that, if the estimator is smooth enough (for example if it is Fréchet or Hadamard differentiable, see [17]), then one can write the following expansion where the remainder term R(P, Q) is controlled. For example, if we apply equation (5) to Q = P n the empirical distribution, the influence function provides a first order approximation for the difference between the estimator T ( P n ) = T (X n 1 ) and its limit T (P ). This technique of approximating the estimator by its influence function is also linked to the Bahadur decomposition, see [2] and [22] for applications to M-estimators. The influence function of M -estimators is usually chosen bounded in robust statistics, in particular from [18,27] we have that if ψ is bounded, then the influence function is bounded and T is qualitatively robust (i.e. the estimator T (X n 1 ) is equi-continuous, c.f. [27]) and have asymptotic breakdown point 1/2. On the other hand if ψ is unbounded, then T (X n 1 ) is not qualitatively robust, the influence function is not bounded and the asymptotic breakdown point is zero. From Hampel's Theorem [27,Theorem 2.21] we also have that ψ is bounded if and only if T is a continuous functional with respect to the Levy metric. More generally, the influence function has been used in a lot of works on asymptotic robustness, see [20,27] or [19,36].
The influence function has also been used recently in Machine Learning literature in order to have a model selection tool specialized in robustness, see for example [11], [28] and the closely related tool of leave one out error [16]. The field of Robustness in Machine Learning has been very active in the last few years, in particular after several works by Olivier Catoni and co-authors in [5,6], the goal is to prove non-asymptotic deviation bounds when the data are more heavy-tailed than what is usually considered in classical Machine Learning. This line of thought has been continued in a number of articles, in particular [13] introduced some general concept of sub-Gaussian estimators that have been then used successfully in other applications, see [9,14,5,31,39,35]. See also some comprehensive lecture notes on the subject in [30].
It is interesting to note that contrary to works from classical robust theory from the 70's, the influence functions of the M-estimators used by Catoni are not necessarily bounded. In this article, we initiate the analysis of the effect of unbounded influence function on the robustness of M-estimators, Huber [26] told us that the influence function must be bounded while Catoni uses unbounded influence function and he still shows robust properties for this type of estimator, the difference is in their vision of what is a robust estimator.
There will be three parts in our analysis of this problem, first we extend Catoni's non-asymptotic analysis of M-estimators as we analyze M-estimators with more general influence functions and in a multivariate setting using the properties of the influence function, making a link between the deviations of the influence function and the deviations of T (X n 1 ). Second we apply our theory to three specific M-estimators for which we show tight upper bound on the rate of convergence to the mean. Finally, we investigate an algorithm to compute T (X n 1 ) and we show that this algorithm converges in a reasonable number of steps.
More precisely, In Section 3, we show that concentration inequalities for Mestimators derive from concentration inequalities on the influence function by showing roughly that From equation (4), the right hand side of equation (6) can be interpreted as the deviation of the influence function. The right hand side of equation (6) can be controlled under classical assumptions. For example in R, if ψ is bounded by β > 0 (Huber estimator), we can use Hoeffding or Bernstein inequality to get a control on T (X n 1 ) − T (P ) . Using Hoeffding inequality, we obtain a concentration rate similar to the rate of the empirical mean on Gaussian data.
Remark that using Bernstein inequality, we don't need to have ψ bounded, hence this gives us a mean to show sub-Gaussian rates for M-estimators with unbounded ψ function.
In Section 4, we provide bounds on the bias T (P ) − E[X] and on the variance terms in the concentration inequality from Section 3. Bounding the bias has often been a problem in robust statistics, if the distribution is skewed and the bias is not controlled we can only say that we estimate a quantity meant to quantify a central tendency of P but we do not estimate E[X] directly. However in statistical learning for example, estimating the mean is not just an arbitrary choice and we don't want to estimate a central tendency of the dataset, we want to estimate its mean. In this article, we give explicit bounds on the bias and we use those bounds (in Section 5) to give concentration results on T (X n 1 ) around E[X] in the context of heavy-tailed and adversarially corrupted datasets even beyond the L 2 case. Indeed, we show that the L 2 assumption is needed only to be able to handle the bias T (P )−E[X] but on the other hand if the distribution is symmetric, T (P ) = E[X] and we can obtain sharp deviation bounds even in the case of L 1 distributions. Bounds on the bias are already present for the specific case Huber estimator in regression in [37]. Our bound on the bias has the same flavor as [37, Proposition 1] but applied to multivariate location parameter for more general M-estimators. Similarly to the discussion in [37], our analysis leads us to the study of a bias-variance trade-off for M-estimator depending on the value of β.
In this context, in Section 5, we show that T (X n 1 ) is suitable to estimate the mean in high dimension in a heavy-tailed and corrupted setting (even though our estimators are not minimax in corrupted setting). In the literature, there are estimators that have strong theoretical guarantees but that are intractable, for example one can see estimators based on the aggregation of one-dimensional estimators (same idea as projection pursuits), see [30,Theorem 44] and reference therein, see also [32] and there has also been estimators based on depth, for example Tukey's median [7]. On the other hand, there are tractable algorithms but with non minimax optimal rates of convergence for example the coordinatewise median or the geometrical median [33,6], our work belong to this type of methods, our estimator is easily computable and even though the obtained error bounds are much better than for the coordinate-wise median, at least in corrupted setting it is not minimax. Recently there have been several propositions of algorithms whose goal was to be at the same time tractable and minimax, see [15,40,23,12,24,8] however these algorithms are often hard to implement and in practice the complexity makes them intractable for high-dimensional problems.
In a corrupted setting where inliers have a q th finite moment, for q > 2, we control the deviations of M-estimators. Let us suppose that the data are an adversarial ε n corrupted dataset (see Assumption 1). Then, under some as- See Proposition 2 below for the formal and more precise statement in the case of Huber's estimator, and see Section 5 for other examples in particular for M-estimators with unbounded score function. In the heavy-tailed setting where ε n = 0, the bound is almost minimax optimal, the difference is that t is allowed This estimator constitutes one of the few tractable and efficient estimators of the multivariate mean in heavy-tailed setting.
The error due to corruption is of order . We can't avoid this dimension dependence for our estimator contrary to [12,29] in which the authors show that the optimal error due to corruption is O(ε Σ op ) in the case of Gaussian distributions. Hence our estimator is not optimal with respect to corruption.
Finally, in Section 7, we exhibit an algorithm to compute T (X n 1 ) and we show that this algorithm converges in a finite number of steps resulting in a complexity of order O(nd log(Tr(Σ))) where n is the sample size and Σ is the covariance matrix of the inliers (our analysis is only valid for d ≪ e n ). This algorithm is practically efficient, it is illustrated in Section 7.3 and we invite the interested reader to check out the github repository https://github.com/TimotheeMathieu/RobustMeanEstimator for the python code.

Setting
In all the article, we consider X 1 , . . . , X n in a Hilbert space H that have been corrupted by an adversary using the following process. Assumption 1. There exist X ′ 1 , . . . , X ′ n ∈ H i.i.d. following a law P that have been modified by an "adversary" to obtain X 1 , . . . , X n . The adversary can modify at most |O| points such that there exist I, O partition of {1, . . . , n}, with I ∪O = {1, . . . , n} and |O| < |I|.
The points (X i ) i∈O are arbitrary and are called outliers, the points (X i ) i∈I are called inliers. Remark that the statistician do not know the sets I, O and that although we have X i = X ′ i for any i ∈ I, the sample (X i ) i∈I is in general not i.i.d. because the adversary can choose which data are corrupted using knowledge on the inliers. For instance it is possible that the adversary decided to corrupt the |O| points from X ′ 1 , . . . , X ′ n that were the closest to the theoretical mean in which case the (X i ) i∈I are not independent.
We consider the functional T defined by for some ψ : R + → R + , the existence and unicity of T (P ) is discussed in Lemma 2. We are interested in the behavior of the associated M-estimator T (X n 1 ) defined by where ψ is a function that satisfies the following properties.
When we want to emphasize the dependency in β, we will use the notation ψ 1 to be such that for all x ≥ 0, ψ(x) = βψ 1 (x/β).
Because ψ is concave, non decreasing and not identically zero, there are always a couple of positive constants β, γ such that Assumptions 2 holds. For our results to hold we will ask that β and γ are not too small. A first result that can be derived from Assumptions 2 and some additional assumptions is that our problem is well defined. This comes from the fact that the problem is a convex problem. Let we have the following lemma whose proof is in Section C.1.
Lemma 1. Let ψ satisfy Assumptions 2, let u ∈ S and θ ∈ H, where Jac denotes the Jacobi operator.
Using the previous lemma, we can prove using additional hypotheses that the problem is well defined.
, then T (P ) defined by equation (7) exists and is unique.

This lemma is proven in Section
however an hypothesis on Tr(Σ) is a lot stronger because it supposes a finite second moment. In the whole article, we will suppose that T (P ) is unique, we do not necessarily suppose that the assumptions of Lemma 2 hold because they are not minimal assumptions for unicity and existence of T (P ). However, for simplicity, we will suppose that the following condition is verified.
Assumption 3 means that β should be large enough to encompass most of the weight of X around the expectation. When ψ is bounded, this is only a first moment assumption. This assumption is due to limitations in our methods and we believe that this is not necessary. One can show that ρ(x) ≥ ψ(x) 2 /2 which may be used to simplify Assumption 3.
Assumptions 2 and Assumptions 3 will be supposed true. The behavior of ψ at 0 allows us to control the deviations of the estimator using the influence function, see Section 3 and it is also important to control the bias of the resulting estimator, see Section 4. On the other hand, the growth rate of ψ at +∞ is central to derive concentration bounds of T (X n 1 ), as will become clear all along Section 3 and 5. Assumptions 2 do not always apply to M-estimators, for example the sample median is not an estimator derived from a function ψ satisfying these assumptions. On the other hand, we provide three examples of score functions satisfying Assumptions 2, with three different growth rates when x goes to infinity.
Huber's estimator. Let β > 0. For all x ≥ 0, let In dimension 1, the M-estimator constructed from this score function is called Huber's estimator [26]. Catoni's estimator. Let β > 0. For all x ≥ 0, let The associated M-estimator is one of the estimators considered by Catoni in [5]. We call the resulting M-estimator Catoni's estimator. Polynomial estimator. Let p ∈ N * , β > 0. For all x ≥ 0, let We call Polynomial estimator the M-estimator obtained using this score function. The following result shows that the score functions from the previous three examples satisfy Assumptions 2.
The proof of Lemma 3 is postponed to Section C.3.

Notations
Let P denote the set of probability distributions on H, S = {x ∈ H : x = 1} where · is a norm associated with a scalar product ·, · on H. For any ψ : R + → R + , let P ψ = {P ∈ P : E P [ψ( X )] < ∞}. We denote a b if there exists a numerical constant C > 0 such that a ≤ Cb.
Let T H , T C and T P denote the functionals associated to the score functions ψ P ,ψ H and ψ C respectively, using Equation (7). Define the following variance terms These variance terms are to be compared with T r(Σ) and Σ op in the multivariate Gaussian setting, Hanson-Wright inequality (see Equation 18) tells us that T r(Σ) and Σ op describe the spread of the empirical mean in high dimension. Here we are not in a Gaussian setting and for example in the case of Huber's estimator, V ψH and v ψH will describe the spread of the influence function of Huber's estimator. See Section 4.2 for a more formal study of the link between V ψH , v ψH and T r(Σ), Σ op .

Tail probabilities of M-estimator and Influence function
The main result of the paper compares the tail probabilities of T (X n 1 ) − T (P ) and those of its influence function. Definition 1. We call t T and t IF the tail probability functions defined for all λ > 0 and θ ∈ H by t T (λ; θ, X n 1 ) := P ( T (X n 1 ) − θ ≥ λ) The main theorem of Section 3 is the following.
Theorem 1. Suppose that Assumptions1, 2 and 3 hold. If moreover V θ = E P [ψ( X ′ − θ ) 2 ] ≤ ψ(β/2) 2 /2 < ∞ and |O| ≤ nγ/8, then for all λ ∈ (0, β/2) and for all θ ∈ H, The proof of this result is given in Section A.1. In the heavy-tailed setting, we will use Theorem 1 with θ = T (P ) in order to have a small value of t IF (λ; θ, X n 1 ). On the other hand, in a corrupted setting, θ will be set to T (P ) where P is the law of inliers. For now, there is no hypothesis on the outliers in O, in what follows we will see that if ψ is unbounded, we will need some hypothesis on X i , i ∈ O in order to have a control of the value of t IF .
Remark that although M-estimators with bounded ψ are proven to have a breakdown point of 1/2 (see [27]), our result is only valid for a proportion of outliers |O n |/n ≤ γ/8. This is an artifact of the proof and with more stringent condition on β, we could allow for a higher breakdown point at the cost of a more complicated analysis. Remark however that when the corruption is close to 1/2, the error due to corruption is the principal source of error and a looser deviation analysis could then be allowed to reach a higher breakdown point.
Example: Huber's estimation in dimension d = 1. In the case of Huber's estimator, ψ is bounded and from Lemma 3, γ = 1. Because ψ H is bounded by β, we have directly from Bernstein inequality for all t > 0, Then, by Theorem 1, if V ψH ≤ β 2 /8, for all t > 0 such that 4 2V ψH t/n + 4βt/n ≤ β/2, Remark that choosing β = V ψH n gives us a sub-Gaussian concentration around T H (P ), this is similar to the concentrations inequalities introduced in [13] except that we concentrate around T H (P ) instead of E[X]. Remark also that the condition V ψH ≤ β 2 /8 is rather weak because we already have V ψH ≤ β 2 , the condition asks that there is enough weight in the interval [−β, β].

Bias and variance of M-estimators when considered as estimators of the mean
If P is symmetric then we can avoid the problem of the bias and T (P ) = E[X], but unfortunately in the case of skewed distribution the bias T (P ) − E[X] can be very large and the choice of β will determine how large the bias is. In this section, we show how the bias behaves as β grows, we also provide bounds on the variance terms defined in Section 2.2, these bounds will be useful to derive concentration inequalities on T (X n 1 ). We will use the notation ψ(x) = βψ 1 (x/β) when we want to emphasize the dependency on β.
We introduce the following function The following theorem links Z β with the distance between T (P ) and E[X].
Theorem 2. Let X be a random vector in H, X ∼ P with finite expectation and suppose Assumptions 2 and 3. Then we have . We postpone the proof to Section A.2. From Theorem 2, it is sufficient to upper bound Z β (E[X]) to get a bound on the bias.

Bias of M-estimators
We begin with the bias of Huber's estimator obtained from equation (3) with ψ = ψ H . The following lemma gives a bound on the bias of Huber's estimator for a distribution with a finite number of finite moments.
The proof is in Section C.4. Lemma 4 is not exactly tight as can be seen for instance if we do the computation with the Pareto distribution for d = 1 for which if the shape parameter is α = 2, we have only one finite moment but can show that in this parametric case, Huber's estimator achieves rates 1/β.
The choice of β is a very important problem when estimating E[X] using T (X n 1 ) and in particular we will need to choose β carefully as a function of n in order to have T (X n 1 ) that converges to E[X]. The choice of β will entail a sort of bias-variance tradeoff. Remark that we do not need a finite second moment for our analysis to work, we only need E[ρ( X − T (P ) /β)] < ∞ which translates in a finite first moment in the case of ψ = ψ H .
In addition to Lemma 4 we can also show an exponential bound on the bias when the random variable X is sub-exponential however because the primary use of Huber's estimator is with robust statistics, we only state the result for a finite number of finite moments as it is what will interest us. An interested reader can adapt the proof to lighter-tailed distributions.
For a ψ function that is not Huber's score function, the bias also depends strongly on the behavior of ψ near 0.
Lemma 5. Suppose that ψ is k times differentiable with bounded k th derivative and that Assumptions 2 and 3 hold, ψ ′ (0) = 1 and for 2 ≤ j ≤ k−1, ψ (j) (0) = 0. Let X be a random variable such that E[ X k ] < ∞, then, Moreover, if X follows a Bernoulli distribution of parameter p, this bound is tight in its dependency in β. When β → ∞, we have This Lemma is proven in Section C.5 . For example, we can show that for Catoni's score function ψ(x) = log(1 + x + x 2 /2) whose second derivative is and then the bias of Catoni's estimator is in general of order 1/β 2 . Lemma 5 shows that the bias depends on the smoothness of the function near 0 and also the number of finite moments.

Bound on the variance of M-estimators
First, we have to control the variability of T (X n 1 ) in order to control its deviations. The following lemma gives an upper bound on both V ψ and v ψ defined in Section 2.2.
Lemma 6. Suppose that Assumptions 2 and 3 are satisfied, suppose that X has a finite second moment with covariance operator Σ we have that Lemma 6 (proven in Section C.6) gives a control on V ψ and v ψ using the properties of X. Next we show that Lemma 6 is tight in the case of Huber's estimator as long as X is sufficiently concentrated using the following lemma whose proof is provided in Section C.7.

Lemma 7. Suppose that Assumptions 2 and 3 are satisfied and that
Lemmas 6 and 7 imply that if X has enough moments, say with 4 finite moments, and if β is sufficiently large, then the behavior of the variance term is the same as the variance term for the empirical mean. On the other hand, if X is not very concentrated, Lemma 6 can be a very rough bound and in the case of Huber estimator if X has only a finite first moment but no finite variance, then V ψH and V ψH are finite even though T r(Σ) = Σ op = ∞.

Application to the concentration of M-estimators around the mean in Corrupted Datasets
In this section, we investigate the concentration of the three M-estimators taken as example in this article in a corrupted, heavy-tailed setting. The goal will be to recover deviations similar to the one we would have in a Gaussian setting, but when the data are not Gaussian. The gold standard in this context is the deviation of the empirical mean in a Gaussian setting (see [4]). If X 1 , . . . , X n are i.i.d from N (µ, σ 2 ) for some µ ∈ R and σ > 0, then for all t > 0, An equivalent of this in the multi-dimensional setting is Hanson-Wright inequality [21]: let X ∼ N (µ, Σ) for Σ a positive definite matrix, µ ∈ R d . Then, for any This form of Hanson-Wright inequality can be found for example in [30]. Our aim is to obtain deviations similar to the ones in equations (17) and (18) but in a non-Gaussian setting.
The results we show in this section are not optimal, they are nearly optimal in the heavy-tailed setting but the effect due to corruption is sub-optimal compared to [12]. One of our goals is to illustrate the use of the influence function and particularly Theorem 1 for an easy derivation of concentration inequalities for M-estimators. We also illustrate an interesting phenomenon derived from Theorem 1 by showing that the concentration of T (X n 1 ) around T (P ) can be much faster than the concentration of T (X n 1 ) around E[X], because the variance term is not T r(Σ) but V which can be a lot smaller than T r(Σ) (for instance if T r(Σ) is not finite). It also shows that T (X n 1 ) is never arbitrarily bad as long as E[ψ( X )] < ∞ contrary to the empirical mean. We use the following corollary of [1, Theorem 4] recalled in Section D, this lemma is proven in Section C.8.
. . , Y n be i.i.d random variables taking values in H, centered and with covariance operator Σ, and such that the Orlicz norm of Y is finite: There exists an universal constant C > 0 such that, for all t ≥ 0, The last term in equation (19) can be handled using [38, Lemma 2.2.2] from which we get that there exists an absolute constant K > 0 such that However, note that Hanson-Wright's inequality for Gaussian random variables shows that this logarithm factor is not optimal. This extra logarithm factor can be removed if Y is bounded, which will be the case when we apply this result to Huber's estimator but not for Catoni's estimator.
In the rest of the section, we prove concentration inequalities for the estimators featured in Section 2 using Lemma 8 applied to Y = X−T (P ) X−T (P ) ψ( X − T (P ) ) and using the bounds on the bias from Section 4. For simplicity, we will not keep track of all the constants and we will give the names C 1 , C 2 , C 3 to numerical constants that do not depend on any of the parameters of the model (in particular they do not depend on P or β).
Proposition 1. Suppose that Assumption 1 and Assumption 3 are verified with |O| ≤ n/32. Then, there exist some numerical constants where ε n = |O|/n is the proportion of outliers.
Remark that the condition on β can be simplified if needed, using Lemma 6, to β 2 ≥ 8T r(Σ). The second step is to choose the value of β, the choice of β will be a tradeoff between the bias term from Lemma 4 and the concentration in Proposition 1 Proposition 2. Suppose the same assumptions as in Proposition 1 and suppose additionally that we have with probability larger than 1 − 4e −t − e −n/32 that then we have the following bound on the deviations.
When ε n = 0 the previous proposition guarantees an optimal sub-Gaussian rate. Notice that ε n is multiplied by a quantity that increases with the dimension in general, this bound is not minimax, see [12] which achieve a sharper bound in the Gaussian setting. We see that the dependency on ε n is O(ε 1−1/q n ), this type of bound is already present for example in [25] and the power 1 − 1/q is optimal (see [34,Lemma 5.4]) however, the E[ X − E[X] q ] 1/q factor in front of it is not optimal, we show in Section 6 that this factor is unavoidable for M-estimators. On the other hand, when ε n = 0, we obtain sub-Gaussian rates of convergence as soon as t n.
For Equation (21) to hold, we must have Remark that this is dependent on the dimension, we could have stated a similar deviation bound under the alternative condition t ≤ O n q−2 2q−2 , avoiding the dimension dependence at the price of a worse dependency on n. For simplicity we did not state it in the proposition. Remark √ d and on the other hand, by Jensen's inequality, if we denote by X (i) be the i th coordinate of X, The dependency in the dimension is similarly of order √ d. Finally, we present the symmetric case for which there is no need for Lemma 4 because we have right-away that T H (P ) = E P [X] and this simplifies the computations. In particular we only need a finite first moment and we can directly pick the minimal value of β in Proposition 1 to get, Proposition 3. Suppose the same assumptions as in Proposition 1 and moreover, suppose that P is symmetric with E P [ X ] < ∞. Then, there exist some numerical constants C 1 , C 2 , C 3 > 0, such that for any t ≤ C 1 n, with probability larger than 1 − 4e −t − e −n/32 , we have where ε n = |O|/n is the proportion of outliers and We see with Proposition 3 that we can relax the Gaussian inliers assumption made in [12] to inliers that are symmetric or inliers with infinite number of finite moments and still we have a linear dependency in ε n . We see also that this bound does not need a finite second moment, the law only need to be symmetric and have a finite first moment.

Catoni's estimator
In the case of Catoni's estimator and Polynomial estimator, we will only prove a proposition similar to Proposition 2 but we could also state the equivalents of Propositions 1 and 3 using the same reasoning.
Let β > 0 and, for all From Lemma 3, ψ C satisfies Assumptions 2 with γ = 4/5. Lemma 8 in addition to Theorem 1 can be used to obtain the following proposition.
Fix the value of β to Then, there exist some numerical constants C 1 , C 2 , C 3 > 0 such that for any Proposition 4 gives results that are similar to Huber's estimator in Proposition 2 using stronger assumptions on the corruption. If E[ X 3 ] is finite, we can use similar arguments to show a bound of order O(ε 2/3 ) the interested reader could adapt the proof of Proposition 2. Contrary to Proposition 2, our method does not allow us to go further than the 3 rd order because of the bias bound from Lemma 5.
The condition on outliers is not very unusual. Indeed, if we suppose that the outliers are i.i.d with law P O and such that E PO [ψ C ( X − T C (P ) )] < ∞, we can use Chebychev inequality to say that for any and then, we can choose C O such that the right-hand-side is strictly positive. Remark that we only suppose a finite moment for ψ C ( X − T C (P ) ), which is a logarithmic moment, this is a rather mild requirement on the outliers. On the other hand, if there is a fixed number of outliers, i.e. ε n = C/n, we deduce from Proposition 4 that the speed of convergence does not deteriorate if the outliers are bounded almost surely by exp( nV ψC ) in which case δ O = 0. This is also a rather mild requirement on outliers.

Polynomial estimator
Finally, we look at the polynomial estimator defined for p, β > 0 by Proposition 5. Suppose Assumption 1 and Assumption 3 are verified with law P that verifies E P [ X 2 ] < ∞ and its covariance matrix is denoted Σ. We suppose that there exist some constants We fix the value of β to β 2 = T r(Σ). Then, there exist some numerical constants where ε n = |O|/n is the proportion of outliers, supposed positive, and the parameter p is fixed to p = C 3 t.
The proof of this proposition can be found Section B.5 In this proposition we see that we obtain weaker guarantees for the polynomial estimator because the t in the right-hand-side of the bound is multiplied by T r(Σ) as compared to the smaller Σ op factor that we had before, hence this bound is not optimal. Nonetheless, this result is valid with very high probability and this may be surprising to the reader but recall that p is considered a parameter that we tuned using the level t so that in fact we use a function ψ P that gets very close to a bounded function when t gets large.

Lower bound in corruption bias for M-estimators
In this section we suppose ψ bounded and H = R d . Define P ε = (1 − ε)P + εQ for some outlier probability Q and some ε ∈ (0, 1/2). The goal is to estimate the error we would incur if we want to estimate the expectation of P using data from P ε . Remark that this setting is more restrictive than the adversarial corruption setting we used until now, indeed we can see P ε as a corrupted setting in which the adversary chose to modify a random number of randomly chosen outliers.
Theorem 3 (proven in Section A.3) gives us a lower bound on the bias due to the corruption. Remark that a similar result already existed for the case of the geometric median, see [29, Proposition 2.1]. Our result extends the result of [29] to more general ψ functions using an alternative proof.
To make the link with the corrupted setting from Assumption 1, remark that if an adversary chose ⌊(1 − ε)n⌋ inliers i.i.d from P and ⌈εn⌉ outliers i.i.d from Q, then the resulting empirical distribution converges to P ε and by continuity of T with respect to the weak topology (see Hampel's theorem in [27]), we have that T (X n 1 ) → T (P ε ). Then, remark that we have In the right-hand-side of (23), we have the term T (P ε ) − T (P ) = T (P ε ) − E[X] that stays bounded by below for any value of β from Theorem 3 and the term T (X n 1 ) − T (P ε ) that goes to 0 as n goes to infinity. More precisely, we have the following corollary of Theorem 3 consequence of the consistency of the plug-in estimator T ( P n ) = T (X n 1 ) for any distribution P (see [27]). Corollary 1. Suppose P = N (µ, σ 2 I d ). There exist a distribution Q and ε max > 0 such that for all ε ∈ (0, ε max ) and if an adversary chose ⌊(1 − ε)n⌋ inliers i.i.d from P and ⌈εn⌉ outliers i.i.d from Q to form a sample X 1 , . . . , X n , we have It is possible to quantify the rates in Corollary 1 but this is not really necessary as this already proves that we can't hope to achieve a rate that does not depend on d in the corruption error, i.e. we can't attain the minimax rates of convergence which is in this case of order O(σε).

M-estimators in practice
In this section, we give results for H = R d but they could be extended to more general Hilbert spaces provided that one use a sufficiently accurate initialization instead of the coordinate median used here.

Algorithm and convergence using iterative re-weighting
To compute T (X n 1 ), we use an iterative re-weighting algorithm. This algorithm is rather well known to compute M-estimators, see [27,Section 7] and it has already been extensively studied. The principle is to rewrite the definition of T (X n 1 ) from equation (3) as , we get an expression of T (X n 1 ) as a weighted sum: The weights w i depend on T (X n 1 ) and the principle of the algorithm is as follows. Initialize θ 0 with the coordinate-wise median and iterate the following We show that this algorithm allows us to find a minimizer of Let r n , δ > 0 be such that for instance, one can use the bound given in Section 5. We have the following theorem.
Theorem 4. Let X 1 , . . . , X n be in the I ∪ O setting with (X i ) i∈I i.i.d with law P whose variance is finite and the covariance matrix is denoted Σ. Suppose also that |O| ≤ n/8 and β ≥ 2 2T r(Σ) + r n + ψ −1 2V ψ . Then, for all N ∈ N, with probability larger than 1 − (d + 4)e −n/8 − δ, we have Said differently, the iterative reweighting algorithm is such that for any ε, we have θ (m) − T (X n 1 ) ≤ ε after a number of iterations The proof of Theorem 4 can be found in Section A.4. To prove this theorem, we use techniques similar to those used to prove the convergence of Weiszfeld's Method (see [3]).
We obtain an exponential rate of convergence. Remark that because the objective function is convex, even if the initialization was not as good as the coordinate-wise median or if β was not large enough, we would converge nonetheless but with a linear rate of convergence (similar to convergence rates in [3]) until we are close enough to T (X n 1 ) for the rate to be exponential.

Discussion on the choice of β
The choice of β is a frequent problem when using Huber estimator. One solution is to use Lepski's method but this is computationally expensive and not always efficient, another (often used) approach is to use a heuristic for β based on the median absolute deviation by saying that β must be of order σ the standard deviation of the inliers however in Section 5 we see that this choice is very conservative and β would often be too small if it is estimated using the median absolute deviation.
In view of Section 5 where we see that depending on the number of finite moments, we may want to choose β between T r(Σ) and T r(Σ)n, we propose to choose β in the interval [0, MAD √ n] where MAD = Med ( X i − GMed(X n 1 ) ) (GMED being the geometric median).
We propose the following Heuristic to choose β: ) 2 and C ψ is a constant that depend on ψ as described by the bounds on the bias (Section 4, C ψH = 1, C ψC = 5/32 and C ψP = 1/16). This is a bias-variance trade-off, the first term converges to the asymptotic variance of T ψ β (X n 1 ), the second term is a bound on the squared bias and the third term is a bound on the corruption bias if we suppose that ε n ≤ 0.05 (In the Robust literature, it is often said that there is less than 5 or 10 percent of outliers).
Remark that the objective function may not be convex and hence there can be local minima, we restrict the search space [0, MAD √ n] in order to be able to choose β efficiently using a grid-search.

Illustrations
To illustrate the behavior of M-estimators in heavy-tailed setting we consider a multivariate Pareto law for which the coordinates are drawn, independently of each other, from a Pareto distribution and a multivariate student distribution. All of the dataset present a finite variance but infinite third moment. In these three datasets, we consider n = 1000 samples and the dimension is d = 100.  We consider 5 estimators : Huber and Catoni estimator with β chosen using Section 7.2, the polynomial estimator with p = 5 and β chosen using Section 7.2, the geometric median (gmed) and the geometric median of means described in [33] with k = 9 blocks. The result is represented in Figure 2 In Figure 2, M-estimators are outperforming the geometric median by quite a lot because the geometric median is very biased when estimating the mean of an asymmetric distribution. The geometric median of means is closer in performance to M-estimators but it is not as good, maybe because there is no adaptivity the number of blocks have been fixed to 9..
Remark that the multivariate robust mean estimators described in [15,12,24,8] are not presented here because they are either too computationally intensive or too hard to implement for the purpose of comparison. Remark also that the choice of β from Section 7.2 is heuristic. In a learning setting, one may prefer to use cross-validation to tune β using directly the learning criterion.

A.2. Proof of Theorem 2
Step 1. We have Proof: The function θ → Z β (θ) is differentiable and by the mean value theorem, we have Where Jac denotes the Jacobi operator. From Lemma 1 and Assumption 2-(iv), we get Hence, for all t ∈ [0, 1], inject this in Equation (30) to get the result.
because ρ is increasing and super-additive on R + (ρ is increasing because ψ(0) = 0 and ψ is non-decreasing because ψ ′ ≥ 0, hence ψ = ρ ′ ≥ 0). Hence, by Markov's inequality and using the hypothesis, Step Hence, taking x such that ψ( x ) ≥ ψ ∞ /2 (this fixes the norm of x) and M −1 x = M −1 op x (this fixes the direction of x), we have where M op is the operator norm of M . Let us control this operator norm. We have for all u ∈ S where S is the sphere in R d , Hence, Then, use that X ∼ N (0, σ 2 I d ) to have that σ 2 / X 2 have a law inverse-χ 2 of parameter d. Hence, Inject this in equation (31) to conclude that IF(x) ≥ √ d − 2/6. Now, use that ε → T (P ε ) is continuous (it is in fact Lipshitz continuous because the Influence function is bounded) to conclude.

A.4. Proof of Theorem 4
This proof mimic the proof of convergence for EM algorithm or for the algorithm used to compute the geometric median. Let we are searching for the argmin of J n . First, we show that the initialization is not too far away from the optimum.
Lemma 10. If Assumption 1 is verified, Σ is the covariance matrix of P and |O| ≤ n/8. Then, with probability larger than 1 − de −n/8 − δ, The proof is provided in Section C.10. Then, we note that J n is strongly convex with high probability, using Lemma 1, for all u ∈ S and all θ ∈ R d , This allows us to show that the sequence of iterates is equivalent to minimizing a convex majorant of J.
The proof is provided in Section C.11. From Lemma 11, we get that the direction from θ (m) to θ (m+1) is a proper descent direction through the following lemma proved in Section C.12.
Lemma 12. For all θ ∈ R d and for all m ∈ N, we have Hence, using the fact that T (X n 1 ) is a minimizer of J, we get the result that This allows us to restrict ourselves to a bounded domain. From Lemma 10, with probability larger than 1 − de −n/8 − δ, we have that θ 0 is in and then, equation (33) assures us that we stay in Θ for the other iterations: ∀m ∈ N, θ (m) ∈ Θ. The following Lemma is proven in Section C.13.
Lemma 13. Let θ ∈ Θ, if X 1 , . . . , X n are corrupted by an adversary with X ′ 1 , . . . , X ′ n i.i.d with law P whose variance is finite and the covariance matrix is denoted Σ. Moreover, we suppose |O| ≤ n/8, then if β ≥ 2 2T r(Σ) + r n + ψ −1 1 2V ψ , we have with probability greater than 1 − e −n/32 , This allows us to quantify the speed at which the sequence θ (m) − T (X n 1 ) decreases. Indeed, from Lemma 12 for θ = T (X n 1 ), we have and by Lemma 13 and Lemma 10 and the convexity equation (32), we have by Taylor's theorem that with probability larger than 1 − (d + 4)e −n/8 − δ (because e −n/32 ≤ 4e −n/8 ), Hence, Solve this equation for θ (m+1) − T (X n 1 ) to obtain, then use that n j=1 w j (θ (m) ) ≤ n and the fact that the right-hand-side of Equation (34) is increasing in n j=1 w j (θ (m) ) to get for all m ∈ N, Then, this implies directly that with probability larger than 1 − (d+ 4)e −n/8 − δ, we have Hypotheses in this proof.
let us find an upper bound of t IF (λ/4; T H (P ), X n 1 ).
Then, by Lemma 8 for all t > 0, with probability larger than 1 − 4e −t , inject this in Equation (37) Step 2. There exists a constant C 2 > 0 such that condition λ t ≤ β/2 is verified for any t ≤ C 2 n. Proof: Use Hypothesis (i) to get that λ t ≤ β/2 is implied by Then, this is implied by the following system of equation with C 1 , C 2 , C 3 > 0 three numerical constants, The last implication is from Hypothesis (ii): β 2 ≥ 8V ψH ≥ 8v ψH . Inject this in Equation (36) to get the desired result.

B.2. Proof of Proposition 2
Hypotheses in this proof.
Having ε n ≥ 0, we can bound the chosen β by

C3t
, then there exists C > 0 such that the hypothesis on β is verified if This is condition on t is verified in Hypothesis (ii).
Step 2. For any t ≤ C 2 n, we have Proof: From the previous step, we can apply Proposition 1, and by Lemma 4 and Hypothesis (i), if β 2 ≥ V ψH max(8, C 1 /n) then with probability larger than 1 − 4e −t − e −n/32 , To simplify, we don't take into account the effect of β on V ψH and v ψH when choosing β then, we choose β such that inject this in Equation (38) to get The result follow because 2 1/q (1 + 1 q−1 ) ≤ 4.
Step 3. For any t ≤ C 2 n, by sub-linearity of the square root we obtain Step 2, to get

B.4. Proof of Proposition 4
Hypotheses in this proof.
The proof of Lemma 14 is postponed to Section C.14. By Step 3 and the fact that β 2 ≥ T r(Σ), we have s ≤ max(e, log(1+1+1/2)) ≤ e. Then, by Lemma 14, for any q ∈ N * , Then, using the power series expansion of the exponential function, we get that, for all t > βe, Choosing t = 2βe shows that ψ C ( X − T C (P ) ) ψ1 ≤ 2eβ. Then, using Lemma 8, we get for all t > 0, with probability larger than 1 − 4e −t , Hence, from Step 2, we have Step 5. The condition λ t ≤ β/2 is implied by t ≤ C 3 n/ log(n) for some absolute constant C 3 . Proof: We use that ε n C O ≤ 1/20 to get that λ t ≤ β/2 is implied by Then, this is implied by the following system of equation with C 1 , C 2 , C 3 > 0 three numerical constants, The last implication is because β 2 ≥ 8V ψC ≥ 8v ψC . The first inequality is necessarily verified because β 2 ≥ 10T r(Σ), hence the only remaining condition is t ≤ C 3 n/ log(n) for some absolute constant C 3 .
Step 6. There exists a numerical constant C 2 > 0 such that with probability larger than 1 − 4e −t − δ 0 , Proof: From Lemma 5, with 3 finite moments, we have Then, from Step 4 with probability larger than 1 − 4e −t − e −n/50 , Then, there exists a constant C 4 > 0 such that To simplify, we also use the bounds on the variance from 6, V ψC ≤ T r(Σ) and also By sub-linearity of the square root we obtain , inject this in Equation (45), to get

2
T r(Σ) n + 10 t|Σ op n Having t n, we get that there exists a numerical constant C 2 > 0 such that Step 7. Using the sub-linearity of x → x 2/3 and using that for C 2 small enough in equation (46), we have which finishes the proof.

B.5. Proof of Proposition 5
Hypotheses in this proof.
The following lemma applies.
Lemma 15. Let n ∈ N * , suppose X 1 , . . . , X n are i.i.d. Let q ∈ N * and suppose E P [ X q ] < ∞. There exists an absolute constant K > 0 such that

36
The proof is postponed to Section C.15. Take q = 2 and λ = E P [ X − T P (P ) 2 ]t/n, we get for all t > 0, Take p = t 4K 2 e −1 , and there is a constant C 1 > 0 such that with probability larger than 1 − e −C1t , we have Now, we take care of the outliers in a similar way as in Proposition 1 and Proposition 4. We have from Hypothesis (i), Hence, from Equation (48) and Equation (49) with we have with probability larger than 1 − e −C1t − e −n/512 − δ O , Step 2. There exist a constant C ′ 1 > 0 such that for any t ≤ C ′ 1 n, we have λ t ≤ β/2. Proof: The condition λ t ≤ β/2 is implied by using the fact that ε n ≤ 1/(64C O ). This simplifies with t nβ 2 /E P [ X − T P (P ) 2 ] and then, by Hypothesis (ii) t n.
Moreover, by Cauchy-Schwarz inequality we also have 1 − X−θ X−θ , u 2 ≥ 0. Hence, with these two inequalities, we get Inject this in (51) to get the result: for any u ∈ S,

C.2. Proof of Lemma 2
Hypotheses in this proof. Step Proof: First, notice that from Assumptions 2-(i) and 2-(iv) ρ is two times derivable and increasing on R + . Hence, J is differentiable and its gradient is Then, by definition of T (P ) (Equation (7)), T (P ) verifies ∇J(T (P )) = 0, i.e. T (P ) critical point of J.
Step 2. J is convex Proof: Let Hess(J) denote the Hessian of J. From Lemma 1, for any θ ∈ H and u ∈ S, Hence, J is convex because its Hessian is positive. ∞. Hence, J is coercive and as J is also convex (Step 2), its minimum T (P ) exists.
Step 4. J is strictly convex at T (P ). Proof: From Assumption 2-(iv), and because ρ is increasing, Hence, by Markov's inequality, By Step 3, we have that T (P ) minimizer of J, hence

Inject this in Equation (52) to get
Then, using Hypothesis (i), we get E[ψ ′ ( X − T (P ) )] > 0, which implies from Lemma 1 that for all u ∈ H, u = 0, u T Hess(J)(T (P ))u > 0. Hence J is strictly convex at T (P ), the minimizer of J.
Step 5. T (P ) as defined by Equation (7) exists and is unique. Proof: As T (P ) is minimizer of J, it is also root of Equation (7) and the existence and unicity we proven in Step 2 and Step 4.

C.3. Proof of Lemma 3
Huber's score function: The equality for the Huber's score function is immediate by derivation of ψ H . Catoni's score function: ψ C is differentiable, and we have for all x ≥ 0, This function is decreasing on R + , positive and even, hence Polynomial score function: ψ P is differentiable, and we have for all x ≥ 0 As in the case of Catoni's score function, this function is decreasing over R + , positive and even. Then, we get

C.4. Proof of Lemma 4
From Theorem 2 we only need to control Z β (E[X]). We have, Hence, by triangular inequality, because ψ 1 is 1-Lipshitz and ψ 1 (0) = 0. Then, by integration by part, Until now, the proof was valid for any ψ 1 , for the specific case of Huber score function and using Theorem 2, we get that Then, use Markov's inequality,

C.5. Proof of Lemma 5
We have, Then, by Taylor expansion which proves the first part of the lemma using Theorem 2. In the case of Bernoulli distribution, the result follows from a Taylor expansion: C.6. Proof of Lemma 6 Step 1. For any ψ that satisfies Assumptions 2, V ψ ≤ Tr(Σ). Proof: First, remark that we have for all and because ψ ′ 1 ≤ 1 and ψ 1 (0) = 0, we get that h is decreasing, the fact that h(0) = 0 implies that for all x ∈ R + , ψ 2 1 (x) ≤ 2ρ 1 (x). Then, Define J(θ) = E ρ 1 X−θ β by definition 1, T (P ) is the minimum of J and by equation (53), Then finally, using that by integration of ψ ′ 1 ≤ 1 we have ρ 1 (x) ≤ x 2 /2, hence the result.

C.7. Proof of Lemma 7
Step 1. if X has q > 2 finite moments, then Then, by Hölder inequality, we get the result announced. Step We use the following lemma See Section C.16 for the proof. Then, for Y = X − T H (P ) and from Step 1, we get, Use the fact that (a + b) q ≤ 2 q−1 (a q + b q ), And finally, because E[X] is the minimizer of the quadratic loss, Step 3. Similarly, if X has q > 2 finite moments, we have ]+β 2q ) 1−2/q . Proof: Then, we operate the same manner for the bound on v ψH . We have, Then, use Cauchy-Schwarz inequality, Then, use the same reasoning as for the bound on V ψH to conclude that From [1, Theorem 4] and because Y = sup u =1 Y, u , there exists an absolute constant C 1 such that, for all t ≥ 0, where σ 2 = n sup u∈S E[ X, u 2 ]. Remark that σ 2 can be rewritten Then, by Cauchy-Schwarz inequality,

C.9. Proof of Lemma 9
ρ is convex because ρ ′′ = ψ ′ ≥ 0 and it is increasing because ψ = ρ ′ ≥ 0 (ψ(0) = 0 and ψ increasing). Then, from triangular inequality and Jensen's inequality, we have By definition of T (P ), it is a minimizer of θ → E ρ X−θ β , hence, then, use the hypothesis to upper bound the right-hand side by ρ(1/3), we get Finally, because ρ is non-decreasing on R + (its derivative is non-negative), we get the result.

C.10. Proof of Lemma 10
First, let us begin with d = 1. We have that for all t > 0, By Hoeffding's inequality, we have and by Chebychev inequality, for the choice t = 2 √ 2σ we have P (X − E P [X] > t) ≤ 1/8. Then, from Equation (55), and then, because |O| ≤ n/8, Now for dimension d, we use the dimension 1 result on each coordinate, and by union bound we have that with probability larger than 1 − de −n/8 , for all and then, by taking the sum of the squares, we get θ 0 −E P [X] 2 ≤ 8 d j=1 σ 2 j = 8T r(Σ). We conclude using that T (X n 1 ) − E P [X] ≤ r n with probability larger than 1 − δ.

C.11. Proof of Lemma 11
The proof is derived from the proof of iterative reweighting algorithm for regression found in [27,Section 7.8].
First point. We have U θ (m) is a convex function, let us take its gradient to find its minimum, Hence, the minimum is found for θ = n i=1 wi(θ (m) ) n j=1 wj (θ (m) ) X i = θ (m+1) . Second point. For all i ∈ {1, . . . , n}, We have that g i is differentiable and This prove that g i is a majorant of ρ 1 and by taking the sum, this implies that U κ is a majorant of J n .
Third point. Define we have by definition of w i (κ), and moreover, h i is a differentiable function whose gradient is and we can verify that Let us show that ∇h κ is Lipshitz. We have Then, use that θ → ρ 1 ( X i − θ ) is convex, hence its Hessian is positive and we have for all u ∈ S, This conclude that ∇h i,κ is w i /β 2 -Lipshitz continuous and hence by summing over i, ∇h κ is Lipshitz continuous with Lipshitz constant L = 1 Fourth point already verified using Equations (56) and (57).

C.14. Proof of Lemma 14
For q ∈ N * , let g q : x → q q x/(e q − 1) if x ∈ [0, e q − 1] log(1 + x) q if x > e q − 1 Step 1. g q is a concave function over R + . Proof: g q is continuous at e q − 1, the left and right limits are equal to q q . g q is derivable on [0, e q − 1) and (e q − 1, ∞). This derivative is non-increasing on both intervals. At e q − 1, the left derivative is q q (e q − 1) −1 while the derivative on the right is q q e −q . Thus, the left derivative at e q − 1 is larger than the right derivative. Hence the derivative is non-increasing on R + , g q is concave on R + .
This last function is clearly smaller than the function x → q q x/(e q − 1). Hence, x → log(1 + x) q is smaller than g q , we found a concave upper bound of x → log(1 + x) q . Since g q is concave (Step 1), by Jensen's inequality, for any positive random variable Z such that E[Z] < ∞, we have E[log(1 + Z) q ] ≤ E[g q (Z)] ≤ g q (E[Z]).
Proof: From Step 1 and Step 2 and equation (60), we get From equations (59) and (61) and if we re-inject the definition of Y i 's, we get t IF (λ; T P (P ), X n 1 ) ≤ E[ψ P ( X − T P (P ) ) pq ] Then, use that ψ p ( x ) ≤ x 1/p β 1−1/p to get

Appendix D: Technical tools
We remind the reader of Bernstein inequality, a classical concentration inequality, this form of Bernstein inequality is borrowed from [4, Theorem 2.10].
Theorem 5. Let X 1 , . . . , X n be independent real-valued random variables. Assume that there exist positive numbers v and c such that where x + = max(0, x). Then for all t > 0 The following theorem is borrowed from [1,Theorem 4], it is a concentration inequality for suprema of sums of independent random variables. Theorem 6. Let X 1 , . . . , X n be independent random variables with values in a measurable space (S, B) and let F be a countable class of measurable functions f : S → R. Assume that for every f ∈ F and every i, E[f (X i )] = 0 and for any α ∈ (0, 1] and all i, sup f |f (X i )| ψα < ∞. Let Define moreover Then, for all 0 < η < 1 and δ > 0, there exists a constant C = C(α, η, δ) > 0 such that for all t ≥ 0,