Robust and efficient mean estimation: an approach based on the properties of self-normalized sums

Let $X$ be a random variable with unknown mean and finite variance. We present a new estimator of the mean of $X$ that is robust with respect to the possible presence of outliers in the sample, provides tight sub-Gaussian deviation guarantees without any additional assumptions on the shape or tails of the distribution, and moreover is asymptotically efficient. This is the first estimator that provably combines all these qualities in one package. Our construction is inspired by robustness properties possessed by the self-normalized sums. Theoretical findings are supplemented by numerical simulations highlighting strong performance of the proposed estimator in comparison with previously known techniques.

Let X be a random variable with mean EX = µ and variance Var(X) = σ 2 , where both µ and σ 2 are unknown; in what follows, P will denote the distribution of X and P 2,σ -the class of all distributions possessing 2 finite moments and having variance σ 2 . We will be interested in robust estimators µ of µ constructed from the data X 1 , . . . , X N generated as follows: the initial non-corrupted sample X 1 , . . . , X N of independent, identically distributed copies of X is merged with a set of O < N outliers that are independent from the initial sample, and the combined sample of cardinality N := N + O is given as an input to an algorithm responsible for construction of the estimator. This contamination framework is more general than Huber's contamination model [Hub64,CGR16] where the outliers are assumed to be identically distributed, but weaker than the framework allowing adversarial outliers [KL93,Val85] that may for instance depend on the initial sample. Robustness will be quantified by two properties: first, in the situation when O = 0, the estimators should admit tight non-asymptotic deviation bounds of the form | µ − µ| ≤ Cσ s N (1.1) with probability at least 1 − 2e −s , where C > 0 is an absolute constant. In particular, we will be interested in the estimators that attain such deviation guarantees uniformly over 0 < s < ψ P (N ) where ψ P (N ) is an increasing function that might depend on the law of X 1 ; guarantees of type (1.1) can be informally labeled as "robustness to heavy tails." Second, the estimators of interest should perform optimally with respect to the degree of outlier contamination characterized by the quantity ε := O N . Another important property that we focus on is asymptotic efficiency. Informally speaking, efficiency measures how "wasteful" an estimator is: an efficient estimator will capture all the information available in the sample; alternatively, in many cases it is possible to conclude that the confidence intervals centered at an efficient estimator will have (at least asymptotically) smallest possible diameter. It is difficult to quantify efficiency using only finite-sample guarantees of type (1.1) as the constants in these bounds are rarely sharp, at least, for practical considerations, and therefore a common approach is to take an asymptotic viewpoint. Specifically, we will be looking for the estimators that are asymptotically normal and have asymptotic variance that is as small as possible in the minimax sense, that is, − → denotes convergence in distribution and ν 2 := ν 2 ( µ, P ) is such that sup P ∈P2,σ ν 2 ( µ, P ) = inf µ sup P ∈P2,σ ν 2 ( µ, P ).
Here, the infimum is taken over all asymptotically normal (after rescaling by √ N ) estimators µ of µ. It is easy to see that inf µ sup P ∈P2,σ ν 2 ( µ, P ) = σ 2 (for reader's convenience, the proof of this simple fact is given in Lemma 6.5), therefore, it suffices to find a robust estimator that satisfies √ N ( µ − µ) d − → N (0, σ 2 ) for all P ∈ P 2,σ . For instance, the sample mean is an example of the estimator with required asymptotic properties that is not robust, while the popular median-ofmeans estimator [NY83] is robust but not asymptotically efficient [Min19].
In this paper we construct the first, to the best of our knowledge, example of an estimator of the mean that is provably (a) robust to the heavy tails of the data-generating distribution P ; (b) admits optimal error bounds with respect to the outlier contamination proportion ε = O N ; (c) is asymptotically efficient and (d) is almost tuning-free, meaning that it does not require information about any parameters of the distribution besides the upper bound for the contamination proportion ε. We also show how to make our procedure fully adaptive. Our construction is novel and is inspired by the properties of self-normalized sums.
The rest of the paper is organized as follows: section 2 introduces the estimator and explains the main ideas behind its construction; the key results are presented in section 3, while comparison of our estimator with existing robust estimation techniques in the context of properties (a) -(d) is presented in section 3.4. Finally, a fully adaptive procedure is outlined in section 4 while the supporting numerical simulations are included in section 5. The proofs of the main results are contained in section 6. All notation and auxiliary results will be introduced on demand.

Construction of the estimator.
We restrict our attention to the estimators that are obtained via aggregating the sample means evaluated over disjoint subsets (also referred to as "blocks") of the data. Specifically, assume that {1, . . . , N } = k j=1 G j where G i ∩ G j = ∅ for i = j and |G j | = n = N/k is an integer, and letμ j := 1 |Gj | i∈Gj X i be the sample mean of the observations indexed by G j . We consider estimators µ N of the form for some (possibly random and data-dependent) nonnegative weights α 1 , . . . , α k such that k j=1 α j = 1. For example, the well known median-of-means esti-mator [NY83,AMS96,LO11] corresponds to the case α j = 1 for j such that µ j = median (μ 1 , . . . ,μ k ) and α j = 0 otherwise. Construction proposed in this paper starts with an observation that it is natural to choose the weights that are inversely proportional to some increasing function of the standard deviation of each block. Indeed, the estimation error of the sample meanμ j in each block of the data is essentially controlled by the corresponding sample standard deviation σ j := 1 |Gj | i∈Gj (X i −μ j ) 2 . To understand why, consider the following obvious identity: The random variableμ j −µ σj , which is equal up to normalization to the Student's t-statistic, is known to be tightly concentrated around 0: namely, it is bounded by t n with probability at least 1 − e −ct for t ≤ c n where c, c are positive constants, even if data are heavy-tailed (a more precise version of this fact is stated below). Therefore, |μ j − µ| is bounded by a multiple of σj √ n with high probability. And, while the error |μ j − µ| is unknown, the quantity σ j is fully data-dependent. This motivates the choice of the weights of the form for some p ≥ 1; in what follows, the estimator (2.1) with weights (2.2) will be denoted µ N,p . When we need to emphasize the specific value of k used in the construction, we will write µ N,p (k). Observe that when p = 1, the estimation error satisfies which is proportional to the average of t-statistics evaluated over k independent subsamples. It is therefore natural to expect that µ N,1 − µ will satisfy strong deviation guarantees. Let us present now an example where the weights corresponding to p = 2 arise naturally. Observe that one can model outliers by assuming that the variances of the data differ across k groups, where large variance corresponds to a corrupted subsample: X i , i ∈ G j ∼ N (µ, σ 2 j ) for some µ ∈ R and positive but unknown σ 1 , . . . , σ k . The maximum likelihood estimator µ in this model is easily seen to satisfy µ = argmin z∈R An approximate solution can be obtained via minimizing the first-order approximation of the loss function z → k j=1 |G j | μj −z σj 2 that attains its minimum at the point which is the estimator (2.1) with weights defined in (2.2) for p = 2. In the following sections we will present non-asymptotic deviation bounds for the estimator µ N,p for all values of p ≥ 1 and will establish its asymptotic efficiency in the absence of outliers.

Main results.
The goal of this section is to prove the deviation inequality for the estimation error µ N,p −µ for any p ≥ 1, where the estimator µ N,p corresponds to the weights defined by (2.2).

Preliminaries.
In this section, we consider the simple framework of i.i.d data without outliers. We will start with a brief review of concentration inequalities for the self-normalized sums. It is known (for example, see the book [PLS08]) that the properties of the t-statistics evaluated over subsamples indexed by G 1 , . . . , G k are closely related to the behavior of the self-normalized sums Q j : The following inequality is well known (cf. Theorem 2.16 [PLS08]): for any j = 1, . . . , k and any x > 0, 2 with probability at least 1 − 4e −x 2 /2 , as long as E(X − µ) 2 < ∞. In order to deduce a non-random upper bound from (3.2), it suffices to control the ratio 1 Vj . To this end, define As long as Var(X) is finite, it is clear that ζ(X) < ∞. Combining this inequality with the bound (3.2), we deduce that for any 1 ≤ j ≤ k and any x > 0, with probability at least 1 − 4e −x 2 /2 − e − n 40ζ 2 (X)∨6 . If moreover x ≤ √ n/18, then the relation T j = f (Q j ) immediately implies that t-statistics T j satisfy the bound with probability at least 1 − 4e −x 2 /2 − e − n 40ζ 2 (X)∨6 for each 1 ≤ j ≤ k. Alternatively, the previous argument also implies that the random variables T j 1{|Q j | ≤ 1/2, V j ≥ σ/2} satisfy the deviation inequality Therefore, we conclude that the random variable T j , truncated at the right level, behaves like a sub-Gaussian random variable. 3 This fact is formalized in Lemma 6.1 and is one of the key ingredients used to show that proposed estimators have sub-Gaussian deviations.

Non-asymptotic deviation inequalities.
In the simplest case p = 1, equation (2.3) suggests that in order to bound the estimation error µ N,p − µ, it suffices to control the average 1 In what follows, we will always assume that O ≤ Ck for some C < 1, where O is the number of outliers in the sample. Define the event (3.5) E p holds whenever the harmonic mean of the (powers of) sample variances does not exceed the corresponding power of the true variance σ 2 by too much. In particular, in the absence of outliers, we can replace C by 0 in the previous event.
Informally speaking, the harmonic mean of a set of numbers is controlled by its smallest elements, therefore, it is natural to expect that the event E p holds with overwhelming probability; this claim will be formalized in the following lemma whose proof is deferred to Section 6.3.
Lemma 3.2. Recall the contamination framework defined in Section 1. Suppose that E|X − µ| 1+δ < ∞ for some 1 ≤ δ ≤ 2 and that O ≤ Ck for some C < 1. Then for some constant c > 0 that depends on δ and E|X − µ| 1+δ . Moreover, if X has sub-Gaussian distribution, then for some constant c > 0 that depends on the distribution of X.
Note that the condition O < Ck only requires C to be smaller than 1: it means that for our technique to reliably estimate the true mean, it suffices that any constant positive fraction of subsamples indexed by G 1 , . . . , G k are free from the outliers, while the popular median-of-means estimator requires at least 50% of the subsamples to be "clean". In practical applications, this difference can be substantial, and our simulation results (see section 5) confirm this observation.
Our first result presents non-asymptotic deviation bounds for the case when the sample does not contain outliers.
Combination of Theorem 3.1 with Lemma 3.2 readily implies that µ N,p admits sub-Gaussian deviation guarantees for s = k N/ log N . Indeed, in that case we get with probability at least 1 − 3e −k that As we explain in the remark below, if k is chosen appropriately, this statement can often be strengthened to yield uniform deviation guarantees holding in the range 0 ≤ s ≤ k.
Remark 3.1. Dependence of the constant c on ζ(X) is inherited from Lemma 3.1. The constant ζ(X) can be arbitrary large, therefore the inequality of Theorem 3.1 does not hold with overwhelming probability uniformly over the class of distributions P 2,σ . To achieve uniformity, we need to assume slightly more about the distribution of X -for example, one may impose the "small ball" condition Q(u) := Pr(|X − EX| ≥ u) ≥c > 0, or the equivalence of moments of order 2 and 2 + δ for some δ > 0, namely that E|X − EX| 2+δ ≤C(E|X − EX| 2 ) 1+δ/2 for some fixedC > 0. Then our bounds will depend on the constantc orC instead, and dependence on ζ(X) can be suppressed: for instance, when the moments of order 2 and 2+δ are equivalent, we have that E(|X −µ| 2 1{|X −µ| ≥ σ(2C) 1/δ }) ≤ σ 2 /2 in view of Markov's inequality, thus ζ(X) ≤ (2C) 1/δ . This justifies the claim that assuming ζ(X) to be "small" is a relatively mild requirement. In simple terms, we ask that the distributions in question assign non-trivial mass to a fixed neighborhood of their means. It is also interesting to take a viewpoint that assumes the distribution of X to be fixed while the parameters n, k → ∞: in this case, one can establish stronger claims about mean estimation -for instance, the deviations in (1.1) can be shown to be uniform over a range of values of parameter s.
Remark 3.2. A more precise bound for the "bias term" φ(δ, n) has the form It is therefore easy to see that whenever ) and the sub-Gaussian deviation guarantees (3.6) hold uniformly over s k (the latter restriction appears due to the fact that the probability of event E p depends on k as e −ck ).
In the case when δ = 1, φ(δ, n) = o k N so that sub-Gaussian deviation guarantees hold with s = k n. However, if k is large enough, namely, if , we can still achieve the situation when φ(δ, n) = O N −1/2 . In this case, deviation guarantees hold uniformly over s k. The price that we have pay however is the fact that k can grow arbitrarily slowly as a function of N , but this is unavoidable in general as shown in [DLLO16].
Next, we discuss the more general contamination framework described in the introduction. For each block G j , we denote by W j the number of outliers in G j and byμ I j (respectivelyμ C j ) the sample mean corresponding to the inliers (respectively outliers) within G j . For every set of outliers O we define Informally speaking, α(O) can be viewed as a proxy for the magnitude of the outliers. The following extension of Theorem 3.1 holds.
One may notice that α(O) −(p−1)/2 ≤ 1, and this quantity gets smaller as p grows, suggesting that the estimator µ N,p is more robust to the outliers of large magnitude as p increases. Next, let us discuss the term O n that quantifies dependence of the estimation error on the fraction of outliers ε = O N . It is easy to see that the "best" choice of k for which the terms φ(δ, n) and O k √ n are of the same order is k ∝ N ε 2 1+δ yielding the error rate of ε δ 1+δ that is known to be optimal with respect to δ (e.g. see section 1.2 in [SCV17] or Lemma 5.4 in [Min18]). However, as the upper bound depends explicitly on the magnitude of outliers through α(O), in some scenarios it can be much smaller than the worst case given by O k √ n .

Asymptotic efficiency.
The following result establishes asymptotic efficiency (in a sense defined in section 1) of the estimator µ N,p for any p ≥ 1 in the absence of outliers, implying that the estimator can not be uniformly improved in general.
where N j := k j n j and φ(δ, n) was defined in remark 3.2. Then for any p ≥ 1, Condition N j φ(δ, n j ) = o(1) is essentially a requirement that the bias of estimator µ Nj ,p is asymptotically of order o N −1/2 j . It is not difficult to check that the sequences {k j } j≥1 , {n j } j≥1 with required properties exist for any distribution P ∈ P 2,σ , see remark 3.2 for the details. For example, if E|X − µ| 3 < ∞, it suffices to require that k j = o(n j ).
Together, results of section 3 imply that the estimator µ N,p can be viewed as a true robust alternative to the sample mean -it preserves its desirable properties such as asymptotic efficiency while being robust at the same time.

Comparison with existing techniques.
One of the most well-known consistent, robust estimators of the mean in the class P 2,σ is the median-of-means estimator [NY83,AMS96,LO11]. While it is robust to heavy tails, adversarial contamination, and is tuning-free, it is not asymptotically efficient: indeed, according to Theorem 4 in [Min19], the asymptotic variance of the median-of-means estimator is π 2 σ 2 . This fact is illustrated in our numerical experiments in section 5. Another family of estimators belonging to the broad class defined via equation (2.1) is discussed in section 2.4 in [Min19] and is defined via where ρ is Huber's loss ρ(z) = min z 2 2 , |z| − 1 2 and ∆ > 0. The asymptotic variance of this estimator can be made arbitrarily close to σ 2 , however, achieving this requires σ 2 to be known. Construction of Catoni's estimator [Cat12] again requires knowledge of σ 2 (or its tight upper bound), moreover, it is not robust to adversarial contamination. Finally, deviation bounds for the trimmed mean estimator obtained in [LM19b] are not uniform with respect to the confidence parameter s (meaning that different choices of s require the estimator to be re-computed), and its asymptotic efficiency, while plausible, has not been formally established. Moreover, construction employed in [LM19b] requires sample splitting. Recently, Lee and Valiant [LV20] showed that it is possible to construct a mean estimator that achieves sub-Gaussian guarantees with essentially optimal constants, however, their estimator explicitly depends on the desired confidence level, and its asymptotic behavior is not discussed.
The only other robust, tuning free estimator that is asymptotically efficient, albeit only for a subclass of P 2,σ , is a permutation-invariant version of the medianof-means estimator (which is also the higher order Hodges-Lehmann estimator). It is defined as follows: let A . We note that Card A (n) N = N n , so that for large N and n exact evaluation of µ U is not computationally feasible. The following result was established recently in [DR20]: assume that N j = n j k j is the sample size where n j , k j → ∞ as j → ∞ such that n j = o N j . Moreover, suppose that X is normally distributed with mean µ and variance σ 2 . Then While is likely that the result still holds for other symmetric distributions, the condition n j = o N j is restrictive: for example, for non-symmetric distribution possessing 3 finite moments, the bias of the estimator µ U is of order n −1 j , and the requirement n j = o N j implies that this bias is asymptotically larger than N −1/2 j . Finally, there is a growing body of literature related to sub-Gaussian mean estimators in R d , for example see the papers [DM20, LM19a, Hop20], and references therein. These works are mainly concerned with rate optimality, and questions related to asymptotic efficiency have not been investigated in detail.

Adaptation to the contamination proportion ε.
The number of outliers O is usually unknown in practice, therefore, it is desirable to have a procedure that can adapt to this unknown quantity. Fortunately, the proposed method admits a natural adaptive version. This extension is based on the following observation: assume that p = 1, and consider the estimation . Then the numerator of this expression admits an that holds for all choices of k with probability at least 1 − 2e −s − ke −cn , as shown in the proof of Theorem 3.2. Therefore, it suffices to choose k such that the harmonic mean k k j=1 1 σ j is a good, in a relative sense, estimator of σ. Fortunately, the harmonic mean of standard deviations is a fully data-dependent quantity that can be evaluated for any k; similar intuition holds for other values of p as well.
Based on the previous observation, we propose an adaptive estimator µ p defined as follows. We will choose k as the smallest integer, on a logarithmic scale, which guarantees that is not too large compared to σ p , in a sense defined by (3.5). To this end, we only need to obtain a good preliminary estimator of σ that we can compare the harmonic means to. Assume that we are already given an estimator σ such that with large probability. The above assumption is not restrictive since, as we will show in section 6.8, one can construct σ such that (4.1) holds with probability at least 1 − e −cN for some absolute c > 0, under mild conditions. Next, for each positive integer k, set Finally, definek via log 2k := inf i ∈ {1, . . . , log 2 N } : E p (2 i ) holds ∨ 1, 4 and the corresponding estimator µ p (s) := µ N,p (k ∨ s + 1). The following bound is the main result of this section; essentially, it states that µ p (s) is a robust estimator that is fully adaptive and provides sub-Gaussian deviation guarantees.
where c > 0 depends only on the distribution of X and C p > 0 depends only on p.

Numerical simulation results.
The goal of this section is to compare performance of the estimators µ N,p for different values of p ≥ 1, as well as evaluate their performance against the benchmarks given by other popular techniques such as the median-of-means estimator and the "oracle" trimmed mean (labeled "trim" in the figures) estimator that takes the contamination proportion ε as its input.
Our simulation setup was defined as follows: N = 2500 observations from halft distribution 5 with 4 degrees of freedom (d.f.). This distribution is asymmetric, therefore, results allow us to evaluate the degree to which the bias affects performance of different robust estimators; linear transformation has been applied so that the mean and variance of generated data are 0 and 1 respectively. Next, O ∈ {0, 50, 100, 150} randomly selected observations have been replaced by the outliers given by the point mass at x 0 = 10 3 ; this type of outliers appears to be most challenging for the trimmed mean estimator as it creates bias due to "inliers" being removed only from one of the tails of the distribution. We compared 4 estimators: the median-of-means (MOM) estimator defined after equation (2.1), estimators µ N,1 and µ N,2 corresponding to the choice of weights (2.2) with p = 1 and p = 2, as well as the "oracle" trimmed mean estimator [LM19b] that knows the number of outliers. Specifically, trimmed mean was computed by removing the smallest εN + 5 and well as largest εN + 5 observations, where 5 was added to account for the outliers due to the heavy tails, and averaging over the rest. Estimators µ N,1 , µ N,2 as well as MOM were evaluated for various values of parameter k ∈ {25, 50, 75, 100, 125, 150, 175, 200} that controls the number of subgroups.
For each combination of values of O and k, simulation was repeated 1000 times; we present 3 summary statistics in the plots below: the average error ( Figure  1), the standard deviation ( Figure 2) and the maximal (over 1000 repetitions) absolute error (Figure 3).
Overall, numerical experiments confirm our theoretical findings. Here is the summary of our simulation results: 1. In the setup with no contamination (O = 0), all estimators showed good performance, with µ N,1 slightly but consistently beating µ N,2 on average, but µ N,2 had the smallest maximal error among all estimators; empirical standard deviations of µ N,1 and µ N,2 were consistent with theory-predicted values; 2. as O increased, µ N,2 was performing better that µ N,1 , while both estimators were significantly better than MOM; 3. both µ N,1 and µ N,2 showed consistent performance as the number of blocks k increased; moreover, unlike MOM, the estimators performed well even in the challenging setup where O k.

Proofs.
This section contains detailed proofs of the main results of the paper.
6.1. Results related to the deviations of self-normalized sums.

Bounds for the moment generating function of the t-statistic.
Recall that T j :=μ j −µ σj and Q j : Lemma 6.1. There exists c p > 0 such that, for all λ ∈ R. we have E(e λw1 ) ≤ e cpλ 2 /(2n) .

It follows from Jensen's inequality that
Ee λw1 ≤ Ee λ(w1− w1) . Finally, is well-known that, in view of (6.1), the latter is bounded by e λ 2 cp/(2n) for some c p > 0 only depending on p (for instance, this follows from Proposition 2.5.2 in [Ver18]).

Auxiliary technical results.
Lemma 6.2. Let p ≥ 1 and δ ≥ 1. Assume that E(|X − µ| 1+δ ) < ∞. Then for any 1 ≤ j ≤ k for δ < 2. At the same time, for δ ≥ 2, we have Proof. Due to homogeneity, we can assume that σ = 1 without loss of generality. We will also assume that µ = 0, otherwise X j should be replaced by X j − µ for all j. Observe that where Z = n i=2 X 2 i . Consider the event O 1 = {Z 2 ≥ n/4}, and recall that in view of Lemma 3.1 Pr(O c 1 ) ≤ e −cn for some c = c(P ) > 0 that depends on the distribution P of X. Consider the event where the sequence {α n } n≥1 is defined as follows: consider a non-increasing function g(u) = E(|X| 1+δ 1 {|X|≥u} ), and observe that lim u→∞ g(u) = 0. Therefore, taking α n := g(n 1/4 ) 1/(2+δ) ∨ n −1/4 , we get that α n → 0, and moreover It is easy to see that

Indeed, Markov's inequality implies that
where we used Hölder's inequality. Next, we will reduce the problem to the case where X and Z are bounded. Define the event O := O 1 ∩ O 2 . Then Letting F be the distribution function of X, we deduce that conditionally on Z In the derivation above, we used the elementary inequality for 0 < t := x 2 Z 2 and the fact that E|X| 1+δ < ∞. Combining (6.4) with (6.2),(6.3), we see that whenever δ < 2 and that for δ = 2 (in fact, in this case all the terms are of order o n −1 besides C(p)n p/2 E|X| 1+δ α 2−δ n n (p+δ)/2 which is O n −1 ). Remark 6.1. It follows from the previous argument that the term o 1 n δ/2 takes the form Remark 6.2. The key quantity of interest in the previous proof is given by the expression that was then estimated from above. Let us present a counterexample showing that one cannot improve the result of Lemma 6.2 when δ ≥ 2 for p = 1. To this end, let X be a random variable such that Pr(X = a) = 1/(1 + a 2 ) and Pr(X = −1/a) = a 2 /(a 2 + 1) for some 1 < a 2 ≤ 2 and assume that n ≥ 8.
Observe that X is a.s. bounded by a, centered, and has variance 1. Given x, y > 0, we say that x y when c ≤ x/y ≤ C for some absolute constants c, C > 0. Let E Z denote the conditional expectation with respect to Z. It is easy to check that on the event A := {Z 2 ≥ n/4} we have where we have used that on A both a 2 and 1/a 2 are smaller than Z 2 and that Z 2 n. Since a does not depend on n, X is a.s. bounded by an absolute constant, and Pr(A) ≥ 1 − e −cn for some absolute constant c > 0. Hence It follows that, for p = 1, we have Although X admits infinitely many moments, the previous bound cannot be improved beyond three moments due to the asymmetry of the distribution of X.
Proof. Again, we can assume without loss of generality that σ 2 = 1 and that EX = 0. Observe that Indeed, 1{X 2 1 + Z 2 ≥ n/4} → 1 in probability in view of Lemma 3.1, while X 2 1 +Z 2 n p → 1 in probability by the Law of Large Numbers.
As in the proof of Lemma 6.2, we deduce that Pr(O c 2 ) = o 1 n and that Next, we will reduce the problem to the case where X and Z are bounded. Let thus (6.7) follows. Next, letting F be the distribution function of X, we deduce that conditionally on (Y, Z), In the derivation above, we used the bound where the last inequality follows from an elementary bound (6.5). Moreover, in view of (6.6). Therefore, we see that concluding the proof.
where c > 0 depends only on ζ(X), φ(δ, n) = o(n −δ/2 ) for δ < 2 and φ(δ, n) = O(n −1 ) otherwise. Moreover, if Var(X) < ∞, then Proof. We will prove the two claims separately. Recall the algebraic identity σ 1 = V 1 1 − Q 2 1 . To deduce the first inequality, observe that where we have used the elementary inequality (1 + u) p/2−1 du ≤ 2 p p 3 p/2 t 2 (6.8) that holds for all 0 ≤ t ≤ 1/2. Taking (3.2) into account, we get that for an absolute constant C > 0. Indeed, it directly follows from the inequality that is valid for all x ≥ 0. As a consequence, Moreover, we have that where we have used (6.9) in the last inequality. We conclude using Lemma 6.2 as long as (6.10) and (6.11) that The first claim is a consequence of both (6.10) and (6.12) since n −3/2 is always less than φ(δ, n).
Next, we establish the second claim of the lemma. Since, due to the first inequality of the lemma, 1{ O} vanishes as n goes to infinity, it is enough to prove that the second moment converges to 1. We follow the same steps as in the first part to deduce that where we have used (6.8) in the first inequality and (6.9) in the second one. Moreover, we also have that where we again used (6.9). Combining (6.13) and (6.14), we get that The conclusion follows immediately from Lemma 6.3.

Proof of Lemma 3.2.
We will first consider the outlier-free case, meaning that O = 0. It is easy to see that  Hence, Bennett's inequality yields that for some absolute constant c > 0, where π := Pr σ 2 1 ≥ 4σ 2 ≤ Pr V 2 1 ≥ 4σ 2 ≤ 1 4 . Alternatively, if X possesses more than 2 moments, we can apply von Bahr-Esseen inequality [vBE + 65] for any δ ≥ 1. It yields that for c > 0 depending only on the ratio E|X − µ| 1+δ /σ 1+δ . When X has sub-Gaussian distribution, we instead use the Hanson-Wright inequality [HW71] and deduce that where c > 0 is an absolute constant and X ψ2 is the ψ 2 norm of X 6 . In this case, (6.15) yields that for an absolute constant c 1 > 0.
Next, we consider the case O > 0. Let σ (1) , . . . , σ (k) be the increasing order statistics corresponding to σ 1 , . . . , σ k . If O ≤ Ck for C < 1, then at least a fraction of data buckets is outlier-free. Let us call the index set of these buckets J so that Card(J) ≥ (1 − C)k , whence .
Hence, we get that The final result follows from (6.16) and (6.17) replacing k by (1 − C)k .
Lemma 6.6. Let σ n be such that σ n = V n 1 − Q 2 n , and let O = {|Q n | ≤ 1/2} ∩ {V 2 n ≥ σ 2 /4} using previous notations. Then Proof. We have that Q n ≤ 1/2 and σ 2 n ≥ 3/4V 2 n ≥ 3/16σ 2 on O. Therefore, where we have used that for x ≥ 3y/16 > 0, Moreover, where we employed inequality (6.9) and Lemma 3.1. Observing that the random variable V 2 n −σ 2 V 2 n +σ 2 1{ O} converges to 0 in probability (in view of the Law of Large Numbers) and is bounded, hence the convergence holds also in L 1 . This completes the proof.
Let p ≥ 1. Denote µ := µ N,p and consider the events (6.18) Using Lemma 3.1 and inequality (3.3), we get that for some constant c > 0 depending on the distribution of X. Therefore, for all t > 0 . Recall the definition (3.1) of the t-statistics T 1 , . . . , T k . The following chain of inequalities holds: for some c p > 0 depending only on p. Choosing t as we get that where we used Chernoff bound on the last step. Combining the display above with Lemma 6.4, we conclude that for all s > 0 for some C p > 0 depending only on p. When ke −cn ≥ 1, the previous bound is trivial. It follows that for all s > 0.
The proof follows similar steps as the argument used to establish Theorem 3.1. We will first show that with high probability the proportion of outliers in each bucket of observations is less than 1/2. Indeed, letting W j denote the number of ouliers in the subsample indexed by G j , it is straightfoward to see that k j=1 W j = O, and that the random variables {W j , j = 1, . . . , k} are negatively correlated. Consider the event Recall that W j = i∈Gj 1 i∈Oj . Since k j=1 W j = O, the random variables (1 i∈Oj ) i∈Gj are 1-negatively correlated for each j = 1, . . . , k, as a sub-sequence of a 1-negatively correlated sequence of random variables. Applying the Chernoff bound for negatively correlated random variables (see [Doe20, section 1.10.2.2 and Theorem 1.10.23] for the definitions and the required version of the Chernoff bound), we get that as long as O ≤ N/4, Hence in what follows, we can restrict our attention on the event E 2 . We use the superscript I to denote "clean" sample and C ("corrupted") -otherwise. Notice thatμ whereμ C j ,μ I j are, respectively, empirical means of the corrupted and clean part of the sub-sample indexed by G j . We also have that where ( σ C j ) 2 , ( σ I j ) 2 are, respectively, empirical variances of the corrupted and clean sub-samples of G j . Observe that σ 2 j ≥ ( σ I j ) 2 /2, and, therefore, as in the previous proof we deduce that the weights α j given by (2.2) can not be too large even when outliers are present in the sample. Consider the events Using Lemma 3.1 and inequality (3.3), we get that for some constant c > 0 that depends only on the distribution of X. In the rest of the proof we assume that the event E ∩ E p holds, with E p defined in (3.5). On this event, we have that We will proceed by estimating each of the terms separately.
Control of (A): Using (6.19), observe that on O j we have for some absolute constant C > 0. It comes out that . Hence it follows from Cauchy- As a consequence, Observe that the previous statement holds pointwise, and is not probabilistic in nature. It also suggests that the worst scenario occurs whenever all buckets are corrupted. Control of (B): Since σ 2 j ≥ σ I j 2 /2 under O j , we have that Hence we can show, as in Lemma 6.1, that the random variable is sub-Gaussian. Following the same arguments as in Theorem 3.1, this leads to the bound that holds with probability at least 1 − 2e −s for some C p > 0. Control of (C): As for the "bias term," it is enough to observe that for uncorrupted buckets, σ j = σ I j , and the bias can be upper bounded exactly as in Theorem 3.1. Hence At the same time, for the corrupted part of the bias term, we have on O j that , for some C p > 0 depending only on p. Hence for some C p > 0, where we have used inequality (6.9) and the fact that This concludes the proof of the fact that with probability at least 1 − 2e −s − ke −cn − Pr(E c p ), (6.20) 6.6. Proof of Theorem 3.3.
Using the definition of φ, it is easy to see that N j φ(δ, n j ) = o(1), implying that k j = o(n j ) which in turn implies that k j = o (e cnj ) for any constant c > 0. We recall that Next, we will use the following decomposition that holds on the event E (6.18) defined in the proof of Theorem 3.1.
where H = 1{E}.Using Lemma 6.6, we have that Moreover using Lemma 6.4 we have For any integer m, we denote by E p (m) the event E p defined via (3.5) with m blocks. For every event A, A c will denote its complementary. Observe that, as long as 1/20 ≤ σ σ ≤ 4, we have the following inclusions Finally, we recall that when E p holds and k ≤ 3(O ∨ s), then with probability at least 1 − 2e −s − ke −cN/(O∨s) as shown in (6.20). Combining the previous results, we conclude that , so that event E p holds.
6.8. Construction of a robust estimator of σ.
Note that Lemma 6.7 requires the new condition E|X − µ| ≥ σ/2. The latter condition is mild and can be viewed as the equivalence between absolute first and second moments which is less restrictive than the equivalence between centered moments of order 2 and 2 + δ. This condition may also be seen as the price to pay for adaptation under only two moments.