A MOM-based ensemble method for robustness, subsampling and hyperparameter tuning

Hyperparameters tuning and model selection are important steps in machine learning. Unfortunately, classical hyperparameter calibration and model selection procedures are sensitive to outliers and heavy-tailed data. In this work, we construct a selection procedure which can be seen as a robust alternative to cross-validation and is based on a median-of-means principle. Using this procedure, we also build an ensemble method which, trained with algorithms and corrupted heavy-tailed data, selects an algorithm, trains it with a large uncorrupted subsample and automatically tune its hyperparameters. The construction relies on a divide-and-conquer methodology, making this method easily scalable for autoML given a corrupted database. This method is tested with the LASSO which is known to be highly sensitive to outliers.


Introduction
Robustness has become an important subject of interest in the machine learning community over the last few years because large datasets are very likely to be corrupted. This may happen due to hardware, storage or transmission issues, for instance, or as a result of (human) reporting errors. As can be seen, for instance, in Figure 1 in [29] and Figure 1 and 5 in [30], many learning algorithms based on emprical risk minimization (including the LASSO) may be completely mislead by a single corrupted example.
Robust alternatives to empirical risk minimizers and their penalized/regularized versions have been studied in density estimation [5] and least-squares regression [4,36,20,48,53]. Various robust descent algorithms have also been recently considered [44,42,43,23,22]. Despite these important progresses, the final steps of a data-scientist routine, which are estimator selection and hyperparameter tuning [8,11,7] are yet to receive a proper treatment. In fact, practitioners usually have at disposal several algorithms, each of these requiring one or several parameters to be tuned. An alternative to estimator selection is aggregation (aka ensemble methods) [16,51,40] which outputs e.g. a linear or convex combination of the candidate estimators; classical examples include binning, boosting, bagging or stacking.
The most common procedure used to select or aggregate candidate estimators is (cross-)validation: the dataset is partitioned (several times in the case of cross-validation) into a training sample used to build candidate estimators and a test sample used to estimate their risks. The final estimator is either the candidate with lowest estimated risk, or a linear combination of the candidates with coefficients depending on the estimated risks. Even if some candidate estimators are robust, outliers from the test set may mislead the selection/aggregation step, resulting in a poor final estimator. This raises the question of a robust selection/aggregation procedure, which is addressed in the present work.
There exist many data-driven methods to tune hyperparameters or to select an estimator from a collection of candidates. Among these, one can mention the SURE method [47], model selection [8,11,12,35,9,38] where penalization methods are used to select among candidates built with the same data as those used to build the original estimators, selection, convex or linear aggregation [51,45,52], cross-validation [2,3] or Lepski [33] and the Goldenschluger-Lepski [21] methods to name a few. To the best of our knowledge, all these techniques either use a classical non-robust validation principle or estimate the risk with the non-robust empirical risk. A notable exception is the estimator selection procedure of [6,7] which is robust in general settings [6] and extremely efficient in Gaussian linear regression [7]. The main drawback is that this procedure requires robust tests in Hellinger distance that may be hard to compute for general learning problems where one does not specify statistical models with bounded complexities.
The first contribution of this paper is a general and robust estimator selection procedure with provable theoretical guarantees, which can be viewed as a robust alternative to cross-validation. Roughly stated, the procedure uses a median-of-means principle [1,24,41] to build robust pairwise comparisons between candidates, and the final estimator is then selected by a minmax procedure in the spirit of [4,28] or the Goldenshluger-Lepski method [21], see Section 3 for details. The method is easily implementable. We here focus on least-squares regression and refer to [34] for other examples including density estimation and classification.
The second contribution is the definition of an ensemble method based on this selection procedure and a subsampling strategy. Two of the main ideas behind this method is that subsampling can provide robustness by avoiding outliers and that the choice of the subsample can itself be seen an a hyperparameter to be tuned. Estimator selection procedures can then be used to simultaneously select the best algorithm, an uncorrupted subsample and the best hyperparameters. Moreover, the method is computationally attractive.
Subsampling is usually used in machine learning for computational reasons: some algorithms require to break large datasets into smaller pieces [25], for instance in supervised learning [17, for classification and regression] and [37, for matrix factorization]. A natural way to divide-and-conquer corresponds to the older idea of subagging [15, subsample aggregating]-which is a variant of bagging [14]: one randomly chooses several small subsets of data, build an estimator from each subsample, and aggregate them into a single estimator. For instance, the bag of little bootstraps [26] builds confidence intervals in such a way. Subagging is also used for large-scale sparse regression [13].
The paper is divided as follows. Section 2 presents the general prediction setting we consider, and Section 3 introduces the robust estimator selection procedure. Theoretical guarantees for the latter are given in Theorem 3.2. The ensemble method is defined in Section 4 and applied to the LASSO in Section 5. Applications to the ERM in linear aggregation are presented in Appendix A. Numerical experiments are presented in Section 6. The proofs are outsourced in the appendix in Appendices B and C.
Let X be a measurable space. Let P be a probability distribution on X × R, and let (X, Y ) ∼ P .
For any probability measure Q on X × R and any measurable function Let F be a linear subspace of L 2 (P X ). We call estimator any element of F . For f ∈ F , let γ(f ) : X × R → R denote the square-loss function associated with f , defined by for all (x, y) ∈ X × R by γ(f )(x, y) = (y − f (x)) 2 . For any function f : X → R in L 2 (P X ), let R(f ) denote its risk R(f ) := P [γ(f )] and let f * be the oracle: f * := arg min f ∈F R(f ). Let ℓ denote the excess risk with respect to f * : The second equality holds since F is a linear space. A learning algorithm is a measurable map G : ∪ +∞ n=1 (X × R) n → F which takes a dataset of any (finite) size as input and outputs an estimator in F . Assumption 1. Let χ, σ > 0 such that for every f ∈ F , This assumption only involves second and fourth moments. The first assumption (P f 4 ) 1/4 ⩽ χ(P f 2 ) 1/2 is satisfied for instance by linear functions f (·) = ⟨ ·, t ⟩ for t ∈ R d and X which is a d-dimensional vectors with independent entries with a fourth moment [39]. It therefore covers heavy-tailed cases beyond classical L ∞ -boundedness or subgaussian assumptions. The second assumption P [ is independent of X and has a second moment-which is a very standard statistical modeling assumption when Y = f * (X) + ζ with ζ independent of X. It also also holds when Y − f * (X) has a fourth moment by using Cauchy-Schwarz.
Let N ⩾ 1 be the size of the dataset (X i , Y i ) i∈ [N ] , which is partitioned into informative data and outliers: [N ] = O ⊔ I. Informative data (X i , Y i ) i∈I is assumed independent and identically distributed (i.i.d.), with common distribution P . No assumption is granted on outliers (X i , Y i ) i∈O . Of course, the partition O ⊔ I is unknown to the learner. We call a subsample any nonempty subset B ⊂ [N ] (or the corresponding data (X i , Y i ) i∈B ), and for any measurable function g : X × R → R, denote: ] . (1) are unknown: let us construct a robust estimator of those quantities. Let be a partition into V blocks of a subset of . The selection of the final estimator is obtained by plugging these median-of-means (MOM) estimators into equation (1). In other words, we selectf m , wherê Thanks to the median-of-mean operator, the risk of the selected estimatorfm is expected to be close to the risk of the best estimatorf mo , even for heavy-tailed and corrupted data because T (m, m ′ ) is a robust (to outliers) sub-gaussian estimator of γ(f m ) − γ(f m ′ ), even for heavy-tailed data [34].
Median-of-means have been introduced in [1,24,41]. Median-of-means pairwise comparisons have been used to build robust estimators in [36,27]. Minmax strategies have been used in [4,28] for least-squares regression and in [5] for density estimation. Finally, the minmax principle has been used for (non-robust) selection of estimators in [21].
Remark 3.1 (Minmax-MOM selection to divide-and-conquer). It is classical to use divide-andconquer approaches [25] to deal with large databases: the database is divided in small batches, algorithms are run on each batch and the results are "aggregated". Minmax-MOM selection procedure (2) can perform this king of aggregation. Denote by B m the block of data hosted on server m ∈ M. Train estimatorsf m for all m ∈ M. Then, for all m, m ′ ∈ M, compute the V real and take their median. Then compute the minmax-MOM estimator (if there are too many medians, choose m and m ′ at random in M). Following the map-reduce terminology [18], the mapper is the training of the procedure itself and the V evaluations. The reducer is the computation of the ( |M| 2 ) medians of differences of empirical risks and the minmax-MOM selection (2). Theorem 3.2 (Robust oracle inequality). Grant Assumption 1 and assume V ∈ 3 |O| , N/8 . Then with probability larger than 1 − |M| 2 e −V /48 , the estimatorf m , wherem is selected by the minmax-MOM selection procedure (2) satisfies, for all ε > 0, The proof of Theorem 3.2 is postponed to Section C.1. Roughly speaking, Theorem 3.2 states that, with exponentially large probability, the selected estimator (2) has the excess risk of the best estimator in the collection (f m ) m∈M . Following [19], this result is called an oracle inequality. We call it robust as it holds under moment assumptions on the linear space F (see Assumption 1) and for a dataset that may contain outliers. The residual term b ε,V is of order V /N . If log |M| ≳ |O| and V ≍ log |M|, the residual term is of order log |M|/N , which is minimax optimal [50]. The oracle inequality is interesting when a ε,V < 1 which holds if χ ≲ √ N/V . The "constant" χ in Assumption 1 may therefore grow with the dimension of F as in the examples of [46] without breaking the results.

An ensemble method to induce robustness, subsampling and hyperparameters tuning
In this section, we define an ensemble method which takes one or several (non necessarily robust) algorithms as input and ouputs an estimator. The method is robust to the presence of outliers, has subsampling capabilities, and automatically tune hyperparameters. The main ideas behind the construction are: using subsampling as a way of achieving robustness (by avoiding outliers), viewing the choice of the subsample as a hyperparameter to be tuned, and using the robust selection procedure from Section 3 to select the final estimator. Performance guaranties are established in Corollary 4.1.

Definition of the method
Let (G λ ) λ∈Λ be a finite collection of learning algorithms which outputs estimators in F . The collection may in fact correspond to a single algorithm with several combinations of hyperparameters values, or even several different algorithms with several combinations of hyperparameters values.
We now construct the set B of subsamples to be considered by the method. Assume N ⩾ 8. Let K min and K max integers such that 3 ⩽ K min ⩽ K max ⩽ log 2 N (these parameters will specify the subsamples cardinality range). For each K ∈ K min , K max , consider a partition (B From K min ⩾ 3 we can easily deduce that each subsample in B has cardinality less than N/4. Then, as in Section 3, for each m = (λ, B), letf m be the estimator trained by algorithm G λ using subsample B, in other words: . Then, using the minmax-MOM selection procedure (2) from Section 3, we select from collection (f m ) m∈M the final estimatorfm.

Theoretical guarantees
The following result states that if risk bounds hold for algorithms (G λ ) λ∈Λ in a context with nooutlier, thenf m essentially satisfies the best of those risk bounds, even in the presence of outliers.
with probability larger than Corollary 4.1 is proved in Section C.2. Let us stress some important aspects.
Estimatorsf λ,B for (λ, B) ∈ M are assumed to satisfy an excess risk bound with rates ρ(λ, |B|) only with constant probability (the constant 1−exp(−1/48) chosen here has nothing special), when B is large enough and only contains informative data. For example, this condition is met by ERM when informative data satisfies moment assumptions, see Propositions 5.2 and A.1. With these arguably weak requirement, the above statement claims that the ensemble method achieves the best bound among ρ(λ, The upper bound ρ(λ, |B|) on the excess risk off λ,B depends on λ and the size |B| of the subsample.
It improves with the sample size by the monotonicity assumption on ρ.
Finally, the function λ → ν(λ) is introduced to handle situations where the risk bound holds only when the sample size is larger than ν(λ).

An efficient partition scheme of the dataset
The partitions (B can be constructed in many different ways. This section presents a specific choice for those partitions which yields a computational advantage by significantly reducing the number of empirical risks [γ(f m )] to be computed (by making many of them redundant). This complexity reduction makes the computations from Section 6 possible in a reasonable amount of time.
The minimax-MOM selection procedure (2) this requires, in the worst case, the computation of V |M| 2 empirical risks. By comparison, the construction presented here will only require the computation of 8V |M|/3 empirical risks.
⩽ N/4, as required. Moreover, the following key property holds. k1,K2,k2) is a sub-collection of the 2 K0 -partition, whose sets have empty intersection with both B m and B m ′ , and which, according to Lemma 4.3, contains at least V sets. We can thus define (T as a sub-collection of size exactly V . Consequently, . Moreover, we have the following lower bound on the cardinality of its sets, which is required (see Section 3). Consequentely, to compute the minimax-MOM selection procedure in the context of the ensemble method defined in Section 4.1, the empirical risk of each estimatorf m has to be computed on the 2 K0 -partition only, which thanks to (6) means the computation of at most 8V |M|/3 empirical risks, as advertised.

Application to fine-tuning the regularization parameter of the LASSO
This section applies the ensemble method from Section 4 with the LASSO as input algorithm. Consider here X = R d , and denoteβ λ,B the LASSO estimator trained with regularization parameter λ and subsample B.β λ,B = arg min Statistical guarantees for the LASSO, which we recall below, have been obtained in Theorem 1.4 in [32] with a regularization parameter λ ≍ ∥ζ∥ Lq √ s log(ed/s)/N instead of ∥ζ∥ Lq √ s log(ed)/N . This choice is valid under the following assumption.
Proposition 5.1 is an (exact) oracle inequality with optimal residual term [10]. It is satisfied by the LASSO with a constant probability when trained on a set of informative data and for an optimal choice of regularization parameter λ ∼ ∥ζ∥ Lq √ s 0 log(ed/s 0 )/N . This regularization parameter requires the knowledge of the sparsity s 0 . Proposition 5.1 shows that the risk bound only holds with constant probability because the noise is only assumed to have finite L q -moment. Finally, the LASSO has to be trained with uncorrupted data; a single outlier completely breaks down its statistical properties-see Figure 1 in [28].
Let us now combine Corollary 4.1 and Proposition 5.1 to apply the ensemble method to this example. Let V, K min , K max satisfy the assumptions from Section 4 and Corollary 4.1. Denote by s * the largest integer s such that N/ max(8V, 2 Kmin+1 ) ⩾ s log(ed/s), and assume 1 ⩽ ∥β * ∥ 0 ⩽ s * (where ∥ · ∥ 0 denotes the number of nonzero coefficients). Consider the set of subsamples B defined as in While Proposition 5.1 shows statistical guarantee with constant probability for the estimatorsβ λ,B trained on uncorrupted data, Corollary 5.2 shows that the ensemble method improves the constant probability into an exponential probability, allows |O| outliers as long as V ⩾ 3|O| and selects the best hyperparameter λ. The proof is given in Appendix C.6. A similar application to the ERM in linear aggregation is given in Appendix A.
6 Numerical experiments with the LASSO

Presentation
In this section, the ensemble method from Section 4 is implemented and fed with the LASSO algorithm, as in Section 5. Numerical experiments are performed with various amount and types of outliers in order to investigate their effects on the output estimatorfm and the corresponding parameter (λm, Bm).
We The first type, which we call hard outliers are defined to simulate corruption due, for instance, to hardware issues: X i = (1, . . . , 1) ∈ R 2000 , Y i = 10000, i ∈ O 1 , and second type, which we call heavy-tail outliers are constructed as X i ∼ N (0, , ζ i is a noise independent of X i and distributed according to Student's t-distribution with 2 degrees of freedom. Informative data is drawn according to X i ∼ N (0, , and ζ i (i ∈ I) is a standard Gaussian noise independent of X i . On the one hand, a hard outlier, if contained in the training sample of an estimator, is likely to significantly deteriorate its performance. On the other hand, heavy-tail outliers only differ from informative data in the distribution of the noise, and should not deteriorate too much the performance of affected estimators. Nevertheless, we expect the informative data to be preferred over the type 2 outliers in the selected subsample Bm (this is indeed the case in Figure 1c).
} as the grid of values for the regularization parameter of the LASSO. We implement the ensemble method from Section 4 with parameters V = 40, K min = 3 and K max = 4. The set B of subsamples is constructed as in Section 4.3 and we set M = Λ×B. For each m = (λ, B) ∈ M, we train the LASSO estimatorβ m with hyperparameter λ and subsample B (see (7)). We then compute the output estimatorβ m , which uses partitions (T is the true risk function which is not known-so thatm cannot be computed using only the data. For comparison, we also compute the LASSO estimatorsβ λ, [N ] trained with the whole dataset, which we will call basic estimators, and letβ λ, [N ] be the best among those, so thatλ := arg min λ∈Λ R(β λ, [N ] ).

On the choices of V and K max
The choices of V and K max have an impact on both the performance the output estimator and the computation time. The higher is V , the higher is the number of outliers that the MOM-selection procedure (2) can handle, and as a matter of fact, Theorem 3.2 requires V ⩾ 3 |O|. However, higher values of V increase computation time and deteriorates the statistical guarantee (through the values of a ε,V and b ε,V from the statement of Theorem 3.2). Here we choose V = 40, so we can expect the minimax-MOM selection procedure to perform well at least up until we get as many as ⌊40/3⌋ = 13 outliers. The number of considered subsamples is increasing with K max . High values of K max increase computation time. Moreover, we don't want to go for the maximum value K max = ⌈log 2 N ⌉, which would imply the training of estimators with subsamples of size 2, which is irrelevant. We therefore want a low value of K max , but we would like to have at least one subsample which contains no outlier. This is necessarily the case when 2 Kmax > |O|. Since the choice of V = 40 allows to hope for a good selection performance up to |O| = 13 outliers, we choose K max = 4 which indeed satisfies 2 Kmax > |O|.

Results and discussion
The plots presented in Figure 1 are averaged over 100 experiments. Figure 1a shows estimation error rates against the number |O| of outliers in the dataset and a 95% confidence interval 1) of the output estimatorβ m of the ensemble method, 2) ofβ m , the best estimator among (β m ) m∈M , and 3) of the best basic estimatorβ λ, [N ] . As soon as the dataset contains outliers, basic estimators (β λ, [N ] ) λ∈Λ have larger errors thanβ m (the best estimator computed on a subsample). For |O| ⩽ 48 the ensemble method procedure has the same error as the best estimator among (β m ) m∈M .
For a given value of parameter V , the minimax-MOM selection procedure (2) is expected to fail at some point when the number of outliers increases, but it seems here to resist to a much higher number of outliers than predicted by the theory. Theorem 3.2 holds for |O| ⩽ V /3, that is |O| ⩽ 13 here. It seems here that the minimax-MOM selection procedure performs satisfactorily for |O| ⩽ 48 and even selects a reasonably good estimator for |O| ⩽ 56. Figure 1c shows the number of each type of outliers in the selected subsample Bm. The method manages to rule out hard outliers when |O| ⩽ 48, and the output estimatorβm has in these cases minimal risk, as the best estimatorβ m . Figure 1b also shows that almost all subsample contain outliers when |O| ⩾ 48. Besides, the selected subsample Bm contains heavy-tail outliers even for small values of |O|. As heavy-tail outliers and informative data define the same oracle, these heavy-tail outliers are actually informative for the learning task and the minimax-MOM selection procedure use this extra information automatically in an optimal way. In particular, the ensemble method distinguishes between non-informative hard outliers and possibly informative heavy-tailed outliers.
Overall, our method shows very strong robustness to the presence of outliers and outputs an estimator with the best possible performance among the given class of estimators.

A Application to ERM and linear aggregation
This section applies Corollary 4.1 by considering non-robust linear aggregation as input algorithms. Let (F λ ) λ∈Λ be a finite collection of subspaces of F , typically spanned by previous estimators. For each λ ∈ Λ, denote by d λ the dimension of F λ and by f * λ an oracle in F λ , meaning f * λ := arg min f ∈F λ R(f ). Denotef λ,B the empirical risk minimizer (ERM) on F λ trained with subsample B:f λ,B := arg min The performance of ERM in linear aggregation likef λ,B under a L 4 /L 2 assumption such as Assumption 1 have been obtained in [31].
Then, for every x > 0, with probability larger than 1 − exp(−|B|/(64χ 8 λ )) − 1/x, the ERMf λ,B defined in (9) satisfies In Proposition A.1, the (exact) oracle inequality satisfied byf λ,B guarantees an optimal residual term of order σ 2 λ d λ /N only when the deviation parameter x is constant. This may seem weak, but it cannot be improved in general-see Proposition 1.5 in [31]: ERM are not robust to "stochastic outliers" in general.
We can now combine these algorithms with our ensemble method. Let M = Λ × B be as in (3) and V , K min and K max satisfy the assumptions from Section 4 and Corollary 4.1. Consider the output estimatorf m from (2). The following result combines Corollary 4.1 and Proposition A.1.

B Lemmas and proofs
Lemma B.1. Let m, m ′ ∈ M and v ∈ V (m,m ′ ) . The conditional variance of random variable given random variables (X i , Y i ) i∈Bm∪B m ′ is bounded from above as: )) .
Proof. By assumption, random variables (X i , Y i ) i∈I are independent. In particular, random vari- ) .
⊂ I, and let us bound from above each variance terms from the latter expression: where we used the basic inequality (x + y) 2 ⩽ 2(x 2 + y 2 ) in the last inequality. Let us bound from above the first term. The second term is handled similarly. We use a quadratic/multiplier decomposition of the excess loss: By Assumption 1, it follows that Likewise, Assumption 1 yields The result follows from combining these pieces and using T Proof. Fix m, m ′ ∈ M and v ∈ V (m,m ′ ) . Conditionally to (X i , Y i ) i∈Bm∪B m ′ , it follows from Chebychev's inequality and Lemma B.1 that, with probability higher than 1 − 1/8, As the probability estimate does not depend on(X i , Y i ) i∈Bm∪B m ′ , the above also holds unconditionnally and, with probability larger than 1 − 1/8, Denote by Ω (m,m ′ ) v the event defined by (11) and see that P [ Hoeffding's inequality to random variables Then, on Ω (m,m ′ ) , using (10) and the assumption V ⩾ 3|O|, ∑ v∈V (m,m ′ ) In other words, inequalities (11) hold for more than half of the indices v ∈ [V ]. Therefore, on event Ω (m,m ′ ) , the same inequality holds for the median over v ∈ [V ]: Proof. By definition of C m,m ′ and the inequalities √ Lemma B.4. With probability at least 1 − |M| 2 e −(V −|O|)/32 , for all m, m ′ ∈ M and ε > 0: Proof Let Ω be the event defined by Lemma B.4. Since V ⩾ 3|O|, by Lemma B.4 It follows from Lemma B.4 and (12) that, on Ω, Then, by definition ofm and using Lemma B.4, on Ω, where we used 1 − a ε,V ⩾ 0 and the definition of m o . Plugging this into (13) yields the result.
where the first inequality follows from the definition of K 0 , the second inequality from the definition of K 0 and the third inequality from the assumption V ⩾ 3 |O|.
Consider the events From now on, assume that Ω 1 ∩ Ω 2 hold and the aim is to establish an upper bound on min m∈M ℓ(f m ). Write .
Here the second inequality comes from the definition of Ω 2 and the last inequality from (14) combined with ρ being nonincreasing in its second variable. Combining the above with the definition of Ω 1 yields the desired inequality: To conclude the proof, let us bound from below the probability of Ω 1 ∩ Ω 2 . By Theorem 3.2, , that ρ is non-increasing in its second variable and that B The third inequality follows from the excess risk bound onf λ0,B (K 0 ) k , which holds as soon as ; this is indeed case because using (14): where the penultimate inequality follows from assumption N ⩾ ν max max(8V, 2 Kmin+1 ). Therefore,

C.3 Proof of Lemma 4.2
Start with (i). Let k ∈ [2 K ] and i ∈ B We can bound from below as follows: let Similarly, the upper bound is obtained as follows: Therefore, This means i ∈ B The proof of (ii) proceeds as follows. Let k ′ ∈ [2 K ′ ] and i ∈ B This can be rewritten as Combining the two previous displays implies that (k ′ − 1)2 K−K ′ < k ⩽ k ′ 2 K−K ′ , meaning that i ∈ B Using (i) from Lemma 4.2, we have: , with k ′ 2 = ⌊(k 2 − 1)2 3−K2 ⌋ + 1.
The above inequality is also true for s < ∥β * ∥ 0 since the right-hand side is then infinite. We can now apply Corollary 4.1 which gives that with probability higher than 1 − ((s * ) 2 N 2 + 1)e −V /48 , it holds that: