Aggregated hold out for sparse linear regression with a robust loss function

Sparse linear regression methods generally have a free hyperparameter which controls the amount of sparsity, and is subject to a bias-variance tradeoff. This article considers the use of Aggregated hold-out to aggregate over values of this hyperparameter, in the context of linear regression with the Huber loss function. Aggregated hold-out (Agghoo) is a procedure which averages estimators selected by hold-out (cross-validation with a single split). In the theoretical part of the article, it is proved that Agghoo satisfies a non-asymptotic oracle inequality when it is applied to sparse estimators which are parametrized by their zero-norm. In particular , this includes a variant of the Lasso introduced by Zou, Hasti{\'e} and Tibshirani. Simulations are used to compare Agghoo with cross-validation. They show that Agghoo performs better than CV when the intrinsic dimension is high and when there are confounders correlated with the predictive covariates.


Introduction
From the statistical learning point of view, linear regression is a risk-minimization problem wherein the aim is to minimize the average prediction error φ(Y − θ T X) on a new, independent data-point (X, Y ), as measured by a loss function φ. When φ(x) = x 2 , this yields classical least-squares regression; however, Lipschitz-continuous loss functions have better robustness properties and are therefore preferred in the presence of heavy-tailed noise, since they require fewer moment assumptions on Y [8,20]. Similarly to the L 2 norm in the least-squares case, measures of performance for estimators can be derived from robust loss functions by substracting the risk of the (distributiondependent) optimal predictor, yielding the so-called excess risk.
In the high-dimensional setting, where X ∈ R d with potentially d > n, full linear regression cannot be achieved in general: the minimax excess risk is bounded below by a positive function of d n (proposition 2.2). Stronger assumptions on the regression coefficient θ are needed in order to estimate it consistently.
A popular approach is to suppose that only a small number k * of covariates are relevant to the prediction of Y , so that θ may be sought among the sparse vectors with less than k * non-zero components. Estimators which target such problems include the Lasso [36], least-angle regression [11] (a similar, but not identical method [16,Section 3.4.4]), and stepwise regression [16,Section 3.3.2]. In the robust setting, variants of the Lasso with robust loss functions have been investigated by a number of authors [22,34,6,44].
Such methods generally introduce a free hyperparameter which regulates the "sparsity" of the estimator; sometimes this is directly the number of nonzero components, as in stepwise procedures, sometimes not, as in the case of the Lasso, which uses a regularization parameter λ. In any case, the user is left with the problem of calibrating this hyperparameter.
Several goals are conceivable for a hyperparameter selection method, such as support recovery -finding the "predictive" covariates -or estimation of a "true" underlying regression coefficient with respect to some norm on R d . From a prediction perspective, hyperparameters should be chosen so as to minimize the risk, and a good method should approach this minimum. As a consequence, the proposed data-driven choice of hyperparameter should allow the estimator to attain all known convergence rates without any a priori knowledge, effectively adapting to the difficulty of the problem.
For the Lasso and some variants, such as the fused Lasso, Zou, Wang, Tibshirani and coauthors have proposed [49] and investigated [43,38] a method based on Mallow's C p and estimation of the "degrees of freedom of the Lasso". However, consistency of this method has only been proven [43] in an asymptotic setting where the dimension is fixed while n grows, hence not the setting considered here. Moreover, the method depends on specific properties of the Lasso, and may not be readily applicable to other sparse regression procedures.
A much more widely applicable procedure is to choose the hyperparameter by cross-validation. For the Lasso, this approach has been recommended by Tibshirani [37], van de Geer and Lederer [39] and Greenshtein [13], among many others. More generally, cross-validation is the default method for calibrating hyperparameters in practice. For exemple, R implementations of the elastic net [12] (package glmnet), LARS [11] (package lars) and the huberized lasso [48] (package hqreg) all incorporate a cross-validation subroutine to automatically choose the hyperparameter.
Theoretically, cross-validation has been shown to perform well in a variety of settings [1]. For cross-validation with one split, also known as the hold-out, and for a bagged variant of v-fold cross-validation [23], some general oracle inequalities are available in least squares regression [26,Corollary 8.8] [46] [23]. However, they rely on uniform boundedness assumptions on the estimators which may not hold in high-dimensional linear regression. For the more popular V-fold procedure, results are only available in specific settings. Of particular interest here is the article [32] which proves oracle inequalities for linear model selection in least squares regression, since linear model selection is very similar to sparse regression (the main difference being that in sparse regression, the "models" are not fixed a priori but depend on the data). This suggests that similar results could hold for sparse regression.
However, in the case of the Lasso at least, no such general theoretical guarantees exist, to the best of our knowledge. Some oracle inequalities [23,30] and also fast rates [17,Theorem 1] have been obtained, but only under strong distributional assumptions: [23] assumes that X has a logconcave distribution, [30] that X is a gaussian vector, and [17,Theorem 1] assumes that there is a true model and that the variance-covariance matrix is diagonal dominant. Recently, Chetverikov et al. [7] have obtained fast rates (up to log-terms) for a certain class of conditional distributions (of Y given X) which are smooth transformations of Gaussian distributions. In contrast, there are also theorems [5] [17,Theorem 2] which make much weaker distributional assumptions but only prove convergence of the (insample) error at the "slow" rate O(r log d n ) or slower. Though this rate is basically minimax [33] for the model a hyperparameter selection method should adapt also to the favorable cases where the Lasso converges faster ( [21,Theorem 14]); these results do not show that CV has this property. Thus, the theoretical justification for the use of standard CV, which selects a single hyperparameter by minimizing the CV risk estimator, is somewhat lacking. In fact, two of the articles mentioned above introduce variants of CV which modify the final hyperparameter selection step; a bagged CV in [23] and the aggregation of two hold-out predictors in [5]. In practice too, there is reason to consider alternatives to hyperparameter selection in sparse regression: sparse estimators are unstable, and selecting only one estimator can result in arbitrarily ignoring certain variables among a correlated group with similar predictive power [47]. For the Lasso, these difficulties have motivated researchers to introduce several aggregation schemes, such as the Bolasso [3], stability selection [27], the lasso-zero [9] and the random lasso [45], which are shown to have some better properties than the standard Lasso.
Since aggregating the Lasso seems to be advantageous, it seems logical to consider aggregation rather than selection to handle the free hyperparameters. In this article, we consider the application to sparse regression of the aggregated hold-out procedure. Aggregated hold-out (agghoo) is a general aggregation method which mixes cross-validation with bagging. It is an alternative to cross-validation, with a comparable level of generality. In a previous article with Sylvain Arlot and Matthieu Lerasle [25], we formally defined and studied Agghoo, and showed empirically that it can improve on cross-validation when calibrating the level of regularization for kernel regression. Though we came up with the name and the general mathematical definition, Agghoo has already appeared in the applied litterature in combination with sparse regression procedures [18], among others [42], under the name "CV + averaging" in this case.
In the present article, the aim is to study the application of Agghoo to sparse regression with a robust loss function. Theoretically, assuming an L ∞ − L 2 norm inequality to hold on the set of sparse linear predictors, it is proven that Agghoo satisfies an asymptotically optimal oracle inequality. This result applies also to cross-validation with one split (the so-called holdout), yielding a new oracle inequality which allows norms of the sparse linear predictors to grow polynomially with the sample size. Empirically, Agghoo is compared to cross-validation in a number of simulations, which investigate the impact of correlations in the design matrix and sparsity of the ground truth on the performance of aggregated hold-out and cross-validation. Agghoo appears to perform better than cross-validation when the number of non-zero coefficients to be estimated is not much smaller than the sample size. The presence of confounders correlated to the predictive variables also favours Agghoo relative to cross-validation.

Setting and Definitions
The problem of non-parametric regression is to infer a predictor t : X → R from a dataset (X i , Y i ) 1≤i≤n of pairs, where X i ∈ X and Y i ∈ R. The pairs will be assumed to be i.i.d, with joint distribution P . The prediction error made at a point (x, y) ∈ X × R is measured using a non-negative function of the residual φ(y − t(x)). The global performance of a predictor is assessed on a new, independent data point (X, Y ) drawn from the same distribution P using the risk L(t) = E[φ(Y − t(X))]. The optimal predictors s are characterized by s(x) ∈ argmin u E[φ(Y − u)|X = x] a.s. The risk of any optimal predictor is (in general) a non-zero quantity which characterizes the intrinsic amount of "noise" in Y unaccounted for by the knowledge of X. A predictor t can be compared with this benchmark by using the excess risk ℓ(s, t) = L(t) − L(s). Taking φ(x) = x 2 yields the usual least-squares regression, where s(x) = E[Y |X = x] and ℓ(s, t) = (s − t)(X) 2 L 2 . However, the least-squares approach is known to suffer from a lack of robustness [20,Chapter 7]. For this reason, in the field of robust statistics, a number of alternative loss functions are used. One popular choice was introduced by Huber [19].
When c → +∞, φ c converges to the least-squares loss. When c → 0, 1 c φ c converges to the absolute value loss x → |x| of median regression. Thus, the c parameter allows a trade-off between robustness and approximation of the least squares loss.
The rest of the article will focus on sparse linear regression with the loss function φ c . Thus, notations s, ℓ(s, t) and L are to be understood with respect to φ c .

Sparse linear regression
With finite data, it is impossible to solve the optimization problem min L(t) over the set of all predictors t. Some modeling assumptions must be made to make the problem tractable. A popular approach is to build a finite set of features (ψ j (X)) 1≤j≤d and consider predictors that are linear in these features: ∃θ ∈ R d , ∀x ∈ X , t(x) = d j=1 θ j ψ j (x). This is equivalent to replacing X ∈ X withX = (ψ j (X)) 1≤j≤d ∈ R d and regressing Y onX. For theoretical purposes, it is thus equivalent to assume that X = R d for some d and predictors are linear: As the aim is to reduce the risk L(t), a logical way to choose θ is by empirical risk minimization: Empirical risk minimization works well when d ≪ n but will lead to overfitting in large dimensions [41]. Indeed, if d is too large, no estimator can succeed at minimizing the risk over R d , as the following proposition shows. Proposition 2.2 Let σ > 0 and Σ be a positive definite matrix of dimension d. For any θ ∈ R d , let P θ denote the distribution such that (X, Y ) ∼ P θ iff almost surely, Y = θ, X + σε, where X ∼ N (0, Σ), ε ∼ N (0, 1) and ε, X are independent. Then for any n > d, where infθ denotes the infimum over all estimators and θ T denotes the linear functional x → θ, x .
Proposition 2.2 is proved in appendix A. With respect to σ, the lower bound of proposition 2.2 scales as σ 2 when σ ≪ c and as cσ when σ ≫ c, as could be expected from the definition of the Huber loss (Definition 2.1). With respect to d and n, it scales as d n when d ≪ n. Moreover, there is a positive lower bound on the minimax risk when d is of order n. Thus, for such large values of d, consistent risk minimization cannot be achieved uniformly over the whole of R d .
Sparse regression attempts instead to locate a "good" subset of variables in order to optimize risk for a given model dimension. The Lasso [37] is now a standard method of achieving sparsity. The specific version of the Lasso which we consider here is given by the following definition. Definition 2.3 Let n ∈ N and let D n = (X i , Y i ) 1≤i≤n be a dataset such that X i ∈ R d and Y i ∈ R for all i ∈ [|1; n|] and some d ∈ N. Let φ c be the Huber loss defined in Definition 2.1. For any r ≥ 0, let

Now let
The intercept q is left unconstrained in definition 2.3, as is usually the case in practice [48]. Equation (2) is a tiebreaking rule which simplifies the theoretical analysis. 6

Hyperparameter tuning
The zero-norm of a vector θ is the integer θ 0 = |{i : θ i = 0}|. Many sparse estimators, such as best subset or forward stepwise [16,Section 3.3], are directly parametrized by their desired zero-norm, which must be chosen by the practitioner. It controls the "complexity" of the estimator, and hence the bias-variance tradeoff. In the case of the standard Lasso (Definition 2.3 with φ(x) = x 2 ), Zou, Hastie and Tibshirani [49] showed that θ (λ) 0 is an unbiased estimator of the "degrees of freedom" of the estimator A(λ). As a consequence, [49] suggests reparametrizing the lasso by its zero-norm. Applying their definition to the present setting yields the following.

Definition 2.4
For any dataset D n , let (q,θ) be given by Definition 2.3, equation (2) . Let M ∈ N and (r m ) 1≤m≤M be the finite increasing sequence at which the sets {i :θ(r) i = 0} change. Let r 0 = 0. For any k ∈ N let with the convention max ∅ = 0. Let then Let A lasso k = A lasso k,+∞ denote the unconstrained sequence (corresponding to [49]'s original definition).
The (optional) constraint θ (r m ) ℓ 1 ≤ r m ≤ R has some potential practical and theoretical benefits. From the practical viewpoint, it allows to reduce the computational complexity by excluding lasso solutions with excessively large ℓ 1 norm, which may be expected to perform poorly anyway. From a theoretical viewpoint, it helps control the L p norms of the predictor θ (r m ), X , thus avoiding inconsistency issues encountered by the empirical risk minimizer for some pathological designs [31] .
More generally, consider any sequence (A k ) k∈N of learning rules which output linear predictors A k (D n ) : x →q k (D n ) + θ k (D n ), x . To prove the main theoretical result of this article (Theorem 3.2), we make the following assumptions on the collection (A k ) k∈N . Hypothesis 2.1 For any n ∈ N, let D n ∼ P ⊗n denote a dataset of size n. Assume that For the reparametrized Lasso given by definition 2.3 and 2.4, hypothesis 2.1 holds by construction.
Moreover, condition 1 is naturally satisfied by such sparse regression methods as forward stepwise and best subset [16,Section 3.3]. Condition 3 states that the intercept q is chosen by empirical risk minimization, with a specific tie-breaking rule in case the minimum is not unique.

Aggregated hold out applied to the zero-norm parameter
The tuning of the zero-norm k is important to ensure good prediction performance by optimizing the bias-variance tradeoff. Depending on the application, practicioners may want more or less sparsity, depending on their requirements in terms of computational load or interpretability. For this reason, we consider the problem of selecting the zero-norm among the set {1, . . . , K}, for some K ∈ N which may depend on the sample size. This article investigates the use of Agghoo in this context, as an alternative to cross-validation. Agghoo is a general hyperparameter aggregation method which was defined in [25], in a general statistical learning context. Let us briefly recall its definition in the present setting. For a more detailed introductory discussion of this procedure, we refer the reader to [25]. To simplify notations, fix a collection (q k ,θ k ) 1≤k≤K of linear regression estimators. First, we need to define hold-out selection of the zero-norm parameter.
Using the hyperparameterk T (D n ) together with the dataset D T n to train a linear regressor yields the hold-out predictor Aggregation of hold-out predictors is performed in the following manner.
Agghoo outputs the linear predictor: Thus, Agghoo also yields a linear predictor, which means that it can be efficiently evaluated on new data. If theθk T (Dn) have similar support,θ ag T will also be sparse: this will happen if the hold-out reliably identifies a true model. On the other hand, if the supports have little overlap, the Agghoo coefficient will lose sparsity, but it can be expected to be more stable and to perform better. The linear predictors x →qk x aggregated by Agghoo are only trained on part of the data. This subsampling (typically) decreases the performance of each individual estimator, but combined with aggregation, it may stabilize an unstable procedure and improve its performance, similarly to bagging.
An alternative would be to retrain each regressor on the whole dataset D n , yielding the following procedure, which we call "Aggregated crossvalidation" (Agcv).
The output of Agcv is the linear predictor: Agghoo is easier to study theoretically than Agcv due to the conditional independence: θ k D T n 1≤k≤K ⊥ ⊥k T (D n ) D T n . For this reason, the theoretical section will focus on Agghoo, while in the simulation study, both Agghoo and Agcv will be considered.
In comparison to Agghoo and Agcv, consider the following definition of a general cross-validation method.
This makes clear the difference between cross-validation and Agghoo (or Agcv): cross-validation averages the hold-out risk estimates (and selects a single linear predictor) whereas Agghoo and Agcv aggregate the selected predictors (qk ). If the parameterk cv T is used instead of thek T i in Definition 2.6, this yields the bagged CV method of Lecué and Mitchell [23]. This method applies bagging to individual estimatorsq k ,θ k , whereas Agghoo also bags the estimator selection step. When there is a single, clearly established optimal model of small dimension, the advantages of a more accurate model selection step (as in CV and its bagged version) may outweigh the gains due to aggregation. In contrast, when there are many different sparse linear predictors with close to optimal performance, model selection will be unstable and aggregation should provide benefits relative to selection of a single parameter k.

Computational complexity
There are two types of computational costs to take into account when considering a (sparse) linear predictor such as f ag T (D n ): the cost of calculating the parametersq ag T (D n ),θ ag T (D n ) at training time and the cost of making a prediction on new data, i.e computing f ag T (D n )(x) for some x. In this section, Agghoo, Agcv and cross-validation are compared with respect to these two types of complexity.
Let (q k ,θ k ) 1≤k≤K be some finite collection of sparse linear regression estimators. Let S(n) = E max 1≤k≤K θ k (D n ) 0 denote the expected maximal number of non-zero coefficients. In particular, under point 1 of hypothesis 2.1, S(n) ≤ K. Let V = |T | and n v = n − n t , where n t is given by hypothesis (Reg-T ) below (equation 4).
Computational complexity at training time Agghoo, Agcv and crossvalidation must all compute the hold-out risk estimator for each subset in T and each k ∈ {1, . . . , K}. LetĈ hos denote the number of operations needed for this.
For a given subset T i , the estimatorsq k (D T i n ),θ k (D T i n ) must be computed for all k, which may be more or less expensive depending on the method. In the case of the Lasso, the whole path can be computed efficiently using the LARS-Lasso algorithm [11].
Then, the empirical risk of all estimators must be calculated on the test set. On average, this takes at least S(n t )n v operations to compute the risk of the least sparseθ k (n v scalar products involving an average of S(n t ) non-zero coefficients) and at most O(KS(n t )n v ) operations in general. In particular, In a next step, Agghoo and agcv compute the minima of V vectors of length K, whereas cross-validation averages these vectors and calculates the argmin of the average. Both operations have complexity of order V K.
It is in their final step that the three methods differ slightly. Agghoo uses theθk T i (D T i n ) which have been computed in a previous step, whereas Agcv and cross-validation must compute theθk T i (D n ) andθkcv T (D n ), respectively. The complexity of this depends on the method, but can be expected to be small compared toĈ hos , as there is only one estimator to fit instead of K.
Finally, Agghoo and Agcv must aggregate V vectors drawn from thê θ k (D T i n ) andθ k (D n ), with respective complexity O(V S(n t )) and O(V S(n)), provided that a suitably "sparse" representation is used for theθ k . Assuming S(n) ≈ S(n t ), this is negligible compared to E[Ĉ hos ].
All in all, Agghoo, Agcv and cross-validation have a similar complexity at training time, of order E[Ĉ hos ] + V K, with E[Ĉ hos ] most likely being the dominant term.
Evaluation on new data Given new data x, the complexity of evaluating q + θ, x is proportional to θ 0 . If the sparse estimatorsθ k perform as intended and consistently identify similar subsets of predictive variables, then Agghoo and Agcv sould not lose much sparsity compared to CV, as thê θk -as bothk cv T and k T 1 optimize the same bias-variance tradeoff with respect to the "complexity parameter" k . However, this situation is one in which the hold-out is very unstable, so Agghoo can be expected to yield significant improvements in exchange for the increased computational cost. The same argument applies to agcv.

Theoretical results
Let n ∈ N and D n = (X i , Y i ) 1≤i≤n denote an i.i.d dataset with common distribution P . Let q k ,θ k 1≤k≤K be a collection of linear regressors which satisfies assumption 2.1. Let T be a collection of subsets of {1, . . . , n}. In this section, we give bounds for the risk of the Agghoo estimator f ag T (Definition 2.6) built from the collection q k ,θ k 1≤k≤K .

Hypotheses
To state and prove our theoretical results, a number of hypotheses are required. First, the collection of subsets T -chosen by the practitioner -should satisfy the following two conditions.
(Reg−T ) There exists an integer n t such that max(3, n 2 ) ≤ n t < n and Let also n v = n − n t denote the size of the validation sets.
Independence of T from D n ensures that for T ∈ T , D T n is also iid with distribution P . The assumption that T = (T 1 , . . . , T V ) contains sets of equal size ensures that the pairsqk Most of the data partitioning procedures used for crossvalidation satisfy hypothesis (Reg-T ), including leave-p-out, V -fold crossvalidation (with n − n t = n v = n/V ) and Monte-Carlo cross-validation [1].
To state an upper bound for ℓ(s, f ag T ), we also need to quantify the amount of noise in the distribution of Y given X, in a way appropriate to the Huber loss φ c . That is the purpose of the following assumption.
Assume that there exists s and a positive real number η such that where c denotes the parameter of the Huber loss.
Equation (5) is specific to the Huber loss: it requires the conditional distribution of the residual Y − s(X) to put sufficient mass in a region where the Huber function φ c is quadratic. For example, assume that Y = s(X)+σε where ε is independent from X and has a continuous, positive density q in a neighbourhood of 0. If the Huber parameter c is proportional to or larger than σ, then a constant value of η can be chosen, independently of σ. On the other hand, if c ≪ σ, the optimal value of η satisfies η = η(σ) ∼ c σ →0 Finally, some hypotheses are needed to deal with pathological design distributions which can in general lead to inconsistency of empirical risk minimization [31]. To illustrate the problem as it applies to the hold-out, consider a distribution P such that 0 < P (X ∈ H) < 1 for some vector subspace H, as in [31]. Assume to simplify that Y = θ * , X + ε. Let p H denote the orthogonal projection on H. With small, but positive probability, X i ∈ H for all i ∈ {1, . . . , n}. On this event, it is clearly impossible to estimate θ * − p H (θ * ). Likewise, the hold-out cannot correctly assess the impact of the orthogonal componentsθ k − p H (θ k ) of the estimatorsθ k on the risk, since θ k , X i only depends on p H (θ k ), whereas out of sample predictions θ k , X may depend onθ k − p H (θ k ) (since P (X ∈ H) < 1). This means that the hold-out-selected predictors f ho T i may be arbitrarily far from optimal in general.
To avoid this issue, two sets of assumptions have been made in the litterature. First, there are boundedness assumptions: for example, if the predictorŝ q k + θ k , X and the variable Y are uniformly bounded, this clearly limits the impact of low-probability events such as {∀i ∈ {1, . . . , n}, X i ∈ H} on 13 the risk. Such hypotheses have been used to prove general oracle inequalities for the hold-out [14,Chapter 8] [26, Corollary 8.8] and cross-validation [40]. Alternatively, pathological designs can be excluded from consideration by assuming an L p − L q norm inequality or "small ball" type condition [28,29]: this has been used to study empirical risk minimization over linear models [31,2].
In this article, a combination of both approaches is used. First, we assume a weak uniform upper bound on L 1 norms of the predictors (hypothesis (Uub)). The bound is allowed to grow with n t at an arbitrary polynomial rate.
where n t is given by hypothesis (Reg-T ). Let X ∼ X 1 be independent from D nt . There exist real numbers L, α such that . This is the case if the components of X have variance 1 and d is polynomial in n, or if the components of X are sub-exponential with constant 1 and log p is polynomial in n.
Hypothesis (Uub) is much weaker than boundedness assumptions usually made in the litterature, where typically the L ∞ norm is used instead of the L 1 norm, and the bound is a constant rather than a polynomial function of n t . Point 1 of Hypothesis (Uub) is natural in the sense that an estimatorθ k which violates it cannot perform well anyway: assuming that P (|Y |) < +∞ , by definition of φ c , for any (q, θ), Thus, if E θ k (D nt ), X − P X grows faster than n α t , then so do the expected risk and expected excess risk of A k (D nt ). Point 2 of Hypothesis (Uub) can be seen as an "empirical version" of point 1, wherein the independent variable X is replaced by the elements of D nt . The lack of independence betweenθ k and X i makes this condition less straightforward than 1. However, by the Cauchy-Schwarz inequality, it is always the case that Thus, it is enough to suppose that d and E[ θ k , X − P X 2 ] are bounded by Ln α t for some α > 0. Together with the weak uniform bound (Uub), we assume that for sparse linear predictors x → θ, x − EX with θ 0 ≤ K, the L 2 norm is equivalent to the stronger "Orlicz norm" defined below.
The constant relating · L ψ 1 and · L 2 is allowed to depend on n t in the following way.
There exists a constant ν 0 such that The interpretation of this hypothesis is not obvious. Note first that κ(K) is a non-decreasing function of K, and in particular, Unlike κ(K), κ(d) is invariant under linear transformations of X: in other words, it only depends on the linear space V spanned by the columns of X.
In particular, κ(d) does not depend on the covariance matrix of X, provided that it is non-degenerate. The inequality X , θ L ψ 1 ≤ κ(d) X , θ L 2 can be interpreted as an effective, scale invariant version of sub-exponentiality: it states that the tail of X , θ is sub-exponential with a scale parameter which isn't too large compared to its standard deviation. In sections 3.3 , 3.4 and 3.5, we shall give examples where simple bounds can be proved for κ(K) or κ(d).

Main Theorem
When Agghoo is used on a collection (A k ) 1≤k≤K of linear regression estimators satisfying Hypothesis (2.1), such as the Lasso parametrized by the number of non-zero coefficients, as in Definition 2.4, the following theorem applies.
where n t is given by assumption (Reg-T ). Let c denote the Huber loss parameter from Definition 2.1.
Let K be an integer such that 3 ≤ K ≤ e √ nv and (A k ) 1≤k≤K be a collection of linear regression estimators which satisfies hypothesis (2.1). Assume that hypotheses (Ni) and (Uub) hold. There exist numerical constants µ 1 > 0, µ 2 ≥ 1 such that, for any θ ∈ R such that √ α + 3 µ 2 ν 0 η ≤ θ < 1, Theorem 3.2 is proved in appendix B. Theorem 3.2 compares the excess risk of Agghoo to that of the best linear predictor in the collection A k (D nt ), trained on a subset of the data of size n t . Taking |T | = 1 in Theorem 3.2 yields an oracle inequality for the hold-out, which is also cross-validation with one split. It is, to the best of our knowledge, the first theoretical guarantee on hyperparameter aggregation (or selection) for the huberized Lasso. That n t appears in the oracle instead of n is a limitation, but it is logical, since estimators aggregated by Agghoo are only trained on samples of size n t . Typically, the excess risk increases at most by a constant factor when a dataset of size n is replaced by a subset of size τ n, and this constant tends to 1 as τ → 1. This allows to take n v of order n (n v = (1 − τ )n), while losing only a constant factor in the oracle term.
In addition to the oracle, E min 1≤k≤K ℓ(s, A k (D nt )) , the right hand side of equation (9) contains two remainder terms. Since K ≤ n t , the second of these terms is always negligible with respect to the first as n v , n t → +∞ for fixed L, c. Assuming that n v , n t are both of order n, the first remainder term is O( log n n ) with respect to n. In comparison, the minimax risk for prediction in the model Y = θ * , X + ε, θ * 0 ≤ k * , ε ∼ N (0, 1) is greater than a constant times k * n by proposition 2.2. Thus, if more than log n independent components of X are required for prediction of Y , the remainder term can be expected to be negligible compared to the oracle as a function of n.
As a function of a scale parameter σ in a model Y = s(X)+σε, where ε is distributed symmetrically around 0, the remainder term scales as c 2 η , where η depends only on σ and on the fixed distribution of ε. When σ c is lower bounded and if ε is sufficiently regular, then c 2 η = O(cσ) (see the discussion of hypothesis (Lcs)). In that case, the rate cσ is the same as in the minimax lower bounds of Proposition 3.2, and can therefore be considered correct. When σ c → 0, c 2 η ∼ c 2 is suboptimal for Gaussian distributions σε, where the correct scaling is σ 2 (by Proposition 2.2 and a simple comparison with least squares). However, Theorem 3.2 makes no moment assumptions whatsoever on the residual Y − s(X) -thus, it is logical that the parameter c, which controls the robustness of the Huber loss, should appear in the bound.
In equation (9), there is a tradeoff between the oracle and the remainder terms, governed by the tuning parameter θ ∈ (0; 1]. θ must be larger than a positive constant depending on α, ν 0 and η; as a result, Theorem 3.2 only yields a nontrivial result when ν 0 < η µ 2 √ α+3 . Note that hypothesis (Ni), which defines ν 0 , allows ν 0 to decrease with n as fast as log n n , in case κ(K) is a constant -as when X is gaussian (see section 3.3 below). Assuming only that ν 0 = ν 0 (n) → 0 and that the remainder term is negligible compared to the oracle, equation (9) proves that E ℓ(s, f ag T ) ∼ E min 1≤k≤K ℓ(s, A k (D nt )) by taking θ = θ n → 0 slowly enough -an "optimal" oracle inequality.

Gaussian design
In the case where X ∈ R d is a Gaussian vector, θ, X − EX follows a centered normal distribution. As a result, κ(K) -defined in equation (7) -is a fixed numerical constant, equal to max( Z L ψ 1 , 1 log 2 ), where Z ∼ N (0, 1). It follows that for any fixed ν 0 , hypothesis (Ni) holds as soon as nv log(nt∨K) is large enough.
Moreover, for Gaussian design, it is possible to show that the Lasso estimators of Definition 2.4 satisfy hypothesis (Uub) for any R ≥ 0 (including R = +∞), as long as Y has some moments and K isn't too large. More precisely, hypothesis (Uub) holds with L, α independent from R. This leads to the following corollary. Corollary 3.3 Assume that X ∈ R d is a Gaussian vector, that for some u ∈ (0; 1], Y ∈ L 1+u and that hypothesis (Lcs) holds. Let R ∈ R ∪ {+∞} and let f ag T be the Agghoo estimator built from the collection A lasso k,R 1≤k≤K . Assume that n t ≥ 13 + 6 u and There exist numerical constants µ 5 , µ 8 such that for all θ ∈ µ 5 η log nt nv ; 1 and all q ∈ R, Corollary 3.3 allows to take θ → 0 at any rate slower than log nt nv , so that the asymptotic constant in front of the oracle is 1. The constraint (10) imposed on K by Corollary 3.3 is mild, since there are strong practical and theoretical reasons to take k much smaller than nt log nt anyway: this enforces sparsity -minimizing computational complexity and improving interpretability -and allows better control of the minimax risk (Proposition 2.2). Equation (10) serves only to prove thatθ lasso k,R satisfies hypothesis (Uub), hence it could be replaced by a polynomial bound on R and on X − EX, as explained in the discussion of hypothesis (Uub).

Nonparametric bases
Given real random variables U ∈ [a, b], Y ∈ R, a linear model may be a poor approximation to the actual regression function s 0 (U). A popular technique to obtain a more flexible model is to replace the one-dimensional variable U with a vector X = ψ j (U) 1≤j≤dn , where (ψ j ) 1≤j≤dn spans a space of functions W dn known for its good approximation properties, such as trigonometric polynomials, wavelets or splines ( [16,Chapter 5]). d n is practically always allowed to tend to +∞ as n grows to make sure that the approximation error of s by functions in W dn = (ψ j ) 1≤j≤dn converges to 0. In this section, we discuss conditions under which Theorem 3.2 applies to such models.
It turns out that most of the classical function spaces satisfy an equation In particular, if W dn contains the constant functions -which is the case with splines, wavelets and trigonometric polynomials -then equation (7) holds with κ(d n ) of order √ d n . Thus, equation (30) of hypothesis (Ni) holds under the assumption that d n ≤ µν 0 nv log nt for some constant µ. Assuming that n v and n t are both of order n (for example, a V −fold split with fixed V ), this assumption is mild: as a consequence of [14,Theorem 11.3] and approximation-theoretic properties of the spaces W dn [10], taking d n ≤ n log 2 n , for example, is sufficient to attain minimax convergence rates [35] [14, Theorem 3.2] over standard classes of smooth functions.
Note that even though κ(d n ) ≈ √ d n , this does not in general imply that κ(K) = O( √ K): for example, in the case of regular histograms on [0, 1], The property κ(K) = O( √ K) does, however, hold in the case of the Fourier basis: as a result, d n may be arbitrarily large, and only bounds on K (the maximal zero-norm of the estimators) are required. We examine this case in detail in the following section.

The Fourier basis
Suppose that real variables (U, Y ) are given, and that we wish to find the best predictor of Y among 1−periodic functions of U. Let s per denote the minimizer of the risk E[φ c (Y −t(U))] among all measurable 1−periodic functions on R. For all k ∈ N, let ψ 2k (x) = √ 2 cos(2πkx) and ψ 2k−1 (x) = √ 2 sin(2πkx). Let X = (ψ j (U)) 1≤j≤d , where d ∈ N and d ≥ 2. One can easily show that s per (U) = s(X), where s minimizes P [φ c (Y − t(X))] among measurable functions t on R d . By taking d large and using sparse methods, it is possible to approximate functions s per which have only a small number of non-zero Fourier coefficients, but potentially at high frequencies, as is commonly the case in practice [15].
Let (q k ,θ k ) 1≤k≤K be a collection of sparse linear regression estimators satisfying hypothesis 2.1 and lett k denote the predictort k : x →q k (D nt ) + θ k (D nt ), x . Given this initial collection of linear predictors, Definition 3.4 below constructs a second collection (q k ,θ k ) 1≤k≤K which also satisfies hypothesis (Uub) under an appropriate distributional assumption (Corollary 3.6, equation (14)). whereq For any k, lett k : x →q k (D nt ) + θ k (D nt ), x .
By construction, (q k ,θ k ) also satisfies hypothesis 2.1. Replacing (q k ,θ k ) by (q k ,θ k ) may improve performance and cannot significantly degrade it, as proposition 3.5 below makes clear.
Proposition 3.5 Assume that Y ∈ L α for some α ∈ (0, 1] and let q * ∈ R. If for some numerical constant µ 10 ≥ 0, Theorem 3.2 can be applied to the collection (q k ,θ k ) 1≤k≤K , which yields the following Corollary.
Corollary 3.6 Assume that U has a density p U such that Assume that there exists η > 0 such that almost surely, There exists a constant µ 9 ≥ √ 8 such that, if for some θ ∈ (0; 1], then If the 1−periodicity of s per represents (say) a yearly cycle, then Equation (14) states that each "time of year" u ∈ [0; 1] is sampled with a positive density, i.e that the density of U −⌊U⌋ is lower bounded by a positive constant p 0 on [0; 1]. This ensures that equation (7) holds with κ(K) of order K p 0 , so that hypothesis (8) reduces to K ≤ p 0 θη µ 9 2 nv log nt . In particular, if θ is constant and n v is of order n, then K is allowed to grow with n at rate n log n . This is a reasonable restriction, as by Proposition 2.2, one cannot expect to estimate more than n log n coefficients with reasonable accuracy (a 1 log n convergence rate being too slow for most practical purposes). Corollary then deduces an oracle inequality with leading constant 1+θ 1−θ (arbitrarily close to 1) and remainder term of order c 2 log n ηn , which is typically negligible in the non-parametric setting of this corollary. For this reason, corollary 3.6 can be said to be optimal, at least up to constants.

Effect of V
The upper bound given by Theorem 3.2 only depends on T through n v and n t . The purpose of this section is to show that for a given value of n v , increasing V = |T | always decreases the risk. This is proved in the case of monte carlo subset generation defined below. Definition 3.7 For τ ∈ 1 n ; 1 and V ∈ N * , let T mc τ,V be generated independently of the data D n by drawing V elements independently and uniformly in the set {T ⊂ [|1; n|] : |T | = ⌊τ n⌋} .
For fixed τ , the excess risk of Agghoo is a non-increasing function of V .
Proposition 3.8 Let U ≤ V be two non-zero integers. Let τ ∈ 1 n ; 1 . Then: It follows by convexity of f → ℓ(s, f ) that For any I ∈ I, (T i ) i∈I ∼ T mc τ,U and is independent of D n , therefore 1 . This yields the result.
It can be seen from the proof that the proposition also holds for Agcv. Thus, increasing V can only improve the performance of these methods. The same argument does not apply to CV, because CV takes an argmin after averaging, and the argmin operation is neither linear nor convex. Indeed, no comparable theoretical guarantee has been proven for CV, to the best of our knowledge, even though increasing the number of CV splits (for given τ ) generally improves performance in practice. Proposition 3.8 does not quantify the gain due to aggregation. This gain depends on the properties of the convex functional t → ℓ(s, t), in particular on its modulus of strong convexity in a neighbourhood of the target s (assuming that at least some estimators in the collection are close to s). Moreover, as for any loss function, the gain due to aggregation depends on the diversity of the collection ( f ho T i ) 1≤i≤V : the more the hold-out estimators f ho T vary with respect to T , the greater the effect of aggregation.
More precisely, under hypothesis (Lcs), we can prove the following improvement to Proposition 3.8. Proposition 3.9 Let (X, Y ) ∼ P be independent from D n . Assume that P satisfies hypothesis (Lcs). For any where Med[Y ] denotes the largest median of a random variable Y .
Proposition 3.9 is proved in appendix C.3. It quantifies the gain due to aggregation in terms of the parameter c of the Huber loss, the constant η given by hypothesis (Lcs) and the distance between two hold-out estimators that are close enough to s. Taking c → +∞ recovers the least-squares case, where η = 1 and there are no constraints on f ho T i − s. Only two indices 1, 2 appear in the right-hand side of equation (17): that is a consequence of the exchangeability of the collection ( f ho T i ) 1≤i≤V for Monte-Carlo subset generation. The same result also applies to V −fold Agghoo, since it also yields an exchangeable collection. For arbitrary T , all distinct pairs of indices would have to be considered.
Going beyond proposition 3.9 requires giving nontrivial lower bounds on , which is no easy task, given the complex dependencies involved. Results in this direction have only recently been obtained in the setting of least-squares density estimation [24,.
A few general heuristics apply: first, if there is one learning rule A k * in the collection which is much better than the others, the hold-out can be expected to select it most of the time: in that case, Agghoo reduces to bagging, and potential gains depend on the stability of A k * . In contrast, if there are many rules A k which are close to optimal, while being distant from each other, then the gains of aggregation can be expected to be large, even if the individual rules A k are stable.

Simulation study
This section focuses on hyperparameter selection for the Lasso with huber loss, either using a fixed grid or using the reparametrization from Definition 2.4. The methods considered for this task are Aggregated hold-out given by Definition 2.6, Aggregated cross-validation given by Definition 2.7 and standard cross-validation given by Definition 2.8. In all cases, the subsamples are generated independently from the data and uniformly among subsets of a given size τ n, as in Definition 3.7. Thus, all three methods share the same two hyperparameters: τ , the fraction of data used for training the Lasso, and V , the number of subsets used by the method.
For the huberized Lasso with a fixed grid, the hqreg raw function from the R package hqreg [48] is used with a fixed grid designed to emulate the default choice: a geometrically decreasing sequence of length 100, with maximum value λ max and minimum value λ min = 0.05λ max . The fixed value of λ max is obtained by averaging the (data-dependent) default value chosen by hqreg raw over 10 independent datasets. To compute the reparametrization given by Definition 2.4, we implemented the LARS-based algorithm described by Rosset and Zhu [34], which allows to compute the whole regularization path.
I.i.d training samples of size n = 100 are generated according to a distribution (X, Y ), where X ∈ R 1000 and Y = w T * X + ε, with ε independent from X. To illustrate the robustness of the estimators, Cauchy noise is used: ε ∼ Cauchy(0, σ). The performance of Agghoo and cross-validation may depend on the presence of correlations between the covariates X and the sparsity of the ground truth w * . To investigate these effects, three parametric families of distribution are considered for X, in sections 4.1, 4.2 and 4.3.
The risk of each method is evaluated on an independent training set of size 500, and results are averaged over 1000 repetitions of the simulation. More precisely, 1000 training sets D j of size n = 100 are generated, along with 1000 test sets (X ′ i,j , Y ′ i,j ) 1≤i≤500 , each of size 500. For each simulation j and any learning rule A τ,V among the six obtained by combining Agghoo, monte carlo CV and AGCV with either a fixed grid or the zero-norm parametrization, the average excess risk is computed on the test set for all values of V ∈ {1, 2, 5, 10} and τ ∈ i 10 : 1 ≤ i ≤ 9 .

Experimental setup 1
X is generated using the formula X i = 1 where Z j are independent standard Gaussian random variables, u i = I |i|≤cor e − 2.33 2 i 2 2cor 2 and cor ∈ N is a parameter regulating the strength of the correlations. The regression coefficient has a support of size r = 3 * k drawn at random from [|1; 1000|], and is defined by w * ,j = u * ,g(j) , where g is a uniform random permutation, u * ,j = b if 1 ≤ j ≤ k and u * ,j = b 4 if 2k + 1 ≤ j ≤ 3k, with b calibrated so that Xw * L 2 = 1. The noise parameter is σ = 0.08, while the Huber loss parameter c is set to 2 -a suboptimal choice in this setting, but convenient for computing the huberized Lasso regularization path.
Choice of τ parameter For all methods, in most cases the optimal value of τ is 0.8 or 0.9, similarly to what was observed in the rkhs case [25], where τ = 0.8 was recommended. Table 1 displays the quantitŷ where Sd denotes the (empirical) standard deviation and τ * the optimal choice of τ , τ * = argmin τ ∈{0. Choice of V For all methods considered, performance is expected to improve when V is increased, but by how much? If the performance increase is too slight, it may not be worth the additional computational cost. In figure  1, the mean excess risk for the optimal value of τ is displayed as a function of V , with error bars corresponding to one standard deviation. The scale used for the vertical axis in each graph is the average excess risk of the oracle with respect to the fixed grid over the λ parameter. Quantifying performance as a percentage of the oracle risk, when cor = 15, Agghoo improves by roughly 20% from V = 1 to V = 2, by roughly 10% from V = 2 to V = 5 and by a few percent more from V = 5 to V = 10. CV with the standard grid behaves similarly in these two simulations, while CV with the zero-norm parametrization shows much less improvement when V is increased. Thus, taking V ≥ 5 is advantageous, but there are clearly diminishing returns to choosing V much larger than this. For CV with the zero-norm parametrization, V = 2 seems sufficient in these simulations .
Comparison between methods From figure 1, it appears that grid agcv is a very poor choice, being worse than both grid agghoo and grid cv for all values of V when r = 150, cor = 15 , and being the worst of all the methods for V ≥ 2 when r = 24, as well as highly unstable, as the size of the error bars clearly shows.
Interestingly, 0−norm agcv behaves much better, being the second best method when cor = 1, and very close to the best when r = 24 and cor = 15.
Generally speaking, of the two types of parametrization of the Lasso, the zero-norm parametrization appears to perform better than the standard grid when correlations are small (cor = 1), while the performance is significantly worse when r = 150 and cor = 15.
Comparing now Agghoo and CV, Agghoo appears to be better than CV when V ≥ 2 in situations where r is larger (r = 150). This seems to hold for both the standard parametrization (grid agghoo) and the zero-norm one (0−norm agghoo). The relation is reversed for small r, with CV performing better than Agghoo for all values of V when r = 24.  Further studies The previous simulations suggest that Agghoo performs better than CV in the case of high intrinsic dimension. This behaviour is logical, since the cross-validated Lasso will ignore some predictive variables when there are too many of them, and randomized aggregation may help recover more of the support. However, the effect of correlations is unclear. Experimental setup 1 mixes different types of correlations: correlations between predictive variables, correlations between predictive and non-predictive variables, and correlations among non-predictive variables. It is possible that one type of correlation favours Agghoo while another favours CV.
To gain a more accurate idea of when Agghoo is advantageous over CV, two more settings are studied, considering separately correlations among predictive variables, and between predictive and non-predictive variables. Since previous simulations showed that τ = 0.8, 0.9 and V = 10 were the optimal parameters, only those parameters will be considered in the following.
Since the choice of lasso parametrization did not seem to affect the relative performance of Agghoo and CV, we only consider the standard parametrization, as it is more popular and also easier to use in our simulations. Agcv is not considered either, since it was discovered to be unreliable in previous simulations.

Experimental setup 2: correlations between predictive and noise variables
Let r be the number of predictive variables and let each predictive covariate have s "noise" covariates which are correlated with it at level ρ = 0.8. Assume that rs ≤ d, where d is the total number of variables. Let (Z 0 j ) 1≤j≤r , (Z j,i ) 1≤j≤r,1≤i≤s and (W k ) 1≤k≤d−rs be independent standard gaussian variables. For any j ∈ [|0 : r − 1|] and any i ∈ [|1; s|], let X js+i = √ 0.8Z 0 j+1 + √ 0.2Z j+1,i and for rs < i ≤ d, let X i = W i−rs . For the regression coefficient, , where u = (I s|(j−1) I j≤rs ) 1≤j≤d . Let then Y be distributed conditionnally on X as Cauchy( w * , X , 0.3). The loss function used here is φ c with c = 2.
Results Figure 2 shows a bar plot of the average excess risk of CV and Agghoo as a fraction of the average risk of the oracle. 90 % error bars were estimated using a normal approximation. Parameters used for Agghoo and CV were τ = 0.9 and V = 10 (τ = 0.8 yields similar result).
Overall, Agghoo's risk relative to the oracle significantly decreases as the zero-norm of w * increases from r = 10 to r = 50 , as was observed in section 4.1 . For r = 25 and r = 50 separately, the risk relative to the oracle significantly decreases as s increases from 2 to 10. For r = 10, this trend is unclear due to the random errors.
In contrast, CV's performance relative to the oracle shows no statistically significant trend either as a function of r or as a function of s.
As a result of these trends, Agghoo performs significantly worse than CV for r = 10 and significantly better when r = 50, especially when s ≥ 5. When r = 25, CV performs significantly better than Agghoo for s = 2 and s = 5 and they perform similarly when s = 10 and s = 20.

Experimental setup 3: correlations between predictive variables
We consider now predictive covariates which are correlated between them, and independent from the unpredictive covariates. As above, let r denote the number of predictive variables and ρ > 0 be the level of correlations. Let Z 0 , (Z i ) 1≤i≤r and (W i ) 1≤i≤d−r be standard Gaussian random variables. The random variable X is then defined by As in section 4.2, the regression coefficient w * is a constant vector of the form 3 * u Xu L 2 , where this time u = (I 1≤i≤r ) 1≤i≤d . Y is distributed conditionnally on X as Cauchy( X, w * , 0.3) and the loss function used is the Huber loss φ 2 .
Results Figure 3 shows a barplot generated in the same way as in section 4.2. Parameters used for Agghoo and CV were V = 10 and τ = 0.8, which is optimal in this case for both Agghoo and CV.
As in previous simulations, Agghoo's performance relative to the oracle improves significantly when the intrinsic dimension r grows from 25 to 200, for a given value of ρ. The decrease in relative risk is faster for small values of ρ. As a result, Agghoo performs best, relative to the oracle, when ρ = 0.2 for r = 200, whereas best performance seems to occur at ρ = 0.5 for smaller values of r, up to random errors.
For cross-validation, the relative risk seems more or less unaffected by the dimension r, but shows an increasing trend as a function of ρ for all values of r.

Conclusion
Aggregated hold-out (Agghoo) satisfies an oracle inequality (Theorem 3.2) in sparse linear regression with the Huber loss. This oracle inequality is asymptotically optimal in the non-parametric case where the intrinsic dimension tends to +∞ with the sample size n, provided that an L ψ 1 − L 2 norm inequality holds on the set of sparse linear predictors. The condition holds for gaussian vectors and for classical approximation spaces in non-parametric regression. In the case of the trigonometric basis, this approach yields an oracle inequality in which the total dimension d does not appear.
When Monte-Carlo subsampling is used (Definition 3.7), Agghoo has two parameters, τ and V . Theoretically, it is shown that Agghoo's performance always improves when V grows for a fixed τ . Simulations show a large improvement from V = 1 to V = 5 in some cases, but diminishing returns for V > 5. With respect to τ , simulations show that τ = 0.8 or τ = 0.9 is optimal or near optimal in most cases. In particular, a default choice of V = 10, τ = 0.8 seems reasonable.
Compared to cross-validation with the same number of splits V , simulations show that Agghoo performs better when the intrinsic dimension r is large enough (r = 150 in section 4.1, r = 50 in section 4.2 and r = 100 in 4.3) for n = 100 observations and d = 1000 covariates. Correlations between predictive and non-predictive covariates, which increase the number of covariates correlated with the response Y , clearly favour Agghoo relative to CV and the oracle, whereas the effect of correlations between predictive covariates is ambiguous.

Acknowledgements
While finishing the writing of this article, the author (Guillaume Maillard) has received funding from the European Union's Horizon 2020 research and innovation program under grant agreement No 811017.

A Proof of Proposition 2.2
The proof follows the same lines as the proof of [31, Theorem 1], with some differences due to the non-quadratic risk.
Sinceθ is allowed to depend on Σ, which is positive definite by assumption, we can always replace the X i by Σ − 1 2 X i . Thus, it can be assumed without loss of generality that Σ = I n . Using the notation of Proposition 2.2 where ε, X are assumed to be independent from the sample D n . Since ε, X are independent, centered normal variables, σε + θ * − θ, X is centered normal, with variance σ 2 + θ * − θ 2 2 . It follows that Let also g c,σ (r) = g c ( √ r 2 + σ 2 ) − g c (σ), so that ℓ(θ T * , θ T ) = g c,σ ( θ * − θ 2 ). Consider the prior Π λ = N (0, σ 2 λn I d ) on θ * . Then a classical computation [31] shows that the posteriorπ n = Π λ (·|D n ) is gaussian and centered at the ridge estimatorθ whereΣ n is the empirical covariance matrix. Fix a sample D n and letθ ∼π n be independent from ε, X. Notice that Set now θ =θ λ,n . Sinceθ ∼π n , knowing X, θ −θ λ,n , X is centered normal and independent from ε, which is also centered normal. It follows that E φ ′ c (σε + θ − θ, X )|X = 0, since φ ′ c is an odd function. This shows thatθ λ,n is a Bayes estimator with respect to the prior Π λ and the loss function ℓ.
Thus, for any estimatorθ, ℓ(θ T * , θ T ) = g c,σ ( θ * − θ 2 ), so by convexity of ℓ(θ T * , ·), g c,σ must be convex. Hence, by Jensen's inequality, Since d < n,Σ n is almost surely non-degenerate. It follows that Since the ε i are iid normal N (0, 1) and independent from the X i , conditionnally on X 1 , . . . , X n ,r n is centered normal, with covariance matrix It follows by lemma A.1 that By convexity of the function M → T r(M −1 ) on the positive definite matrices (lemma A.2), Since g c is non-decreasing and convex, . This proves the proposition.
Thus, the lemma is equivalent to where D is diagonal and Q is orthogonal. Let λ 1 , . . . , λ d be the diagonal coefficients of D (that is to say, the eigenvalues of Σ 0 ). Then

36
The coefficients λ i are positive (since Σ 0 is positive definite) and sum to 1 (since T r(Σ 0 ) = 1 by construction). It follows by Jensen's inequality that This proves the lemma. For any positive real a > 0, For any two matrices A, B, let A, B := T r(M − 1 2 AB T M − 1 2 ). It is easy to see that this defines a scalar product. Thus, by the Cauchy-Schwarz inequality, Thus, By equation (19), this proves that the Hessian of f at M is non-negative definite.
In this proof, we shall adopt the following notational conventions. The notation P, E will be reserved for probabilities and expectations which involve the sample D nt (or D n ). For a (possibly random) function f : R d × R → R, P (f ) = P (f (X, Y )) will denote the expectation taken with respect to (X, Y ) ∼ P only (ignoring the potential randomness in the construction of f ). The notation E will be used for any other expectation. Moreover, for any measurable function t : R d → R, we denote For a random functiont : ω → (x →t(ω)(x)), let t α,P = t (X) α,P : ω → t (ω) α,P , with a similar definition for t ψ 1 ,P . Fix a dataset D nt , K ∈ {1, . . . , n t } and for any k ∈ [|1; K|] 2 , lett k = A k (D nt ) : x →q k (D nt ) + θ k (D nt ), x . More precisely, to apply [25,Theorem 17], one must show inequalities of the form H(w 1 , w 2 , (t k ) 1≤k≤K ): for all r ≥ 2, where w 1 , w 2 are non-decreasing functions. Since φ c is Lipschitz, it is enough to control t k −t l ψ 1 ,P and t k −t l 2,P by functions of ℓ(s,t k ) and ℓ(s,t l ).

B.1 A few lemmas
Lemma B.1 Let X be a non-negative random variable such that where a ≥ 1. Let g ∈ L 1 (R + , e −x dx) be an increasing, differentiable function. Then for all b ∈ R + , Lemma B.2 Let Z be a random variable. Then for all r > 2, In particular, if Z L r ≤ κ r Z L 2 for some r > 2, κ r > 0, then Assume now that Z L r ≤ κ r Z L 2 . Then Lemma B.3 Let Z be a ψ 1 − random variable. Then for all r ∈ N, Proof By definition of Z L ψ 1 and Markov's inequality It follows by lemma B.1 that ≤ 2r!(moment of an exponential distribution).
Then for all integers r ≥ 2, Proof Since 2 > 1, the statement is true for r = 2. Consider now r ≥ 3. Let b > 1 be a real number to be determined later. Then By definition of Z L ψ 1 and a Chernoff bound, the variable Y = Z Z L ψ 1 satisfies P(Y ≥ x) ≤ 2e −x for all x, therefore by lemma B.1, An easy induction argument shows that It follows that Let b = 4 + 4 log κ ≥ 4 + 2 log 2. Then for all r ≥ 3, As a result, for all r ≥ 3, We now prove that for all t ≥ b, t ≥ 2 log t + 2 log κ. For all t ≥ 4, It follows that for all t > 4 + 4 log(κ) = b, t > 2 log t + 2 log(κ). In particular, Lemma B.5 There exists a constant µ 0 such that, for any sub-exponential random variable Z and any κ ≥ √ 2, Proof By lemmas B.3 and B.2, for all r ≥ 3, The conclusion follows since by assumption, log κ ≥ log( √ 2) > 0.
B.2 Controlling the ψ 1 norm t k −t l ψ 1 ,P First, let us bound the supremum norm by the L 2 norm.
Claim B.5.1 For any k ∈ {1, . . . , K}, recall thatt k = A k (D nt ). Then: Proof Let X be independent from D n and observe that for any k, whereb k =q k +θ T k (P X) (using the notations of hypothesis 2.1). Note that 1 ψ 1 ,P = 1 log 2 . Hence, by the triangle inequality, 2K. The definition of κ (equation (7)) implies that A uniform bound on the Orlicz norm is also required.
can be bounded as follows.
Claim B.6.1 Assume that hypotheses (Reg-T ), (Uub) hold and that for some λ > 0, κ(K) log(κ(K)) ≤ λ √ n t . Then Proof Let (k, l) ∈ {1, . . . , K} 2 . Defining X i = X i − 1 nt nt i=1 X i and changing variables in hypothesis 2.1 from (q, θ) to b = q+ < θ, 1 nt nt i=1 X i >, θ , we can rewritet k aŝ Then for ε ∈ [0; b 2 ], (23) contradicts the minimality of |b k |. Thus, (21) leads to a contradiction. Let i be such thatb k +θ T k X i ≥b l +θ T l X i . Then Exchanging k and l yields Let X ∼ X 1 be independent from D nt . For any k, l, As X is independent from D nt , conditionnally on D nt , by hypothesis 2, Hence, by lemma B.5, Thus, by the hypotheses of claim B.6.1, The result follows since for all n t ≥ 3, 6L log 2 n α t ≤ 2L log 2 n 1+α t .

B.3 Proving hypotheses H
The following lemmas will be useful.
s is a risk minimizer. Under hypothesis (Lcs), almost surely, for any u ∈ R, As a result, for any u ∈ R, Proof Recall that Then φ ′ c (x) = sgn(x)(|x| ∧ c) and φ ′′ c (x) = I |x|≤c . By differentiating under the expectation, almost surely, for any u such that |u − s(X)| ≤ c 2 , This proves the first equation. Since s(X) is a global minimum, it follows that, for any u ∈ s(X) − c 2 ; s(X) + c 2 , Because ℓ X (·) is convex, for any u such that u ≥ s(X) + c 2 , Similarly, for u < s(X) − c 2 , ℓ X (u) − ℓ X (s(X)) ≥ ηc 4 (s(X) − u). This proves the lemma.
We now relate the L 2 norm to the excess risk in the following Proposition.

B.4 Conclusion of the proof
where w A is defined in Proposition B.8. It remains to apply [25,Theorem 17] and to express the remainder term as a simple function of c, n v , n t , κ, L, K and α. We recall here the definition of the operator δ used in the statement of that theorem.
Definition B.9 For any function h : R + → R + and any ξ > 0, let The following lemma will facilitate the computation of δ(w A , ·). Proof To find δ(h r,s , ξ), notice that given the definition of δ(h r,s , ξ), the condition s ≤ ξ is obviously necessary for the infimum to be finite. Assume now that ξ ≥ s. For any u ≥ then applying Agghoo to the collection (A k ) 1≤k≤K yields the following oracle inequality.
C Applications of Theorem 3.2

C.1 Gaussian vectors
Proof For any θ ∈ R d , Z = θ, X − P X is a centered gaussian random variable. By homogeneity of norms, the quotient Z L ψ 1 Z L 2 does not depend on the scale parameter σ; it is therefore a numerical constant; moreover one can check that for Z ∼ N (0; 1), Thus, we can choose κ(K) = 1 log 2 so that κ(K) log(κ(K)) < 0.6.
Lemma C.1 There exists a constant µ 6 such that for any subset I such that |I| ≤ min nt log nt , 2 5 (n t − 1) and for all ε ∈ (0; 1], Moreover, if in addition |I| ≤ nt log d , then for all ε ∈ (0; 1], P (γ ≤ ε) ≤ 2e As a result, for any r ∈ 0, nt−1 3 , 1 γ L r ≤ 2µ 6 e 2 2(1 + 2e Proof By restricting to a subspace, we can always assume that M(θ) = √ θΣθ is a norm. Let S Σ = {θ ∈ E I : √ θΣθ = 1} be the unit sphere in norm M. Let ε > 0. By changing coordinates, it is easy to see that the metric entropy of S Σ in norm M is the same as that of the euclidean sphere S in the euclidean norm. Therefore, for any δ > 0, there exists a finite set S Σ,δ , of cardinality less than 6 δ d and such that for any u ∈ S Σ , there exists v ∈ S Σ,δ such that M(u − v) ≤ δ 2 . Therefore, On the other hand, for any θ ∈ S Σ , θ, X i −P X i is standard normal, therefore