Robust Regression through the Huber’s criterion and adaptive lasso penalty

: The Huber’s Criterion is a useful method for robust regression. The adaptive least absolute shrinkage and selection operator (lasso) is a popular technique for simultaneous estimation and variable selection. The adaptive weights in the adaptive lasso allow to have the oracle properties. In this paper we propose to combine the Huber’s criterion and adaptive penalty as lasso. This regression technique is resistant to heavy-tailed errors or outliers in the response. Furthermore, we show that the estimator associated with this procedure enjoys the oracle properties. This approach is compared with LAD-lasso based on least absolute deviation with adaptive lasso. Extensive simulation studies demonstrate satisfactory ﬁnite-sample performance of such procedure. A real example is analyzed for illustration purposes.


Introduction
Data subject to heavy-tailed errors or outliers are commonly encountered in applications which may appear either in response variables or in the predictors.We consider here the regression problem with responses subject to heavy-tailed errors or outliers.In this case, the Ordinary Least Square (OLS) estimator is reputed to be not efficient.To overcome this problem, the least absolute deviation (LAD) or Huber type estimator for instance can be useful.On the other hand, an important topic in linear regression analysis is variable selection.Variable selection is particularly important when the true underlying model has sparse representation.To enhance the prediction performance of the fitted model and get an easy interpretation of the model, we need to identify significant predictors.Scientists prefer a simpler model because it puts more light on the relationship between the response and covariates.We consider the important problem of robust model selection.
The lasso penalty is a regularization technique for simultaneous estimation and variable selection ( [32]).It consists to add a l 1 penalty to the least square criterion.This penalty forces to shrink some coefficients.In [4], the authors show that since lasso uses the same tuning parameters for all the regression coefficients, the resulting estimators may suffer an appreciable bias.Recently, [20,18,38] and [39] show that the underlying model must satisfy a nontrivial condition for the lasso estimator be consistent in variable selection.Consequently, in some cases, lasso estimator cannot be consistent in variable selection.In a first attempt to avoid this, [4] proposes the SCAD penalty and shows its consistency in variable selection.The main drawback of the SCAD penalty is due to its non-convexity: it typically leads to optimisation problems suffering from the local minima problem.In a second attempt, [39] provides a convex optimisation problem leading to a consistent in variable selection estimator.He assigns adaptive weights for penalizing differently coefficients in the l 1 penalty and calls this new penalty the adaptive lasso.Owing to the convexity of this penalty, it typically leads to convex optimisation problems.As a consequence, they do not suffer from the local minima problem.These adaptive weights in the penalty allow to have the oracle properties.Moreover, the adaptive lasso can be solved by the same efficient algorithm (LARS) for solving lasso (see [39]).
In [36], the authors propose to treat the problem of robust model selection by combining LAD loss and adaptive lasso penalty.So they obtain an estimator which is robust against outliers and also enjoys a sparse representation.Unfortunately, the LAD loss (l 1 criterion) is not adapted for small errors: it penalizes strongly the small residuals.In particular when the error has no heavy tail and does not suffers from outliers, this estimator is expected to be less efficient than the OLS estimator with adaptive lasso.In practice, we do not know in which case we are.So it is important to consider some methods having good performances in both situations.
That is why we can prefer to consider Huber's criterion with concomitant scale (see [12]).The Huber's criterion is a hybrid of squared error for relatively small errors and absolute error for relative large ones.In this paper, we propose to combine Huber's criterion with concomitant scale and adaptive lasso.We show that the resulting estimators enjoy the oracle properties.This approach is compared with LAD-lasso based on least absolute deviation with adaptive lasso.Extensive simulation studies demonstrate satisfactory finite-sample performance of such procedure.
The rest of the article is organized as follows.In Section 2, we recall the lassotype method and introduce the Huber's criterion with adaptive lasso penalty.In Section 3, we give its statistical properties.Section 4 is devoted to simulation.This study compares the Huber's criterion with adaptive lasso with two others methods: least square criterion with adaptive lasso and the LAD-lasso approach.
In Section 5, we analyze Chinese stock market data for illustration purposes.We relegate technical proofs to the Appendix.

Lasso-type estimator
Let us consider the linear regression model where x i = (x i1 , . . ., x ip ) T is the p-dimensional centered covariable (that is n i=1 x i = 0), α * is the constant parameter and β * = (β * 1 , . . ., β * p ) T are the associated regression coefficients.We suppose that σ > 0 and ǫ i are independent and identically-distributed random errors with mean 0 and variance 1, when it exists.Indeed in the sequel we do not need existence of variance.
Let A = {1 ≤ j ≤ p, β * j = 0} and p 0 = |A|.In variables selection context, we usually assume that β * j = 0, for j ≤ p 0 and β * j = 0, for j > p 0 for some p 0 ≥ 0. In this case the correct model has p 0 significant regression variables.We denote by β A the vector given by the coordinates of β the index of which are in A.
When p 0 = p, the unknown parameters in the model (2.1) are usually estimated by minimizing the ordinary least square criterion.To shrink unnecessary coefficients to 0, [32] proposed to introduce a constraint on the l 1 -norm of the coefficients.This leads to the primal formulation of the lasso.The link with the following dual criterion is studied in [21]: λ n > 0 is the tuning parameter.Notice that the intercept α does not appear in the penalty term since it seems not reasonable to constrain it.Fan and Li [4] studied a class of penalization methods including the lasso one.They showed that the lasso method leads to estimators that may suffer an appreciable bias.Furthermore they conjectured that the oracle properties do not hold for the lasso.Hence Zou [39] proposes to consider the following modified lasso criterion, called adaptive lasso, where ŵadl = ( ŵadl 1 , . . ., ŵadl p ) is a known weights vector.We report to the Subsection 2.4 for the hyperparameter choice.This modification allows to produce sparse solutions more effectively than lasso.Precisely, Zou [39] shows that with a proper choice of λ n and of ŵadl the adaptive lasso enjoys the oracle properties.

Robust lasso-type estimator: The LAD-lasso
When the regression error has very heavy tail or suffers from outliers, the finite sample performance of lasso can be poor.A first attempt to solve this problem has been done in [36].This paper provides a procedure inspired by the convex function Note that the intercept α is not included in the study of [36] but to fairly compare the methods we consider this intercept term in this paper.As in [39], the authors show that with a proper choice of λ n and of ŵladl = ( ŵladl 1 , . . ., ŵladl p ), the adaptive LAD-lasso enjoys the oracle properties.We report again to the Subsection 2.4 for the hyperparameter choice.Moreover the obtained estimator is robust to heavy tailed errors since the squared loss has been replaced by the L 1 loss.Unfortunately, this loss is not adapted for small errors: it penalizes strongly the small residuals.In particular when the error has no heavy tail and does not suffer from outliers, this estimator is expected to be less efficient than the adaptive lasso.That is why we can prefer to consider criterion like Huber's one.

The Huber's Criterion with adaptive lasso
To be robust to the heavy-tailed errors or outliers in the response, another possibility is to use the Huber's criterion as loss function as introduced in [12].For any positive real M, let us introduce the following function This function is quadratic in small values of z but grows linearly for large values of z.The parameter M describes where the transition from quadratic to linear takes place.The Huber's criterion can be written as where s > 0 is a scale parameter for the distribution.That is if each y i is replaced by cy i for c > 0 then an estimate ŝ should be replaced by cŝ.Usually, the parameter s is denoted by σ.To avoid confusions, we adopt here another notation since one can choose σ as scale parameter but various choices are possible.For example any multiple of σ is a scale parameter and those are not only.
Let us remark that with this loss function errors smaller than sM get squared while larger errors increase this criterion only linearly.In [12], the parameter M is viewed as a shape parameter that one chooses to control the amount of robustness.The Huber's criterion becomes more similar to least square for larger values of M while it becomes more similar to LAD criterion for small values of M .In this case, the Huber's method is more robust against outliers (as LAD method) but less efficient for normally distributed data.In [12], Huber proposes to fix M = 1.345 to get as much robustness as possible while being efficient for normally distributed data.Even if we adopt this approach, there remains the scale parameter to estimate.
As far as we know, all the algorithms of the literature designed with a penalty first estimate the unknown parameter σ defined in (2.1) and plug it as a scale s in the Huber's criterion.Among all the possible estimations of the standard deviation σ of the data, there is no rule how to choose it.A popular choice (see e.g.[29,25,14,28]) is the Median Absolute Deviation (MAD).It get a simple explicit formula, needs little computational time and is very robust as witnessed by its bounded influence function and it 50% breakdown point.Note that it has a low (37%) gaussian efficiency ( [33]).However, this kind of approach is critizeable (see e.g.[10]).Indeed, the Huber's criterion is designed to work with a scale parameter s which, ideally, is not the standard deviation of the data σ (as shown by lemma 1 below).Moreover, σ is only a nuisance parameter: the location one is generally more important.So, focusing attention on a good estimation of σ introduces difficulties in a superfluous step of the procedure: Huber's criterion only needs a well designed scale s. [12] proposed to jointly estimate s and the parameters of the model in several ways.The common property shared by these methods is that they do not need an estimation of the parameter σ.We retain the Huber's Criterion with concomitant scale defined by, which are to minimize with respect to s ≥ 0, α and β.To our knowledge, the statistical properties of this loss function have never been studied.Theorem 3.1 below shows that, as for the MAD, the provided scale estimation is a robust transformation of the residuals of a location estimate.However, in the case of MAD, the location esimate is the OLS estimator which can have very poor quality especially if collinearity is involved.In this case, the corresponding residuals are irrelevant.On the contrary, the scale used by the Huber's Criterion with concomitant scale is obtained from more relevant residuals.Let us define, for s > 0, We have the following lemma (its proof is given in Appendix 6.3).
Lemma 1.If M > 1 and (N2) (defined below page 10) holds, then there exists a unique s * > 0 satisfying (2.3).Moreover, it satisfies Consequently, the ŝ obtained by the minimisation of the Huber loss function with concomitant scale is a scale estimation of the scale parameter s * .Generally, it is a poor estimation of the standard deviation σ of the data.As explained previously, the algorithms of the literature use an estimation of σ as scale parameter which is not necessarily well suited for the Huber criterion.It is noticeable that the scale parameter s * is the standard deviation σ of the noise only if the loss is quadratic (i.e.M = +∞).
Let us now briefly comment the way we estimate the intercept α * of the model.In practice, it is usual to center the y and the x and to remove it from the optimization problem.This procedure is equivalent to minimize the quadratic loss over α.However, since we use the Huber loss (and not the quadratic loss), this procedure is not any more equivalent to minimize the loss function.So, we minimize the Huber's loss over α.Consequently, in our procedure, the intercept α * is no more estimated by a mean.
In this paper, we want to combine the Huber's criterion and adaptive penalty as lasso.In particular, that allows to have the oracle properties (see Section 3).So, we consider the following criterion ) is a known weights vector which will be defined at Section 2.4.As can be seen, the criterion Q Hadl combines the Huber's criterion and adaptive lasso penalty.Hence, the resulting estimator is expected to be robust against outlier and also to enjoy sparse representation.Let us remark that Q Hadl (α, β, s) is a convex function and thus the optimization problem does not suffer from the multiple local minima issue.Its global minimizer can be efficiently solved.We give an algorithm in Appendix 6.4.

Tuning parameter estimation
We now consider the problem of tuning parameter estimation.For the adaptive lasso method, Zou [39] proposes to use the following estimation for weights vector.Let β be a root-n-consistent estimator to β * ; for instance, one can use the estimate obtained by OLS βols .Let γ > 0 be a constant to be determined.He defines the weights vector estimation as ŵadl j = | βols j | −γ , j = 1, . . ., p. Then he uses two dimensional cross-validation to find an optimal pair of (γ, λ n ).
For the LAD-lasso, Wang et al. [36] consider similar estimation for weights vector.Let βlad the unpenalized LAD estimator of β * .They propose to use weights vector estimation as ŵladl j = | βlad j | −1 , j = 1, . . ., p, and fix λ n = log(n).For Huber's Criterion with concomitant scale and adaptive penalty, we propose to use similar way.We denote by βH the unpenalized Huber's estimator with concomitant scale.So, the weights vector is estimated by ŵHadl j = | βj | −γ , j = 1, . . ., p.It remains to evaluate the constants λ n and M .As in [12], we fix M = 1.345.Next, we can use cross-validation to find optimal values for λ n .Let us note that the theoretical part is given for these forms of weights vector and for the numerical results we fix γ equal to 1.
Following the remarks of an anonymous referee, for these three methods a BIC-type model selection procedure has been designed to choose the regularization parameter λ n .Let us now describe precisely the BIC criterions we used for each method.The collections of estimators obtained using adaptive-lasso or LAD-lasso methods are naturally log-likelihood estimators.The corresponding collections of models (containing probability density functions) are nested.We use the classical BIC criterions.In [30], relying on the Kullback-Leibler divergence, it is recommended to select λ adl n minimizing log n i=1 over λ n for adaptative-lasso and minimizing log n i=1 over λ n for LAD-lasso.Let us note that k λn denotes the model dimension.
Following [35] and [37], we use for k λn the number of non-zero coefficients of β adl λn (resp.β ladl λn ) for adaptive-lasso (resp.LAD-lasso).In the underlying models all the residuals are supposed to have the same distribution: gaussian or double exponential.
Let us now design a BIC-type procedure taking advantage of the two previous ones.The flexibility of BIC criterions would allow to gather the collection of estimators obtained by adaptive-lasso and LAD-lasso.The corresponding collection of models are no more nested but BIC criterion have been designed to work in this framework (see [30]).Thus one easily get a BIC criterion to select the final estimator in this augmented collection of estimators.However, datasets more likely contain some outliers.We thus propose a BIC type procedure relying on the Huber's loss.In this way, the weight associated to each residual is adapted: they are not treated all in the same way. of Huber-lasso within the logarithm of the BIC criterion.

Some remarks on scale invariance
An estimator ê(y 1 , . . ., y n ) calculated from the data (x i , y i ) 1≤i≤n is said to be scale invariant if Note that, in this definition, the design is fixed.This means that if each y i is replaced by cy i for c > 0 then the estimate ê should be replaced by cê.
It is important to consider location estimators (α, β) satisfying this property since they provide a coherent interpretation of the results.Indeed, if we change the scale of our measurments y by an arbitrary change of units, the selected variables are the same and the prediction changes accordingly.Note that others equivariance notions have also been introduced in [17].
The following easy property can be used to assert that an estimator is scale invariant.Let A be a cone of R q (this means that cx ∈ A when c > 0 and x ∈ A).If ê(y 1 , . . ., y n ) = argmin γ∈A Q(y 1 , . . ., y n , γ) with Q(cy 1 , . . ., cy n , cγ) = g(c)Q(y 1 , . . ., y n , γ), g(c) ≥ 0, then ê(y 1 , . . ., y n ) is scale invariant.Let us note that the lasso procedure (OLS with lasso penalty) is not scale invariant.On the other hand, the LAD criterion or the Huber's one with concomitant scale is always scale invariant when combining with lasso penalty.But, when an adaptive penalty is introduced in the previous criteria (even if γ = 1), the scale invariant property is lost.On the other hand, if we consider all the procedure (λ n choice by BIC-type criterion or cross validation technique and estimation method), the adaptive methods are all scale invariant.

Theoretical properties
Let X denotes the design matrix i.e. the n × p matrix the i th rows of which is x T i .We will use some of the following assumptions on this design matrix.
of V , corresponding to the covariables associated with non zero coefficients.
Assumption (D1) and (D2) are classical.For instance, (D1) is supposed in theorem 4.1 of [17] to ensure the asymptotic normality of the regression quantile estimation.It can also be seen as a "compacity assumption": it is satisfied if the variables are supposed to be bounded.In [39], the author needs only the assumption (D2) since he uses a least square criterion as loss function.
Let us denote by ǫ a variable with the same law as ǫ i , i = 1, . . ., n.The following assumptions on the errors are used in the following: (N0) The distribution of the errors does not charge the points ±M s * : Note that (N0) holds if ǫ is absolutely continuous with respect to the Lebesgue's measure and (N2) is satisfied if, moreover, the density is continuous and strictly positive at the origin (which is assumption A of [36]).Condition (N1) is natural without prior knowledge on the distribution of the errors and (N2) ensures that the noise is not degenerated.It is noticeable that there is no integrability condition assumed on the errors ǫ.The theorems ensuring the convergence of the penalized least squared estimators (e.g.[16] and [39]) usually assume that ǫ has a finite variance.
Theorem 3.1 below states that the estimation of the scale proposed by the Huber criterion with concomitant is robust to large residuals.The robustness comes from the fact that only the smallest residuals are taking into account to obtain the scale estimation.For (α, β) fixed, let us sort the absolute values of the residuals, For a real number x, ⌈x⌉ denotes the smallest integer larger than x.Then we have the following theorem (its proof is postponed in Appendix 6.1).
Theorem 3.1.When M = 1, there exists a unique ŝHadl .Moreover if M ≤ 1 then ŝHadl = 0 and (α Hadl , βHadl ) is obtained by minimising the penalised ℓ 1 loss (as LAD). where Note that the criterion (3.2) determines which residuals are small enough to be used in the estimation (3.1) of the scale.It relies on the energy of the smallest residuals.This way to be robust is different as the one used by a MAD type estimation of the standard deviation σ where the median (of the residuals) is used.Note that when M = +∞, k = n and ŝHadl is the classical arithmetical mean of the squared residuals.This is the maximum likelihood estimator of σ 2 in the gaussian case.Unfortunately, in this case, the likelihood is not concave.Huber loss criterion with M = +∞ provides a convex objective function the minimum of which leads to the same estimation.Moreover, we have for any (α, β): (3.3)So the loss function linearly penalizes the largest residuals.We put attention to the reader on the fact that small residuals are put together through a L 2 norm and not the classical squared L 2 norm loss.This is natural since we consider a scale invariant procedure.Consequently, the objective function Q Hadl with M = +∞ is not equal to Q adl .
In the following theorem we show that, with a proper choice of λ n , the proposed estimator enjoys the oracle properties.Its proof is postponed in Appendix 6.2.Theorem 3.2.Suppose that λ n / √ n → 0 and λ n n (γ−1)/2 → ∞.Let us also assume that conditions M > 1, p 0 > 0, (N0), (N1), (N2), (D1) and (D2) hold.Moreover, for j = 1, . . ., p, the weights in Q Hadl are ŵHadl j = 1/| βj | γ where β is a root-n-consistent estimator of β * .Then, any minimizer (α Hadl , βHadl , ŝHadl ) of Q Hadl satisfies the following: where Σ 2 is the squared block diagonal matrix and where [3] already studied asymptotic normality of any minimizer of the Huber loss without concomitant scale nor penalty.It is noticeable that the previous result holds under the same assumptions: the introduction of the concomitant scale in the criterion does not lead to supplementary hypotheses.
Unlike the plug-in methods of the scale, this result provides a simultaneous convergence of location and scale estimations.Moreover, the asymptotic variance Σ 2 is block diagonal.This means that these estimations are asymptotically independent.
In [39], the author has already treated the case where the loss is quadratic.Theorem 3.2 generalizes theorem 2 of [39] to deal with a robust loss (i.e. the case M < +∞).It is noticeable that, in the quadratic case where M = +∞, the asymptotic variance matrix 1,1 and we recover the asymptotic variance of theorem 2 of [39].
Let us also remark that assumptions on λ n are the same as in [39].They are used in the penalty term control.The data-dependent ŵHadl is the key in Theorem 3.2.As the sample size grows, the corresponding weights get inflated (to infinity) for zero-coefficient whereas they converge to finite constant for nonzerocoefficient.Consequently, as explained in [39], we can simultaneously unbiasedly (asymptotically) estimate large coefficient and remove zero-coefficient.
For simplicity, Theorem 3.2 is stated assuming that the preliminary estimator β is a root-n-consistent estimator to β * .Using results over loss function given in the proof of Theorem 3.2, we can prove that the unpenalized Huber's estimator βH satisfies this property and, consequently, can be used to determine the weights ŵHadl .As noticed in [39], examining carefully the provided proof, this assumption on β can be greatly weakened.
Same kind of results can be proved in the random design setting using similar techniques.

Simulation results
In this section, the algorithm minimising the objective function Q adl (resp.Q ladl and Q Hadl ) is called ad-lasso (resp.LAD-ad-lasso and Huber-ad-lasso).The adaptive weights are obtained from the corresponding unpenalized estimator and γ = 1.Our aim is to compare the finite sample performances of these procedures.The purpose of these simulations is to see how our estimator performs in the absence of outliers (where robustness is not essential) and in their presence (where robustness is needed) and in comparison with other robust (LAD-ad-lasso) or non-robust methods (ad-lasso).Paragraph 4.1 presents the studied models.The way simulations are conducted is described in 4.2 and an insight of conclusions is provided in paragraph 4.3.

Models used for simulations
The models used to compare the performances of the algorithms are inspired by those presented in [32], [39] and [36].They all have the form y = 1 1 n + Xβ * + σǫ , where 1 1 n denotes the vector of R n composed of ones and y (resp.ǫ) represents the response (resp.error) vector (y 1 , . . ., y n ) T (resp.(ǫ 1 , . . ., ǫ n ) T ).The vector of true coefficients is β * = (3, 1.5, 0, 0, 2, 0, 0, 0) T .As compared with (2.1), this means that the intercept of the model is α * = 1 and the number of variables (without the intercept) is p = 8.The number of influencing variables is p 0 = 3.The design matrix X is constructed as follows.The rows of X are given by n independent gaussian vectors N 8 (0, Σ r ).They are normalized such that the corresponding p-dimensional covariables are centered (as assumed in (2.1)).For some r ≥ 0, the variance matrix of the variables is defined by Σ r,i,j = r |i−j| for 1 ≤ i, j ≤ p.
These four models can be divided into two types.The first type contains light tailed errors models (1 and 2) whereas the second type is composed of heavy tailed errors models (3 and 4).Models 1 and 2 allow to quantify the deterioration of the performances of the robust methods LAD-ad-lasso and Huber-ad-lasso in the absence of outliers.Thinking about the maximum likelihood approach, the loss used in ad-lasso (resp.Huber-ad-lasso, LAD-ad-lasso) is well designed for models 1 and 2 (resp.3,4); see [12] for a justification of this claim for Huber-ad-lasso.

Prediction accuracy
The following procedure has been designed to compare the performances of the various algorithms in the fixed design setting.The performances are measured both by the prediction errors and the model selection ability.For any considered underlying models, the distribution of the design matrix is given.It is used to generate the covariates used in the training and test datasets.So we generate a first set of n training designs (x 1 , . . ., x n ) and a second set of m =10 000 test designs (x n+1 , . . ., x n+m ).These two sets are centered in mean to stick on the theoretical definition (2.1) of the model.For the ad-lasso, [39] recommends to standardize the design in mean (i.e.ensures that n i=1 x i = 0) which only leads to α = y and does not affect the β value (since the intercept is not penalized and the squared loss is used).Concerning the LAD-ad-lasso, the intercept is not included in the study of [36] and no recommendation are provided for the beforehand normalizations of the data.Moreover, their simulations are performed on stochastically independent covariables.If (α ladl , βladl ) denotes the LAD-ad-lasso estimator, αladl is the median of the n residuals y i − x T i βladl for i = 1 • • • n.In a general way, for the LAD-ad-lasso or Huber-ad-lasso, such a normalization has some effect over β but it is not so clear how it works (since the loss function is no more the squared loss).In this paper, we follow the normalization of design in mean provided in [39] for the three procedures.
Since the theoretical results are established in fix design framework, the training and test design are fixed once and for all: they will be used for all the data generations.100 training sets of size n are generated according to definition (2.1) of the model.All the algorithms have been runned on the 100 training sets of size n =50, 100, 200 and their prediction capacity have been evaluated on the test design set of size m =10 000.
In order to compare the prediction accuracy, the Relative Prediction Errors (RPEs) already considered in [39] are computed.Let us now precisely recall the definition of this index.In our model, it is assumed that the true regression function is m(x) = α * + x T β * for any x ∈ R p .Consequently, any estimator (α, β) of (α * , β * ) leads to the estimation m n (x) = α + x T β of the regression function.The following decomposition of the excess risk is classical when we are interested in regression functions (see e.g.[9]): Since in our model E x,y (y − m(x)) 2 = σ 2 , the relative prediction error E x (m n (x) − m(x)) 2 /σ 2 is reported.In our simulations, it is estimated by the corresponding mean m j=1 (m n (x n+j ) − m(x n+j )) 2 /(mσ 2 ) over the test sample.Tables 1 and 2 provide the mean and the standard deviation of the 100 obtained RPE.Figures 1 and 2 provide the corresponding boxplots.

Selection ability
The model selection ability of the algorithms are reported in the same manner as done by [36], [32] and [4] in Tables 3, 4, 5 and 6 for cross-validation and in Tables 7, 8, 9 and 10 for BIC.In order to provide the indicators defined below, a coefficient is considered to be zero if it absolute value is strictly less than 10 −5 (i.e. its five first decimals vanish).In all cases, amongst the 100 obtained estimators, the first column (C) counts the number of well chosen models i.e. the cases where the first, second and fifth coordinates of β are non-zeros and   the third, fourth, sixth, seventh and eightth ones are zeros.To go further in the model selection ability analysis, we consider another measurements.To begin with, the second column (O) reports the number of overfitting models i.e. those selecting all the non-zeros coefficients and at least one zero coefficient.Next, the third column (U) reports the number of chosen underfitting models i.e.
those not selecting at least one non-zero coefficient.In this way, all the 100 models are counted one time.Columns (0) and (U) aim to explain the results obtained in (C).The fourth column (SV) reports the median number of selected variables.Finally, the fifth column (MM) reports the minimal and maximal values of estimations of the coefficients of non-influencing variables.For each n and each column, the best performance is indicated in bold.The sixth (resp.seventh) column (IZ) (resp.(CZ)) provides the average number of correctly (resp.mistakenly) estimated 0's.The last column (Z) is the average number of estimated 0's.Models selection abilities are closely related to the accuracy of estimations of the coefficients.This fact is illustated by columns (MM) and by boxplots of the coefficients estimations of Figures 3 and 4.

Hyperparameter choices
Concerning the hyperparameter choices, the regularization parameters are chosen by either BIC criterion or a 5−fold cross validation on each of the 100 training sets.The same grid has always been used.It is composed of 0 and 100 points log-linearly spaced between 0.01 and 1400.For Huber-ad-lasso, the simulation studies report the performances obtained with M = 1.345.This value has been recommended by Huber in [12].Let us remark that it is possible to chose the M parameter from the data (for example by cross-validation simultaneous with the tuning parameter).But in practice we do not observe some improvement to make it data adaptive.It is also noticeable that, within each of the four considered models, the two model selection procedures have been tested on the same simulated datasets.

Comparison results
To begin with, let us point out some fact observed both with a cross-validation and BIC model selection.Gaussian tailed errors models (1 and 2) emphasize that the use of Huber loss with concomitant scale instead of the squared loss does not lead to a significant loss of performances in the absence of outlier.Indeed, the relative prediction errors and model selection ability of Huber-ad-lasso are closed to the ones of ad-lasso (see Tables 1, 2, 3 and 7).It is noticeable that, as attended, the results of LAD-ad-lasso are the worst from a prediction and model selection point of view.Let us also stress that in the case of large correlations between the covariables (Model 2), the use of Huber loss does not solve the poor behavior of the quadratic loss: in this case, in comparison with the low correlated case, we even observe a slightly more marked deterioration of the For n = 100, estimations of influencing and non-influencing coefficients by Huber-ad-lasso (H), LAD-ad-lasso (L) and ad-lasso (A) with cross-validation.
performances of Huber-ad-lasso with respect to adaptative-lasso.To settle this, one has to change the penalty: this is a point for a future work.Heavy tailed errors models (3 and 4) emphasize better performances of Huber-ad-lasso with respect to ad-lasso.More precisely, the quadratic loss is particularly unsuited S. Lambert-Lacroix and L. Zwald For n = 100, estimations of influencing and non-influencing coefficients by Huber-ad-lasso (H), LAD-ad-lasso (L) and ad-lasso (A) with BIC.
in presence of a small quantity of potentially large outliers (Model 3).Indeed, the relative prediction errors of ad-lasso are around ten times as big as the ones of Huber-ad-lasso (see Tables 1 and 2) in this Model 3. The standard deviation is also far bigger (see Figures 1 and 2).In this case, ad-lasso leads to a poor estimation of the coefficients (see Figures 3 and 4).In presence of sensible outliers (Model 4), the quadratic loss get worst results than Huber loss but the gap is smaller.Let us now compare the Huber-ad-lasso with a robustdesigned algorithm: the LAD-ad-lasso.In the gaussian tails errors models (1 and 2) Huber-ad-lasso always get better results than the LAD-ad-lasso from prediction error and model selection ability point of view (see Tables 1, 2, 3,  7, 4 and 8).Tables 4 and 8 emphasizes that the benefit decreases when the correlations between variables increase.Next, let us point out some differences due to the model selection procedure.These differences occur for the comparison of Huber-ad-lasso and LAD-ad-lasso in the heavy tails models (model 3 and 4).When cross-validation is used, in model 3, Huber-ad-lasso and LAD-ad-lasso get similar performances in terms of prediction error and model selection ability (see Tables 1  and 5) while in model 4 Huber-ad-lasso has a better model selection ability than LAD-ad-lasso.As a contrary, when a BIC-type criterion is used, the results are less surprising.Indeed, the nature of the noise directly determines the more performant algorithm (from a prediction error and model selection ability points of view).Precisely, as attended, the mixture of gaussians (resp.double exponential) noise of model 3 (resp.4) favours Huber-ad-lasso (resp.LAD-ad-lasso) over LAD-ad-lasso (resp.Huber-ad-lasso).
Finally, let us provide some general concluding remarks highlighted by these experiments.If the model selection is performed by cross-validation, it is preferable to use Huber-ad-lasso rather than ad-lasso or LAD-ad-lasso.However, it is noticeable that for each of the four models, each value of n and each method, the performances in terms of model selection ability are better with a BICtype criterion rather than with cross-validation.Moreover, roughly speaking, BIC-type procedures are 5 times faster than the corresponding cross-validation procedure.

A real example: The Chinese stock market data
In this section we consider for illustrating purposes the Chinese stock market data analyzed in [36].These data set is derived from CCER China Stock, which was partially developed by the China Center for Economic Research (CCER) at Peking University.It contains a total of 2 247 records; each record corresponds to one yearly observation of one company.In these data set, 1 092 records come from year 2002 and the rest comes from year 2003.As in [36] we use year 2002 records as the training data and year 2003 records as the testing data.The response variable is the return on equity (denoted by ROE) of the following year (denoted by ROE t+1 ).The explanatory variables all measured at year t include ROE of the current year (denoted by ROE t ), asset turnover ratio (ATO), profit margin (PM), debt-to-asset ratio or leverage (LEV), sales growth rate (GROWTH), price-to-book ratio (PB), account receivables/revenues (ARR), inventory/asset (INV), and the logarithm of total assets (ASSET).
The reliability of the usual OLS-based estimation and model selection methods (e.g.lasso) is severely challenged for these data set (see [36]), whereas the robust selection methods (such as LAD or Huber based methods) become more attractive.We run the three methods ad-lasso, LAD-ad-lasso and Huber-adlasso for the both hyperparameters choice methods (cross-validation and BIC criterion).Each method is used to select the best model based on the training dataset of 2002.The prediction accuracies of these methods are measured by the mean absolute prediction error (MAPE) and the corresponding standard deviation (STD), evaluated on the testing data for 2003.The STD denotes the standard deviation of the absolute values of the prediction errors divided by the square root of the number of testing data.Let us notice that contrary to [36], we do not penalize the intercept term.For comparison purposes, the results of the full model based on the LAD, OLS ad Huber estimators are also reported (see Table 11).We find similar results as in [36].The MAPE of the OLS (with or without variables selection) is as large and is substantially worse than all other robust methods (like LAD or Huber).That justifies the use of the robust methods for this data set.According to the reported standard error of the MAPE estimate, we can clearly see that a difference cannot be statistically significant between all the robust methods.Based on a substantially simplified model, the prediction accuracy of the robust lasso estimator remains very satisfactory.In particular, the Huber-ad-lasso leads to the same set of selected variables for the both hyperparameters choice methods whereas this is not the case for LAD-ad-lasso.

Proof of Theorem 3.1
We have ŝHadl = argmin s≥0 L H (α Hadl , βHadl , s) .The continuity and convexity of the objective function defining the estimators with respect to the involved variables implies that the new objective function From now on, we denote r (i) = r 2 (i) (α Hadl , βHadl ) for i = 1, . . ., n.The set R + can be cut up into intervals such that, on each interval, the function f get the form a s + bs + c for various constants a, b and c.Precisely, we have: where case 1: s = 0, case 2: 0 < s ≤ When M < 1, all the b coefficients are strictly positive but all the functions s → a/s + bs + c are considered on their increasing part.By continuity of f on R + , the function f is increasing in this case.Thus, ŝHadl = 0.
When M = 1, the function f is constant on [0, |r (1) |] and increasing on [|r (1) |, +∞[.Thus, if r (1) = 0, the argmin is not unique but we can choose ŝHadl = 0. Thus, if M ≤ 1, ŝHadl = 0.The definition of (α Hadl , βHadl , ŝHadl ) and (2.2) implies that (α Hadl , βHadl ) is an argmin of the penalized loss 2M The first point of Theorem 3.1 is established.Next, we suppose that M > 1.For the intervals having k ≤ n 1 − 1/(M 2 ) , the b coefficients are negative (b ≤ 0).Thus, by continuity of f on R + , the function f is decreasing on these intervals.The interval is the first one where the function f can be minimal.By convexity of f , among all the intervals, there is only one where f is a valley.Expression and unicity of ŝHadl come from f expression over this interval.

Proof of Theorem 3.2
In the Step 1, we prove the asymptotic normality of this estimator and, in the Step 2, its consistency in variable selection.
Step 1.Let us first prove the asymptotic normality.This proof in an adaptation to our case of the proof given by [39].The only difference concerns the treatment of the loss function.So in the following, we will use notations similar to the ones of [39].We will point out the difference between the both proofs.
Let us define The principle of the proof of [39] is to study the epi-limit of U n .The notion of epi-convergence has been introduced to ensure variational properties.It provides the natural conditions, minimal in some sense, under which, if U n epi-converges to U , one can guarantee the convergence of the corresponding minimizations problems.See theorem 1.10, section 2.2 and page 39 of [2] for a precise definition in the case of deterministic functions.As noticed by [6], if we restrict ourselves to lower semicontinuous extended real-valued functions from R q with non-empty epigraphs, epi-convergence is induced by an explicit distance.Consequently, [6] defines epi-convergence in distribution (denoted → e−d ) for random lower semicontinuous extended real-valued variables from R q as the convergence in distribution for this distance.We refer to [15] and [23] for other equivalent definitions.It has already been used in statistics to get expedient proofs for asymptotic normality theorems.For example, it has been used in [16].Theorem 4.1 of [17] (asymptotic normality of the regression quantile estimates) is proved using epi-convergence.In these cases, it leads to a simpler argument than the one used previously.The function U n (u) can be decomposed as the sum of two terms, one du to the loss function given by J n (u) = L H (α * , β * , s * ) T + u/ √ n − L H (α * , β * , s * ) and one du to the penalty term given by P n (u) = λ n p j=1 . This penalty term is the same as in [39].Let P (u) = 0 if u is such that u j = 0 for all j / ∈ A and P (u) = +∞ otherwise.[39] claims that, under assumptions λ n / √ n → 0, λ n n (γ−1)/2 → ∞ and β is a root-n-consistent estimator of β * , P n epi-converges to P .
Concerning the term J n , we need the following lemma.Let u 1:p be the vector defined by (u 1 , . . ., u p ) T .We denote by → f −d the finite dimensional convergence in distribution.
Using the definiton of [6], the notion of epi-convergence in distribution of convex lower semicontinuous random variables is a particular case of weak convergence of a net as stated in definition 1.33 of [34].Consequently, we can use Slutsky's theorem page 32 and example 1.4.7 of [34] to ensure that epiconvergence of J n and P n implies that (J n , P n ) → e−d (J, P ) since P is deterministic.However, the metric space involved by epi-convergence is not a topological space i.e. the sum of epi-limits is not necessary the epi-limit of the sum (see [19] and [27] for counter-examples).Consequently, the epi-convergence of the couple (J n , P n ) is not enough to ensure the epi-convergence of the sum.One has to consider a stronger convergence of a coordinate.Let → u−d denotes the convergence in distribution with respect to the topology of uniform convergence on compacts.Since J n → f −d J and J n , J are finite (for n sufficiently large) convex functions, [15], page 13 claims that J n → u−d J. Gathering (J n , P n ) → e−d (J, P ), J n → u−d J and continuity of J, theorem 4 of [15] ensures ∈ A and j = 1, . . ., p. U (u) = +∞ otherwise.Since U n and U are convex lower semicontinuous functions defined on the whole set R p+2 , gathering the previous epi-convergence with the definition of û(n) as an argmin, theorem 5 of [15] ensures that Indeed, V 1,1 is supposed positive definite in assumption (D2) and Theorem 3.2 assumes that the noise satisfies (N2).Consequently, U get a unique argminimum and the asymptotic normality part is proved.
Step 2. Let us now show the consistency in variable selection part.A finite intersection of random sets the probability of which tends to 1 as n tends to infinity also tends to 1 as n tends to infinity.Consequently, it suffices to show that P [A ⊂ A n ] → 1 and P [A c ⊂ A n c ] → 1 as n tends to infinity.
The first claim is an easy consequence of (6.1).Indeed, (6.1) implies that ∀ j ∈ A, βHadl j P → β * j with β * j = 0. Thus, and the first claim is proved.Let j such that β * j = 0. To prove the second claim, we have to show that P[ βHadl j = 0] → 0 as n tends to infinity.If βHadl j = 0, ŝHadl > 0, the objective function Q Hadl get a partial derivative with respect to β j at βHadl j .Moreover, this function is minimal at (α Hadl , βHadl , ŝHadl ) thus the limit of the Newton's difference quotient defining the partial derivative is null: Now, the probability of this event is given by P ŝHadl = 0 + P ŝHadl > 0 and Let us now prove (6.4).Using the definition (2.1) of the model and û(n) , we have Equation (6.1) ensures that û(n) = O P (1) since it converges in distribution to a finite everywhere random variable (theorem 2.4 of [33]).Combined with ŝHadl P → s * > 0, this leads to Let us denote where for 1 ≤ i ≤ n , Z i = 1+H M (σǫ i /s * )−σǫ i /s * H ′ M (σǫ i /s * ) .From Lemma 3, we have F n (u) P → f (u), where Consequently, equation (6.6) involves that where Du to the convexity of F n and f , the pointwise convergence in probability of Lemma 3 ensures convergence in probability uniformly over any compact set of F n and its derivative.This result is available in a deterministic framework in theorems 10.8 and 25.7 of [26].Its generalization for convergence in probability is done in theorem II.1 (appendix II) of [1] for F n (see also [24], section 6 for a direct proof).In [1], the convergence in probability is shown by extracting from any subsequence of F n a subsequence converging almost surely.The subsequence is extracted using a countable dense set of points in R p+2 and a diagonal argument.Theorem 10.8 of [26] ensures the almost sure convergence of the extracted subsequence.The same argument can be used to ensure the convergence in probability uniformly over any compact set of ∇F n using Theorem 25.7 of [26] instead of Theorem 10.8.See [5]  This means that We have that tr(X T X)/n and condition (D2) ensures that (tr(Var(H n ))) n≥1 is a convergent and thus bounded sequence.Moreover, condition (N1) implies that E[H n ] = 0. Since Markov's inequality entails we get that sup n≥1 P [ H n ≥ M ] → 0 as M tends to infinity which is the definition of H n = O P (1).Since û(n) 1:p = O P (1) (a consequence of (6.1)) and H n = O P (1), (6.7) leads to (6.4).

Proof of Lemma 1
Let us firstly show that s * is given by the equation (2.4).Convexity of H M (.) and theorem 1.1.6page 5 of [11] imply that for all t ∈ R, the function f t : s → s + sH M (σt/s) is convex on ]0, +∞[.Since the law of ǫ defines a positive measure, this implies that the function Consequently, the following uniform bound holds: ∀s > 0, |f ′ t (s)| ≤ 1 .This allows to use the dominated convergence theorem impling that F is of class 3) if and only if it satisfies (2.4).In order to finish the proof of lemma 1, it suffices to show that the function F ′ has a unique zero on ]0, +∞[.The function g is continuous and such that g(x) → 1 − M 2 as x → +∞ and g(0) = 1.Thus, justifying the use of the dominated convergence theorem by the same uniform bound as previously, we get F ′ (s) → 1 − M 2 as s tends to 0 and F ′ (s) → 1 as s tends to infinity.The continuity of F ′ and the intermediate value theorem implies that F ′ has a zero on ]0, +∞[ if M > 1.To see that this zero is unique, we show that the function F ′ is strictly increasing.Let s 1 > s 2 > 0. The following decomposition holds since the function g is constant on ] − ∞, −M ] ∪ [M, +∞[: Under condition (N2), the last quantity is strictly positive.This equality also implies that condition (N2) is necessary to ensure that the function F ′ is strictly increasing.Actually, suppose that (N2) does not hold i.e. there exists a > 0 such that P [|ǫ| ≤ a] = 0.If we consider s 1 = aσ/M and s 2 = aσ/(2M ), we have s 1 > s 2 .Moreover, the last equality implies that F ′ (s 1 ) = F ′ (s 2 ) since the integral of a function over a null set is equal to 0. Consequently, if (N2) does not hold, the function F ′ is not strictly increasing.This concludes the proof.

Proof of Lemma 2
First we need to the following lemma.
Gathering this lemma and Lemma 4 (with w In order to show that the asymptotic variance Σ 2 involved in Lemma 4 is a diagonal matrix, one has to note that Cov(δ, W ) = E [δW ] = 0.This holds since the function x → H ′ M (x) g(x) is odd (H ′ M (.) is odd and g(.) is even) and (N1) is supposed.
To conclude the proof of Lemma 2, since J is convex, continuous and finite everywhere random variable, part (a) of Theorem 5 of [15] implies that J n → e−d J since the finite-dimensional convergence holds.

Proof of Lemma 3
The condition M > 1 and (N2) ensure s * > 0 through Lemma 1.Thus, for n sufficiently large (with respect to a non-random bound depending on u p+1 ), u p+1 / √ n + s * > 0. Consequently, the definitions (2.1) of the model and (2.2) of the function L H imply that, for n sufficiently large (with respect to a nonrandom bound depending on u p+1 ), Simple calculations lead to the following explicit expression of the derivative of d n,i at 0. For 1 We have to show the convergence in probability of d n,i (u) − d n,i (0) − u T ∂d n,i ∂u (0) since this quantity is equal (for n sufficiently large) to the left-hand side quantity in (6.8).The structure of the rest of the proof is inspired by [3].
Step 1.To begin with, we show that Var(D n ) → 0 as n → +∞ .Using that the errors (ǫ i ) 1≤i≤n are independent random variables, we get Moreover, convexity of H M (.) combined with proposition 2.2.1 page 160 of [11] implies that for all 1 ≤ i ≤ n, d n,i is a convex function.As such, it lies above all of its tangents: 0 ≤ d n,i (u) − d n,i (0) − u T ∂d n,i ∂u (0) ≤ u T ∂d n,i ∂u (u) − ∂d n,i ∂u (0) .
Furthermore, easy calculations imply that the the previous upper bound is equal to and the term I 3 is defined by In order to show that each term I i (1 ≤ i ≤ 3) tends to 0 as n tends to infinity, we notice that they all have the same following general form.Let ϕ denote a lipschitz continuous function from R to R with lipschitz constant L such that, for some constants K  Assumption (N0) ensures that the distribution function of σǫ is continuous at ±M s * .Consequently, Γ is continuous at 0 with Γ(0) = 0: as (h, h ′ ) tends to 0, Γ(h, h ′ ) → 0.Moreover, assumption (D1) ensures that max as n tends to infinity.Consequently, the following assertion holds: It directly entails that I 1 and I 2 tend to 0 as n goes to infinity.Moreover, combined with assumption (D2), it also implies I 3 → 0. Consequently, Var(D n ) → 0 .
Step 2. Now, we show that E [D n ] → A s * (u 2 0 + u T 1:p V u 1:p ) + D s * u 2 p+1 as n → +∞ .Since the function H ′ M (.) is odd, the explicit expression of the derivative of d n,i (given at the begining of the proof) combined with assumption (N1) and property (2.4) of s * imply that E  is not two times differentiable at ±M .In the case of the Huber loss without concomitant scale, [3] already solves this technical difficulty by an argument of approximation of convex functions.In the case of Huber loss with concomitant scale, roughly speaking, one has to deal with a supplementary variable (s).However, the function L H (α, β, s) is still convex w.r.t (α, β, s).Thus, following readily the proof of Lemma 1 of [3], we get the desired expansion of Ψ in return for a bit more work to deal with the supplementary terms.We notice that the obtained proofs rely on the convexity of the involved functions.Now, we claim that this expansion has nothing to do with convexity.Using successively the dominated convergence theorem, we show that the function Ψ is two times differentiable at 0. The rigorous reasoning relies on the fact that the derivatives with respect to h and h ′ of the integrated function are related to functions g and H ′ M (•).Precisely, using that these functions are lipschitz continuous and bounded, we can derive the uniform bounds justifying the uses of the dominated convergence theorem.Consequently, the classical Taylor-Young theorem is available to get the desired Taylor expansion.Moreover, the coefficients A s * , B s * , C s * , D s * , E s * are now explicit.Note that the proof relying on dominated convergence theorem only supposes that (N0) holds to provide satisfying coefficients.Under the supplementary hypothesis (N1), it is clear that E s * = 0 and B s * = 0 since H ′ M (•) is odd.Since we also suppose M > 1 and that (N2) holds, s * satisfies (2.4) which is equivalent to C s * = 0. Finally, under assumptions (N0), (N1) and (N2), we get
page 73 for a detailed treatment of the involved diagonal argument.Consequently, sup E ∇F n (u) − ∇f (u)
= 0 (note that Z i = g (σǫ i /s * )).Consequently, the definition ofD n leads to E [D n ] = n i=1 E [d n,i (u) − d n,i (0)] .Moreover, the random variables (σǫ i ) 1≤i≤n are identically-distributed thusE [D n ] = n i=1 By analogy with the two previous ones, we propose to select λ Hadl

Table 1
Means of the RPE (standard deviation of the 100 RPE) with cross-validation.Best mean indicated in italic

Table 2
Means of the RPE (standard deviation of the 100 RPE) with BIC.Best mean indicated in italic

Table 3
Selection model ability on Model 1 based on 100 replications and cross-validation

Table 4
Selection model ability on Model 2 based on 100 replications and cross-validation

Table 5
Selection model ability on Model 3 based on 100 replications and cross-validation

Table 6
Selection model ability on Model 4 based on 100 replications and cross-validation

Table 7
Selection model ability on Model 1 based on 100 replications and BIC's model selection

Table 8
Selection model ability on Model 2 based on 100 replications and BIC's model selection

Table 9
Selection model ability on Model 3 based on 100 replications and BIC's model selection