The Smooth-Lasso and other ℓ 1 + ℓ 2 -penalized methods

We consider a linear regression problem in a high dimensional setting where the number of covariates p can be much larger than the sample size n . In such a situation, one often assumes sparsity of the regression vector, i.e., the regression vector contains many zero components. We propose a Lasso-type estimator ˆ β Quad (where ‘ Quad ’ stands for quadratic) which is based on two penalty terms. The ﬁrst one is the ℓ 1 norm of the regression coeﬃcients used to exploit the sparsity of the regression as done by the Lasso estimator, whereas the second is a quadratic penalty term introduced to capture some additional information on the setting of the problem. We detail two special cases: the Elastic-Net ˆ β EN introduced in [42], which deals with sparse problems where correlations between variables may exist; and the Smooth-Lasso 1 ˆ β SL , which responds to sparse problems where successive regression coeﬃcients are known to vary slowly (in some situations, this can also be interpreted in terms of correlations between successive variables). From a theoretical point of view, we establish variable selection consistency results and show that ˆ β Quad achieves a Sparsity Inequality, i.e., a bound in terms of the number of non-zero components of the ‘true’ regression vector. These results are provided under a weaker assumption on the Gram matrix than the one used by the Lasso. In some situations this guarantees a signiﬁcant improvement over the Lasso. Furthermore, a simulation study is conducted and shows that the S-Lasso ˆ β SL performs better than known methods as the Lasso, the Elastic-Net ˆ β EN , and the Fused-Lasso (introduced in [30]) with respect to the estimation accuracy. This is especially the case when the regression vector is ‘smooth’, i.e., when the variations between successive coeﬃcients of the unknown parameter of the regression are small. The study also reveals that the theoretical calibration of the tuning parameters and the one based on 10 fold cross validation imply two S-Lasso solutions with close performance.


Introduction
We focus on the usual linear regression model where the design x i = (x i,1 , . . ., x i,p ) ∈ R p is deterministic, β * = (β * 1 , . . ., β * p ) ′ ∈ R p is the unknown parameter, and ε 1 , . . ., ε n , are independent, identically distributed (i.i.d.) centered Gaussian random variables with known variance σ 2 .We aim on estimating β * in the sparse case, that is, when many of its unknown components are zero.Thus only a subset of the design covariates (X j ) j is truly of interest where X j = (x 1,j , . . ., x n,j ) ′ , j = 1, . . ., p.Moreover, we are interested in the high dimensional problem where p ≫ n and we consider p depending on n.In such a framework, two main problems arise: the interpretability of the prediction and the control of the variance in the estimation.To tackle these problems we use regularized selection type procedures of the form where X = (x ′ 1 , . . ., x ′ n ) ′ , Y = (y 1 , . . ., y n ) ′ and pen : R p → R is a positive convex function called the penalty.For any vector a = (a 1 , . . ., a n ) ′ , we have adopted the notation a 2 n = n −1 n i=1 |a i | 2 and we denote by < •, • > n the corresponding inner product in R n .The choice of the penalty appears to be crucial.On the one hand, although well-suited for variable selection purpose, concave-type penalties (see for example [9,13,32]) are often computationally hard to optimize.On the other hand, Lasso-type procedures (modifications of the ℓ 1 penalized least square (Lasso) estimator introduced in [29]) have been extensively studied during the last few years.See for example [3,4,7,40] and references therein.Such procedures are suitable for our purposes as they perform both regression parameters estimation and variable selection with low computational costs.We will explore this type of procedures in our study.
In this paper, we propose a novel estimator, denoted by βQuad , which is a modification of the Lasso.It is defined as the solution of the optimization problem (2) for a combination of the Lasso penalty (i.e., p j=1 |β j |) and the quadratic penalty β ′ J ′ Jβ for some m × p matrix J (m ∈ N * ).
The matrix J typically reflects some underlying geometry or structure in the true signal.More generally, the matrix J can be chosen so that sparsity of β * translates to some other desired behavior depending on the context.There is a wide variety of interesting applications, and what we present below is not meant to be an exhaustive list but rather a small set of illustrative examples that motivated our work on this problem.We add this second term to the Lasso procedure for two major issues.First, we exploit this second penalty to take into account some prior information on the data or the regression vector (such as correlation between variables or a specified structure on the regression vector).Second, the quadratic penalty is introduced to overcome (or to reduce) theoretical problems observed by the Lasso estimator.Indeed, (see for example [3,4,18,21,34,37,40,41]) strong conditions to guarantee good performance in prediction, estimation or variable selection for the Lasso procedure are required.See also [33] for an overview of the conditions used to establish the theoretical results according to the Lasso.It was shown that the Lasso does not always ensure good performance when high correlations exist between the covariates.In this paper, we establish theoretical results showing good performance of βQuad under a weaker assumption than the Lasso estimator.The improvement is especially observed when the Lasso achieves only poor results.Two particular cases of the estimator βQuad are mainly considered: the Elastic-Net introduced in [42] to deal with problems where correlations between variables exist.It is defined with the quadratic penalty term p j=1 β 2 j .The second and novel procedure is called the Smooth-Lasso (S-Lasso) estimator.It is defined with the ℓ 2 -fusion penalty, that is, p j=2 (β j − β j−1 ) 2 .The ℓ 2 -fusion penalty was first introduced in [17].This term helps to tackle situations where the regression vector is structured such that its coefficients vary slowly.Let us call the regression vector 'smooth' in this case.Note, however, that our theoretical study takes into account a large amount of procedures such as the closely related 'Weighted Fusion' introduced in [10].This is detailed in Remark 1.
The main contribution of this paper is the introduction of the Smooth-Lasso estimator which significantly improves (both in theory and in practice) the performance of the Lasso and the Elastic-Net in some situations.However, the method is a special case of the estimator βQuad .This type of estimators aims on • capturing the sparsity and some other structure (smoothness in the case of the S-Lasso); • reducing the assumptions on the Gram matrix and providing theoretical guarantees in situations that are not suitable for the Lasso (correlations between successive covariates in the case of the S-Lasso).
From a practical point of view, some problems are also encountered when we solve the Lasso criterion (for instance with the LARS algorithm [12]).Indeed, this algorithm fails to select a complete group of correlated covariates.We describe two disadvantages of the Lasso.First, the Lasso is not consistent neither in variable selection nor in estimation (bad reconstitution of β * ).In this paper, we focus on the estimation issue.We consider the case where the regression vector β * is structured.We invoke the S-Lasso estimator to respond to such problems where the covariates are ranked so that the regression vector is 'smooth' (that is, the vector β * has only small variations in its successive components).We will see with the help of simulations that such situations support the use of the S-Lasso estimator.This estimator is inspired by the Fused-Lasso [30].Both S-Lasso and Fused-Lasso combine a ℓ 1 -penalty with a fusion term [17].The fusion term is designed to make successive coefficients as close as possible to each other.The main difference between these two procedures is that we use the ℓ 2 distance between the successive coefficients (that is, the ℓ 2 -fusion penalty: p j=2 (β j −β j−1 ) 2 ) whereas the Fused-Lasso uses the ℓ 1 distance (that is, the ℓ 1 -fusion penalty: p j=2 |β j − β j−1 |).Hence, compared to the Fused-Lasso, we sacrifice sparsity in changes between successive coefficients in the estimation of β * for an easier optimization due to the strict convexity of the ℓ 2 distance.This implies a large reduction of computational cost.However, sparsity is, nonetheless, ensured by the Lasso penalty.The ℓ 2 -fusion penalty helps to provide 'smooth' solutions.Consequently, even if there is no perfect match between successive coefficients, our results are still interpretable.From a theoretical point of view, the ℓ 2 distance also helps us to provide theoretical properties for the S-Lasso which in some situations appears to outperform the Lasso and the Elastic-Net (cf.[42]).Let us mention that variable selection consistency of the Fused-Lasso and the corresponding Fused adaptive Lasso have also been studied in [27] but in a different context from the one in the present paper.The results obtained in [27] are established not only under the sparsity assumption, but the model is also supposed to be piecewise constant, that is, the non-zero coefficients are represented in a block shape with equal values inside each block.
Many techniques have been proposed to address the weaknesses of the Lasso.The Fused-Lasso procedure is one of them.Additionally we give here some of the most popular alternative methods.The Adaptive Lasso was introduced by [41].It is similar to the Lasso but with adaptive weights used to penalize each regression coefficient separately.This procedure reaches under certain (strong) conditions Oracles Properties (that is, consistency in variable selection and asymptotic normality, see [41]).Another approach is the Relaxed Lasso (see [20]), which aims on double-controlling the Lasso estimate: one parameter to control variable selection and another to control the shrinkage of the selected coefficients.To overcome the problem due to the correlation between covariates, group variable selection has been proposed in [36] with the Group-Lasso procedure which selects groups of correlated covariates instead of single covariates at each step.A first step to the variable selection consistency study has been proposed in [1] and Sparsity Inequalities were given in [8,19].In [42], another choice of penalty has been proposed with the Elastic-Net.This penalty has also been studied for example in [5,15,43].
The rest of the paper is organized as follows.In the next section, we introduce the estimator βQuad defined with the Lasso penalty together with a quadratic penalty.In particular, we define the S-Lasso estimator and a notion of smoothness.We also provide a way to solve the βQuad problem with the attractive property of piecewise linearity of its regularization path.
Consistency in estimation and variable selection in the high dimensional case are considered in Section 3. We moreover provide some examples in favor of the Elastic-Net and the S-Lasso in Sections 3.1.2and 3.1.3and technical issues in Section 3.3.We finally give experimental results in Section 4 which show the S-Lasso performance compared to some popular methods.All proofs are postponed to the Appendix.
2 The S-Lasso procedure In many applications for example in macroeconomics, financial time series analysis, and biological and medical sciences one often deals with data with given complex attributes and a 'smooth' solution.This is, for instance, the case in trend filtering (see [16] for a nice survey).
As a start, let us provide a definition of a 'smooth' vector: In the applications mentioned above, the regression vector β * is smooth.Hence, it is important to consider estimation methods which can reflect this aspect of the problem.It is often useful to assume that the regression vector is also sparse in order to be able to treat data such as spectrometry or some genomic data, where both smoothness and sparsity appear simultaneously.For these reasons, it is worth introducing and analyzing a method which can reconstitute sparse and smooth regression vectors.Hence, we define the S-Lasso estimator βSL as the solution of the optimization problem (2) with the penalty where λ and µ are two positive parameters that control the sparsity of our estimator and its smoothness.For any vector a = (a 1 , . . ., a p ) ′ and integer q, we have used the notation |a| q q = p j=1 |a j | q .Note that the Lasso estimator is a special case of the S-Lasso with µ = 0.More generally, we consider the following penalty where J is a given m × p matrix (m ∈ N * ).This penalty is a combination of the Lasso penalty and a quadratic penalty.The matrix J typically reflects some underlying geometry or structure in the true signal (we refer to [31] for similar ideas).Let us call βQuad the solution of the minimization problem ( 2) and ( 4).The S-Lasso penalty is a particular case of the penalty (4) with J given by The Elastic-Net corresponds to the case where J is the identity matrix.
Remark 1.For any j, k ∈ {1, . . ., p}, denote by s j,k = sign the sign of the sample correlation between predictor variables j and k.Denote also by w j,k ≥ 0 some predictor correlation driven weights.Given this notation, the Weighted Fusion introduced in [10] corresponds to the case where the k-th diagonal term of J equals w k,k and (J) k,j = (J) j,k = −s j,k w j,k for j = k.Now we deal with the solution βQuad of ( 2) and ( 4) and its computational costs.The following lemma shows that βQuad can be expressed as a Lasso solution by expanding the data artificially.
Lemma 1.Given the dataset (X, Y ) and the tuning parameters (λ, µ), define the extended dataset ( X, Y ) and ε by where 0 is a vector of size p containing only zeros, ε = (ε 1 , . . ., ε n ) ′ is the noise vector and J is the m × p matrix given by the penalty (4).Then, we have Y = Xβ * + ε, and the estimator βQuad , defined as the solution of the minimization problem (2) with the penalty given by (4), is also the minimizer of the Lasso-criterion This result is a consequence of simple algebra.It motivates the following comments on the estimator βQuad .
Remark 2 (Regularization paths).LARS is an iterative algorithm introduced in [12].A modification of LARS can be used to construct βQuad .For a fixed µ, it constructs at each step an estimator based on the correlation between covariates and the current residual.Each step corresponds to a value of λ.Then, for a fixed µ, we obtain the evolution of the coefficients values of βQuad when λ varies.This evolution describes the regularization paths of βQuad which are piecewise linear (see [28]).This property implies that (again for fixed µ) the problem (2) and (4) can be solved using the LARS algorithm with the same computational cost as the ordinary least square (OLS) estimate.

Theoretical results in the high dimensional setting
In this section, we study the performance of the estimator βQuad in the high dimensional case.In particular, we provide a non-asymptotic bound for the squared risk.We also provide a bound for the ℓ 2 estimation error of βQuad .Let J = J ′ J, be the p × p matrix where J is the matrix appearing in the quadratic penalty (4).Since our main interest is the study of the S-Lasso estimator, we first focus on the case where the matrix J is sparse.We refer the reader to Section 3.3 where we address several technical points, for example the study of the case where the matrix J is general.All the results of this section are proved in Section 6.These theoretical contributions rely partly on Lemma 1.Let us finally mention that the tuning parameters λ and µ will actually be chosen depending on the sample size n.We emphasize this dependency by adding a subscript n to these parameters.

Sparsity Inequality when J is sparse
Now we establish a Sparsity Inequality (SI) achieved by the estimator βQuad , that is, a bound on the squared risk that takes into account the sparsity of the regression vector β * .More precisely, we prove that the rate of convergence of βQuad is max( , where A * is the sparsity set A * = {j : β * j = 0}.This rate depends not only on the sparsity index |A * | but also on | Jβ * |.In the case of the S-Lasso, this last quantity is related to the smoothness of the vector β * .Let us first present the assumptions needed, and the setup of this contribution.Let η ∈ (0, 1) be given ((1 − η) will be a confidence bound, see Theorem 1).We define the tuning parameter λ n as For now, we leave the calibration of µ n free.We discuss later (see Corollary 1 and Section 3.1.1for example) the choice for this parameter.Our assumption on the Gram matrix Ψ n := n −1 X ′ X involves the symmetric p × p matrix K n defined by Given the expanded dataset defined in Lemma 1, we note that K n = n −1 X ′ X can be seen as an expanded Gram matrix.Let Θ ⊂ {1, . . ., p} be a set of indices.Using this notation, we formulate the following assumption: Assumption B(Θ): Let K n be the matrix given by (8) and let There is a constant φ µn > 0 such that, for any ∆ ∈ R p that satisfies Here are some comments about this assumption: • first of all, Assumption B(Θ) is inspired by the Restricted Eigenvalue (RE) Assumption introduced in [3].The RE assumption is widely used in the literature and requires somehow that the restriction of the matrix K n to the rows and columns in Θ is invertible (when K n is invertible, the condition ( 9) is always satisfied with φ µn at least as large as the smallest eigenvalue of K n ) .We refer to [3,33] for more details on this assumption.The main difference with the assumption we use is that in [3] the authors consider the case where K n = Ψ n , which matches with the Lasso estimator (that is µ n = 0 in our setting).
In the sequel, let φ 0 denote φ µn for µ n = 0, that is, the case of the Lasso estimator; • another difference to [3] is that the set on which the assumption should hold is larger in Assumption B(Θ) than in the RE Assumption.Indeed, in Assumption B(Θ), the considered vectors ∆ should be such that j / ∈Θ |∆ j | ≤ ̺ n j∈Θ ∆ 2 j , whereas in [3] the authors only need to consider vectors ∆ such that j / ∈Θ |∆ j | ≤ cst • j∈Θ |∆ j | (see also [33]).We make this set larger to allow large values of the tuning parameter µ n .We will explain later why this is desirable; • in the case of the Elastic-Net, Θ = A * in Assumption B(Θ).Hence, the assumption above is close to Condition Stabil in [5, page 4] for the Elastic-Net.We will consider precisely the difference between both assumptions in Section 3.1.2.However, let us mention here that in Condition Stabil the condition ( 9) is replaced by for a constant φ CS µn > µ n ; • only small subsets B of indices Θ will be considered in Assumption B(Θ).More precisely, let B ⊂ {1, . . ., p} be a set of indices such including the true sparsity set A * .We will consider a set depending on J and on A * , and the sparser J, the smaller B. For instance, in the case of the Elastic-Net, B = A * , and in the case of the S-Lasso (that we will detail later), the set B is such that |B| ≤ 3|A * |.Thanks to the sparsity of J, we will see that we can assume that there exists a constant c J ≥ 1 such that |B| ≤ c J |A * | (see Sections 3.1.2and 3.1.3).
Theorem 1 below holds for general matrices J. However we emphasized here the sparse case since Assumption B(B) with large sets B is more stringent (with φ µn close or equal to zero).Hence in the general case, another assumption presented in Section 3.3.1 may be more attractive.We also mention that Theorem 1 is formulated as general as possible.We refer to Corollary 1 below for a special case illustrating the superiority of βQuad compared to the Lasso.
Theorem 1 ( J sparse).Let A * be the sparsity set.Let the tuning parameter λ n be defined as in (7).Suppose that Assumption B(B) is satisfied with a set B ⊃ A * such that |B| ≤ c J |A * | for a given constant c J ≥ 1.Then, with probability greater than 1 − η, we have and Theorem 1 states that βQuad achieves a SI which also brings the quantity | J β * | 2 into play.A first glance at the bounds above would suggest that µ n = 0 provides the best rates.However, it is worth noting that φ µn , one of the main terms of the bounds, also depends on µ n and increases with this parameter since J is positive semidefinite.Calibration of µ n captures the tradeoff between slowing down the rate of convergence and being able to address situations where the Lasso fails.For instance, the Smooth-Lasso with a large µ n is devoted to problems with large correlations between successive variables.In Section 3.1.1,we further discuss the importance of a good calibration of µ n and the interest of using βQuad (with µ n different from zero) instead of the Lasso estimator.These considerations lead to the following Corollary 1.It points out that the estimator βQuad is particularly useful when the assumptions on the Gram matrix Ψ n are so restrictive that the Lasso error fails to be well controlled.
Assume furthermore that the Gram matrix Ψ n is such that φ 0 < λ 2 n |A * | and that the extended Gram matrix K n is such that φ µn ≥ µ n .Then the bound on the Lasso (obtained setting µ n = 0 above) does not guaranty any control on the errors.In contrast, βQuad satisfies with probability greater 1 − η.
The above bounds are even better when | Jβ * | 2 is small.One illustration of this corollary can be found in the example included in Section 3.1.3.Moreover, we refer to Section 3.3 for other choices of µ n which are more suitable when we deal with a general (non sparse) matrix J.
In our simulation study we focus on the particular choice of µ n given in the first part of Corollary 1.However, in real applications, since the parameters λ n and µ n depend on the unknown regression vector β * , we tune them with the help of a 2D ten fold cross validation over a grid.

Discussion around µ n and the rate of convergence
In this paragraph, we highlight the cases when using βQuad is useful in the sense of Theorem 1.We mainly consider two aspects.The first one deals with situations (or conditions on the Gram matrix Ψ n ) where φ µn is much larger than φ 0 , that is, the settings where the introduction of the additional penalty enables the estimator βQuad to consider problems that cannot be treated by the Lasso.The second one is the fact that µ n | J β * | 2 should be dominated by For the first point, and to make things more understandable, let us restrict ourselves to the above prediction error bound (10) and consider the particular case of the Elastic-Net where the matrix J is the identity.Because of the definition of φ µn (in the particular case of the Elastic-Net), we have φ µn ≥ µ n .We now discuss the rates of convergence of the Lasso (with φ 0 ) and the Elastic-Net (with φ µn ) in different situations.We present the cases in an asymptotic setting with n tending to infinity.The results provided in Theorem 1 suggest essentially three regimes: • when φ 0 is a constant: in this case, the rate |A * | log(p)/n is optimal (up to a logarithmic factor; cf.[6,Theorem 5.1]).This rate is reached by the Lasso (set µ n = 0 in the above Theorem 1) and as a consequence the Elastic-Net (and more generally βQuad ) does not help a lot.Indeed, whatever µ n > 0, the value of φ µn does not significantly vary from φ 0 (although φ µn > φ 0 ); • when φ 0 depends on n but with µ n ≤ φ 0 < 1: in this case, φ µn (and φ 0 as well) is an influencing term that should be taken into account in the rate of convergence.The rate of the Lasso is worse than |A * | log(p)/n.But, since µ n < φ 0 , the Elastic-Net does not cause a big improvement in this case neither; • when φ 0 depends on n and µ n > φ 0 : clearly here, φ µn > φ 0 .Then when φ 0 is small (or even very small), the rate of convergence of the Lasso is bad (or even the Lasso error is not controlled when φ 0 < λ 2 n |A * |), whereas the Elastic-Net is guarantied to reach the worst case rate φ −1 µn |A * | log(p)/n (cf.Corollary 1 for a bound independent on the second term in the LHS of (10)).This can lead to a big improvement.For instance, Section 3.1.3gives an illustrating example pointing out the advantage of using the Smooth-Lasso estimator.
The above remarks recommend large values of µ n due to the fact that φ µn grows with µ n .However the RHS of (10) depends on µ n also through µ n | J β * | 2 .Then one may choose the largest µ n such that the second term in the RHS of (10) remains reasonable compared to the first one.That is the choice of µ n should make a tradeoff between increasing φ µn and increasing µ n | J β * | 2 in the bound.To make things clearer, let us focus on the prediction error (the same reasoning is true for the other errors).The rate of convergence is Then, the term µ n | Jβ * | 2 induces an alteration on the rate of convergence when In other words, the rate of convergence is worse when we add the quadratic penalty unless if All these explanations encourage the compromise stated in Corollary 1 above for the calibration of µ n .In the next two paragraphs we provide a more detailed study in the special cases of the Elastic-Net and the S-Lasso estimators.

Elastic-Net
The Elastic-Net corresponds to the case where J equals the identity matrix.Then B = A * in the above theorem and corollary.The theoretical performance of the Elastic-Net has already been considered for example in [5,15].In [15], the authors considered a version of the Irrepresentable Condition to establish their consistency results.This necessary and (almost) sufficient assumption for the variable selection task is harder to interpret than ours.The result in the present paper (and particularly those in Section 3.3.1)about the Elastic-Net are quite close to those in [5].A comparison between the results obtained here and those stated in [5] is postponed to Section 3.3.1.
When compared to the Lasso, we essentially note two differences: first, as mentioned before Theorem 1, the Lasso brings into play a set of linear inequalities (that is, vectors [3,33]), whereas we need in Theorem 1 a bigger set induced by a quadratic set of inequalities (that is, ∆ such that Even though this difference is small, let us mention that we will establish in Section 3.3.1 theoretical guaranties which also require the same linear set as in the Lasso case; second, the main difference pertains to the values of φ µn and φ 0 .Since φ µn > φ 0 , the Elastic-Net is useful in situations that preclude the use of the Lasso because φ 0 is close to zero.This was discussed in Section 3.1.1.For instance, when the correlations are high between variables, the Lasso fails, whereas the Elastic-Net achieves satisfying performance (see Corollary 1).
Finally, we observe that in the case of the Elastic-Net, Equation ( 11) is nothing but a SI on the ℓ 2 estimation error |β * − βQuad | 2 2 .Note, however, that the rate λ n |A * |, when µ n is defined as in Corollary 1, is not optimal (it can be sharper with more restrictive assumptions) but has the advantage of only requiring Assumption B(A * ).Imposing Assumption B(B) with B larger A * , a better rate of convergence can be reached (see Proposition 1).We refer to [35, Theorem 1]) for lower bounds on the ℓ q estimation error of order |A * | 1/q log(1+p/|A * |) n .See also [25,26].

Smooth-Lasso
The S-Lasso corresponds to the case where J = J ′ J with J given by ( 5).This estimator can deal with problems where the regression vector is expected to be α-smooth in the sense of Definition 2.1.As a consequence, we have the worst case relation the constant 7 comes from some rough computations and is not accurate).Note also that in this case Assumption B(Θ) is satisfied with a set Θ = B whose size is less than 3|A * |.This set can be expressed as and Theorem 1 holds with c J = 3.Moreover, Equation ( 11) can be seen as a control on the 'smoothness error' p j=2 (δ j − δ j−1 ) 2 , where δ j is the components difference β * j − βQuad j .
The S-Lasso is designed to provide a smooth and sparse solution.This is true whatever the correlations between variables.However, it is interesting to remark that the smoothness has quite close interactions with correlations between successive variables.Indeed, when we deal with the S-Lasso estimator, the matrix J is tridiagonal with its off-diagonal terms equal to -1.If we do not consider the diagonal terms, we remark that Ψ n and K n differ only in the terms on the second diagonals (that is, (K n ) j−1,j = (Ψ n ) j−1,j for j = 2, . . ., p as soon as µ n = 0).Terms in the second diagonals of Ψ n correspond to correlations between successive covariates.
When high correlations exist between successive covariates, a suitable choice of µ n fulfills Assumption B(B).Hence, the S-Lasso estimator is particularly useful in situations where we expect that the variables are ranked, such that not only the regression vector is 'smooth', but also successive covariates are highly correlated.Indeed, on the one hand Assumption B(B) is a weaker assumption for 'smooth' regression vector.On the other hand, this 'smoothness' makes the prediction and the estimation errors sharper (as φ µn depends on |Jβ * | 2 ).
In the next paragraph, we present an illustrating example of Corollary 1 (or Theorem 1) where we show the importance of using the Smooth-Lasso in certain situations where the Lasso and the Elastic-Net do not provide good control on the different errors.In particular, we present a case where correlations between variables exist (and where the Lasso is not suitable).Moreover, since the influence of the quadratic penalty in the definition of βQuad reduces when | J β * | 2 is large (see the definition of µ n in Corollary 1), we consider a smooth regression vector with large singular coefficient values such that | Jβ * | 2 is small when J is the matrix corresponding to the Smooth-Lasso, and large when J is the identity matrix associated to the Elastic-Net.Due to this difference on the value of | J β * | 2 , the Smooth-Lasso outperforms the Elastic-Net.
Example.Let J be the matrix defined on (5).Assume that n/4 is an integer.First of all, let us define a smooth regression vector β * with n/2 non-zero components such that This regression vector is chosen piecewise linear (a particular case of smoothness) to clarify the idea and for simplicity of computations.The vector β * is such that Then, we can set the smoothness parameter α = 4/ √ n in Definition 2.1.
Let us now consider the design matrix Ψ n .Let ǫ > 0 be a real number.Let Ψ n be a tridiagonal Gram matrix with diagonal elements equal to 1 (that is, normalized) and such that Ψ n j,j−1 = Ψ n k,k+1 = ǫ for j = 2, . . ., p and k = 1, . . ., p − 1.In such a case, the spectrum of the Gram matrix lies in [1 − 2ǫ, 1 + 2ǫ].Then, φ 0 ≥ 1 − 2ǫ (the φ µn corresponding to the Lasso estimate, that is, when µ n = 0).However, we do not know how far φ 0 is from 1 − 2ǫ so that we can only say the the prediction error of the Lasso βL is such that with high probability with the choice ǫ = 1 2 − log(p/η) 2n .Actually, the above bound does not provide any control on the prediction error of the Lasso estimator.
Let us now focus on the Elastic-Net estimate βEN .According to Assumption B(A * ), we have to consider the spectrum of the matrix K EN n = Ψ n + µ n I p , where I p is the identity matrix in R p .This spectrum lies in Given the values of ǫ and of |β * | 2 , we get the control where we used the definition of µ n provided in Corollary 1.Let us mention that choosing a different value for µ n does not imply an improvement in the bound.Hence, in this case the Elastic-Net estimator does not control the prediction error neither.
Next, in the case of the S-Lasso βSL the eigenvalues of the matrix We refer to [38] for more details on the eigenvalues of tridiagonal matrices.This interval is of the same order as the one of the Elastic-Net.By the sequel, we have the following control for the S-Lasso estimator (when ǫ > µ n , otherwise the control is even better) where here again, we considered the value of µ n given in Corollary 1.In this 'smooth context', the S-Lasso is obviously the best method (compared with the Lasso and the Elastic-Net).Note that this last rate is better than the minimax rate under the sparsity assumption ).This is due to the fact that we also imposed a smoothness assumption which is nicely exploited by the S-Lasso estimator.Thus, the above minimax rates cannot be applied anymore.
Let us conclude with the following remarks: in the above situation, we assume that the regression vector is smooth also that the successive covariates are correlated.This is the best context for the Smooth-Lasso.In the case where the regression vector is smooth, but we do not have a particular structure in the Gram matrix (say the variables are independent and φ 0 is a fixed positive constant), the Lasso and the Elastic-Net (for instance with the value of µ n given in Corollary 1) reach the rate σ 2 log(p)|A * | n .Compared to the bounds for the Elastic-Net, there are improved bounds for the S-Lasso and for suitable values of µ n (note that µ n depends on α).Here again, if we consider the same regression vector β * as in the above example, the rate is of order O σ √ Consequently, we get better performance than the Elastic-Net and the Lasso.Finally, when the regression vector is not smooth (say, |β * | 2 and |Jβ * | 2 are constants) and the design matrix is for instance as in the above example, the Lasso is not suitable.In this case, both the Elastic-Net and the S-Lasso have comparable performance and their bound is in order O( log(p)|A * |/n), which is much better than the bounds for the Lasso (even if not optimal).The above discussion dealt with the prediction and the estimation performance.In the next section we consider the variable selection power of βQuad .

Variable selection
Let us first mention that the estimator βQuad , with the Smooth-Lasso as a particular case, has not been introduced for such an objective.Indeed, it is designed to deal with the estimation criterion or, more precisely, with structural questions.However, in some problems βQuad may induce better variable selection properties than the Lasso.
A large amount of work has been done on the topic of variable selection for Lasso-type methods.One important observation is that one has to make a compromise between not identifying a low signal level (that is, small coefficients β * j , j ∈ A * , in absolute value) and imposing a strong restriction on the Gram matrix Ψ n which sometimes seems to be not realistic.Moreover, the question of the identifiability of β * has also to be considered.Since we tackle problems where we expect correlations between variables, we take the middle path, that is, we impose less restrictive assumptions on the Gram matrix that permit us to recover a reasonably low signal level.For this purpose, we first provide a bound on the sup-norm |β * − βQuad | ∞ , based on a control on the ℓ 2 estimation error.
To this end, we use Assumption B(Θ) on the Gram matrix.However the set Θ should be larger than the one required in Theorem 1.To define it, let us denote by C the index-set of the m largest components in absolute value of β * − βQuad outside B.Here B is the set introduced in Theorem 1.In this setting m is an integer such that m + |B| < p.
Assumption B ′ (B ∪ C): Let K n be the matrix given by (8) and let There is a constant φ µn > 0 such that, for any ∆ ∈ R p that satisfies The above assumption differs from Assumption B(Θ) only in that we restrict R p in a different set than the one used in Condition (12).Obviously, Assumption B ′ (B ∪ C) implies Assumption B(B).
Proposition 1.Let us consider the same setting as in Theorem 1 with the only difference , One can exploit the control provided in Proposition 1 to construct a hard-thresholded version of βQuad which is consistent in variable selection.Such a construction has already been considered is several papers for the Lasso estimate.The methodology closest to ours is the one developed in [23]. Consider and zero otherwise, where c is given in Proposition 1.This estimator consists of βQuad with its small coefficients reduced to zero.We then enforce the selection property of βQuad .Variable selection consistency of this estimator is established under one more restriction on the regression vector given now.
Assumption C: The true regression vector β * is such that , ) is from Proposition 1, and φ µn is the term appearing in Assumption B ′ (B ∪ C).
Here again, we observe how important the quantity φ µn is.We want it to be as large as possible.
This assumption bounds from below the smallest regression coefficient in β * .This is a common assumption to provide sign consistency in the high dimensional case.See for example [4,18,23,34,39,40].We refer to [18] for a longer discussion on how these works are related in terms of restrictions related to the threshold or the assumption on the Gram matrix.Now, we can state the following sign consistency result.Note that all the remarks established in Sections 3.1.2and 3.1.3remain valid also for this variable selection result.

Technical advances
We devote this paragraph to several technical considerations.First, we consider the case of a general matrix J.Then, we establish the variable selection consistency of a non-thresholded version of βQuad .Finally, we provide a relaxation of the assumption on the noise.The reader who is not interested in these studies can skip them without consequences for the readability of the paper.

General matrices J
Theorem 1 is particularly interesting when J = J ′ J is sparse.In that statement, Assumption B(B) was needed with a set B ⊃ A * which depends on J .More precisely, B contains the indices of components which interfere in the sparse product β * ′ Ju for a given u ∈ R p (see the proof for more details).This set is not too large compared to A * when we consider the case where J is sparse.This way to solve the problem allows us to choose . In what follows, we consider p × p matrices J (including the sparse case) for which we only need an (adapted) RE Assumption.Contrary to the results provided in Section 3.1, µ n is here, for technical reasons, not a free parameter anymore and is fixed in advance (see (13) below).This value is smaller than the one given in Corollary 1.
Let us first establish the assumptions needed and the setup of this contribution.Let η ∈ (0, 1).We define the regularization parameters λ n and µ n in the following way: We now state the adapted RE Assumption which differs from the usual one introduced in [3] only by the matrix to which we apply the assumption (K n instead of Ψ n ): Assumption RE: There is a constant φ µn > 0 such that, for any ∆ ∈ R p that satisfies This assumption involves a set of linear inequalities.Then, we clearly have φ µn ≥ φ 0 (the φ µn corresponding to the Lasso, that is, when µ n = 0).With this setting, we obtain the following result for a general matrix J.
Theorem 3 (General J ).Let A * be the sparsity set and let the tuning parameters (λ n , µ n ) be defined as in (13).If Assumption RE holds, then with probability greater than 1 − η we have Similar bounds were provided for the Lasso estimator in [3].Let us mention that the constants are not optimal.We focused our attention on the dependency on n (and thus on p and |A * |).It turns out that our results are near optimal.For instance, for the ℓ 2 risk, the S-Lasso estimator reaches nearly the optimal rate |A * | n log( p |A * | + 1) up to a logarithmic factor (see [6,Theorem 5.1]).Moreover, Theorem 3 states a control on an error which is linked to the expected prior information which suggested the use of the estimator βQuad .
The results provided in Theorem 1 and more precisely Corollary 1, differ from those established in Theorem 3 in a few points.First, the value of µ n is larger in the sparse case.Indeed, in Corollary 1 and Theorem 3 respectively.The former value can be much larger for some regression vector β * .Second, these values of µ n have an influence on the error bounds through φ µn .As a consequence, the bounds in Corollary 1 are better than those in Theorem 3. Finally, apart from the considerations on the quantity φ µn , we observe a modification of the bound of (β * − βQuad ) ′ J(β * − βQuad ).Indeed, in Theorem 1, it involves the term appears, which is obviously larger.We then have a better control on this error using the sparsity of the matrix J. Finally, we remark that the constant factor in the definition of the tuning parameter λ n in Corollary 1 is smaller than the corresponding constant in Theorem 3. One should however mention that for a fixed φ µn (that is a fixed µ n ), the set of feasible vectors ∆ in Assumption RE is larger than the one in Assumption B(B).In this sense, Assumption RE is less restrictive than Assumption B(B).Nevertheless, this difference does not clearly mean that the φ µn resulting from the Assumption RE is larger than the one arising from Assumption B(B).Indeed, when ∆ is in the feasible set of both assumptions, φ µn is the same in both conditions.
A close result to Theorem 3 has been established by Bunea in [5] in the particular case of the Elastic-Net.It is worth briefly pointing out here the differences and the similarities of our work and [5] when we deal with the Elastic-Net.For any vector b ∈ R p and subset Θ ⊂ {1, . . ., p}, let b Θ be the vector in R p such that (b Θ ) j = b j if j ∈ Θ and zero otherwise.In [5], Bunea provided a SI close to the one established in Theorem 3.This inequality holds under the Condition Stabil defined in [5, page 4] by where φ CS µn > µ n , and similarly to vectors in Assumption RE, ∆ is such that j / ∈A * |∆ j | ≤ 4 j∈A * |∆ j |.The above equation is the analogous of the condition (9) in Assumption RE, and to make the comparison easier, let us write (9) as follows Since the bounds in the Sparsity Inequalities stated in [5] and in the present paper are up to constants the same, it seems that the only difference is the value of φ µn .Indeed, according to Inquality (14), φ µn can be much larger than φ CS µn (given in Condition Stabil), as we subtract the term 2 in ( 14), which can be large thanks to µ n (we expect however |∆ (A * ) c | 2 2 to be small).It is worth adding that the Elastic-Net corresponds to a case where the matrix J is sparse (as J is the identity).Therefore, it is more convenient to use the setting of Section 3.1 since the value of µ n is larger there.

Non-thresholded variable selector
In Section 3.2, we established variable selection consistency for a thresholded version of βQuad when J is sparse.In this section, we state a comparable result for a non-thresholded version.Indeed, paying the price of a more restrictive assumption, we provide in Theorem 4 below a variable selection consistency result directly for βQuad when using a different calibration of the tuning parameters.This result can be applied to general matrices J .The approach to prove the result is also different.We first provide a bound on the sup-norm This can be done easily using the theorem stated in Section 3.3.1 for the ℓ 1 estimation error |β * − βQuad | 1 .However, this would imply that only 'high' levels of the signal can be reconstituted, that is, coefficients where φ µn is the constant appearing in Assumption RE.Moreover, if min Proposition 2 is a trivial consequence of Theorem 3. A short proof is given in the Appendix section.This proposition emphasizes directly that under Assumption RE all non-zero components of β * are detected by βQuad with high probability.Actually, in the setting of Proposition 2, βQuad may contain too many non-zero components.More restrictions are needed in order to ensure the variable selection consistency of βQuad .Here is an additional assumption on the Gram matrix which controls the correlations between the truly relevant variables and those which are not.
Assumption D: We assume that where t is a positive term smaller than 64 .This assumption is quite close to the Mutual Coherence assumption which involves the Gram matrix Ψ n instead of K n .In addition, the Mutual Coherence assumption restricts correlations between all covariates.To prove the first claim, we use some arguments from [5].The second point is a consequence of the first one and of Proposition 2. There are essentially two differences between the settings in Theorem 4 and Proposition 2. First, we need for this last result a more restrictive assumption on the correlations between variables.However, this restriction is only between relevant variables and irrelevant covariates.This is 'quite' a reasonable assumption to identify the relevant variables, that is, the non-zero components of the vector β * .Second, the minimal value of λ n is larger in this last theorem.This suggests that we need a larger value of this tuning parameter to set to zero the irrelevant components.Note that we established the variable selection consistency of βQuad but with a value of the tuning parameter µ n smaller than the one used in the thresholded version.
Remark 3. The results of Theorem 4 can also be obtained under the more restrictive Mutual Coherence assumption: max j∈A * max k∈{1,...,p} , where t is a small positive constant.Here, even the correlations between relevant variables are restricted but this restriction makes possible to recover even smaller signal.That is, we can detect coefficients of β * such that |β * j | ≥ cst • log(p)/n.See for instance [5] in case of the Elastic-Net.

Non Gaussian noise with finite variance
Most of the results established for Lasso-type methods assume Gaussian or sub-Gaussian type noise [3,5,15,34,39].Noise with exponential moment is studied in [4,23].Only a few references consider other type of noise.Noise with moment of order 2k, where k ≥ 1 is an integer, is considered in [40], whereas in the paper [18], the author presents the case where the noise admits zero mean and finite variance.It is in the same spirit as that in this last reference that we consider this relaxation on the noise.According to the Elastic-Net, noise with moment of order 2k + δ, where k ≥ 1 is an integer and δ is a positive constant is considered in [43], but the authors treated only the case where p = O(n).
We assume that the noise random variables ε 1 , . . ., ε n are independent and admit zero mean and finite variance.That is Eε i = 0 and Eε 2 i ≤ σ 2 for i = 1, . . ., n with σ 2 < ∞.In this generalization we also use a revisited version of Nemirovski's Inequality established in [11].One more restriction is needed on the sample points.
Assumption E: There exists a positive constant L < ∞ such that Theorem 5 below extends the results in Corollary 1 of Section 3 to the non-Gaussian noise case.However, one is able to generalize all the results of that section in the same way.
Theorem 5. Let consider the linear regression model (1) where the ε i 's are independent random variables with zero mean such that Eε 2 i ≤ σ 2 for i = 1, . . ., n with σ 2 < ∞.Denote by K N em the quantity K N em = inf q∈[2,∞) (q−1)p 2/q , and let . Assume also that Assumption B(B) (where ̺ n = 6 |A * |) and Assumption E hold.Then, with probability greater than 1 − η we have Let us mention that 2e log(p) − 3e < K N em < 2e log(p) − e.As a consequence, the rate of convergence in Theorems 5 is of the same order as in Corollary 1.However, the constant factor seems to be worse in the non-Gaussian case since it brings into play the constant L which can be large.This is the price to pay to adapt to the non-Gaussian noise.Remark 4. In the above theorem, η is fixed.However, one can set η depending on p (or on n) in such way that it decreases to zero as p → ∞ (or n → ∞).It is interesting to note that in this case, we loose a small power of log(p) (or log(n)) in the rate of convergence when we consider non-Gaussian noise compared to the Gaussian case.
Using similar reasoning as in Theorem 5 (cf.proof of Theorem 5), there is no major difficulty to extend the variable selection results established in Section 3.2 with Gaussian noise to the case where the noise is defined as above.This can be done using Lemma 3 instead of Lemma 2 of Section 6 in all the proofs.

Experimental Results
In this section, we present the experimental performance of the estimator βQuad .In particular, we focus on two special cases: the Elastic-Net and the S-Lasso defined respectively with the penalties pen EN (β) = λ|β| 1 + µ|β| 2  2 and pen SL (β) = λ|β| 1 + µ p j=2 (β j − β j−1 )2 .The Elastic-Net is useful when high correlations between variables appear, whereas the S-Lasso is devoted to problems where the regression vector β * is 'smooth' (small variations in the values of the successive components of β * ).We are essentially interested in the performance of these estimators w.r.t.their estimation accuracy, that is, in terms of the estimation error | β − β * | 2 , when β * is known (simulated data).Indeed, the introduction of βQuad is motivated by a priori knowledge on the structure of the parameter β * , or on the correlation between variables, and the purpose here is to see how this information can be taken into account to improve the reconstitution of the vector β * .As benchmarks, we use the Lasso and the Fused-Lasso estimators, since the first is the reference method and the second is close in spirit to the S-Lasso estimator.Indeed, the Fused-Lasso produces solutions with equal successive components ('piecewise linear') [30].Note also that in the pioneer paper of the Elastic-Net, a 'corrected' version of this estimator is proposed [42].There is as yet no theoretical support for this method.Moreover, it outperforms the 'non-corrected' Elastic-Net (this 'non-corrected' Elastic-Net is denoted by naive in [42]) in only a very few of the situations we consider in this paper.We omitted the results for these 'corrected' versions to avoid digressions.Except for the Fused-Lasso solution, all of the Lasso, the S-Lasso and the Elastic-Net solutions can be computed with the LARS algorithm (cf.Lemma 1).However, we will not use the LARS algorithm in this study.In order to be fair with all the methods, we used the same algorithm for all of them.We use an algorithm provided by J. Mairal 2 which is an implementation of a general algorithm given in [24].In all our experiments, the tuning parameters are chosen based on the 10 fold cross validation criterion (for the Fused-Lasso, the Elastic-Net and the S-Lasso, the cross validation is performed on a 2d Grid), but we also display the results obtained based on the theoretical values.Note that for the Fused-Lasso, we consider the same theoretical values of the tuning parameters as for the S-Lasso as they are both motivated by similar applications (this choice seems arbitrary, but to our knowledge no precise study has been made for the Fused-Lasso in the context we consider).On the other hand, both the Elastic-Net and the S-Lasso involve a sparse matrix J in the definition of the estimator βQuad .Then, the theoretical values of the tuning parameters are λ = 2 √ 2σ log(p)/n and µ = λ √ A * /2| J β * | 2 , in accordance with Corollary 1 and Proposition 1.These quantities depend on unknown parameters.They can be used only in the simulation study, otherwise one needs to estimate | Jβ * | 2 .The different methods are applied to several simulation examples.They also have been applied to a pseudo-real dataset generated from the riboflavin dataset.

Synthetic data
There are several parameters: the dimension p, the sample size n and the level of noise σ.They will be specified in the experimental settings (that is, in the different tables and figures captions).The first one is classical and has been introduced in the original paper of the Lasso [29].The second simulation, where we are interested in observing the performance of the procedures when groups of variables appear, comes from [42].The last two studies aim on determining the behavior of the methods when the regression vector is 'smooth'.Example (b) [p/n/σ]: Groups.We have β * j = 3 for j ∈ {1, . . ., 15} and zero otherwise.We construct three groups of correlated variables: Ψ j,j = 1 for every j ∈ {1, . . ., p}; for j = k, Ψ j,k ≈ 1 (actually Ψ j,k = 1 1+0.01 , due to an extra noise variable) when (j, k) belongs to {1, . . ., 5} 2 , {6, . . ., 10} 2 and {11, . . ., 15} 2 and zero otherwise.
Except when p = 500 where we run only 100 replications, we based all the experiments on 500 replications.
Results.The performance of the estimator β (which can be the Lasso, the S-Lasso, the Elastic-Net or the Fused-Lasso) in terms of the prediction error Y test − X test β 2 n (on a test set (Y test , X test ) of size n, that is, a set with the same size as the training set) and the ℓ 2 estimation error | β − β * | 2 are illustrated by boxplots in Figures 1 to 4. For some of these experiments, the corresponding computational costs (in seconds) of each method is reported in Table 1.In what follows, we first compare the methods to each other in terms of their accuracy.Then, we compare them in terms of their computational costs.Finally, we provide some numerical justifications to the theoretical calibration of the tuning parameters of the S-Lasso procedure.

Methods comparison in terms of performance: Let us consider the different examples separately.
− Example (a): when we consider the procedures induced by the cross validation criterion (for the choice of the tuning parameter), we notice that none of them outperforms the others even when ρ = 0.9 (quite large correlation between successive variables).This is observed for both prediction and estimation errors.This is essentially due to the good behavior of the Lasso in such a situation where the regression vector is sparse but without any particular structure.Actually, this conclusion holds in almost all the cases even when the tuning parameters are chosen based on the theoretical study.However, two observations can be made.First, when both of ρ and σ are small, the Lasso estimator performs slightly better than the other methods.Moreover, when ρ is large a small improvement can be observed using the Fused-Lasso, the Elastic-Net and the S-Lasso methods when we care about the estimation error.This is illustrated in Figure 1 (left and right respectively) where we display the performance of the methods in terms of the prediction error in Example (a) [1/0.1](left) and in terms of the estimation error in Example (a) [3/0.9](right).For this example, the Lasso seems to be the best method since it involves only one tuning parameter.It moreover has a lower (mean) computational cost equal to 0.18 seconds (based on the cross validation criterion) as displayed in Table 1.The S-Lasso, the Elastic-Net and the Fused-Lasso computational costs are respectively 3.7, 3.6 and 4.2 seconds.− Example (b): with Example (a), this is the least favorable example for the S-Lasso.Indeed, here the fifteen first coefficients equal 3. Then the value of the coefficients drops down directly to 0. There is a breakpoint in the 'smoothness' in the true regression vector.Figure 5 displays the best reconstitution of the regression vector β * using the S-Lasso solution (which minimizes the ℓ 2 estimation error since β * is known).We observe the edge effects (breakpoint in the 'smoothness') that the S-Lasso cannot solve due to the ℓ 2 fusion penalty term.However, even in this case, it seems that all the procedures perform in a similar way when the tuning parameters are chosen by cross validation.When the noise level is large (σ = 15), let us nevertheless mention a (very) small improvement using the corrected versions of the S-Lasso and the Elastic-Net.Figure 2 (right) illustrates the performance of the methods in terms of the estimation error when they are applied to Example (b) [40/50/15].The Fused-Lasso outperforms the other methods slightly in this example (with σ = 15) when we deal with the estimation performance.On the other hand, when the methods are based on the theoretical calibration of the tuning parameters, two observations can be made regardless of the noise level (1 ≤ σ ≤ 15): the S-Lasso and the Lasso perform better than the other methods in terms of the prediction error; the S-Lasso and the Elastic-Net provide good results whereas the Lasso has poor performance in terms of estimation error.This is illustrated in Figure 2 (left and center respectively) when the methods are applied to Example (b) [40/50/3].Note moreover that a similar results are also obtained when p = 100 and n = 40.In this case, the behavior of the different methods seems to be stable with the parameters p, n and σ.This example is quite interesting since it corroborates that a good method for the prediction objective can be less efficient for the estimation objective (see the performance of the Lasso and the Elastic-Net).− Example (c): we consider several values of the sample size n and the dimension p.It turns out that here again, when p < n, all the methods behave in the same way when the tuning parameters are chosen by cross validation (the S-Lasso induces just a small improvement).However, when p > n the S-Lasso is by far better than the other methods.This is illustrated by Figure 3 (left) where the ℓ 2 estimation error of each method applied to Example (c) [100/30/3] is displayed.The same plot is obtained for the prediction error.Moreover, when the tuning parameters are calibrated according to the theoretical study, the S-Lasso performs the best and the Fused-Lasso the worst.This appears to be true whatever the values of the parameters p, n and σ.See for instance Figure 3 (right) where the different methods are applied to Example (c) [100/30/3] and for the estimation task (the same is obtained for the prediction objective).Note that in this example, the Fused-Lasso and the Elastic-Net appear to be useless.− Example (d): this is with Example (c) the most favorable situation for the S-Lasso estimator where the regression vector is 'smooth' with a large amount of non-zero components.The S-Lasso estimator seems to dominate its opponents in all the cases and regardless of the sample size n, the dimension p, or the noise level σ.This observation holds for the ℓ 2 estimation and the prediction errors.Note that when the tuning parameters are chosen by cross validation, the Lasso, the Fused-Lasso and the Elastic-Net have quite close performance.Figure 4 illustrates this fact when p < n for the estimation error (left: cross validation; center-left: theory).Moreover, Figure 4 (center-right and right) displays the performance of the methods when p > n in case where the tuning parameters are based on the theoretical study (note that ranking of the methods does not change from the case p < n when the tuning parameters are chosen by cross validation).In addition, an interesting observation follows from the experiments on Example (d) [100/30/3] (Figure 4-left) .Indeed, here the sparsity index |A * | = 40 and it is then larger than the sample size n = 30.In this case, the Lasso has poor performance.However, the S-Lasso is still good.Moreover, there even exists a pair (λ, µ) (the pair minimizing the ℓ 2  estimation error since β * is known) such that we have a good reconstitution on the regression vector β * (see Figure 5-right).
Methods comparison in terms of computational costs: Table 1 displays the computational cost (in seconds) of each method on several examples.First note that the Fused-Lasso has the largest computational cost in all the simulations whereas the Lasso has the smallest.The Elastic-Net and the S-Lasso have intermediate computational costs but are still reasonable compared to the Fused-Lasso.More precisely, when the tuning parameters are chosen by cross validation, we remark that the computational costs for the S-Lasso and the Elastic-Net are about 30 times larger than for the Lasso.This can partly be explained by the number of values explored for the tuning parameter µ (a grid with 20 elements).Actually, since the S-Lasso and the Elastic-Net are obtained with a Lasso program applied to expanded data (cf.Lemma 1), it turns out that even for fixed λ and µ, the computation costs of the Lasso is (a bit) smaller than the computation costs of the S-Lasso and the Elastic-Net.This is observed for example when we consider the solutions computed when the tuning parameters are chosen based on the theoretical study.Except Example (a), where the increase of computational cost using the S-Lasso and the Elastic-Net is not justified (since the improvement using the Lasso-type methods is quite small), in most of the considered situations it is quite interesting to use the Elastic-Net and even more interesting to use the S-Lasso estimator.This is due to the 'smoothness' of the true regression vector.
Table 1: Computational costs in seconds for the Lasso (L), the S-Lasso (SL), the Fused-Lasso (FL) and the Elastic-Net (EN) in several examples illustrated in the above figures.We chose either T uning = T h or T uning = Cv, depending on whether we consider the methods with the tuning parameters based on the theoretical issue or on the 10 fold cross validation.Finally, the Fused-Lasso has a large computation cost due to the ℓ 1 -fusion penalty which admits a singularity.Moreover, it does not improve significantly the Lasso estimator in the situations we considered in this paper (as observed in the previous part).
In view of the computational costs related to Example (a) (the first two columns in Table 1), let us finally remark that these costs increase with ρ, the correlation level between variables, and σ, the noise level.We observe for example that the mean computational cost of the Lasso estimator (when the tuning parameter is chosen by cross validation) is 1.1 seconds when ρ = 0.1 and σ = 1 and increases to 8 seconds when ρ = 0.9 and σ = 3.
S-Lasso; theory vs. cross validation: in what follows, we compare both of the version of the S-Lasso.That is, we compare the S-Lasso when the tuning parameters are chosen by cross validation and when the tuning parameters are chosen based on the theoretical study: • first, we compare these two methods in terms of their performance.Figure 6 summarizes the comparison between the S-Lasso based on a theoretical choice of the tuning parameters (denoted in this part by S-Lasso T h ) and the S-Lasso where the tuning parameters are based on 10 fold cross validation (denoted here by S-Lasso Cv ).First we can observe that the performance of both S-Lasso T h and S-Lasso Cv are close.Moreover, given the results in the part 'Methods comparison in terms of performance', they both perform in a good way.However, it seems that S-Lasso Cv outperforms S-Lasso T h when we deal with the prediction task.This seems quite intuitive since by definition, the cross validation criterion attempts to provide good estimator for the prediction objective.According to the ℓ 2 estimation goal, we cannot conclude the superiority of one of the estimators on the other.Nevertheless, in the high dimensional setting Example (d) [500/100/σ], it seems that S-Lasso Cv begins to become better.
At least, the theoretical choice for µ (µ ) provides good performance both in terms of ℓ 2 estimation error and test error.They are often close to the performance of the S-Lasso estimator based on the cross validation criterion.This is quite interesting since the computational cost of S-Lasso T h is much smaller than S-Lasso Cv .This study is actually more a verification of our theoretical choices of the tuning parameters than a rule to apply in practice.Indeed, since the theoretical choice of µ depends on β * , the corresponding estimator S-Lasso T h is unusable in real data problems; • second, we evaluate the values of the tuning parameters in both cases.Table 2 displays the values of the tuning parameters (λ, µ) of the S-Lasso, when they are chosen by cross validation (λ Cv , µ Cv ) and based on the theoretical values (λ T h , µ T h ).We compare them to the values of n (bottom) of the S-Lasso based on 500 replications.For each subplot: Left: The tuning parameters are chosen by 10 fold cross validation.Right: The tuning parameters are chosen based on the theoretical study.We refer to Table 2 for an evaluation of these tuning parameters the parameters (λ Est , µ Est ) that minimize the ℓ 2 estimation error.
A first remark is that the values of the tuning parameters calibrated based on the theoretical study are always larger than those chosen by cross validation.This is not surprising since the theoretical calibration of the tuning parameters is fixed to capture smoothness with a large value of µ n .It then turns out that the theoretical considerations leads to 'smoother' solutions than the cross validation.Note however that λ T h > λ Cv does not imply that the solution based on the theoretical issue is sparser since a larger µ usually implies that the solution is less sparse.Regarding the best solution (where the tuning parameters minimize the ℓ 2 estimation error), there are two cases.When the true regression vector is not smooth, it seems that these 'best' tuning parameters are closer to the ones chosen by cross validation.When the true regression vector is smooth, they are closer to the tuning parameters calibrated based on theory.To sum up, on can say that the best λ is close to the one chosen by cross validation, whereas the best µ is closer to the one based on theory; • finally, we compare both of the methods in terms of their estimation accuracy of * .Table 3 summarizes the results.The first four rows displays the median values of |J β| 2 when β denotes the S-Lasso estimator.We compare the three ways to calibrate the tuning parameters.We observe that the S-Lasso based on cross validation (S-Lasso Cv ) provides satisfying estimations of |Jβ * | 2 .We also note that the S-Lasso based on the theoretical values of the tuning parameters (S-Lasso T h ) is particularly good in Examples (c) and (d).This is not surprising since the regression vector in these examples is smooth.It behaves similarly as the best S-Lasso solution (in terms of the minimization of the ℓ 2 estimation error).Since λ T h and µ T h depend on |Jβ * | 2 and |A * | (cf.Corollary 1), one can intent to use S-Lasso Cv to estimate these two quantities.In this way, one would be able to compute S-Lasso T h even in real dataset experiments.However, our experiments reveal that S-Lasso Cv may overestimate the number of nonzero components as illustrated by the four last rows of Table 3 (this is also a well-known fact).Nevertheless, we do not exclude this approach, which can be helpful to Table 2: Median values of the tuning parameters (λ, µ) of the S-Lasso for different ways of calibration: 'Cv' for cross validation; 'T h' for theoretical choice; 'Est' for ℓ 2 estimation error minimizers.The tuning parameters displayed here correspond to the experiments illustrated in Figure 6.33 [37] 102 [113] provide closer performance to those of S-Lasso Est .

Conclusion of the experimental results
. The S-Lasso has good performance when the regression vector is 'smooth' (Examples (c) and (d)).Nevertheless, even in situations made in favor of the Elastic-Net and the Fused-Lasso (Examples (b)), the S-Lasso performs similarly as the other methods when the tuning parameters are chosen based on the cross validation criterion.The S-Lasso is even better in these examples when the methods are constructed based on the theoretical considerations.
All the results according to the procedures for which the tuning parameters are chosen based on the theoretical perspectives is a little unfair in disfavor of the Fused-Lasso.Indeed, the rates of the tuning parameters have been calibrated based on a study made for the estimator βQuad (the Elastic-Net and the S-Lasso are two particular cases of this estimator).For the Lasso estimator, we also used the usual rate for λ.Even if the Fused-Lasso seems to be close to the S-Lasso, it turns out that similar choices for the tuning parameters lead to worse results for the Fused-Lasso.
Based on results on Examples (c) and (d) it seems that the Fused-Lasso and the Elastic-Net imply a large bias for large values of µ when the regression vector is smooth (also observed in [10]).They do not improve significantly the performance of the Lasso estimator in such situations.Even the 'corrected' Elastic-Net does not provide better results since the artificial correction seems to work for a small number of pairs (λ, µ) that have to be chosen very carefully.
One can think of two-stage methods to obtain better performance for the Fused-Lasso and the Elastic-Net (and also for the S-Lasso and the Lasso), where for instance an ordinary least squares is fitted based on the estimated support.This technique reduces of course the bias of the procedures and we refer to [2] for a nice theoretical study of such procedures.However, we attempt here to examine the performance for the (one-stage) methods and observe how well the S-Lasso approaches the true regression vector.

Pseudo-real dataset
We apply all the methods we previously studied on artificially generated dataset from the riboflavin data.These data is about riboflavin (vitamin B2) production by Bacillus subtilis.They kindly have been provided to us by DSM Nutritional Products (Switzerland).In the original data, the real-valued response variable is the logarithm of the riboflavin production rate, and there are p = 4088 covariates measuring the logarithm of the expression level of 4088 genes that cover essentially the whole genome of Bacillus subtilis.The sample size is n = 71.
Here, we are not interested in the riboflavin production, but only in the covariates matrix X coming from this application.We use this design matrix to generate an artificial response vector with a 'smooth' regression vector as in Equation ( 1).Let us mention that this trick to generate pseudo-real datasets has already been used in [22].In what follows, we consider two different applications based on the real covariates matrix provided by the riboflavin dataset.
In the first application, say Application 1, let us define X as the 1023 first covariates of the riboflavin dataset.Moreover, let us define the regression vector β * such that β * j = 10 • exp − 1 1−((j−125)/125.1) 2 for j = 1, . . ., 250 (cf. Figure 8) and the noise level σ = 3.Hence, n = 71 and p = 1023 and then this is a high-dimensional setting with p ≫ n where the number of non-zero components (the sparsity index |A * |) is larger than the sample size n.According to the second application, say Application 2, we restrict X to the 300 first covariates of the riboflavin dataset.The regression vector β * is such that β * j = 10 • exp − for j = 1, . . ., 50 (cf.Figure 8), and the noise level σ = 3.This is a more common highdimensional case where the sparsity index |A * | is smaller than the sample size n.
Let us now detail the obtained results for different experiments.First, we mention that, with the exception of the S-Lasso, all the methods provide an estimation of the regression vector which is characterized by large variations in the values of the successive components when µ is small (for the Elastic-Net and the Fused-Lasso) and by large bias when µ is large.Hence, we focus here on the S-Lasso estimator.Nevertheless, we display the comparison of all the methods in terms of accuracy in Figure 7 when the methods are applied to Application 2.Even though the S-Lasso estimator is outperformed when the tuning parameter is chosen by cross validation (by the Fused-Lasso for the estimation error and by all the methods for the prediction; cf.Figures 7 (left and center-left)), it turns out that we can find a S-Lasso solution which performs better than the other methods as displayed in Figures 7 (center-right and right).One of the best solution of the S-Lasso estimator in Application 2 can also be seen in Figure 8 (left).We observe how the S-Lasso succeeds to reconstruct the 'smooth' regression vector β * .Before considering Application 1, we point out one more fact: in both center-right and right plots in Figure 8, the tuning parameters minimize the ℓ 2 estimation error.This can provide an explanation of such a bad performance of the Lasso when we consider the prediction error (right plot).This also implies the big discrepancy between the Lasso based on cross validation (plot center-left) and the one corresponding to the right plot.Finally, let us consider Application 1, and let us recall that the sparsity index is here larger than the sample size.Figure 8 (right) displays the best reconstitution of the regression vector on this very difficult problem.We observe that the S-Lasso succeeds only partly to reconstruct the true regression vector.In the simulation study, we met a similar situation with Example (d) [100/30/3] (cf. Figure 5), where the S-Lasso perfectly estimated β * .However, the situation here is even more difficult since the sparsity index is much larger than the sample size and since many high and negative correlations between the covariates appear in the riboflavin dataset.

Conclusion
In this paper, we introduced the Lasso-type estimator βQuad which consists of two penalty terms: a ℓ 1 penalty term which ensures sparsity and a quadratic penalty term which captures some structure in the regression vector.We showed that this estimator satisfies good theoretical properties, specifically when the Lasso estimator might fail.As special cases we considered the Elastic-Net and the S-Lasso.These methods are interesting in particular when correlations between variables exist or when the regression vector is 'smooth'.We illustrated this in a certain setting and an example where βQuad performs better than the Lasso.In a concrete survey, we considered the performance of the S-Lasso estimator compared to the Lasso, the Elastic-Net and the Fused-Lasso in terms of prediction and estimation accuracy.We found the superiority of the S-Lasso in several simulation experiments where the regression vector has a particular structure.We also observed that the theoretical calibration of the tuning parameters and those obtained by 10 fold cross validation provide similar performances.The methods have also been applied to pseudo real examples based on the riboflavin dataset.Finally, we pointed out in several simulation studies (see Example (d) [100/30/σ]) the ability of the S-Lasso to recover smooth vector even in difficult situations where the sparsity index is larger than the sample size.
Lemma 3. Let η ∈ (0, 1).Let 0 < τ ≤ 1, be a real number.Denote also by L the constant such that n −1 n i=1 max j=1,...,p x 2 i,j ≤ L. Let Λ n,p be the random event defined by Λ n,p = {max j=1,...,p 2|V j | ≤ τ λ n } where V j = n −1 n i=1 x i,j ε i is such that for any i = 1, . . ., n, x 2 i,j ≤ L and the ε i 's are independent random variables with zero mean and finite variance Eε 2 i ≤ σ 2 .Denote by K N em the quantity K N em = inf q∈[2,∞]∩R (q − 1)p 2/q .Then for Proof.This inequality uses an inequality on the expectation of supremum of square of sum of independent random variables that can be found in [11 where we used the definition of in the last inequality.Theorem 2.2 in [11] is used to obtain (15).
Proof of Theorem 1.We provide a first result which may help the legibility of the paper.It states that the squared risk and the ℓ 1 -estimation error are controlled by the restricted ℓ Proposition 3. Let βQuad be the estimator defined by (2)-( 4) with tuning parameters λ n and µ n .Let 0 < τ ≤ 1 be a real number.On the event Λ n,p = {max j=1,...,p 2|V j | ≤ τ λ n } with where Proof.Let first X, Y and ε be the augmented dataset defined by where 0 is a vector of size p containing only zeros and J is the p × p matrix given by (5).Then we have Y = Xβ * + ε, and the estimator βQuad , solution of the minimization problem (2) with the penalty given by (4), is also the minimizer of Hence, by definition of the estimator βQuad we can write Let us now consider the term 2 n ε ′ X(β * − βQuad ).By the definition of X and ε, we have the decomposition The first term in this decomposition is quite common in the literature and we treat it using arguments which can be found for instance in [7].We then need to adapt those arguments in order to deals with the second term of the decomposition µ n β * ′ J ′ J(β * − βQuad ) in the same time.Recall that A * = {j : β * j = 0} and that J ′ J = J.Let 0 < τ ≤ 1 be a real number.Then, on the event Λ n,p = {max j=1,...,p 2|V The remainder of this proof is linked to the way we choose to treat the term µ n β * ′ J(β * − βQuad ) and in particular in the way we choose to link the RHS of Inequality ( 17) to the quantity We obviously can write where B is the smallest set of indices such that the first equality holds.Note that the set B includes A * , the true sparsity set, and is not much larger due to the sparsity of J. Now let τ = 1/2 in (17), add 2 −1 λ n |β * − βQuad | 1 to both sides of this inequality.We then get where In the second above inequality, we used the fact that |β * j − βQuad for any j / ∈ A and to the triangular inequality.This is the claim of Proposition 3 when J is sparse.
Let us now proof the main theorem.Thanks to Inequality (16) in Proposition 3, we easily obtain that where Then the vector β * − βQuad is an admissible vector ∆ in Assumption B(B).As a consequence, using this assumption in Equation ( 16), we get on one hand and a simple simplification leads to the first part of the result On the other hand, Inequality (19), combined to Assumption B(B) and Inequality (20), implies which is the desired bound on the ℓ 1 estimation error given in Theorem 1.The proof is completed when we use Lemma 2 with τ = 1/2 to control the probability of the event Λ n,p .
Proof of Proposition 1.We first provide a bound on Theorem 1 states a bounds on the prediction error and on the ℓ 1 estimation error under Assumption B(B).
Here we do not care about the ℓ 1 estimation error.Then one can observe that in the intermediate step between ( 17) and (18)   where we recall that Θ = B ∪ C, with |B| = m.Then using the last display with (19) yields to where Combine this last inequality with (21) implies Since |δ| ∞ ≤ |δ| 2 , we obtained the desired control on the sup-norm of β * − βQuad .
Proof of Theorem 2. This result is quite natural since it is a direct consequence of Proposition 1.We refer the reader to the proof of Theorem 2 in [18] for instance.
Proof of Theorem 3. We consider now the case of general matrices J.Most of the proof is similar to the sparse case (Proof of Theorem 1 above).The same reasoning leads to (17) and the only different occurs when we deal with the term −µ n β * ′ J(β * − βQuad ).We have here Then, if we set τ = since here τ becomes equal to 1/2 in Lemma 2. This completes the proof of the first part of the Proposition.
| βQuad Note that by assumption, we have |β * j | > U, ∀j ∈ A * .Then if we distinguish the case β * j > 0 and the case β * j < 0, we easily conclude that β * j > 0 implies βQuad j > 0 and β * j < 0 implies βQuad j < 0. This ables us to write P(Sgn( βQuad A * ) = Sgn(β * A * )) ≥ P(| βQuad and this naturally implies the that A * ⊂ Â with high probability. Proof of Theorem 4. We now show that Â ⊂ A * with high probability.This proof is quite inspired by the one by Bunea [5].First of all, note that we can write the KKT conditions of the minimization problem (6) as Then all the solutions of the criterion (6) share the same active set Â = j ∈ {1, . . ., p} : That is, all these solutions have non-zero components at the same positions.We now use this property to show that the estimator βQuad has non-zero components at the same positions as a well-controlled (but uncomputable) estimator on an event which occurs with high probability.
For this purpose, let us consider the criterion where recall that for any p-dimensional vector a and any set Θ ⊂ {1, . . ., p}, the notation a Θ means that (a Θ ) j = a j , ∀j ∈ Θ and 0 otherwise.Moreover, J A * is such that (J A * ) j,k = J j,k if j, k ∈ A * and 0 otherwise.Define the estimator b = argmin where 0 p is the zero in R p .Since we restricted b to be zero when β * is zero and that this is an information we do not have access to, we mention that the vector is not computable.Let us denote by Ω the following event Observe how the event Ω is inspired by the KKT conditions (23).Actually, on the event Ω, the components bk with k / ∈ A * equals zero as they do not saturate KKT conditions.This makes the minimization of F (b) over b ∈ R p : b (A * ) c = 0 p coincide with the minimization of the criterion (6) on Ω.That is, the estimator b turns out to be also solution of the original criterion (6) on Ω.But βQuad is also solution of ( 6) and then, as we already pointed, this implies that on Ω, both of βQuad and b have non-zero components at the same positions and then, b has non-zero components at components j ∈ Â. Add the fact that by construction b(A * ) c = 0 p , then Â ⊂ A * on the event Ω.It then remains to prove that the event Ω occurs with high probability.We have where we used the fact that for real number a and b, we have |a| + |b| ≥ |a + b| in the third inequality and the fact that µ n = λn By definition of b, we just have to repeat the proof of Theorem 3 but with b instead of βQuad and only on the true sparsity set A * .We get that on the event Λ n,A * = max j∈A * |X ′ j ε| ≤ λ n /8 , which is the same that Λ n,p but using A * instead of {1, . . ., p}, Proof of Theorem 5.This proof is almost the same as the one of Theorem 1.The only difference is the way to control the event Λ n,p = {max j=1,...,p 2|V j | ≤ τ λ n } where V j = n −1 n i=1 x i,j ε i when the noise admits only zero mean and finite variance.Then we do not use the concentration inequality provided in Lemma 2 for the Gaussian noise but an analog concentration inequality more adapted to this type of noise.This concentration inequality is given by Lemma

Theorem 2 .
Let us consider the thresholded estimator βT h−Quad as described above.In the same setting as in Proposition 1, and under Assumption B ′ (B ∪ C) and Assumption C P sign( βT h−Quad ) = sign(β * ) ≥ 1 − η.

Proposition 2 .
Therefore, we favor to exploit here again a control on the ℓ 2 estimation error |β * − βQuad | 2 instead, which in the sequel enables us to recover signals with |β * j | ≥ cst • λ n |A * | with the same assumption on the matrix K n .Let us mention that λ n |A * | is not the best level which can be recovered.One can also get rid of the extra term |A * | through a quite restrictive assumption on the correlations between variables (see Remark 3).Proposition 2 below is a first step to a variable selection result.It states that βQuad enables us at least to detect the relevant variables (and maybe also some noise variables): Let us consider the same setting as in Theorem 3 with the only difference that λ n = 4 √ 2σ log(p/η)/n and µ n = λ n /(4| J β * | ∞ ) with 0 < η < 1.Under Assumption RE, and with probability larger than 1 − η, we have

Figure 1 :
Figure 1: Performance of the Lasso (L), the S-Lasso (SL), the Fused-Lasso (FL), and the Elastic-Net (EN) applied to Example (a) and based on 500 replications.The tuning parameters are chosen based on the theoretical study.Left: Evaluation of the prediction error Ytest − Xtest β 2 n in comparison with the performance of the truth (T), that is, Ytest − Xtestβ * 2 n .Right: Evaluation of the ℓ2 estimation error | β − β * |2.

2 Figure 2 :
Figure 2: Performance of the Lasso (L), the S-Lasso (SL), the Fused-Lasso (FL) and the Elastic-Net (EN) applied to Example (b) and based on 500 replications.The tuning parameters are chosen based on the theoretical study in the first two plots and by 10 fold cross validation in the third.Left: Evaluation of the prediction error Ytest − Xtest β 2 n , in comparison with the performance of the truth (T), i.e., Ytest − Xtestβ * 2 n .Center and Right: Evaluation of the ℓ2 estimation error | β − β * |2.

2 Figure 3 :
Figure 3: Evaluation of the ℓ2 estimation error | β − β * |2 of the Lasso (L), the S-Lasso (SL), the Fused-Lasso (FL) and the Elastic-Net (EN) applied to Example (c) and based on 500 replications.Left: The tuning parameters are chosen by 10 fold cross validation.Right: The tuning parameters are chosen based on the theoretical study.

2 Figure 4 :
Figure 4: Evaluation of the ℓ2 estimation error | β − β * |2 of the Lasso (L), the S-Lasso (SL), the Fused-Lasso (FL) and the Elastic-Net (EN) applied to Example (d) and based on 500 replications.Left: The tuning parameters are chosen by 10 fold cross validation.Center-left; Center-right; Right: The tuning parameters are chosen based on the theoretical study.

Figure 6 :
Figure 6: Evaluation of the ℓ2 estimation error | β − β * |2 (top) and the prediction error Ytest − Xtest β 2 n (bottom) of the S-Lasso based on 500 replications.For each subplot: Left: The tuning parameters are chosen by 10 fold cross validation.Right: The tuning parameters are chosen based on the theoretical study.We refer to Table2for an evaluation of these tuning parameters

Figure 7 : 2 nFigure 8 :
Figure 7: Evaluation of the ℓ2 estimation error | β − β * |2 and the prediction error Ytest − Xtest β 2 n of the Lasso (L), the S-Lasso (SL), the Fused-Lasso (FL) and the Elastic-Net (EN) applied to the pseudo-real data, and based on 20 replications of Application 2. Left; Center-left: The tuning parameters are chosen by 10 fold cross validation.Center-right; Right: The tuning parameters minimize the ℓ2 estimation error.
and B is a set including A * .

1 n Xβ * − X βQuad 2 2 + λ n 2 | 1 .
|β * − βQuad | 1 ≤ 2λ n |A * ||β * A * − βQuad A * | 2 .This above intermediate result is the analogous of Proposition 3 in the case where J is general.That is, we get a similar bound but depending on|β * A * − βQuad A * | 2 instead of |β * B − βQuadB | 2 and with r n = 2λ n |A * |.Note also that (19) is replaced by the following linear inequality |β * − βQuad | 1 ≤ 4|β * A − βQuad A Taking into account this changing, we use can use Assumption RE instead of Assumption B(B) and then a similar reasoning as in the proof of Theorem 1 leads to the desired results.Proof of Proposition 2. Using exactly the same reasoning as in the proof of Proposition 1 but based on Theorem 3 instead of Theorem 1 we obtain with probability at least 1 − η |β * A * − βQuad A * | 2 ≤ 2φ −1 µn λ n |A * |, b∈R p : b (A * ) c =0p F (b),
3 and we get P max j=1,...,p2|V j | ≤ τ λ n ≥ 1 − η, for a value of λ n = 2σ τ K Nem L nη .Then we set τ = 1/2 and we plug this new value of the tuning parameter λ n instead to the one used to establish the previous results into Theorem 1.We just finish the proof by using the fact thatµ n = λn √ |A * | 2| Jβ * | 2and we obtain the analogous of Corollary 1.

Table 3 :
Median value of |J β| 2 (four first rows) and median number of nonzero components | Â| (four last rows) of the S-Lasso for different ways of calibration of the tuning parameters: 'Cv' for cross validation; 'T h' for theoretical choice; 'Est' for ℓ 2 estimation error minimizers.The third quantiles are displayed in brackets.The values in this table correspond to the experiments illustrated in Figure 6 (and in Table 2 as well).