Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators

We derive the $l_{\infty}$ convergence rate simultaneously for Lasso and Dantzig estimators in a high-dimensional linear regression model under a mutual coherence assumption on the Gram matrix of the design and two different assumptions on the noise: Gaussian noise and general noise with finite variance. Then we prove that simultaneously the thresholded Lasso and Dantzig estimators with a proper choice of the threshold enjoy a sign concentration property provided that the non-zero components of the target vector are not too small.


Introduction
The Lasso is an l 1 penalized least squares estimator in linear regression models proposed by Tibshirani [17]. The Lasso enjoys two important properties. First, it is naturally sparse, i.e., it has a large number of zero components. Second, it is computationally feasible even for high-dimensional data (Efron et al. [8], Osborne et al. [16]) whereas classical procedures such as BIC are not feasible when the number of parameters becomes large. The first property raises the question of model selection consistency of Lasso, i.e., of identification of the subset of non-zero parameters. A closely related problem is sign consistency, i.e., identification of the non-zero parameters and their signs (cf. Bunea [2], Meinshausen and Bühlmann [13], Meinshausen and Yu [14], Wainwright [20], Zhao and Yu [22] and the references cited in these papers).
Zou [23] has proved estimation and variable selection results for the adaptive Lasso: a variant of Lasso where the weights on the different components in the l 1 penalty vary and are data dependent. We mention also work on the convergence of the Lasso estimator under the prediction loss: Bickel, Ritov and Tsybakov [1], Bunea, Tsybakov and Wegkamp [3], Greenshtein and Ritov [9], Koltchinskii [11;12], Van der Geer [18; 19]. Knight and Fu [10] have proved the estimation consistency of the Lasso estimator in the case where the number of parameters is fixed and smaller than the sample size. The l 2 consistency of Lasso with convergence rate has been proved in Bickel, Ritov and Tsybakov [1], Meinshausen and Yu [14], Zhang and Huang [21]. These results trivially imply the l p consistency, with 2 p ∞, however with a suboptimal rate (cf., e.g., Theorem 3 in [21]). Bickel, Ritov and Tsybakov [1] have proved that the Dantzig selector of Candes and Tao [6] shares a lot of common properties with the Lasso. In particular they have shown simultaneous l p consistency with rates of the Lasso and Dantzig estimators for 1 p 2. To our knowledge, there is no result on the l ∞ convergence rate and sign consistency of the Dantzig estimator.
The notion of l ∞ and sign consistency should be properly defined when the number of parameters is larger than the sample size. We may have indeed an infinity of possible target vectors and solutions to the Lasso and Dantzig minimization problems. This difficulty is not discussed in [2; 13; 14; 20; 21] where either the target vector or the Lasso estimator or both are assumed to be unique. We show that under a sparsity scenario, it is possible to derive l ∞ and sign consistency results even when the number of parameters is larger than the sample size. We refer to Theorem 6.3 and the Remark 1, p. 21, in [1] which suggest a way to clarify the difficulty mentioned above.
In this paper, we consider a high-dimensional linear regression model where the number of parameters can be much greater than the sample size. We show that under a mutual coherence assumption on the Gram matrix of the design, the target vector which has few non-zero components is unique. We do not assume the Lasso or Dantzig estimators to be unique. We establish the l ∞ convergence rate of all the Lasso and Dantzig estimators simultaneously under two different assumptions on the noise. The rate that we get improves upon those obtained for the Lasso in the previous works. Then we show a sign concentration property of all the thresholded Lasso and Dantzig estimators simultaneously for a proper choice of the threshold if we assume that the non-zero components of the sparse target vector are large enough. Our condition on the size of the non-zero components of the target vector is less restrictive than in [20][21][22]. In addition, we prove analogous results for the Dantzig estimator, which to our knowledge was not done before.
The paper is organized as follows. In Section 2 we present the Gaussian linear regression model, the assumptions, the results and we compare them with the existing results in the literature. In Section 3 we consider a general noise with zero mean and finite variance and we show that the results remain essentially the same, up to a slight modification of the convergence rate. In Section 4 we provide the proofs of the results.

Model and Results
Consider the linear regression model
The Lasso and Dantzig estimatorsθ L ,θ D solve respectively the minimization problems and min θ∈R M where r > 0 is a constant. A convenient choice in our context will be r = Aσ (log M )/n, for some A > 0. We denote respectively byΘ L andΘ D the set of solutions to the Lasso and Dantzig minimization problems (2) and (3). The definition of the Lasso minimization problem we use here is not the same as the one in [17], where it is defined as for some t > 0. However these minimization problems are strongly related, cf. [5]. The Dantzig estimator was introduced and studied in [6]. Define Φ(θ) = 1 n |Y − Xθ| 2 2 + 2r|θ| 1 . A necessary and sufficient condition for a vector θ to minimize Φ is that the zero vector in R M belongs to the subdifferential of Φ at point θ, i.e.,

Karim Lounici/Sup-norm rate, sign concentration of Lasso and Dantzig estimators 93
Thus, any vector θ ∈Θ L satisfies the Dantzig constraint The Lasso estimator is unique if M < n, since in this case Φ(θ) is strongly convex. However, for M > n it is not necessarily unique. The uniqueness of Dantzig estimator is not granted either. From now on, we setΘ =Θ L orΘ D andθ denotes an element ofΘ. Now we state the assumptions on our model. The first assumption concerns the noise variables.
We also need assumptions on the Gram matrix and max i =j for some integer s 1 and some constant α > 1, where c 0 = 1 if we consider the Dantzig estimator, and c 0 = 3 if we consider the Lasso estimator.
The notion of mutual coherence was introduced in [7] where the authors required that max i =j |Ψ i,j | were sufficiently small. Assumption 2 is stated in a slightly weaker form in [1]- [4].
Theorem 1 states that in high dimensions M the set of estimatorsΘ is necessarily well concentrated around the vector θ * . Similar phenomenon was already observed in [1], cf. Remark 1, page 21, for concentration in l p norms, 1 p 2. Note that c 2 in Theorem 1 is an absolute constant. Using Theorem 1, we can easily prove the consistency of the Lasso and Dantzig estimators simultaneously when n → ∞. We allow the quantities s, M ,Θ, θ * to vary with n. In particular, we assume that M → ∞ and lim n→∞ log M n = 0, as n → ∞, and that Assumptions 1,2 hold true for any n. Then we have in probability, as n → ∞. The condition (log M )/n → 0 means that the number of parameters cannot grow arbitrarily fast when n → ∞. We have the restriction M = o(exp(n)), which is natural in this context. A result on l ∞ consistency of Lasso has been previously stated in Theorem 3 of [21], whereθ L was assumed to be unique and under another assumption on the matrix Ψ. It is not directly related to our Assumption 2, but can be deduced from a restricted version of Assumption 2 where α is taken to be substantially larger than 1. The result in [21] is a trivial consequence of the l 2 consistency, and has therefore the rate |θ L − θ * | ∞ = O P (s 1/2 r) which is slower than the correct rate given in Theorem 1. In fact, the rate in [21] depends on the unknown sparsity s which is not the case in Theorem 1. Note also that Theorem 3 in [21] concerns the Lasso only, whereas our result covers simultaneously the Lasso and Dantzig estimators.
We now study the sign consistency. We make the following assumption. |θ * j | > c 1 r.
We will take r = Aσ (log M )/n. We can find similar assumptions on ρ in the work on sign consistency of the Lasso estimator mentioned above. More precisely, the lower bound on ρ is of the order s 1/4 r 1/2 in [14], n −δ/2 with 0 < δ < 1 in [20; 22], (log M n)/n in [2] and √ sr in [21]. Note that our assumption is the less restrictive.
We now introduce thresholded Lasso and Dantzig estimators. For anyθ ∈Θ the associated thresholded estimatorθ ∈ R M is defined bỹ Denote byΘ the set of all suchθ. We have first the following non-asymptotic result that we call sign concentration property.
Theorem 2 guarantees that every vectorθ ∈Θ and θ * share the same signs with high probability. Letting n and M tend to ∞ we can deduce from Theorem 2 an asymptotic result under the following additional assumption. Then the following asymptotic result called sign consistency follows immediately from Theorem 2. The sign consistency of Lasso was proved in [13; 22] with the Strong Irrepresentable Condition on the matrix Ψ which is somewhat different from ours. Papers [13; 22] assume a lower bound on ρ of the order n −δ/2 with 0 < δ < 1, whereas our Assumption 3 is less restrictive. Note also that these papers assumê θ L to be unique. Wainwright [20] does not assumeθ L to be unique and discusses sign consistency of Lasso under a mutual coherence assumption on the matrix Ψ and the following condition on the lower bound: (log M )/n = o(ρ) as n → ∞, which is more restrictive than our Assumption 3. In particular Proposition 1 in [20] states that as n → ∞, if the sequence of θ * satisfies the above condition for all n large enough, then P ∃θ L ∈Θ L s.t. sign(θ L ) = sign(θ * ) → 1.
This result does not guarantee sign consistency for all the estimatorsθ L ∈Θ L but only for some unspecified subsequence that is not necessarily the one chosen in practice. On the contrary, Corollary 1 guarantees that all the thresholded Lasso and Dantzig estimators and θ * share the same sign vector asymptotically. It follows from this result that any solution selected by the minimization algorithm is covered and that the case M > n, where the setΘ is not necessarily reduced to an unique estimator, can still be treated. We note also that the papers mentioned above treat the sign consistency for the Lasso only, whereas we prove it simultaneously for Lasso and Dantzig estimators. An improvement in the conditions that we get is probably due to the fact that we consider thresholded Lasso and Dantzig estimators. In addition note that not only the consistency results, but also the exact non-asymptotic bounds are provided by Theorems 1 and 2.

Convergence rate and sign consistency under a general noise
In the literature on Lasso and Dantzig estimators, the noise is usually assumed to be Gaussian [1; 6; 13; 20; 21] or admitting a finite exponential moment [2; 14]. The exception is the paper by Zhao and Yu [22] who proved the sign consistency of the Lasso when the noise admits a finite moment of order 2k where k 1 is an integer. An interesting question is to determine whether the results of the previous section remain valid under less restrictive assumption on the noise. In this section, we only assume that the random variables W i , i = 1, . . . , n, are independent with zero mean and finite variance E[W 2 i ] σ 2 . We show that the results remain similar. We need the following assumption Assumption 5. The matrix X is such that For example, if all X i,j are bounded in absolute value by a constant uniformly in i, j, then Assumption 4 is satisfied. The next theorem gives the l ∞ rate of convergence of Lasso and Dantzig estimators under a mild noise assumption.
n , with δ > 0. Let Assumptions 2,5 be satisfied. Then where c 2 is defined in Theorem 1, and c > 0 is a constant depending only on c ′ . Therefore the l ∞ convergence rate under the bounded second moment noise assumption is only slightly slower than the one obtained under the Gaussian noise assumption and the concentration phenomenon is less pronounced. If we assume that lim n→∞ (log M ) 1+δ /n = 0 and that Assumptions 2,3 and 5 hold true for any n with r = σ (log M ) 1+δ /n, then the sign consistency of thresholded Lasso and Dantzig estimators follows from our Theorem 3 similarly as we have proved Theorem 2 and Corollary 1. Zhao and Yu [22] stated in their Theorem 3 a result on the sign consistency of Lasso under the finite variance assumption on the noise. They assumedθ L to be unique and the matrix X to satisfy the condition max 1 i n ( M j=1 X 2 i,j )/n → 0, as n → ∞. This condition is rather strong. It does not hold if M > n and all the X i,j are bounded in absolute value by a constant. In addition, [22] assumes that the dimension M = O(n δ ) with 0 < δ < 1, whereas we only need that M = o(exp(n 1/(1+δ) )) with δ > 0. Note also that [22] proves the sign consistency for the Lasso only, whereas we prove it for thresholded Lasso and Dantzig estimators.

Proofs
We begin by stating and proving two preliminary lemmas. The first lemma originates from Lemma 1 of [3] and Lemma 2 of [1]. Lemma 1. Let Assumption 1 and (5) of Assumption 2 be satisfied. Take r = Aσ (log M )/n. HereΘ denotes eitherΘ L orΘ D . Then we have, on an event of probability at least 1 − M −A 2 /8 , that and for allθ ∈Θ, where ∆ =θ − θ * , c 0 = 1 for the Dantzig estimator and c 0 = 3 for the Lasso.
Proof. Define the random variables

Standard inequalities on the tail of Gaussian variables yield
On the event A, we have Any vectorθ inΘ L orΘ D satisfies the Dantzig constraint (4). Thus we have on A that sup Now we prove the second inequality. For anyθ D ∈Θ D , we have by definition that |θ D | 1 |θ * | 1 , thus Consider now the Lasso estimators. By definition, we have for anyθ L ∈Θ L Developing the left hand side on the above inequality, we get On the event A, we have for anyθ L ∈Θ L Adding |θ L − θ * | 1 on both side, we get for anyθ L ∈Θ L .
Lemma 2. Let Assumption 2 be satisfied. Then Proof. For any subset J of {1, . . . , M } such that |J| s and λ ∈ R M such that |λ J c | 1 c 0 |λ J | 1 , we have ) denotes the components of the vector λ J . This yields

Karim Lounici/Sup-norm rate, sign concentration of Lasso and Dantzig estimators 99
We have used Assumption 2 in the second line, the inequality |λ J c | 1 c 0 |λ J | 1 in the third line and the fact that |λ J | 1 |J||λ J | 2 √ s|λ J | 2 in the last line.
Proof of Theorem 1. For all 1 j M ,θ ∈Θ we have
Proof of Theorem 3. The proof of Theorem 3 is similar to the one of Theorem 1 up to a modification of the bound on P (A c ) in Lemma 1. Recall that Z j = n −1 n i=1 X i,j W i , 1 j M and the event A is defined by The Markov inequality yields that Then we use Lemma 3 given below with p = ∞ and the random vectors Y i = (X i,1 W i /n, . . . , X i,M W i /n) ∈ R M , i = 1, . . . , n. We get that wherec > 0 is an absolute constant. Taking r = σ (log M ) 1+δ /n and using Assumption 5 yields that where c > 0 is an absolute constant.
The following result is Lemma 5.2.2, page 188 of [15]. Lemma 3. Let Y 1 , . . . , Y n ∈ R M be independent random vectors with zero means and finite variance, and let M 3. Then for every p ∈ [2, ∞], we have wherec > 0 is an absolute constant.