Honest variable selection in linear and logistic regression models via $\ell_1$ and $\ell_1+\ell_2$ penalization

This paper investigates correct variable selection in finite samples via $\ell_1$ and $\ell_1+\ell_2$ type penalization schemes. The asymptotic consistency of variable selection immediately follows from this analysis. We focus on logistic and linear regression models. The following questions are central to our paper: given a level of confidence $1-\delta$, under which assumptions on the design matrix, for which strength of the signal and for what values of the tuning parameters can we identify the true model at the given level of confidence? Formally, if $\widehat{I}$ is an estimate of the true variable set $I^*$, we study conditions under which $\mathbb{P}(\widehat{I}=I^*)\geq 1-\delta$, for a given sample size $n$, number of parameters $M$ and confidence $1-\delta$. We show that in identifiable models, both methods can recover coefficients of size $\frac{1}{\sqrt{n}}$, up to small multiplicative constants and logarithmic factors in $M$ and $\frac{1}{\delta}$. The advantage of the $\ell_1+\ell_2$ penalization over the $\ell_1$ is minor for the variable selection problem, for the models we consider here. Whereas the former estimates are unique, and become more stable for highly correlated data matrices as one increases the tuning parameter of the $\ell_2$ part, too large an increase in this parameter value may preclude variable selection.


Introduction
The literature on various theoretical aspects of ℓ 1 empirical risk minimization has enjoyed substantial growth over the last few years, partly as a necessity to complement the flourishing field of convex optimization.The main attraction, from both theoretical and computational perspectives, is the proved ability of such methods to recover sparse approximations of the true underlying model when the number of parameters is large relative to the sample size.The principal theoretical topics of interest are therefore focused on optimality properties that involve the notion of sparsity.Whereas the theoretical properties of the ℓ 1 + ℓ 2 penalized estimates, sometimes referred to as elastic net estimates, a phrase introduced by [23] in linear models, have not been investigated for the models we consider, the properties of the ℓ 1 penalized estimates, typically referred to as the Lasso-type estimates, have received considerable attention.The topics studied range from finite sample results concerning sparsity oracle inequalities for the risk of the estimators, in regression and classification, e.g., [4,5,19,26,20,2,11] to the asymptotic behavior of the estimates, including the consistency of subset selection, e.g.[9,10,13,22,21,25,6,3,17,12,14].
This work is motivated by the emergence of a large number of variations and improvements of the ℓ 1 penalization schemes in regression and classification.To appreciate the need for such variations it is important therefore to investigate the limitations of the original method.When the number of variables M is large relative to n, an asymptotic analysis of the variable selection problem may obscure issues that arise in finite samples.In this paper we investigate the finite sample accuracy of variable selection via the ℓ 1 and the closely related ℓ 1 + ℓ 2 penalization schemes in regression models.We also discuss asymptotic alternatives and asymptotic consequences of our results.Our goal is to review existing results, and to offer a self-contained, back-to-back analysis of these important models and respective penalization schemes.
Formally, let (X i , Y i ), 1 ≤ i ≤ n, be i.i.d.pairs distributed as (X, Y ) with probability measure P, where Y ∈ {0, 1} or Y ∈ R and X = (X 1 , . . ., X M ) ∈ R M .We assume that E(Y |X = x) = g( j∈I * β * j x j ), where I * ⊆ {1, . . ., M } is an unknown subset and g is a known link function.In our analysis, M is allowed to depend and be larger than the sample size n, and the size of I * may depend on n.The goal of this paper is to provide an understanding of the merits and possible limitations of variable selection via these two penalization schemes when used to answer the following central questions: given a level of confidence 1 − δ, given the number of variables M and the sample size n, under which assumptions on the design matrix, for which strength of the signal and for what values of the tuning parameters do we identify the true model at the given level of confidence?Formally, if I is an estimate of I * , we study conditions under which P( I = I * ) ≥ 1 − δ.
We will focus on variable selection in logistic regression, corresponding to the link function g(z) = e z /(1 + e z ), and also present a full analysis of the problem for linear models, corresponding to g(z) = z, to facilitate the comparison of the results.We will conduct separate analyses of the corresponding estimates, as different arguments are needed for models with possibly unbounded response, such as the linear model.
We denote by β * the vector in R M with components β * j for j ∈ I * and zero otherwise.We begin our analysis in Section 2 by establishing upper bounds on the ℓ 1 distance between the Lasso and elastic net estimators, respectively, and the parameter β * .These results are connected with the sparsity oracle inequalities recently obtained for the Lasso estimators in [4] and [2], in linear regression models, and [19], in generalized linear regression models.The focus in these works is on the predictive performance of the estimators, rather than on the accuracy of variable selection, as considered here.For us, these results are an intermediate, albeit essential, step in discussing the conditions under which an estimate I of the set I * satisfies P( I = I * ) ≥ 1 − δ.It is intuitively clear that if the estimates β are too far from β * , we cannot hope to recover the true coefficient set I * with high probability.It is interesting to note, however, that under some conditions on the design matrix, we can still estimate the true subset correctly even if the distance between β * and the estimates is not close to zero, but can still be controlled as in Section 2. Although this may appear surprising, it is this phenomenon that sets the variable selection problem apart from the problem of estimating well β * itself: here we aim at identifying a nonzero coefficient.Even if the estimate of this coefficient is relatively far from the real value, it only matters whether it is different than zero, not whether it is very close to the truth.
The rest of paper is organized as follows.In Section 2.1 we re-visit the conditions on the design matrix under which sparsity oracle inequalities for the Lasso estimates have been previously established and discuss weaker conditions.In Sections 2.2 and 2.3 we show that these results continue to hold under the weaker conditions.If one considers a slight modification of the ℓ 1 penalty that consists in the addition of a properly scaled ℓ 2 term, one can further weaken the requirements on the design matrix, while maintaining the sparsity of the resulting estimator.This motivates the study the ℓ 1 + ℓ 2 estimates, which have not been, to the best of our knowledge, investigated theoretically from this perspective in these models.Section 2.3 also contains an alternative asymptotic analysis of the ℓ 1 norm of the difference β − β * for estimates in logistic regression, motivated by the presence of possibly large constants in the finite sample oracle bounds.Under weak conditions on a weighted version of the design matrix, we obtain improved oracle bounds, that hold with probability converging to one.
In Section 3, which is central to our paper, we discuss in detail when the Lasso and the elastic net methods can provide accurate variable selection, in linear and logistic regression models.We show that obtaining results of the type P(I * = I) ≥ 1 − δ depends crucially on a combination of conditions on the design matrix and the signal strength.This analysis complements the existing asymptotic results for Lasso estimates in linear regression models, and shows that similar phenomena occur in generalized linear models, for which the variable selection problem has not been investigated from this perspective; we refer to the the very recent work in [17] for related results in binary graphical models.Moreover, we provide the parallel study of the elastic estimates, and investigate to which extent they can be used for variable selection.We note that in a nonasymptotic framework, the study of P(I * = I) is well posed only if I is unique.Since the elastic net estimates of β * are unique, as shown in Appendix B, so is the corresponding I. Recall that, in contrast, the Lasso-type estimators of β * may not be unique.However, in that case, the problem studied here is still well posed: even when the Lasso estimates of β * are not unique, the corresponding I is.This property has been used implicitly in [15], and then in [13], for linear models, without an explicit proof, and not investigated outside linear models.For completeness, we present a proof of this result in Appendix B.
In the Conclusions section we summarize our findings and discuss the relative merits of the Lasso and elastic net estimates.The proofs of our main results are in Appendix A. Additional technical results are collected in Appendix B.

Notation
In the following sections we will denote by β the penalized least squares estimates, for both the ℓ 1 and ℓ 1 + ℓ 2 penalties and, similarly, by β the penalized logistic regression estimates, for either penalty.The estimates are of course different, but we opted for the same notation to keep the exposition simple.It will always be clear from the context to which combination model/penalty they correspond to.In the same way, I will always denote the set of selected variables, and I * will denote the set of truly associated variables.We denote by k * the cardinality of I * .For simplicity, we assume that the observations on the X variables are normalized and centered, that is , for all j.This is in no way crucial, but it allows for cleaner results and easier interpretation of the assumptions.We will also assume that for all i and j the variables X ij are bounded by a common constant L > 0, with probability 1.For any vector in a ∈ R M we denote by |a| 1 = M j=1 |a j | the ℓ 1 norm of a vector.
2. Sparse balls for the ℓ 1 and ℓ 1 + ℓ 2 penalized estimates In this section we establish upper bounds on the ℓ for the Lasso and elastic net estimates, in linear and logistic regression, respectively.We show that these bounds are, up to constants that we make precise below, of the form k * r, where r is the tuning parameter corresponding to the ℓ 1 penalty and k * is the number of non-zero components of β * .Since the ℓ 1 norm is a sum of M terms, but the bound only involves the unknown and possibly much lower dimension k * , we call the corresponding balls sparse.

Conditions on the design matrix
In [5] and [19] it was showed that the Lasso type-estimates belong to sparse ℓ 1 balls centered at the true parameter, in linear models and generalized linear models, respectively.These results were established under variants of a condition on the design matrix typically referred to as the mutual coherence condition, introduced in [8].We state below a mild version of this condition, which we will also use in Section 3 of this paper.Let Condition Identif: We assume that there exists a constant 0 < d ≤ 1 such that This condition guarantees separation of the variables in the true set I * from one another and from the rest, where the degree of separation is measured in terms of the size of the correlation coefficients.We regard it here as an identifiability condition.It will be used as a sufficient condition for correct variable selection in Section 3 below.However, it is not needed for sparse oracle inequalities, as we detail below.In Sections 2.2 and 2.3 below we show that Condition Identif can be relaxed if one is only interested in prediction or the global behavior of the estimates measured, as in these sections, by the ℓ 1 distance to the truth.To formulate the weaker condition let α > 0, ǫ ≥ 0 be given.Define the set Let Σ be the M × M matrix with entries ρ kj .Condition Stabil.Let α, ǫ > 0 be given.There exist 0 < b ≤ 1 such that Remark.We denote generically one of the estimates of β * studied below by β.We will motivate the definition of the set V α,ǫ by showing, in the course of the proofs of Theorems 2.2-2.7, that β − β * ∈ V α,ǫ , with high probability, for specific parameters α and ǫ.For instance, we will show that α is either 3, for the ℓ 1 penalized estimates, or 4, for the ℓ 1 + ℓ 2 penalized estimates.The parameter ǫ will be either zero, for the least squares estimates, or exponentially small, for each M and n, in the case of the logistic regression estimates.The term ǫ in the definition of V α,ǫ is needed for purely technical reasons, and does not affect the results or their interpretation.Condition Stabil corresponding to α = 3 and ǫ = 0 has been introduced, for an analysis similar to the one we conduct here, by [2], for a comparative study of the predictive performance of the Dantzig and Lasso estimators in linear models.
One possible intuitive interpretation of Condition Stabil is as follows.If ǫ = 0, Condition Stabil is immediately implied by P(Σ − bD ≥ 0) = 1, where D is the M × M matrix containing the k * × k * identity matrix corresponding to indices in I * , and with zero elements otherwise.This asserts that the correlation matrix remains semi-pozitive definite if we decrease the diagonal elements corresponding to the true variables slightly, and leave all other entries unchanged.Since this modification affects only k * of M 2 entries, it can be regarded as a stability requirement on the correlation structure.Condition Stabil is even milder than P(Σ − bD ≥ 0) = 1, since it is only required to hold for v ∈ V α,ǫ , for some given α and ǫ.
The following lemma establishes the relationship between the two conditions, and shows that Condition Stabil is less restrictive.A brief argument establishing this link is also offered in [2], for α = 3 and ǫ = 0; we include a full proof here for the general case, for completness.Proof.Let Σ * be the k * × k * matrix with entries ρ kj , k, j ∈ I * .For any v ∈ R M denote by v * the vector in R k * obtained from v by retaining only the components corresponding to I * .Then , by Cauchy -Schwarz The last inequality follows from Condition Identif, which also implies that Thus, for instance, for the study of the Lasso estimates in linear models, we have α = 3 and ǫ = 0 and so if Condition Identif holds for some d, then Condition Stabil holds for 0 < b ≤ 1 − 7d, which imposes the restriction 0 < d < 1  7 .The results of Sections 2.2 and 2.3 below will be established directly under the less restrictive Condition Stabil, which requires the specification of a constant b.Notice that if b is very small, the condition is almost a tautology, as Σ ≥ 0 by construction.However, as it will become apparent from the results established below, a very small value of b will increase the radius of the ℓ 1 balls covering the estimator.This motivates the parallel study of the elastic net estimates.We show that they are less affected by potentially small values of b.

Sparse ℓ 1 balls for estimates in linear regression models
Throughout all sections on linear regression in this paper we assume that the model generating that data is E(Y |X = x) = j∈I * β * j x j , for X ∈ R M and I * ⊆ {1, . . ., M }.This is the most popular model for regression with unbounded response Y .It is also becoming increasingly common in regression models with Y ∈ {0, 1}, when the data supports it.Its usage in this context dates back to [1].

An ℓ 1 penalized least squares estimator
We estimate β * by where r =: r n,M (δ) is a tuning sequence depending on n, M and a user specified parameter δ.In what follows we determine r such that and we make C > 0 precise.
In the following theorem we will use Condition Stabil corresponding to the set V α,ǫ defined in (2.1) for ǫ = 0 and α = 3.Let σ 2 = Var (Y ) and recall that L denotes a common bound on for β given in (2.2).
Remark 1.In practice one can replace σ in the tuning sequence by an estimator, as discussed in detail in [4].
Remark 2. It is interesting to note that although the results above indicate that the radius of the ℓ 1 ball is small if k * r ≤ 1, the proofs make no use of this restriction on k * ; in particular k * > √ n is allowed.It is clear that in this case the bounds are large but, perhaps surprisingly, this does not affect the validity of variable selection, for some design matrices.We discuss this in detail in the next section.Theorem 2.2 above shows that the bound on | β −β * | 1 becomes large if Condition Stabil is satisfied only for very small values of b.One remedy is provided by a slightly modified estimator, which retains the sparsity properties of the Lasso estimates, but is less affected by small values of b.The modified estimate will be penalized least squares with a combined ℓ 1 and ℓ 2 penalty and we discuss it in the next subsection.

An ℓ 1 + ℓ 2 penalized least squares estimator
We estimate now β * by β, where As before, the goal is to find r =: r n,M (δ) and c =: c n,M (δ) for which we can construct sparse balls for the estimates.
In the following theorem we will use Condition Stabil corresponding to the the set V α,ǫ defined in (2.1) for ǫ = 0 and α = 4.
for some B > 0, independent of n, and if for β given in (2.3).
Remark.The result above shows that even if Condition Stabil holds with b very close to 0, the bound on | β − β * | 1 stays finite, for any given M and n.Note that it may still be large, since c is restricted to take relatively small values, dictated by the sizes of r and B. However, we cannot choose a much larger value for c: in that case the ℓ 2 penalty would become prevalent, and no estimates will be set to zero in finite samples.

Sparse ℓ 1 balls for estimates in logistic regression models
We denote the logistic loss function by and denote by Pl(β) = El(β; Y, X) the associated risk.Define Throughout all sections on logistic regression we will assume that . (2.4)

An ℓ 1 penalized logistic regression estimator
We estimate β * by (2.5) We will determine the tuning sequence r = r n,M (δ), different than the one above, for which we can construct sparse balls for these estimators.We will analyze the estimates under the assumption that p(x) is bounded away from zero and one for all x.This is implied by: in combination with the assumption that all X variables are bounded by L.
In the following theorem we will use Condition Stabil corresponding to the set V α,ǫ defined for with r given below, and for α = 3.Also, let s be a constant depending on L and D, which decreases with D.
Remark.Notice that the term (1 + 1 r )ǫ is roughly n 2 M ∨n and therefore negligible.As noted above, the bound on | β − β * | 1 becomes large for very small values of b, which motivates the study of the ℓ 1 + ℓ 2 penalized estimators in the next section.Also, the constant 1/s, the exact form of which is given in the course of the proof of this theorem, can be very large for large D; similar results, based on different arguments and slightly more restrictive assumptions on Σ have also been obtained in [19].However, if we content ourselves with asymptotic statements, we can obtain an improved bound on | β − β * | 1 , with 1/s replaced by a quantity arbitrarily close to 1.These results will hold with probability converging to 1, and under more stringent requirements on the design.They are based on the following fact, which is of independent interest: it establishes the sup-norm consistency of β ′ x as an estimate of β * ′ x.Proposition 2.5.Let δ =: δ n be any sequence converging to zero with n.If max j∈I * |β * j | ≤ B, for some B > 0, independent of n, and k * r → 0, then for any η > 0 we have Remark.It is interesting to note that this result holds independently of the assumptions on the design.
The next result is obtained under a condition similar to Condition Stabil, but required to hold for a weighted version of the matrix Σ.Let g(z) = e z /(1+e z ).Let Σ 1 be the M × M matrix with entries for some B > 0 independent of n, and k * r → 0, then for β given by (2.5) and for a constant w arbitrarily close to one.

An ℓ 1 + ℓ 2 penalized logistic regression estimator
In this section we obtain similar results for estimators of β * given by In the following theorem we will use Condition Stabil corresponding to the set V α,ǫ defined in (2.1) for for r given below, and α = 4. Let s > 0 be the constant given in Theorem 2.4 above.
Theorem 2.7.Assume that Assumption A holds and that Condition Stabil is satisfied for some 0 < b < 1.Let B > 0 such that max |β| * j ≤ B and take for β given by (2.8).
The comments and remarks of Section 2.2.1 apply here with no change: the ℓ 1 + ℓ 2 penalized estimates are more stable, in that the radius of the ℓ 1 ball covering the estimate is less affected by small values of b and s.However, care must be exercised in choosing too large a c, as in that case the sparsity properties will be lost.We can also derive, in a similar manner, versions of Proposition 2.5 and Theorem 2.6 for the ℓ 1 + ℓ 2 penalized estimate.Since the results are nearly identical we do not include them here.

Correct subset selection
The asymptotic properties of subset selection via the Lasso in linear models have been studied recently by a number of authors: [13] studied selection Gaussian graphical models, [21] investigated subset selection in linear regression on for what was termed incoherent design matrices, [3] studied approximating regression models under design matrices satisfying Condition Identif introduced in Section 2 above and previously discussed in [4], and [25] investigated a three stage procedure in linear models.A nice overview of the connections between incoherent design matrices and matrices satisfying conditions similar to our Condition Identif is given in [14].An interesting asymptotic analysis, in which one studies the interplay between the sample size n, the sparsity level k * and the number of variables M for average asymptotic consistency in linear regression models with Gaussian design is presented in [24].There the coefficient set I * is assumed to have been selected uniformly at random from {1, . . ., M }, and one studies asymptotically the average error probability, where one averages over all possible choices of I * .We refer to the work of [6] for a non-asymptotic investigation of the accuracy of model selection via the Lasso in linear models, but under model assumptions different than ours: the coefficient set I * is again assumed to have been selected uniformly at random from {1, . . ., M }, and conditionally on I * the signs of β j , j ∈ I * , are assumed to be equally likely to be 1 or -1.
The properties of the Lasso-type estimates used for correct subset selection in logistic regression have not been investigated from the perspective considered here.After finishing this paper, we learned of the recent work of [17], which investigates the very related topic of asymptotic model selection consistency in binary graphs; we comment on connections with our work in Section 3.2.2.
The finite sample properties of variable selection via the elastic net have not been investigated in either of the models considered here.For a discussion of its usage in linear regression models with different target parameters than those considered here we refer to [23].
We study in this section the non-asymptotic merits of the Lasso and elastic net estimates when used for variable selection.We conclude the section with the asymptotic implications of these results.
All estimates of β * analyzed in Theorems 2.2-2.7 have zero coefficients.These theorems, however, do not necessarily guarantee that the corresponding set of the non-zero coefficients of these estimates is exactly equal with I * , with high probability: we can either omit some of the true variables or include variables that do not belong to I * while still being able to control the radii of the ℓ 1 balls.In this section we find estimates of β * that have the properties discussed in Section 2 and for which, in addition, we have P(I * = Î) ≥ 1 − γ, for some given small γ > 0. Since P(I * = Î) ≥ 1 − P(I * ⊆ Î) − P( I ⊆ I * ), we find the subset I such that

Correct inclusion of all true variables in the selected set
In this section we discuss conditions under which we can obtain results of the type for some given γ 1 > 0, for estimates having the properties discussed in Section 2 above.Lemma 3.1 below shows what governs the size of P(I * ⊆ Î).We discuss in detail to which extent we can use the results of Section 2 directly for this study.Recall that the cardinality of I * is k * .
Lemma 3.1.Let β * and β be a combination parameter/estimator from Section 2. Let Î be the index set of the non-zero components of β.Then Proof.The following display follows directly from the definitions of Î and I * .

Detection of large signals
The purpose of this subsection is to point out that the study of P(I * ⊆ Î) via a direct application of the sparse oracle bounds derived in the previous section may lead to suboptimal results.Remark.The lower bounds on the minimum size of the true coefficients stated in Corollary 3.2 are all of the type possibly up to the small additive term ǫ defined in the previous section.
For stable design matrices, when the constant C is close to 1, and if the true model is supported on a space of dimension k * , with very low k * satisfying rk * < 1, then such lower bounds imply that we can detect moderate sized signals.Clearly, for large k * , the lower bounds on the coefficient size are too conservative, especially since the constant C may also be large.We discuss below when one can weaken this requirement.

Detection of weaker signals
Propositions 3.3 and 3.4 below show that the lower bounds on the signal strength can be significantly weakened under further conditions on the design matrix.The intuition is the following: if the signal is very weak and the true variables are highly correlated with one another and with the rest, one cannot hope to recover the true model with high probability.We will therefore work, for the remainder of this paper, under the assumption that the true model is identifiable, as quantified in Condition Identif stated in Section 2 above.Recall that this condition only requires that the true variables be separated from on another and from the rest, and it does not impose any restrictions on the variables placed outside the true set.
Detection of weak signals via ℓ 1 and ℓ 1 + ℓ 2 penalized least squares in linear models.
We show below that if the identifiability condition is met, then we can recover coefficients with sizes above the noise level n −1/2 .The following result shows that, if the identification is to be performed at some given confidence level δ, the size of the signal will also depend on δ.Moreover, it will depend on M , via a logarithmic term: this is the price to pay for simultaneous identification of the true variables, among all M possibilities.In what follows we will use the following tuning parameters, depending whether Y ∈ {0, 1} or Y ∈ R. Let 0 < δ < 1 be fixed.Let K be an upper bound on k * .Since k * is unknown, one can always use the conservative bound M .However, if in practical situations K is known, one can use it instead of the larger bound M .Consider or Proposition 3.3.For r given above we assume that (1) If Condition Identif is satisfied for d ≤ 1 15 and I corresponds to the ℓ 1 penalized least squares estimate, then (2) We assume, in addition, that max j∈I * |β * j | ≤ B for some B > 0. We choose c = r 2B .If Condition Identif is satisfied for d ≤ 1+c 17.5 and I corresponds to the ℓ 1 + ℓ 2 penalized least squares estimate, then Remark .This is a substantial relaxation of the lower bound on the signal strength, which no longer depends on either the possibly large k * or the possibly small b.Similar relaxations of the requirements on min j∈I * |β * j | have also been obtained by [24] and [6], but for models in which I * is assumed to be random, as discussed at the beginning of Section 3.
Proposition 3.3 above allows an immediate comparison between the selection properties of the Lasso and the elastic net.Their behavior is almost the same, the only difference is in the restriction on the constant d: slightly larger values of d can be allowed for the elastic net estimate.This translates into saying that if the correlations between the true variables, and between the true variables and the rest are slightly larger than what is allowed for the Lasso, then the ℓ 1 + ℓ 2 penalized estimate may provide an alternative.However, as we noted in Section 2, although it would be tempting to increase the value of c, in order to allow for a larger degree of correlation, this would result in not setting any components of the estimate to zero.
The identifiability condition needed for linear models needs to be adjusted to the nature of the logistic regression model, in a manner similar to that of replacing Condition Stabil by Condition LStabil.We impose below a new condition: a weighted correlation matrix should exhibit the same type of separation we required of the correlation matrix of the data.The weights depend on the link function.This perhaps comes with little surprise: the correlation matrix appears explicitly in the expression of the least squares estimates in linear models, and this is not typically the case for other models and estimates.We formalize this below.For a given 0 < δ < 1, M and n, let where we recall that L is a common bound on the X j 's.Let d be as required by Condition Identif.Recall that for such 0 < d < 1 there exists a 0 < b < 1 for which Condition Stabil holds, as specified in Lemma 2.1.For this b define for s > 0 given in Theorem 2.4.The definition of U is justified by the properties of the estimates β discussed in Section 2, which have been proved under Condition Stabil and Assumption A. Let g(z) = e z /(1 + e z ).
Condition Lidentif.Let d be the constant required by Condition Identif.We assume that Remark 1.We give a formal justification of this condition in the course of the proof of Proposition 3.4 in Appendix A below.It is a natural condition that appears via a linearization of the likelihood function.The term containing ǫ in the definition of U is exponentially small, and can be essentially ignored for practical purposes; its role is purely technical.
and I corresponds to the ℓ 1 + ℓ 2 penalized logistic regression estimate then Remark 2. Notice that if g(x) = x is the linear link, Condition Lidentif becomes Condition Identif.Since ǫ is exponentially small, the requirement on the minimum size of the coefficients is essentially As discussed in the remark following Proposition 3.3 above, Corollary 3.2 shows that P(I * ⊆ I) can also be controlled under the less restrictive Condition Stabil, but in that case we can only recover sets I * corresponding to the large signal strength min j∈I * |β * j | ≥ 4 sb rk * +(1+ 1 r )ǫ.In contrast, Proposition 3.4 shows that we can detect weaker signals, however the correlation structure needs to follow the more restrictive Conditions Lidentif and Identif.As discussed before, similar properties are valid for the elastic net estimate, for an appropriate choice of the tuning sequence c.Refinements of this result, that replace the possibly small constant s by a term close to 1 are possible, if instead of statements that hold with probability larger than 1 − δ we consider statements that hold with probability converging to one.For this, one can use Proposition 2.5 and Theorem 2.6.Since these results are very similar to those above, we do not include them here, for brevity.

Correct selection via the Lasso and the elastic net in linear regression models
Theorem 3.5.Let K be an upper bound on k * and take (1) Assume that Condition Identif is met for d ≤ 1 15 .If I corresponds to the ℓ 1 penalized least squares estimator, then (2) Assume, in addition, that max j∈I * |β * j | ≤ B for some B > 0 and choose c = r 2B .If Condition Identif is met for d = 1+c 17.5 and I corresponds to the ℓ 1 + ℓ 2 penalized least squares estimator, then Remark.Since k * is unknown, one can always take K = M .However, if in some instances one has a rough idea of the order of magnitude of k * , one can use that value instead of the conservative bound M .The remarks on the relative merits of the Lasso versus the elastic net from the previous sections apply here with no change.
Recall that the Lasso parameter estimates β may not be unique.However the set estimates Î are unique, for each given tuning sequence r.This result, which we prove in Appendix B, is needed throughout the paper to ensure that the problem is well posed.We mention it again here, since it will be used constructively in the proof of Theorem 3.5 in Appendix A.
Theorem 3.5 has immediate asymptotic implications.It guarantees that I * will be consistently estimated by I if M , the number of candidate variables is polynomial in n, i.e M = O(n ζ ), for some ζ ≥ 0. To obtain this result it suffices to replace δ by any sequence converging to zero with n.For instance, choosing δ = 1/n and restating the value of r in terms of order of magnitude we have the following corollary.

Correct variable selection via ℓ 1 or ℓ 1 + ℓ 2 penalized logistic regression
In this subsection we show that the type of results that hold for ℓ 1 or ℓ 1 + ℓ 2 penalized least squares continue to hold for penalized logistic regression, under requirements on the correlation matrix that are tailored to this type of loss function.
Theorem 3.7.Under the assumptions of Proposition 3.4 we have: (1) If I corresponds to the ℓ 1 penalized logistic regression estimate then (2) If I corresponds to the ℓ 1 + ℓ 2 penalized logistic regression estimate then The asymptotic implications of Theorem 3.7 are again immediate.If M is polynomial in n and for δ = 1/n we therefore obtain: on the size of the true model, for some positive constant C. In this context, similar results, based on different arguments, have been independently obtained by [17], under the slightly more stringent requirements k * ≤ C( n log n ) 1/3 and min k∈I * |β * k | ≥ 1 k * , but under slightly more relaxed conditions on the weighted matrix of the design.

Conclusions
The scope of this paper is to offer finite sample, non-asymptotic, benchmarks on the performance of the Lasso and the closely related elastic net methods for variable selection in logistic and linear regression methods.We showed that the methods can be used for correct variable selection in identifiable models, where we defined identifiability via Condition Identif and Condition Lidentif.The added requirement for correct selection, versus good prediction, is on the size of the signal strength: we can detect coefficients larger than a small constant multiplied by the tuning parameter of the ℓ 1 penalty.This tuning parameter is a function of n, M and the level of confidence, δ.The size of the tuning parameter has to be larger than the noise level, typically of order 1 √ n , up to factors that are logarithmic in M and 1 δ .Our contribution can be detailed as follows.
Lasso and the elastic net in linear regression.The properties of the ℓ 1 penalized least squares in regression models are becoming well understood, while those of the ℓ 1 + ℓ 2 penalized least squares have not been investigated from this perspective.We complemented the existing results on the Lasso estimates by providing a refinement of assumptions.We showed in Section 2 that the ℓ 1 penalized estimates belong to sparse ℓ 1 balls under Condition Stabil, also proposed in [2].We included a full proof of this result to facilitate the comparison with the elastic net estimates, which allow for a slightly higher degree of correlation between the X variables than the one permitted by the Lasso estimate.We discussed in Section 2 the precise interplay between this degree of correlation and the choice of the tuning parameters.If the tuning parameter of the ℓ 2 term is smaller than the tuning parameter of the ℓ 1 term, this estimator is also sparse: it belongs to a sparse ℓ 1 ball centered at the true value and can be used to recover the true coefficient set I * with high probability.However, care must be taken when using this estimate: if the tuning sequence accompanying the extra ℓ 2 term is too large we would essentially have a ridge regression estimate, and no variable selection will be performed.
In Section 3 we provided a non-asymptotic analysis of the subset selection problem in linear models, which complements the existing asymptotic results.We showed that the signal detection boundaries suggested by previous asymptotic analyses can be relaxed.In the works of [21] and [3], which investigate aspects of selection consistency, the minimal signal strength is required to be n − 1 2 +θ , for some θ > 0, up to unspecified and possibly large constants.The work in [3] requires Condition Identif from Section 2 above.In [21] a less restrictive assumption on the design matrix is imposed, namely the irrepresentable design condition, which is almost necessary and sufficient for the sign consistency of the estimators, which implies consistent subset selection.The work of [14] uses a coherence-type condition similar to our Condition Identif, which is shown to be a sufficient condition for the sign consistency of a further thresholded Lasso estimator.The price to pay is a stronger requirement on the minimum size of the detectable coefficients: this size depends on sequences involved in the definition of their coherence condition and k * .These requirements are similar in spirit to those discussed in our Corollary 3.2 above, and share similar drawbacks.
We showed here that if one concentrates directly on the study of P( I = I * ), instead of sign consistency, and studies the original (untruncated) Lasso estimator under Condition Identif, one can relax the requirement on min j∈I * |β * j |.We showed in Theorem 3.5 that one only needs min j∈I * |β * j | be larger than , up to small constants independent of the design.For M polynomial in n and the choice δ = 1/n one can therefore detect, with the untruncated Lasso, coefficients of order O( log n n ).
Lasso and the elastic net in logistic regression models.We showed in this article that the ℓ 1 and ℓ 1 +ℓ 2 penalized logistic regression estimators have features that are similar to ℓ 1 and ℓ 1 + ℓ 2 penalized least squares estimators, but the study of the estimates depends on conditions on a weighted correlation matrix of the data.
The predictive performance and adaptation to unknown sparsity of the Lasso penalized estimates in generalized linear models received very little attention, with the notable exceptions of [19,26] and [11] in regression and classification.
Here we revisited some of these issues, and showed that the ℓ 1 penalized logistic regression estimators, as well as the elastic net estimates belong to sparse ℓ 1 balls under the weaker Condition Stabil.The size of the radii of these balls can be improved asymptotically under Condition LStabil.We also showed that the ℓ 1 + ℓ 2 penalized logistic regression estimators, which have not yet been investigated, exhibit the same adaptation to unknown sparsity as the Lasso estimates, for appropriate choices of the tuning parameters given in Section 2.3.We showed in Theorem 3.7 that, similar to linear models, ℓ 1 or ℓ 1 + ℓ 2 penalized logistic regression can be used to estimate I * with very high probability.The difference is in the conditions on the correlation matrix, which need to be adapted to the nature of this model, as in Condition Lidentif.The size of the coefficients that are detectable via this method is also of the order O( log n n ), where the constants involved in this bound are independent of the design or sparsity level.

Appendix A
Proof of Theorem 2.2.Let X i be the M dimensional vector with entries X ij , 1 ≤ j ≤ M .For ease of notation, let r n,M (δ) = r.By the definition of the estimator, and with Define the event Notice that on the event A display (4.1) yields, via simple algebra, that Therefore, on the set A we have β − β * ∈ V , with V defined in (2.1), for ǫ = 0 and α = 3. Adding r| β − β * | 1 to both sides of (4.1) and re-arranging the terms we also have Using the Cauchy-Schwarz inequality in the right hand side of the inequality above, followed by an inequality of the type 2uv ≤ au 2 + v 2 /a, for any a > 1, we further obtain Since β − β * ∈ V we can invoke Condition Stabil and, by taking a = 1/b, we obtain, on the set A, that To conclude the proof we determine now r = r n,M (δ) such that P(A c ) ≤ δ.If Y ∈ {0, 1} we use Hoeffding's inequality to obtain and the choice ) , and the choice guarantees that P(A c ) ≤ δ.This concludes the proof.
Proof of Theorem 2.3.Using the definition of the estimator, the fact that max j∈I * |β * j | ≤ B and our choice of c = r 2B we obtain, on the event A, that Therefore, on the set A we have β − β * ∈ V , with V defined in (2.1), for ǫ = 0 and α = 4.We use the same reasoning as in Theorem 2.2, and invoke Condition Stabil to obtain the analogue of display (4.3).The only difference is that we complete the square generated by the ℓ 2 part of the penalty: and so, under the assumption that max j∈I * |β * j | ≤ B and our choice of c = r

2B
we obtain, for any a > 1 and the remaining part of the proof is identical to that of Theorem 2.2, if we now choose a = 1 b+c .
Proof of Theorem 2.4.Recall that we denoted the logistic loss function by and the associated risk by Pl(β) = El(β; Y, X).We also denote the empirical risk by With this notation and letting r = r n,M (δ), the estimator satisfies, by definition By adding and subtracting to both sides and rearranging terms we obtain Notice first that if we change the ith pair (X i , Y i ) while keeping the others fixed, the value of L n changes by at most 4L n , where L is a common bound on all X ij .To see why, recall that P n = 1 n n i=1 δ Xi,Yi is the empirical measure putting mass 1/n at each observation (X i , Y i ).Let be the empirical measure corresponding to changing the pair (X l , Y l ) to (X ′ l , Y ′ l ).Then where the inequality follows immediately by a first order Taylor expansion and the assumption that all X variables are bounded by L. Therefore we can apply the bounded difference inequality (e.g.Theorem 2.2, page 8 in [7]) to obtain that Thus, if we take we have We will use Lemma 3 in [26] to obtain a bound on EL n .We re-state it here for ease of reference, adapting it to our notation.
Lemma 3, [26].Let J n be an integer such that 2 Jn ≥ n.Then, if L n defined above corresponds to a Lipschitz loss and the components of X are bounded by L, with probability one, then , where C 1 , C 2 are positive constants depending on the Lipschitz constant and L.
Our loss is Lipschitz in t = β ′ x, with constant 2. Also, inspection of the chaining argument used in the proof of the Lemma shows that we can take J n = (M ∨ n) and ǫ = log 2 2 (M ∨n)+1 × 1 r .Therefore, by making the constants precise we obtain Define the event From the previous displays we then conclude that if where we used the possibly conservative bound D on ǫ to keep the exposition clear.By Example 4.5 in [18] we have Pl( and is the L 2 norm with respect to the distribution of X.A first order Taylor expansion gives g , where f β (x) = β ′ x and β′ x is an intermediate point between β ′ x and β * ′ x.Let A = 6LD, and let s = (1 + e A ) −4 .Then, since Assumption A and (4.11) hold, we have Thus, on the event E, display (4.8) further yields Via simple algebra, display (4.12) yields on the set E. Therefore β − β * ∈ V , for the set V given by (2.1) of Section 2, with α = 3 and ǫ as in the statement of the theorem.Let γ kj = EX k X j , for k, j ∈ {1, . . ., M } and let Γ be the M × M matrix with entries γ kj .Notice that f β − f β * 2 = (β * − β) ′ Γ(β * − β) and so, using a reasoning identical to the one used in display (4.3) of Theorem 2.2, we further obtain Since on the set E we have Taking a = 1/sb we obtain, on the set E, that Since we have shown above that P (E) ≥ 1 − δ, the proof is complete.
Proof of Proposition 2.5.First notice that on the event E defined in the previous theorem, display (4.8) yields Then, the assumptions of this proposition imply that the righthandside of the above display converges to zero with n and so we have that for any ϑ > 0 since P(E c ) ≤ δ n → 0.
Observe that Pl(β) = P X P Y |X l(β), where we regard the expectations as being taken with respect to a pair (X, Y ) independent of the sample.By the definitions of the loss l and p(x) we have Let θ > 0 be arbitrary, fixed.Simple algebra shows that if sup ), for all x, and so there exists ϑ θ > 0 such that where the convergence to zero follows by (4.14) above.This concludes the proof of this proposition.
Proof of Theorem 2.6.The proof differs from the proof of Theorem 2.4 above only in the way we obtain the lower bound on P(l( β) − l(β * )).For quantities defined in the discussion immediately following display (4.11) above we write exp( β′ x) where we recall that β′ x is an intermediate point between β ′ x and β * ′ x and that we defined p(x) = exp β * ′ x) 1+exp(β * ′ x) .Let θ > 0 be arbitrarily close to zero.Let A θ be the set for which sup and recall that, by Proposition 2.5 we have P(A θ ) → 1.On the set A θ we have 2 ≥ e −2θ =: w, for all x and with w arbitrarily close to 1. Let Γ 1 be the matrix with entries Ep(X)(1 − p(X))X k X j , for k, j ∈ {1, . . ., M }.Therefore, on A θ we have Invoking condition LStabil we obtain 2 and the rest of the proof carries on unchanged, with results holding now on the set A θ ∩ E.
Proof of Theorem 2.7.The proof is identical to the one of Theorem 2.4 above, up to the following display To arrive at this display we observe that the elastic net satisfies and so β − β * ∈ V , for the set V given by (2.1) of Section 2, with α = 4 and ǫ as in the statement of the theorem.Therefore the use of Condition Stabil in the derivations above is valid.
For the remaining of the proof we reason as in Theorem 2.3 above.We complete the square in the left hand side of the inequality above and invoke the assumption max j∈I * |β * j | ≤ B to obtain which immediately implies, by choosing c such that 2cB = r, that Then, we use again the Cauchy-Schwarz inequality followed by 2xy ≤ ax 2 +y 2 /a to obtain Choosing now a = 1 sb+c gives, on the event E defined in (4.10) above Since we showed in the proof of Theorem 2.4 that P(E c ) ≤ δ, this completes the proof.
We argue exactly as in the course of the proof of Theorem 2.2 to bound the probabilities above.We use either Hoeffding's inequality, for Y ∈ {0, 1} or Bernstein's inequality, for Y ∈ R to bound the first term by k * δ KM ≤ δ M , for r given by (3.3).Similarly, for this choice of r, we have We establish now similar results for the ℓ 1 + ℓ 2 penalized least squares estimator.By the characterization of a zero component of the solution, given in Lemma 4.3 in Appendix B below, we also have and so the proof is identical to the one above.The only modification is in terms of constants: in this case Condition Identif implies Condition Stabil with b = 1 − 9d.From Theorem 2.3 we obtain for the choice of r given by (3.3) that As above, we note 1/2d ≥ 4.25/(b + c) for d ≤ 1+c 17.5 .Invoking now Theorem 2.3 with these constants concludes the proof.
Proof of Proposition 3.4.As in the previous proof, recall that we denoted the cardinality of I * by k * and that We begin by establishing the result for the ℓ 1 penalized estimator.By Lemma 4.1 in the Appendix below it follows that if β k = 0 is a component of the solution as in Theorem 2.4, if For this, let g(z) = e z /(1 + e z ) and notice that Taylor's formula gives g(u) − g(v) = g ′ (a)(u − v), for a point a between u and v, where 0 < g ′ (a) < 1. Therefore where a i is a point between M j=1 β j X ij and M j=1 β * j X ij , for each i, and so Therefore, by Theorem 2.4, for b chosen as in the discussion following display (4.16) above, we have P(G c n ) ≤ δ.Notice that on the event G n we have This justifies the definition of the set U in Condition Lidentif.
Combining the results above with (4.17) we obtain Note that if Condition Identif and Lidentif both hold for d/2 then Thus, if d ≤ s 16+2s(7+ǫ) , and with b chosen as in the discussion following display (4.16) above we have Therefore, collecting the bounds above, we obtain The result for the ℓ 1 + ℓ 2 penalized estimator follows in an identical manner.By Lemma 4.3 in Appendix B below, if β k = 0 is a component of the solution Therefore the remainder of the proof is identical to the proof above, if we invoke Theorem 2.7 instead of Theorem 2.4.
Proof of Theorem 3.5.In light of Proposition 3.3, it is enough to show that P( I ⊆ I * ) ≥ 1−2δ, for both the ℓ 1 and ℓ 1 +ℓ 2 penalized least squares estimators.We begin by showing that P( I ⊆ I * ) ≥ 2δ for the ℓ 1 penalized estimate.Let and define Let Let, by abuse of notation, µ ∈ R M be the vector that has the components of µ in positions corresponding to the index set I * and components equal to zero otherwise.By standard results in convex analysis, e.g.Lemma 4.1 in the Appendix B below it follows that, on the set B, µ is a solution of (2.2).Recall that β is a solution of (2.2) by construction.By definition β k = 0 for k ∈ I.By construction, µ k = 0 for k ∈ S ⊆ I * , for some subset S. By Proposition 4.2 in Appendix B, any two solutions have non-zero elements in the same positions, therefore I = S ⊆ I * on B. Hence where we used Condition Identif to obtain the last inequality.Recall now that if Y ∈ {0, 1} the choice guarantees, as in display (4.4) of the proof of Theorem 2.2, that Repeating now the proof of Theorem 2.2, with β replaced by µ and using only the variables corresponding to I * , we obtain By Hoeffding's inequality and the choice Here we used again the fact that, by Lemma 2. The same conclusion holds if Y ∈ R, by invoking Bernstein's inequality as in (4.5) and corresponding value of r from the statement of this proposition, instead of Hoeffding's inequality.
Of course, the choice in (4.19) is not implementable, as k * is not known in practice, and we can always replaced it by a known upper bound, or the conservative bound M .This completes the proof for this part of the proposition.
It remains to show that P( I ⊆ I * ) ≥ 1 − 2δ for the ℓ 1 + ℓ 2 penalized estimate.We reason as above and let Recall that β is a solution of (2.Let Let, by abuse of notation, µ ∈ R M be the vector that has the components of µ in positions corresponding to the index set I * and components equal to zero otherwise.By standard results in convex analysis, e.g.Lemma 4.1 in the Appendix below it follows that, on the set B 1 , µ is a solution of (2.2).Recall that β is a solution of (2.Let, by abuse of notation, µ ∈ R M be the vector that has the components of µ in positions corresponding to the index set I * and components equal to zero otherwise.By standard Lemma 4.3 in the Appendix B below it follows that, on the set B 1 , µ is a solution of (2.2).Recall that β is a solution of (2.2) by construction.Also by Lemma 4.3 in the Appendix B, the solution is unique.Since, on the set B 1 , β k = 0 for k ∈ I * c we conclude that Î ⊆ I * on the set B 1 .Therefore the remainder of the proof is identical to the one above.

Properties of ℓ 1 penalized least squares and logistic regression solutions
The solution of the ℓ 1 penalized optimization problem may not be unique.However, in this case, all solutions have zero elements in the same positions, as we show below.We denote by Y = (Y 1 , . . ., Y n ) and by X the n × M matrix with entries X ij .We let L(β) = L(X , Y; β) be a function depending on the data and a parameter β ∈ R M .Let where ∇L(β) is the M -dimensional vector having ∂L(β) ∂βj as components and v ∈ R M is such that By standard results in convex analysis, β ∈ R M is a point of local minimum for a convex function f if and only if 0 ∈ D β , where 0 ∈ R M .Therefore, β satisfies (4.24) if and only if ∂L( β) ∂β j = λ|v|, for all 1 ≤ j ≤ M, and so the index set S of non-zero components of a solution is given by Therefore, if (4.25) holds, S is the same for all solutions.

Corollary 3 . 8 .
Let r = O( log n n ) and assume that min j∈I * |β * j | = O( log n n ).Then, under Assumption A and the assumptions on the design required for (1) or (2), respectively, of Theorem 3.7 we have lim n→∞ P( Î = I * ) = 1, for I either the ℓ 1 or the ℓ 1 + ℓ 2 penalized logistic regression estimator.Remark.In the proofs of Corollary 3.8 and Theorem 3.7 above we invoked Theorem 2.4 of Section 2.3.1 and therefore used its hypotheses.If instead one invoked the asymptotic result of Theorem 2.6, one could obtain a version of Corollary 3.8 with Assumption A replaced by the conditions max k∈I * |β * k | ≤ B and rk * → 0. For polynomially large M , the order of our tuning sequence is r = O( log n √ n ).The condition rk * → 0 therefore places the restriction k * ≤ C √ n (log n) 2

Proof of Proposition 3 . 3 .
Recall that we denoted the cardinality of I * by k * .First observe that by the definitions of I and I * and by the union bound we haveP(I * ⊆ Î) ≤ P k / ∈ Î for some k ∈ I * ≤ P β k = 0 and β * k = 0, for some k ∈ I * ≤ k * max k∈I * P β k = 0 and β * k = 0 .We first show that P(I * ⊆ I) ≥ 1 − δ − δ M for the ℓ 1 penalized least squares estimator.It follows immediately from Lemma 4.1 in Appendix B below that if where the penultimate inequality follows by the triangle inequality |a + b + c| ≥ |c| − |a| − |b|.Under Condition Identif and since min j∈I * |β * j | ≥ 2r we further obtain for a constant b for which Condition Stabil holds.By Lemma 2.1 in Section 2, Condition Identif implies Condition Stabil with b = 1 − 7d.Notice that for this value of b we have 1/2d ≥ 4/b for d ≤ 1/15, as required in statement of this theorem.Therefore, combining these results we obtain Lemma 4.3 in the Appendix B, b = ( µ, 0), where 0 is a vector corresponding to indices in I * c , is a solution of (2.3) on the set

3 )
by construction, and that by Lemma 4.3 in the Appendix B, the solution is unique.Since, on the set B, bk = 0 for k ∈ I * c , by construction, and β k = 0 on Îc , by definition, we conclude that Î ⊆ I * on the set B. Therefore the proof is identical to the one above, where we now invoke Condition Identif with d ≤ 1+c 17.5 and the analogue of the proof of Theorem 2.3.Proof of Theorem 3.7.By Proposition 3.4, it is enough to show that P( I ⊆ I * ) ≥ 1 − 2δ for both the ℓ 1 and ℓ 1 + ℓ 2 penalized estimate.We begin by showing that P( I ⊆ I * ) ≥ 1 − 2δ for the ℓ 1 penalized logistic regression estimate.Let i µ ′ X i + log(1 + exp µ ′ X i )} + 2r

Lemma 4 . 1 .
λ > 0. Let S be the set of indices corresponding to the non-zero components of a solution β :S = {k : βk = 0, 1 ≤ k ≤ M }.If L is differentiable in β and if for any minima β(1) , β(2) ∂L( β(1) ) ∂β j = ∂L( β(2) ) ∂β j , for all 1 ≤ j ≤ M,(4.25)then all β satisfying (4.24) have non-zero components in the same positions.Proof.We recall that for any convex function f : R M → R the subdifferential of f at a point β is the setD β = {w ∈ R M : f (u) − f (β) ≥ w, u − β }.For the function f defined in (4.24) this becomesD β = {w ∈ R M : w = ∇L(β) + λv}, We argue this in what follows.Inequality(4.15)in the proof of Lemma 3.1 above makes it clear that the rate at which P(I * ⊆ Î) decays is governed by how small we can make the probability of estimating a non-zero component of β * by zero.However, if we further bound this probability and arrive at (3.1), we can use Theorems 2.2-2.7 of the previous section directly.We thus arrive at the following corollary.
. Notice that although Corolarry 3.2 is established under the weaker Condition Stabil, it only guarantees that P(I * ⊆ Î) for the collection I * for which min j∈I * |β * j | ≥ 4 b rk * .In contrast, if Condition Identif holds, we can detect variables corresponding to the set I * for which min j∈I * |β * j | ≥ 2r 1 2d ≥ 4 sb , with b given by Condition Stabil.By Lemma 2.1, Condition Identif implies Condition Stabil with b = 1 − d(7 + ǫ), and so the restriction on d is d ≤ 1, Condition Identif implies Condition Stabil and then reasoned as in Proposition 3.3 to conclude that the analogue of Theorem 2.2 can be used, for d ≤ 1 15 .