Sparsity oracle inequalities for the Lasso

This paper studies oracle properties of $\ell_1$-penalized least squares in nonparametric regression setting with random design. We show that the penalized least squares estimator satisfies sparsity oracle inequalities, i.e., bounds in terms of the number of non-zero components of the oracle vector. The results are valid even when the dimension of the model is (much) larger than the sample size and the regression matrix is not positive definite. They can be applied to high-dimensional linear regression, to nonparametric adaptive regression estimation and to the problem of aggregation of arbitrary estimators.


Background
The need for easily implementable methods for regression problems with large number of variables gave rise to an extensive, and growing, literature over the last decade.Penalized least squares with ℓ 1 -type penalties is among the most popular techniques in this area.This method is closely related to restricted least squares minimization, under an ℓ 1 -restriction on the regression coefficients which is called the Lasso method, following [24].We refer to both methods as Lassotype methods.Within the linear regression framework these methods became most popular.Let (Z 1 , Y 1 ), . . ., (Z n , Y n ) be a sample of independent random pairs, with Z i = (Z 1i , . . ., Z Mi ) and Y i = λ 1 Z 1i + . . .+ λ M Z Mi + W i , i = 1, • • • , n, where W i are independent error terms.Then, for a given T > 0, the Lasso estimate of λ ∈ R M is where |λ| 1 = M j=1 |λ j |.For a given tuning parameter γ > 0, the penalized estimate of λ ∈ R M is Lasso-type methods can be also applied in the nonparametric regression model Y = f (X) + W , where f is the unknown regression function and W is an error term.They can be used to create estimates for f that are linear combinations of basis functions φ 1 (X), . . ., φ M (X) (wavelets, splines, trigonometric polynomials, etc).The vectors of linear coefficients are given by either the λ pen or the λ lasso above, obtained by replacing Z ji by φ j (X i ).
In this paper we analyze ℓ 1 -penalized least squares procedures in a more general framework.Let (X 1 , Y 1 ), . . ., (X n , Y n ) be a sample of independent random pairs distributed as (X, Y ) ∈ (X , R), where X is a Borel subset of R d ; we denote the probability measure of X by µ.Let f (X) = E(Y |X) be the unknown regression function and F M = {f 1 , . . ., f M } be a finite dictionary of real-valued functions f j that are defined on X .Depending on the statistical targets, the dictionary F M can be of different nature.The main examples are: (I) a collection F M of basis functions used to approximate f in the nonparametric regression model as discussed above; these functions need not be orthonormal; (II) a vector of M one-dimensional random variables Z = (f 1 (X), . . ., f M (X)) as in linear regression; (III) a collection F M of M arbitrary estimators of f .Case (III) corresponds to the aggregation problem: the estimates can arise, for instance, from M different methods; they can also correspond to M different values of the tuning parameter of the same method; or they can be computed on M different data sets generated from the distribution of (X, Y ).Without much loss of generality, we treat these estimates f j as fixed functions; otherwise one can regard our results conditionally on the data set on which they have been obtained.
Within this framework, we use a data dependent ℓ 1 -penalty that differs from the one described in (1.2) in that the tuning parameter γ changes with j as in [5,6].Formally, for any λ = (λ 1 , . . ., λ M ) ∈ R M , define f λ (x) = M j=1 λ j f j (x).Then the penalized least squares estimator of λ is where pen(λ) = 2 M j=1 ω n,j |λ j | with ω n,j = r n,M f j n , where we write g 2 n = n −1 n i=1 g 2 (X i ) for the squared empirical L 2 norm of any function g : X → R. The corresponding estimate of f is f = M j=1 λ j f j .The choice of the tuning sequence r n,M > 0 will be discussed in Section 2. Following the terminology used in the machine learning literature (see, e.g., [21]) we call f the aggregate and the optimization procedure ℓ 1 -aggregation.
An attractive feature of ℓ 1 -aggregation is computational feasibility.Because the criterion in (1.3) is convex in λ, we can use a convex optimization procedure to compute λ.We refer to [10,26] for detailed analyzes of these optimization problems and fast algorithms.
Whereas the literature on efficient algorithms is growing very fast, the one on the theoretical aspects of the estimates is still emerging.Most of the existing theoretical results have been derived in the particular cases of either linear or nonparametric regression.
In the linear parametric regression model most results are asymptotic.We refer to [16] for the asymptotic distribution of λ pen in deterministic design regression, when M is fixed and n → ∞.In the same framework, [28,29] state conditions for subset selection consistency of λ pen .For random design Gaussian regression, M = M (n) and possibly larger than n, we refer to [20] for consistent variable selection, based on λ pen .For similar assumptions on M and n, but for random pairs (Y i , Z i ) that do not necessarily satisfy the linear model assumption, we refer to [12] for the consistency of the risk of λ lasso .
The Lasso-type methods have also been extensively used in fixed design nonparametric regression.When the design matrix i is the identity matrix, (1.2) leads to soft thresholding.For soft thresholding in the case of Gaussian errors, the literature dates back to [9].We refer to [2] for bibliography in the intermediate years and for a discussion of the connections between Lasso-type and thresholding methods, with emphasis on estimation within wavelet bases.For general bases, further results and bibliography we refer to [19].Under the proper choice of γ, optimal rates of convergence over Besov spaces, up to logarithmic factors, are obtained.These results apply to the models where the functions f j are orthonormal with respect to the scalar product induced by the empirical norm.For possible departures from the orthonormality assumption we refer to [5,6].These two papers establish finite sample oracle inequalities for the empirical error f − f 2 n and for the ℓ 1 -loss | λ − λ| 1 .Lasso-type estimators in random design non-parametric regression received very little attention.First results on this subject seem to be [14,21].In the aggregation framework described above they established oracle inequalities on the mean risk of f , for λ lasso corresponding to T = 1 and when M can be larger than n.However, this gives an approximation of the oracle risk with the slow rate (log M )/n, which cannot be improved if λ lasso with fixed T is considered [14,21].Oracle inequalities for the empirical error f − f 2 n and for the ℓ 1 -loss | λ − λ| 1 with faster rates are obtained for λ = λ pen in [6] but they are operational only when M < √ n.The paper [15] studies somewhat different estimators involving the ℓ 1 -norms of the coefficients.For a specific choice of basis functions f j and with M < √ n it proves optimal (up to logarithmic factor) rates of convergence of f on the Besov classes without establishing oracle inequalities.Finally we mention the papers [17,18,27] that analyze in the same spirit as we do below the sparsity issue for estimators that differ from λ pen in that the goodness-of-fit term in the minimized criterion cannot be the residual sum of squares.
In the present paper we extend the results of [6] in several ways, in particular, we cover sizes M of the dictionary that can be larger than n.To our knowledge, theoretical results for λ pen and the corresponding f when M can be larger than n have not been established for random design in either non-parametric regression or aggregation frameworks.Our considerations are related to a remarkable feature of the ℓ 1 -aggregation: λ pen , for an appropriate choice of the tuning sequence r n,M , has components exactly equal to zero, thereby realizing subset selection.In contrast, for penalties proportional to no estimated coefficients will be set to zero in finite samples; see, e.g.[22] for a discussion.The purpose of this paper is to investigate and quantify when ℓ 1aggregation can be used as a dimension reduction technique.We address this by answering the following two questions: "When does λ ∈ R M , the minimizer of (1.3), behave like an estimate in a dimension that is possibly much lower than M ?" and "When does the aggregate f behave like a linear approximation of f by a smaller number of functions?"We make these questions precise in the following subsection.
To motivate and introduce our notion of sparsity we first consider the simple case of linear regression.The standard assumption used in the literature on linear models is E(Y |X) = f (X) = λ ′ 0 X, where λ 0 ∈ R M has non-zero coefficients only for j ∈ J(λ 0 ).Clearly, the ℓ 1 -norm | λ OLS − λ 0 | 1 is of order M/ √ n, in probability, if λ OLS is the ordinary least squares estimator of λ 0 based on all M variables.In contrast, the general results of Theorems 1, 2 and 3 below show that | λ−λ 0 | 1 is bounded, up to known constants and logarithms, by for λ given by (1.3), if in the penalty term (1.4) we take r n,M = A (log M )/n.This means that the estimator λ of the parameter λ 0 adapts to the sparsity of the problem: its estimation error is smaller when the vector λ 0 is sparser.In other words, we reduce the effective dimension of the problem from M to M (λ 0 ) without any prior knowledge about the set J(λ 0 ) or the value M (λ 0 ).The improvement is particularly important if M (λ 0 ) ≪ M .Since in general f cannot be represented exactly by a linear combination of the given elements f j we introduce two ways in which f can be close to such a linear combination.The first one expresses the belief that, for some λ * ∈ R M , the squared distance from f to f λ * can be controlled, up to logarithmic factors, by M (λ * )/n.We call this "weak sparsity".The second one does not involve M (λ * ) and states that, for some λ * ∈ R M , the squared distance from f to f λ * can be controlled, up to logarithmic factors, by n −1/2 .We call this "weak approximation".
We now define weak sparsity.Let C f > 0 be a constant depending only on f and which we refer to as the oracle set Λ.Here and later we denote by • the L 2 (µ)-norm: and by < f, g > the corresponding scalar product, for any f, g ∈ L 2 (µ).
If Λ is non-empty, we say that f has the weak sparsity property relative to the dictionary {f 1 , . . ., f M }.We do not need Λ to be a large set: card(Λ) = 1 would suffice.In fact, under the weak sparsity assumption, our targets are λ * and f * = f λ * , with is the effective or oracle dimension.All the three quantities, λ * , f * and k * , can be considered as oracles.Weak sparsity can be viewed as a milder version of the strong sparsity (or simply sparsity) property which commonly means that f admits the exact representation f = f λ0 for some λ 0 ∈ R M , with hopefully small M (λ 0 ).
To illustrate the definition of weak sparsity, we consider the framework (I).Then f λ − f is the approximation error relative to f λ which can be viewed as a "bias term".For many traditional bases {f j } there exist vectors λ with the first M (λ) non-zero coefficients and other coefficients zero, such that f λ − f ≤ C(M (λ)) −s for some constant C > 0, provided that f is a smooth function with s bounded derivatives.The corresponding variance term is typically of the order n,M M (λ) can be viewed as the bias-variance balance realized for M (λ) ∼ n choose r n,M slightly larger, but this does not essentially affect the interpretation of Λ.In this example, the fact that Λ is non-void means that there exists λ ∈ R M that approximately (up to logarithms) realizes the bias-variance balance or at least undersmoothes f (indeed, we have only an inequality between squared bias and variance in the definition of Λ).Note that, in general, for instance if f is not smooth, the bias-variance balance can be realized on very bad, even inconsistent, estimators.
We define now another oracle set If Λ ′ is non-empty, we say that f has the weak approximation property relative to the the dictionary {f 1 , . . ., f M }.For instance, in the framework (III) related to aggregation Λ ′ is non-empty if we consider functions f that admit n −1/4consistent estimators in the set of linear combinations f λ , for example, if at least one of the f j 's is n −1/4 -consistent.This is a modest rate, and such an assumption is quite natural if we work with standard regression estimators f j and functions f that are not extremely non-smooth.We will use the notion of weak approximation only in the mutual coherence setting that allows for mild correlation among the f j 's and is considered in Section 2.2 below.Standard assumptions that make our finite sample results work in the asymptotic setting, when n → ∞ and M → ∞, are: for some sufficiently large A and for some sufficiently small A ′ , in which case all λ ∈ Λ satisfy f > 0 depending only on f , and weak approximation follows from weak sparsity.However, in general, r n,M and C f r 2 n,M M (λ) are not comparable.So it is not true that weak sparsity implies weak approximation or vice versa.In particular, is smaller in order than n/ log(M ), for our choice for r n,M .

General assumptions
We begin by listing and commenting on the assumptions used throughout the paper.

The first assumption refers to the error terms
Assumption (A1).The random variables X 1 , . . ., X n are independent, identically distributed random variables with probability measure µ.The random variables W i are independently distributed with We also impose mild conditions on f and on the functions f j .Let g ∞ = sup x∈X |g(x)| for any bounded function g on X .

Assumption (A2). (a) There exists
Remark 1.We note that (a) trivially implies (c).However, as the implied bound may be too large, we opted for stating (c) separately.Note also that (a) and (d) imply the following: for any fixed λ ∈ R M , there exists a positive constant L(λ), depending on λ, such that f − f λ ∞ = L(λ).

Sparsity oracle inequalities
In this section we state our results.They have the form of sparsity oracle inequalities that involve the value M (λ) in the bounds for the risk of the estimators.All the theorems are valid for arbitrary fixed n ≥ 1, M ≥ 2 and r n,M > 0.

Weak sparsity and positive definite inner product matrix
The further analysis of the ℓ 1 -aggregate depends crucially on the behavior of the M × M matrix Ψ M given by In this subsection we consider the following assumption Assumption (A3).For any M ≥ 2 there exist constants κ M > 0 such that is positive semi-definite.
Note that 0 < κ M ≤ 1.We will always use Assumption (A3) coupled with (A2).Clearly, Assumption (A3) and part (b) of (A2) imply that the matrix Ψ M is positive definite, with the minimal eigenvalue τ bounded from below by c 0 κ M .Nevertheless, we prefer to state both assumptions separately, because this allows us to make more transparent the role of the (potentially small) constants c 0 and κ M in the bounds, rather than working with τ which can be as small as their product.
Theorem 2.1.Assume that (A1) -(A3) hold.Then, for all λ ∈ Λ we have where B 1 > 0 and B 2 > 0 are constants depending on c 0 and C f only and for some positive constants c 1 , c 2 depending on c 0 , C f and b only and Since we favored readable results and proofs over optimal constants, not too much attention should be paid to the values of the constants involved.More details about the constants can be found in Section 4.

Weak sparsity and mutual coherence
The results of the previous subsection hold uniformly over λ ∈ Λ, when the approximating functions satisfy assumption (A3).We recall that implicit in the definition of Λ is the fact that f is well approximated by a smaller number of the given functions f j .Assumption (A3) on the matrix Ψ M is, however, independent of f .
A refinement of our sparsity results can be obtained for λ in a set Λ 1 that combines the requirements for Λ, while replacing (A3) by a condition on Ψ M that also depends on M (λ).Following the terminology of [8], we consider now matrices Ψ M with mutual coherence property.We will assume that the correlation between elements i = j is relatively small, for i ∈ J(λ).Our condition is somewhat weaker than the mutual coherence property defined in [8] where all the correlations for i = j are supposed to be small.In our setting the correlations ρ M (i, j) with i, j ∈ J(λ) can be arbitrarily close to 1 or to −1.Note that such ρ M (i, j) constitute the overwhelming majority of the elements of the correlation matrix if J(λ) is a set of small cardinality: With Λ given by (1.5) define Theorem 2.2.Assume that (A1) and (A2) hold.Then, for all λ ∈ Λ 1 we have, with probability at least 1 − π n,M (λ), where C > 0 is a constant depending only on c 0 and C f , and Note that in Theorem 2.2 we do not assume positive definiteness of the matrix Ψ M .However, it is not hard to see that the condition ρ(λ)M (λ) ≤ 1/45 implies positive definiteness of the "small" The numerical constant 1/45 is not optimal.It can be multiplied at least by a factor close to 4 by taking constant factors close to 1 in the definition of the set E 2 in Section 4. The price to pay is a smaller value of constant c 1 in the probability π n,M (λ).

Weak approximation and mutual coherence
For Λ ′ given in the Introduction, define Theorem 2.3.Assume that (A1) and (A2) hold.Then, for all λ ∈ Λ 2 , we have where C ′ > 0 is a constant depending only on c 0 and C ′ f , and for some constants c ′ 1 , c ′ 2 depending on c 0 , C ′ f and b only.
Theorems 2.1 -2.3 are non-asymptotic results valid for any r n,M > 0. If we study asymptotics when n → ∞ or both n and M tend to ∞, the optimal choice of r n,M becomes a meaningful question.It is desirable to choose the smallest r n,M such that the probabilities π n,M , πn,M , π ′ n,M tend to 0 (or tend to 0 at a given rate if such a rate is specified in advance).A typical application is in the case where n → ∞, M = M n → ∞, κ M (when using Theorem 2.1), L 0 , L, L(λ * ) are independent of n and M , and In this case the probabilities π n,M , πn,M , π ′ n,M tend to 0 as n → ∞ if we choose for some sufficiently large A > 0. Condition (2.3) is rather mild.It implies, however, that M cannot grow faster than an exponent of n and that M (λ * ) = o( √ n).

High-dimensional linear regression
The simplest example of application of our results is in linear parametric regression where the number of covariates M can be much larger than the sample size n.In our notation, linear regression corresponds to the case where there exists λ * ∈ R M such that f = f λ * .Then the weak sparsity and the weak approximation assumptions hold in an obvious way with C f = C ′ f = 0, whereas L(λ * ) = 0, so that we easily get the following corollary of Theorems 2.1 and 2.2.

Assume that (A1) and items (a) -(c) of (A2) hold. (i) If (A3) is satisfied, then
) where B 1 > 0 and B 2 > 0 are constants depending on c 0 only and for a positive constant c 1 depending on c 0 and b only.
Result (3.2) can be compared to [7] which gives a control on the ℓ 2 (not ℓ 1 ) deviation between λ and λ * in the linear parametric regression setting when M can be larger than n, for a different estimator than ours.Our analysis is in several aspects more involved than that in [7] because we treat the regression model with random design and do not assume that the errors W i are Gaussian.This is reflected in the structure of the probabilities π * n,M .For the case of Gaussian errors and fixed design considered in [7], sharper bounds can be obtained (cf.[5]).

Nonparametric regression and orthonormal dictionaries
Assume that the regression function f belongs to a class of functions F described by some smoothness or other regularity conditions arising in nonparametric estimation.Let F M = {f 1 , . . ., f M } be the first M functions of an orthonormal basis {f j } ∞ j=1 .Then f is an estimator of f obtained by an expansion w.r.t. to this basis with data dependent coefficients.Previously known methods of obtaining reasonable estimators of such type for regression with random design mainly have the form of least squares procedures on F or on a suitable sieve (these methods are not adaptive since F should be known) or two-stage adaptive procedures where on the first stage least squares estimators are computed on suitable subsets of the dictionary F M ; then, on the second stage, a subset is selected in a data-dependent way, by minimizing a penalized criterion with the penalty proportional to the dimension of the subset.For an overview of these methods in random design regression we refer to [3], to the book [13] and to more recent papers [4,15] where some other methods are suggested.Note that penalizing by the dimension of the subset as discussed above is not always computationally feasible.In particular, if we need to scan all the subsets of a huge dictionary, or at least all its subsets of large enough size, the computational problem becomes NP-hard.In contrast, the ℓ 1 -penalized procedure that we consider here is computationally feasible.We cover, for example, the case where F 's are the L 0 (•) classes (see below).Results of Section 2 imply that an ℓ 1 -penalized procedure is adaptive on the scale of such classes.This can be viewed as an extension to a more realistic random design regression model of Gaussian sequence space results in [1,11].However, unlike some results obtained in these papers, we do not establish sharp asymptotics of the risks.
To give precise statements, assume that the distribution µ of X admits a density w.r.t. the Lebesgue measure which is bounded away from zero by µ min > 0 and bounded from above by µ max < ∞.Assume that F M = {f 1 , . . ., f M } is an orthonormal system in L 2 (X , dx).Clearly, item (b) of Assumption (A2) holds with c 0 = µ min , the matrix Ψ M is positive definite and Assumption (A3) is satisfied with κ M independent of n and M .Therefore, we can apply Theorem 2.1.Furthermore, Theorem 2.1 remains valid if we replace there • by • Leb which is the norm in L 2 (X , dx).In this context, it is convenient to redefine the oracle λ * in an equivalent form: with k * as before.It is straightforward to see that the oracle (3.3) can be explicitly written as In the remainder of this section we consider the special case where {f j } ∞ j=0 is the Fourier basis in L 2 [0, 1] defined by where k is an unknown integer.
Corollary 2. Let Assumption (A1) and assumptions of this subsection hold.Let γ < 1/2 be a given number and M ≤ n s for some s > 0.Then, for r n,M = A log n n with A > 0 large enough, the estimator f satisfies where b 1 > 0 is a constant depending on µ min and µ max only and b 2 > 0 is a constant depending also on A, γ and s.
Proof of this corollary consists in application of Theorem 2.1 with M (λ * ) = k and L(λ * ) = 0 where the oracle λ * is defined in (3.3).
We finally give another corollary of Theorem 2.1 resulting, in particular, in classical nonparametric rates of convergence, up to logarithmic factors.Consider the class of functions where L > 0 is a fixed constant.This is a very large class of functions.It contains, for example, all the periodic Hölderian functions on [0,1] and all the Sobolev classes of functions

Corollary 3. Let Assumption (A1) and assumptions of this subsection hold.
Let M ≤ n s for some s > 0.Then, for r n,M = A log n n with A > 0 large enough, the estimator f satisfies where λ * is defined in (3.3), b 3 > 0 is a constant depending on µ min and µ max only and with the constants b 4 > 0 and b 5 > 0 depending only on µ min , µ max , A, L and s.
This corollary implies, in particular, that the estimator f adapts to unknown smoothness, up to logarithmic factors, simultaneously on the Hölder and Sobolev classes.In fact, it is not hard to see that, for example, when f ∈ F β with β > 1/2 we have M (λ * ) ≤ M n where M n ∼ (n/ log n) 1/(2β+1) .Therefore, Corollary 3 implies that f converges to f with rate (n/ log n) −β/(2β+1) , whatever the value β > 1/2, thus realizing adaptation to the unknown smoothness β.Similar reasoning works for the Hölder classes.

Proof of Theorem 1
Throughout this proof λ is an arbitrary, fixed element of Λ given in (1.5).Recall the notation f λ = M j=1 λ j f j .We begin by proving two lemmas.The first one is an elementary consequence of the definition of λ.Define the random variables and the event we find that, on E 1 , by the triangle inequality and the fact that λ j = 0 for j ∈ J(λ).
The following lemma is crucial for the proof of Theorem 1.

Lemma 2. Assume that (A1) -(A3) hold. Define the events
Proof.Observe that assumption (A3) implies that, on the set E 2 , j∈J(λ) Applying the Cauchy-Schwarz inequality to the last term on the right hand side of (4.1) and using the inequality above we obtain, on the set Intersect with E 3 (λ) and use the fact that ω n,j ≥ c 0 r n,M / √ 2 on E 2 to derive the claim.
Proof of Theorem 1. Recall that λ is an arbitrary fixed element of Λ given in (1.5).Define the set and the event We prove that the statement of the theorem holds on the event and we bound P {E(λ)} C by π n,M (λ) in Lemmas 5, 6 and 7 below.
First we observe that, on Consequently, we find further that, on the same event To finish the proof, we now show that the same conclusions hold on the event (by definition of E 4 (λ)) This and a reasoning similar to the one used in (4.3) yield Also, invoking again Lemma 2 in connection with (4.5) we obtain The conclusion of the theorem follows from the bounds on the probabilities of the complements of the events E 1 , E 2 , E 3 (λ) and E 4 (λ) as proved in Lemmas 4, 5, 6 and 7 below.
The following results will make repeated use of a version of Bernstein's inequality which we state here for ease of reference.
Lemma 3 (Bernstein's inequality).Let ζ 1 , . . ., ζ n be independent random variables such that for some positive constants w and d and for all integers m ≥ 2.Then, for any ε > 0 we have Lemma 4. Assume that (A1) and (A2) hold Then, for all n ≥ 1, M ≥ 2, Proof.The proof follows from a simple application of the union bound and Bernstein's inequality: where we applied Bernstein's inequality with w 2 = f j 2 L 2 and d = L 2 and with ε = 1 2 f j 2 for the first probability and with ε = f j 2 for the second one.

Lemma 7. Assume (A1) -(A3). Then
, denote the (i, j)th entries of matrices Ψ M and Ψ n,M , respectively.Define Then, for every µ ∈ U (λ) we have Using the the last display and the union bound, we find for each λ ∈ Λ that Now for each (i, j), the value ψ M (i, j) − ψ n,M (i, j) is a sum of n i.i.d.zero mean random variables.We can therefore apply Bernstein's inequality with and inequality (4.8) to obtain the result.

Proof of Theorem 2.2
Let λ be an arbitrary fixed element of Λ 1 given in (2.1).The proof of this theorem is similar to that of Theorem 1.The only difference is that we now show that the result holds on the event Here the set Ẽ4 (λ) is given by where We bounded P {E 1 ∩ E 2 } C and P {E 3 (λ)} C in Lemmas 5 and 6 above.
The bound for P { E 4 (λ)} C is obtained exactly as in Lemma 7 but now with 2 , so that we have .
The proof of Theorem 2.2 on the set Ẽ By Lemma 1, on E 1 we have Here j∈J(λ) where we used the fact that i,j ∈J(λ) < f i , f j > u i u j ≥ 0. Combining this with the second inequality in (4.11) yields Combining this with the fact that r n,M f j ≤ √ 2ω n,j on E 2 and ρM (λ) ≤ 1/45 we find Intersect with E 3 (λ) and use the fact that ω n,j ≥ c 0 r n,M / √ 2 on E 2 to derive the claim.with probability greater than 1 − π n,M (λ).

Case (b).
In this case it is sufficient to show that for a constant C ′ > 0, on some event E ′ (λ) with P{E ′ (λ)} ≥ 1 − π ′ n,M (λ).We proceed as follows.Define the set where We prove the result by considering two cases separately: Recall that being in Case (b) means that f λ − f 2 > r 2 n,M M (λ).This coupled with (4.14) and with the inequality f − f λ ≤ f λ − f shows that the right hand side of (4.9) in Lemma 8 can be bounded, up to multiplicative constants, by f λ − f 2 .Thus, on the event for some constant C > 0. Combining this with (4.14) we get (4.13), as desired.Let now f − f λ > f λ − f .Then, by Lemma 8, we get that λ − λ ∈ U ′ (λ), on E 1 ∩ E 2 ∩ E 3 (λ).Using this fact and the definition of E 5 (λ), we find that on Repeating the argument in (4.4) with the only difference that we use now Lemma 8 instead of Lemma 2 and recalling that f λ − f 2 > r 2 n,M M (λ) since we are in Case (b), we get for some constants C > 0, C ′′ > 0. Therefore,

. 16 )
Note that (4.15) and (4.16) have the same form (up to multiplicative constants) as the condition f − f λ ≤ f λ − f and the inequality (4.14) respectively.Hence, we can use the reasoning following (4.14) to conclude that onE ′ (λ) ∩ f − f λ > f λ − f inequality (4.13) holds true.The result of the theorem follows now from the boundP {E ′ (λ)} C ≤ π ′ n,M(λ) which is a consequence of Lemmas 5, 6 and of the next Lemma 9.
we follow again the argument of Theorem 2.1 first invoking Lemma 8 given below to argue that λ − λ ∈ Ũ (λ) (this lemma plays the same role as Lemma 2 in the proof of Theorem 2.1) and then reasoning exactly as in (4.4).