Adaptive spectral regularizations of high dimensional linear models

This paper focuses on recovering an unknown vector $\beta$ from the noisy data $Y=X\beta +\sigma\xi$, where $X$ is a known $n\times p$-matrix, $\xi $ is a standard white Gaussian noise, and $\sigma$ is an unknown noise level. In order to estimate $\beta$, a spectral regularization method is used, and our goal is to choose its regularization parameter with the help of the data $Y$. In this paper, we deal solely with regularization methods based on the so-called ordered smoothers and provide some oracle inequalities in the case, where the noise level is unknown.

In spite of its simplicity, this mathematical model plays an important role in solving practical inverse problems like gravity problems (see, e.g.[3]), tomography inverse problems [12], and many others.As a rule, in inverse problems n and p are very large and therefore the main goal in this paper is to propose an approach suitable for n = ∞, p = ∞, severely ill-posed matrices X ⊤ X, and the unknown noise level.
We begin with the standard maximum likelihood estimate β0 = arg min where z 2 = n k=1 z 2 (k).It is well known and easy to check that and thus, the mean square risk of β0 is computed as follows: where λ(k) and ψ k ∈ R n here and below are the eigenvalues and the eigenvectors of X ⊤ X X ⊤ Xψ k = λ(k)ψ k .
In this paper, it is assumed solely that λ(1) ≥ λ(2) ≥ • • • ≥ λ(p).This assumption together with (1.1) reveals the main difficulty in β0 : its risk may be very large when p is large or when X ⊤ X has a large condition number.
The natural idea to improve β0 is to suppress large λ −1 (k) in (1.1) with the help of a linear smoother.Therefore we make use of the following family of linear estimates βα = H α β0 , α ∈ (0, α • ], where H α , α ∈ (0, α • ] is a family of p × p-smoothing matrices.
In what follows, we deal with the smoothing matrices admitting the following representation where H α (λ) : In the literature (see, e.g., [9]), this method is called spectral regularization.It covers widely used regularizations methods such as the Tikhonov-Phillips regularization [20] known in the statistical literature as ridge regression, Landweber's iterations [14], the µ-method (see e.g.[9]), and many others.
Summarizing, β is estimated with the help of the family of linear estimates βα , α ∈ (0, α • ] defined by (1.2) and our goal is to find based on the data at hand the best estimator within this family.Notice that for given α, the mean square risk of βα is computed as follows: where It is easily seen from (1.3) that the variance of βα is always smaller than that one of β0 , but βα has a non-zero bias and therefore adjusting α we may improve the risk of β0 .This improvement may be very significant when β, ψ k 2 are small for large k.In practice, a good choice of the regularizing matrix family H α is a delicate problem related to the computational complexity of βα .For details, we refer interested readers to [9].
As a rule, practical spectral regularization methods (the spectral cut-off, the Tikhonov-Phillips regularization, Landweber's iterations) represent the so-called ordered smoothers [13].This means that the family of functions {H α (λ), α ∈ (0, α • ]} is ordered in the following sense: The next important question usually arising in practice is related to the data-driven choice of the regularization parameter α.In statistical literature, one can find several general approaches to this problem.We cite here, for instance, the Lepski method which has been adopted to inverse problems in [17], [2], [5], and the model selection technique which was used in [15]. The approach proposed in this paper is a slight modification of the unbiased risk estimation.To make the presentation simpler, we begin with the case, where the noise level σ 2 is known.Intuitively, a good data-driven regularization parameter should minimize in some sense the risk L α (β) (see (1.3)).Obviously, the best regularization parameter minimizing L α (β) cannot be used since it depends on the unknown parameter of interest β.However, the idea of minimization of L α (β) may be put into practice with the help of the empirical risk minimization principle defining the regularization parameter as follows: where and P en(α) : (0, α • ] → R + is a given function called penalty.The main idea in this approach is to link L α (β) and R σ α [Y, P en].Heuristically, we want to find a minimal penalty P en(α) that ensures the following inequality where C is a random variable that doesn't depend on α.It is convenient to define this constant as follows: The traditional approach to solving (1.5) is based on the minimization of the unbiased risk estimate.In this method, the penalty is computed as a root of the equation One can check with a simple algebra that The idea of this penalty goes back to [1] and [7] provides some oracle inequalities related to this approach.Another well-known and widely used approach to the data-driven choice of α is related to the cross validation technique [8].In the framework of our statistical model, this method prompts a data-driven regularization parameter which is close to It is well-known (see e.g.[13]) that if the risk of β is measured by E X β − Xβ 2 , then this penalty is nearly optimal and it works always well.However, the question Does αCV works well when the risk is measured by E β − β 2 ?has a delicate answer depending on the spectrum of X ⊤ X.To the best of our knowledge there are no oracle inequalities controlling the risk of β αCV uniformly in β.Notice, however, that one can show with the help of the method for computing minimal penalties in [4], that if λ(k) ≤ exp(−κk), then the risk of this method blows up starting from some κ > 0.
The similar effect takes place in the unbiased risk estimation.This happens because the standard deviation of R σ α [Y, P en u ] + C may be very large with respect to the mean E R σ α [Y, P en u ] + C and therefore (1.5) may fail with a high probability.
To improve the above mentioned drawbacks of the unbiased risk estimation, we define, following [11], the penalty as a minimal root of the equation where [x] + = max{0, x} and C 1 > 1 is a constant.Heuristic motivation behind this approach is rather transparent.We are looking for the minimal penalty that balances the excess risks corresponding to all possible α ∈ (0, α • ].Recall that the excess risk is defined by the difference between the risk of the estimate and its penalized empirical risk.Note that in view of (1.5), we can deal solely with the positive part of the excess risk.
In order to explain heuristically how Equation (1.7) may be solved, we begin with the spectral representation of the underlying statistical problem.One can check easily that where ξ ′ (k) are i.i.d.N (0, 1).With these notations, βα admits the following representation where β(k) = β, ψ k , and therefore ( In what follows, it is assumed that the penalty has the following structure where γ is a small positive number and Q(α), α > 0 is a positive function of α to be defined later on.Recall that the first term at the right-hand side is obtained from the unbiased risk estimation (see Equation (1.6)).With P en(α) we can rewrite the excess risk as follows: ( The first idea in solving (1.7) is based on the the fact that the cross term is typically small with respect to E R σ α [Y, P en] + C (see for more details Lemma 9 in [11]).With this in mind, omitting the cross term, Equation (1.7) can be rewritten in the following nearly equivalent form (1.10) where and . Now we are in a position to compute an approximation of the minimal root for (1.10).It is clear that (1.11) To find a feasible solution to (1.11), we make use of the exponential Chebychev inequality resulting in where η is a random variable, Γ(•) is the gamma function, and λ > 0. Therefore we define Q + (α) as a root of equation It is easy to check with a simple algebra that where µ α is a root of the equation and (1.15) The next result (see also Theorem 1 in [11]) shows that Q + (α) is a nearly optimal solution to (1.10).

Proposition 1 For any
where here and throughout the paper C denotes a generic constant.
Let us now turn to the case, where σ is unknown.To compute the datadriven regularization parameter in this situation, we replace Thus we arrive at the following approximation of the empirical risk and the data-driven regularization parameter is computed now as follows: Notice that in contrast to the case of known σ, it assumed that α is bounded from below by α • .This constraint ensures that with a hight probability Unfortunately, when this inequality fails we cannot control correctly the risk of βα since it may blow up (see [4] for similar phenomenon in the model selection).So, to avoid the blowup, we need a relatively good estimate of σ, or equivalently, large 1 − H α 2 .Stress also that since α • cannot depend on σ, we would like to have α • as small as possible to be sure that the methods works for small noise levels.From a mathematical viewpoint, this means that we need a relatively good upper bound for E|σ − σ2 α|P en(α).Roughly speaking, we have to check that with a hight probability The main difficulty in proving this equation is related to the fact that the random variables σ 2 − σ2 α and P en(α) are dependent.To overcome this difficulty we make use of the law of the iterated logarithm for σ 2 − σ2 α combined with a generalization of the Hölder inequality (see Lemmas 4 and 3 below).To carry out this approach, we need the following additional condition: there exists a positive constant C 2 such that for all α ∈ (0, where Denote for brevity The following theorem controls the risk of βα via the penalized oracle risk defined by r(β) (1.13-1.15)and suppose (1.16-1.17)hold.Then, uniformly in where R(x) = x/log(x).
Notice that Equation 1.18 can be rewritten in the following concise form where The statistical sense of (1. 19) is rather transparent: this equation shows that in typical nonparametric situations the method works like the ideal penalized oracle with the risk r(β).The typical nonparametric situation means that • the vector ( β, ψ 1 , . . .β, ψ p ) ⊤ contains many significant components, and thus r(β) ≫ σ 2 D(α • ).
These assumptions are typical in the minimax estimation, where it is assumed that β belongs to an ellipsoid.Notice that with the help of (1.19-1.20)one can check relatively easily that for a proper chosen spectral regularization, βα is the asymptotically minimax estimate up to a constant (see for details [11] and [18]).
We finish this section with a short discussion of Conditions (1.16-1.17).Equation (1.16) means that h α (k) vanishes rather rapidly for large k.This is always true for the spectral cut-off method (h and it is seen easily that (1.17) is fulfilled.Assume now that X ⊤ X is severely ill-posed, i.e., λ(k) ≍ exp(−κk) with κ > 0. Then Therefore (1.17) holds with

Ordered processes and their basic properties
The main results in this paper are based on a general fact which is similar to Dudley's entropy bound (see, e.g., [21]).Let ζ t be a separable zero mean random process on R + .Denote for brevity The following fact (see Lemma 1 in [11]) plays a cornerstone role in the proof of Proposition 1 and Theorem 1.
u , u ∈ R + , be a continuous strictly increasing function with v 2 0 = 0. Then for any λ > 0, The next two propositions (see Lemmas 2 and 3 in [11]) show that the ordered process ζ t can be controlled by the deterministic function v t .
Proposition 3 Let ζ t be an ordered process with ζ 0 = 0. Then there exists a constant C(q ′ , q) such that for all 1 < q ′ , q ≤ 2, uniformly in z > 0 , where [x] + = max(0, x).Proposition 4 Assume that there exists a monotone function v t , t ≥ 0 such that a random process for any z > 0 and some q ′ ≥ 1, q > 1.Then there exists a constant C ′ such that for any random variable τ ∈ R + the following inequality holds In what follows, we focus on typical ordered processes related to the empirical risk.The following two propositions (see Lemmas 4 and 5 in [11]) are essential in controlling the cross term in the case, where α is a random variable depending on ξ ′ (k), k = 1, . . ., p.
Proposition 5 For any given ᾱ > 0 and any z > 0, Proposition 6 Let ᾱ be a given smoothing parameter.Then for any p ∈ [1, 2), there exists a constant C(p) so that for any data-driven smoothing parameter α, .
In order to obtain oracle inequalities in the case, where the noise variance is unknown, we will need the following lemma generalizing Proposition 3.
Proof.Since h α (•), α ≥ 0, is the family of ordered functions, it is not difficult to check that Indeed, we can rewrite (2.1) in the following equivalent form , . . ., p, and we get thus proving (2.1).Since ζ α (b) is a Gaussian process, we obtain by (2.1) We may assume without loss of generality that σ α is a continuous function in α ∈ R + .Then let us fix some ǫ > 0 and define , the set of α l is always finite but it may be empty.Let α * be a root of the equation v 2 α * (b) = 1.Then by Proposition 2 and (2.2) we obtain This equation with ǫ = 0.5 completes the proof.

Recovering the noise variance
In this section, we focus on basic probabilistic properties of the variance estimator in the case, where α is a data-driven smoothing parameter.We begin with a simple auxiliary fact.
Proof.Consider the following function where x * is a root of the equation Therefore x * ≤ x * , where x * is a root of the following linear equation Since q > 1, with a little algebra we get thus arriving at the following upper bound F (z, y) ≤ y (2 − q) q log q 1 + qy z .
Now we are in a position to finish the proof.Notice that for any λ > 0 and therefore Next, substituting in the above equation , we obtain (2.4) Finally, applying the following inequality + 2 q−1 log q (1 + y), x, y > 0, we get and combining this equation with (2.4), we finish the proof of (2.3).
Lemma 3 Let η be a nonnegative sub-Gaussian random variable, i.e., such that for all λ > 0 and some S > 0
Let F (x) = x log q (1 + x).It is clear that is increasing in x and therefore F (x) is convex.Therefore by Jensen's inequality Eη ′q log q 1 + η ′q Eη ′q ≥ log(2)Eη ′q , and thus, we arrive at (2.6).

Lemma 4
Let and Then for any s ∈ (1, 2], Proof.For some ǫ > 0 define α k , k ≥ 0, as roots of equations Then, denoting for brevity we obtain where f (ǫ) will be chosen later on.
Since log(1 + x) ≥ x − x 2 /2, x ≥ 0, then for any λ > 0 and by the exponential Tchebychev inequality we get (2.9) To bound from above the last term in Equation (2.7), we make use of that 2h α (k) − h 2 α (k) is a family of ordered functions, and thus (see (2.1)) Similarly to (2.8) Therefore with the help of Proposition 2 and the exponential Tchebychev inequality we obtain (2.10) Now we chose f (ǫ) to balance the exponents at the right-hand sides in (2.9) and (2.10), thus arriving at following equation for this function This yields With this f (ǫ) and with (2.7-2.10)we get Finally, choosing ǫ as a root of f (ǫ) = (s − 1)/2, we finish the proof.We summarize the main properties of the variance estimator in the following lemma.

Proof of Theorem 1
The following proposition (see Lemma 7 in [11]) summarizes some basic properties of the penalty defined by (1.13-1.15).Proposition 7 We begin the proof of Theorem 1 with a simple generalization of Proposition 3. Consider the following random process where , where Proof.It is based on the following fact.Let S(x) = x 1/(q−1) , x ∈ R + .Then where S −1 (x) = x q−1 denotes the inverse function to S(x).
To prove this inequality, define α k , k = 0, 1, 2, . . .as roots of the following equations Then, noticing that η ǫ α − η ǫ α k is an ordered process, we obtain by (1.12) and Proposition 2 Next we get by the Laplace method x q/(q−1) e −x 2 /2 dx ≤ C q/(q−1) 1 z 1/(q−1) exp q 2(q − 1) log q q − 1 . (2.15) To finish the proof, denote for brevity Then by (2.15) we obtain with a simple algebra The following important lemma provides an upper bound for
Proof.In view of the definition of α, for any given smoothing parameter ᾱ, R α[Y, P en] ≤ R ᾱ[Y, P en].It is easy to check with the help of (1.8) that this inequality is equivalent to the following one where hα (k) = 2h α (k) − h 2 α (k).We can rewrite (2.16) as follows: (2.17) Since ᾱ is given, we get by Jensen's inequality The third line in (2.17) is bounded by Proposition 1 as follows: where γ ≤ 1/ √ 2. The upper bound for the fourth line in (2.17) is a little bit more tricky.Since hα (•), α ∈ (0, α • ] is a family of ordered functions, we obtain by Proposition 5 that for any ǫ > 0 and given To continue this inequality, notice that if α ≥ ᾱ, then and therefore (2.21) (2.22) So, combining (2.21-2.22)with Young's inequality .
(2.23) Thus, by (2.20) and (2.23) we obtain Therefore, substituting in the above equation q ′ = 2/3 and (2.24) Now we proceed with the last line in Equation (2.17).Since ᾱ is given, we have by (2.11) and therefore (2.25) The last term in (2.17) can be bounded by Lemma 5 and (1.16) as follows: The next idea in the proof of Theorem 1 is that the data-driven parameter α defined by (1.4) cannot be very small, or equivalently, that the ratio D(α)/D(α • ) cannot be very large.
With this inequality we obtain With the help of (2.32) we continue (2.28) as follows: To control from below the left-hand side in the above equation, notice that . (2.34) To finish the proof, let us consider the function f (x) = x log 1+γ/4 (x), x ≥ 1. Computing its second order derivative, one can easily check that f (x) is convex for all x ≥ exp(1) = e.So, f (x + e − 1) is convex for x ≥ 1.It is easily seen there exists a constant C > 0 such that for all x ≥ 1 Therefore by (2.34) and Jensen's inequality, It is easy to check that the inverse function ψ −1 (x) satisfies the following inequality ψ −1 (x) ≤ (x + e − 1) log −1−γ/4 (x + e − 1).
Now we are ready to proceed with the proof of Theorem 1.Let ǫ > 0 be a small given number to be defined later on.By (1.8) and (1.9), the following equation for the skewed excess risk holds.
We proceed with the second line from below at the right-hand side of this display.By Lemmas 6 and 8, we obtain (2.37) The next step is to bound the third line from below at the right-hand side of (2.36  (2.39) Let us consider the function With (2.40) we continue (2.39) as follows: Therefore, substituting this equation in we get (2.41) Hence, to finish the proof of the theorem, it remains to check that ) = (x + e − 1) log 1+γ/4 (x + e − 1).