Ordered Smoothers With Exponential Weighting

The main goal in this paper is to propose a new method for deriving oracle inequalities related to the exponential weighting method. For the sake of simplicity we focus on recovering an unknown vector from noisy data with the help of a family of ordered smoothers. The estimators withing this family are aggregated using the exponential weighting and the aim is to control the risk of the aggregated estimate. Based on simple probabilistic properties of the unbiased risk estimate, we derive new oracle inequalities and show that the exponential weighting permits to improve Kneip's oracle inequality.


Introduction and main results
This paper deals with the simplest linear model where ξ i is a standard white Gaussian noise.For the sake of simplicity it is assumed that the noise level σ > 0 is known.The goal is to estimate an unknown vector µ ∈ R n based on the data Y = (Y 1 , . . ., Y n ) ⊤ .In this paper, µ is recovered with the help of linear where H is a finite set of vectors in R n which will be described later on.
In what follows, the risk of an estimate μ(Y ) = (μ 1 (Y ), . . ., μn (Y )) ⊤ is measured by where E µ is the expectation with respect to the measure P µ generated by the observations from (1.1) and • , •, • stand for the norm and the inner product in R n x 2 = n i=1 x 2 i , x, y = n i=1 x i y i .
Since the mean square risk of μh (Y ) because it depends on the underlying vector.However, one could try to construct an estimator μH (Y ) based on the family of linear estimates μh (Y ), h ∈ H, with the risk mimicking the oracle risk.This idea means that the risk of μH (Y ) should be bounded by the so-called oracle inequality R(μ H , µ) ≤ r H (µ) + ∆H (µ), which holds uniformly in µ ∈ R n .Heuristically, this inequality assumes that the remainder term ∆H (µ) is smaller than the oracle risk for all µ.In general, such an estimator doesn't exist, but for certain statistical models it possible to construct an estimator μH (Y ) (see, e.g., Theorem 1.1 below) such that: • ∆H (µ) ≤ Cr H (µ) for all µ ∈ R n , where C > 1 is a constant.
It is well-known that one can find the estimator with the above properties provided that H is not very rich (see, e.g., [2]).In particular, as shown in [10], this can be done for the so-called ordered smoothers.This is why this paper deals with H containing solely ordered multipliers defined as follows: Definition 1.1.H is a set of ordered multipliers if • if for some integer k and some h, g ∈ H, h k < g k , then h i ≤ g i for all i = 1, . . ., n.
The last condition means that vectors in H may be naturally ordered, since for any h, g ∈ H there are only two possibilities h i ≤ g i or h i ≥ g i for all i = 1, . . ., n. Therefore the estimators from (1.2) are often called ordered smoothers [10].
Notice that ordered smoothers are common in statistics (see, e.g., [10]).Below we give two basic examples, where these smoothers appear naturally.
Smoothing splines.They are usually used in recovering smooth regression functions f (x), x ∈ [0, 1], given the noisy observations where x i ∈ (0, 1) and ξ ′ i are i.i.d.Gaussian random variables with zero mean and unit variance.It is well known that smoothing spline is defined by fα (x, Z) = arg min where f (m) (•) denotes the derivative of order m and α > 0 is a smoothing parameter which is usually chosen with the help of the Generalized Cross Validation (see, e.g., [20]).
To transform this model into the sequence space model (1.1), consider the Demmler-Reinsch [5] where here and below u, v n stands for the inner product It is assumed for definiteness that the eigenvalues λ k are sorted in ascending order λ 1 ≤ . . .≤ λ n .
With this basis we can represent the underlying function as follows: and we get from (1.3) Next, substituting (1.5) in (1.4), we arrive at Notice that a similar equivalence with takes place in the minimax estimation of smooth regression functions from Sobolev's balls [17].The Demmler-Reinsch basis is a very useful tool for statistical analysis of spline methods.In practice, this basis is rarely used since there are very fast algorithms for computing smoothing splines (see, e.g., [8] and [20]).
Spectral regularizations of large linear models.Very often in linear models, we are interested in estimating Xµ ∈ R n based on the observations where X is a known n × p -matrix and ξ is a standard white Gaussian noise.
It is well known that if X ⊤ X has a large condition number or p is large, then the standard maximum likelihood estimate X μ0 (Z), where may result in a large risk.More precisely, if X ⊤ X > 0, then Usually the risk of X μ0 (Z) may be improved with the help of some regularizations.For instance, one can use the Phillps-Tikhonov regularization [19] μα (Z) = arg min where α > 0 is a smoothing parameter.It is seen easily that This formula is a particular case of the so-called spectral regularizations defined as follows (see, e.g., [6]): where H α (•) : R + → [0, 1] is a function depending on a smoothing parameter α ∈ R + .The matrix H α (X ⊤ X) may be easily defined when H α (λ), λ ∈ R + admits the Taylor expansion where I is the identity matrix.
Notice that for the Phillps-Tikhonov method we have and it is clear that this family of functions is ordered in the sense of Definition 1.1.Along with the Phillps-Tikhonov regularization, the spectral cut-off and Landweber's iterations (see, e.g., [6] for details) are typical examples of ordered smoothers.The standard way to construct an equivalent model of the spectral regularizations is to make use of the SVD.Let e k , k = 1, . . ., p and λ 1 ≤ λ 2 ≤ . . .≤ λ p be eigenvectors and eigenvalues of X ⊤ X.It is easy to check that is an orthonormal basis in R n .Therefore Z defined by (1.6) can be represented in the following equivalent form where ξ * k are i.i.d.N (0, 1).Notice also that and hence (1.8) In view of (1.7) and (1.8), we see that the spectral regularization methods are equivalent to the statistical model defined by (1.1) and (1.2).
Nowadays, there are a lot of approaches aimed to construct estimates mimicking the oracle risk.At the best of our knowledge, the principal idea in obtaining such estimates goes back to [1] and [13] and related to the method of the unbiased risk estimation [18].The literature on this approach is so vast that it would be impractical to cite it here.We mention solely the following result by Kneip [10] since it plays an important role in our presentation.Denote by r(Y, μh ) the unbiased risk estimate of μh (Y ).
Theorem 1.1.Let ĥ = arg min h∈H r(Y, μh ) be the minimizer of the unbiased risk estimate.Then uniformly in µ ∈ R n , where K is a universal constant.
Another idea to construct a good estimator based on the family μh , h ∈ H is to aggregate the estimates within this family using a held-out sample.Apparently, this approach was firstly developed by Nemirovsky in [14] and independently by Catoni (see [3] for a summary).Later, the method was extended to several statistical models (see, e.g., [21], [15], [11]).
To overcome the well-know drawbacks of sample splitting one would like to aggregate estimators using the same observations for constructing estimators and performing the aggregation.This can be done, for instance, with the help of the exponential weighting.The motivation of this method is related to the problem of functional aggregation, see [16].It has been shown that this method yields rather good oracle inequalities for certain statistical models [12], [4], [16].
In context of the considered statistical model, the exponential weighting estimate is defined as follows: where and r(Y, μh ) is the unbiased risk estimate of μh (Y ) defined by (1.9).It has been shown in [4] that for this method the following oracle inequalities hold. where Notice that for projection methods (h k ∈ {0, 1}) this theorem holds for β ≥ 2, see [12].
It is clear that if we want to derive from (1.11) an oracle inequality similar to (1.10), then we have to chose π h = (#H) −1 , where #H denotes the cardinality of H, and thus we arrive at This oracle inequality is good only when the cardinality of H is not very large.If we deal with continuous H like those related to splines smoothing with continuous smoothing parameter, this inequality is not good.To some extent, this situation may be improved, see Proposition 2 in [4].However, looking at the oracle inequality this proposition, unfortunately, one cannot say that it is better than (1.10).
The main goal is this paper is to show that for the exponential weighting we can get oracle inequalities with smaller remainder terms than that one in Theorem 1.1, Equation (1.10).
In order to attain this goal and to cover H with low and very hight cardinalities, we make use of the special prior weights defined as follows: Here h + = min{g ∈ H : g > h} π hmax = 1, where h max is the maximal multiplier in H, and • 1 stands for the l 1 -norm in R n , i.e., Along with these weights we will need also the following condition: The next theorem, yielding an upper bound for the mean square risk of μ(Y ), is the main result of this paper.
We finish this section with a short discussion concerning this theorem.
Remark 1.The condition β ≥ 4 may be improved when the multipliers h ∈ H take only two values 0 and 1.In this case it is sufficient to assume that β ≥ 2 (see [9]). Remark thus showing that the upper bound for the remainder term in the oracle inequality related to the exponential weighting is better than that one in Theorem 1.1.
Remark 4. We carried out numerous simulations to compare numerically the remainder terms in (1.15) and (1.10) and to find out what β is optimal from a practical viewpoint.Below we summarize what we obtained for the smoothing splines.
• Nearly optimal β is close to 1, but unfortunately, good oracle inequalities are not available for this case.
• There is no big difference between the exponential weighting with β = 1 and the classical unbiased risk estimation.Both methods demonstrate almost similar statistical performance.However, when r H (µ)/σ 2 is close to 1, the exponential weighting works usually better.
• It seems to us that the remainder term in the oracle inequality (1.10) is too large.We couldn't see the square-root behavior in the simulations.
On the other hand, the remainder term in (1.15) seems adequate to simulation results.

Proofs
The approach in the proof of Theorem 1.3 is based on a combination of methods for deriving oracle inequalities proposed in [12] and [9].The cornerstone idea is to make use of the following property of the unbiased risk estimate: let ĥ = arg min h∈H r(Y, μh ) be the minimizer of the unbiased risk estimate, then for any sufficiently small ǫ < 1, there exists ĥǫ ≥ ĥ such that with the probability 1, for all h ≥ ĥǫ .This property means that w h (Y ) are exponentially decreasing for large h and therefore we can obtain the following entropy bound (see Lemma 2.3 in the paper) Here and in what follows, C denotes a generic constant.Next, we prove the following upper bound with the help of Lemma 2 in [7] (see Lemma 2.5 below).Finally, we combine these facts following the main lines in the proof of Theorem 5 in [12].

Auxiliary facts
The next lemma collects some useful facts about the prior weights π h defined by (1.12).
Lemma 2.1.Under Condition 1.1, for any h ∈ H, the following assertions hold: • there exists a constant C • such that • there exist constants π • and π • such that Proof.Denote for brevity Then we have Therefore in view of the definition of π h , it is clear that if S hmax = 1, then S h = S h + , thus proving (2.1).
To prove (2.2), notice that and hence, by Conditions (1.13) and (1.14), In order to check (2.3), consider the following subset in H Let g h be the maximal element in G h .Then there are two possibilities In the first, case we have and therefore by (1.12) In the case, where h 2 + 1/2 ≤ g h 2 ≤ h 2 + 1, we make use of that by the Taylor expansion, for any g < g h and thus, This equation together with (2.4) guaranties that there exists π The proof of the inverse inequality g∈G h π g ≤ π • is quite similar to that one of (2.2).
The following lemma is a cornerstone in the proof of Theorem 1.3.
Lemma 2.2.For β ≥ 4 the risk of μ(Y ) is bounded from above as follows: Proof.It is based essentially on [12].Recall that the unbiased risk estimates for μi (Y ) and μh i (Y ) are computed as follows (see, e.g.[18]) (2.5) Since h∈H w h = 1, we have (2.6) From the definition of μ(Y ) we obviously get and combining this equation with (2.6) (see also (2.5)), we arrive at (2.7) In deriving the above equation it was used that h∈H w h (Y ) = 1 and hence To control the second sum at the right-hand of (2.7), we make use of the following equation Substituting in the above equation (see (1.9)) (2.8) Next noticing that Combining this equation with (2.6)-(2.8),we finish the proof.
• for some h * such that h * 2 ≤ h 2 . Then where Proof.Decompose H onto two subsets Denote for brevity

By convexity of log(x)
H(W h ) = P P + Q h∈P π h q h P log (P + Q)/P q h /P Next, notice that x log(1/x) is an increasing function when x ∈ [0, e −1 ].Therefore, using that 1 − exp(−ǫ) > (1 + ǫ) −1 ǫ, we get with Condition (1.13) and (2.3) To continue this inequality, we make use of (see (1.12)) that In order to bound from above the right-hand side at this equation, consider the set {h ∈ H : h ≥ h}.We may assume that {h k , k = 1, . ..} in this set are ordered so that h k+1 1 ≥ h k 1 .Denote for brevity With these notations we can write Let us check that max where [x] + = max(0, x).Solving the equation we obtain with a simple algebra and summing up these equations, we get (2.12).Similar arguments may be applied to prove that max S k ,k≥1 i≥1 and similarly Denote for brevity Then with (2.14) and (2.16) we arrive at where It is seen easily that the minimizer x * of the right-hand side at (2.17) is a solution to the following equation and thus .
Therefore from (2.17) we get Lemma 2.4.Let ξ i be i.i.d.N (0, 1) and G be a set of ordered sequences.Then for any α > 0 Then Proof.By the definition of r(Y, μh ), see (1.9) and (2.20), we get Let us fix some γ ∈ (0, 1).Then we can rewrite the above equation as follows: Next, bounding max and min in this equation with the help of Lemma 2.4, we arrive at Hence, choosing γ = 3βǫ, we get (

2.22)
To control the expectation at the right-hand side in (2.22), notice that for any given g ∈ H the following inequality So, for any γ ∈ (0, 1), we get with this equation and Lemma 2.4 Next, minimizing the right-hand side in g ∈ H, we have Choosing in the above display γ = βǫ and substituting thus obtained inequality in (2.22), we get (2.21).(2.23)

Proof of
Next notice that for all h such that ĥ 2 ≤ h 2 ≤ ĥ 2 + 1 we have by Condition (2.25) depends on h ∈ H, one can minimize it choosing properly h ∈ H. Very often the minimal risk r H (µ) = min h∈H R(μ h , µ) is called the oracle risk.Obviously, one cannot make use of the oracle estimate µ * (Y ) = h * • Y, h * = arg min h∈H R(μ h , µ)
2. In contrast to Proposition 2 in [4], the remainder term in (1.15) does not depend neither the cardinality of H nor n.It has the same structure as Kneip's oracle inequality in Theorem 1.1.
Remark 3. Comparing (1.15) with (1.10), we see that whenr H (µ) σ 2 ≈ 1,then the remainder terms in (1.10) and (1.15) have the same order, namely, Our next step is to bound from above the second term at the right-hand side of Equation (2.23).Lemmas 2.3 and 2.5 help us in solving this problem.Let ĥǫ be defined by(2.20).Then for all h > ĥǫ r(Y, μh) − rH (Y ) ≥ 2βǫσ 2 h 2 − ĥ 2 + 2βσ 2and in view of (2.24) we obtain with Lemma 2.3 and (2.2)To finish the proof of the theorem it remains to minimize the right-hand side at this equation in ǫ.Assuming that ǫ ≤ 1/(5β) we obtain