Oracle inequalities for cross-validation type procedures

We prove oracle inequalities for three different types of adap- tation procedures inspired by cross-validation and aggregation. These pro- cedures are then applied to the construction of Lasso estimators and aggre- gation with exponential weights with data-driven regularization and tem- perature parameters, respectively. We also prove oracle inequalities for the cross-validation procedure itself under some convexity assumptions. AMS 2000 subject classifications: Primary 62G99.


Introduction
In this paper, we construct adaptation procedures inspired by cross-validation.Adaptation procedures are of particular interest when one wants to adapt to an unknown parameter.Such a parameter can appear in statistical procedures for two reasons: either it is an unknown parameter of the model (complexity parameter, "concentration" parameter, geometric parameter, variance of the noise,...), or the construction of the procedure requires fitting a parameter that no theory is able to determine (regularization parameter, smoothing parameter, threshold,...).Thus it is very useful to have at hand some statistical procedure which can choose these unknown parameters in a data-dependent way.The construction of adaptation procedures has been one of the main topics in nonparametric statistics for the two last decades.Retracing the entire bibliography here is not possible.Nevertheless, we would like to refer the reader to some classical -and now pioneering -steps in this field like the model selection approach (cf.i.e.Barron et al. (1999) and Massart (2007)), aggregation methods (cf.i.e.Nemirovski (2000), Catoni (2004) and Catoni (2007)), empirical risk minimization (cf.i.e.Vapnik (1982), Koltchinskii (2006) and Bartlett & Mendelson (2006)) or Lepskii's adaptation method in Lepskii (1990Lepskii ( , 1992)).Of course many other approaches in some particular setups have been developed.But one of the most popular and universal strategy used for fitting unknown parameters or more generally to select algorithms is the Cross-Validation (CV).Cross-validation is a very important and widely applied family of model/ estimator/ parameter selection methods.Among other, the CV procedure was studied for the selection of the bandwidth in kernel density estimation in Hall (1983) and Stone (1984), for the regression model in Stone (1974), in classification in Devroye et al. (1979).Many other authors have been studying or using this method and we refer the reader to the survey of CV methods in model selection Arlot et al. (2010), the PhD thesis Cornec (2009), Shao (1993) or van der Vaart et al. (2006) for more bibliographical references on this topic.The aim of this paper is to present and to study three procedures inspired by the CV procedure in the following general framework.
Let (Z, T ) be a measurable space and F be a class of measurable functions from Z to R. On a very general level, our aim is to minimize a risk function R : F → R over its domain F. This risk function is assumed to exist, but is unknown to us.To obtain information about it, though, we assume that it also appears as the expectation of a quantity we can sample from: Let Z be a random variable with values in Z and denote its probability measure by π.Assume that there exists a "contrast" or loss function Q : Z × F −→ R such that the risk of any f ∈ F can be written in the form and that there exists a sequence (Z i ) i∈N of i.i.d.random variables distributed according to π.For the purpose of estimation, we shall use finite amounts, say n, of data from this sequence.
The problem of risk minimization is a general formulation for many different kinds of statistical problems, and we shall introduce all our examples using this form.If the infimum over all f in F is achieved by at least one function, we write f * for some choice of such a minimizer in F. In this paper, we will assume that inf f ∈F R(f ) is achieved -otherwise we can replace f * by f * n , an element in F satisfying R(f * n ) ≤ inf f ∈F R(f ) + n −1 , and still obtain the same results.This model is best illustrated by its three key examples: regression, density estimation and classification.
Regression: Take Z = X × R, where (X , A) is a measurable space, and let Z = (X, Y ) be a random pair on Z.In the regression framework, we would like to estimate the regression function f * (x) = E [Y |X = x] , ∀x ∈ X .Take F = L 2 (X , A, P X ), where P X is the distribution of X.Consider the contrast function Q((x, y), f ) = (y−f (x)) 2 defined for any (x, y) ∈ X ×R and f ∈ F. We have R(f , where ζ = Y − f * (X) is usually called the noise.Thus f * is a minimizer of R(•) and the minimum achievable risk is Density estimation: Let (Z, T , µ) be a measure space, and take Z to be a random variable with values in Z.We assume that the probability distribution π of Z is absolutely continuous with respect to µ and denote by f * one version of its density.Consider F the set of all density functions on (Z, T , µ), i.e. the set of all T -measurable functions f : Z → R + that integrate to 1.We consider the contrast function Q(z, f ) = − log f (z) for any z ∈ Z and f ∈ F. The corresponding risk computes as R(f Instead of using the Kullback-Leibler loss, one can use the quadratic loss.The corresponding contrast function is Classification framework: Let (X , A) be a measurable space.We assume that the space Z = X × {−1, 1} is endowed with an unknown probability measure π, and consider a random pair Z = (X, Y ) which takes on values in Z and whose probability distribution is π.Denote by F the set of all measurable functions from X to R, and furthermore let φ be a function from R to R. For any f ∈ F consider the φ−risk, R(f where the contrast function is given by Q((x, y), f ) = φ(yf (x)) for any (x, y) ∈ X × {−1, 1}.In many situations, a minimizer f * of the φ−risk R over F (or the sign of f * , if the latter takes on arbitrary real values) is equal to the Bayes rule f * (x) = Sign(2η(x) − 1), ∀x ∈ X , where η(x) = P(Y = 1|X = x) (cf.Zhang (2004) and Bartlett & Jordan & McAuliffe (2006)).
We say that a statistic is a sequence of functions f = ( f We assume that we know how to construct some statistics fλ for λ in a set of indexes Λ.The aim of this work is to construct procedures f := ( f (n) ) n∈N satisfying oracle inequalities that is inequalities like, for any sample size n, where C ≥ 1 is a constant and r(n, Λ) is a residue term which we would like to keep as small as possible.Controlling this residue will depend on some complexity parameter of the excess loss function class {Q(•, as well as on a margin parameter that limits the behaviour of the contrast function around the risk minimizer (cf.Assumptions (A) in Section 2).
The paper is organized as follows.In Section 2 are constructed the adaptation procedures which are then proved to satisfy oracles inequalities in the finite case |Λ| = p.Section 3 is devoted to the study of the general non-finite case.In Section 4, we apply our adaptation procedures to the construction of Lasso estimators with a data-driven regularization parameters and aggregates with exponential weights with a data-driven temperature parameter.Finally, the main proofs are provided in Section 5.

Procedures and Oracle inequalities
In this section we provide some oracle inequalities for several procedures selecting/aggregating estimators: first two modified versions of the cross-validation procedure, then cross-validation procedure itself, and then finally we discuss aggregation with multiple splitting.

Classical Cross-validation procedures
The key feature of the CV procedure, the use of multiple splits to train and test the candidate estimator, renders it somewhat more difficult to handle in a theoretial way.Nevertheless, we shall show that a carefully crafted risk inequality opens the door to oracle inequalities for cross-validation too.In this section, we have to pay careful attention to the exact choice of the splits of our data, especially when retraining the selected model to obtain our final estimator(s).
First we shall introduce some notation.Let n be an integer, and V a divisor of n.We split the data set which shall be test sets, and their complements ) be a contrast function whose arguments are a data point Z and a parameter f ∈ F. For a statistic f = ( f (n) ) n , we define the V-fold CV empirical risk by (2.3) Let p statistics f1 , . . ., fp be given.The V-fold CV procedure is the procedure (2.4) Perhaps the oldest, and certainly the most frequently studied, cross-validation scheme is n-fold or leave-one-out cross-validation.It forms the intersection between the class of V -fold cross-validation schemes and the class of leave-m-out CV schemes, defined by where R n,−m is defined as This method does however become very computationally inadequate as soon as m is no longer 1, as there are far too many subsets of {1, . . ., n} to average over.One possible solution for this is balanced incomplete cross-validation, where cross-validation is treated as a block design and the available pieces of data are all used equally often for training, and equally often for testing.Alternatively, we could use Monte Carlo cross-validation, where the training and testing subsets are drawn randomly -without replacement -from the available data.See Shao (1993) for a discussion of all these methods.
We can place all of these cross-validation schemes into one general framework as follows.For any subset C ⊂ {1, . . ., n} of indices, write D (C) for {Z i : i ∈ C} and D (C) for {Z i : i / ∈ C}.Assume that a fixed value n C be given (the size of test sets), and define n V = n − n C .Let C 1 , . . ., C N C be N C subsets of {1, . . ., n}, each of size n V .Now for any statistic f define the CV risk and its minimizer by (2.7)

The modified CV procedure and its average version
In this subsection, we introduce the selecting procedures that we will be studying later.We use the notations introduced in the previous subsection.
To introduce the modified CV procedure, we consider some integer V and we assume that V divides n.We consider the splits (B 1 , D 1 ), . . ., (B V , D V ) of the data introduced in (2.1) and (2.2).We define the modified CV procedure (mCV) by where For the average version of the mCV procedure, we don't have to split the data in the same "organized" way as in (2.1) and (2.2).We can consider the more general second partition scheme introduced in the second part of the previous subsection that we recall now for the reader convenience: Let N C and 1 ≤ n C < n be two integers and set n V = n−n C .Let C 1 , . . ., C N C be subsets of {1, . . ., n} each of size n V .We define the averaged version of the modified CV procedure (amCV) by: (2.9) where, for the CV-risk We did not consider the same partition scheme of the data for the two procedures.The one considered for the amCV is more general but to obtain oracle inequalities for the amCV we will need the convexity of the risk.Whereas for the mCV, the partition scheme is the one used for the VCV method and will only require a weak assumption on the basis statistics f1 , . . ., fp .For each one of our results, we will consider two different setups depending on the procedure that we want to study and the assumptions of the problem.
Note that the difference between the classical VCV procedure defined in (2.4) and our mCV procedure is that f Therefore, under some extra "regularity" assumptions on the basis statistics f1 , . . ., fp saying that for every j, f (n) j is somehow more efficient as n increases (cf., for instance, the "stability" assumption in Bousquet et al. (2002)) the VCV procedure should outperform our mCV procedure.Nevertheless, we will not explore this kind of regularity assumption and will require only weak assumptions on the basis estimators.Under these weak assumptions, the mCV (as well as the amCV) will, in fact, outperform the classical VCV and CV procedures (cf.Theorem 2.4 and Example 2.6 below).

Assumptions
A significant part of our analysis is based on concentration properties of sums of random variables that belong to an Orlicz space.These spaces appear to be useful for the non-bounded setup we have in mind.We say that a function ψ : R + −→ R is a Young function (cf.van der Vaart and Wellner (1996)) when it is convex, non-decreasing, ψ(0) = 0 and ψ(∞) = ∞.Each Young function gives rise to a norm on a suitable class of random variables as follows: Definition 2.1 For a Young function ψ and a random variable f , the ψ-norm of f is f ψ = inf C > 0 : Eψ |f |/C ≤ 1 .The Orlicz space associated with ψ is then the space of random variables with finite ψ-norm.
For instance, when we consider ψ α = exp(x α ) − 1 for α ≥ 1, the ψ α -norm measures exponential tail behavior of a random variable.Indeed, one can show that for every u ≥ 0, , where c is an absolute constant independent of f (see, for example, van der Vaart and Wellner (1996)).Note that the Orlicz space associated with the Young function ψ(x) = x p is the classical L p space.
We shall use the following assumptions on the tail behavior and the "margin" (cf.Mammen & Tsybakov (1999) and Tsybakov (2004)) of the excess loss function of an estimator f .
(A) There exist κ ≥ 1 and K 0 , K 1 > 0 such that the following holds.For any m ∈ N and any data set The first point allows us to handle unbounded loss functions and unbounded estimators.This is a crucial point when one wants to consider the regression problem with unbounded noise or when one wants to aggregate unbounded estimators.
The second point is the classical "margin assumption" (cf.Mammen & Tsybakov (1999)).This means that the L 2 -diameter of the set of almost oracles is controlled by their excess risks.The idea behind this assumption is for empirical risk minimization based procedures, the L 2 -diameter of the set of almost minimizers of the empirical risk will be small with high probability.This leads to a smaller complexity of the set within we are looking for the oracle.Moreover, a side effect of this kind of assumption is that the concentration of the empirical risk around the risk is improved.The margin condition is linked to the convexity of the underlying loss Q.In density and regression estimation it is naturally satisfied with the best margin parameter (κ = 1), but for non-convex losses (for instance in classification), this assumption does not hold naturally (cf.Lecué (2007) for a discussion on the margin assumption and for examples of such losses).

Oracle inequalities for the modified CV procedures (mCV) and its average version (amCV)
In this section, we shall not yet introduce any conditions on how a candidate statistic f behaves when its training sample size changes, i.e. about the relationship of fm and fn for m = n.As the usual application of cross-validation involves retraining the selected model using all the available data to obtain a final estimator, such assumptions are crucial for avoiding such pathological "counter-examples" as that found in Example 2.6 below.As we shall only introduce such conditions in Section 2.5, we will first prove a simpler case -the case where even after selection involving estimation with training size n V , we still only use training samples of size n V to build the final estimator.The case where we retrain on all available data will then be handled in Section 2.5.We will require some simple (fixed sample size) properties on the estimators f1 , . . ., fp to obtain an oracle inequality for the modified CV procedure.
Definition 2.2 We say that a statistic f = ( f (n) ) n is exchangeable when for any integer n, for any permutation φ : {1, . . ., n} −→ {1, . . ., n} for any π ⊗nalmost vector (z 1 , . . Remark that most of the statistics in the batch setup (the setup of this paper) satisfy this property.On the other side, statistics coming from the on-line setup are likely to be un-exchangeable.
The following lemma shows that in all of these cases, supremum bounds on the "shifted" empirical process for the "trained" estimates ) are sufficient for deriving oracle inequalities for the corresponding amCV and mCV procedures: Lemma 2.3 We have two different setups, depending on the procedure that we want to study.Assume that one of the two following conditions holds: is the averaged version of the modified CV procedure (cf.(2.9)), with N C arbitrary deterministic splits of n pieces of data into n V pieces of training and n C pieces of test data.
2. The statistics f1 , . . ., fp are exchangeable and our estimator is the modified CV procedure defined in Equation (2.8) using the splits of the data defined in Equation (2.1) and (2.2).
Then for any constant a ≥ 0, the following inequality holds: Now combining Lemma 2.3 and the maximal inequality of Lemma 5.3 below for the shifted empirical process appearing in Lemma 2.3, we are in position to obtain the following oracle inequality for the amCV and the mCV procedures.
Theorem 2.4 Let f1 , . . ., fp be p statistics satisfying Assumption (A).We have two different setups depending on the procedure that we want to study.Assume that one of the two conditions holds: amCV introduced in (2.9).
2. The statistics f1 , . . ., fp are exchangeable and our procedure is the modified CV procedure Then for any a > 0, there exists a constant c = c(a, κ) such that

Oracle inequalities for cross-validation itself
In Part 1 of Theorem 2.4, we make the assumption that the risk R(•) is convex -for which e.g. the conditional convexity of the contrast function Q(z, f ), for all z, would suffice, and thereafter in Part 2 we assume that our candidate statistics are exchangeable.To derive a result for a CV estimator retrained on the full data D (n) (instead of the only data D (n V ) like in (2.8) and (2.9)), we shall combine and strengthen these two assumptions.Regard the mCV procedure, whose final estimator ) is retrained on the first n V pieces of data.For symmetry reasons, Part 2 of Theorem 2.4 remains true for any k = 1, . . ., V , if we replace ) using the training set D k from the k−th split.Now assume that Z = R and the statistics f1 , . . ., fp can all be written as functionals on the cumulative distribution function of the data, i.e. that there exist functionals G 1 , . . ., G p such that (2.10) where (This assumption automatically implies the exchangeability of the statistics.In particular, all M-estimators, such as the mean or median, have such a functional form.)Obviously is convex, and all the compositions R • G j too, then we can combine the upper bounds for the estimators f (n) mCV,k (D (n) ) obtained in Part 2 of Theorem 2.4 to derive a bound for the VCV procedure (2.4) as follows: and thus it easily follows from Part 2 of Theorem 2.4 the result: Theorem 2.5 Let f1 , . . ., fp be p statistics that can be written as functionals G 1 , . . ., G p as in (2.10) and which satisfy Assumption (A), and assume that all the compositions R • G 1 , . . ., R • G p are convex, as also is the risk function R(•).
Then for the V-fold cross-validation procedure, we have the oracle inequality Note.The "functional convexity condition" on the R • G j is a strong one, but need not be exactly fulfilled -it suffices for it to hold up to a summand that converges to zero no slower than the residual term in Theorem 2.4, and versions of it averaged over the training data may also suffice.In most practical cases, the only straightforward way of showing the convexity of the R • G j (with high certainty) is by simulation.In the standard example of least-squares regression with underlying Gaussian linear model, for instance, R • G j is convex for the fixed-design setup, regardless of other parameters, but for the random-design setup we need additional conditions such as a reasonable signal-to-noise ratio or large enough sample size (indicating that such a convexity condition does in fact hold up to a quickly-decaying extra summand).Simulations of a straightforward sparse Lasso example with 100-dimensional Gaussian covariates and Gaussian noise have shown that the neccessary functional convexity condition for 10-fold cross-validation holds from a sample size of 40 and a signal-to-noise ratio of 2.0 upwards, for a range of penalty tuning parameters.However, discussing this issue at length is beyond the scope of this paper.
The reason why we need extra assumptions such as the functional form of the candidate statistics is that the computation of the index (D (n) ) only involves the performances of the estimators for n V observations (R n C ( f ) depends only on f (n V ) ).Without extra assumptions, it is thus easy to contrive counter-examples for which f (n V ) performs well and f (n) performs badly: Example 2.6 Fix an integer V and a sample size n > 1 that is a multiple of V .We will construct a set F n = { f1 , f2 } of two estimators (which are functionals of the training data) for which V-fold cross-validation does not satisfy the oracle inequality from Theorem 2.5.
We consider the classification problem with 0−1 loss It is easy to see that (D (n) ) = arg min j∈{1,2} R n,V ( fj ) is always equal to 2.
Thus the V -fold CV procedure is As we can do this for arbitrarily high sample sizes n, V-fold cross-validation is not even risk-consistent at this level of generality -and certainly does not satisfy any meaningful oracle inequalities.

Aggregation with multiple splits
Let a dictionary F = {f 1 , . . ., f p } be given and assume that f = ( f (n) ) n is an aggregation method satisfying the following oracle inequality under a margin assumption with margin parameter κ ≥ 1: where K agg ≥ 1 is the leading constant.For instance, both the empirical risk minimization algorithm and the aggregate with exponential weights and temperature parameter T > 0, , (2.12) satisfy an oracle inequality of the form (2.11) (cf.Lecué (2006)).Let f1 , . . ., fp be p statistics.Assume that a fixed value n C be given (the size of test sets), and define n n) ) an aggregation procedure where the weights have been constructed on the data set D (C k ) and for the dictionary )}, for instance, when the ERM aggregation procedure is chosen for the basic aggregation procedure, Then we average all these aggregates over the N C different splits of We define the aggregate with multiple splits (2.13) Theorem 2.7 Let f1 , . . ., fp be p statistics satisfying Assumption (A).Assume that the risk function f −→ R(f ) is convex.Consider an aggregation procedure satisfying (2.11).The aggregate with multiple splits (defined in (2.13)) associated with this aggregation procedure and the p statistics f1 , . . ., fp satisfies the inequality Proof.By the convexity of the risk, we have Note that when we chose an optimal aggregation procedure (cf. the progressive mixture of Catoni (2004) or Yang (2000), or the aggregation via empirical risk minimization of Lecué & Mendelson (2009)) for the basic aggregation procedure, we can take K agg = 1.

Continuous case
We consider Λ a set of indexes and F = { fλ : λ ∈ Λ} a set of statistics indexed by Λ.In the previous part of this paper, we have explored the case Λ = {1, . . ., p}.In this section, we need not assume Λ to be finite.
We consider the notation introduced in Section 2, and define the continuous version of the modified CV procedure by and the continuous version of the averaged version of the modified CV procedure by (3.2) Remark that we assume that the infima of λ −→ R n,V ( fλ ) and λ −→ R n C ( fλ ) are achieved.We also called these two infima by the same name but there will be no ambiguity since we will use them in two clearly separated setups.
Following the line of Lemma 2.3, it is easy to obtain the following result.
2. If the statistics fλ , λ ∈ Λ are exchangeable, then the modified V -fold CV procedure defined in Equation (3.1) for the splits of the data defined in Equation (2.1) and (2.2) with 1 ≤ V ≤ n satisfies the following oracle inequality with for any constant a ≥ 0, we have the following inequality where To control the expectation of the supremum of the "shifted" empirical process appearing in Lemma 3.1, we need some results from empirical process theory (the proof is provided in Section 5).
Lemma 3.2 Let a > 0 and Q := {Q λ : λ ∈ Λ} be a set of measurable functions defined on (Z, T ).Let Z, Z 1 , . . ., Z m be i.i.d.random variables with values in (Z, T ) such that ∀Q ∈ Q, EQ(Z) ≥ 0. Suppose that there exists some constants c, L, min > 0 such that for all ≥ min and all u ≥ 1, with probability greater than 1 − L exp(−cu) sup Q∈Q:P Q≤ where J is a strictly increasing function such that J −1 is strictly convex.Let ψ be the convex conjugate of J −1 defined by ψ(u) = sup v>0 (uv − J −1 (v)), ∀u > 0.
Assume that for some r ≥ 1, x > 0 −→ ψ(x)/x r decreases and define for q > 1 and u ≥ 1, Then, there exists a constant L 1 (depending only on L) such that for every u ≥ 1, with probability greater than 1 − L 1 exp(−cu) sup Moreover, assume that ψ increases such that ψ(∞) = ∞, then there exists a constant c 1 depending only on L and c such that The function −→ sup Q∈Q:P Q≤ (P − P m )Q, appearing in Equation (3.3), is a classical measure of the complexity of the set of functions Q (cf.for instance van de Geer (2000), Bartlett & Mendelson (2006), Koltchinskii (2006) and references therein).A common way to upper bound this function is to use some metric complexity measure like the Dudley entropy integral (cf.for instance van der Vaart and Wellner (1996)) coming out of the chaining argument.In this paper, we use the γ function of Talagrand (cf. Talagrand (2005)) as a metric complexity measure of Q.We recall here the definition.
Let (T, d) be a metric space.An admissible sequence of T is a collection {T s : s ∈ N} of subsets of T , such that |T 0 | = 1 and |T s | ≤ 2 2 s , ∀s ≥ 1. Definition 3.3 (Talagrand (2005)) For a metric space (T, d) and α ≥ 0 define where the infimum is taken over all admissible sequences of T .
The generic chaining mechanism can be used to show (cf.theorem 1.2.7 in Talagrand (2005)) that if {X t : t ∈ T } (where T is a set provided with two distances d 1 and d 2 ) is such that EX t = 0 and , ∀s, t ∈ T, u > 0 then, there exists some absolute constant L, c > 0 such that for all u ≥ 1, sup s,t∈T with probability at least 1 − L exp(−cu).Note that one choice for the sets T s that constitutes a potential (yet, usually suboptimal) admissible sequence are s -covers of T , where each s is such that the entropy number N (T, s , d) is less than 2 2 s .Then, an easy computation (cf.Talagrand (2005)) shows that (3.5) Lemma 3.4 Let Q := {Q λ : λ ∈ Λ} be a set of measurable functions defined on (Z, T ).Let Z, Z 1 , . . ., Z m be i.i.d.random variables with values in (Z, T ).
Grant that there exists C 1 > 0 and an increasing function G(•) such that Then, there exists some absolute constant L, c > 0 such that for all > 0 and for all u ≥ 1, with probability at least 1 − L exp(−cu), sup Q∈Q:P Q≤ where The proof of Lemma 3.4 is provided in Section 5. Now, combining Lemma 3.2 and Lemma 3.4, we obtain a continuous version of Theorem 2.4.Theorem 3.5 Let Λ a set of indexes and F = { fλ : λ ∈ Λ} a set of statistics indexed by Λ. Fix n V ≤ n the size of the validation sample and define the set of excess loss functions associated with We assume that the tail behavior of the statistics in F and the complexity of F satisfy the following assumptions: Any statistic f in F satisfies (A) and there exist min and a strictly increasing function J such that J −1 is strictly convex, the convex conjugate ψ of J −1 increases, ψ(∞) = ∞ and there exists r ≥ 1 such that x → ψ(x)/x r decreases and We consider two different setups depending on the procedure we want to study.Assume that one of the two condition holds: 2. The statistics f1 , . . ., fp are exchangeable and our procedure is the mCV procedure Then, for every a > 0 and q > 1, the following inequality holds where we set q (u) = ψ 2q r+1 (1+a)u Note that Theorem 3.5 generalizes Theorem 2.4 to a continuous family of estimators.Indeed, it is easy to verify that, in the finite case |Λ| = p, we obtain the same result as in Theorem 2.4.For instance, under the assumptions of Theorem 2.4 by using Equation 3.5, we have, for any > 0, Thus, q (1/q) is, up to some constant depending only on K 0 and κ, of the same order as the residue of the oracle inequality of Theorem 2.4.Furthermore, the same reasoning used for Theorem 2.5 can also be applied here in sufficiently convex setups where the full data set is used for retraining.Nevertheless, from a technical point of view, there is a major difference between the finite and the continuous cases.In the finite case, it is only a side effect of the margin assumption (cf.second point of Assumption (A)) that is actually used, namely a better concentration of the empirical risk to the actual risk.Whereas in the continuous case, all the strength of the margin assumption is used: a reduction of the L 2 diameter of the set of potential almost oracle.This control on the diameter can be easily seen in the Dudley's entropy integral, where this diameter appears in the upper bound of integration.

Applications
In this section, we will be interested in two procedures which initially are nonadaptive to one unknown parameter of the model or to one parameter for which we have no canonical choice: First, the Lasso procedure where theoretical results have been obtained under the assumption that the variance of the noise is known (we will provide a procedure with a data dependent regularization parameter).Second, aggregation with exponential weights, which depends on a temperature parameter.We could just as well have applied this adaptation procedure to other problems, like the choice of the regularization parameter for penalized empirical risk minimization, or the choice of the threshold constant in wavelet methods.

Adaptive choice of the regularization parameter for the Lasso
We consider the linear regression model Y = X, β * + σ , where Y ∈ R is a random vector, X ∈ R p is a random vector and ∈ R is a random variable (the noise) independent of X such that E = 0 and E 2 = 1.We have n i.i.d.observations in this model, and the total dataset consists of Y = (Y 1 , . . ., Y n ) t and X = (X 1 , . . ., X n ) t .We consider the function Φ : Given a regularization parameter λ, the Lasso estimator fλ is defined by We consider the regularization parameter λ to be normalized so as to lie in [0, 1].Such a normalization is possible, since for λ max := 2 max i | X i , Y |, the zero vector is a minimizer of Φ(β, λ max ); that is, the Lasso penalty is always able to shrink the coefficient estimate for β down to zero.Thus the dictionary of estimators that we consider is a finite set { fλ : λ ∈ G} where G is a finite grid of [0, 1].Now, we construct the mCV procedure (cf.(2.8)) in this setup.Let (B 1 , D 1 ), . . ., (B V , D V ) be the family of splits of D (n) defined in (2.1) and (2.2) for some 1 ≤ V ≤ n dividing n.For any Lasso estimator fλ the r-V-fold CV empirical risk, for r > 0, is defined by The mCV procedure is defined in this context by where λr (D n,V ( fλ ).Now, we construct the amCV (cf.(2.9)) and the Agg (cf.(2.13)) procedures using the subsets C 1 , . . ., C N C of {1, . . ., n} each of size n V : the mCV is defined, in this context, by where λr ( n C is the r-CV risk.Finally the Agg procedure (with respect to the aggregate with exponential weights as a based aggregation procedure) is defined by and From a theoretical point of view, of course, we should have minimized the r-CV risk over λ ∈ [0, 1] (for the mCV and the amCV).But we have in mind to perform the Lasso procedure by means of the LARS algorithm.This algorithm provides a family of regularization parameters 0 = λ (0) < λ (1) < . . .< λ (N ) , where N may differ from n, and the corresponding Lasso estimators fλ (j) , j = 1, . . ., N .Thus we believe that using the LARS algorithm combined with the mCV, amCV or Agg procedures with a grid G ⊂ {λ (0) , . . ., λ (N ) } will prove to be efficient.
Note that for values of r close to 0, the Lasso vector β(n) mCV constructed with a data-driven choice of the regularization parameter λr (D (n) ) is likely to enjoy some model selection (or sign consistency) properties.Nevertheless, from a theoretical point of view, we will obtain results only for the prediction problem with respect to the L 2 -risk.
We would like to apply Theorem 2.4 and Theorem 2.7 to the three procedures that we have introduced here.To this end, we have to check assumption (A) for the elements of the dictionary F := { fλ : λ ∈ G} and so the design vector X has to enjoy some properties.Definition 4.1 Let X be a random vector of R p and denote by µ its probability distribution.We say that X is log-concave when for all nonempty measurable sets A, B ⊂ R p and every α ∈ [0, 1], µ(αA Many natural measures are log-concave.Among the examples are measures that have a log-concave density, the volume measure of a convex body, and many others.A well known fact on a log-concave random vector X of R p follows from Borell's inequality (cf.Milman & Schechtman (1986)): for every x ∈ R p , X, x ψ 1 ≤ c X, x L 1 where c is an absolute constant.In particular, the moments of linear functionals satisfy, for all p ≥ 1, X, x Lp ≤ cp X, x L 1 .
In the following we assume that X is a ψ 2 , log-concave vector and the noise is ψ 2 .Let m ∈ N, β := β(λ) (m) (D (m) ) be fixed for the moment, and let L β (X, Y ) = (Y − X, β ) 2 − (Y − X, β * ) 2 be the corresponding excess loss function.We need to bound the ψ 1 -norm of L β and to check the margin condition.For the second task, we use the log-concavity of X to obtain This proves that the dictionary F satisfies the margin assumption with κ = 1.For the first task, we use the fact that X is ψ 2 to get Now for the construction of the dictionary, we threshold all the Lasso vectors β(λ (j) ) provided by the LARS algorithm, in such a way that the 2 -norm of these vectors is smaller than a constant K 0 .Then the dictionary F satisfies Assumption (A) (with K 0 := K 0 + β * 2 ).Note that under assumption (A), the aggregate with exponential weights satisfies the oracle inequality (2.11) (up to a log n factor when κ = 1, cf.Lecué (2006)).Thus, we are now in position to apply Theorem 2.4 and Theorem 2.7.
Let β be either where τ is a thresholded function such that ∀β ∈ R p , τ (β) 2 ≤ K 0 .This proves that the adaptation procedures provided in Section 2 optimize the prediction task of the Lasso thanks to a data-driven choice of the regularization parameter.

Adaptive choice of the temperature parameter for aggregation with exponential weights
In the aggregation setup, one is given a set of data D (n) and a finite set F 0 of M functions f 1 , . . ., f M .The problem is to construct a procedure which has a risk as close as possible to the risk of the oracle, the best element in F 0 .A common aggregation procedure is the aggregation procedure with exponential weights (AEW for short) defined in Equation (2.12); this procedure is defined up to a free parameter which is called the temperature parameter.There is some empirical evidence (cf.Gaïffas & Lecué (2007)) that an optimal temperature parameter exists which minimizes the risk of the AEW procedures.In this subsection, we use the adaptive procedures introduced in the previous section to chose the temperature parameter.Let (B 1 , D 1 ), . . ., (B V , D V ) be the family of splits of D (n) defined in (2.1) for some 1 ≤ V ≤ n.For any AEW procedure f (T ) (where T ≥ 0 is the temperature parameter) the V-fold-CV empirical risk is defined by We consider the following data-driven temperature and the mCV procedure where G is a subset of (0, +∞).We want to apply Theorem 3.5 to the procedure f (n) (D (n) ).We consider the bounded regression model Y = f * (X) + σ with respect to the quadratic loss function Q((x, y), f ) = (y − f (x)) 2 .We consider a finite dictionary F 0 (constructed with a previous sample that we supposed fixed).We assume that For every T > 0, we construct the aggregate with exponential weights f (T ) associated with the dictionary F 0 (cf.(2.12)).Fix A > 0 and construct the infinite dictionary F := { f (T ) : T ≥ A}.It is easy to check that the elements of the dictionary F satisfy Assumption (A) with margin parameter κ = 1.Moreover, for every pair T 1 , T 2 > 0 of temperature parameters, for any n, and any data set D (n) , we have Thus, the exchangeability of the AEW being obvious, Theorem 3.5 yields the following oracle inequality Thus the procedure f (n) (D (n) ) is optimal amongst all the AEW procedures with temperature parameter T ≥ A for a given A.

Preliminaries from empirical process
We start with the following lemma which is a ψ 1 version of Bernstein's inequality (see, for instance, van der Vaart and Wellner (1996), Chapter 2.2).
Lemma 5.1 Let Y, Y 1 , ..., Y m be i.i.d mean zero random variables with Y ψ 1 < ∞.Then for any u > 0, Nevertheless, it appears that Lemma 5.1 does not suit the analysis we have in mind.Indeed, most of the models worked out here satisfy a margin condition; that is, a relation of the form EY 2 ≤ K(EY ) 1/(2κ) .In Lemma 5.1, the sub-Gaussian and the sub-exponential (or Poisson) behaviour of 1 m m i=1 Y i are given with respect to the ψ 1 -norm of Y without reference to the term EY 2 .According to the Central Limit Theorem, we would expect sub-Gaussian behaviour of the sum 1 m m i=1 Y i with respect to the L 2 -norm of Y .That is the objective of the following result.The price that one pays for replacing the ψ 1 -norm by the L 2 -norm under sub-Gaussian behaviour is, in general, an extra factor log m in the sub-exponential behaviour.
Proposition 5.2 There exists an absolute constant c > 0 such that the following holds.Let Y, Y 1 , ..., Y m be i.i.d mean zero random variables such that max i=1,...,m Y i ψ 1 < ∞.Then for any u > 0, Proof.We follow the line of Adamczak (2008).Let ρ > 0 be the truncation level to be chosen later.For every i = 1, . . ., m we defined For every u > 0, we have To bound the first term of (5.1), we use the classical Bernstein inequality for bounded variables together with the inequality Now take ρ := 8E max 1≤i≤m |Y i |.For the second term of (5.1), we note that, by Chebyshev's inequality, Thus, by Hoffman-Jorgensen's inequality (cf.Proposition 6.8 in Ledoux & Talagrand (1991)), we have Consequently, since E|X| ≤ K X ψ 1 for any random variable X, Now, we use Theorem 6.21 of Ledoux & Talagrand (1991) Combining the last result and Equation (5.3), we get In particular, we have (5.4) We obtain the result by using the last inequality together with Equation (5.2) in Equation (5.1) and noting that ρ To obtain bounds oracle inequalities for the mCV and amCV procedures, we need to control the suprema of some empirical processes.The next lemma is precisely such a bound for a (shifted) empirical process, and its conditions are formulated in terms of a general risk bound and a margin condition.
Note that one of the main advantages of the set of assumptions of Lemma 5.3 is that we are allowed to use unbounded random variables.And, in the bounded case max Q∈Q Q(Z) ∞ ≤ b 0 , we recover the classical Bernstein inequality since max Q∈Q max i Q(Z i ) ψ 1 ≤ b 0 .But, if we only have a ψ 1 control of the type max Q∈Q Q(Z) ψ 1 ≤ b 0 , then by using the following classical result due to Pisier (cf.for instance van der Vaart and Wellner (1996)) Thus, by using this approach a (maybe extra) log m term may appear in the upper bound of the shift process in Lemma 5.3 when the margin parameter κ equals to 1.If κ > 1, then we obtain the same upper bound for both L ∞ and L ψ 1 control.

Proof of Lemma 2.3
We first prove the result for f amCV .By definition of (D (n) ), we have, for any j ∈ {1, . . ., p}, (5.6) Using inequality (5.6), we have the following basic inequality for all j and any set of data D Since the Z i 's are i.i.d., it follows that the expectation of the first term in (5.7) is such that for every j, and, by using the convexity of the risk, the expectation of the second term in (5.7) is such that j=1,...,p which now gives the desired result.We can follow the same lines to obtain the oracle inequality for f (n) = f (n) mCV .But instead of using the convexity of the risk in the second line of the last calculus we use the exchangeability and the "organized" partition scheme of the data provided by (2.1) and (2.2).Indeed, for this partition scheme,  satisfies some exchangeability properties under particular permutations of the data: For any k = 1, . . ., V , we introduce the permutation φ k (j) = j + kn C [n] (where [n] means modulo n).By using the exchangeability of the statistics f1 , . . ., fp , it is easy to see that for any k = 1, . . ., V and j = 1, . . ., p and for φ k (B p ) := {φ k ((p − 1)n C + 1), . . ., φ k (pn C )} and φ k (D p ) := {Z i : i / ∈ φ k (B p )}, and thus that (Z φ k (1) , . . ., Z φ k (n) ) = (D (n) ).Moreover, for each k = 1, . . ., V , φ k (D (n V ) ) = φ k (D V −1 ) = D k , so we have
5.4 Proof of Lemma 3.4 Let > 0 and take Q 0 ∈ Q such that P Q 0 ≤ .We have For every Q 1 , Q 2 ∈ Q L 2 , Proposition 5.2 and Pisier's inequality yield for any u ≥ 1 where

Lemma 3. 1
We have two different setups, depending on the procedure that we want to study:1.If the risk function f −→ R(f ) is convex,then the averaged version of the modified CV (cf.(3.2)) with N C arbitrary deterministic splits of n pieces of data into n V pieces of training and n C pieces of test data satisfies the following oracle inequality with f