Weak transport inequalities and applications to exponential inequalities and oracle inequalities

We extend the dimension free Talagrand inequalities for convex distance \cite{talagrand:1995} using an extension of Marton's weak transport \cite{marton:1996a} to other metrics than the Hamming distance. We study the dual form of these weak transport inequalities for the euclidian norm and prove that it implies sub-gaussianity and convex Poincar\'e inequality \cite{bobkov:gotze:1999a}. We obtain new weak transport inequalities for non products measures extending the results of Samson in \cite{samson:2000}. Many examples are provided to show that the euclidian norm is an appropriate metric for classical time series. Our approach, based on trajectories coupling, is more efficient to obtain dimension free concentration than existing contractive assumptions \cite{djellout:guillin:wu:2004,marton:2004}. Expressing the concentration properties of the ordinary least square estimator as a conditional weak transport problem, we derive new oracle inequalities with fast rates of convergence in dependent settings.

1. Introduction. In his remarkable paper [49], Talagrand proves that convex distances have dimension free concentration properties. Since the seminal work of Marton [38], transport inequalities are known to efficiently yield such dimension free concentration inequalities. Using a duality argument, Bobkov and Gotze [10] even proved that transport inequalities are equivalent to some concentration inequalities. Our references on the subject are the monograph of Villani [51], the survey of Gozlan and Leonard [25] and the textbook of Boucheron et al. [16] for the statistical applications. In dependent settings, transport inequalities appear as a nice alternative to the modified log-Sobolev approach of Massart [42] for obtaining dimension free concentration inequalities useful to obtain fast rates of convergence in mathematical statistics. This article develops new kinds of transport inequalities, new exponential inequalities and new oracle inequalities with fast rates of convergence in dependent settings.
In the case of product measures with common margin (iid case), the modified log-Sobolev approach developed in [42] leads to optimal dimension free concentration inequalities of Bernstein's type. However, for non product measures, such inequalities do not hold in their optimal form in many situations. The reason is the following: in the bounded iid case, Bernstein's inequality yields gaussian behavior for deviations less than a bound depending on the essential supremum. In many bounded Markovian cases, their exists a unique regeneration scheme of iid cycles with random length. Then the variance terms in the Bernstein's type inequalities are perturbed by the concentration properties of the random length, see [8]. It yields an additional term, at least logarithmic, which cannot be removed, see [1]. It is a drawback for statistical applications for whom the variance term of the iid case is essential. To bypass this problem, many authors assume contractive conditions on the kernel of Markov chains, see Marton [39] under geometric ergodicity and Lezaud [36] under a spectral gap condition. For symmetric Markov process, the spectral gap condition is more general than uniform ergodicity and is also necessary for Bernstein's inequality, see [28].
Many classical models in time series analysis do not satisfy such contractive conditions. Fortunately, the classical Bernstein's inequality also holds for non Markovian processes underΓ-weakly dependent conditions, closely related with the uniform mixing condition, see [48]. This result yields fast convergence rates of order n −1 in oracle inequalities (comparable to those in the iid case) in dependent settings, see [2]. However, this approach relies on the maximal coupling properties of the Hamming distance and cannot be extended to other metrics, see [19]. For other metrics, contractive conditions are used by Marton [41] and Djellout et al. [20] to extend classical dimension free transport inequalities T 2,d (C) in a dependent context for metrics d different than the Hamming one. If the "constant" C in the transport inequality is sufficiently close to the variance term then Bernstein's inequality is recovered and fast convergence rates are achieved, see [31]. Otherwise, a tradeoff must be done between the estimates of the variance terms and the accuracy of the contractive coupling schemes, see [52] for details. The fast rates of convergence in oracle inequalities are not achieved because the variance terms of the Bernstein's types inequalities do not have the same order than in the iid case. On the contrary, the Hoeffding inequality is easily extended to very general dependent case using the bounded difference method, see [44,46,20]. Unfortunately, the Hoeffding inequality, equivalent to the T 1 transport inequality, is not dimension free and yields oracle inequalities with slow rates of convergence of order n −1/2 , see [3].
In this paper we develop weak transport inequalities to obtain dimension free exponential inequalities and thus fast convergence rates in oracle inequalities. Let E be a Polish space and d be a lower semi-continuous metric on E. With the notation P [h] = hdP for any probability measure P and any measurable function h, we say that P satisfies the weak transport inequalityT p,d (C) for any C > 0 and 1 ≤ p ≤ q if for any measure Q with 1/p + 1/q = 1 and the convention +∞/ + ∞ = 0/0 = 0. Here α is any non-negative measurable function, π is any coupling scheme of (X, Y ) with margins (P, Q) and K(Q|P ) is the relative entropy Q[log(dP/dQ)] (also called the Kullback-Leibler divergence). As the roles of P and Q are not the same, we also introduceT (i) p,d (C) where P and Q are interchanged in the left hand side term. An application of Sion's minimax theorem shows that the weak transport inequalities are extensions of the Marton transport inequalities [40] inf These inequalities are weakened forms of the classical T p,d (C) transport inequality W p,d (P, Q) := inf π π[d p (X, Y )] 1/p ≤ 2CK(Q|P ).
Contrary to the classical T p,d (C) transport inequalities, any compactly supported measure P satisfies the weakT p,d (C) transport inequalities for any 1 ≤ p ≤ 2. Moreover, the weak transport inequalities extend nicely to nonproducts non-contractive measures P on E n , n ≥ 1. Using a new Markov trajectories coupling scheme, our main result in Theorem 3.8 states that there exists C ′ > 0 such that . The main assumptions hold on the conditional laws P |x (i) of the trajectory (X i+1 , . . . , X n ) given that (X i , . . . , X 0 ) = x (i) = (x i , . . . , x 0 ). Fix a lower semi-continuous auxiliary metric d ′ satisfying d ′ ≥ M d. We assume the existence of a trajectory coupling scheme π |i of the E n−i -supported measures (P |x (i) , P |y i ,x (i−1) ) such that The existence of such coefficients γ k,i (p) for any 1 ≤ i < k ≤ n is called the Γ(p)-weakly dependent condition. When d = d ′ is the Hamming distance, the Γ(2)-weak dependance coincides with the weak dependence already studied by Samson [48] and we recover his results. We keep the notation of [48] and denoteΓ(p) the weak dependence when d is the Hamming distance. However, to deal with more general and classical time series, we prefer to choose d as the euclidian norm, see Section 4. When p = 1 and by definition, the Γ(1)-weak dependence coincides with the one of [46] when d ′ is the Hamming distance and the one of [20] when d ′ = d. Thus we recover the Hoeffding's inequalities of [46,20]. They are not dimension free because n 2/p−1 = n as p = 1 and we prefer to focus our study on the case p = 2. We then prove dimension free concentration for ARMA processes under the minimal dependence assumption that the stationary distribution exists. Our approach improves the existing ones based on contractive arguments [20,41] for classical time series; for instance, considering the Markov chain (X t , ξ t ) formed by an ARMA(1,1) process X t = φX t−1 + ξ t + θξ t−1 , the contraction condition is φ 2 + θ 2 < 1 whereas our condition is |φ| < 1. Thus, we answer positively to a crucial question raised in Remark 3.6 in [20].
Weakening transport inequalities does not deteriorate the concentrations properties used in mathematical statistics. More precisely, we show that T 2,d (C) andT i 2,d (C) yields the convex distance dimension free estimate due to Talagrand: , for any measurable set A.
Here d c (x, A) is the convex distance of Talagrand [49] when d is the Hamming distance and the euclidian distance to the convex hull of A as in Maurey [37] when d is the euclidian norm. Following Bobkov and Ledoux [9], we obtain this result by analyzing the dual form of the weak transport inequality. If P satisfiesT 2,d (C) on E then for any function f such that there exist a function When d is the Hamming distance, inequality (1.2) yields to the Bernstein inequality, see Ledoux [34] in the independent setting and Samson [48] in thẽ Γ(2)-weakly dependent setting. When the function f is a convex function, the condition above is automatically satisfied with L j = ∂ j f (the sub-gradients) and d the euclidian norm. The inequality (1.2) coincides with generalizations of the Tsirel'son inequality discovered in [50] for gaussian measures, see [12]. Using the dual form (1.2), we also prove thatT 2 implies sub-gaussianity and convex Poincaré inequality [11]. Then, the weak transport approach provides dimension free concentration properties of ARMA processes under minimal assumptions and is sufficient for statistical application.
As the transport inequalities yield concentration of measures via relative entropy, we couple it with the statistical PAC-bayesian paradigm that describes the accuracy of estimators in term of relative entropy too, see [43]. Thus, instead of using the approach based on the supremum of the empirical process [42], we introduce the conditional weak transport inequalities that provides sharp oracle inequalities. We apply this new approach to the Ordinary Least Square (OLS) estimatorθ in the linear regression context (other interesting statistical issues will be investigated in the future). Denoting by R the risk of prediction, an oracle inequality states with high probability that R(θ) ≤ (1 + η)R(θ) + ∆ n η −1 η =0 where η ≥ 0, θ is the oracle defined as R(θ) ≤ R(θ) for all θ and ∆ n is the rate of convergence. If η = 0 then the oracle inequality is said to be exact and otherwise it is non exact, see [33]. The dimension free concentration properties obtained from the weak transport inequalities with p = 2 yield to fast rates of convergence ∆ n ∝ n −1 . When d is the Hamming distance, we recover in the conditional weak transport inequalities the variance terms of the iid case. These variance terms play a crucial role through Bernstein's condition [7] to obtain exact oracle inequalities with fast rates of convergence. Thus, when d is the Hamming distance, we obtain new exact oracle inequalities with fast convergence rates for the OLSθ in theΓ(2)-weakly dependent case. However, in more general cases, Bernstein's condition cannot hold and the variance terms cannot be nicely estimated. There, we emphase the fact that the Tsirelson inequality should be preferred to the Bernstein one. Hence, using the euclidian metric, we obtain new nonexact oracle inequalities for the OLSθ for Γ(2)-weakly dependent time series. The efficiency of the OLS is proved for many models such as classical ARMA models.
The paper is organized as follows: in Section 2 are developed the preliminaries that are used in the proof of our main result, a weak transport inequalities for non product measures stated in Section 3. We also study the dual form of the weak transport inequalities, the Tsirel'son inequality that are satisfied and the connection with Talagrand's inequalities in this Section 3. Section 4 is devoted to some examples ofΓ and Γ-weakly dependent processes. Finally, new oracle inequalities with fast rates of convergence are given in Section 5.
2.1. Weak transport costs on E. Let M (F ) denotes the set of probability measures on some space F , M + (F ) the set of lower semi-continuous non negative measurable functions andM (P, Q) the set of coupling measures π x,y , i.e. π x,y ∈ M (E 2 ) with margins π x = P and π y = Q. Let (p, q) be real numbers satisfying 1 ≤ p ≤ 2 and 1/p + 1/q = 1. Let us define the weak transport cost as Note thatW is not symmetric and thatW p, Note that α ∈ M + and d are assumed to be lower semi-continuous such as the optimal transport in the weak transport cost definition exists, see for example [25]. Now let us show that the weak transport cost satisfies the triangular inequality. It is a simple consequence of the second assertion of the following version of the gluing Lemma: Lemma 2.1. For any coupling π x,y ∈M (P, Q) and π y,z ∈M (Q, R) respectively there exists a distribution π x,y,z with corresponding margins and such that X and Z are independent conditional on Y , i.e. π x,z|y = π x|y π z|y .
Proof. From the classical gluing Lemma, se for example the Villani's textbook [51], we can choose π x,y,z such that π x,y,z = π x|y π z|y π y as the margins corresponds: π x|y π y = π x,y and π z|y π y = π y,z The conditional independence follows from the specific form of π x,y,z as π x,z|y = π x,y,z /π y by definition.
The conditional independence in the gluing Lemma 2.1 is the main ingredient to prove the triangular inequality onW p,d : Lemma 2.2. For any P, Q, R we have imsart-aop ver. 2012/04/10 file: WeakTRevAx.tex date: May 5, 2014 Proof. Let us fix α ∈ M + (E) such that R[α q ] < ∞. We have Let us choose π * y,z satisfying By conditional independence in Lemma 2.1, we also have Let us choose π * x,y satisfying using Jensen's inequality. Let us denote π * = π * x,y,z obtained by the gluing Lemma 2.3 of π * x,y and π * y,z . Collecting all these bounds we have and taking the supremum on α the desired result follows from the definition ofW p,d (Q, R).

Markov couplings.
In this section, we only consider Markov couplings on the product space E n with n = 2, the cases n ≥ 2 following by simple induction reasoning.
The terminology of Markov couplings was introduced by Rüschendorf in [47]. Similar couplings are used by Marton in [40]. The property of conditional independence in the gluing Lemma 2.1 is nicely compatible with Markov couplings: Lemma 2.3. For any Markov couplings π x,y ∈M (P, Q) and π y,z ∈ M (P, Q) with P, Q, R ∈M (E 2 ) it exists a distribution π x,y,z with corresponding margins and such that X = (X 1 , X 2 ) and Z = (Z 1 , Z 2 ) are independent conditional on Y = (Y 1 , Y 2 ).
The same reasoning show that the second margin is also the correct one.
imsart-aop ver. 2012/04/10 file: WeakTRevAx.tex date: May 5, 2014 2.3. Weak transport costs on E n , n ≥ 2. We extend the definition ofW on the product space E n for n ≥ 2. Let P , Q ∈ M (E n ) we define for any fixed α = (α j ) 1≤j≤n ∈ M + (E n ). Considering Markov couplings, we use the conditional independence in the gluing Lemma 2.3 to assert that the weak transport cost on E n also satisfies the triangular inequality. More useful,W α,d satisfies an inequality similar than the triangular one: Remark 2.1. As a consequence of the Lemma 2.4, we obtain the triangular inequality forW by taking the supremum on α on both sides of (2.8) and using the relation Proof. Let us fix α ∈ M + (E n ) such that R[α q j ] < ∞ for all 1 ≤ j ≤ n. Define recursively the couplings π * y,z and π * x,y ∈M (E 2 ) such that where we use Jensen's inequality. Let us denote π * = π * x,y,z obtained by the gluing Lemma 2.3 of π * x,y and π * y,z . Then The inequality (2.8) follows from (2.10) takingα j = π * y,z [α j (Z)|Y = ·] and noticing that the relation holds by an application of Jensen's inequality.

Weak transport inequalities.
3.1. Weak transport inequalities and dual forms. Let us say that the probability measure P on E n , n ≥ 1, satisfies the weak transport inequalitỹ T p,d (C) when for all distribution Q on E n we have Let us say that P satisfies the inverted weak transport inequalityT By an application of Jensen's inequality, P satisfiesT p,d (C) andT Thus, when d is not specified, we consider the case n = 1 only with no loss of generality. We haveT 1,d Following [10], we investigate the dual form of the weak transport. Denote and C b the set of continuous bounded functions with values in R, we have the following dual forms of the weak transport inequalities: Proof. As their proofs are similar, we prove the first dual form only. From the dual form ofW α,d for α ∈ M + (E) fixed we havẽ Then a measure P satisfiesT p,d (C) if for any α ∈ M + (E) and any probability measure Q sup From the variational identity we get for all λ > 0: We can rewrite it as From the Young inequality Then the desired result follows from the variational formula of the entropy.
In the case p = 1, then q = ∞ and the dual forms (3.3) and (3.4) only depends on α through the fact that α(y) ≤ 1, y ∈ E. Then one can consider α = 1, f α,d (y) = inf x∈E {d(x, y) + f (x) that forces to consider Lipschitz functions and we recognize the dual form of the transport inequality T 1,d that is the Hoeffding inequality: Here Lip 1 is the set of 1-Lipschitz functions f with respect to d.
To obtain similar results when p = 2, it is crucial to identify the map f → f α,d . In the sequel, we focus on the cases d the Hamming distance and the Euclidian norm in R n .
As the difference f α,1 − f is unchanged when adding a constant on f , we can take inf f = 0 with no loss of generality and But for any X > 0 we have X − X 2 /2 ≤ log(1 + X) and thus T 2,1 (1) follows by taking X = λf . For the inverted weak transport, we apply the inequality and we obtain Remark 3.1. Alternative proofs of the Corollary 3.2 are given in [40] and a stronger version is given in [48].
Proof. Let c * be the weights that achieves the supremum in d T . We have Then, by the convex inequality 3.3. The specific case d = N , the euclidian metric. Next we consider the case of E = R n equipped with the euclidian norm · that we denote d = N . We obtain   Remark 3.3. The inequality (3.6) is called the Tsirels'on inequality who discovered it for independent Gaussian random variables with the optimal constant C = 1. Corollary 6.1 in [12] states that it holds for any measures satisfying the log-Sobolev inequality. In particular,T 2,1 2 (C) holds for any log-concave measure dP/dx = e −V with C-strongly convex function V .
Proof. We consider only the case n = 1 as the extension to n ≥ 1 follows the same reasoning. First, we note that one can let α take non positive values in (3.3) and (3.4). A simple computation of the minimizer in the definition of f α,d for sufficiently smooth f gives the following identities: As α is a free parameter, it is also the case for f ′ (x) and thus x. Thus we can restrict ourselves to convex function f and we distinguish two cases: either x = y or not. If x varies, noticing that the dual form of the weak transport inequality (3.3) only depends on x through we derive by x the function g = λf and we obtain As g is convex, the solution Cg ′ (x) = y − x is excluded and thus worst g are affine functions g(x) = ax + b for some a, b ∈ R. We obtain the condition implied by (3.6). If x = y, then we also obtain (3.6). In any case (3.6) is a necessary and sufficient condition. Now let us prove the equivalence for T (i) 1,d (C). We first notice that, when n = 1, the dual form (3.4) is equivalent to where g α (y) = sup x {g(x) − α(y)|x − y|}. Following the same reasoning than above, one can consider only the cases where g(x) − g ′ (x)(x − y) ≥ g(y) and g ′ (x) 2 = α(y) 2 . Denoting λg = f , if one can let x vary, we optimize the term by taking f an affine function. Otherwise, x = y and we obtain that any concave function g satisfies (3.7).
We relate weak transport inequalities to more classical notions of concentration. Recall that a measure P on E = R n is sub-gaussian if there exists c > 0 such that This property is equivalent to T 1,N (C) for some C > 0, see [20,13], and it is a very common assumption in statistics. We say that P satisfies the convex Poincaré inequality if for any separately convex function g Remark 3.4. The convex Poincaré inequality on E = R has been studied in [11]. It is satisfied for X standard normal or in [0, 1] with C = 1. It also holds with the same constant C for the product measure on R n , n > 1.
Notice that the convex Poincaré inequality is equivalent to a concave Poincaré inequality. We obtain Theorem 3.6. The weak transport inequality T 2,N or T  Proof. The arguments developed in this proof are classical, see [35]. We detail the case of T 2,N when n = 1 because the proof for n > 1 and T (i) 2,N follows the same reasoning. Assume that P satisfies T 2,d (C) or T (i) 2,d (C) and apply (3.6) to g(x) = λx: P [exp(λ(X − P [X]))] ≤ exp(Cλ 2 /2), λ > 0. Then P must be sub-gaussian. Now, applying (3.6) or (3.7) to tg with t → 0 we obtain the convex Poincaré inequality in both cases.
Tsirel'son inequality (3.6) quantifies the concentration of "self bounding" functions with respect to the euclidian norm, i.e. convex functions f such that ∇f 2 ≤ f . Let A be a measurable set of R n and B its convex hull, then d N (x, B) = inf y∈B x − y /4 is a self bounding function. Following the same reasoning than in the proof of Proposition 3.4, we obtain the Euclidian version of Talagrand's concentration inequality [37] Proposition 3.7. If the law P of X = (X 1 , . . . , X n ) satisfiesT 2,N (C) andT Remark 3.6. The result is proved for independent X j s on [0, 1] or standard normal with the optimal constant C = 1 in [37] via the convex property (τ ).

3.4.
Coupling trajectories. Dual forms provided in Theorem 3.1 are particularly powerful to derive weak transport inequalities when n = 1. In order to obtain concentration for measures on E n , n > 1, we prove that the weak transport inequalities hold for non product measures. To obtain constants as sharp as possible, we couple trajectories via the new notion of Γ d,d ′ (p)-weak dependence.
To any law P on E n , add artificially time 0 and put X 0 = Y 0 = x 0 = y 0 for a fixed point y 0 ∈ E. Denote x (i) = (x i , . . . , x 0 ) for i ≥ 0 and P |x (i) the conditional laws of (X i+1 , . . . , X n ) given that (X i , . . . , X 0 ) = x (i) = (x i , . . . , x 0 ). Let d and d ′ be two lower semi-continuous distances on E such that d ≤ M d ′ for some M > 0. Let us work under the following weak dependence assumption: there exists a coupling scheme π |i of (P |x (i) , P |x (i−1) ,y i ) and coefficients γ k,i (p) ≥ 0 such that Let us denote The matrix Γ(p) has n rows and n columns. We equip R n with the ℓ p -norm and the set of the matrix of size n × n with the subordinated norm, both denoted · p for any 1 ≤ p ≤ ∞.
Remark 3.7. When the process (X t ) is stationary, we have Proof. The proofs of the two assertions are similar as the weak dependence condition (3.8) is symmetric in x i and y i . Thus the proof of the second assertion is omitted.
Let us fix α ∈ M + (E n ) such that Q[α q j ] < ∞ for all 1 ≤ j ≤ n. As preliminaries, we recall the following result of existence of the optimal Markov coupling due to from Rüschendorf, [47] and a simple and useful consequence of this result stated in Lemma 3.10.
Theorem 3.9 (Theorem 3 in [47]). We have the equivalence between Denote α (i) j denotes the section of α j in y (i) as α (i) j (y i+1 , . . . , y n ) = α j (y) and α (i) = (α (i) j ) j>i . A simple corollary of this Theorem is the following result: Lemma 3.10. Let P , Q ∈ M (E n ) be decomposed as P = P 1 P |X 1 and Q = Q 1 Q |Y 1 for P 1 , Q 1 ∈ M (E) and P |x 1 , Q |y 1 ∈ M (E n−1 ). Then for any α ∈ M + (E n ) and any coupling π 1 ∈M (P 1 , Q 1 ) we have Proof. Let us assume that for almost all x 1 , y 1 ∈ E we haveW α (1) ,d (P |x 1 , Q |y 1 ) < ∞. Then, by lower semi-continuity, it exists π * |x 1 ,y 1 such that: Thus the desired result follows from Theorem 3.9 remarking that for any x 1 , by definition of Markov couplings.
Let us consider now the following coupling scheme denotedπ defined recursively asπ =π n|n−1 · · ·π 2|1π1|0 ∈M (E n ) whereπ j|j−1 =π x j ,y j |x (j−1) ,y (j−1) is determined such that We are now ready to prove the result iterating several times the same reasoning. Let us detail the case j = 1 when considering probabilities conditional on y 0 . Applying (3.9) and (2.8) we havẽ To bound the last term, we use the definition of the Γ d,d ′ (p)-weak dependence: Lemma 3.11. For any α k ∈ M + (E) for all j < k ≤ n and any Γ d,d ′ (p)weakly dependent probability measure P we have Proof. Assume that Q[α q k ] < ∞ for j < k ≤ n. Then, applying the Holder inequality and the definition of Markov couplings, we havẽ W α (j) ,d (P |x j ,y (j−1) , P |y (j) ) = inf because the Γ(p)-weak dependence condition ensures the existence of a coupling scheme satisfying Denoting γ i,i = M for all 1 ≤ i ≤ n, note that by assumption we have the relationd( Collecting the bounds (3.11) and (3.12) we obtaiñ Let us do the same reasoning than above for any 1 ≤ j ≤ n conditional on y (j) onW α (j−1) ,d (P |y (j) , Q |y (j) ). For any 1 ≤ j ≤ n, we obtain: For the specific Markov coupling considered here, the identity (3.10) holds andW where the last inequality follows from the concavity of x → x 1/q and Jensen's inequality. Applying an inductive argument, we obtaiñ the second inequality following from Hölder's and Jensen's inequalities and the last one from the assumption P x j |y (j−1) ∈T p,d (C). Let us denote Q the row vector (Q[α q k ] 1/q ) 1≤k≤n and W the column vector (Q[2CK(P x j |Y (j−1) |Q y j |Y (j−1) ) p/2 ] 1/p ) ′ 1≤j≤n . With <; > denoting the scalar product, we obtaiñ Note that we have the identities As p/2 ≤ 1, successive applications of Jensen's and Hölder's inequalities yield Finally, we obtain The desired result follows by taking the supremum over all α ∈ M + (E n ).

Examples of
is the Hoeffding inequality (which is not dimension free). We then recover concentration results that have been proved using the bounded difference approach of [44]. Applying the Kantorovitch-Rubinstein inequality, we obtain an explicit expression of γ k,i (1): In the bounded case d ≤ M and d ′ = 1, the Γ d,1 (1)-weak dependence condition coincides with the one introduces in Rio [46]. As the conditional probabilities P x j |x (j−1) automatically satisfy Pinsker's inequalityT 1,1 (1/4), Theorem 3.8 recovers the Hoeffding inequality of [46]. The context of Γ N,1 (1)weak dependence is extensive and we refer the reader to Section 7 of [18] for a detailed study of many examples in this case, including causal functions of stationary sequences, iterated random functions, Markov kernels and expanding maps. When d = d ′ the Γ d,d -weak dependence condition coincides with the condition (C 1 ) ′ of [20]: for any f ∈ Lip 1 (d 1 ) it holds From Remark 3.7 we have Γ(1) 1 ≤ 1 + S and thus Theorem 3.8 recovers the Hoeffding inequality of [20]. Examples of Γ d,d (1)-weakly dependent time series are given in [20]. In particular, ARMA processes with sub-gaussian innovations satisfy the conditions of Theorem 3.8 for p = 1, d = d ′ = N . Thus they satisfy Hoeffding's inequality (which is not dimension free).
For Markov chains, theΓ(p) weakly dependent condition is equivalent to the uniform ergodicity condition. In the stationary case,γ p k,i ≤ 2φ k−i where φ is the uniform mixing coefficient introduced by Ibragimov [29]. For p = 2, we recover the transport inequality obtained by Samson [48] as any P x j |x (j−1) satisfiesT 2,1 (1) and by an application of Theorem 3.8 we obtain Notice that we use here the minimax theorem of Sion and the Proposition 1 of [40] to obtain the identity In the stationary case, any φ-mixing processes areΓ(p)-weakly dependent with Γ (p) p ≤ 1 + n i=1 (2φ i ) 1/p for any 1 ≤ p ≤ 2, see [48]. But theΓ(p)weakly dependence is also satisfied for non stationary sequences, see [32]. However, when E is a real vector space, the choice of the Hamming distance is not natural and the resulting weakly dependent conditions are often too restrictive.

Γ N,d ′ (2)-weakly dependent exemples.
In what follows, we show that the choice d = N is natural in many examples in E = R k . We focus on two generic examples: the Stochastic Recurrent Equations, treated in [20] when p = 1 only, and the chains with infinite memory [21]. As an explicit expression of these coefficients is not available, we use the natural coupling provided by the structure of the model to estimate the coefficients γ k,i . Example 4.1 (Stochastic Recurrent Equations (SREs)). Consider the SRE (also called Iterated Random Functions in [22] and Random Dynamical Systems in [20]) where (ψ t ) is a sequence of iid random maps. We denote P the probability of the whole process (ψ t ) t≥1 . Assume in the next proposition that d and d ′ are any semi-lower continuous metrics satisfying d ≤ M d ′ for some M > 0.
for any x ∈ E and if there exists some S > 0 satisfying then for any x ∈ E we have that the law P n x of (X t (x)) 1≤t≤n on E n satisfies T p,d (C(M + S) 2 n 2/p−1 ) orT Proof. The result is proved by an application of Theorem 3.8. The condition of Γ d,d ′ -weak dependence is satisfied because the joint law of (X t (x), X t (x)) t≥1 is a natural coupling scheme π |0 of the law of (X t ) t≥1 given that (X 0 , X −1 , X −2 . . .) = (x, x −1 , x −2 , . . .) and (X 0 , X −1 , X −2 . . .) = (x ′ , x −1 , x −2 , . . .). We obtain similarly natural coupling schemes π |i for any i ≥ 0 and the coefficients γ k,i satisfy the relation k>i γ k,i (p) ≤ S. The fact that the relation k<n γ n,k (p) ≤ S also holds for any n ≥ 2 follows from the exchangeability of (ψ 1 , . . . , ψ n ). Using similar arguments than in Remark 3.7, we obtain that Γ d,d ′ (p) p ≤ M + S. The result is proved as, by the Markov property, P x j |x (j−1) is the law Let us detail two classical SREs, the ARMA models and the general affine processes when d = d ′ = N . The two first examples cannot be treated optimally by existing results in [20,41] that use contractive conditions. Example 4.2 (ARMA models). Consider the ARMA model The notion of Γ N,N (2)-weak dependence is more general than the usual mixing ones. For instance, the solution of X t+1 = 1/2X t + ξ t+1 with ξ 1 ∼ B(1/2) is Γ N,N (2)-weakly dependent but not mixing, see [4].

Example 4.3 (General affine processes). Consider now the specific SRE
and the noise (ξ t ) is a sequence of iid random vectors of R k ′ such that its distribution P ξ is centered. Fix p = 2 and assume that: K > 0, · denoting also the operator norm on M k,k ′ associated with the euclidian norms on R k and R k ′ ; imsart-aop ver. 2012/04/10 file: WeakTRevAx.tex date: May 5, 2014 (c) the Lyapunov exponent in L 2 satisfies Using a version of Lemma 2.1 in [20] we obtain that conditions (1) and (2) implies that P x i |x i−1 ∈T 2, · (CK 2 ) orT (i) 2, · (CK 2 ). Moreover condition (4.2) is satisfied for some S > 0 and thus P n Example 4.4 (Chains with Infinite Memory). Here assume that d = d ′ = d is any semi lower-continuous distance. Consider chains with infinite memory define in [21] for any function F : E N × X → E by the relation: for any sequence x = (x −t ) t≥0 ∈ E N and any iid innovations ξ t on some measurable space X . This model does not satisfy the Markov property. However, it still exists a natural coupling scheme of the law of (X t ) t≥1 given that where the innovations (ξ t ) t≥1 are the same than in (4.3). Then the natural coupling scheme π |0 is the distribution of (X t (x), Y t ) t≥1 . Denote P the law of the innovations process (ξ t ) and P n x the law of (X t (x)) 1≤t≤n on E n , we have Proposition 4.2. Assume there exists a sequence of non negative numbers (a i ) such that i≥1 a i = a < 1, i≥1 i log(i)a i < ∞ and for any x = (x 1 , x 2 , . . .) and y = (y 1 , Proof. Let us compute a bound for the coefficients P [d p (X t (x), Y t )] 1/p for all t ≥ 0 that are estimates of the coefficients γ t,0 . Fix Because a < 1 then γ i+t,i is decreasing with t ≥ 1 for any i ≥ 0. We have Arguments similar than in the proof of Theorem 3.1 in [21] yields The desired result follows by choosing p = cr/ log(r) such that t≥1 γ i+t,i < ∞.
Example 4.5 (AR(∞) models). As an example of chains with infinite memory in E = R, consider (X t ) the stationary solution to the autoregressive equation where the real numbers a i are such that i≥1 |a i | < 1 and i≥1 i log p,N (C) then the distribution of (X 1 , . . . , X n ) satisfiesT 2,N (C ′ ) or T (i) p,N (C ′ ), C ′ > 0 for any n ≥ 1.
Example 4.6 (General affine processes with infinite memory). Consider the process on E = R k defined as the solution of ∀t ≥ 1 where f and M are Lipschitz continuous functions with value in R k and M(k, k ′ ) respectively. These general affine models includes classical econometric models and is estimated in a parametric setting by the quasi maximum likelihood estimator in [6]. Denote for Ψ = f and Ψ = M the Lipschitz coefficients
In this section, we use the weak transport inequality to obtain new nonexact oracle inequalities in the Γ(2)-weakly dependent setting and new exact oracle inequalities in theΓ(2)-weakly dependent setting. Instead of using the Talagrand concentration inequalities given in Propositions 3.4 and 3.7 we prefer to use a more direct approach using conditional weak transport inequalities.
5.1. The statistical setting. We focus on oracle inequalities for the the ordinary least square estimator. Let us consider the case of the linear regression where X = (Y, Z) = (Y, Z (1) , . . . , Z (d) ) and E = R d+1 . The empirical risk is denoted are the observations and θ ∈ R d is a parameter that has to be estimated. In our context, these observations are not necessarily independent nor identically distributed and we denote by P their distribution. The risk of prediction is denoted The aim is to estimate the value θ ∈ R d such that R(θ) ≤ R(θ), ∀θ ∈ R d . We consider the Ordinary Least Square (OLS) estimatorθ of θ such that r(θ) ≤ r(θ) for all θ ∈ R d . We denote the excess of risk R(θ) = R(θ)−R(θ) ≥ 0, r its empirical counterpart, Z = (Z i ) 1≤i≤n the n×d matrix of the design, Z 2 n = n −1 n i=1 Z i 2 and G = P [Z T Z] its corresponding Gram's matrix. Assume that G is a definite positive matrix and denote ρ = max(1, ρ sp (G −1 )). All the results of this sections are given for probability measures P satisfying T 2,d (C) and T (i) 2,d (C) for some C > 0 on E n , n ≥ 1, with d = N or 1 2 . In view of Theorem 3.8 we focus on the case p = 2 to get dimension free concentration. The constant C in the weak transport inequality has to be estimated in each specific statistical case via the Γ N,N (2) orΓ(2)-weak dependence properties of (Y i , Z i ) 1≤i≤n . The case of possibly non linear autoregression is of special interest. There, the vector Z i is a function of the past values ϕ(Y i−1 , . . . , Y 1 ) where ϕ can be chosen as the projection on the last coordinates (case of linear autoregression), functions on Fourier basis or wavelets, etc. If the maximal order of delay ℓ ≥ 1 is fixed, we have γ X k,0 (2) ≤ γ Y ⌈k/ℓ⌉,0 (2) and in the nonlinear autoregressive case,γ X k,0 (2) ≤ ϕ ∞γ Y ⌈k/ℓ⌉,0 . Under conditions on the dependence of Y , theγ(2) coefficients of X are nicely estimated for any bounded measurable functions ϕ whereas the γ(2) coefficients depend also on the regularity of ϕ. Thus, a tradeoff is done: d = N is restricted to regular functions ϕ and some sub-gaussian margins but the dependence structure of the observation is vast. Conversely, d = 1 corresponds to general functions ϕ with no assumptions on the margins but for observations that are nearly independent.

5.2.
Conditional weak transport inequalities. We recall the classical approach based on the empirical process concentration to motivate the introduction of our new approach. Following [42], oracle inequalities will follow from the concentration properties of r(θ). However, as the distribution ofθ is difficult to deal with, one studies the concentration of the supremum of the empirical process Likely f is a self-bounding function (for d = 1 2 ) and one can use the weak transport to extend Bernstein inequalities on f to dependent settings, see [48]. To obtain oracle inequalities, it remains to study the expected value of the supremum of the empirical process. In an independent context, a classical solution consists in using a chain in argument and an entropy metric approach, see Chapter 13 of [16]. In dependent settings, it is not an easy task because the entropy metric depends on the mixing properties, see [45].
Here we take another route that avoid the study of the supremum of the empirical process. Following the PAC-bayesian approach [43,17], the idea is to consider probability measures ρ θ centered on θ. Then the concentration of properties of r(θ) will follows from the transport properties of the measure P ρθ on E n × Θ equipped with some metric d. Notice that ρθ is a probability measure defined conditionally to the observations (X i ). Thus, the properties of the measure P ρθ are not simple to handle directly. The PAC-bayesian approach consist in introducing artificially the measure ρ θ called a priori because it does not depend on the observations (X i ). The conditional weak transport approach then extends the transport properties of P for the metric d θ (x, y) = d((x, θ), (y, θ)) to ρ θ P for d. Then we transport P ρθ to Qρθ, for any Q. Finally, considering f = r − R or f = r − R, noticing that f has nice "self-bounding" properties for d θ = 1 2 or d = N respectively, we obtain oracle inequalities via Qρθ[f ] because ρ θ P [f ] = 0, see (5.2) for d θ = N and (5.3) for d θ = 1 2 . More formally, for any metric space Θ, we have , θ ∈ Θ then for any µ ∈ M + (Θ) we have µ ⊗ P that satisfiesT p,d orT Proof. Denote any measure on E n × Θ has Qν where Q ∈ M + (E n ) and ν is defined conditionally to X. From the proof of Theorem 3.1,T p,d is equivalent to the linear inequality Denote Q θ the conditional probability measure such that µQ θ = Qν. From T p,d θ (C θ ), we infer that We obtain the desired result by linearity integrating in µ and remarking that K(Qν|µ ⊗ P ) = K(µQ θ |µ ⊗ P ) = µ[K(Q θ |P )] 5.3. Nonexact oracle inequality for Γ N,N (2)-weakly dependent sequences. Our first result is a bound on the excess of risk of the OLS estimator for Γ N,N (2)-weakly dependent observations X. Let us first give an oracle inequality that follows from the conditional weak transport described above: Theorem 5.2. Assume that X satisfies T 2,N (C) and T where K := 4 d β Proof. Considering the change (Z, θ) → (ZG −1/2 , G 1/2 θ), we assume that the Gram matrix G is the identity matrix. This change of variable is ρ-Lipschitz function. Thus ZG −1/2 satisfiesT 2,N (ρC) andT (i) 2,N (ρC) using similar arguments than in Lemma 2.1 in [20]. In all the sequel, we thus consider G = I d , Z ∈T 2,N (ρC) andT (i) 2,N (ρC). With this notation, P [ Z 2 n ] = d and θ − θ 2 = R(θ) − R(θ). We first study the self-bounding properties of f = r: using the inequality We apply the conditional weak transport approach. By definition ofW 2 and using Cauchy-Schwartz inequality we obtain for any Q θ conditional on θ that As P satisfiesT 2,N (ρC) andT (i) 2,N (ρC) and using the Cauchy-Schwartz inequality we obtain . and To end the proof, let us compute ρ θ [(1 + θ 2 )R(θ)] using the following identity Let us decompose the last term: imsart-aop ver. 2012/04/10 file: WeakTRevAx.tex date: May 5, 2014 where Y = (Y 1 , . . . , Y n ). Simple computations on gaussian random variables give The desired result follows collecting all these bounds and noticing that 4P [YZ]θ ≤ 2nR(θ).
In the proof above, we obtain the more general result: for any probability measures µ and ν such that there exists Q θ satisfying Qµ = νQ θ we have: Let us discuss the choices µ = ρθ and ν = ρ θ made above. As soon as µ is centered inθ, Jensen's inequality yields Qµ[R(θ)] ≥ Q[R(θ)]. Next, if µ is sufficiently concentrated aroundθ then Qµ[r(θ) − r(θ)] is small as r(θ) − r(θ) < 0. Choosing µ as the Dirac mass inθ is excluded by the condition of existence of some measure Q θ satisfying νQ θ = Qµ. The fact that the support of µ cannot depend on the observations (X i ) constrain us to choose measures supported on the whole space R d in absence of a priori information onθ. The term Qµ[r(θ) − r(θ)] can be seen as an alternative to the classical entropy metric and VC-dimension approach, see Mc Allester [43]. The measure µ should be chosen in order to bound this term (and the entropy K(µ|ν)). It leads to Gibbs estimators that are nice alternatives to classical estimators, see Chapter 4 of the textbook of Catoni [17] in the iid case, [3,2] in weakly dependent settings. Here we choose the gaussian measures µ = ρθ and ν = ρ θ as in Audibert and Catoni [5] for simplicity because K(ν|µ) = β/2 θ − θ 2 . This choice leads to estimate the term Qµ[r(θ) − r(θ)] by Q[ Z 2 n ]/β. This term can easily be estimated by the sum of d/β and a concentration term implying the entropy K(Q|P ). Thus we obtain nonexact oracle inequalities: Corollary 5.3. For any 0 < ε < 1 and any (d + 2)/n < η < 1 we have with probability 1 − ε: imsart-aop ver. 2012/04/10 file: WeakTRevAx.tex date: May 5, 2014 B 2 = 2(5 + θ 2 ), Remark 5.2. This result extends nonexact oracle inequalities as developed in [33] to a dependent context but for the OLS only. Instead of the Bernstein inequality used in [33], only Tsirel'son inequality is used through the choice d = N and thus nonexact oracle inequality holds without any constraint on θ ∈ R d and for Γ N,N (2) dependent sequence with nice margins.
Proof. As for any a, b > 0 we have 2 √ ab ≤ aλ + b/λ for any λ > 0 then from (5.1) we obtain Notice that by definition of K we have by similar arguments than in the proof of Theorem 5.2 we have Q[r(θ)] − R(θ) ≤ 2 2ρCR(θ)n −1 K(Q|P ).
Using again that 2 √ ab ≤ aλ + b/λ, choosing β = λ = nη and by definition of B 1 , B 2 and B 3 we have Choose Q as the probability P restricted to the complementary of the event corresponding to the desired oracle inequality that we denote A. Then Combining these two inequality, we assert that for this specific Q we have − log(ε) ≤ K(Q|P ). The relative entropy is computed explicitly K(Q|P ) = − log(1 − P (A)) and thus the desired result follows. (2)-weakly-dependent sequences. Let us now give an equivalent of (5.2) when d = d ′ = 1. Then any f has the self-bounding property f (x) − f (y) ≤ |f (x)|1 x =y + |f (y)|1 x =y . Following the lines of the proof of (5.2) with f = r we obtain easily where Z 4 n = n −1 n i=1 Z i 4 . The quantities Q[ Z 2 n r(θ)] and Q[ Z 4 n ] can be difficult to estimate for probability measures Q. Let us work under a Bernstein assumption that estimates the variance of the excess of risk with its expectation [7]. It links the set of parameters Θ ⊆ R d and the support of P : there exists some finite constant B > 0 such that
It leads to the following equivalent of Theorem 5.2 In the above estimate the terms involving r(θ) are nuisance terms without additional condition on θ. However, if this term is bounded then the last term of the inequality is proportional to the excess risk Q[R(θ)]. Similarly, in the classical approach [7], the excess risk also appears in the concentration under the Bernstein condition that controls this variance term by R(θ). It is the major advantage considering the Hamming distance (Bernstein inequality) compared with the Euclidian distance (Tsirel'son's inequality) where instead of Q[R(θ)] the term Q[R(θ)] appeared because r is self-bounding for 1 2 but only r for N . As Q[R(θ)] is the quantity of interest, we obtain Corollary 5.5. If condition (5.4) holds, for any 0 < ε < 1 and any M > 0 we have with probability 1 − ε: R(θ) ≤ R(θ) + 160 B 2 + 4BM n × × Bd + 8ρC(log(ε −1 ) − log P (r(θ) > M )) + d(R(θ) + M ) 10B + 40M + 8(Bd) 2 n .
Remark 5.3. Except (5.4), the exact oracle inequality holds for anỹ Γ(2)-weakly dependent sequences without assumptions on the margins (because any probability measure satisfiesT 2,1 (1)). These oracle inequalities are knew, even in the iid case. We refer the reader to [5] for estimates of the term log P (r(θ) > M ) in the iid case under finite moments of order 4 only.
Because P satisfiesT 2 (ρC) andT We conclude as in the proof of Corollary 5.3 choosing Q as P restricted to the complementary of the event corresponding to the desired oracle inequality.