Concentration Bounds for Stochastic Approximations

We obtain non asymptotic concentration bounds for two kinds of stochastic approximations. We first consider the deviations between the expectation of a given function of the Euler scheme of some diffusion process at a fixed deterministic time and its empirical mean obtained by the Monte-Carlo procedure. We then give some estimates concerning the deviation between the value at a given time-step of a stochastic approximation algorithm and its target. Under suitable assumptions both concentration bounds turn out to be Gaussian. The key tool consists in exploiting accurately the concentration properties of the increments of the schemes. For the first case, as opposed to the previous work of Lemaire and Menozzi (EJP, 2010), we do not have any systematic bias in our estimates. Also, no specific non-degeneracy conditions are assumed.


Statement of the Problem
Let us consider a d-dimensional stochastic evolution scheme of the form where (γ n ) n≥1 is a deterministic positive sequence of time steps, the function F : N × R d × R q → R d is a measurable function satisfying some assumptions that will be specified later on, and the (Y i ) i∈N * are i.i.d.R qvalued random variables defined on some probability space (Ω, F , P) whose law satisfies a Gaussian concentration property.That is, there exists α > 0 s.t. for every real-valued 1-Lipschitz function f defined on R q and for all λ ≥ 0: ). (GC(α)) From the Markov exponential inequality and (GC(α)), one derives D(f, r) ), ∀λ, r ≥ 0. An optimization over λ gives that D(f, r) has sub-Gaussian tails bounded by exp(− r2 α ).A practical criterion for (GC(α)) to hold is given by Bolley and Villani [6].If there exists ε > 0 s.t.E[exp(ε|Y 1 | 2 )] < +∞, then the law of Y 1 satisfies (GC(α)) with α := α(ε).The two claims are actually equivalent.In the following (GC(α)) is the only crucial property we require on the innovations (Y i ) i∈N * .In particular we do not assume any absolute continuity of the law of Y 1 w.r.t. the Lebesgue measure.
We are interested in giving non asymptotic concentration bounds for two specific problems related to evolutions of type (1.1).We first want to control the deviations of the empirical mean associated to a function of an Euler like discretization scheme of a diffusion process at a fixed deterministic time from the real mean.Secondly, we want to derive deviation estimates between the value of a Robbins-Monro type stochastic algorithm taken at fixed time-step and its target.Under some mild assumptions, we show that the Gaussian concentration property of the innovations transfers to the scheme.Concerning stochastic algorithms, our deviation results are to our best knowledge the first of this nature.

Euler like Scheme of a Diffusion Process
Let (Ω, F , (F t ) t≥0 , P) be a filtered probability space satisfying the usual conditions and (W t ) t≥0 be a qdimensional (F t ) t≥0 Brownian motion.Let us consider a d-dimensional diffusion process (X t ) t≥0 with dynamics: where the coefficients b, σ are assumed to be uniformly Lipschitz continuous in space and measurable in time.
For a given Lipschitz continuous function f and a fixed deterministic time horizon T , quantities like E x [f (X T )] appear in many applications.In mathematical finance, it represents the price of a European option with maturity T when the dynamics of the underlying asset is given by (1.2).Under suitable assumptions on the function f and the coefficients b, σ, namely smoothness or non degeneracy, it can also be related to the Feynman-Kac representation of the heat equation associated to the generator of X.Two steps are needed to approximate E x [f (X T )]: -The first step consists in approximating the dynamics by a discretization scheme that can be simulated.For a given time step ∆ = T /N, N ∈ N * , setting for all i ∈ N, t i := i∆, we consider an Euler like scheme of the form: where the (Y i ) i∈N * are R q -valued i.i.d.random variables whose law satisfies (GC(α)) for some α > 0. We also assume , where Y * 1 stands for the transpose of the column vector Y 1 and 0 q , I q respectively stand for the zero vector of R q and the identity matrix of R q ⊗ R q .The previous assumptions include the case of the standard Euler scheme, corresponding to Y 1 law = N (0, I q ), which yields (GC(α)) with α = 2 and the Bernoulli law . This latter choice can turn out to be useful, in terms of computational effort, to approximate (1.2) when the dimension is large.
-The second step consists in approximating the expectation E x [f (X ∆ T )] involving the scheme (1.3) by a Monte-Carlo estimator: where the ((X ∆,0,x T ) j ) j∈[ [1,M]] are independent copies of the scheme (1.3) starting at x at time 0 and evaluated at time T .The global error between E x [f (X T )], the quantity to estimate, and its implementable approximation E ∆ M (x, T, f ) can be decomposed as follows: (1.4) The term E D (∆, x, T, f ) corresponds to the discretization error and has been widely investigated in the literature since the seminal work of Talay and Tubaro [16].For the standard Euler scheme, this contribution usually yields an error of order ∆, provided the coefficients b, σ and the function f are sufficiently smooth, which are the assumptions required in [16], or that b, σ satisfy some non-degeneracy assumptions which allow to weaken the smoothness assumptions on f .This is for instance the case in Bally and Talay [1] who obtain the expected order for a bounded measurable f and smooth coefficients satisfying a (possibly weak) hypoellipticity condition.Their proof relies on Malliavin calculus.When the diffusion coefficient is uniformly elliptic and bounded, if b, σ are also assumed to be three times continuously differentiable, the control at order ∆ for E D (∆, x, T, f ) can be derived from Konakov and Mammen [10] who use a more direct parametrix approach.When the Gaussian increments of the standard Euler scheme are replaced by more general (possibly discrete) random variables (Y i ) i≥1 having the same covariance matrix and odd moments up to order 5 as the standard Gaussian vector of R q , it can be checked that the error expansion at order ∆ of [16] still holds for b, σ, f smooth enough.In that framework we also mention the works of Konakov and Mammen [8], [9], concerning local limit theorems for the difference between (1.2) and the scheme (1.3).As in [10], the coefficients are supposed to be smooth and σ uniformly elliptic.The associated error is then of order ∆ 1/2 , speed of the Gaussian local limit theorem, see Bhattacharya and Rao [2].
The term E S (∆, M, x, T, f ) in (1.4) corresponds to the statistical error.Under some usual integrability conditions, i.e. f (X ∆ T ) ∈ L 2 (P), it is asymptotically controlled by the central limit theorem.A first nonasymptotic result is given by the Berry-Essen theorem provided f (X ∆ T ) ∈ L 3 (P), but for practical purposes, the crucial quantity to control non-asymptotically is the deviation between the empirical mean E ∆ M (x, T, f ) and the real one E Precisely, for a fixed M and a given threshold r > 0, one would like to give bounds on the quantity In the ergodic framework and for a constant diffusion coefficient Gaussian controls have been obtained by Malrieu and Talay [14].In the current context and for the standard Euler scheme, a first attempt to establish two-sided Gaussian bounds for E S (∆, M, x, T, f ) can be found in [13] under some non-degeneracy conditions up to a systematic bias independent of M .
In the current work we assume that the coefficients satisfy the mild smoothness condition: (A) The coefficients b, σ are uniformly Lipschitz continuous in space uniformly in time, σ is bounded.
Note that we do not assume any non-degeneracy condition on σ in (A).
We next show that when the innovations satisfy (GC(α)), the Gaussian concentration property transfers to the statistical error In particular we get rid off the systematic bias in [13].The key tool consists in writing the deviation using the same kind of decompositions that are exploited in [16] for the analysis of the discretization error.Denote by X ∆,ti,x T the value at time T of the scheme (1.3) starting from x ∈ R d at time t i , i ∈ [[0, N ]] and by F i := σ(Y j , j ≤ i) the filtration generated by the innovations.We write using the Markov property for the last equality.Introducing the function v ∆ (t i , x) The definition of v ∆ now yields: where The decomposition (1.5) is similar to the first step of the analysis of the discretization error.In that , that is the expectation involving the diffusion at time T starting from the current value of the scheme at t i , see [16].Under some non degeneracy assumptions or smoothness of the coefficients, v is smooth and Itô-Taylor expansions lead to the previously mentioned first order error for E D (∆, x, T, f ).
To analyze the statistical error, the key idea is to exploit recursively from (1.6) that the increments of the scheme (1.3) satisfy (GC(α)).The Gaussian concentration property will readily follow provided the f i are Lipschitz in the variable y.Under (A), this smoothness is actually derived from direct stability arguments using flow techniques, see Proposition 4.1 and its proof in Section A.1.
Let us here mention the work of Blower and Bolley [3] who obtained Gaussian concentration properties for the joint law of the first n positions of stochastic processes (possibly non Markov) with values in general separable metric spaces.This result is in some sense much stronger than ours, since it can for instance yield to non asymptotic controls of the Monte-Carlo error for smooth functionals of the path, such as the maximum.However, some continuity assumptions in Wasserstein metric are assumed on the transition measures of the process, see e.g.condition (ii) in their Theorems 1.2 and 2.1.This is required from the coupling techniques used in the proof.Checking this kind of continuity can be hard in practice, in [3] the authors give some sufficient conditions that require the transition laws to be absolutely continuous and smooth, see their Proposition 2.2.In the current work we only need the property (GC(α)) for the innovations, which can in particular hold for discrete laws.
Also, we want to stress that, even if the concentration results coincide when the innovations (Y i ) i∈N * have a smooth density, the nature of the proofs is different.Blower and Bolley exploit optimal transportation techniques whereas our approach consists in adapting the PDE arguments used for the analysis of the discretization error to the current setting.It is actually striking that a similar error decomposition can be used for investigating both the discretization and statistical error.
We conclude mentioning some works related to the deviations of the 1-Wasserstein distance between a reference measure and its empirical version.In the i.i.d.case, such results were first obtained for different concentration regimes by Bolley, Guillin, Villani [5] relying on a non-asymptotic version of Sanov's Theorem.Some of these results have also been derived by Boissard [4] using concentration inequalities and extended to ergodic Markov chains up to some contractivity assumptions in the Wasserstein metric on the transition kernel.In the i.i.d.case and Gaussian concentration regime, these results lead to the following type of estimates: where the (Z j ) j∈N * are i.i.d.having the same distribution as Z and the constants C(r) and K may be explicitly but tediously computed.This kind of uniform deviation bounds are of interest in statistics and numerical probability from a practical point of view.They can indeed lead to deviation bounds for the estimation of the density of the invariant measure of a Markov chain, see [5].However, the (possibly large) constant C(r) is the trade-off to obtain uniform deviations over all Lipschitz functions.We do not intend to develop these aspects but similar bounds could be established in our context.

Robbins-Monro Stochastic Approximation Algorithm
Besides our considerations for the Euler scheme, we derive non asymptotic bounds for stochastic approximation algorithms of Robbins-Monro type.These recursive algorithms aim at finding a zero of a continuous function h : R d → R d which cannot be directly computed but only estimated through simulation.Such procedures are commonly used in a convex optimization framework since minimizing a function amounts to finding a zero of its gradient.Precisely, the goal is to find a solution θ * to h(θ) := E[H(θ, Y )] = 0, where H : R d × R q → R d is a Borel function and Y is a given R q -valued random variable.Even though h(θ) cannot be directly computed, it is assumed that the random variable Y can be easily simulated (at least at a reasonable cost), and also that H(θ, y) can be easily computed for any couple (θ, y) ∈ R d × R q .The Robbins-Monro algorithm is the following recursive scheme where (Y n ) n≥1 is an i.i.d.R q -valued sequence of random variables defined on a probability space (Ω, F , P) and (γ n ) n≥1 is a sequence of non-negative deterministic steps satisfying the usual assumption n≥1 γ n = +∞, and When the function h is the gradient of a potential, the iterative scheme (1.7) can be viewed as a stochastic gradient algorithm.Indeed, replacing H(θ n , Y n+1 ) by h(θ n ) in (1.7) leads to the usual deterministic gradient method.One of the ideas in (1.7) is to take advantage of an averaging effect along the scheme due to the specific form of h(θ) := E[H(θ, Y )].This allows to avoid the explicit computation or estimation of h.We refer to [7], [11] for some general convergence results of the sequence (θ n ) n≥0 defined by (1.7) towards its target θ * under the existence of a so-called Lyapunov function, i.e. a continuously differentiable function L : R d → R + such that ∇L is Lipschitz, |∇L| 2 ≤ C(1 + L) for some positive constant C and ∇L, h ≥ 0.
See also [12] for a convergence theorem under the existence of a pathwise Lyapunov function.In the sequel, it is assumed that θ * is the unique solution of the equation h(θ) = 0 and that (θ n ) defined by (1.7) converges a.s.towards θ * .We assume that the innovations (Y i ) i∈N * satisfy (GC(α)) for some α > 0 and also that the following conditions on the function H and the step sequence (γ n ) n≥1 in (1.7) are in force: In order to derive a Central Limit Theorem for the sequence (θ n ) n≥1 as described in [7] or [11], it is commonly assumed that the matrix Dh(θ * ) is uniformly attractive.In our current framework, this local condition on the Jacobian matrix of h at the equilibrium is replaced by the uniform assumption (HUA).This allows to derive non-asymptotic concentration bounds uniformly w.r.t. the starting point θ 0 .
Note that under (HUA) and the linear growth assumption (which is satisfied if (HL) holds and Y ∈ L 2 (P)) the function L : θ → 1 2 |θ − θ * | 2 is a Lyapunov function for the recursive procedure defined by (1.7) so that one easily deduces that θ n → θ * , a.s. as n → +∞.
As for the Euler scheme, we decompose the global error between the stochastic approximation procedure θ n at a given time step n and its target θ * as follows: where The term E Emp (γ, n, H, λ, α) corresponding to the difference between the absolute value of the error at time n and its mean can be viewed as an empirical error.As for the Euler scheme, the Gaussian concentration property transfers to this quantity under (HL) and (HUA).The strategy consists in introducing again a telescopic sum of conditional expectations.Denoting for all i ∈ N, F i := σ(Y j , j ≤ i) (i.e.(F i ) i∈N is the natural filtration of the algorithm), we write for all n ∈ N * : where we used the Markov property for the second equality and we introduced the notations v i (θ The stability of the Gaussian concentration property is then derived using that the f γ i are Lipschitz in the variable y, see Proposition 5.1.The term δ n in (1.9) corresponds to the bias of the sequence (θ n ) n≥0 with respect to its target θ * .This contribution strongly depends on the choice of the step sequence (γ n ) n≥1 and the initial point θ 0 .Under (HL) and (HUA), we analyze this quantity in Proposition 5.2.

Deviations on the Euler Scheme Theorem 2.1 (Concentration Bounds for the Euler scheme). Denote by X ∆
T the value at time T of the scheme (1.3) associated to the diffusion (1.2).Assume that the innovations (Y i ) i∈N * in (1.3) satisfy (GC(α)) for some α > 0 and that the coefficients b, σ satisfy (A).Let f be a real valued uniformly Lipschitz continuous function on R d .For all M ∈ N * and all r ≥ 0, one has where q is the dimension of the underlying Brownian motion in (1.2) and c := c(q).
Note that in the above theorem, we do not need any non-degeneracy condition on the diffusion coefficient.As developed in Section 1.1, see (1.6), to handle the previous quantity we rewrite f (X If at some point along the time-discretization the process has a degenerate diffusion term, we can see that the difference will not give any additional contribution in the global deviation. With respect to the previous work [13], we got rid off the systematic bias.Anyhow, the concentration constants now depend on the Lipschitz constant of the function v ∆ (0, x) x] which has order Ψ(T, f, b, σ, q) 1/2 .This magnitude corresponds to the product of the Lipschitz constant of the final function f and the mean of the Lipschitz constant for the flow of the scheme, which gives the exponential dependence in time, see Proposition 4.1 and its proof for details.
Remark 2.1 (Extension to smooth functionals of the path).We point out that the previous concentration results could be extended to some smooth functionals of the path such as the maximum for a scalar scheme.Indeed, introducing in that case the additional state variable i∈N is Markovian and the flow arguments of Proposition 4.1 could be extended to the couple for Lipschitz functions in both variables.
Remark 2.2 (Linear SDEs and concentration).Observe that it is the boundedness of σ that gives the Gaussian concentration regime.However, in many popular models in finance, the diffusion coefficient is linear, see e.g. the Black-Scholes like dynamics T )] of the associated Euler scheme, if f is bounded then the Gaussian concentration holds for the statistical error from the Bolley and Villani criterion applied to f (X ∆ T ).However, for a general Lipschitz function, the expected concentration is the log-normal one.

Deviations for Robbins-Monro algorithms
Theorem 2.2 (Concentration Bounds for Robbins-Monro algorithms).Assume that the function H of the recursive procedure (θ n ) n≥0 (with starting point θ 0 ∈ R d ) defined by (1.7) satisfies (HL) and (HUA), and that the step sequence (γ n ) n≥1 satisfies (1.8).Suppose that the law of the innovation satisfies (GC(α)), α > 0.Then, for all N ∈ N * and all r ≥ 0, , where Concerning the choice of the step sequence (γ n ) n≥1 and its impact on the concentration rate and bias, we obtain the following results: 2λ , a comparison between the series and the integral yields Π N N k=1 γ 2 k /Π k = Ø(N −1 ).Let us notice that we find the same critical level for the constant c as in the Central Limit Theorem for stochastic algorithms.Indeed, if c > 1 2Re(λmin) where λ min denotes the eigenvalue of Dh(θ * ) with the smallest real part then we know that a Central Limit Theorem holds for (θ n ) n≥1 (see e.g.[7]).However, this local condition on the Jacobian matrix of h at the equilibrium is replaced by a uniform assumption in our framework.This is quite natural since we want to derive non-asymptotic bounds for the stochastic approximation (1.7).
Concerning the bias we have the following bound: Hence, for all ǫ ∈ (0, 1 − ρ) we have: Concerning the bias, from the above control, we have the following bound: Since each step is bigger compared to the case γ n = c n , the impact of the initial difference |θ 0 − θ * | is exponentially small.

Abstract concentration properties for a general evolution scheme
In this section we assume that (Y i ) i∈N * is a sequence of i.i.d.R q -valued random variables whose law µ satisfies the Gaussian concentration property (GC(α)) for a given α > 0. Proposition 3.1 (Gaussian concentration for a stochastic evolution scheme).
be a given sequence of time steps.For all r ≥ 0, we have: .
Proof.Set P(r) For λ ≥ 0 to be specified later on, the Tchebychev exponential inequality yields: Observe now that working with regular conditional expectations, we have where Y is a random variable with law µ.From (GC(α)), we derive Plugging this estimate in (3.1) and iterating the procedure we derive and optimizing w.r.t λ, we obtain:

Euler Scheme: Proof of the Main Results
In order to apply Proposition 3.1 from the decomposition (1.6), all we need is to have a control on the Lipschitz modulus in the variable y of the functions f ∆ i (x, y), uniformly in x.Under the current assumptions of Theorem 2.1, we have the following Proposition which is proved in Section A.1.i introduced after (1.6) are uniformly Lipschitz continuous in the space variable y uniformly in x and we have that there exists c := c(q) (dimension of the underlying Brownian motion) s.t: where [f ] 1 stands for the Lipschitz constants of the function f .
An optimization in λ gives the result.

Robbins-Monro Algorithm: Proof of the Main Results
With the notations of Section 1.2, in order to apply Proposition 3.1 we have to control the Lipschitz constants in y of the functions Under the assumptions of Theorem 2.2, the following control holds.Proposition 5.1 (Controls of the Lipschitz constants).For all i ∈ [ [1, n]], the function f γ i satisfies: where Recalling that the random variables (Y i ) i∈N * satisfy (GC(α)), we obtain from Proposition 3.1 that for all r ≥ 0: Contrary to the result concerning the Euler scheme, a bias appears in the non-asymptotic bound for the stochastic approximation algorithm.Consequently, it is crucial to have a control on it.At step n of the algorithm, it is equal to Under the current assumptions (HL) of Lipschitz continuity of H and (HUA) of uniform attractivity, we have the following proposition.Proposition 5.2 (Control of the bias).For all n ≥ 1, we have Proof.With the notations of Section 1.2, we define for all n ≥ 1, From the dynamics (1.7), write now for all n ∈ N, where we used that h(θ * ) = 0 for the last equality.Setting J n := Take now the square of the L 2 -norm in the previous equality.Recalling that ∆M n+1 is a martingale increment, we derive: For the last inequality we used (exploiting assumption (HUA), uniform attractivity of the Jacobian matrix of h) ||I − γ n+1 J n || ≤ exp(−λγ n+1 ), .standing for the matrix norm on R d ⊗ R d , and the inequality Y which follows from (HL).A direct induction yields for all n ≥ 1: which completes the proof.