(Non)-penalized Multilevel methods for non-uniformly log-concave distributions

We study and develop multilevel methods for the numerical approximation of a log-concave probability $\pi$ on $\mathbb{R}^d$, based on (over-damped) Langevin diffusion. In the continuity of \cite{art:egeapanloup2021multilevel} concentrated on the uniformly log-concave setting, we here study the procedure in the absence of the uniformity assumption. More precisely, we first adapt an idea of \cite{art:DalalyanRiouKaragulyan} by adding a penalization term to the potential to recover the uniformly convex setting. Such approach leads to an \textit{$\varepsilon$-complexity} of the order $\varepsilon^{-5} \pi(|.|^2)^{3} d$ (up to logarithmic terms). Then, in the spirit of \cite{art:gadat2020cost}, we propose to explore the robustness of the method in a weakly convex parametric setting where the lowest eigenvalue of the Hessian of the potential $U$ is controlled by the function $U(x)^{-r}$ for $r \in (0,1)$. In this intermediary framework between the strongly convex setting ($r=0$) and the ``Laplace case'' ($r=1$), we show that with the help of the control of exponential moments of the Euler scheme, we can adapt some fundamental properties for the efficiency of the method. In the ``best'' setting where $U$ is ${\mathcal{C}}^3$ and $U(x)^{-r}$ control the largest eigenvalue of the Hessian, we obtain an $\varepsilon$-complexity of the order $c_{\rho,\delta}\varepsilon^{-2-\rho} d^{1+\frac{\rho}{2}+(4-\rho+\delta) r}$ for any $\rho>0$ (but with a constant $c_{\rho,\delta}$ which increases when $\rho$ and $\delta$ go to $0$).


Introduction
In this paper, we are interested in the sampling of probability distribution named Gibbs measure whose density is π(dx) = 1 Z e − U (x) 2σ 2 λ(dx) where λ is the Lebesgue measure, Z = R d e − 2U (x) σ 2 λ(dx) and U : R d → R is a coercive function.Many applications require the computation of these measures in high dimension state space including for example machine learning, Bayesian estimation or statistical physics.Methods that are studied in this paper are based on the discretization of over-damped Langevin stochastic differential equation (SDE) where (B t ) t≥0 is a d-dimensional Brownian motion and σ ∈ R + * .These methods received a lot of attention in the last few years, in particular when U is strongly convex (in the sense where, in the whole space, the smallest eigenvalue of its Hessian is lower bounded by a positive α).This assumption may be certainly constraining in view of applications.It is the reason why, in this paper, we suppose that U is not strongly convex but only weakly convex 1 .More precisely, we will assume that the potential U is a convex twice differentiable function with Lipschitz gradient.Under these assumptions, strong existence and uniqueness of a solution (X t ) t≥0 classically hold and the solution to (1) is an ergodic Markov process whose invariant distribution is exactly the Gibbs distribution π ∝ e −U dλ (for background, see e.g.[MT93], [KS91], [Kha12], [Hai10]).
We respectively denote by (P t ) t≥0 and L the related semi-group and infinitesimal generator.We recall that for a twice differentiable function f : R d → R by It is also well-known that in this log-concave setting, the distribution π satisfies the Poincaré inequality (see e.g.[BBCG08]) and that convergence holds to equilibrium in distribution and in "pathwise average": for any starting point x ∈ R d , the occupation measure converges to π in the following sense: for all continuous function f ∈ L 2 (π), In the continuity of [EP21], our multilevel methods will be based on discretized adaptations of (2).More precisely, we first choose to approximate the stochastic process (X t ) t≥0 by the classical Euler-Maruyama scheme.When the related step size γ is constant, this discretization scheme is defined by X0 = x ∈ R d and: where (Z n ) n∈N is an i.i.d sequence of d-dimensional standard Gaussian random variables.In the long-time setting, these schemes and there convergence properties to equilibrium were first studied in the nineties by [Tal90] and [RT96].Then, some decreasing step Euler schemes were investigated by [LP03] (see also [Lem05]) in order to manage, in the same time, the discretization and long-time errors.Here, we choose to keep the constant step size point of view in order to avoid some additional technicalities but our ideas could be probably adapted to this setting.
This "pseudo-diffusion" form is usually convenient for proofs but it is worth noting that the procedure is only based on the discrete-time Euler scheme.If no confusion arises, we will sometimes write t instead of t γ , and Xt or Xγ t instead of Xγ,x0 t to alleviate the notations.We now mimic (2) with the Euler scheme to approximate the target measure π.Thus, consider the following occupation measure (for background see [Tal90]), for N ∈ N

Multilevel methods
Multilevel methods introduced by M. Giles in [Gil08].These methods, initially used for the approximation of E[f (X T )], are now widely exploited in many settings.The rough idea is the following: assume that the target E[X] is the expectation of a random variable that cannot be sampled (with a reasonable cost) and consider a family of random variables (X j ) j approximating X, with a cost of simulation and a precision which typically increases with j.The principle of the multilevel is to stack correcting layers with a low variance to a coarse approximation X 0 of the target.More precisely, writing the multilevel method consists in building a procedure based on the addition of Monte-Carlo approximations of E[X 0 ] and of E[X j − X j−1 ], j = 1, . . ., J.Then, if the random variables X j − X j−1 have low variance, the approximation of E[X j − X j−1 ] requires few simulations and, in view of (4), we can obtain a procedure which has the bias related to X J but with a cost which may be much less than the one generated by a standard Monte-Carlo method applied to estimate E[X J ].
In the discretization setting, the family of random variables (X j ) j is a sequence of Euler schemes ( Xγj ) j where (γ j ) j is a family of decreasing time steps2 .Following the heuristic (4), the (independent) correcting layers are built by coupling of Euler schemes with steps γ j−1 and γ j .Note that in view of the simulation of the (synchronous) coupling, we need γ j−1 to be a multiple of γ j (in this paper, we will assume that γ j = γ 0 2 −j ).
Multilevel methods have been already studied in the literature for the approximation of the invariant distribution of the Langevin diffusion.In [GMS + 20], the authors take advantage of the convergence in distribution to equilibrium.Thus, the classical Monte-Carlo point of view is adopted: the approximation of π(f ) is obtained by sampling a large number of Euler schemes for each level.In [PP18] 3 and [EP21], the point of view is to take advantage of the convergence of the occupation measure.Thus, each level is based on only one path of the Euler scheme or of the couple of Euler schemes whose length decreases (since the variance of the correcting layers decreases) and discretization step increases.All these papers show that, in the uniformly strongly convex setting, the invariant distribution can be approximated (along Lipschitz continuous functions) with a precision ε (in a L 2 -sense) using a Multilevel procedure whose complexity is of order ε −2 or ε −2 log p (ε) with p ∈ [1, 3].Moreover, in [EP21], a particular attention is paid to the dependency in the dimension.In this case, it is shown that one can build a multilevel procedure that produces an εapproximation of the target for a complexity cost proportional to dε −2 (with an explicit expression of the dependence in the Lipschitz constant L and the contraction parameter α).
The more involved weakly convex case seems to be less explored in the multilevel paradigm but, in view of applications (for instance for Bayesian Lasso), it is natural to ask about the robustness of these methods when one relaxes the contraction assumption.

Contributions and plan of the paper
The main goal of this paper is thus to extend the multilevel Langevin algorithm for the Gibbs sampling to the weakly convex setting, and if possible to obtain some quantitative bounds for the complexity related to the computation of an ε-approximation of the target (see Section 1.4 for a definition of ε-approximation).
We first investigate the penalized multilevel method: in the continuity of [DK19] and [DKRD22], we build a multilevel procedure based on the following observation: consider a new equation with another potential U α (x) := U (x) + α 2 |x| 2 , this new equation has an invariant distribution named π α which converges to π when α tends to 0 in Wasserstein metric.The idea is that this new invariant distribution is easiest to sample because of the uniform convexity of the potential U α .In Section 2.1, Theorem 2.1 combines the benefits of the penalized approach and of the multilevel methods.For a Lipschitz-continuous function f : R d → R and a C 2 -potential U , the multilevel procedure performs an ε-approximation of π(f ) with a complexity cost proportional to π(|.| 2 ) 3 dε −5 .As in [DKRD22], our result depends on the generally unknown constant π(|.| 2 ) which is at least proportional to d (see Remark 2.1 for details and comparisons with [DKRD22]).
Because of the above remarks, we chose in a second part to try to develop some tools which tackle the weakly convex setting from a dynamic point of view and which can improve the complexity in terms of ε.More precisely, in the spirit of [GPP20], we study an intermediary framework (called weakly parametric convex setting in the sequel).We assume that the eigenvalues of the Hessian matrix of U vanish when |x| goes to +∞, but with a rate controlled by the function x → U −r (x) with r ∈ [0, 1) (see Assumption (H 1 r )).The parameter r characterizes the "lack of strong convexity", the case r = 1 referring to the "Laplace case"4 whereas r = 0 corresponds to the uniformly convex setting.When one assumes such an assumption, one can get some bounds for the exponential moments of the Euler scheme (at the price of technicalities).One is also able to preserve some confluence properties, i.e. two paths of the Euler scheme have a tendency to stick at infinity.Finally, in this setting, it is also possible to control the distance between diffusion paths and the related Euler schemes.These three ingredients (obtained with a lower quality than in the strongly convex setting) allow us to tackle the multilevel procedure in this framework.
The related main contribution is Theorem 2.2.In this result, we provide a series of statements under different sets of assumptions: when U is only C 2 or when U is C 3 .Under (H 1 r ) only or under (H 1 r ) and (H 2 r ), where (H 2 r ) denotes an additional assumption which requires the highest eigenvalue to be also controlled by the function x → U −r (x) (we could roughly say that under (H 1 r ) and (H 2 r ), the potential is uniformly weakly convex in the sense that "the decrease of the contraction is uniform").In each statement, we provide a multilevel procedure adapted to the assumptions.The related complexity is exhibited in terms of d, ε but also in terms of the contraction parameter and the Lipschitz constant.Without going too much into the details, when U is only C 2 , the complexity is of the order ε −3 whereas when U is C 3 , we can obtain a rate of the order ε −2−ρ for any ρ > 0 and thus approach the "optimal" complexity ε −2 .Now, in terms of the dimension, when only (H 1 r ) holds the dependence in the dimension of the complexity is bounded by for any ρ > 0 when U is C 3 .This means that when U is C 3 and (H 1 r ) and (H 2 r ) hold, the complexity is of the order ε −2−ρ d 1+ ρ 2 +4r for any ρ > 0. With respect to the paper [GPP20], our multilevel procedure improves the dependence in ε and is most comparable in terms of the dimension.Note that when only (H 1 r ) holds, the dependency in the dimension dramatically increases on r.Whereas, when the potential is uniformly weakly convex, the dependence in the dimension does not explode when r → 1 (see Theorem 2.2 for more details).
Plan of the paper.As detailed in the previous paragraphs, Sections 2.1 and 2.2 are respectively devoted to the statement of the main theorems for the penalized multilevel and in the parametric weakly convex case.Then, Section 3 is dedicated to the proof of the first main theorem (Theorem 2.1).In Proposition 3.2, we obtain a Wasserstein bound related to the bias induced by the penalization on the invariant distribution.The proof of Theorem 2.1 is then an adaptation of [EP21, Theo 2.2].From Section 4, we focus on the proof of Theorem 2.2.In Section 4, we prove some preliminary results on the diffusion and its Euler scheme under (H 1 r ): we begin with some controls of the exponential moments (Proposition 4.1 and Proposition 4.2) which in turn lead to some bounds of the polynomial moments (Proposition 4.3).In this section, we also show that the discretization error can be controlled in long time (Proposition 4.4) and finally obtain an integrable rate of convergence to equilibrium for the Euler scheme (Proposition 4.5).With the help of these fundamental bricks, in Section 5, we obtain some bounds on the bias (Proposition 4.7) and of the variance of the procedure (Proposition 5.2) which in turn allow us to finally provide the proof of Theorem 2.1 in Theorem 2.2.

Design of the algorithm
We now build the multilevel procedure.Let x ∈ R d be the initialization of the procedure, J ∈ N be a number of levels, (γ j ) 0≤j≤J be a sequence of time steps, (T j ) 0≤j≤J be a sequence of final times.Define by Y(J, (γ j ) j , τ, (T j ) j , x, .) the multilevel occupation measure : for all f : R d → R, Y(J, (γ j ) j , τ, (T j ) j , x, α, f where trajectories on each level are coupled with the same Brownian motion.To ease notation, we simplify Y(J, (γ j ) j , τ, (T j ) j , x, f ) by Y(f ).The parameter τ ≥ 0 is the time where we begin the average.Indeed, this "warm-start trick" may improve the precision of the estimation in some cases (we refer to [EP21] for more details).But, it could be equal to 0 when the gain is non-efficient, for example in the second part of our main result.

Notations
The usual scalar product on R d and the induced Euclidean norm are respectively denoted by all its partial derivatives are well-defined and continuous up to order k.The gradient and Hessian matrix of f are respectively denoted by ∇f and D 2 f .The probability space is denoted by (Ω, F , P).The Laplace operator is denoted by ∆: ∆f = d i=1 ∂ 2 i,i f .The L p -norm on (Ω, F , P) is denoted by • p .For two probability measures µ and ν, we define the Wasserstein distance of order p by , where Π(µ, ν) is the set of couplings of µ and ν.
• P and uc : For two positive real numbers a and b and a set of parameters P , one writes a P b if a ≤ C P b where C P is a positive constant which only depends on the parameters P .When a ≤ Cb where C is a universal constant, we write a uc b.
• ε-approximation: We say that Y is an ε-approximation (or more precisely an ε-approximation of a for the Equivalently, Y is said to be an ε-approximation of a if the related Mean-Squared Error (MSE) is lower than ε 2 .
• Complexity/ε-complexity: For a random variable Y built with some iterations of a standard Euler scheme, we denote by C(Y), the number of iterations of the Euler scheme which is needed to compute Y.For instance, C( Xγ nγ ) = n.We sometimes call ε-complexity of the algorithm, the complexity related to the algorithm which produces an ε-approximation.

The penalized approach
In this section, we develop a penalized multilevel method to sample a non-strongly log-concave probability distribution π.The idea is based on [DKRD22] and [DK20] where the authors consider the potential U α (x) := U (x) + α 2 |x| 2 with α > 0 which is called penalized version of U .We here assume that U satisfies the following assumption: WC L : U is a non-zero C 2 -function and there exists L > 0 such that the inequalities being taken in the sense of symmetric matrices.Denote by π α the invariant measure of the diffusion process (X α t ) t≥0 solution of the stochastic differential equation It appears that π α satisfies the Bakry-Émery criterion thus we can apply our multilevel method that requires strong convexity to approximate π α .But our target is π then we have to control the distance (in a Wasserstein sense) between π and π α .To this end, the results of [BV05] and [DKRD22] ensure the convergence of π α when α goes to 0 with a bound of the Kullback-Leibler divergence.This leads to the following theorem: Theorem 2.1.Assume that WC L hold.Let ε > 0, let f : R d → R be a Lipschitz function and x ∈ R d .For ε ≥ 0 let with a complexity satisfying ) and this implies that the complexity is of the order d 4 ε −5 .
Let us now compare with [DKRD22]: note that in this paper, the cost is not explicitly written.As usual in the Monte-Carlo literature, the authors control the number of iterations of the Euler scheme which is necessary to draw a random variable whose distance to the target is lower than ε instead of giving the real cost.In the Langevin Monte-Carlo case, when U is C 2 , they then obtain a number of iterations which is, up to logarithmic terms, of order π(|.| 2 )dε −4 .Normalizing ε (i.e.replacing ε by ε/ π(|.| 2 )) leads to a number of iterations of order π(|.| 2 ) 3 dε −4 .But to compare with our work, we need to include the Monte-Carlo cost, i.e. to the number of simulations which is necessary to make the variance lower than ε 2 .Then, we have to multiply the previous number of iterations by Var π (f )ε −2 which can be reasonably bounded by π(|.| 2 )ε −2 (when f is Lipschitz).This means that the complexity of the penalized Langevin Monte-Carlo in [DKRD22] is of the order π(|.| 2 ) 4 dε −6 .In consequence, the multilevel method allows us to improve the result of [DKRD22].Note that the authors also provide other algorithms such as the Kinetic-LMC where the bound in ε is improved (it seems that our result meets the complexity given for this algorithm).
About decreasing penalization.In the above result, we propose a Multilevel strategy based on a fixed penalization.A natural question arises: could we take advantage of the Multilevel strategy by keeping the same penalization for the highest level and progressively reducing it on the lower layers?Indeed, this is precisely what we do with the discretization bias, thus we can wonder about the effect of such a strategy for the penalization: to this end, let us introduce a decreasing sequence (α j ) 0≤j≤J of penalization levels such that the bias induced by the distance between π αJ and π is small with respect to the required precision ε.
More specifically, we want to replace (4) with the following telescoping series However, the above decomposition requires several longtime bounds on the underlying dynamics to be an efficient multilevel procedure.In particular, for the control of the variance generated by each level, we need to control the pathwise distance between two paths of the dynamics related to penalization levels α j and α j−1 .But, oppositely to the constant penalization case where we can obtain some confluence properties, we can observe that for two trajectories computed with two different degrees of penalization α and α, there is a "lack of confluence" quantified by the following inequality, where the bound dramatically depends on α.In this inequality we voluntary treated the continuous case to ease the readability but a discrete analogous result can be shown.We refer to the section 3 to get the proof of this result.This inequality is certainly related to the shape of the penalization sequence (α j ) 0≤j≤J .
In fact, this result must be considered in addition of the error induced by the difference between two Euler schemes with different time steps.Up to an universal constant we have (see [EP21, Prop 5.1]) Then, the additional variance generated by the decrease of the penalization does not have an impact on the results, we have to impose that γdσ 2 α 2 j (α j − α j−1 )dσ 2 α j−1 .In particular, the sequence (α j ) 0≤j≤J cannot be too decreasing.Going further into the computations, it seems that we cannot expect a significant gain with this approach.

Parametric weakly convex setting
The purpose of this section is to study the non-penalized multilevel procedure in the weakly convex setting.Instead of penalizing the dynamics, it is actually natural to ask about the robustness of the "standard" multilevel method in this case.To answer this question, we have to prove a series of properties in the spirit of the assumptions H i (i ∈ {1, 2, 3, 4}) in [EP21].These assumptions include the convergence to equilibrium of the Euler scheme with a quantitative rate, the long-time control of the L 2 -distance between the Euler scheme and the diffusion, the control of the Wasserstein distance between π γ and π and the control of the moments.Some of these properties (especially the long-time control of the L 2 -distance) seem hard to check in a general convex setting.We thus propose to work in the parametric weakly convex setting used in [EP21] by introducing (H 1 r ) (see below) where we assume that the contraction vanishes at ∞ but with a rate controlled by U −r .
Let us now introduce our assumptions depending on a parameter r ∈ [0, 1) : is positive and there exist L and c > 0 such that, The lower-bound can be seen as the "lack of uniform strong convexity" for the potential.Indeed, if r = 0 we recover the strong convexity and r = 1 corresponds to the weakest convexity case where the gradient is flat at infinity.The fact that ∇U is L-Lipschitz implies that x → λD 2 U(x) is upper-bounded by L. In order to improve the dependence in the dimension, we also introduce an additional assumption that deals with the case where the largest eigenvalue decreases at infinity with an intensity that is of the same order as the lowest eigenvalue: (H 2 r ) : There exists a positive c such that for all x ∈ R d , λD 2 U(x) ≤ cU −r (x).
For instance, it can be checked that the function Let us finally define the couple (γ ⋆ , Ψ) by: where c r is a constant which only depends on r. γ ⋆ will denote the largest value for γ 0 whereas, Ψ controls the moments of U ( Xnγ ) (see Proposition 4.3 for details).It is worth noting that on the one hand, γ ⋆ does not depend on d and on the other hand, that Ψ ∝ d 1 1−r when only (H 1 r ) holds and Ψ ∝ d when (H 1 r ) and (H 2 r ) hold true.This means that in the first case, the dependence in d dramatically increases with r whereas in the second case, it does not depend on r.In the next result, the reader will have to keep in mind that the definition of these parameters depends on the assumptions.In particular, even if (H 2 r ) does not appear in the statement, it is hidden in the value of the parameters γ ⋆ and Ψ.
We are now ready to state our main theorem in this setting: Theorem 2.2.Assume (H 1 r ) and let x ∈ R d such that U (x) r Ψ, γ 0 ∈ (0, γ ⋆ ], δ ∈ (0, 1/4] and let f be a Lipschitz-continuous function.For an integer J ≥ 1, set ∀j ∈ {1, . . ., J}, γ j = γ 0 2 −j and Then, for δ small enough, with a complexity cost, Then, for δ small enough, Y(J, (γ j ) j , τ, Remark 2.2.This technical result deserves several comments: ⊲ Complexity in terms of ε.If we only consider the dependence in ε, we obtain ε −3 when U is only C 2 and ε −2−ρ for any ρ > 0 when U is C 3 and an additional (but reasonable 7 ) assumption on ∆(∇U ) is satisfied.We can thus theoretically approach the complexity in ε −2 .However, it is worth noting that the non-explicit constants depending on ρ and δ go to ∞ (independently of the other parameters) when ρ and δ go to 0. The fact that we "only" obtain a complexity in ε −3 when U is C 2 is due to the fact that in this case, our bound of the 1-Wasserstein distance between π γ and π is of the order √ γ.When U is C 3 , the bound on the 1-Wasserstein distance between π γ and π is of order δ.This allows us to clearly improve the complexity but it can be noted that we do not retrieve the ε −2 -bound of the uniformly convex case.This is due to the rate of convergence to equilibrium.Actually, our rate is polynomial and not exponential, which in turn, implies a "slight cost" on the dependence in ε.In fact, we could get some (sub)-exponential rates but without controlling the dependence in the dimension which is of first importance for applications.
⊲ Complexity in terms of the dimension.The dependence in the dimension strongly varies with the assumptions.In the "worst" case where U is only C 2 and only (H 1 r ) holds, the complexity is of the order . Unfortunately, when r is close to 1, this means that this dependence seriously worsens.We retrieve this same phenomenon when U is C 3 and only (H 1 r ) holds but with a better bound of the order d for any positive ρ and δ.This bad behavior when r goes to 1 is due to the fact that when only (H 1 r ) holds, the bounds on the exponential moments of sup t≥0 E[e U( Xt) ] are of the order exp(d r ) dramatically improves this exponential bound since in this case, we are able to prove that this is of the order e d (this implies that sup t≥0 E[U p ( Xt )] is of the order d p , see Propositions 4.3 and 4.2 for details).It is worth noting that in this case, the dependence in the dimension does not explode when r goes to 1 being of the order d 1+ ρ 2 +(4−ρ+δ)r for any ρ > 0. Remark that when r = 0, we formally approach the rate of the uniformly convex case in dε −2 obtained in [EP21].
⊲ Comparison with the literature: In this setting, the only paper which we may reasonably compare with is [GPP20] since we use similar assumptions.Compared with this paper, our multilevel procedure certainly improves the dependence in ε, replacing ε −4 by ε −3 when U is C 2 and ε −3 by ε −2−ρ when U is C 3 .In terms of the dimension, our approach slightly increases the dependence on the dimension.For instance, when (H 1 r ) and (H 2 r ) hold, [GPP20] obtain a bound in d 1+4r when U is C 2 or C 3 .We here retrieve a dependence which is somewhat similar when U is C 3 but when U is C 2 , our bound in d 3 2 +( 9 2 +δ)r is clearly worse.
⊲ About the parameters.In applications, the dependence in the parameters, L, c, and c may be of importance (think for instance to applications to Bayesian estimation where these parameters can strongly depend on the number of observations).This is why here, we chose to keep all these dependencies in the main result even if it sometimes adds many technicalities in the proof.

Proof of Theorem 2.1
This section is devoted to the proof of the first main result.We first quantify the bias induced by the approximation of π by π α .To this end, we use the Talagrand concentration inequality that estimates the Wasserstein distance between these two measures by their Kullback-Leibler divergence.
Proposition 3.1.Assume that WC L hold.Then for all α ≥ 0, there is a universal constant C such that 7 See Remark 4.8 for details on this assumption We refer to [BV05, Cor 2.4] to find a proof of this result.In addition, in [DKRD22] the authors show that C ≤ 2E π [|X| 2 ] (page 24).It remains to compute the Kullback Leibler divergence of π from π α , to bound the bias induced by the penalization.Proposition 3.2.Assume that WC L hold.Then for all α ≥ 0, Proof.The Kullback-Leibler divergence is defined by By definition of π and π α , Using the inequality e −x ≤ 1 − x + x 2 2 for x ≥ 0, this leads to, and by the inequality log(1 − x) ≤ −x for x ≤ 1 we get, Proposition 3.1 for π α implies the result.Now we switch to the proof of the main theorem.With the two previous propositions, we control the bias induced by the penalization, then it remains to compute the error and the complexity of a multilevel procedure in a uniformly convex setting.To this end, we use [EP21, Theorem 2.2] which gives parameters to perform an ε-approximation of the invariant distribution with an explicit complexity in terms of the parameters, especially in terms of the contraction parameter.Here, this is exactly our penalization parameter α and we will thus optimize its choice in the proof.
Proof. of Theorem 2.1 Let ε be a positive number and f : R d → R be a Lipschitz continuous function.By the bias/variance decomposition, triangular inequality and the Monge-Kantorovich duality, we have The second term denoted by P 2 is the mean squared error of a Multilevel procedure for the approximation of π α .This penalized measure is invariant for the diffusion process defined with the potential U α .By assumption WC L , U α satisfies the following property σ 2 αd and the following parameters8 : we have with a complexity satisfying It remains to calibrate the penalization parameter α.Proposition 3.2 implies Putting α in (15) and ( 16), Precisions about the "decreasing penalization": For α > α > 0 and x, y ∈ R d consider the couple (X x,α t , X x, α t ) t≥0 defined by Proof.By the Itô formula we have, We now use the fact that for all x ∈ R : The strong convexity property of Up to an universal constant, the moment of order two of the diffusion process under the strong convexity hypothesis are bounded by σ 2 d α (see [EP21, Lem 5.1]), we get The result follows.
4 Preliminary bounds under (H 1 r ) and (H 2 r ): From now, we switch to the proof of the second part of the main results i.e. we consider the weakly convex case under the parametric assumptions (H 1 r ) and (H 2 r ).As mentioned before, these hypotheses deal with the behavior of the lowest and highest eigenvalues of the Hessian matrix of U .In some sense, (H 1 r ) quantifies the strict convexity of the potential which in turn implies the contraction of the dynamics.Note that such an assumption also appears in [CFG22] where the authors obtain exponential rates to equilibrium under this parametric assumption.
In this preliminary section we state a series of results related to the diffusion and its Euler scheme under Assumption (H 1 r ).For the upper-bounds of the eigenvalues of D 2 U , we distinguish two cases: the first one where we assume that we have a uniform upper-bound by L (in others words that ∇U is L-Lipschitz) and the second one, we add Assumption (H 2 r ) where the largest eigenvalues also decrease at infinity with a rate which is comparable to the one of the lowest eigenvalues.In fact, in the second case, we will see that we are able to preserve a dependency of the moments in the dimension which is linear, whereas, without this assumption, the dependency is O(d In the second part we state a result about the longtime pathwise control of the distance between the diffusion and its Euler scheme.Third, we study the convergence to equilibrium for the Euler scheme.Finally, we quantify the bias induced by the discretization with some results on the 1-Wasserstein distance between π and π γ (the invariant measure of the Euler scheme).

Bounds on the exponential moment
In order to study the confluence between the continuous time process and its Euler scheme, let us start this section by a control of the moment of the continuous time process and the discrete time when the potential U is supposed convex.We first state a result on the control of the exponential moment of the continuous time process.Proposition 4.1.For all x ∈ R d and t > 0, under (H 1 r ) and (H 2 r ).We preface the proof by a technical lemma.
In particular, C M is a compact set (since it is included in a level set of a coercive function).
Proof.Denote by y the solution of the ordinary differential equation y ′ (t) = −∇U (y(t)) starting from y(0) = x.Define the function f : t → |∇U (y(t))| 2 we have by chain rule for all t ∈ R + .By ( Since lim t→+∞ y(t) = x ⋆ , we get by integration Therefore, Since ∆U = Tr(D 2 U ) ≤ d λD 2 U ≤ dL (where λA stands for the largest eigenvalue of symmetric matrix A) and U (x ⋆ ) = 1, it follows that so that If we now consider the case where (H 2 r ) also holds, we use that ∆U ≤ cU −r (x) to obtain: To ensure that the right-hand member is lower-bounded by M , it is enough to ensure that This concludes the proof.
Proof.(of Proposition 4.1) Let θ ∈ (0, 1), (to be choosen latter) and for all x ∈ R define, show that f θ is a Lyapunov function for the dynamic L : where C M is defined in Lemma 4.1.In the proof of this lemma we showed that C M is included in a level set Finally f θ is a Lyapunov function for the dynamics: i.e. for all x ∈ R d , Hence by a Gronwall argument we get, under (H 1 r ) and (H 2 r ) leads to the result.We now state an analogous result for the Euler scheme.
(ii) Assume (H 1 r ) and (H 2 r ).If γ ∈ (0, 1−r 4c∨L ] and θ ∈ [0, 1 8σ 2 ∧ 1], then for all x ∈ R d and n ∈ N we have where c denotes a constant independent of the parameters and c r a constant which only depends on r.
The proof of this proposition is postponed in Appendix A.
Remark 4.1.The reader will find more explicit (but more technical) bounds in the proof of the second case.It is worth noting that we can preserve a condition on γ does not depend on d (as in the strongly convex setting).This is of first importance in our multilevel setting where it is much more efficient if the rough layers of the method can be implemented with step sequences with large sizes.The proof is very close to [GPP20] but the bounds are refined.In particular, compared to this paper, we precisely do not require that the step size decreases with d.
Thanks to the two previous results we are now able to control the moment of the continuous time and the discrete time processes. where under (H 1 r ) and (H 2 r ), (ii) Let γ ∈ (0, γ ⋆ ] with γ ⋆ defined by (11).Then, where Ψ is defined by (11) (iii) In particular, and Proof.The proof similar to [GPP20, Prop B.4] is postponed in Appendix B.
Remark 4.2.In order to avoid the distinction between cases in all the proofs, we choose to adopt only one notation for Ψ and Ψ but the reader has to keep in mind that the definition of these quantities depends on the fact that (H 2 r ) is satisfied or not.Let us also recall that the notation r,p means that the constant only depends on r and p.These constants are certainly locally bounded: for any compact subset K of [0, 1) × [1, +∞), there exists a universal constant c such that for any (r, p) ∈ K, the underlying constant c r,p related to r,p is bounded by c.Finally, note that we chose to keep all the dependencies in the other parameters.
Remark 4.3.The control of the L 2 -distance between the diffusion and its Euler scheme is a fundamental property for the efficiency of the multilevel method.Actually, it allows us to control the variance of each level.The fact that we are able to obtain such a property in this (semi)-weakly convex setting is new.
We start with two technical lemmas.
Lemma 4.2.Assume (H 1 r ) then for all x ∈ R d we have Proof.First, one can check that for all x ∈ R d and for all eigenvalue of the Hessian we have where in the last inequality we have used assumption (H 1 r ).By the Taylor formula, where ξ θ = λx + (1 − λ)x * .By (23) and the fact that ∇U 1+r (x ⋆ ) = 0, we get This concludes the proof.
The next lemma is a bound on the moments of the increment of the Euler scheme (with the notations γ ⋆ and Ψ introduced in Proposition 4.3).
Then for all t > 0 and k ∈ N * , Proof.By the definition of the Euler scheme we have Then, by Minkowski inequality, where in the last line we used the L-Lipschitz continuous property of ∇U and the fact that B t −B t ∼ √ t − tZ with Z ∼ N (0, Id).Finally, by Lemma 4.2 and Proposition 4.3, we obtain We are now ready to prove Proposition 4.4.
Proof.(of Proposition 4.4) Let x ∈ R d and consider the following process in To control E 1 we use a Taylor expansion and obtain, where For the second term E 2 , using the inequality a, b ≤ ξt 2 |a| 2 + 1 2ξt |b| 2 and the fact that ∇U is L-Lipschitz, we get Thus, and a Gronwall argument leads to Taking the expectation, the Fubini's theorem implies As ds.
Now let φ be a real non negative function, φ : t → φ(t) Using (H 1 r ), the convexity of x → U (x) and t → t −r and the Jensen inequality, we have Thus by the inequality (a + b) p ≤ a p + b p for a, b ≥ 0 and p ∈ [0, 1], By Cauchy-Schwarz inequality, Proposition 4.3(iii) and Lemma 4.3 we get where in the last line, we used that dσ 2 ≤ Ψ and γL ≤ 1.
For the second term use Cauchy-Schwarz inequality, .
With the help of Inequality (25) and Proposition 4.3, we have For the third term of this product, let κ be a positive number and use Markov inequality: The function x → x −κ being convex on (0, +∞) it follow from Jensen inequality that Using again inequality (25) and Proposition 4.3, Finally, we get . Now let φ(s) = t−s (t+1−s) 1−δ with a > 0, δ ∈ (0, 1) and κ = 2(1+δ) 1−δ we have As a consequence, since γL ≤ 1 and dσ 2 ≤ Ψ, we obtain Back to (24), we deduce from (26) and from the above inequality that The result follows.

Convergence to equilibrium for the Euler Scheme under (H 1 r ):
We now proceed to establish the weak error between the discrete semi group and its invariant measure denoted by π γ .The proof of this result is based on the control of the so-called tangent process T x t := ∇X x t .
Remark 4.4.⊲ In order to alleviate the purpose, the result is stated under the assumption that the initial condition x satisfies U (x) r Ψ but the reader will find some bounds without this assumption in the proofs.
⊲ The function h φ,κ plays the role of convergence rate to equilibrium.In this setting where the Hessian is not lower-bounded, we adopt a strategy which consists in separating the space into two parts.In the first one, we assume that we have some good contraction properties parametrized by the function φ and in the other one, we try simply to control the probability that such a good contraction does not occur.This leads to a balance between two terms depending on φ and κ.In the following, we will choose φ and κ in order that h φ,κ is summable with the smallest impact on the dependence in the dimension.
Note that in [CFG22], some exponential rates are exhibited under similar assumptions in the continuous case (with the help of concentration inequalities).However, this exponential rate depends on some constants whose control seems to be difficult to obtain (typically, when the starting distribution is absolutely continuous with respect to the invariant distribution, the constants involve the L 2 -moment of the related density).Probably, some ideas could be adapted to the Euler scheme (starting from a deterministic point) but with technicalities that seem to carry us too far for this paper.
We preface the proof of Proposition 4.5 by a lemma about the shape of the first variation process of the continuous time Euler scheme, T x t = ∇ x Xx t .
Lemma 4.4.For all n ∈ N, x ∈ R d and γ ∈ [0, 1), Proof.(of Lemma 4.4) First, observe that for all n ∈ N, and by the definition of the Euler scheme and the chain rule, Then we get, T and the proof follows by a simple induction.
Consider two paths defined with the same Brownian motion and different starting points: x, y ∈ R d .The following proposition shows that there is a pathwise confluence, i.e. the two trajectories get closer when n goes to infinity.Proposition 4.6.Assume (H 1 r ) and let x, y ∈ R d , γ ∈ (0, γ ⋆ ], κ > 0. Let φ : R + → R be a positive function.Then, where, with γ ⋆ and Ψ given by Proposition 4.3. Remark 4.5.In the sequel, this property is typically applied with a polynomial function φ which leads to polynomial rates to equilibrium.It is worth noting that the proof could be adapted to provide exponential rates (the idea would be to consider an exponentially decreasing convex function instead of x → x −κ in the proof below).However, with our method, such rates would lead to exponential dependence in the dimension.This is why we do not give such bounds here.
Proof.(of Proposition 4.6) For x, y ∈ R d and n ∈ N, let us start by a Taylor expansion of the function where . is the operator norm associated with the Euclidean norm.By Jensen inequality and Lemma 4.4, The operator norm associated with the Euclidean norm of a symmetric matrix is equal to its spectral radius, so we get with, For a given real non negative function φ : R → R we have For a positive number κ we have and using the Markov inequality, The function x → x −κ is convex on (0, +∞) then by Jensen inequality, Observe that assumption (H 1 r ) implies, By Proposition 4.3 and the convexity of U , this implies that for any γ ∈ (0, γ ⋆ ], Thanks to this confluence property we are now able to prove the convergence to equilibrium of the Euler scheme and to give the rate of this convergence. Proof.(of Proposition 4.5) Since π γ is invariant for Xnγ n∈N we deduce from Fubini's Theorem and Jensen inequality that The Lipschitz property of f implies that where [f ] 1 is the Lipschitz constant of f .Proposition 4.6 implies With the help of the Young inequality, we also have Plugging these controls into (28) yields To conclude, we now use the bound (22) of Proposition 4.3(iii) and the assumption U (x) r Ψ.

Bias induced by the discretization under (H 1 r ):
We now need to provide estimates of W 1 (π, π γ ).We provide two results: Lemma 4.5 where we directly derive from Proposition 4.4 a bound in O( √ γ) which "only" requires the potential U to be C 2 .However, such a bound has a serious impact on the dependency in ε of the complexity.Thus, we propose a second result when U is C 3 where we recover a bound in O(γ).

A first bound in O( √ γ)
As mentioned before, a first estimate can be directly deduced from Proposition 4.4.Actually, since in this result, the L 2 -error between the process and its discretization is controlled uniformly in time, leads to a similar bound for W 1 (π, π γ ) by letting t go to ∞.More precisely, Proof.Owing to the stationarity of π, we have for every n ≥ 0, 0 by Proposition 4.5 (more precisely, this property can be deduced from an integration of (29) with respect to π and from the fact that π(U p ) < +∞ for any p by Proposition 4.3).Now, integrating with respect to π the bound of Proposition 4.4 and using that π(U p ) r Ψp by Proposition 4.3 leads to the result.

A second bound in O(γ)
Even if the above bound is quite explicit in terms of its dependency with respect to L, c, c and d, the fact that it is in O( √ γ) dramatically impacts the complexity in terms of ε (at least).
In fact, it is possible to get a 1-Wasserstein error of the order γ by using a combination of the control of the rate of convergence to equilibrium of the continuous process and of the finite-time weak error (between the process and its discretization).Such a strategy is used in several papers : in [PPar], this idea is developed in a multiplicative setting with a so-called "domino" approach for the control of the 1-Wasserstein and T V distances between the process and its discretization, uniformly in time.For the control of W 1 (π, π γ ) itself, our approach follows [DE21] which provides a series of bounds in many models and sets of assumptions which are mainly based on the following principle (see Lemma 1 of [DE21]).Taking advantage of the stationarity of π γ , for any p ≥ 1, for any t > 0, so that if we assume that We thus propose to estimate ε 1 (t) and ε 2 (t) under Assumption (H 1 r ) (with or without (H2 r )).This is the purpose of Lemmas 4.6 and 4.7 respectively.These two estimates lead to the following proposition Proposition 4.7.Assume (H 1 r ) and let δ ∈ (0, 1).Assume that U is C 3 with ∆(∇U ) 2 2,∞ r σ −4 L 3 Ψ (with ∆(∇U ) 2,∞ defined in Lemma 4.7).Then, a constant c r,δ (depending only on r and δ) exists such that for all γ ∈ (0, γ ⋆ ], Remark 4.6.⊲ Note that this result is clearly in the spirit of [DE21, Theorem 6].However, there are several differences.First, we need here to adapt our proof to a setting where we only have polynomial convergence to equilibrium (instead of exponential convergence).Second, under our assumptions on U which are more restrictive than the one of [DE21, Theorem 6], we can improve the constants (and in particular avoid some exponential dependence in the Lipschitz constant L).
⊲ Compared with Lemma 4.5 , this result improves the dependence in γ but it is worth noting that the bound is also better with respect to Ψ (and thus to the dimension).
⊲ As mentioned in Remark 4.5, the proof can be adapted to provide exponential rates but unfortunately our method would lead to exponential dependence in the dimension.For this section, the lack of exponential rate does not have a serious impact on the bounds.Nevertheless, if we needed to improve our bounds, an idea would be to apply [CFG22, Theorem 5.6].In this result, the authors provide exponential rates under assumptions which are similar to ours.However the related constant depends on the density of the semi-group and it would be necessary to be able to control it with respect to the parameters of the model.
Proof.Denoting by T x t = ∂ x X x t , the first variation process related to (X x t ) t≥0 , we have: Thus, where .stands for the operator norm associated with the Euclidean norm.Since T x is the solution to Thus, )ds dλ.
Following the arguments of Proposition 4.6, we get for any positive function φ, .
By (H 1 r ), one deduces that where in the second line, we used Proposition 4.3 and the convexity of U .Let now ν be a coupling of π and π γ .We have Taking the infimum over the set of couplings ν of π and π γ and using again Proposition 4.3, this yields where Then, a constant c r exists such that for all γ ∈ (0, γ ⋆ ], for all λ ∈ (0, 1], Remark 4.8.The assumption on ∆(∇U ) is calibrated to control its contribution by L 3 Ψ.That simplifies the purpose and we could keep its specific contribution at the price of technicalities.However, this assumption is not really restrictive: denoting by A(x) the d × d-matrix defined by A i,j (x) = D 3 i,j,j U (x).One easily checks that ∆(∇U ) 2 2,∞ = sup the second inequality coming from a classical inequality related to the Frobenius norm.Since L ≥ sup x∈R d λD 2 U(x) , the assumption is for instance true if sup To conclude, note that Ψ σ 2 dL is well controlled: for instance, under (H 1 r ) and (H 2 r ), Ψ Remark 4.9.The calibration of the parameter λ is of first importance in the proof of Proposition 4.7 in order to avoid exponential dependence in the dimension.
Proof.The proof is an adaptation of Lemma 5.2 and Proposition 5.3 of [EP21] but with the viewpoint that U is only convex.More precisely, we start with a one-step control of the error between the diffusion and its Euler scheme by setting Then, setting b = −∇U , we have where in the last line we used the convexity of U which involves that ∇U (x) − ∇U (y), x − y ≥ 0.
We then write Let λ > 0. For the right-hand side of (33), we use the elementary inequality, |uv| ≤ λ For (34), the Itô formula applied to On the one hand, setting ∆b = (∆b i ) d i=1 , On the other hand, using the fact that M defined by M t = t 0 ∇b(y + σB s ), dB s is a martingale (we refer to [EP21] for the details), we get Finally, from what precedes, we deduce that A standard Gronwall argument then leads to (ii) Iterating the above inequality, we obtain for each n ≥ 1, Integrating the initial condition with respect to π γ , we get where in the second line, we used the stationarity property of π γ .Now, under (H 1 r ), |b| 2 = |∇U | 2 ≤ 2LU (with the same idea than one which leads to (18)) so that by Proposition 4.3(iii), π γ (|b| 2 ) r L Ψ. On the other hand, by the Itô formula and the fact that ∆U ≤ dL, Again, with the help of Proposition 4.3(iii), Thus, using that γ ≤ L −1 , Since for a symmetric d × d-matrix A, A F ≤ √ d λA , one deduces that ∇b 2,∞ = D 2 U 2,∞ ≤ √ dL.It easily follows that σλL ∇b 2,∞ ( √ Ψ + σ √ dL) ≤ L 3 Ψ (using that L ≥ 1 and Ψ ≥ d).The result follows.
5 Proof of Theorem 2.2 Following the bias-variance decomposition of the MSE: , we successively study the bias and the variance contributions and end the section by the proof of Theorem 2.2.

Step 1: Bias of the procedure
In the sequel, Y(J, (γ j ) j , τ, (T j ) j , f ) is usually written Y for the sake of simplicity.We start with a telescopictype decomposition: Let us now study the bias generated by the first and second terms of the right-hand side of (36).
Lemma 5.1.Assume (H 1 r ) and γ 0 ∈ (0, γ ⋆ ].Let x ∈ R d such that U (x) r Ψ.Then, for any r ∈ [0, 1) and δ ∈ (0, 1 2 ], there exists a constant c r,δ (depending only on r and δ) such that for all T ≥ 1, for all Lipschitz continuous function f : R d → R, Proof.Let us apply Proposition 4.5 with where in the last line, we used standard arguments of comparisons between series and integrals.The result follows.
We are now ready to state a proposition about the control of the bias of the procedure.
Proposition 5.1.Assume (H 1 r ) and γ 0 ∈ (0, γ ⋆ ].Let x ∈ R d such that U (x) r Ψ.Let δ ∈ (0, 1 2 ] and let f be a continuous Lipschitz function with (ii) If the assumptions of Proposition 4.7 are fulfilled, (1) bias Proof.(i) Taking the expectation in (36), we obtain: For the three first terms, we apply Lemma 5.1 and for the last one, Lemma 4.5.The result follows.
(ii) It is the same proof using Proposition 4.7 to control the last term (instead of Lemma 4.5).

Step 2 : Control of the variance
Now we have to control the variance of our estimator.Owing to the independency between the layers, Var(Y(J, (γ j ) j , (T j ) j , f where for some given γ > 0 and s > 0, Before going further, let us recall that in order that the multilevel method be efficient, the correcting layers must have a small variance.In the long-time setting, this involves to be able to control the L 2 -distance between couplings of Euler schemes with steps γ and γ/2.By Proposition 4.4, this is still possible under (H 1 r ), and such a property allows to obtain the following result: Lemma 5.2.Assume (H 1 r ) and γ 0 ∈ (0, γ ⋆ ].Let x ∈ R d such that U (x) r Ψ.Let δ ∈ (0, 1 2 ] and κ > 2 1−δ .Let f be a continuous Lipschitz function.Then, for all T > 0, where Remark 5.1.In the uniformly convex case, the variance is controlled by γ log(1/γ) T whereas, here, we are only able to obtain γ 1− 1 T . This difference is due to the lack of exponential convergence to equilibrium under our assumptions.Note that if we leave κ go to ∞, we are moving ever closer to the uniformly convex bound.However, the constant depends on κ and explodes when κ → +∞.The interesting point is that the exponent of Ψ remains bounded when κ → +∞, which means that the dependence in the dimension is slightly impacted by the choice of κ.
First, at the price of replacing f by f /[f ] 1 ,we can assume in the sequel that [f ] 1 ≤ 1.Then, By Proposition 4.4 and the fact that U (x) uc Ψ, we deduce that for every δ ∈ (0, 1], This yields a first bound for Cov (G γ s , G γ u ): Hence, for any t 0 > 0, We now want to take advantage of the convergence to equilibrium to get a second bound when s − u ≥ t 0 : since G γ u is F u γ -measurable, we have for any s ≥ u γ , Setting F(γ, t, x) = E[f ( Xγ,x t )] − π γ (f ), we deduce from the Markov property that and hence, On the other hand, As a consequence, Let us study the two right-hand members successively.For (40), the Cauchy-Schwarz inequality and (38) yield: By Proposition 4.5 or more precisely by (29) combined with Proposition 4.3(iii)9 applied with φ and that for any κ > 2 1−δ , we deduce that for any t 0 ≥ 2 and for any γ ∈ (0, For (41), using that s γ ≥ s γ − u γ , we remark that we can obtain the same bound so that: In view of the above bound and of the one obtained in (39), we now optimize the choice of t 0 by taking t 0 solution to: (γL) Plugging this value of t 0 into (39), this leads to: for any κ > The result follows.
In the next proposition, we are now able to work on the variance of the multilevel procedure.
To conclude, we need to separate two situations.If γθcd ≤ 1, then the inequality e x ≤ 1 + 4x for x ∈ [0, 1] leads to sup where in the last inequality, we used that 4γc ≤ 1 ≤ c c .This concludes the proof.
(iii) Noting that Ψ r Ψ, the first bound is obvious.For the second one, it is enough to note that under (H 1 r ), (X t ) t≥0 and ( Xnγ ) n≥0 converge in distribution to π and π γ respectively so that with a uniform integrability argument combined with the first bound of (iii), the convergence holds for along functions U p for any p > 0.

B Proof of Proposition 4.3
The idea is to use Jensen inequality to derive controls of the polynomial moments from exponential moments.To this end, we begin with the following lemma: i.a), (ii.a), (ii.b) are satisfied (up to a constant depending on κ, δ and r only) if

Lemma B. 1 .
Let V denote a non negative random variable which satisfies E[e θV ] < e a + ρe b for positive θ, a, ρ and b.Then, for anyp ≥ 1, E[V p ] ≤ θ −p (p − 1 + a + b + log(2ρ)) p .(56) •, • and | • |.The set M d,d refers to the set of d × d-real matrices, we denote by • the operator norm associated with the Euclidean norm.For a symmetric matrix A, we denote respectively by λ A and λA its lowest and highest eigenvalues.The Frobenius norm for A ∈ M d,d is denoted by A F .