High-dimensional MCMC with a standard splitting scheme for the underdamped Langevin diffusion

The efficiency of a Markov sampler based on the underdamped Langevin diffusion is studied for high dimensional targets with convex and smooth potentials. We consider a classical second-order integrator which requires only one gradient computation per iteration. Contrary to previous works on similar samplers, a dimension-free contraction of Wasserstein distances and convergence rate for the total variance distance are proven for the discrete time chain itself. Non-asymptotic Wasserstein and total variation efficiency bounds and concentration inequalities are obtained for both the Metropolis adjusted and unadjusted chains. \nv{In particular, for the unadjusted chain,} in terms of the dimension $d$ and the desired accuracy $\varepsilon$, the Wasserstein efficiency bounds are of order $\sqrt d / \varepsilon$ in the general case, $\sqrt{d/\varepsilon}$ if the Hessian of the potential is Lipschitz, and $d^{1/4}/\sqrt\varepsilon$ in the case of a separable target, in accordance with known results for other kinetic Langevin or HMC schemes.


Introduction
The Langevin diffusion (also called underdamped or kinetic Langevin diffusion) is the Markov process on R d × R d that solves the SDE where B is a standard d-dimensional Brownian motion, U ∈ C 2 (R d ) is called the potential (or log-likelihood) and γ > 0 is a friction (or damping) parameter. Under general assumptions, it is known to be ergodic with respect to the Gibbs measure with Hamiltonian H(x, v) = U(x) + |v| 2 /2, namely the probability measure π on R 2d with density proportional to exp(−H(x, v)).
It is a central subject of study in a wide variety of domains, at the intersection of probabilities, PDE and statistical physics, modeling e.g. plasmas or stellar systems. In this work we focus on the use of the Langevin diffusion in Markov Chain Monte Carlo (MCMC) algorithms in order to estimate expectations with respect to π. This process has been used for decades in molecular dynamics (MD), in particular due to physical motivations related to the motion of particles in classical physics (see [25,38,55,30,32,31,51,8,7,6] and references within). In this context, the objective is not necessarily to compute only statistical averages, but also dynamical properties, like diffusion coefficients, transition times, reaction paths, etc., so that it is crucial to sample a trajectory of (a discretetime approximation of) the Langevin diffusion, and not of another process with the same statistical equilibrium. More recently, the Langevin diffusion has gained some interest in the computational statistics and learning community [12,11,33,14,59], in which case there is no preferred underlying dynamics and only statistical averages with respect to π matter. The present work follows this viewpoint. Besides, in this context, the original question is to compute expectations with respect to π 1 ∝ exp(−U) the first d-dimensional marginal of π, and the velocity V is added as an auxiliary variable in order to enhance the sampling. This is called a lifted MCMC method.
We will focus on the case where U is m-convex and L-smooth (see Assumptions (∇Lip) and (Conv) below). These are usual conditions in the statistics community to compare MCMC methods, the objective being then to obtain non-asymptotic explicit estimates in term of the dimension d, see e.g. [18,12,14]. These conditions are far from natural for MD applications where, except for very particular harmonic toy models, the potentials are highly non-convex and singular due to strong short-range repulsion of nuclei. On the other hand, and contrary to the recent works in computational statistics, we will study a discretization scheme of (1) obtained by a splitting procedure, as commonly used in MD [8,30,7,6,51] and implemented in all MD codes.
The study for MCMC purpose of a discretization of the Langevin diffusion for a convex and smooth potential has recently been conducted in [12,14,33,59]. Concerning Metropolis adjusted chains, some non-asymptotic results have recently been established in [10,20,36,37,5] for HMC samplers or other Metropolis-adjusted algorithms. Let us give an informal summary of our contribution with respect to these works (a more detailed and quantitative discussion is provided in Section 3.4 after our results have been rigorously stated).
• As far as non-asymptotic efficiency bounds are concerned, standard splitting discretisation schemes (such as the OBABO chain studied in the present work, see Section 2.1) had not been studied yet. Contrary to the first order scheme of [12,14,11,59], the OB-ABO sampler is a second order scheme (in the time-step). Like classical schemes, it only requires one computation of ∇U per step, contrary to the second order scheme of [14,13] that also requires a computation of ∇ 2 U. It belongs to the class of splitting schemes considered in [30], but [30] only provides a formal expansion of the invariant measure in the time-step, and only for two particular splitting (and not the one studied here). A second order scheme with the same computational complexity is also introduced in [33], and we obtain similar results in term of efficiency. Contrary to the schemes studied in [33,30], the OBABO scheme is naturally connected to HMC type samplers, and in particular the discretization bias at equilibrium is only due to a deterministic approximation of the Hamiltonian dynamics by a Verlet (or leapfrog) scheme, which has some nice consequences in the analysis.
• Contrary to the previous works on other schemes based on the underdamped Langevin diffusion, instead of establishing long-time convergence estimates for the continuous-time process and then conducting an analysis of the discretization error, we directly prove a Wasserstein contraction for the discrete time Markov chain itself. This classically yields nice concentration results (and thus non-asymptotic confidence intervals) for empirical averages, see Theorem 6, which is not the case in previous works. Then, we study the bias at equilibrium separately. Notice that, ultimately, gathering these two parts yields efficiency bounds that are similar to those obtained with the method that relies only on the convergence rate of the continuous-time process, nevertheless this work was initially motivated by the theoretical question considered in [48] of obtaining convergence rates for (discrete-time) MCMC samplers in high dimension. This issue is not addressed with the basic method. Finally, with our method, we obtain that empirical averages of the discrete-time chain have a (biased) long-time limit (again, this is Theorem 6), which is a first step for using Romberg interpolation techniques, see e.g. [19,1] (this issue is postponed to a future work, see [39]).
• We establish a Wasserstein/total variation regularization property. To our knowledge, this is the first result of this kind for a Markov chain based on an hypoelliptic nonelliptic diffusion. As a consequence, contrary to [12,14,59], we obtain results for the total variation distance. However, we acknowledge that schemes that are based on the exact solving of the SDE (1) but with ∇U(X t ) replaced by ∇U(X δ⌊t/δ⌉ ) where δ > 0 is the time-step (which is the case in [12,14,33,59] and not of our scheme) have Gaussian transitions and thus enjoy a similar regularization property, although it hasn't been stated in previous works. In [33], bounds are obtained for the Kullback-Leibler divergence, which is stronger, but with hypocoercive PDE techniques, while we use an elementary coupling approach. Adapting their method to our case is not straightforward, as one iteration of our scheme is not based on solving (1) with fixed forces.
• In order to see that second order schemes are more efficient than first order ones, the gradient Lipschitz assumption is not sufficient, some information is required on higher order derivatives of U. For this reason, in [14,33], the assumption that ∇ 2 U is Lipschitz is enforced. We work under a weaker condition (see Assumption (∇ 2 pol(ℓ)) below), that holds for instance if ∇ (k) U ∞ < +∞ for some k > 2. When ∇ 2 U is Lipschitz, we obtain efficiency bounds similar to [14,33].
• We establish non-asymptotic efficiency bounds and confidence intervals for the Metropolisadjusted OBABO scheme, which is an HMC sampler with a particular choice of parameters. This is the first results of this kind for a Metropolis-adjusted sampler based on the (kinetic) Langevin equation. By comparison, the recent works [10,36,37,5] are concerned with the classical HMC sampler, with complete refreshment of velocities separated by long run of the Hamiltonian dynamics (and do not provide confidence intervals). Also, contrary to [10,20,36,37], we do not study the case of so-called warm start for the initial distribution, and we obtain non-asymptotic bounds that involve only some moments of the initial distribution (similarly to the unadjusted case).
We are mainly interested in the dependency in the dimension d, the time-step δ and the accuracy ε of the explicit estimates. Concerning the dependency on other parameters like m or L, in practice, all our estimates could be improved by considering a preconditioned algorithm as in [59] and/or by taking into account explicitly a Gaussian part of U, i.e. decomposing U(x) = x · S −1 x/2 +Ũ (x) with some covariance matrix S and potentialŨ , see e.g. [47].
The article is organized as follows. The OBABO sampler and its Metropolis-adjusted counterpart are presented in Section 2, together with some general considerations on Wasserstein distances. Our main results are stated in Section 3, decomposed as: results on the long-time behavior of the OBABO chain (Section 3.1); on the equilibrium bias (Section 3.2); on the Metropolis-adjusted sampler (Section 3.3). The results are discussed and related to other works in Section 3.4. Section 4 contains the proofs for the long-time behaviour of the OBABO sampler, Section 5 the proofs for the equilibrium bias and Section 6 the proofs for the Metropolis-adjusted chain.

A second-order scheme for the Langevin diffusion
We present the Markov chain which is the main object of study of this work, obtained as a time-discretization of the Langevin diffusion (already considered by Horowitz in [25], see also [8] and references within). First, we simply define the transition of the Markov chain, and the rest of the section is devoted to the motivation and discussion of this definition. Let δ, γ > 0 be respectively the time-step and friction coefficient, and denote η = e −δγ/2 . We consider the time-homogeneous Markov chain (x n , v n ) n∈N on R d × R d with transitions given by where G and G ′ are two independent standard (mean 0 variance I d ) d-dimensional Gaussian random variables. Denote by P the Markov transition operator associated to the chain, i.e. P ϕ(x 0 , v 0 ) = E(ϕ(x 1 , v 1 )) for all measurable bounded ϕ.
A first very important remark is that, although it seems that in one iteration, ∇U has to be computed for two different values of x, this is in fact only true for the first iteration, afterward ∇U(x n ) is already known from the previous iteration and only ∇U(x n+1 ) is computed. A second remark is that, contrary to the schemes considered in e.g. [12,14,33,59] (where the transition is obtained by following for a time δ the continuous-time process (1) but with constant forces ∇U(x t ) = ∇U(x 0 )), the law of (x 1 , v 1 ) conditionally to (x 0 , v 0 ) is not a Gaussian distribution, because of the term ∇U(x 1 ) in v 1 (except in the particular case where U is quadratic).
The scheme makes more sense when it is decomposed as the following successive steps: Here, we use the notations O, B and A of [30], referring respectively to the damping (or friction/dissipation, or partial resampling), the acceleration and the free transport parts of the dynamics. So, in the rest of the article we will refer to the Markov chain (x n , v n ) n∈N as the OBABO sampler which, contrary to some of its other names like second order Langevin or midpoint Euler-Verlet-Midpoint Euler [8,25,31], has the advantage to give a compact yet explicit description of the specific scheme (in particular by comparison with other second order schemes for the underdamped Langevin diffusion). Note that the parts BAB of the scheme is the classical velocity Verlet algorithm for the Hamiltonian dynamics. More generally, the Verlet, OBABO, or BAOAB (from [30]) algorithms all shares a common palindromic form since they are obtained from a Strang/Trotter splitting of the generator of the Langevin diffusion. We now give a brief and informal presentation of the latter, and refer to [30,31] or [44,Section 5] for more details. The transition semigroup P t associated to (1) is informally of the form e tL where the infinitesimal generator L can be decomposed as Using that e δ(a+b) = e δ/2a e δb e δ/2a + o(δ 2 ), we get formally a second order approximation of P δ by The operator on the right-hand side is exactly the transition kernel of the OBABO sampler. At least formally, the fact that the transition kernel of the chain approximates the transition semigroup of the continuous process up to second order in δ can be shown to yield a similar order for the error on the invariant measure, see [30,44] for general discussions and more details and Proposition 12 below for a rigorous and quantitative statement in the particular OBABO case. More precisely, in Proposition 12, we get an error term on the invariant measure of order δ 2 for the Wasserstein and total variation distances, to compare to e.g. [14,Theorem 2] or the proof of [12,Theorem 1] where the error is of order δ for the Wasserstein distance (see Section 3.4 for the consequence of this on the efficiency bounds). For convex and smooth potentials, for a suitable choice of γ, the continuous-time process (1) converges to equilibrium at a rate that is independent of the dimension, so that the error estimates obtained in [12,14,33,59] depends on the dimension only because of the error on the invariant measure (which requires the time-step to be small enough). By replacing a first order discretization scheme by a second order one, we expect to improve the estimates accordingly, which is indeed what is observed in [14,33] where other second-order schemes are considered, and in the present work (again, we refer to Section 3.4).
The reason we chose to study the OBABO algorithm rather than the BAOAB one or other similar scheme is the following. The OBABO scheme contains at its core a (deterministic) Verlet part BAB, which is time reversible (in the physicist sense, i.e. up to a reflection of the velocity). This fact classically leads to a natural Metropolized version of the algorithm, of Hamiltonian (or Hybrid) Monte Carlo (HMC) type, presented below. This Metropolisadjusted algorithm is interesting in itself, and we use it as an auxiliary tool in the study of the bias of the OBABO chain. Besides, an unintended benefit of our choice of integrator is that, for the OBABO scheme, establishing a Wasserstein/total variation regularization property (Proposition 3) is rather straightforward because of the particular location of the noise in the scheme. That being said, it should be possible to adapt most (if not all) of our arguments to the BAOAB or other similar second order schemes like in [33]. Finally, remark that the BAOAB scheme is promoted in [30] because, in the overdamped limit γ → +∞, the invariant measure is correctly sampled at fourth order in the time-step. However, as γ → +∞, the convergence rate of the chain goes to zero, so it is not clear that this argument is decisive in the choice of the integrator (after a suitable scaling of time, the BAOAB converges as γ → +∞ to a discretization scheme of the overdamped Langevin diffusion which has a correct equilibrium up to second order).
Recall the goal of the algorithm is to compute averages with respect to π 1 the first ddimensional marginal of π. As such, we are not constrained for the equilibrium of the velocities, and typically we could chose a symmetric Gaussian distribution with variance different from 1. Considering the Hamiltonian H(x, v) = U(x) + |v| 2 /(2σ 2 ), we would have a new parameter σ to optimize, which could improve the results. Nevertheless, in that case, we can always chose σ = 1 by the change of variables v ← v/σ, δ ← δσ and γ ← γ/σ. As a consequence, without loss of generality, we take σ = 1.
As mentioned above, the particular structure of the scheme has some nice implications for the theoretical study. Of course, in practice, if we are only interested in computing averages that only involve the position x n , then the scheme is equivalent to alternate a Verlet step BAB and a partial refreshment O 2 (which is simply O but with a full rather than half time-step, i.e. η is replaced by e −δγ ), since (OBABO) n =O(BABO 2 ) n−1 BABO and the first and last O have no effect on the position.
Finally, let us mention that our analysis can be straightforwardly adapted to get results on the O(BAB) k O scheme for some fixed k ∈ N * independent from δ (i.e. k Verlet steps are performed between the partial refreshments). Indeed, as δ vanishes, the corresponding chain still converges toward the continuous time process (1), which is the main ingredient of our results concerning long-time convergence, and similarly there is no particular difficulty to adapt the results concerning discretization errors. For fixed parameters k and γ, this extension does not yield any improvement in terms of scaling in δ and d (also, note that one iteration of O(BAB) k O requires k computations of ∇U) and thus we don't consider it to avoid an unnecessary inflation of notations.

Metropolis-adjusted algorithm
The O parts of the scheme leaves invariant the standard Gaussian distribution for the velocities, hence the target π. This is not the case of the Verlet part BAB because of the numerical approximation. As a consequence, π is not invariant for the OBABO chain. Nevertheless, this is classically fixed by adding a Metropolis accept/reject step on this part of the dynamics, which gives an HMC algorithm [25,45]. We now detail this. In the following we denote so that the steps BAB read ( is symmetric. Denote P M H the Markov transition operator of the Metropolis-Hastings algorithm with proposal kernel q. A transition of the corresponding chain is given by: where W is uniformly distributed over [0, 1]. We used that H(x ′ , v ′ ) = H(x ′ , −v ′ ). By construction of the Metropolis-Hastings algorithm, P M H is reversible (in the probabilistic sense) with respect to π. Moreover, π is also invariant by the transformation v ↔ −v, namely .
In other words, P O is given for any bounded observable ϕ by The Metropolis-adjusted version of the OBABO algorithm, introduced in [25], is the Markov chain with transition operator By construction, π is invariant for P M . Although π is reversible for P O , P R and P M H , this property is not conserved by composition and π is not reversible with respect to P M . Besides, using that P R P O P R = P O and P 2 R = Id, we see that that the adjoint operator P * M = P O P M H P R P O in L 2 (π) is given by P * M = P R P M P R , i.e. the process is reversible in the physics sense, up to a reflection of the velocity. We won't use this property in the article but we note that it allows to get an L 2 spectral gap from a contraction in an other distance, like a Wasserstein distance or the H 1 norm, as in the reversible case, see e.g. [15].
The operator P M corresponds to the following transition: Remark that, in case of acceptation (namely if W α(x 0 , v ′ 0 )) then this is exactly the OBABO transition. Similar Metropolis-adjusted discretizations of the Langevin diffusion have been studied in [51,7,6,47], see also references within.
Due to the particular scaling of the partial refreshment mechanism, this HMC chain converges as δ vanishes to the Langevin diffusion rather than to the continuous-time Randomized HMC (see [15] and references within) where dissipativity in the velocity is ensured by a (possibly partial) resampling of the velocities at constant rate, rather than an Ornstein-Uhlenbeck diffusion.
Following motivations similar to the unadjusted case, we call this specific Metropolisadjusted splitting scheme the OM(BAB)O chain (the M(·) standing for Metropolis).

Wasserstein distances
Let ρ be a distance on R d . For p ∈ [1, ∞), denote by P p,ρ (R d ) the set of probability distributions on R d having a finite p-th moment for ρ, i.e. ν ∈ P p, where Γ(ν, µ) is the set of transference plan of ν and µ, namely is the set of probability distributions on R d × R d with first marginal ν and second marginal µ. In other words, although this is only a formal definition since the infimum is not taken on a proper set. If (X, Y ) is a random variable with law η ∈ Γ(ν, µ), we say that (X, Y ) is a coupling of ν and µ. Implicitly, W ρ means W 1,ρ , and W p means W p,ρ with ρ(x, y) = |x − y|. In fact, unless otherwise specified, in R d , the name Wasserstein distances usually refers to cases where ρ is equivalent to the Euclidean metric. Besides, another interesting case is the discrete metric ρ(x, y) = 2½ x =y , in which case the associated W 1 distance is the total variation norm Given a Markov operator Q acting on P p (R d ), it is easily seen by conditioning on the initial condition that, for C > 0, the proposition Cρ(x, y) .
Besides, the Jensen's inequality implies that W p,ρ W q,ρ whenever p q, which together with the previous remark implies that a contraction of the W q,ρ distance for some q induces a contraction of W p,ρ for all p q.
For a distance ρ on R d that is equivalent to the Euclidean metric, from [58, Corollary 5.22 and Theorem 6.18], there always exists an optimal coupling for W p,ρ , i.e. for ν, µ ∈ P p (R d ) there exists a coupling (X, Y ) of µ on ν on some probability space Ω such that W p p,ρ (ν, µ) = E (ρ p (X, Y )), and moreover (P p (R d ), W p,ρ ) is a Banach space.

Main results
As already mentioned, we are principally interested in the smooth and convex case, corresponding to the following conditions. Assumption (∇Lip). There exists L > 0 such that for all x, y ∈ R d , Assumption (Conv). There exists m > 0 such that for all x, y ∈ R d , Under Assumptions (∇Lip) and (Conv), U admits a unique global minimum x ⋆ and for all x ∈ R d , The two conditions don't have the same status. The condition (∇Lip) will be enforced in all the article. It classically implies the stability of the OBABO scheme for δ small enough (this will be a consequence of our results). Approximating diffusions with non-Lipschitz coefficient leads to various difficulties and can lead to truncated algorithm, see e.g. [7] and references within. The condition (Conv) gives a simple condition to obtain nice explicit convergence rates, and will not be used in most of the bias error analysis.

Dimension free convergence rates
Recall P is the Markov operator corresponding to the OBABO chain introduced in Section 2.1.
In this section, we state our results obtained on the long-time behavior and regularization of this chain by itself, i.e. without referring to the continuous-time limit. Under the the convex/smooth assumption, and for a suitable choice of the damping parameter, the Langevin diffusion is contractive, in the sense that its deterministic drift contracts the distances [4,12]. Hence, starting from two different initial states, the parallel coupling (i.e. considering the same Brownian noise) of two processes yields a deterministic contraction. The problem is quite more involved in a non-convex case where one has to take advantage of the noise to overcome potential barriers, in which case a combination of parallel and mirror (using reflected Brownian noises) couplings and a suitable concave transformation of the distance can give a contraction in some modified W 1 distance, see [21,11,13]. An alternative argument in the non-convex case to get a W 2 hypocoercive contraction is to combine an entropic hypocoercive contraction with a Wasserstein/entropy regularization, as in [22,Theorem 2] (this gives sharper results than coupling arguments in the low temperature regime, see [42]).
In the present convex case, the OBABO chain being an approximation of the Langevin diffusion, a simple parallel coupling does the job, and we get the following. Theorem 1. Under Assumptions (∇Lip) and (Conv), suppose moreover that γ 2 √ L and that δ m/(33γ 3 ). Then, for all p 1, n ∈ N and all ν, µ ∈ P p (R 2d ), Moreover, P admits a unique invariant probability distribution π δ , and π δ ∈ P p (R 2d ) for all p 1.
This is proven in Section 4.1, see Corollary 20.
Remark 2. The restriction on γ is consistent with all the works that studied the contraction of the parallel coupling for the continuous-time Langevin diffusion [4,12,14,59] (see also [15] for the Randomized HMC algorithm, since this is exactly the same computations after taking expectations). In fact, as proven in [43,Proposition 4], for the continuous-time Langevin process, a Wasserstein contraction as in the proof of Theorem 1 holds for all potentials U satisfying (∇Lip) and (Conv)if and only if L − m < γ( √ L + √ m). When m ≪ L our condition γ 2 √ L is thus similar to the optimal one up to a factor 2. In fact, it is clear that combining the proof of [43, Proposition 4] and of Theorem 1, a Wasserstein contraction holds for the OBABO chain under the same optimal condition on γ as in the time-continuous case, but then κ and the condition on δ are less nice and this doesn't improve the convergence rate in terms of the dependency on m and L.
The choice γ = 2 √ L maximizes our bound on the contraction rate. In [14], the optimal rate for the continuous-time process is proven to be m/γ, obtained with the choice γ = √ m + L, which means in the regime m ≪ L we are simply missing a factor 6 from the optimal bound. Again, this factor 6 can of course be reduced to get arbitrarily close to the optimal bound simply by following the proof and keeping sharper constants, to the cost of a stronger condition on δ, and without changing the order in terms of m and L.
Besides, as γ → +∞, we get a convergence rate of order 1/γ, which is sharp and wellknown in the case of the continuous-time Langevin diffusion.
It is clear that a condition on δ is needed, since the process is not even stable when δ is too large. Here the condition δ m/(33γ 3 ) is not sharp, it is used to get a simple expression. Anyway, we have in mind the regime where m, L, γ are independent from the dimension and, as we will see in the next section, δ should vanish as d → +∞, so that the condition on δ is always satisfied in high dimension, which is why we chose simplicity over sharpness to state this condition.
As an hypoelliptic (non-elliptic) diffusion, the continuous-time Langevin diffusion (1) enjoys many nice regularization properties. In particular, under Assumption (∇Lip), [23, Corollary 4.7(2) and Remark 4.1] implies an instantaneous W 2 /relative entropy regularization, which together with the Pinsker inequality implies that for some C > 0 for all ν, µ ∈ P 2 (R 2d ) and t > 0, where (P t ) t 0 is the transition semigroup associated to (1). For elliptic diffusions the result would typically hold with √ t, but for the Langevin diffusion, there is no direct noise in the position and thus t N +1/2 is to be expected where N is the number of times one has to consider Lie brackets to fulfill Hörmander's condition (here N = 1).
We now state a similar result, but at the level of the discrete time chain P . Convexity is not assumed. We are not aware of any similar result for a Markov chain based on the discretization of an hypoelliptic non-elliptic diffusion (although, as mentioned in the introduction, it holds indeed in many cases).
This is proven in Section 4.2, see Proposition 22.
Remark 4. We retrieve for δ → 0 the expected δ 3/2 . In fact, more precisely, as can be seen in Proposition 22 below, we get the scaling δ 3/2 for the positions and √ δ for the velocities.
Combining the two previous results naturally yields a convergence rate for the total variation distance. Following [50,48], we consider the total variation convergence rate of P in the sense of: n .
An immediate consequence of Theorem 1 and of Proposition 3 is the following.
The most classical settings to obtain bounds on the total variation convergence rate of a Markov chain is the combination of a Lyapunov (drift) and a local Doeblin (minoration) conditions. Nevertheless, the recent [48] rigorously establishes that, in a variety of cases, no drift/minoration condition can yield a reasonable bound on r * (P ) (here only one-step minoration conditions are concerned, of the form inf z∈C P (z, ·) εν for some probability ν and some ε > 0). More precisely, it is clear that the results of [48] apply to the OBABO sampler (essentially adapting the proof of [48,Proposition 15] for the MALA sampler), which proves that it is impossible to prove simply with a drift/minoration condition that, if δ scales polynomially with the dimension, then r * (P ) is bounded away from 1 polynomially in the dimension. In Corollary 5, we have used the combination of a Wasserstein convergence and a Wasserstein/total variation regularization, called in [48,49,34] a one-shot coupling, which is indeed presented in [48] as an alternative to drift/minoration arguments.
We finish this section with a concentration result implied by the Wasserstein contraction of Theorem 1, or more precisely by the deterministic contraction at the core of the proof of this result (see Proposition 17 below). We say that a probability distribution ν on R d satisfies a logarithmic Sobolev (or simply log-Sobolev) inequality with constant C if, for all bounded Lipschitz ϕ, Such inequality yields many useful information, in particular related to concentration of measure, see e.g. [2,28]. As a consequence of the Brascamp-Lieb inequality or of the Bakry-Emery curvature criterion [2], it is well known that, under the condition (Conv), the Gibbs measure π satisfies such an inequality with constant 2 max(1, 1/m). However, the invariant measure of P is not explicit. Nevertheless, the positive Wasserstein curvature (in the sense of [46,26]) obtained in Theorem 1 (or more precisely in Proposition 17) yields a discrete-time Bakry-Emery condition, which in turn yields the following.
Theorem 6. Under the conditions of Theorem 1, π δ satisfies a log-Sobolev inequality with constant Moreover, if µ 0 ∈ P(R 2d ) satisfies a log-Sobolev inequality with constant C ′ > 0 and if (Z k ) k∈N is an OBABO chain with Z 0 ∼ µ 0 , then for all 1-Lipshitz functions ϕ on R 2d , n 1 and u 0,

Remark 7.
If µ 0 is a Dirac measure, it satisfies a log-Sobolev inequality with constant C ′ = 0.
Remark 8. In fact, a W 1 contraction (which does not necessarily implies a log-Sobolev inequality) is sufficient to get a similar result, see [16,Corollary 2.6] and [26].

Equilibrium bias and efficiency
The invariant measure π δ of P is not explicit. Since the OBABO sampler converges to the Langevin diffusion as δ vanishes, we expect π δ to converge to π. There are many classical ways to quantify this; in view of the previous long-time convergence results, in this work we are naturally interested by an estimation of the Wasserstein and total variation distances between π δ and π.
Recall the definition of the Verlet map Φ V in Section 2.2, and consider P V the corresponding transition operator, i.e.
. From the contraction of Wasserstein distances and the Wasserstein/total variation distance regularization property, it is not difficult to get the following.
and for all n ∈ N * , This is proven in Section 5.
and that it is reversible up to a reflection of the velocity (which preserves the Hamiltonian H) we see that In other words, πP V is the probability law with density proportional to exp This leads us to the study of the Verlet step. In order to control the distance from πP V to π, a natural way is to consider a Markov operator Q that leaves π invariant and to control the distance between πP V and πQ = π by constructing a coupling of a transition of P V with a transition of Q, starting from the same initial condition distributed according to π. If Q is close to P V then we will be able to do so while keeping the two chains close. Since P V has been obtained as the discretization of the Hamiltonian dynamics, a natural candidate for Q is Q δ where (Q t ) t 0 is the transition semi-group of this deterministic flow, which indeed leaves π invariant. Comparing the discrete-time Markov chain with its continuous-time limit is indeed the main argument in similar works such as [12,11,18,14]. However, we remark that another candidate is given by the Metropolis-adjusted Verlet transition P M V = P R P M H introduced in Section 2.2. Indeed, π is invariant for P M V , and P V and P M V only differ by an accept/reject step. Comparing P V with Q δ or with P M V leads to two different kind of results. On the one hand, P V and Q δ are deterministic, so there is no real coupling here, only a deterministic numerical error, in particular it behaves well with W p for all p 1. On the other hand, it is not suitable to bound the total variation distance since the probably that the deterministic flow and the numerical integrator are equal after one step is in general zero. On the contrary, by design, a transition of P V and P M V gives exactly the same result provided the step is accepted in P M V , which enables to get information on the total variation distance. Nevertheless, in case of rejection, because of the velocity reflection, although the initial condition is the same for the two chains, the distance instantaneously get large. This makes the comparison with P M V less and less efficient for bounding W p as p increases. In a word, when comparing P V with Q δ , we get two points that are very close (but distinct) with probability 1 while, when comparing P V with P M V , we get two points that are equal with large probability but distant otherwise.
This analysis is conducted in Section 5. We now present its results.
A first remark is that the main difference between the OBABO sampler and the Markov chain studied in [12] is that the former is based on a second-order integrator, at least formally. Nevertheless, a Taylor expansion of the Verlet algorithm to an order larger than one requires informations on derivatives of U higher than two. As a consequence, relying only on Assumption (∇Lip), we can only get results similar to [12] for the Wasserstein distances.
Proposition 11. Under the conditions of Theorem 1, for all p 1, where K 2 , K 3 , K 4 are given in Propositions 3 and 31 (in particular K 3 and K 4 converge to L as δ vanishes).
Proposition 11 is a corollary of Propositions 9 and 31, as proven at the end of Section 5.
In order to see the benefits of the second-order approximation, some conditions on higher derivatives of U have to be enforced. Ideally, we would like conditions that are general, simple, and under which the OBABO sampler proves to be efficient. Of course these three criteria are contradictory and, as a compromise, we study several conditions to get a partial picture of the behavior of the algorithm. We could state a technical ad hoc condition containing exactly what is used in the proof, but then such a condition would have to be checked on particular cases. We will rather focus on two simple and general conditions, which provides a simple proof that there exist situations for which the bias of the chain is of second order in δ with possibly a nice dependence in the dimension.
The simplest condition one can think of, considered in other works like [18,14,33], is that the Hessian of U is Lipschitz: In fact, we will work under a possibly weaker generalization of this condition: Assumption (∇ 2 pol(ℓ)). There exist ℓ 2, L ℓ > 0 and x ⋆ ∈ R d such that, for all x, y ∈ R d , Note that, up to a change of the constant L ℓ , if Assumption (∇ 2 pol(ℓ)) holds for some x ⋆ then it holds with any other value of x ⋆ . As a consequence, implicitly, whenever Assumptions (Conv) and (∇ 2 pol(ℓ)) are simultaneously satisfied, then we chose x ⋆ in (∇ 2 pol(ℓ)) to be the global minimum of U given by (Conv).
As we will see, we still get a second-order error in δ under Assumption (∇ 2 pol(ℓ)) for any ℓ 2. Nevertheless, for ℓ > 2, due to the presence of additional moments, the dependence in the dimension worsen. However, as illustrated by the next condition, there are cases where, even though the local Lipschitz constant of ∇ 2 U is polynomial, the OBABO sampler is super efficient.
Assumption (⊥ ⊥). There exist ℓ ∈ N, L ℓ > 0, x ⋆ = (x ⋆,1 , . . . , x ⋆,d ) ∈ R d , an orthogonal matrix Q ∈ O(d) and, for i ∈ 1, d , potentials U i ∈ C 2 (R) such that U(Qx) = d i=1 U i (x i ) and for all i ∈ 1, d and x, y ∈ R, Similarly to Assumption (∇ 2 pol(ℓ)), when Assumptions (⊥ ⊥) and (Conv) are both satisfied, x ⋆ is implicitly the minimum of U. Assumption (⊥ ⊥) means that the target distribution is separable, namely, in a suitable orthonormal basis, the coordinates of a random variable distributed accorded to π are independent. It is important to note that we do not need to know Q. Indeed, contrary to e.g. the Zig-Zag sampler [3] or more generally Gibbs algorithms, in the OBABO sampler, the basis does not play a particular role, i.e. if (x n , v n ) n∈N is an OB-ABO chain for some potential U then (Q −1 x n , Q −1 v n ) n∈N is an OBABO chain with potential x → U(Qx). As a particular case, Assumption (⊥ ⊥) is satisfied for all Gaussian distributions (with ℓ = 2 and L ℓ = 0).
Although the independence assumption is very restrictive, it is the classical condition under which scaling limits are established for MCMC algorithms, see e.g. [15] and references within. Non-asymptotic bounds for the HMC algorithm for a separable target have been established in [36], and under a similar but weaker condition in [37].
Remark 13. In the Gaussian case, π δ is an explicit Gaussian measure and thus W 2 (π, π δ ) is known. More precisely, in that case, we get W 2 (π, π δ ) ≃ δ 2 √ dL/4 as δ → 0 (we refer to the upcoming work [39] for the detailed analysis of the Gaussian case), which means the δ 2 √ d dependency under Assumption (⊥ ⊥) is sharp.
On the contrary, for the dependency in d of the total variation bound, the results of [33] give a bound of order √ d under the condition (∇ 2 pol(ℓ)) with ℓ = 2, while we only get d 3/4 . We discuss this in more details in Section 3.4.
Although one can design academic examples where (∇ 2 pol(ℓ)) holds for some ℓ > 2 but not for ℓ = 2, while (∇Lip) holds, this does not correspond to practical cases of interest. The reason why we decided to work under a framework with possibly ℓ > 2 is the following. For potentials that behave like U(x) = |x| α with α > 2 at infinity, (∇ 2 pol(ℓ)) holds with ℓ > 2 but not ℓ = 2. On the other hands (∇Lip) doesn't hold, but it can be replaced by a condition similar to (∇ 2 pol(ℓ)) but on ∇U rather than ∇ 2 U. Moreover the OBABO integrator becomes unstable and should be replaced by an adaptive time-step/tamed version. Adapting our analysis to this case, similar but more involved, is out of the scope of the present work, but we wanted to highlight that working with such local Lipschitz-continuity conditions is possible and possibly leads to a higher dependency in the dimension of the bias estimates due to additional moments, but not necessarily, as illustrated by the separable case.

The Metropolis-adjusted chain
Since the continuous-time Langevin diffusion has, in the smooth and convex case, a convergence rate that is independent from the dimension, and as seen in Theorem 1, the OBABO chain has a convergence rate that depends on the dimension only, possibly, through the timestep δ. Hence, the number of steps of the chain required in order to sample the target measure is of order 1/δ (up to some logarithmic terms). On the other hand, the rejection probability in the OM(BAB)O scheme is of order δ 2 . This means that, as δ vanishes, there is a high probability that no rejection occurs during the whole simulation. Consequently, results obtained for the unadjusted OBABO chain are straightforwardly transfered to the OM(BAB)O chain. For simplicity we focus on the total variation distance and confidence intervals, but similar adaptations could be conducted for Wasserstein distances.
2. If condition (∇ 2 pol(ℓ)) holds and ν ∈ P ℓ+1 (R 2d ), considering K 10 as in Corollary 35, 3. If condition (⊥ ⊥) holds, and ν ∈ P ℓ+1 (R 2d ), considering K 10 as in Corollary 35 and, for i ∈ 1, d , denoting by This is proven at the end of Section 6. To fix ideas, taking first the limit as δ → 0 and then the leading terms as L, L ℓ → +∞ and m → 0 (with L ℓ ≪ L 2 ), we get Applied with µ = π (so that µP n M = π) and in combination with Corollary 5, this gives a non-asymptotic bound on νP n M − π T V . Remark 15. Contrary to Corollary 5 for the OBABO chain, Proposition 14 does not provide a bound on the total variation convergence rate of the OM(BAB)O chain. Nevertheless, combining Proposition 14 with Corollary 5 yields local coupling estimates, i.e. for all R > 0, a suitable choice of δ and n ensures that Contrary to the one-step Doeblin condition considered in [48], these n-steps coupling bounds have a reasonably nice scaling with δ and d.
Similarly, the non-asymptotic confidence intervals of Theorem 6 can be transfered to the Metropolis-adjusted chain: Theorem 16. Under the conditions of Theorem 1, if µ 0 ∈ P(R 2d ) satisfies a log-Sobolev inequality with constant C ′ > 0 and if (Z k ) k∈N is an OM(BAB)O chain with Z 0 ∼ µ 0 , then for all 1-Lipschitz functions ϕ on R 2d , n 1 and u 0, where C lS is given in Theorem 6 and E(ν, δ, n, d) is as in Proposition 14.
This is proven at then end of Section 6.

Conclusion and related works
Let us summarize the previous results, focusing on the behavior as δ → 0 and d → +∞ for fixed values of p, m, L, L ℓ , ℓ and γ = 2 √ L. We suppose that the initial condition µ 0 is such that, for p 1, W p (µ 0 , π) + M 1/p µ 0 ,p K p √ d for someK p that does not depend on d, which is feasible in practice (see Remark 7 and Lemma 30). For the dependency on L, m, L ℓ , we only write the leading terms when L, L ℓ → +∞ and m → 0.
Choosing those as non-asymptotic criteria for efficiency is debatable, see the discussion in [13] in particular for a different scaling for the Wasserstein distances, nevertheless we stick to these definitions for comparison with previous works.    discretization bias and Table 2 the efficiency bounds that are obtained from our various results. For simplicity, in Table 2, concerning the dependency on L, m and L ℓ , we only consider the case where the term with L ℓ is negligible with respect to the other one (as in the Gaussian case where L ℓ = 0) and we don't write the logarithmic terms in these variables. Let us now compare our results to previous works.
First, we mention that, starting back at least to the work of Talay [54] (who was also already concerned with the discretized chain), there has been a substantial amount of works in the last two decades on obtaining quantitative convergence rates for the continuous-time (underdamped) Langevin diffusion (1), leading to the theory of hypocoercivity [57,17,24]. In the convex and smooth case, the dimension-free convergence is established by the author in [40] for mean-field particles using entropic hypocoercivity techniques (implying the Wasserstein convergence, see e.g. [22] for details), and later by [12,14,59] by direct coupling arguments in the context of MCMC. The Wasserstein contraction induced by the parallel coupling in the smooth and convex case was first established in [4] (but without the explicit dependency on the dimension). In the non-convex case, quantitative results are obtained through an explicit combination of reflection and parallel coupling in [21,13,11]. Using hypocoercive techniques, one of the first result on the underdamped Langevin algorithms motivated by stochastic algorithms (MCMC and simulated annealing) is established in [42], focusing on low-temperature rather than high-dimensional estimates. In the recent [33], a similar result is established, focusing on high-dimensional MCMC (in both [33] and [42], the convergence is obtained in term of the log-Sobolev constant of the target distribution; the main difference is thus how the dependency on the parameters of interest is presented).
By contrast, Theorem 1 and Corollary 5 give explicit quantitative convergence rates for a discrete-time Markov chain obtained as a discretization of (1). As mentioned in the introduction, this leads to the concentration inequalities of Theorem 6 and it was motivated by the work [48]. That being said, in [12,14,59,13,11,33,59], a discretization error analysis is conducted which, together with the convergence rate of the continuous-time process, provides in fine non-asymptotic bounds for the discrete-time algorithm. In the smooth and convex case, up to some logarithmic terms, W 2 efficiency bounds of order √ d/ε are obtained in [12,59] with a first-order approximation of (1) and of order d/ε in [14] with a second-order scheme (that requires the computation of ∇ 2 U at each iteration) under the Lipschitz Hessian condition (that corresponds to (∇ 2 pol(ℓ)) with ℓ = 2). These scalings are thus similar to our results on the OBABO chain under similar conditions. In [33], using a second order scheme with a similar complexity of the OBABO chain, efficiency bounds of order d/ε are obtained for the relative entropy (Kullback-Leibler divergence) under the Lipschitz Hessian condition. Through Pinsker's and Talagrand's inequalities, this yields W 2 and total variation efficiency bounds of order √ d/ε (since the results have to be applied with ε replaced by ε 2 ), to compare to our results respectively of d/ε and d 3/4 / √ ε for these distances under (∇ 2 pol(ℓ)) with ℓ = 2. Thus, for the unadjusted process, under the same conditions as [33], we see that we have a better dependency in term of ε for both Wasserstein and total variation distances, the same dependency in the dimension for the Wasserstein distance but a worse dependency in the dimension (d 3/4 versus √ d) for the total variation distance. In the separable case we recover the √ d scaling for the total variation distance, and improve the scaling to d 1/4 for the Wasserstein one. Our method is very different from the one of [33]. Comparing the discrete scheme with the continuous process, they get a bound on the relative entropy that scales as d, which gives √ d for the total variation. We get a bound on the total variation by comparing the unadjusted process with the Metropolis-adjusted one. We can see that the bad dependency in d for the total variation in our case stems from the fact we bound the expected Metropolis rejection probability by a cubic term (Lemma 28 with ℓ = 2) whose expectation is of order d 3/2 . The bounds of Lemma 28 seems to be sharp, so it is the method of comparing the unadjusted and adjusted processes which leads to a non-optimal scaling. In some sense, by requiring the two processes to stay equal, we do not use the regularization properties of the process. By working with the relative entropy, the law of the unadjusted process can be compared with the law of the continuous time process: the processes (driven by the same Brownian motion) do not stay equal but remain close which, with the regularization properties of the process, is in fact sufficient to control the relative entropy of the laws (roughly speaking), hence the total variation.
Similarly to the results on the convergence rate of the Langevin diffusion, several works are concerned with the convergence rates of idealized HMC algorithms, i.e. algorithms that rely on the exact simulation of the continuous-time Hamiltonian dynamics, [52,29,15,36,5]. Non-asymptotic bounds on the unadjusted HMC are obtained in [36], of order √ d/ε for the W 1 distance in the smooth and convex case with a first-order scheme, of order d 1/4 / √ ε in the separable case (and later in [37] under a weaker condition), and similarly for the Metropolisadjusted algorithm. Again, this is similar to our rates for the OBABO chain. The works [20,10] are concerned with Metropolis-adjusted algorithms (Metropolis-adjusted overdamped Langevin algorithm -MALA -and HMC), and establish total variation efficiency bounds that are logarithmic in the accuracy ε. The method is quite different to ours, based on conductance bounds. In term of number of computations of gradients, the scalings are d for the MALA algorithm and d 11/12 for the HMC one in the smooth, convex, Lipschitz Hessian case. In term of dependency in d, for the Metropolis-adjusted case, we only get d in this case, and √ d in the separable case. It would be interesting to use the methods of [20,10] for the OM(BAB)O chain (see also Remark 15).
For more considerations on the family of sampler based, like the HMC process and Langevin diffusion, on the Hamiltonian dynamics, we also refer to the recent preprint [53] and references therein.

Study of the OBABO chain 4.1 Wasserstein contraction in the convex case
For z = (x, v) ∈ R 2d , and a, b > 0 with b 2 < a, we consider the Euclidean norm and Θ = (Θ 1 , Θ 2 ). The transition (2) of the OBABO chain is thus given by (x 1 , v 1 ) = Θ(x 0 , v 0 , G, G ′ ) with independent G, G ′ ∼ N (0, I d ). We consider two initial conditions (x 0 , v 0 ), (y 0 , w 0 ) ∈ R d × R d , two independent sequences of independent standard Gaussian variables (G n ) n∈N and (G ′ n ) n∈N , and the parallel coupling of two OBABO chains given by (y n+1 , w n+1 ) = Θ(y n , w n , G n , G ′ n ) . Then, b 2 a/4 and for all (x 0 , v 0 ), (y 0 , w 0 ) ∈ R 2d , the parallel coupling given by (5) is such that almost surely, for all n ∈ N, Remark 18. It is possible to improve the condition on γ and the other estimates by assuming that U(x) = x · S −1 x +Ũ (x) for some symmetric positive matrix S and then taking S into account in the definition of the modified norm with more care. In particular in the Gaussian case (Ũ = 0) there is no condition on γ, it is always possible to design a Euclidean norm that is contracted by the coupling for sufficiently small δ, see e.g. [43]. Nevertheless, in this work we are mainly concerned with the fact that κ is independent from the dimension.

Remark 19.
The contraction is almost sure, and not only in expectation. In fact, as will be clear in the proof, the particular law of (G, G ′ ) does not intervene, it is sufficient that it does not depend on the position (x, v).
Proof. Let us show that, for this choice of parameters, for all x, v, y, w, g, g ′ ∈ R d , denoting (x ′ , v ′ ) = Θ(x, v, g, g ′ ) and (y ′ , w ′ ) = Θ(y, w, g, g ′ ), which will immediately yield the result. As a first step, we identify the leading terms (in term of δ) of the evolution of the norm. The contraction will be proven only thanks to these terms, the higher order ones being treated as a small perturbation. As a consequence, our arguments works for other Markov chains that are an approximation of order at least one (in δ) of the continuous-time Langevin diffusion.
The equivalence of · a,b with the Euclidean distance implies that p,a,0 From this equivalence and Proposition 17, it is then straightforward to obtain the following, which proves Theorem 1.

Corollary 20.
Under the assumptions and with the notations of Proposition 17, for all p 1 and all ν, µ ∈ P p (R 2d ), and for all n ∈ N, Moreover, P admits a unique invariant probability distribution π δ , and π δ ∈ P p (R 2d ) for all p 1.
Remark 21. The one-step contraction (8) implies that P has a positive Wasserstein curvature δκ/2 in the sense of [46,26] with respect to the metric · a,b .

Wasserstein/total variance regularization
While the proof of [23,Corollary 4.7] for the continuous-time process relies on functional inequalities arguments, our proof will be a direct application of optimal coupling for Gaussian variables, in the spirit of [18,34]. As can be seen by conditioning with respect to the initial condition (distributed according to an optimal W 1 coupling), Proposition 3 is an immediate corollary of the following.

Proposition 22. Under Assumption
Proof. As already mentioned, for a fixed ( is not a Gaussian variable. Nevertheless, we are going to use successively two Gaussian couplings, coupling first the positions thanks to the random variable G and then the velocities with the random variable G ′ . In particular, if the first coupling is a success, then the non-Gaussian part ∇U(x ′ ) − ∇U(y ′ ) vanishes. This two steps coupling (position first, then velocity) is natural in view of the hypoelliptic structure of the Langevin diffusion. Note that the choice of the OBABO sampler is particularly convenient here, since the damping part is splited in two half-steps, contrary to e.g. the BAOAB scheme [30] (although one would just have to consider two transitions of the Markov chain). Recall the following result for the optimal coupling of Gaussian distributions. For x, y ∈ R d and σ > 0, Indeed, orthogonal coordinates being independent, we just have to couple the projections on x − y, so that the d dimensional case boils down to the one dimensional case, which is [34, Lemma 15].
In other words, G 1 and G 2 are both standard Gaussian random variables and such that and Conditionally to (x ′ 1 , x ′ 2 ), denote Set The fact that, conditionally to (x ′ 1 , v ′ 1 ), G ′ 1 is a standard Gaussian random variable, implies that G ′ 1 is independent from (G 1 , G 2 ) (and similarly for G ′ 2 ). As a consequence, (9) and (12) with G 1 and G ′ 1 (resp. G 2 and G ′ 2 ) that are two independent standard Gaussian variables proves that ( (2) corresponding to P ). In particular, We now bound this probability. Remark that Hence, under the event Since As a conclusion, using (10) and (11),

Concentration inequality
We first state the following general result (proven in [ Suppose moreover that there exists C > 0 such R(z, ·) satisfies a log Sobolev inequality with constant C for all z ∈ R d . Then: 1. If µ ∈ P(R d ) satisfies a log Sobolev with constant C ′ , then µR satisfies a log Sobolev inequality with constant r 2 C ′ + C.
3. If r < 1, µ ∈ P(R d ) satisfies a log Sobolev with constant C ′ and (Z n ) n∈N is a Markov chain associated to R with initial condition Z 0 ∼ µ, then for all n ∈ N * , all u 0, and all 1-Lipschitz functions ϕ, Remark 24. The first point generalizes the fact that log-Sobolev inequalities are transported by Lipschitz transformations, which corresponds to the case Rϕ = ϕ • Ψ with Ψ an r-Lipschitz function (in that case, C = 0).
Proof. By density, we can consider a smooth positive Lipschitz f . Denoting g = Rf 2 , we decompose Now, by the Cauchy-Schwarz inequality, which concludes the proof of the first claim. The second claim immediately follows by induction and the weak convergence of R n toward ν, which follows from the Wasserstein contraction implied by the gradient/operator subcommutation, see [27,Theorem 2.2].
Concerning the third claim, if µ ∈ P(R d ) satisfies a log-Sobolev inequality with constant C ′ > 0 then, for all β-Lipschitz ϕ and all λ, µ e λϕ e C ′ β 2 λ 2 /4 e µ(ϕ) , see e.g. [28]. Using successively Chernov's inequality, the Markov property and (13) together with the local log-Sobolev inequality satisfied by R, given any 1-Lipschitz ϕ, λ > 0 and t ∈ R, Using that ϕ + Rϕ is (1 + r)-Lipschitz, (13) and then a straightforward induction, we get Applying this with t = n k=1 µR k ϕ + u with u 0 reads Using that ν is invariant by R, that n k=1 R k ϕ is a r/(1 − r)-Lipschitz function for all n 1 and that the dual representation of W 1 (see e.g. [58]), In the rest of this section, considering a, b given by Proposition 17, we consider the matrix We also denote P a,b the transition operator of (S(x n , v n )) n∈N where (x n , v n ) n∈N is an OBABO chain. This change of variable is meant for simplicity, in order to work with the classical Euclidean metric and gradient in the following.
Moreover, for all z ∈ R 2d , P a,b (z, ·) satisfies a log-Sobolev inequality with constant Proof. Using the notations of Proposition 17, the latter means that, setting z n = (x n , v n ) and z ′ n = (y n , w n ), almost surely for all z 0 , z ′ 0 . This implies that P a,b contracts the W ∞ distances (for the classical Euclidean metric) by a factor √ 1 − δκ, which implies the claimed gradient/Markov operator subcommuation as proven in [27,Theorem 2.2].

Bias analysis
First, we prove Proposition 9, which relates the equilibrium error between π and π δ to the one-step error between π and πP V .
Proof of Proposition 9. We use the notations of Proposition 17. Using the invariance of π δ by P and Corollary 20, W p,a,b (π δ , π) W p,a,b (π δ P, πP ) + W p,a,b (πP, π) (1 − δκ/2)W p,a,b (π δ , π) + W p,a,b (πP, π) where we used that which together with the equivalence of | · | and · a,b (see (7)) gives where K 1 is given in Theorem 1. Now, decomposing P = P O P V P O , where P O has been introduced in Section 2.2, we remark that P O does not increase W p , as can be seen with a parallel coupling. Indeed, let ( Moreover, π is invariant by P O , so that For the total variation distance, similarly, for n 1, where we used Corollary 5 and the fact the total variation is decreased by any Markov operator. Using again that π is invariant by P O , which concludes. As mentioned in Section 3.2, in the following, we compare P V to two reference transitions: the first one is P M V , the Metropolis-adjusted Verlet transition introduced in Section 2.2, with acceptance probability α. The second is Q δ where (Q t ) t 0 is the deterministic transition operator of the Hamiltonian dynamics: given (x, v) ∈ R 2d and considering (x t ,v t ) t 0 the solution ofx ′ Lemma 26. For ν ∈ P(R 2d ) and (X, V ) ∼ ν, If, moreover, ν ∈ P p (R 2d ) for some p 1, then Proof. The only difference between P V and P M V is the accept/reject step. Considering (X, V ) ∼ ν and, independently, W a random variable uniformly distributed over [0, 1], set Similarly, Finally, the last claim follows from the fact (Φ V (X, V ), Φ HD (X, V )) is a coupling of νP V and νQ δ .
When comparing a Metropolis adjusted Verlet step and an unadjusted one, in case of rejection, the distance is not small with δ.
1. Under Assumption (∇Lip), for all (x, v) ∈ R 2d , writing Z = |v| + δ/2|∇U(x)|, 2. Under Assumptions (∇Lip) and (∇ 2 pol(ℓ)), for all (x, v) ∈ R 2d , writing Z = |v| + δ/2|∇U(x)|, Proof. Fix x, v ∈ R 2d . Assumption (∇Lip) holds in the whole proof. Consider (x t ,v t ) t 0 given by (14) and (y t , w t ) t 0 the solution of In other words, Bounding (t − u) t and applying Gronwall's Lemma, we get for all t ∈ [0, δ] Similarly, for all t ∈ [0, δ], As a consequence, The conclusion of the proof of the first claim follows from the previous estimates and Note that |v δ − w δ | yields a first order error (in δ). Now, if an additional assumption is available, in order to see that the Verlet scheme is in fact a second order integrator, we should rather directly bound As in the proof of Lemma 28 (more precisely, (15)), denoting g(s) = ∇U(x s ), Under Assumption (∇ 2 pol(ℓ)), Gathering the previous bounds, Recall that π 1 denotes the first marginal of π, namely the probability law on R d with density ∝ exp(−U). Similarly, denote by π 2 the second marginal of π, which is simply the standard Gaussian distribution on R d . Lemma 30. Under Assumption (∇Lip) and (Conv), for all β 0, Proof. Consider L = −∇U + ∆ the generator of the overdamped Langevin diffusion, which leaves invariant π 1 . For ζ 1 and ϕ(x) = |x| 2ζ , The invariance of π 1 implies that E π 1 (Lϕ) = 0, so that from Jensen's inequality, and thus This concludes the case of β = 2ζ 2. The case β ∈ [0, 2) is obtained by Jensen's inequality.
Applying this result to the potential U(x) = |x| 2 /2 (so that m = 1) bounds the moments for the velocity.
We can now gather the results of the last four lemmas to get the following.
1. Under Assumptions (∇Lip) and (Conv), for all p 1, 2. Under Assumptions (∇Lip), (Conv) and (∇ 2 pol(ℓ)), for all p 1, Proof. Both cases are similar so we only detail the first one. For h 1, denote and For p 1, using Lemmas 26 and 29 and the invariance of π by Q δ , where we used Lemma 30. Similarly, from Lemmas 26, 28 and 30, Remark that in the computations of the second case, to simplify the last term involved in the bound on |Φ V − Φ HD | in Lemma 29, we use that Using then Lemma 30 to bound yields the claimed expression for K 6 . Similarly, to bound the last term given in Lemma 28, we use that and proceed similarly to K 6 to get the expression of K 8 .
The case of a separable target ensues from the following result.
Lemma 32. Let p 2 and, for i ∈ 1, d , let ν i , µ i ∈ P p (R) and ν = ⊗ d i=1 ν i and µ = ⊗ d i=1 µ i . Then Proof. For all i ∈ 1, d , let (X i , Y i ) optimal W p coupling of ν i and µ i , with (X i , Y i ) independent from (X j , Y j ) for j = i. Then X = (X i , . . . , X d ) and Y = (Y i , . . . , Y d ) are respectively distributed according to ν and µ so that, for p 2, Alternatively, if the (X i , Y i )'s are optimal couplings of ν i and µ i for the total variation distance, then As a conclusion of the analysis of the equilibrium bias: Proof of Propositions 11 and 12. Simply combine Propositions 9 and 31. More precisely, for the total variation, to prove Proposition 11, apply Proposition 9 with n := 1 + | ln (δ 3 d) | | ln(1 − δκ)| 2 + | ln (δ 3 d) | δκ , so that (1 − δκ) (n−1)/2 δ 3/2 √ d .
The second claim is obtained simply by saying that an OBABO and a OM(BAB)O chains starting at the same point and sampled with the same random variables remain equal up to the first rejection of the OM(BAB)O chain. More precisely, let (G n , G ′ n , W n ) n∈N be an i.i.d. sequence of random variables such that, for all n ∈ N, G n , G ′ n and W n are independent, G n , G ′ n ∼ N (0, I d ) and W n is uniformly distributed over [0, 1]. Let Z 0 ∼ ν be independent from these variables. We construct (Z k ) k∈N and (Z k ) k∈N as the OBABO and OM(BAB)O chains with initial condition Z 0 and whose transitions at step n are given respectively in Section 2.1 with variables G n , G ′ n or in Section 2.2 with variables G n , G ′ n , W n . In other words, the damping parts O in both chains use the same Gaussian variables, and the only difference between Z = (X, V ) andZ = (X,Ṽ ) is that the proposal given by the Verlet step BAB is always accepted for Z and is accepted iff W n α(Z ′ n ) forZ, whereZ ′ n = (X n , ηṼ n + 1 − η 2 G n ). In particular, the two chains are equal up to the first rejection, in other words, for all n 1, P Z k =Z k ∀k ∈ 1, n = P W k α X k , ηV k + 1 − η 2 G k ∀k ∈ 0, n − 1 .
Bounding the probability of the union by the sum of the probabilities and using that (X k , ηV k + 1 − η 2 G k ) ∼ νP k P O for all k ∈ N, we get that A := P ∃k ∈ 1, n , (Z k ) = (Z k ) and for u 0 we bound P 1 n n k=1 ϕ(Z k ) − π(ϕ) u A + P 1 n n k=1 ϕ(Z k ) − π(ϕ) u .
In the separable case (condition (⊥ ⊥)), denoting (y, v) = (Q −1 x, Q −1 v) and H i (y i , w i ) = U i (y i ) + |w i | 2 /2 for i ∈ 1, d , using that H(x, v) = d i=1 H i (y i , w i ) and that Q −1 Φ V = Φ V Q −1 , we get that where α i is the acceptance probability of a one-dimensional OM(BAB)O chain with potential U i . Bounding the probability of the union by the sum of the probability we get that (1 − α i (y i , w i )) .
Proof of Proposition 14. Simply combine Lemma 33 and Corollary 35.
Proof of Theorem 16. This follows from the concentration bound for the unadjusted chain (Theorem 6), the bound between the adjusted and unadjusted process obtained by combining Lemma 33 and Corollary 35, and the dual representation of the W 1 distance (see e.g. [58]) that implies that |π(ϕ) − π δ (ϕ)| W 1 (π, π δ ) for all 1-Lipschitz function ϕ.