Higher Order Langevin Monte Carlo Algorithm

A new (unadjusted) Langevin Monte Carlo (LMC) algorithm with improved rates in total variation and in Wasserstein distance is presented. All these are obtained in the context of sampling from a target distribution $\pi$ that has a density on $\mathbb{R}^d$ known up to a normalizing constant. Crucially, the Langevin SDE associated with the target distribution $\pi$ is assumed to have a locally Lipschitz drift coefficient such that its second derivative is locally H\"{o}lder continuous with exponent $\beta \in (0,1]$. Non-asymptotic bounds are obtained for the convergence to stationarity of the new sampling method with convergence rate $1+ \beta/2$ in Wasserstein distance, while it is shown that the rate is 1 in total variation even in the absence of convexity. Finally, in the case of a Lipschitz gradient, explicit constants are provided.


Introduction
In Bayesian statistics and machine learning, one challenge, which has attracted substantial attention in recent years due to its high importance in data-driven applications, is the creation of algorithms which can efficiently sample from a high-dimensional target probability distribution π. In particular, its smooth version assumes that there exists a density on R d , also denoted by π, such that π(x) = e −U (x) /ˆR d e −U (y) dy, with´R d e −U (y) dy < ∞, where U is typically continuously differentiable. Within such a setting, consider a filtered probability space (Ω, F , (F t ) t≥0 , P), then the Langevin SDE associated with π is defined by where (w t ) t≥0 is a d-dimensional Brownian motion. It is a classical result that under mild conditions, the SDE (1) admits π as its unique invariant measure. The corresponding numerical scheme of the Langevin SDE obtained by using the Euler-Maruyama (Milstein) method yields the unadjusted Langevin algorithm (ULA), known also as the Langevin Monte Carlo (LMC), which has been well studied in the literature. For a globally Lipschitz ∇U , the non-asymptotic bounds in total variation and Wasserstein distance between the n-th iteration of the ULA algorithm and π have been provided in [11], [12] and [10]. As for the case of superlinear ∇U , the difficulty arises from the fact that ULA is unstable (see [23]), and its Metropolis adjusted version, MALA, loses some of its appealing properties as discussed in [7] and demonstrated numerically in [2]. However, recent research has developed new types of explicit numerical schemes for SDEs with superlinear coefficients, and it has been shown in [15], [17], [16], [18], [13], [19], that the tamed Euler (Milstein) scheme converges to the true solution of the SDE (1) in L p on any given finite time horizon with optimal rate. This progress led to the creation of the tamed unadjusted Langevin algorithm (TULA) in [2], where the aforementioned convergence results are extended to an infinite time horizon and, moreover, one obtains rate of convergence results in total variation and in Wasserstein distance. In this article, a new higher order LMC algorithm is constructed, which is based on the order 1.5 scheme (3) of the SDE (1). One observes that the scheme (2) coincides with (3) in distribution, and the latter is obtained using an Itô-Taylor expansion (see [22]). By extending the techniques used in [2] and [20], it can be shown that the scheme (2) has a unique invariant measure π γ , where γ ∈ (0, 1) is the stepsize of the scheme, and one obtains the improved convergence results between π γ and the target distribution π. More precisely, assume the potential U is three times differentiable, and its third derivative is locally Hölder continuous with exponent β ∈ (0, 1]. Then, under certain conditions (specified in Section 2), Theorem 1 and 2 state that the rate of convergence between the n-th iteration of the new algorithm and the target measure π is 1 + β/2 in Wasserstein distance, whereas the rate is 1 in total variation. Here, one notes that these results are obtained in the context of having superlinear ∇U . To the best of the authors' knowledge, these are the first such results which provide a higher rate of convergence in Wasserstein distance compared to the existing literature. As for the total variation distance, [11] proves that the rate of convergence is 1 for the case of a strongly convex U , whereas our result yields the same convergence rate without assuming convexity.
An important feature of the newly proposed, higher order LMC is its computational efficiency. It can be seen in key paradigms of the area, such as sampling from a high-dimensional Gaussian distribution, since one obtains with limited additional computational effort, that the distribution of the algorithm converges faster to the target distribution π compared to existing algorithms (see [9], [11] and references therein). In other words, it takes fewer steps for the error to be less than a given ε, where the error is measured using the notion of a 2-Wasserstein distance between the n-th iteration of the algorithm (3) and π. To illustrate this point, consider the multivariate Gaussian as the target distribution π defined by where Σ ∈ R d×d is the covariance matrix. Then, for any x ∈ R d U (x) = − 1 2 x T Σ −1 x, ∇U (x) = Σ −1 x, ∇ 2 U (x) = Σ −1 .
As the third derivative of U is zero in this case, the proposed algorithm (3) can be obtained by just adding two more terms on the ULA algorithm (the corresponding scheme (2) can be obtained accordingly), which are easily computed. If the precision level at n-th iteration is set to be ε, i.e. W 2 (δ x R n γ , π) ≤ ε, then according to Corollary 1, by letting one obtains n ≥ (2C) 1 3 /mε 2 3 log 4(|x − x * | 2 + d/m)/ε 2 , whereC > 0 is some constant that is proportional to d and x * is the unique minimizer of U . Thus, the algorithm (3) requires much fewer steps to reach a certain precision level compared to results in [9] and [11]. However, one notes that for the case of a Lipschitz gradient with nonzero L 2 and L in assumptions H5 and H6 (see below), the dependency of the Wasserstein bound on the dimension increases from d to d 2 .
We conclude this section by introducing some notation. The Euclidean norm of a vector b ∈ R d and the Hilbert-Schmidt norm of a matrix σ ∈ R d×m are denoted by |b| and |σ| respectively. σ T is the transpose matrix of σ and I d is the d × d identity matrix. The i-th element of b and (i, j)-th element of σ are denoted respectively by b (i) and σ (i,j) , for every i = 1, . . . , d and j = 1, . . . , d. In addition, denote by ⌊a⌋ the integer part of a positive real number a, and ⌈a⌉ = ⌊a⌋ + 1. The inner product of two vectors x, y ∈ R d is denoted by xy. For all x ∈ R d and M > 0, denote by B(x, M ) (respectively B(x, M )) the open (respectively close) ball centered at x with radius M . Let f : R d → R be a twice continuously differentiable function. Denote by ∇f , ∇ 2 f and ∆f the gradient of f , the Hessian of f and the Laplacian of f respectively. Consider a twice continuously differentiable function g : R d → R d . The term ∇ 2 g is an array of d Hessian matrices, one for each component of g, i.e.
Let µ be a probability measure on (R d , B(R d )) and f be a µ-integrable function, define If V = 1, then · V is the total variation denoted by · T V . Let µ and ν be two probability measures on a state space Ω with a given σ-algebra. If µ ≪ ν, we denote by dµ/dν the Radon-Nikodym derivative of µ w.r.t. ν. Then, the Kullback-Leibler divergence of µ w.r.t. ν is given by We say that ζ is a transference plan of µ and ν if it is a probability measure on We denote by Π(µ, ν) the set of transference plans of µ and ν. Furthermore, we say that a couple of R d -random variables (X, Y ) is a coupling of µ and ν if there exists ζ ∈ Π(µ, ν) such that (X, Y ) are distributed according to ζ. For two probability measures µ and ν, the Wasserstein distance of order p ≥ 1 is defined as By Theorem 4.1 in [4], for all probability measures µ, ν on R d , there exists a transference plan ζ * ∈ Π(µ, ν) such that for any coupling (X, Y ) distributed according to ζ * , we have

Main results
Assume U is three times continuously differentiable. The following conditions are stated: H1 lim inf |x|→+∞ |∇U (x)| = +∞, and lim inf |x|→+∞ H2 There exists L > 0, ρ ≥ 2, and β ∈ (0, 1], such that for all x, y ∈ R d , H3 U is strongly convex, i.e. there exists m > 0 such that for all x, y ∈ R d , Remark 1. Unless otherwise specified, the constants C, K > 0 may take different values at different places, but these are always independent of the step size γ ∈ (0, 1).

Remark 2.
Assume H2 holds, then there exists a constant K > 0 such that for all x ∈ R d , moreover, there exists a constant K > 0 such that for all x, y ∈ R d , Furthermore, there exists a constant K > 0 such that for all x, y ∈ R d , For any x ∈ R d , denote by The taming is chosen in a way such that (7) holds, which leads to the finiteness of exponential moments of the scheme due to the Log-Sobolev inequality, while the desired rate of convergence of the tamed drift to the original drift can be obtained.

Remark 3.
There exists a constant K > 0 such that for all x ∈ R d , Note that in Remark 3, the upper bound for | ∇U ∇ 2 U γ (x)| is Kγ −1 since for |x| ≤ 1, one estimates the term using H2, while for |x| > 1, the bound is obtained by a straightforward calculation using the taming defined above.
The new higher order LMC algorithm, which is the tamed order 1.5 scheme of the SDE (1), has also the following representation, for any n ∈ N, where γ ∈ (0, 1) is the step size, (Z n ) n∈N are i.i.d. standard d-dimensional Gaussian random variables, for all x ∈ R d , Note that, for all x ∈ R d and k = 1, . . . , d, the term |σ γ (x)| ≥ 2/3, which implies the covariance matrix σ γ (x) is positive-definite. The Markov kernel R γ associated with (2) is given by for all x ∈ R d and A ∈ B(R d ).
One observes that the scheme (3) is Markovian, and Law(X n ) is the same as Law(X n ), for any n ∈ N.
Denote by (P t ) t≥0 the semigroup associated with (1). For all x ∈ R d and A ∈ B(R d ), In addition, for all x ∈ R d and h ∈ C 2 (R d ), the infinitesimal generator A associated with (1) is defined by For any a > 0, define the Lyapunov function V a : Then, for the local Lipschitz drift, one obtains the following convergence results. Theorem 1. Assume H1, H2 and H3 are satisfied. Then, there exist constants C > 0 and λ ∈ (0, 1) such that for all x ∈ R d , γ ∈ (0, 1) and n ∈ N, where c is given in (13) and for all γ ∈ (0, 1), Theorem 2. Assume H1 and H2 are satisfied. There exist C > 0 and λ ∈ (0, 1) such that for all x ∈ R d , γ ∈ (0, 1) and n ∈ N, where c is given in (13) and for all γ ∈ (0, 1), Moreover, the Lipschitz case is considered in order to provide explicit constants for the moment bounds and the convergence in Wasserstein distance. More precisely, the drift coefficient ∇U (x) is assumed to be Lipschitz continuous and H2 is replace by the following assumptions: Then, for all x ∈ R d and n ∈ N, where the constantC depends on d 2 , and the explicit expression is given in the proof.
If one considers a multivariate Gaussian as the target distribtuion, then for all x ∈ R d and n ∈ N, where the constantC depends on d, and the explicit expression is given in the proof.

Moment bounds
It is a well-known result that by H1 and H2, the SDE (1) has a unique strong solution. One then needs to obtain moment bounds of the SDE (1) and the numerical scheme (2) before considering the convergence results. By using Foster-Lyapunov conditions, one can obtain the exponential moment bounds for the solution of SDE (1). More concretely, the application of Theorem 1.1, 6.1 in [7] and Theorem 2.2 in [21] yields the following results. Proposition 1. Assume H1 and H2 are satisfied. For all a > 0, there exists b a > 0, such that for all and sup Moreover, there exist C a > 0 and ρ a ∈ (0, 1) such that for all t > 0 and probability measures Proof. Please refer to Proposition 1 in [2] for the detailed proof.
The proposition below provides a uniform bound for exponential moments of the Markov chain (X k ) k≥0 . Proposition 2. Assume H1 and H2 are satisfied. Then, there exist constants b, c, M > 0, such that for all x ∈ R d and γ ∈ (0, 1), Moreover, this guarantees that the Gaussian kernel R γ has a unique invariant measure π γ and R γ is geometrically ergodic w.r.t. π γ .
Proof. We use the scheme (2) throughout the proof. First, one observes that by H1, for γ ∈ (0, 1), the following holds Indeed, by H1, Note that f (x) = x/(1 + x 3/2 ) 2/3 is a non-decreasing function for all x ≥ 0. Then, one obtains (7), since for all The function f (x) = (1 + |x| 2 ) 1/2 is Lipschitz continuous with Lipschitz constant equal to 1. Let X 0 = x, then for all x ∈ R d , applying log Sobolev inequality (see Proposition 5.5.1 in [5] and Appendix A for a detailed proof) gives, which using Jensen's inequality yields Note that we have By taking the conditional expectation on both sides, the above equation becomes Then, by inserting (10) into (9), one obtains where Then, expanding the square yields By (7), there exist M 1 , κ 1 > 0 such that for all |x| ≥ M 1 , Thus, by using Remark 3, for all |x| ≥ max{1, M 1 }, By substituting (12) into (11) and completing the square, one obtains, for |x| ≥ M , For the case |x| ≤ M , by H2, the following result can be obtained: where c 3 is a positive constant (that depends on d and K). Then, by using Thus, Then by induction, for all n ∈ N and x ∈ R Finally, since any compact set on R d is accessible and small for R γ , then by section 3.1 in [7] and Theorem 15.0.1 in [8], for all γ ∈ (0, 1), R γ has a unique invariant measure π γ and it is geometrically ergodic w.r.t. π γ .
The results in Proposition 1 and 2 provide exponential moment bounds for the solution of SDE (1) and the scheme (2), which enable us to consider the total variation and Wasserstein distance between the target distribution π and the n-th iteration of the MCMC algorithm.

Proof of Theorem 1
In order to obtain the Wasserstein distance, the assumption H3 is needed, which assumes the convexity of U . We consider the linear interpolation of the scheme (3) given bȳ Note that the linear interpolation (14) and the scheme (3) coincide at grid points, i.e. for any n ∈ N, X n =x nγ . Let (F t ) t≥0 be a filtration associated with (w t ) t≥0 . For any n ∈ N, denote by E Fnγ [·] the expectation conditional on F nγ . Lemma 1. Assume H1 and H2 are satisfied. Then, there exists a constant C > 0 such that for all p > 0, γ ∈ (0, 1), n ∈ N, and t ∈ [nγ, (n + 1)γ), Proof. Consider a polynomial function f (|x|) ∈ C poly (R + , R + ), then there exists a constant C > 0 such that for all x ∈ R d , f (|x|) ≤ CV c (x). For p > 1, by applying Hölder's inequality and Remark 2, one obtains The second inequality can be proved using similar arguments. For the case 0 < p ≤ 1, Jensen's inequality is used to obtain the desired result.
Lemma 2. Assume H1 and H2 are satisfied. Then, there exists a constant C > 0 such that for all p > 0, γ ∈ (0, 1), n ∈ N, and t ∈ [nγ, (n + 1)γ), Proof. For p > 1, by using Hölder's inequality, Remark 2 and Lemma 1, we have For the case 0 < p ≤ 1, one can use Jensen's inequality to obtain Similarly, for p > 1, by using Hölder's inequality, one obtains where the last inequality holds due to Proposition 1. The case p ∈ (0, 1] follows from the application of Jensen's inequality. Lemma 3. Assume H1 and H2 are satisfied. Then, there exists a constant C > 0 such that for all γ ∈ (0, 1), n ∈ N, and t ∈ [nγ, (n + 1)γ), Proof. For any t ∈ [nγ, (n + 1)γ), applying Itô's formula to ∇U (x t ) gives, almost surely By further calculations, one obtains where Then, by squaring both sides of (15) and taking conditional expectation yields By using Cauchy-Schwarz inequality, Proposition 2, Remark 2 and Lemma 2, one obtains Similarly, by Cauchy-Schwarz inequality, Proposition 2 and Remark 2, one obtains where the last inequality is obtained by applying Lemma 1. Moreover, applying Cauchy-Schwarz inequality, Proposition 2, Lemma 2 and Remark 2 yields Furthermore, one obtains by using Cauchy-Schwarz inequality, Proposition 2, Lemma 2 and H2 that The estimate of E Fnγ |G 5 (t)| 2 can be obtained by straightforwad calculations, and we have For any x,x ∈ R d , denote by M (x,x) a matrix whose (i, j)-th entry is d k=1 ). One then obtains the following results.

Proof of Theorem 2
By applying the following lemma, one can show that without using H3, the rate of convergence in total variation norm is of order 1, which is properly stated in Theorem 2.
Lemma 7. Asuume H1 and H2 are satisfied. Let p ∈ N and ν 0 be a probability measure on (R d , B(R d )). There exists C > 0 such that for all γ ∈ (0, 1) Proof. Denote by µ y p andμ y p the laws on C([0, pγ], R d ) of the SDE (1) and of the linear interpolation (14) of the scheme both started at y ∈ R d . Denote by (F t ) t≥0 the filtration associated with (w t ) t≥0 , and by (x t ,x t ) t≥0 the unique strong solution of where −∇Ũ γ (t,x ⌊t/γ⌋γ ) is defined in (14). Then, by taking into consideration Definition 7 concerning diffusion type processes and Lemma 4.9 which refers to their representations in section 4.
Note that the assumptions of Theorem 7.19 in [1] are satisfied due to proposition 1 and 2. By using (27), one obtains where the last inequality holds due to Lemma 3. Then, by Theorem 4.1 in [3], it follows that Finally, applying the tower property yields the desired result, Proof of Theorem 2. The proof follows along the same lines as the proof of Theorem 4 in [2], but for the completeness, the details are given below. By Proposition 1, for all n ∈ N and x ∈ R d , we have Denote by k γ = ⌈γ −1 ⌉, and by q γ , r γ the quotient and the remainder of the Euclidian division of n by k γ , i.e. n = q γ k γ + r γ . Then, where By applying Lemma 24 in [12] to I 1 , we have Then, by Proposition 2 and Lemma 7, one obtains where the last inequality holds since r γ ≤ k γ ≤ 1 + γ −1 . Furthermore, by Proposition 1 and Proposition 2, Substituting (29) and (30) into (28) yields where λ ∈ (0, 1). By using similar arguments to I 2 , one obtains (5). Finally, by sending n to infinity, (6) can be obtained.

Lipschitz case
In the context of a Lipschitz gradient, assume H3 -H6 hold. Then, by H4 and H5, one obtains, for any x, y ∈ R d Moreover, in the Lipschitz case, there is no need to use the tamed coefficients. Thus, the linear interpolation of the scheme becomes One notes that for any n ∈ N, X n =x nγ .

Proof of Theorem 3
The explicit constants for the second and the fourth moments are obtained, then by using the following lemmas, one can show the rate of convergence in Wasserstein distance.
Then, for all n ∈ N, and t ∈ [nγ, (n + 1)γ), Proof. One observes that the first two results can be obtained immediately by using (34) and (37). As for the third result, consider where the last inequality holds by using Theorem 1 in [11].
Then, for all n ∈ N, and t ∈ [nγ, (n + 1)γ), where c 5 , c 6 and c 7 are given explicitly in the proof.
In the following proofs, denote by M (x,x) a matrix whose (i, j)-th entry is d k=1 Lemma 11. Assume H6 holds. Then, for any x,x ∈ R d , and i = 1, . . . , d, Proof. The proof follows the same lines as Lemma 4, hence omitted.
Then, for all n ∈ N, and t ∈ [nγ, (n + 1)γ), Proof. By using conditional Itô's isometry and Lemma 2, one obtains where a detailed proof for the first inequality, which uses the mean value theorem is given in Appendix C.
Then, for all n ∈ N, and t ∈ [nγ, (n + 1)γ), where the constants c 8 , c 9 and c 10 are given explicitly in the proof.
Proof. The proof follows the same lines as in Lemma 6, thus, the main focus here is to provide explicit constants. The second term in (19) can be estimated as where the first inequality holds due to Young's inequality and the fact that for any i, j, k = 1, . . . , d while the last inequality holds due to Young's inequality, the mean value theorem, Cauchy-Schwarz inequality and Lemma 8. By using Cauchy-Schwarz inequality, (19) becomes Then, to estimate the first term of (19), one applies Itô's formula to ∇U (x r ) − ∇U (x nγ ) and ∇U (x r ) − ∇U (x nγ ) to obtain, almost surely where the first inequality holds due to Cauchy-Schwarz inequality and Young's inequality, the second inequality holds by using (31) and H5, while the last inequality is obtained due to Lemma 8 and 9. Finally by using Young's inequality and Lemma 12, one obtains where c 8 = 2L 4 1 + 4L 1 L 2 2 + L 2 2 , c 9 = 9L 2 1 L 2 2 + 45 4 L 1 L 2 2 + L 2 2 + 18L 4 1 and c 10 = d(16L 3 1 + 6L 1 L 2 2 + 50L 2 2 + 9 4 L 2 d).