Uniform-in-time propagation of chaos for kinetic mean field Langevin dynamics

We study the kinetic mean field Langevin dynamics under the functional convexity assumption of the mean field energy functional. Using hypocoercivity, we first establish the exponential convergence of the mean field dynamics and then show the corresponding $N$-particle system converges exponentially in a rate uniform in $N$ modulo a small error. Finally we study the short-time regularization effects of the dynamics and prove its uniform-in-time propagation of chaos property in both the Wasserstein and entropic sense. Our results can be applied to the training of two-layer neural networks with momentum and we include the numerical experiments.

Training neural networks by momentum gradient descent has proven to be effective in various applications [38,22,35].However, despite their excellent performance, the theoretical understanding of those algorithms remains elusive.Recently, extensive researches have been conducted to model the loss minimization of neural networks as a mean field optimization problem [28,8,34,18], with most characterizing gradient descent algorithms as overdamped mean field Langevin (MFL) dynamics.In this paper, we will focus on kinetic dynamics instead, which corresponds to momentum gradient descent in the context of machine learning [30,23].
Classical studies, such as [40,27], have explored the exponential convergence of linear kinetic Langevin dynamics based on hypocoercivity and functional inequalities.The kinetic MFL dynamics is studied in [21] to model the momentum gradient descent for the training of neural networks and its convergence to the unique invariant measure is proven without a quantitative rate.The present work studies both the quantitative long-time behavior of the kinetic MFL dynamics and its uniform-in-time propagation of chaos (POC) property, under a functional convexity assumption, and we aim to provide a theoretical justification for the momentum algorithm's efficiency in practice.

Settings and main results
We give an informal preview of our settings and main results in this section.Let F : P 2 (R d ) → R be a mean field functional and denote by D m F : P 2 (R d )×R d → R d its intrinsic derivative.We aim to investigate the long-time behavior of the kinetic MFL defined by where m x t = Law(X t ), and its associated N -particle system defined by Here W t , (W i t ) N i=1 are independent d-dimensional Brownian motions.Denote m t = Law(X t , V t ) and m N t = Law X 1 t , . . ., X N t , V 1 t , . . ., V N t , and we suppose the initial conditions m 0 and m N 0 have finite second moments.We wish to show the convergence m N t → m ⊗N t when N → +∞ in a uniform-in-t way.We assume • the mean field functional F is convex in the functional sense; • its intrinsic derivative (m, x) → D m F (m, x) is jointly Lipschitz with respect to the L 1 -Wasserstein distance.
• for every measure m ∈ P 2 (R d ), the probability measure proportional to exp − δF δm (m, x) dx satisfy a logarithmic Sobolev inequality (LSI) with a constant uniform in m.
• its second and third-order functional derivatives satisfy certain bounds.
Under these assumptions, we are able to obtain • when t → +∞, the mean field flow m t converges exponentially to the mean field invariant measure m ∞ ; • when t → +∞, the N -particle flow m N t converges approximately to the Ntensorized mean field invariant measure m ⊗N ∞ , with an exponential rate uniform in N ;

Related works
We give in this section a short review of the recent progresses in the long-time behavior and the uniform-in-time propagation of chaos property of McKean-Vlasov dynamics, with an emphasis on kinetic ones.We refer readers to [5,6] for a more comprehensive review of propagation of chaos.
Coupling approaches.The coupling approach involves constructing a joint probability of the mean field and N -particle systems to allow comparisons between them.
The synchronous coupling method is employed in [3] and the uniform-in-time POC is shown by assuming the strong monotonicity of the drift and the smallness of the mean field interaction.The strong monotonicity is then relaxed by the reflection coupling method in [11] and we refer readers to [36,14,21] for further developments.Let us remark that the synchronous coupling gives often sharp contraction rates under strong convexity assumptions, while the reflection coupling allows us to treat dynamics of more general type but gives far-from-sharp contraction rates.
Functional approaches.Another approach to uniform-in-time POC is the functional one, and this is also the major approach of this paper.In this situation in order to study the long-time behaviors and propagation of chaos properties, we construct appropriate (Lyapunov) functionals and investigate the change of their values along the dynamics.The relative entropy is used as the functional in [29] and its follow-up work [16] to study kinetic McKean-Vlasov dynamics with regular interactions.It is worth noting that the relative entropy approach has been successful in handling singular interactions, thanks to the groundbreaking work of Pierre-Emmanuel Jabin and Zhenfu Wang [20], and we refer the readers to [13,9,33] for recent developments.However, we are not aware of any works using the relative entropy functional (or its modifications) to study kinetic diffusions with singular interactions.
Comparison to [7].The present paper is a continuation of our previous work [7], where the overdamped version of mean field Langevin dynamics is studied, and they share a number of key features.We show the exponential convergence of the particle system using the same componentwise decomposition of Fisher information and the same componentwise log-Sobolev inequality.The uniform-in-time propagation of chaos property for both dynamics is then obtained by combining the exponential convergence of the mean field and particle flow.This paper is also different from [7] in a number of aspects.First, as the dynamics is generated by a hypoelliptic operator instead of an elliptic one, we use hypocoercivity to recover the exponential convergence.Second, since we are not able to show hypercontractivity of the kinetic dynamics (let alone reverse hypercontractivity), we prove the entropic propagation of chaos by studying its short-time regularization effects.In this way we no longer restrict the initial condition of the mean field dynamics, but as a trade-off we require a higher-order regularity in measure of the energy functional.Finally, following in a remark in [7], we use an approximation argument to remove the condition on the higher-order spatial derivatives in this work.

Main contributions
Hypocoercivity for mean field systems.We extend the studies of the linear Fokker-Planck equation in [40] to the dynamics with general (but always regular) mean field interactions.In particular, we do not suppose the interaction is in form of a two-body potential, which stands in contrast with [40,Theorem 56] and [29,16].Moreover, in hypocoercive computations, we find that the contributions from the mean field interaction can always be dominated by the "diagonal" terms in the Fisher information, already present in the case of linear dynamics.Hence using the convexity of energy, we are able to derive the hypocoercivity without restrictions on the size of the interaction.Furthermore, our assumptions imply a uniform-in-N bound on the operator norm of the second-order derivatives of the effective potential driving the N -particle system, and the entropic hypocoercivity is consequently uniform in N .This is different from the situation of L 2 -hypocoercivity, where the condition given by Villani [40, (7.3)] gives dimension-dependent constants and therefore is unsuitable for studies of particle systems, as remarked in [15].Finally, let us mention that we derive the entropic hypocoercivity under minimal regularity assumptions, made possible by our approximation argument (of functions and of mean field functionals) and the calculus in Wasserstein space developed in [2].
Regularization in short time.We obtain two short-time regularization results for the kinetic mean field dynamics.The first, from Wasserstein to entropy, is a consequence of the logarithmic Harnack's inequality, obtained by applying the coupling by change of measure method of Panpan Ren and Feng-Yu Wang in [32] to the mean field and N -particle diffusions.We remark also that very recently a similar inequality ([19, (3.13)]) is proved for the propagation of chaos of nondegenerate McKean-Vlasov diffusions.The second regularization, from entropy to Fisher information, is obtained by adapting Hérau's functional in [40] to our mean field setting and follows from the same hypocoercive computations as we prove the convergence of the mean field flow.We stress that although much stronger regularization phenomena are present, for example from measure initial values to L p for every p > 1 and to H k for every k ≥ 1, our results have the advantage of growing at most linearly in dimension, making them suitable for studying the N -particle systems under the limit N → +∞.
Propagation of chaos.Finally, using the exponential convergence and the shorttime regularizations, we derive the propagation of chaos for the kinetic MFL, i.e. bounds on the distances between the particle system and the mean field system.In particular, the initial value of the both systems can be arbitrary measures of finite second moments without any further regularity constraints.Moreover, the error terms do not have any dimension-dependence.It is noteworthy that our approach allows us to not rely on a uniform-in-time log-Sobolev inequality for the mean field flow, and also that the dynamics considered are realized on the whole space instead of the torus, standing in contrast with previous works, e.g.[13,24,10].

Notations
Let d be a positive integer and x, v be elements of R Let X and Y be random variables.We denote the distribution of X by Law(X) and say X ∼ m if m = Law(X).We also say The set of couplings between probabilities µ and ν is denoted by Π(µ, ν).Let N ≥ 2 be integer.The bold letters x N = (x 1 , . . ., x N ), v N = (v 1 , . . ., v N ) denote respectively N -tuples of elements in R d and z N = (z 1 , . . ., z N ) an N -tuple of elements in R 2d .We omit the subscript N when there are no ambiguities.Given x N = (x 1 , . . ., x N ) ∈ R dN , we denote the corresponding empirical measure by For i = 1, . . ., N , we define −i = {1, . . ., N }\{i}, that is, the complementary index set, and we denote the empirical measures formed by the N − 1 points (x j ) j̸ =i by Let I ⊂ {1, . . ., N } and J = {1, . . ., N } \ I be the complementary index set.Let Z be an R 2dN -valued random variable and and m N be its distribution, belonging to P(R 2dN ).We denote the marginal and the (regular) conditional distributions of m N by where the latter is defined m N,J -almost surely and z J denotes the tuple (z j ) j∈J .We identify i with the singleton {i} when working with indices.Whenever a measure m ∈ P(R d ) has a density with respect to the d-dimensional Lebesgue measure, we denote its density function by m equally.The relative H(•|•) between probabilities are always well defined and the absolute entropy H(•) is also well defined if the measure in the argument has finite second moment.If a measure m ∈ P(R d ) has distributional derivatives Dm representable by a finite Borel measure and Dm is absolutely continuous with respect to m, we define its Fisher information by where Dm m is the Radon-Nikodým derivative.Otherwise we set I(m) = +∞.One can verify that I(m) is finite only if m ∈ W 1,1 (R d ) 1 , and in this case m , ∇m being the weak derivatives in L 1 (R d ; R d ).The Fisher information defined in this way corresponds to the functional considered in [1, (2.26)].If m is a measure on R d having finite Fisher information, and if γ is another measure on R d having weakly differentiable density with respect to the Lebesgue, we define the relative Fisher information by 1 We sketch the proof here.Suppose m has finite Fisher information.Set m n = m ⋆ ρ n for a mollifying sequence (ρ n ) n∈N .Then we have ∥m n ∥ W 1,1 ≤ C for all n ∈ N. By Gagliardo-Nirenberg, m n is uniformly bounded in L p for some p > 1, so upon an extraction of subsequence, (m n ) n∈N converges to some m ′ ∈ L p weakly.But m n → m in P. The two limits coincide, i.e. m = m ′ .Hence m has density with respect to the Lebesgue and so does Dm.
For nonnegative functions f : R d → [0, +∞) we define its entropy by which is well defined in [0, +∞] by Jensen's inequality.
Organization of paper.In Section 2, we introduce our assumptions, define the kinetic mean field Langevin and the particle system, and state our main results.We provide in Section 3 an exemplary dynamics modeling neural networks' training and present our numerical experiments.Moving on to the proofs, we first show in Section 4 the exponential convergence of the mean field and particle system dynamics.We then study in Section 5 finite-time propagation of chaos and regularizations of the kinetic MFL before combining all previous results and showing the propagation of chaos theorem in its full form.Finally, several technical results are proved in the appendices.

Assumptions and main results
Assumptions.Let F : P 2 (R d ) → R be a mean field functional.We suppose F is convex in the sense that for every t ∈ [0, 1] and every m, m ′ ∈ P 2 (R d ), Suppose also its intrinsic derivative D m F : for some constants M F mm , M F mx ≥ 0. For each m ∈ P 2 (R d ) we define a probability measure Π x (m) on R d by Π x (m)(dx) ∝ exp − δF δm (m, x) dx and suppose Π x (m) satisfies the ρ x -logarithmic Sobolev inequality (LSI), uniformly in m, for some ρ x > 0, that is, for every m ∈ P 2 (R d ), Finally for some of the results we suppose additionally that F is third-order differentiable in measure with for some constant M F mmm .
Definition of m and functional inequalities.For each m ∈ P(R 2d ), we define m to be the probability on R 2d satisfying where m x is the spatial marginal of m.Sometimes we will abuse the notation and define for a measure m ′x ∈ P 2 (R d ), the probability m ′x (dxdv) ∝ exp − δF δm (m ′x , x)− 3) with the LSI constant ρ x , then setting we have that the ρ-LSI holds for m: As a consequence, we have the Poincaré inequality: and Talagrand's T 2 transport inequality: for every µ ∈ P 2 (R 2d ), (2.9) Mean field and particle system.We study the mean field kinetic Langevin dynamics, that is, the following McKean-Vlasov SDE Let N ≥ 2. The corresponding N -particle system is defined by (2.11) Here W and W i are standard Brownian motions in R d , and (W i ) N i=1 are independent.Their marginal distributions m t = Law(X t ), m N t = Law(X t ) = Law(X 1 t , . . ., X N t ) solve respectively the Fokker-Planck equations: ) where on the second line The mean field equation (2.12) is non-linear while the N -particle system equation (2.13) is linear.We will show in Lemma 4.5 the wellposedness of the mean field dynamics (2.12) with initial conditions of finite second moment.
Remark 2.1.We have fixed the volatility and the friction constants to simplify the computations.In order to apply our results to the diffusion process defined by with α, γ, σ > 0, we introduce the new variables: define m ′ to be the push-out of measure m under x → x ′ , and set Then the stochastic process where W ′ t ′ := γ 1/2 W t is a standard Brownian motion.In the same way we can treat the particle system defined by

.15)
Free energies and invariant measures.For measures m ∈ P 2 (R 2d ), m N ∈ P 2 (R 2dN ), we introduce the mean field and N -particle free energies: (2.17) The functionals are well defined with values in (−∞, +∞].We will also work with probability measures, and having finite exponential moments, i.e., the integrals exp α(|x|+|v|) m ∞ (dxdv) and exp α(|x| + |v|) m N ∞ (dxdv) are finite for every α ≥ 0. We call m ∞ , m N ∞ invariant measures to the dynamics (2.12) and (2.13) respectively.The existence and uniqueness of the invariant measures are guaranteed by our assumptions (2.1) to (2.3), as will be stated in Lemma 4.1.
Main results.Recall that m t and m N t are the respective marginal distributions of the mean field and the N -particle system (2.10) and (2.11).We first prove the exponential entropic convergence result for the MFL dynamics (2.10).
Theorem 2.1 (Entropic convergence of MFL).Assume F satisfies (2.1) to (2.3).If m 0 has finite second moment, finite entropy and finite Fisher information, then there exist constants (2.20) The proof of the theorem is postponed to Section 4.2.We note that the proof only relies on the W 2 -Lipschitz continuity of m → D m F (m, x), contrary to the W 1 one stated in (2.2).
Our second major contribution is the uniform-in-N exponential entropic convergence of the particle systems.
Theorem 2.2 (Entropic convergence of particle systems).Assume F satisfies (2.1) to (2.3).If m N 0 has finite second moment, finite entropy and finite Fisher information for some N ≥ 2, then there exist constants The proof of the theorem is postponed to Section 4.3.Remark 2.2.Strictly speaking, the result (2.21) does not imply that the particle systems converge uniformly.We only show 1  N F N m N t approaches the mean field minimum F(m ∞ ) uniformly quickly until they are O(N −1 )-close to each other.Remark 2.3.Theorems 2.1 and 2.2 state results concerning the convergence of the respective free energies, which we will also call "convergence of entropy" or "entropic convergence", since in both cases the differences of free energies are related to relative entropies, as shown in Lemmas 4.2 and 4.3.
We now present the main theorem, which establishes the uniform-in-time propagation of chaos in both the Wasserstein distance and the relative entropy.The results are direct consequences of the exponential convergence in Theorems 2.1 and 2.2 and the regularization phenomena to be studied in Section 5.
Theorem 2.3 (Wasserstein and entropic propagation of chaos).Assume F satisfies (2.1) to (2.4).If m 0 belongs to P 2 (R d ) and m N 0 belongs to P 2 (R dN ) for some N ≥ 2, then there exist constants moreover, for every t and s such that s The proof of the theorem is postponed to Section 5.4.
Comments on the assumptions.Compared to our previous work [7], we have removed the technical assumption that x → D m F (m, x) has bounded higher-order derivatives by a mollifying procedure of the mean field functional.However, the spatial Lipschitz constant M F mx , appearing in the assumption (2.2), will contribute to the constants, especially the rate of convergence κ, in our theorems.Nevertheless, this behavior is expected for kinetic dynamics, as this dependency is already present for the linear Fokker-Planck dynamics in [40].Finally, we introduce the new condition (2.4) on the second and third-order derivatives in measure of the mean field functional.The condition (2.4) is used to obtain O(1) errors in the propagation of chaos bounds (2.22) and (2.23) in Theorem 2.3, which are stronger than the dimension-dependent errors obtained from the method of Fournier and Guillin [12].

Application: training neural networks by momentum gradient descent
We have given in Section 3 of our previous work [7] several examples of mean field functionals satisfying conditions (2.1) to (2.3) of our theorems, and the only additional condition that remains to verify is the bound on the higher-order measure derivative (2.4).In the following we will recall the mean field formulation of the loss of two-layer neural networks and its corresponding kinetic dynamics (see [7, Examples 2 and 4]), and verify that it satisfies indeed the additional assumption.

Mean field formulation of neural network
Recall that the structure of a two-layer neural network is determined by its feature map: where S is the parameter of a single neuron, φ : R → R is a non-linear activation function satisfying the squashing condition (see [7, (3.4)]), and ℓ : R → [−L, L] is a truncation function with threshold L ∈ (0, +∞).Here the action of the truncation is tensorized: Then given N neurons with respective parameters θ 1 , . . ., θ N , the associated network's output reads Here z should be considered as the input of the network, i.e. the feature, and the value Φ N (θ 1 , . . ., θ N ; z) should correspond to the label.We wish to find the optimal neuron parameters (θ i ) N i=1 for a possibly unknown distribution of featurelabel tuples µ ∈ P(R d+d ′ ).In order to quantify the goodness of networks, we define the loss: It is proposed in [18,7] that instead of minimizing the original loss (3.2), we consider the mean field output function E Θ∼m [Φ(Θ; •)] and minimize the mean field loss We also add a quadratic regularizer with regularization parameter λ > 0. The final optimization problem then reads Following the calculations in [7] we can show that if both the truncation and activation function are bounded and has bounded derivatives of up-to-second order, then the conditions (2.1) to (2.3) are verified.Finally, the third-order derivatives δ 3 F δm 3 is a constant thanks to the fact that the loss function is quadratic, and therefore the condition (2.4) is satisfied with M F mmm = 0. Remark 3.1.Following [7, Remark 3.6], we recognize that the SDE (2.10) describes the continuous version of the gradient descent algorithm with momentum.Among various momentum gradient descent methods commonly used to train neural networks, the most prevalent ones are RMSProp and Adam algorithms (see [17,22]), where the momentum is accumulated and the step size is adapted along the dynamics.In [26,37,31] the authors studied the convergence of these momentum-based algorithms and compared them to algorithms without momentum based on optimization theory.We note that estimates of the discretization error and optimal parameters can also be found in these studies.

Numerical experiments
We present our numerical experiments in this section.Our experiments are based on the discretized version of a particle system dynamics (2.15).We first explain the optimization problem and the numerical algorithm, and then present our two experiments: the first investigates the convergence behavior as the number of particles tends to infinity, and the second compares the kinetic dynamics to the corresponding overdamped dynamics.
Problem setup and momentum algorithm.We aim to solve a supervised learning problem: our goal is to classify the handwritten digits "4" and "6" by a two-layer neural network.We randomly choose K = 10 4 samples from the MNIST dataset [25] and denote by (z k ) K k=1 the figures in 28 × 28 pixel format, i.e. each z k belongs to R 28×28 = R 784 , and by (y k ) K k=1 the one-hot vectors for the two classes of and its threshold equals L. The quadratic regularization parameter is denoted by λ.Following the arguments of [7] and the precedent section, all the conditions of our theorems (2.1) to (2.4) are satisfied.In the beginning of training process, the neuron positions are sampled independently from a given initial distribution m x 0 , m v 0 ∈ P(R 2 ×R 784 ×R).We update the parameters following the discrete-time version of the underdamped Langevin SDE (2.14) with fixed set of parameters (α, γ, σ), that is, we calculate the neurons' evolution by Algorithm 1.
Convergence when N → +∞.To study the behavior of the momentum training dynamics when N → +∞ we conduct independent experiments with the an increasing number of particles: N = 2 P for P = 5, 6, . . ., 10 and repeat the experiment 10 times for each N .The hyperparameters for this experiment are listed in the second column of Table 1.
To quantity the convergence, we compute We then compute its average of the respective quantities over the 10 repeated runs.The evolutions of 1 N F N NNet and 2 and Figure 3 respectively, and can be characterized by two distinct phases.In the first phase, both the quantities decrease and the second quantity decreases exponentially, for every N .We also find that in this phase the convergence rates are almost the same for different N and this is coherent with the behavior indicated by our theoretical upper bound (2.21).We also observe that 1 N F N NNet fluctuates in a stronger way than In the second phase, both the values cease to decrease but the remnant values differ for different N .
To investigate the relationship between the remnant values in the second phase and the number of particles N , we compute the average value of Comparison to algorithm without momentum.We also investigate the difference between gradient descent algorithms with and without momentum by working on the same set of hyperparameters, listed in the last column of Table 1.It is found that the algorithm with momentum leads to much stronger fluctuations compared the algorithm without momentum (see Figure 5).Both algorithms cease to decrease after certain training epochs, but the momentum algorithm leads to better loss in the end.This may be explained by the fact that the presence of momentum helps the particles to escape local minima.4 Entropic convergence

Collection of known results
Before moving on to the proofs, we first state some elementary results without proofs.They are either immediate consequences of the corresponding ones in our previous work [7], or easy adaptations thereof.
where ρ is defined by (2.6).Here, the leftmost inequality holds even without the uniform LSI condition (2.3), once there exists a measure m ∞ satisfying (2.18) and having finite exponential moments.
Lemma 4.3 (Particle system's entropy inequality).Assume that F satisfies (2.1) and that there exists a measure m ∞ ∈ P 2 (R 2d ) verifying (2.18).Then for all m N ∈ P 2 (R dN ) of finite entropy, we have Lemma 4.4 (Information inequalities).Let X 1 , . . ., X N be measurable spaces, µ be a probability on the product space Here we set the rightmost term to +∞ if the conditional distribution µ i|−i does not exist µ −i -a.e.

Mean field system
In this section we study the mean field system described by the Fokker-Planck equation (2.12) and the SDE (2.10).Our aim is to prove Theorem 2.1.To this end, we first show its wellposedness and regularity.
Lemma 4.5.Suppose F satisfies (2.2).Then for every initial value m 0 of finite second moment, the equation (2.12) admits a unique solution in C [0, ∞); P 2 (R d ) .Moreover, for every t > 0, the measure m t is absolutely continuous with respect to the Lebesgue measure.
Proof.Since the drift D m F (•, •) of the SDE system (2.10) is jointly Lipschitz in measure and in space by our condition (2.2), the existence and uniqueness of the solution is standard.
To show the existence of density we recall Kolmogorov's fundamental solution Then the Duhamel's formula holds in the sense of distributions: Since the first moment of m t is bounded, that is, for every T > 0, sup t∈[0,T ] (|v| + |x|)m t (dxdv) < +∞, we can integrate by parts in the second term of (4.4) and obtain By explicit computations we have ), from which the existence of the density follows.
We now introduce a technical condition on the mean field functional: the mapping x → D m F (m, x) is fourth-order differentiable with derivatives continuous in measure and in space, and satisfying This condition will be used to derive some intermediate results in the following studies of the mean field dynamics.
Definition 4.6 (Standard algebra).We define the standard algebra A + to be the set of C 4 functions h : R 2d → (0, ∞) for which there exists a constant C such that when n → +∞.
Proof.Let ε be arbitrary positive real.Put and the associated probability measure m that is to say, the second moments of (m ′ n ) n∈N are uniformly bounded.Together with the fact that the density of m ′ n converges to that of m pointwise, we have m ′ n → m in P 2 .By the dominated convergence theorem, the sequence of measures m ′ n satisfies Moreover, we have the convergence of Fisher information as converges to I(m) when n → +∞, where we used the fact that the weak derivatives satisfy ∇h ′ n = ∇h1 1/n≤h≤n .Hence we may choose n 0 ∈ N such that By the definition we have On the other hand, the gradient of In particular, we have for some constant C. Therefore, m ′′ n1 m ∞ verifies the first condition of A + .Now verify the conditions on the derivatives.The derivatives read For each term in the sum, we can bound its first part by using the same method that we used to verify the first condition of A + .Moreover, since our assumptions (2.2) and (4.5) imply the second part of each term of the sum, m ∞ ∇ k−j m −1 ∞ , is of polynomial growth.The proof is then complete.
Then we show the stability of the set A + under the mean field flow.This property will be used to justify the computations in the proof of Theorem 2.1, as is usual in the analysis of PDE.
In particular, m t is a classical solution to the Fokker-Planck equation.
Proof.In the following C will denote a constant depending on M F mx , M F mm , the initial value h 0 = h(0, •) := m 0 /m ∞ , the time interval T and the bounds on the higher-order derivatives max k=2,3,4 sup m,x ∇ k D m F (m, x) , and it may change from line to line.For a given quantity Q, we denote by C Q a constant depending additionally on Q.
We also define h t (x) = m t (x)/m ∞ (x).The relative density solves We construct for every z = (x, v) ∈ R 2d , the stochastic process Z t,z s = (X t,z , V t,z ), solving Regularity of Z t,z s .Set M t,z = sup s∈[0,t] Z t,z s .By Itō's formula and Doob's maximal inequality, the processes satisfy for every α ≥ 0, Thanks to the assumption on the uniform boundedness of the higher-order derivatives (4.5), the mapping z → Z t,z s is C 4 and the partial derivatives solve the Cauchy-Lipschitz SDEs for k = 1, 2, 3, 4: where B k,j is a k −j +1-variate polynomial and in particular B k,1 (x 1 , . . ., x k ) = x k .The initial values of the SDEs read The Feynman-Kac formula for the parabolic equation (4.6) reads Using the method in the proof of [7,Proposition 4.12], we can prove Moreover, thanks to the estimates (4.8), we can apply the dominated convergence theorem to the Feynman-Kac formula (4.9) and obtain that z → h(t, z) belongs to C 4 with partial derivatives where P j is a j-variate polynomial.Note that and for f = h(0, •).We apply the bounds on Remark 4.1.The polynomials appearing in the previous proof belong to the noncommutative free algebras over R of respective number of indeterminates instead of the usual polynomial rings, as the tensor product is not commutative.
After the technical preparations we prove Theorem 2.1.
Proof of Theorem 2.1.Suppose first that the mean field functional F satisfies additionally (4.5) and the initial value of the dynamics is such that m 0 /m ∞ belongs to A + , which is the standard algebra defined in Definition 4.6.According to Proposition 4.8, the measure m t belongs to A + uniformly in t, for every T > 0. Since we have that z and for some constant M , the alternative relative density η t (z) := m t (z)/ mt (z) is C 4 in z and there exists a constant M T such that for every (t, z) ∈ [0, T ] × R 2d .The constant M T may change from line to line in the following.
In the following we will adopt the abstract notations introduced by Villani in his seminal work on the hypocoercivity [40].Define Finally define L t = A * t A t + B t and u t = log η t .The Fokker-Planck equation (2.12) now reads Adding anisotropic Fisher.Let a, b, c be positive reals to be determined.We define the hypocoercive Lyapunov functional where is the free energy.We also denote the sum of the last three terms in (4.12) by I a,b,c (m t | mt ), so that Thanks to Proposition 4.8 and in particular the bound (4.10), we can show that the quantity E(m t ) is well defined for every t ≥ 0 and is continuous in t.We will show in the following that t → E(m t ) is in fact absolutely continuous and calculate its almost everywhere derivative.To this end, for every t > 0 and every h ≥ −t, we define Contributions from ∆ 1 and ∆ 2 .We first calculate the contributions from ∆ 1 .Using the Fokker-Planck equation (2.12) and the bounds (4.10), one has |∆ 1 | ≤ M T h for every t, h such that t and t + h belong to [0, T ]; moreover, by the dominated convergence theorem one has for almost every t > 0, where the right hand side is continuous in t.The above inequality then holds for every t > 0. Define the 4 × 4 matrix and denote the Hilbertian norm by Then we have for almost every t > 0, lim h→0 ∆ 1 /h = −Y T t K 1 Y t .Next calculate the contributions from ∆ 2 .Arguing as we did for ∆ 1 , again we have |∆ 2 | ≤ M T h.Applying the dominated convergence theorem and compute as in the proofs of [40, Lemma 32 and Theorem 18], we obtain that for almost every t > 0, the limit lim h→0 ∆ 2 /h exists and is upper bounded by −Y T t K 2 Y t , where Contributions from ∆ 3 .Now we calculate the last term So for each z ∈ R 2d , we know that ∇ log mt (z) is continuous in t, and is absolutely continuous once t → m x t is absolutely continuous with respect to the W 2 distance in the sense of [2, Definition 1.1.1].Let us show the latter.Integrating the speed component in the Fokker-Planck equation (2.12), we obtain where is the average speed at the spatial point x.The L 2 norm of the vector field in the continuity equation (4.14) satisfies where the first inequality is due to Cauchy-Schwarz.Applying [ Then we obtain |∆ 3 | ≤ M T h.Moreover, by the dominated convergence theorem, we have for almost every t > 0, by applying Cauchy-Schwarz again, where Hypocoercivity.Our previous bounds on ∆ k , k = 1, 2, 3 establish that t → E(m t ) is absolutely continuous (locally Lipschitz, in fact) with its almost everywhere derivative satisfying As in the end of the proof of [40,Theorem 18], we can pick constants a, b, c > 0, depending only on M F mx and M F mm , such that ac > b 2 and the matrix K is a positive-definite.Let α be the smallest eigenvalue of K. Then we have Hence for every t, s such that t ≥ s ≥ 0, (4.15) Approximation.We now show that the inequality (4.15) holds without additional assumptions on the mean field functional F and the initial value m 0 .First, suppose still that F satisfies (4.5) but no longer suppose m 0 is such that m 0 /m ∞ ∈ A + .The initial value m 0 belongs to P 2 (R d ) and both H(m 0 ) and I(m 0 ) are finite, so thanks to Proposition 4.7, we can pick a sequence of measures m ′ n,0 n∈N , each of which belongs to A + , such that As proved above, the inequality (4.15) holds for the flow m ′ n,t t≥0 , that is, By the continuity with respect to the initial value of the SDE system (2.10), we have also m ′ n,t → m t in the weak topology of P 2 .We recall in Lemma A.1 that both the entropy and the Fisher information are lower semicontinuous with respect to the weak topology of P 2 .Taking the lower limit on both sides of the inequality above, we obtain (4.15) with s = 0 for the original flow (m t ) t≥0 .
Second, we no longer require F to satisfy (4.5) and set F k (m) = F (m ⋆ ρ k ) for a sequence of smooth and symmetric mollifiers (ρ k ) k∈N in R d with supp ρ k ⊂ B(0, 1/k).The linear derivative of the regularized mean field functional reads ) and here m should be understood as the Gibbs-type measure defined with F k instead of F .Let (m ′′ k,t ) t≥0 be the flow of measures driven by F k with the initial value m ′′ k,0 = m 0 .Our previous result yields for every t ≥ 0, where m′′ k,s is the probability measure proportional to From the bound (4.16) we deduce that m ′′ k,t → m t in P 2 for every t ≥ 0 by the synchronous coupling result in Lemma 5.1.So taking the lower limit on both sides of the previous inequality, we obtain the inequality (4.15) with s = 0 holds for general initial values and general mean field functionals.In particular, for every t ≥ 0, the measure m t has finite entropy and finite Fisher information.Then we apply the same argument to the flow with the initial value m s and obtain the inequality (4.15) for general s ≥ 0.
Remark 4.2.Our Theorem 2.1 can be compared to [40,Theorem 56], where kinetic mean field Langevin dynamics with two-body interaction are studied and O(t −∞ ) entropic convergence to equilibrium is shown, under the assumption that the mean field dependence is small.This restriction is lifted by our method which leverages the functional convexity.
Remark 4.3.The regularized energy functional has bounded derivatives of every order.However m → D m F k (m, x) remains only Lipschitz continuous and we are not aware of any approximation argument that allows us to obtain differentiability in the measure argument.Consequently we use still the result from [2] to treat this low regularity.

Particle system
In this section we study the system of particles described by the linear Fokker-Planck equation (2.13) and the SDE (2.11).Note that since the dynamics is linear, its wellposedness is classical and we omit its proof.We first show that for our model we can construct hypocoercive functionals whose constants are independent of the number of particles.Lemma 4.9 (Uniform-in-N hypocoercivity).Assume F satisfies (2.2) and there exists a measure m N ∞ satisfying (2.19) and having finite exponential moments.Let t → m N t be a solution to the N -particle Fokker-Planck equation (2.13) in C [0, T ]; P 2 (R 2dN ) whose initial value m N 0 has finite entropy and finite Fisher information.Then there exist constants a, b, c, α > 0 depending only on M F mx and M F mm such that ac > b 2 and the functional where for every t, s such that t ≥ s ≥ 0. Proof.We first show that the condition (2.2) implies a bound on the second-order derivatives of x → U N (x) := N F (µ x ).The first-order derivatives satisfy Summing over i, we obtain for every ε > 0, From the Lipschitz bound we obtain Now suppose there exist a constant M such that U N satisfies and that for every z ∈ R 2dN .We apply Proposition 4.8 to show that under our assumptions, there exists a constant M T such that for every (t, z) ∈ [0, T ]×R 2dN (in fact, z → h N t (z) remains lower and upper bounded and its up-to-fourth-order derivatives grow at most polynomially).
We denote u N t = log h N t .In view of the regularity bound (4.22), we have as is computed in [40].Denote the Hilbertian norm by and define the four-dimensional vector By Cauchy-Schwarz we have where ∥∇ 2 U N ∥ op,∞ is bounded by (4.19).We then apply the same argument as in the proof of Theorem 2.1 to pick a, b, c such that ac > b 2 and K is positive-definite with its smallest eigenvalue α > 0.Then, from which the desired inequality (4.18) follows.
We then show the inequality (4.18) holds for general mean field functional F and initial value m N 0 .First, suppose still that U N satisfies additionally the bound (4.20) but no longer suppose m N 0 satisfies additionally (4.21).As m N 0 has finite second moment, finite entropy and finite Fisher information, we can find a sequence of measures (m ′N n,0 ) n∈N , each of which satisfies the bound (4.21), such that by the procedure in the proof of Proposition 4.7.We have the convergence m ′N n,t → m N t in P 2 .So taking the lower limit on both sides of ) for s = 0, thanks to the P 2 -continuity of F and the P 2 -lower-semicontinuity of entropy and Fisher information, proved in Lemma A.1.Second, we no longer suppose U N satisfies the bound (4.20) and set for a sequence of smooth mollifiers (ρ k ) k∈N in R dN .Then U N k is C 4 and satisfies its second and fourth-order derivatives and for every t ≥ 0 and we take the lower limit on both sides to obtain (4.18) with s = 0 for general initial values and general mean field functional.In particular, this implies for every t ≥ 0, m N t has finite entropy and finite Fisher information.Then we apply the same argument to the flow with m N s as the initial value and obtain (4.18) for general s ≥ 0. Remark 4.5.If we additionally assume a uniform-in-N LSI for m N ∞ , then we can directly establish for a constant κ > 0 independent of N .This approach has been explored in a number of previous works.We do not impose such an assumption or sufficient conditions for it, as they often requires the mean field interaction to be small enough or (semi-)convex enough, excluding the application to neural networks in Section 3.
We then give the proof of Theorem 2.2.The method of proof is similar to [7, Theorem 2.3] and we only need to take into account of the additional kinetic terms.We give a complete proof only for the sake of self-containedness.
Proof of Theorem 2.2.We pick the positive constants a, b, c, α depending only on M F mx and M F mm such that ac > b 2 and (4.18) holds for every t ≥ 0, according to Lemma 4.9.Then, as in the proof of [7, Theorem 2.3], we will establish a lower bound of the relative Fisher information I t := I m N t m N ∞ in order to obtain the desired result.
Regularity of conditional distribution.By local hypoelliptic positivity (see e.g.[40, Theorem A.19 and Corollary A.21]), we know that for every t > 0 and every , which is strictly positive by the local positivity of m N t and is lower semicontinuous by Fatou's lemma.By the Fubini theorem, we have Together with the lower semicontinuity, we obtain that m N,−i t (z −i ) is finite everywhere.We are therefore able to define the conditional probability density m N,i|−i t which is weakly differentiable in z i and strictly positive everywhere.We can also define the conditional density m for the invariant measure m N ∞ , and the regularity follows directly from its explicit expression.
Decomposing Fisher componentwise.Using the conditional distributions, we can decompose the relative Fisher information as Change of empirical measure and componentwise LSI.We replace the empirical measure Another change of empirical measure.We are going to replace µ x −i by µ x in (4.27).Define δ i 2 (x; y) := δF δm (µ x −i , y) − δF δm (µ x , y) and the second error Taking expectations on both sides of (4.27), we obtain Thanks to the convexity of F , the first term satisfies the tangent inequality For the second term we apply the information inequality (4.3) to obtain Hence, by the definition of free energies |v| 2 m N .Using (4.26), we obtain Bounding the errors ∆ 1 , ∆ 2 .The transport plan between µ x and µ x −i Let us treat the first error Under the L 2 -optimal transport plan Law ( The first term Hence the first error satisfies the bound Now treat the second error ∆ 2 .The Lipschitz constant of y → δ i 2 (x; y) = δF δm (µ x −i , y) − δF δm (µ x , y) is controlled by Use Fubini's theorem to first integrate z ′ in the definition of the second error (4.28) and let Z′ ∞ be independent from Z t .We obtain Using the same method we used for ∆ 1 , we control the first term by For the second term we work again under the L 2 -optimal plan Law ( and let Z′ ∞ remain independent from the other variables.We have As a result, Thanks to the Poincaré inequality (2.8) for m ∞ = m∞ , its spatial variance satisfies . Using the T 2 -transport inequality (2.9) for m ⊗N ∞ and the entropy sandwich Lemma 4.3 we bound the transport cost by In the end we obtain We conclude by applying Grönwall's lemma, as in the end of the proof of Theorem 2.1.

Short-time behaviors and propagation of chaos
Our proof of the main theorem on the uniform-in-time propagation of chaos (Theorem 2.3) relies on the exponential convergence in Theorems 2.1 and 2.2, where the initial conditions are required to have finite entropy and finite Fisher information.We aim to demonstrate in this section that the non-linear kinetic Langevin dynamics exhibits the same regularization effects in short time as the linear ones, where the contributions from the non-linearity can be controlled.We will first show the short-time Wasserstein propagation of chaos using synchronous coupling.
Then we adapt the regularization results for the linear dynamics to our setting and show that for measure initial values of finite second moment, the entropy and the Fisher information are finite for the flow at every positive time, where the shorttime Wasserstein propagation of chaos also plays a role.Finally we combine all the estimates obtained to derive Theorem 2.3.

Synchronous coupling
We first show a lemma where synchronous coupling is applied to general McKean-Vlasov diffusions.This lemma is also used to justify the approximation arguments in the proof of Theorems almost surely, then for every t ∈ [0, T ], + e 2 t t 0 e (2Mm+2Mz)(t−s) E δ 2 s ds.
Proof.From the uniformly Lipschitz continuity of b and b ′ we have the uniqueness in law and the existence of strong solution for both diffusions.So we can construct (Z t , Z ′ t ) t∈[0,T ] such that they share the same Brownian motion and satisfy and by Itō's formula, By (5.1) we have Hence By Cauchy-Schwarz, Taking expectations on both sides and applying Grönwall's lemma, we obtain from which the desired inequality follows.
Since the finite-time propagation of chaos does not depend on the gradient structure of the diffusions, we introduce a more general setting.Let b : be a mapping that is Lipschitz in space and velocity: there exist positive constants We suppose also that the functional derivatives δb δm , δ 2 b δm 2 exist with the following bounds: there exist positive constants M b m , M b mm such that and We consider the following mean field dynamics: and the corresponding particle system: and i = 1, . . ., N .In both equations W t , W i t are standard Brownians and (W i t ) N i=1 are independent from each other.The dynamics (5.5) and (5.6) are well defined globally in time thanks to the Lipschitz continuity (5.2) and we denote by P and Proof.Let us first prove the log-Harnack inequality (5.9) for compactly supported m and m N .
Constructing a bridge.Fix T > 0 and let ( Xi t , Ṽ i t ) N i=1 be N independent duplicates of the solution to (5.5) with the initial condition Law( Xi 0 , Ṽ i 0 ) = m for i = 1, . . ., N .We denote the N -independent Brownians by W i t .By enlarging the underlying probability space, we construct random variables X 0 , V 0 such that Define for i = 1, . . ., N the stochastic processes ) where (5.13) The difference processes (5.15) In particular X i T = Xi T and V i T = Ṽ i T .Change of measure.Define for some universal constant C. In the following C may change from line to line and depend on the constants which is a local martingale.Then (X i , V i , W i ) solves (5.6).Since m, m N are both compactly supported, X i 0 − Xi 0 , V i 0 − Ṽ i 0 are bounded almost surely.The difference in drift δb i t has uniform linear growth in X t , V t , and therefore uniform linear growth in Xt , Ṽt .We then apply Lemma C.1 in the appendix to obtain that R • is really a martingale.By Girsanov's theorem W i t are independent Brownians under the new probability Q = RP.Since X 0 , V 0 , X0 , Ṽ0 are independent from the Brownian motions we have Hence for measurable functions f N : R 2dN → R that are lower bounded away from 0 and upper bounded, we have Arguing as in the proof of Proposition 5.2, we have So the log-Harnack inequality (5.9) is proved for compactly supported m N and m.
Approximation.Now treat general m N , m of finite second moment, but not necessarily compact supported.Take two sequences (m N k ) k∈N , (m k ) k∈N of compactly supported measures such that m N k → m N and m k → m in respective topologies of P 2 .For continuous f N such that log f N is bounded, we have by the P 2 -continuities of P N t * and P * t .So the log-Harnack inequality (5.9) is shown for every continuous f N which is both lower and upper bounded, and for general m N and m of finite second moment.For a doubly bounded but not necessarily continuous f N we take a sequence of continuous and uniformly bounded (f N k ) k∈N that converges to f N in the σ(L ∞ , L 1 ) topology.We have since both (P N t ) * m N and P * t m are absolutely continuous with respect to the Lebesgue measure according to Lemma 4.5.So the desired inequality (5.9) is shown in full generality.Finally, to obtain (5.10) we define another sequence for k ∈ N. We apply the Harnack's inequality (5.9) to g N k and take the limit k → +∞.
Using the known results on log-Harnack inequalities we can also obtain the regularization in the beginning of the dynamics.Proposition 5.4.Assume F satisfies (2.2) and there exist probabilities m ∞ , m N ∞ satisfying (2.18) and (2.19) respectively and having finite exponential moments.Let m 0 (resp.m N 0 ) be the initial value of the mean field dynamics (2.12) (resp.the particle system dynamics (2.13)) of finite second moment.Then there exist a positive constant C depending on M F mm and M F mx such that for every t ∈ (0, 1], for t ∈ (0, 1].For the particle system we apply the classical log-Harnack inequality (which corresponds to the case where M b m and M b mm are both equal to 0 in our Lemma 5.3, i.e. no mean field dependence) and obtain for t ∈ (0, 1] and it is clear from the computations in Lemma 5.3 that the constant C can be chosen to depend only on M F mx and M F mm .

From entropy to Fisher information
We then adapt Hérau's functional to our setting to obtain the regularization from entropy to Fisher information.
Proposition 5.5.Assume that F satisfies (2.1) and (2.2), and that there exist probabilities m ∞ , m N ∞ satisfying (2.18) and (2.19) respectively and having finite exponential moments.Let m 0 (resp.m N 0 ) be the initial value of the mean field dynamics (2.12) (resp.the particle system dynamics (2.13)) of finite second moment and finite entropy.Then there exist a positive constant C depending on M F mm and M F mx such that for every t ∈ (0, 1], Proof.First derive the bound for the mean field system.We suppose additionally F satisfies (4.5) and m 0 /m ∞ ∈ A + without loss of generality, as they can be removed by the approximation argument in the end of the proof of Theorem 2.1.Let a, b, c be positive constants to be determined.Motivated by [40,Theorem A.18], we define Hérau's Lyapunov functional for mean field measures: where η := m/ m.From the argument of Theorem 2.
and Y t is defined by (4.13).We then choose the constants a, b, c depending only on M F mx and M F mm such that ac > b 2 and K ′ t ⪰ 0 for t ∈ [0, 1].Hence t → E(t, m t ) is non-increasing on [0, 1] and the Fisher bound follows: for every t ∈ (0, 1], Here, in the second inequality, we use F(m t ) ≥ F(m ∞ ) which is a consequence of Lemma 4.2.Note that this inequality relies on the convexity of F .For the particle system we suppose additionally U N satisfies (4.20) and m N 0 m N ∞ satisfies (4.21) without loss of generality, as they can be removed by the argument in the end of the proof of Lemma 4.9.We define where h N := m N m N ∞ .By the computations in Lemma 4.9, we have and Y N t is defined by (4.23).We choose again the constants a, b, c depending only on M F mx and M F mm such that ac > b 2 and t → H N (t, m N t ) is non-increasing on [0, 1].Hence we have for every t ∈ (0, 1], Similarly, we use the fact that F N m N t − F N m N ∞ = H m N t m N ∞ ≥ 0 to get the second inequality.Here the difference is that the N -particle system is linear and this fact does not rely on the convexity of F .

Propagation of chaos
Using all the regularization results proved in Sections 5.2 and 5.3 we can finally give the proof of the main theorem.
Proof of Theorem 2.3.Let m 0 and m N 0 be the respective initial values for the dynamics (2.12) and (2.13) and suppose they have finite second moment.The first claim of the theorem (2.22) In the following C may change from line to line and may depend additionally on the LSI constant ρ.Applying the regularization in Proposition 5.5 to the dynamics with m t1 and m N t1 as respective initial values and noting that t 2 − t 1 ≤ 1 by definition, we obtain whereas F(m t1 ) − F(m ∞ ) is bounded by the entropy sandwich in Lemma 4.2: Consequently, both the measures m t2 and m N t2 have finite entropy and finite Fisher information, and we can apply respectively Theorems 2.1 and 2.2 to the dynamics with initial values m t2 and m N t2 .We then obtain and Using consecutively the triangle inequality, Talagrand's inequality (2.9) for m ⊗N ∞ and the entropy inequalities in Lemmas 4.2 and 4.3, we have So the inequality (5.19) is proved by combining the above three inequalities.
A Lower-semicontinuities

B Convergence of non-linear functional of empirical measures
Let ϕ : P 2 (R d ) → R be a (non-linear) mean field functional and m be a probability measure with finite second moment.We suppose the first and second-order functional derivatives δϕ δm , δ 2 ϕ δm 2 exist and that (ϕ, m) satisfies For the second term we apply the argument of [39, Theorem 4.2.9 (i)] and obtain

Figure 4 : 1 NF
Figure 4: Average values of 1 N F NNet + 1 N F Kinet over the last 500 epochs.The mean (black squares) and standard derivations (error bars) are calculated from the 10 independent runs.Dashed curve fits the data.

Proposition 4 . 8 (
Stability of A + under flow).Assume that F satisfies (2.2) and (4.5) and that there exists a measure m ∞ satisfying (2.18) and having finite exponential moments.Let (m t ) t∈[0,T ] ∈ C([0, T ]; P 2 (R d )) be a solution in the sense of distributions to the mean field Fokker-Planck equation(2.12

Remark 4 . 4 .
The constants a, b, c are possibly different from those appearing in the proof of Theorem 2.1.
Kinet of the last 500 training epochs for each individual run and plot their values in Figure 4. Motivated by the upper bound (2.21) in Theorem 2.2, we fit the remnant values by C ′ + C N and find the values are well fitted by this curve.
4.1 (Existence and uniqueness of invariant measures).If F satisfies (2.1) to (2.3), then there exist unique measures m ∞ and m N ∞ satisfying (2.18) and (2.19) respectively and they have finite exponential moments.Lemma 4.2 (Mean field entropy sandwich).Assume F satisfies (2.1) to (2.3).Then for every m ∈ P 2 (R 2d ), we have For a collection of functions (h ι ) ι∈I we say h ι ∈ A + uniformly for ι ∈ I or (h ι ) ι∈I ⊂ A + uniformly, if there exists a constant C such that the previous bounds holds for every h ι , ι ∈ I. Proposition 4.7 (Density of A + ).Assume F satisfies (2.2) and (4.5) and there exists a measure m ∞ satisfying (2.18) and having finite exponential moments.Then for every m ∈ P 2 (R d ) with finite entropy and finite Fisher information, there exists a sequence of measures (m n ) n∈N such that m n /m ∞ ∈ A + and where (η n ) n∈N is a sequence of smooth mollifiers supported in the unit ball.We have m ′′ n → m ′ n0 in P 2 .By the convexity of entropy and Fisher information we have H(m ′′ n ) ≤ H m ′ n0 and I(m ′′ n ) ≤ I m ′ n0 , and by the lower semicontinuities in Lemma A.1 we have lim inf n→+∞ H(m ′′ n ) ≥ H m ′ n0 and lim inf n→+∞ I(m ′′ n ) ≥ I m ′ n0 .Hence, 3, 4. By induction we can obtain the almost sure bound 3, 4 and the exponential moment bound (4.7) to obtain that ∇ k h(t, z) ≤ exp C(1 + |z|) for k = 1, 2, 3, 4. Finally, the derivatives ∇h, ∇ 2 h exist and one can show that they are continuous in time by differentiating (4.6) twice in space.So again by the equation (4.6) we have ∂ t h is continuous and therefore exists classically.Thus m t is a classical solution to the Fokker-Planck equation (2.12).
t≥0 be the flow of measures driven by the regularized potential U N k with the initial value m ′′N k,0 = m N 0 and denote its invariant measure by m ′′N k,∞ .That is to say, m ′′N k,∞ is the probability measure proportional to exp −U N k (x) − 1 2 |v| 2 dxdv.Thanks to the bound (4.24), we can apply the synchronous coupling result in Lemma 5.1 and obtain m ′′N k,t → m N t in P 2 for every t ≥ 0. The result obtained in the previous paragraph writes 2.1 and 2.2.Lemma 5.1.Let T > 0 and β, β ′ : [0, T ] × P 2 (R d ) × R d → R d be measurable and uniformly Lipschitz continuous in the last two variables and σ be a d × d real matrix.Suppose the integralT0 (|β(t, δ 0 , 0)|+|β ′ (t, δ 0 , 0)|)dt is finite.Let (Z t ) t∈[0,T ] , (Z ′ t ) t∈[0,T ] be respective solutions to dZ t = β t, Law(Z t ), Z t dt + σdW t , dZ ′ t = β ′ t, Law(Z ′ t ), Z ′ t dt + σdW ′ t ,where W , W ′ are d-dimensional Brownians.If there exist constants M m , M z and a progressively measurable δ .17) Vlasov and the linear semigroup corresponding to the SDEs (2.10) and (2.11), respectively.We then apply the log-Harnack inequality for McKean-Vlasov diffusions [32, Proposition 5.1] and obtain 1, we know that E(t, m t ) is well defined and t → E(t, m t ) admits derivative satisfying d dt E(t, m t ) ≤ −Y T t K ′ t Y t , where K ′ t is equal to  can be written as two bounds on W 2 2 m N t , m ⊗N t , the first of which follows directly from the finite-time bound in Proposition 5.2.The second claim (2.23) is nothing but Lemma 5.3.It remains to find some C 2 , κ depending only on ρ x , M F mx , M F mm and prove Set t 1 = t∧1 2 and t 2 = t ∧ 1.By the Wasserstein to entropy regularization result in Proposition 5.4, we can find a constant C depending on M F mx and M F mm such that