Maximum likelihood estimation of potential energy in interacting particle systems from single-trajectory data

This paper concerns the parameter estimation problem for the quadratic potential energy in interacting particle systems from continuous-time and single-trajectory data. Even though such dynamical systems are high-dimensional, we show that the vanilla maximum likelihood estimator (without regularization) is able to estimate the interaction potential parameter with optimal rate of convergence simultaneously in mean-field limit and in long-time dynamics. This to some extend avoids the curse-of-dimensionality for estimating large dynamical systems under symmetry of the particle interaction.


Introduction
Dynamical systems of interacting particles have a wide range of applications on modeling collective behaviors in physics [5], biology [16,21], social science [17], and more recently in machine learning as useful tools to understand the stochastic gradient descent (SGD) dynamics on neural networks [15]. Due to the large number of particles, such dynamical systems are high-dimensional even for single-trajectory data from each particle, and statistical learning problems for lower-dimensional interaction functionals from data are usually challenging [2,11,13]. In this paper, we propose a likelihood based inference to estimate the parameters of the interaction function induced by a potential energy in an N interacting particle system based on their observed (continuous-time) trajectories, and establish its statistical guarantees.
1.1. Interacting N -particle systems. In statistical mechanics, microscopic behaviors of a large number of random particles are related to explain macroscopic physical quantities (such as temperature distributions). Specifically, a system of N interacting particles (X N,1 t , . . . , X N,N t ) can be described by stochastic differential equations (SDEs) of the form: where b : R d → R d is a vector field representing the pairwise interaction between the particles, (W 1 t ) t 0 , . . . , (W N t ) t 0 are N independent copies of the standard Brownian motion in R d such that E[W i 0 ] = 0, and σ ∈ R + is a diffusion parameter which is assumed to be constant and known. The scaling in (1) puts us in the mean-field regime, where the pairwise interaction effect is weak and decays on the order of 1/N as the number of particles N → ∞. Thus the total interaction effect remains O (1).
In this paper, we consider the estimation problem of the interaction function b parametrized by a linear approximation b(x) = Θx for some unknown d × d (symmetric) positive-definite matrix Θ ≻ 0: and X N t = N −1 N j=1 X N,j t . Interaction in stochastic system (2) relates to the Hookean behavior for capturing the linear elasticity where the interaction force b scales linearly with deformation distance (due to compression and stretch) in the direction from X N,j t to X N,i t . This type of interaction is extensively used to study the large-scale and long-time dynamics of protein folding as an elastic mass-and-spring network of small Cα atoms [7,6].
It is a classical result [20] that the N -particle SDE system (1) in the mean-field limit admits a unique strong solution (X N,i t ) t 0 if b is (globally) Lipschitz, which is the case for the stochastic system (2). Based on the observed continuous-time and single-trajectory data of the particle movement (X N,1 t ) t 0 , . . . , (X N,N t ) t 0 in the interacting particle system (2), our main focus is to estimate the interaction parameter Θ in the potential energy.
Note that the first-order dynamical system (2) evolves as stochastic gradient flows in R d : which corresponds to a quadratic potential energy V (x) = 1 2 x T Θx and ξ i t are i.i.d. standard Gaussian random vectors in R d . The left-hand side of (3) is the observed velocity vector of particle i at time t and the right-hand side of (3) is a linear function of all particle trajectories at time t corrupted by independent additive Gaussian noise. Thus, there are N d SDEs with observed N trajectory data in dimension d to solve in (3) and estimation problem for Θ can be recast as a high-dimensional linear regression problem in an augmented space R N d (cf. the equivalent form (7) in Section 2). Nevertheless, the trajectory data are temporally dependent samples since the particles are interacting and dynamic, so that theoretical guarantees on estimating structured coefficients in high-dimensional linear models with i.i.d. samples are no long applicable in our context [4]. Moreover, due to the symmetry of the particles in law, the regression coefficients have very special replicated block diagonal structure in R N d , which suggests that regularization techniques may not necessarily needed in our problem. Indeed, we show that a direct likelihood-ratio method suffices to estimate Θ with the optimal rate of convergence in this work.
be the empirical measure of the N particles at time t. Then we can alternatively write (1) as where the drift coefficient vector depends on the individual state X N,i t and the distribution ρ N t (due to interaction). As N → ∞ (i.e., in the mean-field limit), the interaction contributed by any pair of particles in the N -particle system (1) vanishes and all particles are (asymptotically) i.i.d. since they have the same drift and diffusion coefficients driven by independent standard Brownian motion in R d . By the law of large numbers, we see that for any fixed t, ρ N t → ρ t as N → ∞, and the dynamic system (4) becomes where (ρ t ) t 0 is a non-random measure flow. Since the particles are symmetric in distribution, ρ t is actually the limiting law of each particle X i t , i = 1, . . . , N . This defines a system of independent stochastic Vlasov equations (5) with ρ t = L(Y i t ), which is a class of Markov processes with nonlinear dynamics [14].
For quadratic potential V (x) = 1 2 x T Θx, the stochastic system (2) depends on the empirical distribution ρ N t through the empirical mean X N t and it can be decoupled and approximated by a system of independent mean-reverting processes. Averaging the N SDEs in (2), we see that once again by the law of number numbers, which means that the interaction effect X N t becomes deterministic and it is nicely decoupled in the mean-field limit. Thus we expect that the i-th particle process (X N,i t ) t 0 can be (independently) approximated by a limiting process (Y i t ) t 0 given by Note that the processes (Y i t ) t 0 (6) are independent copies of the Ornstein-Uhlenbeck (OU) processes, which is a linear dynamic system of N independent particles.
1.3. Existing literature. Learnability (i.e., identifiability) of interaction functions in interacting particle systems under the coercivity condition were studied in [13,11]. In the noiseless setting, estimation of the interaction kernel, a scalar-valued function of pairwise distance between particles in the system, was first studied in [2] for single-trajectory data in the mean-field limit, where the rate of convergence is no faster than N −1/d . To alleviate the curse-of-dimensionality, sparsity-promoting techniques were considered for some structured high-dimensional dynamical systems [3,19]. [13] showed that a least-squares estimator achieves the optimal rate of convergence (in the number of observed trajectories for each particle) for estimating the interaction kernel based on multiple-trajectory data sampled from a deterministic system with random initialization. Estimation of the diffusion parameter for interacting particle systems from noisy trajectory data was studied in [8]. Consistency of parameter estimation of the general McKean-Vlasov equation by the maximum likelihood estimation is studied in [22]. To the best of our knowledge, there is no existing work, regularized or not, establishes the optimal rate of convergence for interaction parameter estimation simultaneously in the large N (mean-field limit) and large t (long-time dynamics) regime. This work fills this gap for the linear elasticity interacting particle systems.

Notation.
For two generic vectors a, b ∈ R d , we use a · b = d j=1 a j b j to denote the inner product of a and b. We use a = (a · a) 1/2 to denote its Euclidean norm and a ∞ = max 1 j d |a j | to denote its maximum norm. For a generic matrix M , we use M to denote its spectral norm. Denote the set of d × d positive-definite matrices by S d×d + .

Maximum likelihood estimation
We estimate Θ ≻ 0 by a likelihood based method. Since the diffusion parameter σ is known and constant, we may assume for simplicity that σ = 1. In addition, without loss of generality we assume that X N,i 0 = ξ i for some mean-zero independent random vectors ξ i in R d .
) ∈ R N d be the stacked observation process. Then system (2) can be rewritten as a higher-dimensional mean-reverting process in the augmented space R N d : where Note that H is a projection matrix H 2 = H, which implies that the interaction effect is homogeneous. Let P be the law of the standard (N d)-dimensional Brownian motion (W t ) t 0 and Q be the law of the augmented observation process (X N t ) t 0 . By the multivariate Girsanov theorem for changing measures (cf. Theorem 1.12 in [9]), the likelihood ratio of (X N t ) t 0 in (2) and Then the maximum likelihood estimator (MLE) for Θ is defined asΘ Next we explain the intuition why the MLE (9) works. Note that we can write the loglikelihood as As discussed earlier in Section 1, since the interaction among the N particles in the meanfield regime is weak, we expect that those particles can be decoupled by their independent analogs. Let (Y 1 t ) t 0 , . . . , (Y N t ) t 0 be independent copies of the OU processes defined in (6).
Concentration bounds (which we shall develop in Appendix 5.2) allow us to control the decoupling and fluctuation errors around zero. Thus information useful for the estimation purpose comes from the signal part Since the matrix , we see that the maximizer A * = Θ. This means that on the population level, the MLE equals to the true parameter. Combining this with the decoupling and fluctuation errors, we can obtain the rate of convergence for the MLEΘ N t in (9).

Rate of convergence
In this section, we derive the rate of convergence for estimating Θ by the MLEΘ N t in (9) from the continuous-time and single-trajectory data for each particle. Let θ 1 . . . θ d > 0 be ordered eigenvalues of Θ. Below is the main theoretical result of this paper.
1/θ d and N 400, then for any ε ∈ [e −N/400 , 1), we have with probability at least 1 − 14ε, Theorem 3.1 is non-asymptotic and has several appealing features. First, we do not assume that the N -particle processes (X N,1 t ) t 0 , . . . , (X N,N t ) t 0 starts from the stationary distribution (which is a restrictive assumption), and the (continuous) time complexity t 1/θ d of the particle trajectories is sharp. Indeed, suppose Θ = diag(θ 1 , . . . , θ d ) is a diagonal matrix and observe that the decoupled N copies of the OU processes to approximate the dynamics of the interacting N -particle system have the equilibrium distribution as N (0, σ 2 2 Θ −1 ). Thus it takes at least Ω(max 1 j d θ −1 j ) time for (X N,1 t ) t 0 , . . . , (X N,N t ) t 0 starting from zero to mix to the steady states, unless those processes are initialized at t = 0 from the stationary distribution (i.e., τ 2 j = σ 2 2θ j ). In particular, if θ j is closer to zero, then the log-likelihood ratio becomes flatter and thus larger t is necessary to see the information from samples of the stationary distribution. On the other hand, if the processes start from the stationary distribution, then this trajectory time lower bound is not needed to obtain (11).
Second, the rate of convergence in (11) is parametric (and thus rate-optimal) in both N and t. Specifically, for fixed d, we can obtain from Theorem 3.1 the large N (mean-field limit) and large t (long-time dynamics) asymptotics as: provided that 0 < θ j < ∞ for all j = 1, . . . , d, and the N particle processes start from a chaotic distribution. We shall highlight that long-time dynamic behavior in (12) as t → ∞ cannot be obtained from the classical theory of the propagation of chaos (cf. [20]), where Gronwall's lemma (cf. Appendix 1 in [18]) is typically used to control the decoupling error between the interacting N -particle processes and their independent analogs. In such case, the decoupling error is exponentially increasing in t, and thus it cannot be used to yield the rate t −1/2 . Our argument is tailored to the quadratic structure of the interacting potential V (x) = 1 2 x T Θx, which allows for a far more efficient decoupling strategy (cf. Lemma 5.5).
Proof of Theorem 3.1. Without loss of generality, we may assume that σ = 1 and by rescaling, it suffices to prove that holds with probability of at least 1 − 10ε. Since the objective function ℓ N t (A) in (10) is quadratic in A, the first-order optimality condition implies that the MLE of Θ is given bŷ where we recall denotes the tensor product of two vectors a and b. Because (X N t ) t 0 in the form of (7) is an R N d -dimensional OU process with a chaotic Gaussian initialization, we may diagonalize Θ and assume that Θ is a diagonal matrix with the diagonal entries θ 1 , . . . , θ d without changing the distribution of (X N t ) t 0 . Hence, in the rest of the proof, we assume that Θ = diag(θ), where θ = (θ 1 , . . . , θ d ) T with θ 1 . . . θ d > 0, and estimate only the diagonal entries bŷ We shall analyze the numerate and the denominator of the error term in (15) separately. Let (Y i t ) t 0 , i = 1, . . . , N be independent copies of the OU process driven by the same Brownian motion (W i t ) t 0 in (X N,i t ) t 0 , namely, Below we shall fixed a j = 1, . . . , d.
By (24) in Lemma 5.2 and (26) in Lemma 5.5 with N 400 and ε e −N/400 , we have with probability at least 1 − 6ε, where C(ε, N ) is the term defined in (26). By (23) in Lemma 5.1, we have Note that the function x → (1− e −x )/x is positive and strictly decreasing on and consequently we have Thus triangle inequality yields that Now combining (15), (16), and (17), we obtain that with probability at least 1 − 14ε, Summing over j = 1, . . . , d and applying the union bound, we conclude that (13) holds with probability of at least 1 − 14ε.

Discussion
This paper derives a quantitative error bound for the MLE to estimate the interaction parameter of the quadratic potential energy in a large interacting particle systems. It turns out symmetry of the particle interaction is the key structure to obtain the optimal rate of convergence for such high-dimensional (linear) dynamical systems without regularization techniques.
Our future plan is to extend current results from linear interaction to learn some first-order nonlinear SDEs. A popular nonlinear interacting system is given by the following model: where φ : R + → R is an interaction kernel depending only on the pairwise distance between particles. We should highlight that the linear elastic interacting particle system (2) has a different structure from (18), where the latter implicitly assumes that the interaction kernel is isotropic and one-dimensional. Stochastic system (18) may capture other types of physical forces for attraction and repulsion that solely depend on the Euclidean distance.
Nonparametric inference of such interaction kernels based on noiseless trajectory data has been studied [2,13,12]. In the presence of noise, similarly as in Section 2, we can write (18) in R N d as where for X = (X 1 , . . . , Then the negative log-likelihood function has a similar form as in (10): and the MLE is defined asφ N t = argmin ψ∈H ℓ N t (ψ) for some reasonably rich class of functions H. Heuristically, the propagation of chaos phenomenon allows us to control the difference between (X N t ) and its OU process analog (Y N t ) [20]. Thus for each t, replacing (X N t ) by (Y N t ) in the log-likelihood function (20) causes a uniform decoupling error O(N −1/2 ). In the linear interaction case, the propagation of chaos is non-asymptotically implemented by decoupling concentration bounds (cf. Lemma 5.5). If the gradient of potential energy V φ is Lipschitz, then the martingale term t 0 ∇V φ (Y N t ) · dW s corresponding to noise fluctuation can be upper bounded using Lemma 5.2. On the other hand, the signal distortion is controlled by the curvature of the (negative) log-likelihood function ℓ N t (·). For instance, if V φ is strongly convex, then we expect that the signal would be bounded below. Thus the separation between noise and signal allows us to control the estimation error. We leave the derivation of rigorous results as a future work.

5.1.
Ornstein-Uhlenbeck process. The one-dimensional Ornstein-Uhlenbeck (OU) process (Y t ) t 0 is a mean-reverting stochastic process defined by the following stochastic differential equation: where (W t ) t 0 is the standard Brownian motion in R such that E[W 0 ] = 0, θ > 0 is the mean-reverting intensity parameter, and σ > 0 is the diffusion parameter. The stationary distribution of Y t is N (0, σ 2 2θ ). Lemma 5.1 below gives the explicit formula for the mean and variance at any finite time point t ∈ (0, ∞).
Lemma 5.1 (Mean and variance of the OU process). Let (Y t ) t 0 be the OU process defined in (21). If Y 0 = W 0 = ξ for some mean-zero random variable ξ with variance τ 2 , then for all t > 0, we have Proof of Lemma 5.1. Since (Y t ) t 0 starts from a mean-zero random variable, taking expectation on both side of (5.1), we have To prove (23), we apply Itô's formula to get Taking expectation on both sides of the last equation to kill the martingale term whose solution for E[Y 2 t ] subject to the initial condition E[Y 2 0 ] = τ 2 is given by (23).

5.2.
Key supporting lemmas. This section provides key technical results for bounding the fluctuation of the OU process and the decoupling error of the N -particle system by the associated N independent OU processes.
Lemma 5.2 (Concentration inequalities for the OU process). Let (Y t ) t 0 be the one-dimensional OU process with independent coordinates defined in (21) and Y 0 = ξ, where ξ ∼ N (0, τ 2 ). Suppose that (Y 1 t ) t 0 , . . . , (Y N t ) t 0 are independent copies of (Y t ) t 0 . Then we have for any ε ∈ (0, 1), and for any ε e −N/16 , Proof of Lemma 5.2. First we prove (24). In view of Lemma 5.1, we may assume without loss of generality that Y 0 ∼ N (0, σ 2 2θ ), which by rescaling gives the worst case error bound in (24). In such case, . Let λ > 0. By Jensen's inequality, Let w = tλσ 2 2N θ . By independence, we have for 0 < w < 1 2 , where the last inequality follows from Lemma 5.3. Combining the last two inequalities, we get Then it follows from Lemma 5.4 that for all x > 0, Likewise, the lower bound follows similar lines and we have Choosing x = log(1/ε), we obtain (24). Next we prove (25). Denote t 0 are independent continuous local martingales vanishing at zero with quadratic variation [Z i ] t is given by where the last inequality follows from (23) Choosing x = 2 log(1/ε)/θ, we have with probability at least 1 − 2ε, Applying the same argument for −Z i t , we get (25). Lemma 5.3 (Moment generating function of centered χ 2 distribution). Let Z ∼ N (0, 1) be a standard Gaussian random variable and φ denote the logarithm of the Laplace transform of In particular, Proof of Lemma 5.3. See the proof of Lemma 1 in [10].
Lemma 5.4 (From moment generating function to tail probability). If a random variable Z satisfies log(E[exp(uZ)]) vu 2 1 − cu for some v > 0 and c 0, then for all x > 0, Proof of Lemma 5.4. See the proof of Lemma 8 in [1].
Lemma 5.5 (Decoupling error bounds for the N -particle system). Suppose that the onedimensional N -particle system (X N,1 t ) t 0 , . . . , (X N,N t ) t 0 defined in (2) with initialization at X N,i 0 = ξ i for i.i.d. Gaussian random variables ξ 1 , . . . , ξ N ∼ N (0, τ 2 ). Then we have for any ε e −N/16 , where C(ε, N ) = C 1 (ε, N )(2C 1 (ε, N ) + 8C 2 (ε, N )), Moreover we also have for any ε ∈ (0, 1), Proof of Lemma 5.5. Without loss of generality, we may assume σ = 1. Denote ∆ N,i t = X N t − X N,i t + Y i t . Recall the N -particle system and the associated approximating N independent OU processes: for i = 1, . . . , N , and (Y i t ) are driven by the same Brownian motion (W i t ). Averaging the processes (X N,i t ) over i = 1, . . . , N , we get i.e., the (rescaled) mean process ( √ N X N t ) of the N particles has the same law as a standard Brownian motion. Combining the last three expressions, we obtain that which implies that the difference process (∆ N,i t ) between the N -particle system and the decoupled OU processes is an OU process with respect to (W Similarly, E t 0 |∆ N,i s | 2 ds t 2N θ . By a second application of (24) in Lemma 5.2 with σ = N −1/2 , we have which implies that with probability at least 1 − 2ε, Then (26) follows from putting all pieces together.