Parameter estimation of ODE's via nonparametric estimators

Ordinary differential equations (ODE's) are widespread models in physics, chemistry and biology. In particular, this mathematical for malism is used for describing the sets of interactions and the evolution of complex systems and it might consist of high-dimensional sets of coupled nonlinear differential equations. In this setting, we propose a general method for estimating the parameters indexing ODE's from times series. Our method is able to alleviate the computational difficul ties encountered by the classical parametric methods. These difficulties are due to the implicit definition of the model. We propose the use of a nonparametric estimator of regression functions as a first-step in the construction of an M-estimator, and we show the consistency of the derived estimator under general conditions. In the case of spline estimators, we provide asymptotic normality, and we derive the rate of convergence, which is not the usual '\!"ii-rate for parametric estimators. This rate depends on the smoothness of the differential equation. Some perspectives of refinements of this new family of parametric estimators are given.


Introduction
Ordinary differential equations are used for the modelling of dynamic processes in physics, engineering, chemistry, biology, and so forth. In particular, such a formalism is used for the description of ecosystems (for example competing species in biology), or of cell regulatory systems such as signaling pathways and gene regulatory networks [12]. Usually, the model for the state variables x = (x 1 , . . . , x d ) ⊤ consists of an initial value problem ẋ(t) = F (t, x(t), θ), ∀t ∈ [0, 1], where F is a time-dependent vector field from R d to R d , d ∈ N, and θ ∈ Θ, Θ being a subset of a Euclidean space. When data are available such as a time series, we are interested in the problem of estimation of the coefficients parametrizing the ODE. In principle, this may be done by some classical parametric estimators, usually the least squares estimator [28] or the Maximum Likelihood estimator (MLE). Different estimators have been derived in order to take into account some particular features of the differential equation such as special boundary values (there exists a function g linking the values at the boundary i.e. g(x(0), x(1)) = 0 instead of the simple initial value problem), or random initial values or random parameters [9]. Otherwise, there may be some variations on the observational process such as noisy observation times that necessitate the introduction of appropriate minimization criteria [26]. Despite their satisfactory theoretical properties, the efficiency of these estimators may be dramatically degraded in practice by computational problems that arise from the implicit and nonlinear definition of the model. Indeed, these estimators give rise to nonlinear optimization problems that necessitate the approximation of the solution of the ODE and the exploration of the (usually high-dimensional) parameter space. Hence, we have to face possibly numerous local optima and a huge computation time. Instead of considering the estimation of θ straightforwardly as a parametric problem, it may be useful to look at it as the estimation of a univariate regression function t → x(t) that belongs to the (finite dimensional) family of functions satisfying (1.1). Alternative approaches to MLE has been used in applications, such as two-step methods: in a first step, a proxy for the solution of the ODE is obtained by nonparametric methods, and in a second step, estimates of the ODE parameter are derived by minimizing a given distance. Varah [40] initiated this approach by using and differentiating a smooth approximation of the solution based on least-squares splines as [38], or Madar et al. [29] with cubic splines (and a well-chosen sequence of knots). In the same spirit, approximation of the solutions of the ODE are provided by smoothing methods such as local polynomial regression [22,30,11], or neural networks [42], within the same two-step estimation scheme. Slight modifications of these approaches are the adaptation of collocation methods to statistical estimation where the solution is approximated by Lagrange polynomials [20]. This two-step approach has also been considered in the functional data analysis (FDA) framework proposed Ramsay and Silverman [34], which is based on the transformation of data into functions with smoothing cubic splines. Ramsay proposed Principal differential Analysis (PDA) [32] for the estimation of linear ODE's. PDA has been recently extended to nonlinear ODE's and ameliorated by repeating the two steps, introducing iterated PDA (iPDA) [31,41]. Recently, the iPDA approach has then been extended to a so-called generalized smoothing approach by Ramsay et al., [33]. In this algorithm, the smoothing and the estimation of the parameter ODE are considered jointly, and Ramsay et al. proposed an estimation algorithm inspired from profile likelihood methods. The paper provides a versatile estimator, as a lucid account of the current status and open questions concerning the statistical estimation of differential equations. We refer to [33] and the subsequent discussions for a broad view of the difficulties for parameter estimation of ODE's. The use of nonparametric estimators is motivated by the computational simplicity of the estimation method, but an additional motivation is that the functional point of view enables one to use prior knowledge on the solutions of the ODE such as positivity or boundedness whereas it is difficult to exploit the strictly parametric form. Indeed, it implies that we have a thorough knowledge of the influence of the parameters on the qualitative behavior of the solutions of (1.1), which is rarely the case. In this paper, we study the consistency and asymptotic properties of this general two-step estimator based on nonparametric estimators of the trajectories of the ODE. We focus in particular on the use of spline regression in the first step, as it is one of the estimator used previously in the construction of the proxy, but we discuss also the generality and potential ameliorations of this approach.
In the next section, we introduce the statistical model and we define the so-called two-step estimator of θ. We show that under broad conditions this estimator is consistent, and we discuss some straightforward extensions of this estimator to different cases. In section 3, we derive an asymptotic representation of the two-step estimator, and we focus on the case of the least squares splines. We discuss also the generality of this result with respect to other types of nonparametric estimators. In the last section, we give some simulation results obtained with the classical Lotka-Volterra's population model coming from biology.

Statistical model
We want to estimate the parameter θ of the ordinary differential equation (1.1) from noisy observations at n points in [0, 1], 0 ≤ t 1 < · · · < t n ≤ 1, where the ǫ i 's are i.i.d centered random variables. The ODE is indexed by a parameter θ ∈ Θ ⊂ R p and initial value x 0 ; the true parameter value is θ * and the corresponding solution of (1.1) is x * . The vector field defining the ODE is a function F : [0, 1] × X × Θ → R d (X ⊂ R d ) of class C m w.r.t x for every θ and with m ≥ 1. It is a Lipschitz function so that we have existence and uniqueness of a solution x θ,x0 to (1.1) on a neighborhood of 0 for each θ and x 0 ; and we assume that every corresponding solution can be defined on [0, 1]. Hence, the solutions x θ,x0 belong to C m+1 ([0, 1], R d ). Moreover, we suppose also that F is a smooth function in θ so that each solution x θ,x0 depends smoothly 1 on the parameters θ and x 0 . Then, we suppose that F is of class C 1 in θ for every x. Let f Σ be the density of the noise ǫ, then the log-likelihood in the i.i.d case is and the model that we want to identify is parametrized by (θ, x 0 , Σ) ∈ Θ×X ×S + for instance when the noise is centered Gaussian with covariance matrix Σ (S + is the set of symmetric positive matrices). An alternative parametrization is (θ, x θ,x0 , Σ) ∈ Θ × F × S + , with F the set of functions that solve (1.1) for some θ and x 0 , thanks to the injective mapping between initial conditions and a solution.
In most applications, we are not really interested in the initial conditions but rather in the parameter θ, so that x 0 or x θ,x0 can be viewed as a nuisance parameter like the covariance matrix Σ of the noise. We want to define estimators of the "true" parameters (x * , θ * ) (x * = x θ * ,x * 0 ) that will be denoted by (x n ,θ n ). The estimation problem appears as a standard parametric problem that can be dealt with by the classical theory in order to provide good estimators (with good properties, e.g. √ n-consistency) such as the Maximum Likelihood Estimator (MLE). Indeed, from the smoothness properties of F , the log-likelihood l(θ, x 0 ) is at least C 1 w.r.t (θ, x 0 ) so that we can define the score , we can claim under weak conditions (e.g. theorem 5.39 [39]) that the MLE is an asymptotically efficient estimator. The difficulty of this approach is then essentially practical because of the implicit dependence of x on the parameter (θ, x 0 ), which prohibits proper maximization of l(θ, x 0 ). Indeed, derivative-based methods like Newton-Raphson are not easy to handle then and evaluation of the likelihood necessitates the integration of the ODE, which becomes a burden when we have to explore a huge parameter space. Moreover, the ODE's proposed for modelling may be expected to give a particular qualitative behavior which can be easily interpreted in terms of systems theory, e.g. convergence to an equilibrium state or oscillations. Typically, these qualitative properties of ODE are hard to control and involve bifurcation analysis [25] and may necessitate a mathematical knowledge which is not always accessible for huge systems. Moreover, boundedness of the solution x * (a ≤ x * (t) ≤ b, with a, b ∈ R d ) may be difficult to use during the estimation via the classical device of a constraint optimization. Hence, these remarks motivate us to consider the estimation of an ODE as a functional estimation and use flexible methods coming from nonparametric regression from which we could derive a likely parameter for the ODE.

Principle
We use consistent nonparametric estimators of the solution x * and its derivativeẋ * in order to derive a fitting criterion for the ODE and subsequently the M-estimator of θ * corresponding to the criterion. We denote by f q,w = 1 0 |f(t)| q w(t)dt 1/q , 0 < q ≤ ∞, the L q (w) norm on the space of integrable functions on [0, 1] w.r.t. the measure w (and L q is the classical norm with respect to Lebesgue measure). We suppose that w is a continuous and positive function on [0, 1]. The principle of two-step estimators is motivated by the fact that it is rather easy to construct consistent estimators of x * and of its derivativeẋ * . The estimation of the regression function and its derivatives is a largely addressed subject and several types of estimators can be used such as smoothing splines (more generally smoothing in Reproducing Kernel Hilbert Spaces [43,3]), kernel and local polynomial regression [13], series estimators [10]. So, one can construct a consistent estimatorx n of x * that can achieve the optimal rate of convergence with an appropriate choice of regularization parameters. One can derive also fromx n an estimator of the derivativeẋ * by differentiating directly the smoothed curvedx n so that we havex n =ẋ n . This simple device provides consistent estimator of the derivative for Nadaraya-Watson estimators [17], spline regression [45], smoothing splines [43], but other consistent estimators of the derivative can be obtained from local polynomial regression or wavelet estimator (e.g [6]). In this section, we consider then simply that we have two general nonparametric estimatorsx n andx n , such that x n − x * q = o P (1) and x n −ẋ * q = o P (1). Hence, for the estimation of the parameter θ, we may choose as criterion to minimize the function R q n,w (θ) = ẋ n − F (t,x n , θ) q,w from which we derive the two-step estimator Thanks to the previous convergence results and under additional suitable conditions to be specified below, we can show that in probability, and that this discrepancy measure enables us to construct a consistent estimatorθ n . We are left with three choices of practical and theoretical importance: the choice of q, the choice of w and the choice of the nonparametric estimators. In this paper, we show the consistency in a general situation, but we provide a detailed analysis of the asymptotics ofθ n when q = 2,x n is a regression spline (least-squares splines, with a number of knots depending on the number of observations n), and whenx n =ẋ n . In that case, the two-step estimation procedure is then computationally attractive as we have to deal with two (relatively simple) least-squares optimization problems (linear regression in the first step and nonlinear regression in the second step). As a consequence, this is a common approach in applications and we will put emphasis on the importance of the weight function w in asymptotics. Despite the limitation to regression splines, it is likely that our result can be extended to classical nonparametric estimators, as the ones listed above.

Consistency
We show that the minimization of R q n,w (θ) gives a consistent estimator for θ. We introduce the asymptotic criterion derived from R q n,w and we make the additional assumption: which may be viewed as an identifiability criterion for the model.
Ifx n andx n are consistent, andx n (t) ∈ K almost surely, then we have Moreover, if the identifiability condition (2.4) is fulfilled the two-step estimator is consistent, i.e.θ n − θ * = o P (1). Proof. In order to show the convergence of |R q n,w (θ) − R q w (θ)| = | x n − F (·,x n , θ) w,q − F (·, x * , θ) − F (·, x * , θ * ) q |, we make the following decomposition Since all the solutions x θ,x0 (t) andx n (t) stay in K ⊂ X , and x → F (t, x, θ) are K− Lipschitz uniformly in (t, θ), we obtain for all (t, θ) where M is a upper bound for w. Together, (2.5) and (2.6) imply and consequently, by the consistency ofx n andẋ n , With the additional identifiability condition (2.4) for the vector field F , Theorem 5.7 of [39] implies that the estimatorθ n converges in probability to θ * .
The minimization of a distance between a nonparametric estimator and a parametric family is a classical device for parametric estimation (Minimum Distance Estimators, MDE). For instance, it has been used for density estimation with Hellinger distance [2], or for regression with L 2 distance in the framework of regression model checking [24]. The difference between two-estimators and MDE's is that two-step estimators do not minimize directly the distance between the regression function and a parametric model, but between the derivative of the regression function and the parametric model. This slight difference causes some differences concerning the asymptotic behavior of the parametric estimators. Nevertheless, two-step estimators are closer to MDE's than the generalized smoothing approach of Ramsay et al. [33] that uses also parametric and nonparametric estimators. These ingredients are used in a different manner, as the criterion maximized by the generalized smoothing estimator can be seen as the Lagrangian relaxation of a constrained optimization problem in a space of cubic splines (with given knots): the parametric and nonparametric estimators solve The smoothing approach finds a curve that solves approximately a differential equation, which is then close to the spline approximation of the solution of the ODE (with parameterθ) computed by collocation [7]. Moreover, the optimization problem is solved iteratively which implies that the nonparametric estimator is computed adaptively with respect to the parametric model and the data are repeatedly during estimation, whereas two-step estimators used the data once without exploiting the parametric model.

Remark 2.1 (Local identifiability). The global identifiability condition (2.4)
is difficult to check in practice and in all generality. In the case q = 2, we can focus on a simpler criterion by considering the Hessian J * of R 2 w (θ) at θ = θ * : If J * is then nonsingular, the asymptotic criterion behaves like a positive definite quadratic form on a neighborhood V(θ * ) of θ * , so the criterion R 2 separate the parameters. It is then critical to use a positive weight function w, or else some part of the path t → x * (t) might be useless for the identification of the identification and discrimination of the parameter. At the opposite, w might be used to emphasize some differences between parameters. As it is shown in the next section, the value of w at the boundaries has a dramatic influence on the asymptotics of the two-step estimator.
Remark 2.2 (Random times and errors in times). If the observation times t 1 , . . . , t n are realizations of i.i.d. random variables (T 1 , . . . , T n ) with common c.d.f Q, the nonparametric estimatorsx n , as the one used before, are relevant candidates for the definition of the two-step estimator since they are still consistent under some additional assumptions on Q.
As in the setting considered by Lalam and Klaassen [26], the observation times may be observed with some random errors τ i = t i + η i , i = 1, . . . , n, (the η i 's being some white noise) so we have to estimate x from the noisy bivariate measurements (τ i , y i ). Consistent nonparametric estimators have been proposed for the so-called "errors-in-variables" regression and some examples are kernel estimators [14] and splines estimators [23] (in the L 2 sense). Hence, we can define exactly the same criterion function R 2 n and derive a consistent parametric estimator.

Partially observed ODE
The estimator proposed can be easily extended to cases where several variables are not observed. Indeed, if the differential system (1.1) is partially linear in the following sense x(t i ) is replaced by u(t i ) in (2.1) (the noise ǫ i being then d 1 -dimensional). We want to estimate the parameter θ = (η, A) when H is a nonlinear function and A is a matrix, so we can take advantage of the linearity in v in order to derive an estimator for v. We can derive a nonparametric estimator for v by usingû n and the fact that t → v(t) must be the solution of the non-homogeneous linear ODEv = Av + H(û n ; η), v(0) = v 0 . The solution of this ODE is given by Duhamel's formula [19] ∀t which then can be plugged into the criterion R q n (θ). This estimator depends explicitly on the initial condition v 0 which must be estimated at the same time As previously, if H is uniformly Lipschitz the integral t 0 exp ((t − s)A) H(û n ; θ)ds converges (uniformly) in probability in the L q sense to t 0 exp ((t − s)A) H(u * ; θ)ds as soon asû n does, hence R q n (θ, A, v 0 ) converges also uniformly to the asymptotic criterion The estimator (θ,v 0 ) is consistent as soon as R q w (θ, v 0 ) satisfies the identifiability criterion (2.4).

Asymptotics of two-step estimators
We give in this section an analysis of the asymptotics of two-step estimators. We focus on the least squares criterion R 2 n,w when the estimator of the derivative isẋ n . In that case, we show that the two-step estimator behaves as the sum of two linear functionals ofx n of different nature: a smooth and a non-smooth one. Under broad conditions, we show that the two-step estimator is asymptotically normal and that we can derive the rate of convergence. In particular, we show that it is the case for the regression splines. A key result is that the use of a weight function w with boundary conditions w(0) = w(1) = 0 makes the nonsmooth part vanish, implying that the two-step estimator can have a parametric rate of convergence. In the contrary, the two-step estimator has a nonparametric rate.

Asymptotic representation of two-step estimators
We derive an asymptotic representation ofθ n by linearization techniques based on Taylor's expansion of R 2 n,w around θ * . This shows that the two-step estimator behaves as a simple linear functional ofx n . We introduce the differentials of F at (x, θ) w.r.t. x and θ and we denote them D 1 F (x, θ) and D 2 F (x, θ) respectively. For short, we adopt the notation D i F * , i = 1, 2 for the functions Proposition 3.1. We suppose that D 12 exists and w is differentiable. We introduce the two linear operators Γ s,w and Γ b,w defined by , J * is invertible, andx n ,ẋ n are (resp.) consistent estimators of x * andẋ * , then Proof. For sake of simplicity, we suppose that the vector field F does not depend on t, but the proof remains unchanged in the non-autonomous case. We remove also the dependence in t and n for notational convenience and introduce F * and F (x, θ * ). The first order condition ∇ θ R 2 n (θ) = 0 implies that withx * andθ * being random points between x * andx, and θ * andθ respectively. We introduceD 2 F = D 2 F (x,θ), and an asymptotic expression for (θ * −θ) is It suffices to consider the convergence in law of the random integral H n = the functional map D is a continuous map, and we can claim by the continuous mapping theorem that the random functionsD 2 F and t → D 2 F (x(t), θ * ) converge in probability (in the L 2 sense) to D 2 F * . As a consequence, D 2 F 2 converges in probability to D 2 F * 2 so it is also bounded, and D 2 F (x, θ * ) − D 2 F * 2 → 0 in probability. This statement is also true for all entries of these (function) matrices, which enables to claim that all entries of the matrix tend to zero in probability (by applying the Cauchy-Schwarz inequality componentwise). Moreover, we have convergence in probability of each entry of 1 0 (D 2 F ) ⊤ D 2 F * w dt to the corresponding entry of By the same arguments and by using the fact that D 1 F is also Lipschitz in (x, θ), we have convergence of the matrix G n − H n to 0 in probability. The asymptotic behavior of (θ n − θ * ) is then given by the random integral As this functional is linear in (ẋ −ẋ * ) and (x * −x), this is equivalent to study the convergence of the functional Γ(x) to Γ(x * ), where Since D 12 F andẇ exist, we can integrate by part the first term of the right-hand side, As a consequence, Γ is the sum of the linear functionals Γ s,w , Γ b,w introduced before: Finally, by linearity we can writê Proposition 3.1 shows that two-step estimators behave asymptotically as the sum of two plug-in estimators of linear operators: a smooth functional Γ s,w and a non-smooth functional Γ b,w , the latter being a weighted evaluation functional (at the boundaries). It is well-known that the asymptotic behavior of these plug-in estimators are of different nature: Stone [35] showed that the best achievable rate for pointwise estimation over Sobolev classes is n α/2α+1 (where α is the number of continuous derivatives of x), whereas integral functionals h(t)x(t)dt can be estimated at a root-n in a wide variety of situations, e.g. Bickel and Ritov [4]. In particular, we will show in the next section that Γ s,w (x * ) can be estimated at a root-n rate with regression splines. A direct consequence of this asymptotic consequence is that two-step estimators does not achieve the optimal parametric rate in generality (with whatever weight function w). Hence, two-step estimators used in applications computed with a constant weight function have degraded asymptotic properties. Moreover, the classical bias-variance analysis of pointwise nonparametric estimation enables to give additional insight in the behavior ofx. Indeed, it is well known that the quadratic error where b is the bias function and v is the variance function. Although Γ s,w (x n ) is also a biased estimator of Γ s,w (x * ), this bias can converge at a faster rate than root-n (under appropriate conditions onx n ), so that the bias ofθ n is mainly due to the bias terms b(0) and b(1). This remark strongly motivates the use of nonparametric estimators with good boundary properties for reducing the bias ofθ: a practical consequence of proposition 3.1 is that local polynomials should be preferred to the classical kernel regression estimator because local polynomials have better boundary properties at the boundaries of the interval [13]. Nevertheless, a neat approach to the efficiency of two-step estimators is to restrict to the case of a weight function w with boundary conditions w(0) = w(1) = 0, so that Γ b,w does vanish and the two step estimator is then asymptotically equivalent to the smooth functional Then, it suffices to check that the plug-in estimator Γ s (x) is a root-n rate estimator of Γ(x * ), which depends on the definition ofx n . We detail then essential properties of regression splines in the next section, and derive the desired plug-in property.

Two-step estimator with least squares splines
A spline is defined as a piecewise polynomial that is smoothly connected at its knots. For a fixed integer k ≥ 2, we denote S(ξ, k) the space of spline functions of order k with knots ξ = (0 = ξ 0 < ξ 1 < · · · < ξ L+1 = 1): a function s in S(ξ, k) is a polynomial of degree k − 1, on each interval [ξ i , ξ i+1 ], i = 0, . . . , L and s is in C k−2 . S(ξ, k) is a vector space of dimension L + k and the most common basis used in applications are the truncated power basis and the B-spline basis. The latter is usually used in statistical applications as it is a basis of compactly supported functions which are nearly orthogonal. B-splines can be defined recursively from the augmented knot sequence τ = (τ j , j = 1, . . . , L + 2k) with τ 1 = · · · = τ k = 0, τ j+k = ξ j , j = 1, . . . , L and τ L+k+1 = · · · = τ L+2k = 1.
A (univariate) nonparametric estimatorx n is then computed by classical least-squares (the so-called least-squares splines or regression splines). Most of the time, cubic splines (k = 4) are used and the essential assumptions are made on the knots sequence ξ in an asymptotic setting: it is supposed that the number of knots L n tends to infinity with a controlled repartition. The well-posedness of the B-spline basis (and corresponding design matrix) is critical for the good behavior of the regression spline, and the needed material for the asymptotic analysis can be found in [44]. We define below the nonparametric estimator that we use.
We have n observations y 1 , . . . , y n corresponding to noisy observations of the solution of the ODE (1.1), and we introduce Q n the empirical distribution of the sampling times times t 1 , . . . , t n . We suppose that this empirical distribution converges to a distribution function Q (which possesses a density q w.r.t Lebesgue measure). We construct our estimatorx n of x * as a function in the space S(ξ n , k) of spline functions of degree k and knots sequence ξ n = (ξ 0 , ξ 1 , . . . , ξ Ln+1 ) (K n is the dimension of S(ξ n , k)). The knots sequence is chosen such that max 1≤i≤Ln |h i+1 − h i | = o(L −1 n ), |ξ|/ min i h i ≤ M where h i = (ξ i+1 −ξ i ) and |ξ| = sup i h i is the mesh size of ξ. As a consequence, we have |ξ| = O(L −1 n ). Like in Zhou et al. [44], we suppose that we have convergence of Q n towards Q at a rate controlled by the mesh size, i.e. . . , c i,Kn ) ⊤ ∈ R Kn ). We stress the fact that all the componentsx n,i are approximated via the same space, although it may be inappropriate in some practical situations but it enables to keep simple expressions for the estimator. The fact that we look for a function in the vector space spanned by B-splines, puts emphasis on the regression interpretation of the first step of our estimating procedure. The estimation of the parameter C n can be cast into the classical multivariate regression setting where Y n = (Y 1 . . . Y d ) is the n×d matrix of observations, ǫ is the n×d matrix of errors, C ⊤ n is the K n ×d matrix of coefficients and B n = (B j (t i )) 1≤i≤n,1≤j≤Kn is the design matrix. We look for a function close to the data in the L 2 sense, i.e. we estimate the coefficient matrix C n by least-squareŝ c i,n = arg min Finally, we introduce the projection matrix P B,n = B n (B ⊤ n B n ) + B ⊤ n . We will use the notation x y to denote that there exists a constant M > 0 such that x ≤ M y.
General results given by Huang in [21] ensure thatx n L 2 −→ x * in probability for sequences of suitably chosen approximating spaces S(k, ξ n ) with an increasing number of knots. Indeed, corollary 1 in [21] enables us to claim that if the observation times are random with Q(B) ≥ cλ(B) (0 < c ≤ 1 and λ(·) is the Lebesgue measure on [0, 1]), the function x * is in the Besov space B α 2,∞ (with k ≥ α − 1) and the dimension grows such that lim n Kn log Kn n = 0 then Moreover, the optimal rate O P (n −2α/(2α+1) ) (given by Stone [36]) is reached for K n ∼ n 1/(2α+1) . For this nonparametric estimator, it is possible to construct a consistent two-step estimatorθ n by minimization of R 2 n,w (θ).
We need then a general result for the asymptotics of linear functionals of spline estimators. This can be seen as a special case of a general result derived by Andrews for series estimators [1]. First, we need to have a precise picture of the evolution of the basis (B 1 , . . . , B Kn ) as K n → ∞ and particularly the asymptotic behavior of the empirical covariance G Kn,n = 1 n (B ⊤ n B n ) and of the (theoretical) covariance G Kn = 1 0 B(t)B(t) ⊤ dQ(t). From [44], we have that if K n = o(n), The analysis of section 3.1 gives interest in the asymptotic behavior of Γ a (x n ) where Γ a is a linear functional Γ(x) = 1 0 a(s) ⊤ x(s)ds where a is a function in Hence, the asymptotic behavior is derived directly from the asymptotics ofĈ n and of matrix γ. By using the results from Andrews [1], we derive the asymptotic normality of this functional. For simplicity, we consider only the case d = 1, the extension to higher dimensions is cumbersome but straightforward. If the variance of the noise is σ 2 , the variance ofĉ n is and the variance of the estimator of the functional is

Proposition 3.2 (Convergence of linear functionals of regression splines).
Let (ξ n ) n≥1 be a sequence of knot sequences of length L n + 2, and K n = dim(S(ξ n , k)), with k ≤ 2. We suppose that L n → ∞ is such that n 1/2 |ξ n | → 0 and n|ξ n | → ∞. If a : Proof. In order to prove the asymptotic normality of Γ(x n ) − Γ(x), we check the assumptions of theorem 2.1 of [1]. Assumption A is satisfied because the ǫ i 's are i.i.d. with finite variance. For assumption B, since a is C 1 , the functional is continuous with respect to the Sobolev norm (or simply the sup norm). From the approximation property of spline spaces [7], it is possible to construct a splinẽ a = The quality of approximation of the functional Γ a is directly derived: |Γ a (x) − Γã(x)| = | 1 0 (a −ã)(s)x(s)ds| |ξ| x ∞ . Hence, it suffices to look at the case a = α ⊤ n B because Γ a (x) − Γã(x) will tend to zero at faster rate than n 1/2 . We introduce the vectors γ n = (Γ a (B 1 ) . . . Γ a (B Kn )) ⊤ , so we have γ ⊤ n γ n = α ⊤ G ⊤ Kn G Kn α ≥ λ 2 min G Kn × α 2 2 . From lemma 6.2 of [44] (giving bounds on the eigenvalues of G Kn ), we get γ ⊤ n γ n |ξ| α 2 2 . Lemma 6.1 from [44] (about the equivalence of L 2 norm and Euclidean norm of spline coefficients) ensures that γ ⊤ n γ n is bounded away from 0 because |ξ| α 2 2 1 0 a 2 (s)dQ n (s) hence lim inf n γ ⊤ n γ n > 0 and assumption B is satisfied. From (3.5), we get the behavior of the diagonal entries of P B,n : 2 2 ≤ k (because the B-splines are bounded by 1 and only k of them are strictly positive) ensure that max i (P B,n ) ii = O((n|ξ|) −1 ) → 0. It is clear that B n is of full rank for n large enough.
Since α ≤ k, it exists a sequence (x n ) ∈ S(ξ n , k) such that x n − x * ∞ = O(|ξ| α ) [7], hence n 1/2 x n − x * ∞ → 0. If we use again the spline approximation of the function a, we derive the following expression for which remains true when a is any smooth function in C 1 . According to Andrews, we can conclude V −1/2 n (Γ a (x n ) − Γ a (x * )) N (0, 1). We obtain an equivalent of the rate of convergence by the same approximation as above i.e. V n ∼ α 2 |ξ| n and we obtain finally that V n ∼ n −1 . The technique used by Andrews for his theorem 2.1 gives also asymptotic normality ofx n (t) = B(t) ⊤ĉ i,n . We have then and from (3.5) we get V (x n (t)) = σ 2 n B(t) ⊤ G −1 Kn B(t)+o( 1 n|ξ| ), so that V (x n (t)) ∼ C n|ξ| from lemma 6.6 in [44].
For deriving the asymptotics of the two-step estimator when regression splines are used in the first, we just need to put the results of propositions 3.1 and 3.2 together.
Theorem 3.1 (Asymptotic normality and rates of the two-step estimator). Let F a C m vector field w.r.t (θ, x) (m ≥ 1), such that D 1 F, D 2 F are Lipschitz w.r.t (θ, x), and D 12 F exists. We suppose that the Hessian J * of the asymptotic criterion R 2 w (θ) evaluated at θ * is nonsingular, and that the conditions of proposition 2.1 are satisfied.
Proof. From proposition 3.1, we havê so that we just have to apply proposition 3.2 to Γ s,w (x) and Γ b,w (x). We can claim the asymptotic normality of √ n(Γ s (x n )−Γ s (x * )) and of n|ξ n |(Γ b,w (x n )− Γ b,w (x * )) (normality is extended from scalar functional to multidimensional functional by the Cramér-Wold device). We have then two cases for the rate of convergence ofθ n , depending on the values w(0), w(1). When w(0) = w(1) = 0, there is only the parametric part, but when Γ b,w does not vanish the nonparametric part with rate n|ξ n | remains. We can determine the optimal rate of convergence in the mean square sense by using the Bias -Variance decomposition for the evaluation functional θ n − θ * 2 = O P (E(x n (t)) − x(t)) 2 + O P (V ar(x n (t))). Theorem 2.1 of [44] gives E (x n (t) − x * (t)) 2 = O(|ξ n | m+1 ) (because x * is C m+1 ) and V ar(x n (t)) = O P (n −1 |ξ n | −1 ) so the optimal rate is reached for |ξ n | = O(n −1/(2m+3) ) and is O(n −(2m+2)/(2m+3) ).
Remark 3.1 (Random observational times). The asymptotic result given for the deterministic observational times 0 ≤ t 1 < · · · < ct n ≤ 1 remains true when they are replaced by realizations of some random variables T 1 , . . . , T n as long as the assumptions of the two previous propositions are true with probability one. Andrews gives some conditions (theorem 2) in order to obtain this. It turns out that in the case of T 1 , . . . , T n i.i.d. random variables drawn from the distribution Q, it suffices to have K 4 n n r with 0 < r < 1. In particular, as soon as m ≥ 1, the conclusion of proposition 3.1 holds with probability one for the optimal rate K n = n 1/(2m+3) .
We have restricted theorem 3.1 to regression splines in order to a have precise and self-content statement of conditions under which the asymptotics of the two-step estimator is known. In particular cubic splines gives root-n consistent estimators for smooth differential equation (m ≥ 2) and appropriate weight function.
The main point of our study is that the asymptotic normality and parametric rate are derived from the behavior of the smooth functional Γ s,w , which can be derived for series estimators (polynomials or Fourier series) from [1]. Moreover, the same theorem can be derived for kernel regression (Nadaraya-Watson) by using the results on plug-in estimators of Goldstein and Messer [18]. More generally, the same theorem may be obtained for other nonparametric estimators such as local polynomial regression, orthonormal series, or wavelets.

Experiments
The Lotka-Volterra equation is a standard model for the evolution of preypredator populations. It is a planar ODE ẋ = ax + bxẏ y = cy + dxy (4.1) whose behavior is well-known [19]. Despite its simplicity, it can exhibit convergence to limit cycles which is one of the main features of nonlinear dynamical systems, which has usually a meaningful interpretation. Due to this simplicity and the interpretability of the solution, it is often used in biology (population dynamics or phenomena with competing species), but the statistical estimation of parameters a, b, c, d in 4.1 has not been extensively addressed. Nevertheless, Varah (1982) illustrates the two-step method with cubic splines and knots chosen by an expert on the same model as (4.1). Froda et al. (2005) [16] have considered another original method exploiting the fact that the orbit may be a closed curve for some values of the parameters. In this section, we will consider a slight generalization and reparametrization of the previous model which consists in adding the squared terms x 2 and y 2 : We use the two-step estimator obtained by minimizing R 2 n,w (θ) in order to illustrate the consistency and the asymptotic normality of the estimator proved in the previous section. In particular, we will consider two estimators: one obtained with a uniform weight function w = 1, and a second one with a weight vanishing at the boundaries. According to theorem 3.1, it gives two different rates of convergence: we consider then the influence of the number of observations n = 20, 30, 50, 100, 200, 500, 1000. We consider also a small number of observations (n = 20, 30) as the nonparametric procedure can give poor results in the small-sample case, and the simulation can give an insight into the expected properties in this practical situation. As the reconstruction of the solution in the first step is critical, we consider the estimation of an ODE with two different parameters θ 1 with a 1 = 0, a 2 = −1.5, a 3 = 1, b 1 = 2, b 2 = 0 and b 3 = −1.5, and θ 2 with a 1 = 0, a 2 = −1.5, a 3 = 1, b 1 = 2, b 2 = 1 and b 3 = −1.5. In both cases, the parameters of the quadratic terms a 1 and b 2 are supposed to be known, so that 4 parameters have to be estimated from noisy discrete observations of the solution of the ODE. We suppose also that the initial conditions are not known, so that they are nuisance parameters that are not estimated by our procedure. These two parameters θ 1 , θ 2 gives rise to two different qualitative behavior of the solution as it can be seen in figures 1, 2 and it gives a view of the influence of their shapes on the inference. The shape of the solution has two consequences: the identifiability of the model through the asymptotic criterion R 2 w , and the difficulty of the first step (if the curve is rather wiggly or flat). The data are simulated according to 2.1 where ǫ is a Gaussian white noise with standard-deviation σ = 0.2, and the observation times are deterministic and equal to t j = j 20 n , j = 0, . . . , n − 1. We define now exactly the two-step estimator that have been used in the experiments. In the first step, we have used a least square cubic spline with an increasing number of knots K n . The selection of K n and of the position of the knots is a classical problem in the spline regression literature. The study of the asymptotics gives only the rate of K n (or |ξ n |) in order to optimize the rate of convergence, but one has to find practical devices in order to compute a correct estimator. For instance, this choice was left to the practitioner in Varah's paper, but due to the intensive simulation framework for the estimation of the bias and variance of two-step estimators, we have used an adaptive selection of the knots in order to obtain an automatic first step with reliable and stable approximation properties. This procedure permits then to focus on the quality of the two-step estimator, and not in particular on the first step. We have considered 3 different basis of splines with uniformly spaced knots ξ j = j 20      and we have selected the knots by applying a variable selection approach to the truncated power basis (instead of the B-spline basis) defined as the set of functions 1, t, . . . , t k−1 , (t − ξ 1 ) k−1 + , . . . , (t − ξ L ) k−1 + . The knots are selected by minimizing the Generalized Cross Validation criterion (GCV): where K m is a subset (of size m) of {ξ j , j = 0, . . . , K n }, and d(K m ) is the effective number of parameters. As in [15,27], we use d(K m ) = 3m + 1, and we use an heuristic for considering efficiently the various subsets, which is based on the elimination of the less informative knots (for GCV), and the addition of the more informative knots in order to minimize the GCV criterion (ElimAdd procedure in [27]). This procedure is not optimal but it gives a simple and reliable adaptive nonparametric estimators. Other knots selection procedure would have been based on free-knot splines [8,37]. Experiments have been also performed in the non-adaptive case [5].
The quality of the nonparametric estimation procedure is measured by the Root Mean Squared Error (RMSE) and is given in tables 2, 5, so that the MSE of the two-step estimator can be related directly to the quality of the first step.  In a general manner, the second step can be addressed by using whatever (nonlinear) optimization procedures. Nevertheless, in the case of the Lotka-Volterra model, it turns out that we have a closed-form expression forθ. Indeed, the criterion R 2 n,w for the first dimension can be written where the criterion is approximated by a Riemann sum (∆ j is the size of the subdivision, supposed to be uniform) andw j = ∆ j w(s j )x(s j ) is a new weight function. Hence, the estimator is the solution of a weighted least-square program whose solution exists, is unique and has a known expression. This means that the two-step estimator remarkably furnishes a fast and reliable procedure, where there is no problem of local minima, although we deal with a nonlinear regression model. The quality of the two-steps estimator has been evaluated by computing its mean and standard deviation through a Monte-Carlo study with 1000 independent drawings. The results are shown in tables 1, 4 for the models with θ 1 and θ 2 , and illustrates the consistency of the two-step procedures as n grows to infinity. We compute with the same nonparametric estimatorx, and either with a uniform weight function w = 1, or boundaries vanishing weight. In our experiment, we use a piecewise linear function with w(0) = 0, w(1) = 1, w(19) = 1 and w(20) = 0 (and w(t) = 1, t ∈ [1,19]). We have also a picture of the behavior of the estimator for small sample size and it appears that the two-step estimator (weighted or unweighted) is biased, for cases 1 and 2, and the bias diminishes significantly when n ≤ 100. An important feature is that the weighted estimator is better behaved than the unweighted one, in both experiments and for small n and big n. Indeed, the standard deviation ofθ 1,w is equal (in case 1) or smaller (in case 2) to the one ofθ 1 , but the main difference between the two estimators comes from the bias term which induces a bigger RMSE for the unweighted estimator, see tables 2, 5. This difference comes from the presence of bias term in the evaluation functionals in the unweighted estimator (as it is emphasized in theorem 3.1), which can be particularly important at the boundaries. In case 2, this difference is very important, due to the flat part for t ≥ 10.
From the expression of the R 2 n,w , it is clear that the quality of theθ is directly related to the distance x − x 2 2 , as one can see in tables 2, 5 but our experiment shows that it is not sufficient. First, one need also to estimate correctly the derivative of the solution, and second a low minimum of the criterion (R 2 n,w (θ n,w )) does not indicate a good estimator. This difficulty arises when we compare the distance of the nonparametric estimator and the value of the criterion (see tables 3,6). For instance, we have a better nonparametric estimator and criterion in case 2 than in case 1 (e.g. n = 100), but we have a lower RMSE in case 1. Hence, case 2 appears as a more difficult model to estimate, and the shape of the solution has a direct influence on the asymptotic criterion R 2 w and on the ability to approximate it. We have also controlled the normality of the two-step estimator with a univariate Kolmogorov-Smirnov test for the four parameters. The test used was adapted to the case of unknown mean and variance (via Lilliefors procedure), and the normality assumption cannot be rejected (at a level of 5%) as soon as n = 20 for all parameters.

Conclusion
We have proposed a new family of parametric estimators of ODE's relying on nonparametric estimators, which are simpler to compute than straightforward parametric estimators such as MLE or LSE. The construction of this parametric estimator puts emphasis on the regression interpretation of the ODE's estimation problem, and on the link between a parameter of the ODE and an associated function. By using an intermediate functional proxy, we expect to gain information and precision on likely value of the parameters. We do not have studied the effect of using shape or inequality constraints of the estimatorx n but it might be valuable information for the inference of complex models, either by shortening the computation time (it gives more suitable initial conditions) or by accelerating the rate of convergence of the estimator thanks to restriction to smaller sets of admissible parameter values.
We have particularly studied the case R 2 n,w (θ), but other M-estimators such as the one obtained from R 1 n,w (θ) may possess interesting theoretical and practical properties such as robustness. This could be particularly useful in the case of noisy data which can give oscillating estimates of the derivatives of the function.
We have given a general study of the two-step estimator (consistency, asymptotic expansion), and we have shown that the weight function used in R 2 n,w controls dramatically the rate of convergence of the estimator, and that this method can furnish root-n consistent estimators. We have then provided a detailed account of the asymptotic behavior of spline-based estimators. This choice is mainly due to the practical interest and the wide use of splines, but the conclusions may remain the same for usual nonparametric estimators (series estimators, kernel estimators). Indeed, we have shown that the asymptotic behavior of the two-step estimator comes from the behavior of a linear functional of the regression function.
In the experiments, we have illustrated the influence of the weight function, as the influence of the solution in the quality of the estimators. In particular, we have shown that the approximation quality of the solution is not sufficient in order to have a good estimator, and that it depends on the shape of the solution.