Optimal rate of direct estimators in systems of ordinary differential equations linear in functions of the parameters

: Many processes in biology, chemistry, physics, medicine, and engineering are modeled by a system of diﬀerential equations. Such a system is usually characterized via unknown parameters and estimating their ‘true’ value is thus required. In this paper we focus on the quite common systems for which the derivatives of the states may be written as sums of products of a function of the states and a function of the parameters. For such a system linear in functions of the unknown parameters we present a necessary and suﬃcient condition for identiﬁability of the parameters. We develop an estimation approach that bypasses the heavy computational burden of numerical integration and avoids the estimation of system states derivatives, drawbacks from which many classic estimation methods suﬀer. We also suggest an experimental design for which smoothing can be circumvented. The optimal rate of the proposed estimators, i.e., their √ n -consistency, is proved and simulation results illustrate their excellent ﬁnite sample performance and compare it to other estimation approaches.


Introduction
Many processes in biology, chemistry, physics, medicine, and engineering are modeled by a system of differential equations.Parameter estimation for such systems is considered as the bottleneck in modeling dynamic processes and attracts some growing attention in recent statistical literature.In particular, new estimation methods are developed (e.g., [38,37]) or quite old techniques are rigorously analyzed (e.g., [49,21]).Below we review other research as well.Most of it considers systems of ordinary differential equations (ODEs) of the form x (t) = F (x(t); ν), t ∈ [0, 1], x(0) = ξ, (1) where x(t) takes values in R d , ξ in Ξ ⊂ R d , and ν ∈ N ⊂ R q .The seemingly more general nonautonomous system x (t) = F ( x(t), t; ν), t ∈ [0, 1], x(0) = ξ, may and will be reduced to the autonomous system (1) by the simple substitution x(t) = ( x T (t), t) T , t ∈ [0, 1], ξ = ( ξ T , 0) T .In many applications states and parameters can be separated in the sense that there exist measurable functions g : R d → R d × R p and h : N → R p such that F (x(t); ν) = g(x(t))h(ν) (2) holds.We write θ = h(ν), θ ∈ Θ = h(N ) ⊂ R p , and call it the natural parameter, where ν is the parameter of interest.The class of ODEs ( 2) is widely used in practice because of interpretability of the natural parameters as rate constants.In statistics a similar structure is popular; think of linear regression and e.g.Cox' proportional hazards model.The following list includes examples of systems in which the function h is the identity, i.e., systems that are linear in the parameters: the Lotka-Volterra system in population dynamics ( [15]); models describing HIV dynamics ( [36,32,33,47,16]); models for the blood coagulation process ( [27]); problems in chemistry ( [40]); gene regulatory networks ( [5]); models describing the spread of infectious diseases ( [25,29]); calcium measurements analysis ( [39]); pharmacokinetic models ( [14]).A well known example for the case where the system is not linear in the parameters but separability of the states and parameters is still possible, is the FitzHugh-Nagumo system in neurophysiology ( [17,35]).
The extensive list of applications above suggests that systems for which it is possible to separate the states from the parameters deserve special attention and treatment.However, current methods do not seem to exploit the full potential in such systems, both theoretically and practically.In the present study we attempt to do just this: in Section 2 we discuss identifiability in systems linear in the parameters; in Section 3 we present a general estimation approach for the case where all trajectories of x are observed.In Section 4 we develop two estimators for different experimental setups and derive their √ n-consistency, i.e., their optimal rate of convergence.In Section 5 the application of the methods is demonstrated via simulations and a discussion is presented in Section 6.The proofs are relegated to the Appendix.

Identifiability
A prerequisite for consistent estimation is that the parameter is identifiable.There are several concepts of identifiability (e.g., [2,11,31,48]; see also [34] and references therein).We are concerned with structural identifiability, a property that depends on the mathematical structure of the model, but is not affected by the randomness of physical experiments.To be more specific, the identifiability criterion given in Proposition 1 below is given in terms of a particular solution (i.e.set of trajectories) to the initial value problem.Clearly, a particular solution depends on elements of the experimental setup such as initial conditions and control parameters.Verifying the structural identifiability of a model is usually a difficult task that can be carried-out only in models of low dimensions (e.g., models describing HIV dynamics studied in [33,47] and [34]).
Exploiting linearity in the natural parameter θ we start with the following observation.By integration, (1) and (2) yield the system of integral equations Given the values of ξ and θ the solution of ( 1), (2), and (3) is denoted by In the present context identifiability means that knowledge of a solution x(t), t ∈ [0, 1], for the system (1), (2), and (3) yields the values of the parameters ξ and θ.
For ξ = x(0) this is obviously true, while identifiability for θ means that From (3) we see that different values of θ may yield the same solution x(t), t ∈ [0, 1], if and only if the p columns of g(x(t)) are linearly dependent satisfying a nontrivial linear equation that is the same for Lebesgue almost all t ∈ [0, 1].This observation is generalized and formulated precisely in the proposition below.For its formulation we need some notation.Let W be a symmetric d × d-matrix of finite signed measures on ([0, 1], B) with B the sigma field of Borel sets, and let x : [0, 1] → R d and y : [0, 1] → R d be Borel measurable vector valued functions.We assume that W is chosen in such a way that x, y W = 1 0 x T (t) dW (t) y(t) (5) is a semidefinite inner product and is the corresponding seminorm.Note that in (5) the integration with respect to dW (t) includes y(t).To clarify this notation we note the following.Let μ be a finite nonnegative measure on ([0, 1], B) dominating all signed measures in the matrix W (for example, the sum of the variations of the finite signed measures in W will do).Denote by w(•) the d × d-matrix of the Radon-Nikodym derivatives of the signed measures in W with respect to μ.Now (5) may be rewritten as Note that the inner product from ( 6) introduces equivalence classes of measurable functions in that x : [0, 1] → R d and y : [0, 1] → R d are equivalent if and only if x − y W = 0 holds.We shall assume that x W = 0 implies that x i (t) = 0 for W ii -almost all t ∈ [0, 1] and for i = 1, . . ., d.We shall assume also that 0 belongs to the support of W ii for i = 1, . . ., d.If x : [0, 1] → R d×k and y : [0, 1] → R d× are measurable matrix valued functions, then x, y W will be interpreted as the k × matrix of the inner products of the columns of x and of y.Denote the d × d identity matrix by I d and assume that the matrix is well-defined with finite entries and positive definite.
, satisfy the system ( 1)-( 3) and write Let W be a symmetric d × d-matrix of signed measures as in (5) satisfying (8) and having the other properties mentioned above.Assume that the d × p-and p × p-matrices are well-defined with finite entries.
A proof of this proposition is given in Appendix A.1, but here we would like to note already that (11) and ( 12) follow from the fact that at its minimum 0 the derivatives of x − ζ − Gη 2 W with respect to η and ζ at θ and ξ respectively, have to vanish.Note that C W is singular if and only if there exists a p-vector η = 0 with which implies are equivalent to Lebesgue measure on the unit interval, and hence the p columns of g(x(t)) in R d satisfy the same nontrivial linear relationship for Lebesgue almost all t ∈ [0, 1].Conversely, this linear relationship on g(x(t)), t ∈ [0, 1], implies (13) and hence the singularity of C W .
A careful examination of the proposition above reveals that uniqueness of the solution x(t; θ, ξ) (as a function of t) is not required for identifiability of the natural parameter.Note that uniqueness of solutions was previously assumed in [37,49], and [20] who dealt with the fully nonlinear case.According to the Picard-Lindelöf theorem existence and uniqueness of the solution x(•; θ, ξ) in some neighborhood of 0 is guaranteed if the map g(•) is Lipschitz continuous; see also [1,Chapter 2].However, consider for any positive α = 1 the (one dimensional) initial value problem is a solution for any τ ∈ [0, 1).Hence, there are infinitely many solutions for this initial value problem.Nevertheless, the parameter θ is identifiable which may be verified by calculating for W the uniform distribution As for identifiability of the parameter of interest ν we note that part (i) of Proposition 1 may be applied if the measurable parametrization function h : N → Θ is injective, namely If the natural parameter is not identifiable, the parameter of interest might be.However, we will not study this rather complicated situation here.

Methodological approach
In practice, the values of ξ, θ, and ν are unknown and one usually observes x(t; θ, ξ), θ = h(ν), with noise and at certain time points only.We denote the n observations by where ε(t i ) is the unobserved d-dimensional column vector of measurement errors at time t i .This experimental setup is common (e.g., [38] and [20]), and many methods for estimating parameters in this context have been developed.
For an extensive survey of recent developments in parameter estimation and structure identification of biochemical and genomic systems, see [10].Since the list of estimation methods is exhaustive, a detailed review is not feasible, thus we will focus on the two most relevant techniques: the first is the nonlinear least squares (NLS) method that motivates our study, while the second is the two-step approach which we adopt.The classical nonlinear least squares method aims at minimizing over η ∈ Θ and ζ ∈ Ξ the function where • denotes the standard Euclidean norm.Unless an exact solution x(t i ; η, ζ) is at hand, it is approximated via numerical integration, and the minimization of the criterion function is carried-out by searching the parameter space for the global minimum.Statistical properties of this method are studied in [49] for the situation that ξ ∈ Ξ is known.However, [45] demonstrate that the need to repeat numerical integration multiple times might increase the computational time for numerical integration up to 95% of the total computational time required for a gradient based optimization method (even in low dimensional systems).
In order to bypass the burden of numerical integration, several collocation estimation methods were developed, such as the two-step technique (e.g., [3,43]) and generalized profiling ( [38]).The generalized profiling method is asymptotically efficient ( [37]) provided the distribution of the measurement errors is known, and can handle a variety of problems ( [29,50]).On the other hand, the two-step approach, although requiring the choice of some smoothing parameter, is relatively more straightforward to apply.Thus, a two-step method can serve as a preliminary step in the parameter estimation task, to be followed by applying more complex methods such as generalized profiling.This type of estimation strategy was successfully demonstrated in [46] for fully observed systems, and in [12] for the partially observed case.
The classical two-step approach works as follows.The observations are first smoothed, which results in an estimator x n (•) for the solution x(•; θ, ξ) of the system, and by differentiation in the estimator x n (•) for x (•; θ, ξ).Then the estimator for θ is the minimizer θ n over η ∈ Θ of the smooth criterion function where w is an appropriate weight function.By estimating the "true" trajectories of the system and their derivatives, the two-step approach bypasses the need to integrate the system numerically and as a result, the parameter estimates can be computed extremely fast ( [7,30]).Under regularity conditions [20] show that this "smooth and match" estimator (SME) θ n has the √ n-rate of convergence to θ.This is an example of the use of nonparametric "plug-in" or substitution estimators (see [19] and [4]).When the system is linear in the parameters, ( 16) can be minimized straightforwardly, as noted in [5,16] and [20].However, their methods are based on estimates of derivatives, and it is well known (see [44] and [10]) that estimating derivatives from noisy and sparse data may be rather inaccurate.Indeed, this problem attracted some attention ( [22,6]).The methodology developed in the present paper is a two-step approach that does not require the estimation of derivatives.Moreover, we also pay attention to estimation of the initial value ξ.
Let x n (t), t ∈ [0, 1], be an estimator of x(t; θ, ξ) based on the observations (15).In view of (3) and in analogy to (16) it makes sense to estimate the parameters θ and ξ by minimizing Minimizing the criterion function (17) with respect to ζ and η results in the direct estimators (cf.(11) and ( 12)) Note that these estimators are well-defined only if the inverse matrices in (19) and (20) exist.In case the initial value ξ is known, (20) may be used with ξ n replaced by ξ.
In order to estimate the parameter of interest ν we choose a distance function d n (•, •) on R p and we choose ν n in such a way that holds.Of course, if the infimum is attained, we choose ν n as the minimizer.
The idea of an integral-based estimation approach as in (17) appeared already in [26].These authors chose the d × d-matrix W n to be a diagonal matrix with each diagonal element a weight function putting all its mass at the observation times t 1 , . . ., t n .They proposed three specific weight functions, namely equal weights at all time points and two data dependent weight functions.These choices, with all their mass at the observation times, allow these authors to skip the smoothing step and to use Y (t i ) instead of our x(t i ).This has the disadvantage that they had to consider multiple versions of their design, which they called runs, in order to obtain a good performance of their estimators.However, they did not derive statistical properties of their estimators.Their method is referred to in the chemical engineering literature as the 'direct integral method' and some papers revisited this idea ( [51,42], and [18]).In the next section, we introduce two modifications of the "direct integral method".These "modified integral methods" yield estimators with such desired statistical properties as consistency and the parametric √ n rate of convergence.Still, the resulting estimators will not be statistically efficient.By a one step Newton-Raphson type of modification they can be turned into estimators equivalent to least squares estimators, and into efficient estimators when the distribution of the measurement errors is known, and even into semiparametrically efficient estimators when the distribution of the measurement errors is unknown.These modifications are under study (see e.g., [13]).A possible way to apply our "modified integral methods" to general ODE systems, which are not necessarily linear in functions of the parameter, is under study as well.

Asymptotic properties
We start with some general asymptotic results for the estimation approach defined above and then we discuss two specific experimental set-ups.Comparing our estimators (19) and (20) to (11) and (12) we see that they are consistent if g(•) is continuous and x n (•) is a consistent estimator of x(•) in an appropriate sense.Indeed, with the notation x ∞ = sup t∈[0,1] x(t) we have the following result.

Theorem 1. Let the model be defined by (1)-(3) with the map g
Let W be a symmetric d × d-matrix of signed measures as in (5) satisfying the conditions of Proposition 1. Furthermore, let the matrix C W from (10) be nonsingular, which implies that θ is identifiable via (12).Finally, let x n (•) be a consistent estimator of x(•) = x(•; θ, ξ) in the supnorm, i.e., If the sequence of matrices W n converges weakly to W in the sense that the elements of W n converge weakly to the corresponding elements of W , then the estimators ξ n and θ n as presented in (19) and ( 20) are asymptotically welldefined and consistent, i.e., x − y are bounded away from 0 and infinity for all x, y ∈ R d with x = y, and h −1 (•) is continuous, then ν n as defined via ( 21) is asymptotically consistent as well.
Consequently, we have consistency of our estimators at all values of the parameters for which the conditions are satisfied.In view of ( 3 Note that if the system is not linear in its parameters then the criterion function as in ( 16) cannot be solved directly and one needs to search the parameter space for the minimum.This procedure requires that the criterion function separates the parameter space well (cf.equation (3.9) in [20]).In our case this condition is immediately satisfied.
In order to get consistency at a certain rate we need stronger conditions on g(•) and the estimator x n (•).
Theorem 2. Let the model be defined by ( 1)-( 3) with the map g : R d → R d ×R p twice continuously differentiable.Fix ξ ∈ Ξ and θ ∈ Θ and let x(•) = x(•; θ, ξ) exist and be bounded on [0, 1].Assume that θ is identifiable.Let the d×d-matrices W n and W be as defined in (5) satisfying the conditions of Proposition 1.Let Assume that for every differentiable function f

holds. If for every bounded measurable function
and hold, then estimators ξ n and θ n as defined in (19) and ( 20) are consistent to the following order as n → ∞.Furthermore, if g : R d → R d × R p is twice differentiable and all second derivatives of all components of g(•) are bounded, then the condition is not needed in order to obtain (28).Moreover, if θ = h(ν) holds, d n (x, y)/ x − y are bounded away from 0 and infinity for all x, y ∈ R d with x = y, and h −1 (•) is Lipschitz continuous, then ν n as defined via ( 21) is asymptotically consistent to the order Clearly with ) this Theorem presents sufficient conditions for the fastest possible rate, which means √ n-consistency.In the next subsection we present an estimator satisfying these conditions.

Smooth estimator of solution ODE
Our estimators θ n and ξ n are defined by ( 18)-( 20) and are based on an estimator x n (•) of the solution x(•) of the ODE system (1)- (3).Clearly the quality of the estimators θ n and ξ n depends on the properties of the estimator x n (•), as is illustrated by the conditions of Theorem 2. Since the classical kernel estimators are inconsistent at the boundaries of the interval [0, 1], they do not satisfy these conditions.Consequently, we need other estimators of x(•).
Our choice here is to use a local polynomial type of estimator.Under the assumption that all components of the solution x(•) are C α -functions for some real α ≥ 1, we will approximate them by polynomials of degree = α .This works as follows; cf.[41,Section 1.6].For a given point t i , i = 1, . . ., n, and for t sufficiently close to t i the d-vector x(t i ) equals approximately where b = b n > 0 is a bandwidth, the ( + 1)-vector U (u) is a column vector, and ν(t) is a d × ( + 1)-matrix.Let K(•) be some appropriate kernel function and define The local polynomial estimator of order of x(t) is the first column of the d × ( + 1)-matrix ν n (t), i.e., x n (t) = ν n (t)U (0).For a fixed t this estimator is just a weighted least squares estimator ([41, Section 1.6]) and it may be written as the linear estimator with The following conditions on the kernel K will assure that the matrix B n (t) is positive definite and the estimator ( 29) is unique.

Condition K.
(i) The kernel K is symmetric around zero and has compact support, which lies within Conditions (i) and (iv) above are typical assumptions in kernel estimation.The Lipschitz property in (ii) is needed when deriving upper bounds for the risk of the estimator with respect to the supremum norm.The lower bound for the kernel function in (iii) is needed to assure that the matrix B n (t) is positive definite.
Local polynomial estimators are consistent and "automatically" correct for the boundaries.We note that some types of boundary kernel estimators have bias and variance that are of the same order.However, usually they have a complicated form and are not easy to implement (see [9] for a discussion on this problem).The following theorem assures us that estimating x(•) by a local polynomial estimator fulfills the requirements of Theorem 2. A careful choice of the bandwidth b = b n will result in a √ n-rate for the estimators θ n and ξ n .
Theorem 3. Let the model be defined by ( 1)-( 3) with the map g : Let the observations be given by (15) with t i = i/n, i = 1, . . ., n. Assume that ε j (t i ), i = 1, . . ., n, j = 1, . . ., d, are i.i.d. with mean 0 and finite variance σ 2 ε .Let the estimator x n (•) for x(•) be given in (29) Let W be a d × d matrix of signed measures as in Proposition 1.Let the estimators ξ n and θ n be defined in (19) and (20)  Condition (30) states that the total variation distance between W n and W should converge to 0 sufficiently fast.Note that ( 31) is satisfied if the W n,hh have bounded densities with respect to Lebesgue measure on [0, 1] or with respect to 1/n times counting measure on i/n, i = 1, . . ., n.Furthermore, note that for any θ ∈ Θ and ξ ∈ Ξ the component x j (t; θ, ξ) of the solution is a C α -function in t in a neighborhood of 0 provided the map g is C α in its argument ([1, p. 52, Section 7.6, Corollary 4]).
Notice that [20] study systems that are not necessarily linear in the parameters.To prove √ n-consistency of their estimator they need Gaussianity or boundedness of the measurement errors.Here just mean 0 and finite variance suffice.
The method developed above is based on the preliminary step of smoothing the observations.As a result, the performance of this method is heavily based on the choice of the smoothing parameter.This choice is not trivial in practice (see e.g., [38,37] and [20]), especially if one deals with a large system and if the underlying system has "fast" and "slow" components.In that case, using different bandwidths for different components makes more sense.However, the proof of Theorem 3 will show that for α ≥ 3/2 the choice b ≈ n −1/3 always suffices.

Step function estimator of solution ODE
As mentioned above, choosing the smoothing parameter in practice may not be trivial.This problem can be avoided in situations like the following repeated measures model, with t i = i/I, i = 1, . . ., I. Hence, we observe J i repeated measures of x(t i ) for each time point t i , which means that we have n = I i=1 J i observations in total.This is common practice in many fields and therefore makes a quite reasonable experimental setup.
Within this observation scheme it is natural to estimate x(t i ) by and even to estimate x(t) by where we complete the definition of x n (t) on [0, 1] by x n (0) = x n (t 1 ).This definition does not mean that we intend to estimate the initial value x(0; θ, ξ) = ξ by x n (0).The estimator x n (•) is a preliminary estimator of x(•; θ, ξ) that will be used to construct a more accurate estimator of ξ than x n (0).We choose W n as in Theorem 3. Again, our estimators θ n and ξ n are defined by ( 18)-( 20).This estimator θ n with ξ n replaced by ξ equals the estimator based on the direct integral method of [26] with J i = J the number of runs and with W ij = 1 in ( 5)-( 8) of [26], provided the starting values for all runs are the same and known.Both estimators θ n and ξ n are √ n-consistent if the number of time points I is of order √ n and for most time points t i the sample size J i is of order √ n too.We formulate this accurately in the following theorem.
Note that var ( x n (0)) = σ 2 ε /J 1 holds, and that estimating ξ via x n (0) would not yield the best possible rate, unless J 1 is of exact order n.Indeed, the √ n-rate is achievable by θ n using the information from all I time points.

Simulation study
In our simulation study we report on the finite sample properties of the smooth estimator of Section 4.1, and the step function estimator of Section 4.2.The smooth estimator is tested by comparing its performance to that of the derivative based two-step approach and of the generalized profiling estimator.This comparison is done for the same situations as have been used in the simulation studies for these estimators in literature.The study of the step function estimator is focused on understanding the effect on the estimation accuracy of the number of repeated measures, as well as of different error distributions.In all simulations below, whenever the integral approach is applied, the initial values are considered as unknown and therefore are estimated as well.

Smooth estimator
Several researchers studied the problem of parameter estimation for the FitzHugh-Nagumo model ( [17,35]) and therefore it is a good example to consider.This is a system with two states proposed as a simplification of the model presented in [28] for studying and simulating the animal nerve axon.Specifically, this model is used in neurophysiology as an approximation of the observed spike potential and takes the form The voltage x 1 (t) moving across the cell membrane depends on the recovery variable x 2 (t).This system was studied in [30] who applied the derivative based method and in [38] who used generalized profiling.We will compare the integral based approach to the results in the aforementioned papers.Note that the FitzHugh-Nagumo model was studied also by [8] who pointed out some difficulties in estimating the parameters for this ODE system.

Comparison with the derivative based method
By setting ν = (α, β, γ) T , the system (37) takes the form (2) with θ = h(ν) = (γ, 1/γ, α/γ, β/γ) T and the corresponding matrix g is While estimating the ODE parameter vector θ using a derivative based method does not involve the initial value vector ξ, this is not the case with the integral based approach.Anyway, we consider the initial values to be unknown and estimate them as well.
The experimental setup follows that of [30].The true parameter vector is set to (α, β, γ) T = (0.34, 0.2, 3) T and the initial conditions to ξ = (ξ 1 , ξ 2 ) T = (0, 0.1) T .The two signals are first generated by solving the system at 0.1 time units on the interval [0, 20] (n = 201; note that the theory as developed for the time interval [0, 1] in the preceding sections is valid for any bounded interval [0, T ] as may be seen by scaling.) and then we add Gaussian measurement errors with zero mean and variances σ 2  1 , σ 2 2 respectively.In particular, here we used local polynomial estimators of order = 1 for estimating the two components of x(•).The kernel function used for generating the local polynomial estimators was the same one as considered in [30], namely, K(t) = 3/4(1 − t 2 )1{|t| ≤ 1}, where 1{•} stands for the indicator function.The last choice that has to be made before proceeding, is that of the bandwidth b.As pointed out in Remark 3 of [30], the bandwidth selection is critical in local polynomial regression.They used a bandwidth that under-smooths with respect to the optimal bandwidth for estimating x(•).Here we simply choose b = n −1/3 (see the proof of Theorem 3).
Once θ n is obtained, we can estimate ν using (21).To be more specific, we take d n to be the Mahalanobis distance: where Σ n is the estimated covariance matrix of θ n .Given the observations model (15), it is natural to define a bootstrap procedure for estimating Σ n as follows (cf.[23]).Repeat B times the following steps: (i) For each point t i generate residuals ε j (t i ) = Y j (t i ) − x j (t i ).
(ii) Center the residuals: εj (iii) Sample n residuals (with replacement) from εj (t 1 ), . . ., εj (t n ) to obtain the bootstrap residuals ε j (t 1 ), . . ., ε We then use the bootstrap sample Y j (t 1 ), . . ., Y j (t n ) and apply the estimation procedure.Denote the estimator for the vector θ in the bth bootstrap sample by θ * n,b and its corresponding average over the B bootstrap samples by θ * n .Then we define Then we minimize d n (h(ν), θ n ) over ν using a standard nonlinear optimization procedure (in this case, function fminsearch in Matlab).As an initial guess for the optimization step we take an arbitrary estimate for ν denoted by ν 0 = ( α 0 , β 0 , γ 0 ) T .In this case we obtained it as follows.Let θ 1 n , θ 2 n , θ 3 n , θ 4 n stand for the components of the vector of estimates θ n .Then

Table 1 Empirical means (standard deviations) for estimating the parameters of the
FitzHugh-Nagumo system.The parameters are estimated by ( 19)-( 20) and (38), with (29) a local polynomial estimator of order = 1 and bandwidth b = n −1/3 .Based on 500 Monte Carlo simulations.The true parameter vector is (α, β, γ) T = (0.34, 0. We conducted M = 500 Monte Carlo simulations as in [30].We set B = 100 for the bootstrap samples.The resulting empirical means and standard deviations of the integral approach are displayed in Table 1 where 36 different variance combinations are considered.The estimation results are substantially better uniformly over the experimental study, than those reported in Table 1 of [30] for the derivative based approach.Furthermore, another measure of accuracy presented in the aforementioned paper is the average relative estimation [30].For the Integral estimator the parameters are estimated by ( 19)-( 20) and (38), with (29) a local polynomial estimator of order = 1 and bandwidth b = n −1/3 .Based on 500 Monte Carlo simulations.The true parameter vector is (α, β, γ) T = (0.34, 0.2, 3) T .The two signals are first generated by solving the system at 0.1 time units on the interval [0, 20] (n = 201) and then adding Gaussian measurement errors with zero mean and variances σ 2 1 , σ

Table 2 Comparison of the empirical relative estimation error (ARE) in FitzHugh-Nagumo system, of the integral approach (Integral) with that of the derivative based two-step method (Derivative) of
where a m is an estimator of a in simulation m, and in our case M = 500.Table 2 here presents the ARE of the integral based two-step approach, and corresponds to Table 2 of [30].For convenience, the results of Table 2 of [30] are presented in Table 2 as well under the title "Derivative" since their method is a derivative based two-step approach.We see that the ARE's of the integral approach are substantially better, uniformly over the experimental study, than those of the derivative based approach.

Comparison with generalized profiling
The experimental setup here follows that of [38].In particular, they consider the following FitzHugh-Nagumo model The true parameter vector is set to θ = (a, b, c) T = (0.2, 0.2, 3) T and the initial conditions to ξ = (ξ 1 , ξ 2 ) T = (−1, 1) T .The two signals are first generated by solving the system at 0.05 time units on the interval [0, 20] (n = 401) and then we add Gaussian measurement errors with zero mean and variances σ 2 1 = σ 2 2 = 0.5.The integral approach is executed as described above, the initial conditions are estimated as well.The estimation results, based on 500 Monte Carlo simulations, are presented in Table 3.Also, in the table we present the results of the generalized profiling estimator that is chosen to be adapted to the Gaussianity of the measurement errors, as reported in Table 1 of [38].However, since it is not clear to us which initial guess was used there for the optimization over the parameter space, we also generated one experiment of our own.In particular, we first generate an initial guess in the parameter space that follows a Gaussian random vector with means the true parameters and a standard deviation of 0.5 (variable jitter in the original code downloaded from the authors website).Then we start the 500 Monte Carlo simulations using the same initial guess all over.The results are similar to those reported in [38] except for the parameter c for which the variability is higher here.We did not repeat the same experiment for other initial guesses since depending on the distance of the random guess from the true parameter vector, it could take the program about 90 seconds to execute only one simulation out of the 500 simulations required (using In-

Table 3 Comparison of the empirical means (standard deviation) in FitzHugh-Nagumo system, of the integral approach with that of the generalized profiling of [38]. For the
Integral estimator the parameters are estimated by ( 19)-( 20) and (38), with ( 29  tel(R) Core(TM) i7-4550U CPU @ 1.50GHz 2.10GHz 64-bit).In comparison, using the same hardware, one simulation of computing the integral estimator (including generating the 100 bootstrap samples for estimating the covariance) takes about 17 seconds to conclude.We note that when the system is linear in the parameters then there is no need for the bootstrap and the execution time of the integral estimator drops to less than 0.2 seconds.Also, in calculating the total execution time for the integral estimator we exclude the time needed for constructing the matrix W , the weights of the local polynomials, since this matrix can be constructed before any observations are generated.In summary, the estimated variance of the generalized profiling estimator is smaller than that of the two-step based integral approach.This is not surprising, since the generalized profiling estimator is asymptotically efficient as it has been chosen to be adapted to the Gaussianity of the measurement errors.However, the generalized profiling approach involves an iterative optimization method (Gauss-Newton), which in turn, requires a good initial guess in the parameter space.Otherwise, the resulting estimates and execution time may be very bad.Thus, the integral approach may be used as a preliminary step in the estimation procedure, since it provides theoretical and practical guarantees that the resulting estimates are in the vicinity of the true parameter vector.Such a strategy may substantially improve the execution time of the generalized profiling approach even for systems of small dimensions (see for example Table 3 in [46]).

Step function estimator
The goal of the following simulation study is merely to have a better understanding of the finite sample behavior of the step function estimator for dif-Table 4 Gaussian error -empirical means (standard deviation) in Lotka-Volterra system based on 5000 Monte Carlo simulations.In each simulation the data consist of I = 30 noisy observations of x 1 and x 2 according to measurement error model (33) with σ 2 ε = 0.5.The samples were taken at 0.5 time units on the interval [0, 14.9] for the first parameters setup and at 1 time unit on the interval [0, 29.9] for the second.At each time point, J repeated measures were generated.The parameters are estimated by ( 19)-( 20) with (34)  ferent repeated measures and noise scenarios.We consider the Lotka-Volterra system, a population dynamics model that describes evolution over time of the populations of two species, predators and their preys.In mathematical terms the Lotka-Volterra model is described by a system consisting of two equations and depending on the parameter θ = (θ 1 , θ 2 , θ 3 , θ 4 ) T .The system takes the form Here x 1 represents the size of the prey population and x 2 of the predator population.
In the experiment we set the errors to be i.i.d.Gaussian or Laplace with zero mean and σ 2 ε = 0.5 for both system states.In Tables 4-5 we present the empirical mean and standard deviation (in parenthesis) of the estimators for two different sets of parameters and initial values of the Lotka-Volterra system.Results are based on 5000 Monte Carlo simulations.Both the rate constants θ and the initial values ξ are estimated.In each simulation the data consist of I = 30 noisy observations of x 1 and x 2 according to measurement error model (33).The samples were taken at 0.5 time units on the interval [0, T ], T = 14.9 for the first parameters setup and at 1 time unit on the interval [0, T ], T = 29.9 for the second.At each time point, J repeated measures were generated.Last two lines in each block correspond to the empirical mean and standard deviation Table 5 Laplace error -empirical means (standard deviation) in Lotka-Volterra system based on 5000 Monte Carlo simulations.In each simulation the data consist of I = 30 noisy observations of x 1 and x 2 according to measurement error model (33) with σ 2 ε = 0.5.The samples were taken at 0.5 time units on the interval [0, 14.9] for the first parameters setup and at 1 time unit on the interval [0, 29.9] for the second.At each time point, J repeated measures were generated.The parameters are estimated by ( 19)-( 20) with (34)  x(t; θ n , ξ n )−x(t; θ, ξ) 2 dt} 1/2 and x(•; θ n , ξ n ) − x(•; θ, ξ) ∞ respectively.The simulation results suggest that the finite sample behavior of the estimator is similar under both error distributions.Also, as expected, the estimation accuracy grows with the number of repeated measures and is reasonable already when their number is relatively small.

Discussion
Systems of ordinary differential equations are widely used by scientists for modeling real life phenomena.In this paper we studied systems for which separability of the states and parameters is possible, or more specifically, systems that are linear in functions of the parameters.Such systems are spread over diverse fields such as population dynamics, neurophysiology, HIV dynamics, blood coagulation, chemistry, gene regulatory networks, infectious diseases, calcium measurements analysis and pharmacokinetic models, to mention a few (see references above).We addressed both theoretical and practical aspects.
We characterized a necessary and sufficient condition for identifiability of parameters.Specifically, we showed that uniqueness of parameters is not equivalent to uniqueness of ODEs solutions; this fact seems not to have been noticed in previous statistical literature.Exploiting the linearity feature of the model, we developed an integral based two-step estimation approach.The method is based on first estimating the function that is modeled as a solution of the system and then estimating the parameters.It results in an estimator that needs no repeated numerical integration of the system.Moreover, it is consistent at a √ n-rate, provided the estimator of the function that solves the system, is sufficiently accurate.
We have studied two specific, sufficiently accurate estimators of the solution of the system, namely a local polynomial estimator (smooth estimator) and an estimator based on averages (step function estimator).We call our estimators "modified integral methods".Although the size of the system in terms of the dimensions d and p does not matter in the theoretical results, in practice it makes a difference, since computing time will grow with these dimensions.However, this growth will be modest since our "modified integral methods" do not employ search algorithms.We studied both estimation approaches via numerical simulations.We compared the smooth estimator to the derivative based two-step approach and to the generalized profiling method.The finite sample performance of the integral estimator is substantially better than that of the derivative based method.As expected, the variability of the generalized profiling approach is smaller; however, it requires a complex optimization step that can affect the estimation results if started too far from the 'true' vector of parameters.Therefore it makes sense to use the integral estimator in order to generate a preliminary estimator to be used as an initial guess for the optimization step of other, more complicated, but accurate, estimation approaches.The step function estimator was tested under several scenarios of experimental studies; the numerical results support the theory and suggest that the estimation accuracy is robust with respect to the distribution of the errors.Furthermore, we see that practically, the number of repeated measures may be relatively small without the accuracy being corrupted.All simulations were executed in Matlab.The code for executing these simulations and for implementing the method for user data is added as supplementary material to this paper.

A.1. Proof of Proposition 1
in the notation of ( 5) and in view of ( 8) and (6).Hence the i-th component of . So, the i-th component of G(t)η = t 0 g(x(s)) ds η is constant and thus equals 0 for W ii -almost all t ∈ [0, 1], since 0 belongs to the support of W ii .Because this holds for all i = 1, . . ., d, it follows that which contradicts the nonsingularity of C W .
Equalities (11) and ( 12) may be verified now by substituting the right hand side of (3) for x(t).
(ii) If C W would be singular, there would exist a p-vector η with Remark 1.Interestingly, the start of the proof of Proposition 1 may also be formulated via the concept of Schur complement.Let where the entries of the matrix M are defined in (10).If C W is nonsingular, then the Schur complement of C W with respect to M is 24]).Moreover, note that Taking determinants of both sides it is immediately clear that det is singular then M is singular.This implies that we can find a vector (x, y) = 0 such that I d x + B W y = 0 and B T W x + C W y = 0 (note that y = 0 otherwise x = 0).Solving the first equation for x and plugging into the second equation we obtain (41) with η = y.

Denote the supnorm of x(•) by
is bounded and uniformly continuous on B M +1 .
Fix ε > 0. There exists a δ > 0 such that for all x, y ∈ B M +1 with x−y < δ the inequality g(x) − g(y) < ε holds, with the norm • of a matrix equal to the square root of the sum of squares of the components of the matrix.Consequently, x n − x ∞ < δ implies 1 0 g( x n (t)) − g(x(t)) dt < ε, and hence we have Together with the consistency (22) , and applying the weak convergence of (W n ) and dominated convergence we obtain Since the consistency (22) of x n (•) also implies P ( x n ∞ > M + 1) → 0, again by the weak convergence of (W n ) and dominated convergence we obtain the consistency of ( 19) and (20).
Let c > 0 and C < ∞ be such that for all x, y ∈ R d the inequalities c ≤ d n (x, y)/ x − y ≤ C hold.By the triangle inequality for d n (•, •) and (21) we have and hence as n → ∞.Since h −1 (•) is continuous, this implies consistency of ν n .

A.3. Proof of Theorem 2
First, we collect some properties of the semidefinite inner product (5) that we need.
Lemma 1.Let W be a symmetric d × d-matrix of finite signed measures on ([0, 1], B) such that (5) defines a (nonnegative) semidefinite inner product.Then the diagonal elements of W are nonnegative measures on ([0, 1], B), the Cauchy-Schwarz inequality | x, y W | ≤ x W y W holds, and, in particular, for all i = 1, . . ., d, j = 1, . . ., d, and for all x i (•) and x j (•), such that x 2 i dW ii , x 2 j dW jj , and x i x j dW ij are well-defined and finite, holds.
We continue with another lemma that will be used in the sequel.

Lemma 2. Under the conditions of Theorem 2
holds.
Proof.The left hand side of ( 46) is a p × p-matrix.Its entry in the i-th row and j-th column equals In view of the Cauchy-Schwarz inequality (45) this shows that it suffices to prove Denote by ∂g hi (x)/∂x the d-dimensional row vector of first derivatives of the entry g hi of the matrix g, and by ∂ 2 g hi (x)/∂x 2 the d × d-matrix of second derivatives.The following Taylor expansion holds We also study By (23), Lemma 1, and Lemma 2 the first term at the right hand side of (57) is of the order O p (c n + d n + v n ).The i-th component of the p-vector that is the second term, is a sum of d 2 terms of the type Since G(•) is continuous and hence bounded on [0, 1], we obtain by ( 23) and (27), that (58) and hence the second term at the right hand side of (57) is of order O p (c n + v n ).The third term at the right hand side of (57) is of order O(w n ) in view of ( 25), where we note that both G(•) and x(•) are differentiable with bounded derivatives.We have obtained In a similar way we obtain Writing θ n − θ and ξ n − ξ as telescoping sums in which sequentially random elements are replaced by the corresponding deterministic ones, and applying (54), (56), (59), and (60) repeatedly, we obtain a proof of the consistency to the order O p (c n + d n + v n + w n ) of θ n and ξ n .Subsequently the consistency of ν n to the same order is obtained via (44) and the Lipschitz continuity of h −1 (•).

A.4. Proof of Theorem 3
The following lemma assures us that the local polynomial estimator x n satisfies the conditions as required in Theorem 2. Lemma 3. Let the model be defined by ( 1)- (3).Suppose that for any j = 1, . . ., d the solution x j (t; θ, ξ) is a C α -function of t on the interval [0, 1] for some positive real α ≥ 1.
Let W n be as in Theorem 3, let the observations be given by (15) where we have t i = i/n, i = 1, . . ., n, and let the estimator for x(•) be defined in (29).Assume that the errors ε j (t i ), i = 1, . . ., n, j = 1, . . ., d, are i.i.d. and have mean Eε j (t i ) = 0 and variance Eε j (t Proof.Our proof is based on [41,Chapter 1].In particular, the proofs of ( 62) and (63) follow from his Proposition 1.13.Note that the bounds given in this Proposition 1.13 are uniform over [0, 1], and that (63) needs an application of Fubini's theorem and the boundedness of 1 0 dW n,hh (t) as guaranteed by ( 30) and the finiteness of the entries of W .
Lemma 1.5 and (1.70) of [41] show that Condition K(iii) implies that there exists a positive integer n 0 and a positive constant λ 0 such that for all n ≥ n 0 , t ∈ [0, 1] and v ∈ R +1 the inequality B n (t) −1 v ≤ v λ −1 0 holds, where • stands for the Euclidean norm in R +1 .This together with U (0) = 1 leads for n ≥ n 0 to where the last equality holds in view of (31) since t → K((t i − t)/b) is bounded and vanishes outside an interval of length at most 2b.We have proved (65).
To prove Theorem 3 we first note that (25) is satisfied with w n = 1/ √ n in view of (30).Applying Theorem 2 we see that Lemma The optimal convergence rate for (63) is n −2α/(2α+1) , which is obtained by b = b n = n −1/(2α+1) .Compared to this, undersmoothing is needed to control the bias in (62).

A.5. Proof of Theorem 4
To prove this theorem we apply Theorem 2 again.As in the preceding proof we first note that ( 25) is satisfied with w n = 1/ √ n in view of (30).Since g(•) is continuous and x(•) is bounded, g(x(•)) is.Consequently, we have

2 Wn ( 17 )
x n (t) − ζ − t 0 g( x n (s)) ds η over η and ζ, where W n is an appropriate d × d-matrix of signed measures on ([0, 1], B) as in Proposition 1. Denote ) a local polynomial estimator of order = 1 and bandwidth b = n −1/3 .Based on 500 Monte Carlo simulations.The true parameter vector is (a, b, c) T = (0.2, 0.2, 3) T .The two signals are first generated by solving the system at 0.05 time units on the interval [0, 20] (n = 401) and then adding Gaussian measurement errors with zero mean and variances σ 2 ) yields boundedness of G n (•) on [0, 1] in probability.Using the boundedness and continuity of G(•), the boundedness in probability of G n (•), and

)
Furthermore, let there exist a constant C such that for any h = 1, . . ., d, for any interval I n of length b n , and for all n 1. α ≥ 3/2 and g(•) has continuous second derivatives, 2. α ≥ 1 and g(•) has bounded second derivatives.Moreover, if θ = h(ν) holds, d n (x, y)/ x − y are bounded away from 0 and infinity for all x, y ∈ R d with x = y, and h −1 (•) is Lipschitz continuous, then ν n as defined via (21) is √ n-consistent as well.
0, which would imply that the i-th component of G(t)η vanishes for W ii -almost all t ∈ [0, 1].Consequently, (3) yields for all α ∈ R and all i = 1, . . ., d that the i-th component of the equation (45),hh (t) as guaranteed by(30)and the finiteness of the entries of W . Similarly and by the Cauchy-Schwarz inequality(45)we obtain (t)ε j (t i )V n,i (t)dW n,hj (t) j (t i ) |V n,i (t)| dW n,jj (t) 1 0 |V n,i (t)| dW n,hh (t)