Nonlinear mixed eﬀects modeling and warping for functional data using B-splines

: In functional data the interest is to ﬁnd a global mean pattern, but also to capture the individual curve diﬀerences in phase and amplitude. This can be done conveniently by building in random eﬀects on two levels: in the warping functions to account for individual phase variations; and in the linear structure to deal with individual amplitude variations. Via an appropriate choice of the warping function and B-spline approximations, estimation in the nonlinear mixed eﬀects functional model is feasible, and does not require any prior knowledge on landmarks for the functional data. Suﬃcient and necessary conditions for identiﬁability of the ﬂexible model are provided. A theoretical study is conducted: we establish asymptotic normality and consistency of the estimators of the registration and ampli- tude models, convergence of the iterative process, and consistency of the ﬁnal estimator provided by the iterative process. The ﬁnite-sample perfor- mance of the proposed estimation procedure is investigated in a simulation study, which includes comparisons with existing methods. The added value of the developed method is further illustrated via the analysis of a real data example. bases. In the simulation study we present results on the B-spline approximations of the μ -functions, denoted by ˜ μ 1 , ˜ μ 2 , ˜ μ 3 and ˜ μ 4 respectively. To illustrate the impact of modeling bias, we provide for the fourth function simulation results


Introduction
Functional data are encountered in many fields, a multitude of examples can be found in the books by [10,24,25]. When analyzing functional data it is of particular interest to provide answers to the questions: (i) is there a common main (mean) functional pattern to be distinguished?; (ii) can we quantify the significant individual fluctuations with respect to such a mean pattern? While the common functional mean is capturing main features such as peaks and valleys, differences between individual curves are often exposed via differences in phase and in amplitude of the main features. In Figure 1 the Pinch Force data are depicted. These data were collected as part of an experiment to investigate the force (measured in Newton) exerted by thumb and forefinger when pinching a 6 cm width force meter. See [26]. Data on 20 recordings of such force measurements, recorded every 2 milliseconds during a time period of 0.3 seconds, are presented in Figure 1. There seems to be a clear maximum for each curve, but the position and the size of this maximum differs considerably from curve to curve. See further Section 5.
Often there is no prior information available regarding the number of important features, and where, in which region, they occur. A flexible method should thus not rely on such information, and be able to extract a main pattern from the data, as well as information on major individual variations. Aligning the individual curves via individual shift functions is conveniently done via time warping functions, see for example [2,7,8,14,19,33]. One approach towards describing the curve-specific deviations from the mean curve is via random effects. See for example [5,9,15]. An analysis of variance model for functional data describing the phase variability through time-warping and allowing for inference in the presence of amplitude variability, was introduced by [13]. This approach was further extended to a functional regression setting in [11,12]. A functional mixed effects regression model was used to analyse spike train data in [16]. In [36] the emphasis was to construct separate boxplot-type displays for the three main components of the observed variation in functional data, the amplitude, phase, and vertical translation. A shift-warping method is used in [3] for multivariate functional data where each of the components may contribute to a shift with its own parameter value. In [35], warping methods are proposed for data from exponential families. The methodology consists of working with principal components analysis (PCA) and using an expectation-maximization (EM) algorithm for parameter estimation. In [17], PCA is studied to analyse warping functions. A nonparametric registration method is proposed in [4], based on a local variation measure introduced to provide nonparametric conditions that lead to identifiability. The phase and amplitude are separated in [31] by using a representation of functional data that is based on the Fisher-Rao metric to compute an elastic shape analysis of the curves. Based on this representation, [37] analyses the phase variation using a principal nested sphere approach. In [28], a constrained elastic shape analysis is used with a landmark representation. While there are Bayesian methods for registration too, see for example [6], these are not considered here. Other papers focus on curve registration and classification or clustering, see [21,27,30,38].
In this paper we use a mixed effects model in which random effects enter on two levels: (1) a warping function with random effects describes the individual phase variability in a flexible manner, and (2) a second random effect is used to model the individual amplitude variability. This follows the approach of [13], but with two major differences: (i) the definition of the warping function does not depend on 'landmarks' (locations of peaks and valleys); (ii) the estimation procedure. A first important advantage of our method is that there is no need to know nor estimate landmarks, neither their number nor their positions, which can be time consuming and/or difficult. Second, our estimation procedure is computationally less demanding than, for example, an EM-algorithm as used in [13]. Different from [23] is that we use nonparametric estimation by means of B-splines and avoid a linearization of the mean around the warped values as in their estimation approach. We focus in this paper on homogeneous signals. We prove the identifiability results of the proposed data registration model under some mild conditions. In addition, the asymptotic properties of the proposed estimation procedure are investigated: convergence of the algorithm, asymptotic normality of the estimator at each step, and consistency of the final estimator. An important contribution of this theory is the study of the algorithm, seen as an iterative process, and not on the estimator that it approximates. The added value of the method is illustrated on the Pinch Force data in Section 5, where our analysis not only provides a mean pattern, but also allows to describe clearly where most individual differences occur with respect to either phase or amplitude.
The paper is organized as follows. In Section 2 the modeling framework is introduced together with the necessary notations. The identifiability of the proposed model is obtained. Details about the estimation procedure are provided in Section 3. An estimator for the warping parameters is constructed, and its asymptotic normality is proven. Linear mixed model estimators are proposed for the functional parameters and their asymptotic normality is shown. In Sec-tion 3.3 we derive an iterative estimation procedure for which we show the convergence and the consistency of the resulting estimators. The finite-sample performance of the proposed estimation method is investigated in Section 4, which also includes a comparison with four existing methods. The methodology is used to analyse the Pinch Force data in Section 5. The paper concludes by some discussion in Section 6. This paper is accompanied by the R package warpMix. All proofs are given in the Appendices.

The model and its identifiability
Suppose one observes individual curves Y 1 (t), Y 2 (t), . . . , Y n (t) on the interval [0, 1] (without loss of generality), and a first aim is to find a main pattern μ(t) in these individual curves. First, we introduce the various elements of the modeling framework, and provide the identifiability of the model. All the proofs of the results stated in this section can be found in Appendix A.

A functional mixed model with warping function
We consider the following functional mixed model. For i = 1, . . . , n, and for t ∈ [0, 1], we define the process with μ the unknown common mean and where U i denotes the unknown random effect on the amplitude for the observation i. The flexible warping function w : [0, 1] → [0, 1] is strictly increasing and depends on a random variable θ i ∼ N r (θ 0 , Σ θ ), that describes the individual phase variability. Details about the warping function are provided in Section 2.3. We rather use the discretization of model (1) with time points (t i,j ) for j = 1, . . . , T i ; i = 1, . . . , n, where T i denotes the number of fixed (non-random) time points for the individual i: We assume that for all i, the error vectors , meaning that the error terms are independent of t i,j and of the warping effects θ i .
An analogous model was used in [23], where the warping function stands only in the common mean and not in the individual effect, and in [13], where a group level is added. We argue that from this general formulation, many things have to be defined to allow the estimation of this model. The specificities of our study and its novelty will be described in the next two subsections, through the decomposition of the signals onto a B-splines basis and the warping function.

B-spline basis decomposition
The warping function w, the unknown mean function μ and the individual random effect amplitude functions U i are modeled in a flexible fashion via Bsplines. In this paper, we make the following assumption.
Assumption A. We assume that the functions μ, (U i ) i=1,...,n and w belong to the space spanned by the considered spline basis.
Assumption A ensures to have unbiased estimators for the curves, and avoids having to theoretically deal with a modeling bias. When using a spline basis in practice, the curves are well approximated when utilizing a finite (maybe large) number of knots.
For the mean function μ, we define a sequence of K μ interior knots 0 = κ μ 0 < κ μ 1 < . . . κ μ Kμ < κ μ Kμ+1 = 1. In addition, we put p μ + 1 boundary knots at 0 as well as at 1, and denote κ μ } the set of all knots involved in estimation of μ. The B-spline basis functions of degree p μ are defined by induction as for l = −p μ , . . . , K μ . With the use of the additional (boundary) knots, this gives precisely m μ = K μ + p μ + 1 basis functions. The function μ is decomposed in the B-spline basis, with coefficient vector α μ = (α μ −pμ , . . . , α μ Kμ ) , Note that if μ(·) does not belong to the space spanned by the basis functions, then the equality in (3) should be replaced by an approximation. The induced modeling bias can be controlled by taking a large number of knots. Similarly, for i = 1, . . . , n each individual random function U i is decomposed in a basis of B-splines of degree p Ui . Denote the B-spline basis for U i by with knots sequence κ Ui , resulting in m Ui = K Ui + p Ui + 1 basis functions, and consider the covariance matrix for which we assume a diagonal structure, and which needs to be estimated. Further we denote α

The warping function
A flexible way to model the warping function is as follows. For every t ∈ [0, 1], we define with h −1 as indicated below. Note that w −1 (and hence w) is by construction an increasing function. The advantage of using the exponential function is that it warrants the positivity of the function. A non-random version of this warping function was introduced in [25] and used in [16,32]. There are many other choices of warping functions that could be made (see for example [20]). In short, the warping function w −1 (or w) in (4) satisfies the following necessary conditions: increasing, and from [0, 1] to [0, 1]. To ensure identifiability, the function h −1 will be decomposed using a basis of centralized B-splines, i.e.
The vector of random effects θ i = (θ i,−p h , . . . , θ i,K h ) describes the individual phase variability, for which we assume a linear mixed effects model To ensure identifiability, we assume that σ 2 ε is known. In [13] the parameter θ 0 is considered to be a Jupp transform of the landmarks of the mean function μ. In contrast, we avoid the use of landmarks, and θ 0 is a parameter to be estimated.
By construction, the warping function is injective, as proved in Lemma 1.

Lemma 1.
The warping function w defined via (4) and (5) is injective with respect to the second parameter: for every t ∈ [0, 1], Further, we assume that the α U i s and θ i s, the random effects describing the individual phase and amplitude variability, are independent of each other.

The model in matrix form
For further analysis it will be useful to introduce some matrix notation. The where A ⊗ B denotes the Kronecker product of two matrices A and B.
In summary, the unknown parts in the model consist of

Identifiability of the model
In this section, we provide sufficient and necessary conditions to ensure the identifiability of model (7). First, the joint model (7) is identifiable if and only if at least one (approximate) individual model (2) is identifiable. We thus focus on a fixed i, and on the set of parameters (α μ , σ 2 ε , Σ Ui , θ 0 , Σ θ ) which consists of the subparts (α μ , σ 2 ε , Σ Ui ) and (θ 0 , Σ θ ), where the latter is linked to the warping modeling part, and the former with the other parts. We start by investigating identifiability in each part.

Identifiability of the warped process
Take any i ∈ {1, . . . , n}. For j = 1, . . . , T i , let X i be the warped process: (8) is identifiable. In the following theorem, we give sufficient and necessary conditions for model (8) to be identifiable when both variance parameters are unknown. Since, for given (θ 0 , Σ θ ), and due to the use of B-spline approximations, the warped process leads to a linear mixed effects model, we can use general results on the identifiability of such models, as obtained by [34]. Theorem 1 is an adaptation of Corollary 4.2 in [34] to the current setting. Theorem 1. Let i∈{1, . . . , n} be given. Model (8) is not identifiable if and only if the three conditions are fulfilled: (8) is identifiable if at least one of the three conditions in Theorem 1 is not satisfied.

Identifiability of the warping function
Here, we assume that we know the parameters of the mixed effects model (α μ , σ 2 ε , Σ Ui ), and we want to prove the identifiability of the warping process involving the parameters (θ 0 , Σ θ ). Sufficient and necessary conditions for identifiability of this part are established in Theorem 2.
and let X i andX i be the corresponding warped functions, such that

Identifiability of the global model
We proved that, when knowing the warping parameters, the functional linear mixed effects model is identifiable, and that when knowing the functional linear mixed effects model, the warping parameters are identifiable. Then, by iterating between these two identifiable steps until convergence, we have a procedure which is identifiable and leads to the estimation of all the parameters of the model defined in (2). Note that the identifiability conditions are essentially conditions on the englobing B-spline basis structure.

Estimation procedure and asymptotic properties
Recall that the unknown parameters of model (2) (2) is a nonlinear functional mixed effects model due to the composition by the warping function, which is an essential ingredient to describe the individual phase variability. First, we analyse each part of the model, that is, the warping parameters and the linear mixed effect model, by providing an estimator and theoretical guarantees. Then, we propose an iterative estimation procedure, where in a first step we fix the warping parameters (θ 0 , Σ θ ) and estimate the functional parameters (α μ , σ 2 ε , Σ ∼ U ); and next, we start from these estimated parameters, and estimate the warping parameters. Further, we obtain the convergence of the method and the consistency of the global estimator.
We have access to a sample (Y i (t i,j )) j=1,...,Ti; i=1,...,n of n curves, the ith curve being evaluated in T i points. Given are the knot sequences in the Bsplines approximations, the degree of the B-splines, and the dimension of the warping parameters r = K h + p h + 1.
Proofs of the results in this section are provided in Appendix B.

Estimators for the parameters of the warping function
Suppose we know the functional parameters (α μ , Σ ∼ U , σ 2 ε ), and the predictors α U i for all i = 1, . . . , n. The goal is to estimate (θ 0 , Σ θ ). We construct pseudo-observations by minimizing the following empirical L 2 criterion: Note that the criterion which is minimized tends to the L 2 distance between However, as we want to consider the warping parameter as a random effect, we fit a mixed effects model as defined in Eq. (6) on the pseudo-observationŝ The random variables E i andε i are independent. As we assume that σ 2 ε is known for identifiability reasons, we use the empirical mean of {θ T1 1 , . . . ,θ Tn n } to estimate θ 0 , and the empirical covariance to estimate Σ θ = Σ E + σ 2 ε I r . The prediction of E i is easy to get because σ 2 ε is known. We consider the following estimators:

Asymptotic normality ofθ
Ti i First, we focus on the distribution ofθ Ti i conditional on θ i . To do so, we rely on the theory of nonlinear least squares estimators developed in [18], which uses the weighted tail product defined as follows.

Definition 1.
Let p be a nonnegative integer and (t j ) j=1,...,p be fixed time points. Let x = (x p ) and y = (y p ) be two sequences of real numbers and let If (x, y) π p converges to a real number when p → +∞, its limit (x, y) π is called the weighted tail product of x and y.
Let g and h be two sequence valued functions on Θ. If (g(α), h(β)) π p → (g(α), h(β)) π when p → +∞ uniformly for all α and β in Θ, we define This function is called the weighted tail cross product of g and h.
Then, we define the r × r-matrix a i (θ i ) as follows.

Definition 2.
For l = 1, . . . , r, we denote by the partial derivative of the aligned signal. We define Ti l=1,...,r;l =1,...,r the matrix with coefficients the weighted tail product between two partial derivatives, and a i (θ i ) its limit when T i → +∞.

sequence of weighted least squares estimators of θ i . We assume that the model is identifiable and satisfies Assumption
If we assume that the model satisfies Assumption B, conditional on θ i , and a i,Ti (θ Ti i ) is a strongly consistent estimator of a i (θ i ). Let fθT i i (.|θ i ) be the conditional distribution function. Denote by ϕ the Gaussian density function. Theorem 3 implies that for all η ∈ R, By the dominated convergence theorem, we get the asymptotic marginal distribution, for all η: The last line comes from a computation with Gaussian densities, see Lemma 4 in Appendix C. We discuss two cases where the limiting distribution mθ∞ i is computed explicitly: • the case when the noise ε tends to disappear, which makes the theory easier, but also requires a strong assumption for the limit to hold; • the case when the eigenvalues of the matrix a i (θ i ) are bounded. Under this weak assumption the limiting distribution will be more complicated (see below).
We next discuss these two cases in more detail.
In the first case we assume that the noise tends to disappear, when the number of points in the time grid increases.
It is important to remark the following. If, however, there is a non-negligible noise, the method will warp the observed noise curve on some global mean, and the warping parameter will depend on this noise, whereas the true warping parameter does not, as it would be based on the denoised data.
We now turn to the second case. With weaker assumptions, the limiting distribution mθ∞ i (η) is more complicate to describe. Denote by E α (θ 0 , Σ θ ) the following ellipsoid: The following theorem establishes that in this second setting, the limiting distribution mθ∞ i (η) is close to a Gaussian distribution.
In summary, we get that under the two settings, the distribution ofθ ∞ i is close to a Gaussian distribution.
where W(Σ, p) denotes the Wishart distribution with scale matrix Σ and p degrees of freedom.
Note that this implies that T i has to go to infinity faster than n goes to infinity, i.e. n = o(min T i ). Indeed,

Functional parameters
Suppose we know the warping parameters (θ i ) i=1,...,n . Then, we warp the observations (Y i (t i,j )) j=1,...,Ti; i=1,...,n onto the estimated warped curves X i,j = Y i {w(t i,j ; θ i }, and we fit a functional linear mixed model on (X i ) i=1,...,n as defined in Eq. (8) using maximum likelihood estimation, which leads to estimators (α μ ,Σ U , (σ 2 ε )) and predictors (α U i ) i=1,...,n . Following the ideas described in [22,Chapter 3], we need the following assumption: Assumption F. Existence and positive definiteness of I, which is the limit of minus the expected Hessian matrix of the log-likelihood function based on the model given in Eq. (8).
Then, we get the asymptotic normality of the estimator.  (t i,j )) j=1,...,Ti; i=1,...,n . We assume that the model is identifiable and satisfies Assumption F. Then,

Global model and iterative estimation procedure
We propose to directly estimate the nonlinear model. Working with the L 2distance, we want to fit the model of which the coefficients minimize Using the steps described previously, we propose an iterative process that approximates the following minimizer: Algorithm 1 presents the steps in the iterative procedure. Further details are provided regarded the initialization, the convergence criterion, the theoretical convergence and the consistency of the resulting estimator.

Details about the initialization
First, we initialize the mean function μ. There exist several ways to define a central curve in functional data analysis. Here we use band depth for functional data as introduced in [29], and compare every observed curve with the deepest functionμ (0) = μ deep . We then deduceα (0) μ , the projection of the function μ (0) onto B μ the B-spline basis considered. In the initialization step, we do not consider individual amplitude effects, i.e. (α U i ) (0) = 0 m U i for all i = 1, . . . , n.

Convergence of the algorithm
We define
Fit a linear mixed model on the pseudo-observations, θ Warp the observed curves according to w −1 Fit a linear mixed model on these observations: θ The iterations are stopped when C n < 10 −4 during five successive iteration steps.
Note first of all that the various iterations in Algorithm 1 involve three operations Ψ 1 , Ψ 2 and Ψ 3 , and that the update function to go from one iteration to the next is composed of three parts Herein Ψ and its components are defined as follows.
In Theorem 8 we prove that the algorithm is converging. A condition under which this holds is that Ψ 1 is a contraction mapping, as stated in the following assumption.
Assumption G. There exists k Ψ1 < 1 such that, for (x, y) ∈ (R mμ+nm U +1+nr ) 2 , Under Assumption G we show the convergence of the algorithm, seen as iterations of Ψ. We denote by ( , (Σ θ ) (∞) ) the estimator obtained at the end of the algorithm.
. Moreover, suppose that the model is identifiable and satisfies Assumption G.

) ) exists and is unique, and the algorithm converges to this solution with a geometric rate with respect to the Euclidean distance.
This theorem gives the pointwise convergence of the algorithm. The randomness has not been taken into account. We rather focus on the iterations of steps. Theorem 8 relies on Assumption G, which appears as rather technical. To get some insights into this assumption, we investigate it in a specific setting in Example 1.

Example 1.
We focus on μ, do not consider U i , and restrict the family of warping functions to translations: w −1 (t; θ i ) = θ i + t. The global mean is supposed to be a linear function μ(t) = α + βt. Let t i,1 = 0 and t i,Ti = 1 Finally, we set θ 0 = 0.
Fix i. Recalling (10), in this case we are looking for This is a polynomial function of degree 2 inθ i with nonnegative coefficient of the quadratic term: there exists a unique minimizer: We know that the Lipschitz constant is bounded by the norm of the differential.
Here, the function we consider is (α,β) →θ, so we compute the differential, evaluated in (α, β): These expressions reveal that for β small, the problem is more complicated (as one could expect). Note that in this special case Assumption G in fact leads to an assumption on β.
Example 1 also shows that, in some particular settings, Assumption G might be translated into a condition on μ and w.

Consistency of the estimators
To conclude, we provide the statistical consistency of the full procedure. This has the following meaning. When the sample size and the number of time points are going to infinity, the parameters estimated by the iterative process are converging almost-surely to the true parameter. Finally, the consistency is deduced for the common mean, seen as a functional parameter.

j ). Suppose that the model is identifiable, and Assumptions A, B, C, F and G hold. Then,
As a consequence, we get that, from a functional viewpoint, for μ ∈ span(B μ ),

Simulation study
We investigate the finite-sample performance of the proposed estimation method, and we compare it with four state-of-the-art methods, described below. An R package, called warpMix, has been developed for the proposed method and is available at https://cran.r-project.org/web/packages/warpMix/index. html.

Warping functions
The warping process is the same in most of the settings (with exception of Model M2), and defined via (4) and (5). Three different interior knots {0.2, 0.5, 0.7}  Figure 2 depicts a sample of size 100 of the warping function, and the empirical covariance matrix of (w −1 (.; θ i )) i=1,...,n computed on this sample. This highlights the differences between the correlation in (θ i ) i=1,...,n and that in (w −1 (.; θ i )) i=1,...,n . Note that due to the random warping structure, there is a variability induced on the whole time period.

Elements of the functional model
The elements determining the functional model are the function μ and the individual random effects U i . For the mean function μ we consider four different functions. For t ∈ [0, 1], These functions are plotted in Figure 3. The modeling framework in Section 2 assumes that the functions μ and U i are well approximated using a B-spline basis. This is in practice not always the case, for example when a too limited number of knots is considered in the B-spline bases. In the simulation study we present results on the B-spline approximations of the μ-functions, denoted byμ 1 ,μ 2 ,μ 3 andμ 4 respectively. To illustrate the impact of modeling bias, we provide for the fourth function simulation results  for its B-spline approximationμ 4 as well as for the function μ 4 itself. We refer to model (2)  which are harder to fit, where we use Σ U = 0.05I 10 . The variance of ε in the functional linear model equals σ 2 ε = 0.02. In the numerical study, we simulate 100 times from each setting, and report the evaluation criteria based on these 100 simulated samples.

Variability
To generate the data, we first construct a sample of the process (X 1 , . . . , X n ), defined in (8) and then un-warp them via (9) and the warping function described in Section 4.1.1. To understand the variability induced by each modeling aspect, we plot in Figure 4, a warped sample and the un-warped sample for model M 1 .
The signal-to-noise ratio expresses the ratio of the variability caused by the To compute the numerator we use the conditional variance formula, for a random variable V seen as a function of two random variables U and θ,  For each given time point t i,j we compute this SNR function 50 times to get 50 values for SNR at each time point. To compute the function once, we proceed as follows. For a fixed θ, we compute the empirical conditional variance Var U (Z|θ) and the conditional expectation E U (Z|θ) over a sample of size 100. By varying θ 60 times, we compute the global variance. This whole process is then repeated 50 times. In Figure 5 the resulting approximations for the SNR functions for models M 1 and M 2 are plotted. For higher values of SNR we expect the estimation problem to be somewhat easier. Some caution regarding this interpretation is needed though. In our functional mixed effects model there are several sources of variability in the signal part (the individual effect related to U i and the warping effect due to θ i ). The SNR-criterion does not distinguish between these variabilities, and just considers the global signal variability against the error variability. Note from the SNR plots in Figure 5 that the estimation task can be harder in some time-regions. At the endpoints of the interval [0, 1] the SNRvalues for the different models are equal, since the warping is not effecting these parts, and the only effect is coming from the covariance matrix Σ U , the noise variance σ 2 ε , and their relative contribution.

Comparison with existing methods and performance criteria
To illustrate the numerical performance of the proposed method, we compare with four methods available in the literature.
Since our nonlinear functional mixed effects model is closely related to that of [13] with major differences as indicated in Section 1, we include a comparison with this method. Some procedure parameters have to be chosen in the method of [13]: we took p = q = 1 for the number of components in the Karhunen-Loève decompositions; λ = 1 for the regularization parameter; and τ 0 = {0.3, 0.6} as the set of average landmarks. Their estimation method involves a Monte Carlo approximation part, for which we considered 100 iterations; and an EM algorithm part in which we also considered at most 100 iterations. Convergence was said to be reached when the difference in norm between estimated parameters in two consecutive iteration steps was less than 10 −2 . We also would like to mention that in our simulation study we use a rewritten Matlab version of the original Fortran code used in [13], since the latter was no longer running properly. The use of the Matlab code can make computations a bit slower.
The elastic square-root slope is a promising framework, so we include a comparison with the method developed in [31]. We use the default settings: no elasticity, Karcher mean, do not smooth the data and at most 20 iterations. We use the code available in the R package fdasrvf.
Bayesian methods are also of interest, and we choose to compare with [6], also available in the R package fdasrvf. Also here we considered the default settings: 150000 iterations and a uniform prior distribution.
Finally, we compare the performances with that of the algorithm of [27], available in the R package fdakma, that allows for clustering misaligned data. We assume that there is no cluster, consider affine alignment and compute the similarity through the cosine of the angle between the two function.
Since the available inference in those studies does not fully match our modeling inference, we can only report on the comparison related to estimating μ.
To evaluate the estimation performance for the various components of the target (α μ , σ 2 ε , Σ ∼ U , θ 0 , Σ θ ) we need some criteria. Note that the modeling framework involves two unknown functions, namely the overall mean function μ and the warping function w, unknown matrices Σ U , Σ θ , as well as the unknown variance σ 2 ε . For each sample we obtain estimatesμ,ŵ,Σ U ,Σ θ andσ 2 ε . Since in our simulation setting we have the same observational time points for each individual curve, i.e. t i,j = t j , and j = 1, . . . , T i , with T i = T , we use the criteria in Table 1 to evaluate the estimation performance in each sample. Herein Tr(A) denotes the trace of a matrix A. For each simulated sample we calculate the estimates, and the corresponding evaluation criteria of Table 1. To report on the bias of an estimator, we compute the empirical mean of a criterion over the 100 simulations. To report on the variance of an estimator, we proceed as follows. For example, when estimating the function μ, we calculate in each point t j the empirical mean over all 100 estimated values of μ(t j ) and denoting this byμ(t j ). For each simulated sample we then calculateΔ μ = . The empirical variance of the estimator for μ is then computed by taking the average over the 100 obtainedΔ μ values. In a similar way we obtainΔ w ,Δ U ,Δ θ andΔ ε .
A final remark is that for Δ θ andΔ θ , we use medians rather than means across all simulations as a measure of central position, since sometimes estimation of some components of Σ θ resulted in large outlying values. However, even in the latter cases the quality of the estimated warping function was still very good, as will be seen from the reported results.

Simulation results for the proposed method
Models M 1 and M HD 1 . Figure 6 depicts the simulation results for estimating μ 1 in M 1 and M HD 1 . In the left panels we depict, for each time point t j , the boxplots of the obtained estimated values forμ 1 (t j ), whereas in the right panels we use a functional boxplot, as developed in [29]. The true curveμ 1 is in all plots presented as the solid (red) curve. The black solid curve in the centre of the functional boxplots indicates the deepest function among all estimated mean functions.
The quality of estimatingμ 1 is quite good for the proposed method. Passing from low dimension to high dimension (from the top row to the bottom row plots), we see that the results improve for larger values of n and T . Note that the largest variability occurs in the region where there was also most variability noticed in the SNR plot for model M 1 in Figure 5. Table 2 further summarizes the simulation results for models M 1 and M HD 1 . The results on estimation of μ are in correspondence with what was observed from Figure 6. Note that in estimation of Σ θ there are quite some extreme estimation results. However, the resulting estimation of the warping function w is still good, as can be noticed from the last row in Table 2.
In Figure 7 we present boxplots of the estimation results for the components  (right), with the true component values indicated as red horizontal lines. Outliers have been excluded for plotting the boxplots for clarity of presentation. To complement these boxplots, we summarize in Table 3 the maximum (across simulations) of the estimated values for each of the seven components of Σ θ . Note that the most extreme values occur for the first coefficient. For larger n and T there are less extreme estimates.
Next, we focus on estimating the individual curve amplitude variability, which is captured by the estimation of the ten diagonal components of Σ U . In Figure  8 we provide boxplots of the estimation results. The horizontal red line presents the true value 0.10 for all diagonal components. As can be seen the estimation results tend to be better for larger value of n and T , as expected. . To study the finite-sample performance of the proposed estimation method when there is a misspecification with respect to the warping function, we consider the model withμ 2 and simulate data under two different warping schemes:

Models
• scheme w CDG : the warping scheme of Section 2.3;  • scheme w GC : the warping scheme of [13]. In the scheme w GC , θ i are generated via a linear mixed effects model, and then mapped into the set of landmarks using a Jupp transform. This is followed by interpolation by cubic splines to get to the corresponding parameters. Simulations were carried out from the two models, referred to as models M  Table 4 summarizes the simulation results for all elements in the functional mixed effects model. Overall conclusions remain as above. Note that also under the misspecified warping scheme w GC the proposed method continues to perform very well.  Table 5 presents the simulation results for the low-and high-dimensional settings for model M 3 . Also in these settings the method performs well. For the fourth model we include simulation results (in the low-dimensional sample setting) when simulating from the unprojected function μ 4 , for which the B-spline approximation induces a modeling bias. As can be seen from columns 2-5 in Table 6 there is only a little loss in performance when modeling bias is present.

Comparison with available methods.
We compare the four methods introduced in Section 4. . The simulation results are summarized in Table 7. Note that the proposed method often has low/lowest bias, but at the price of having a larger estimation variance. On model M 1 , the method fdakma performs the best, with a very low variance, but it has a comparable performance (in terms of bias) to the proposed method for M , our method has particularly good results in mean, but with a larger variance. As GC's method is very slow (i.e high computational cost) and does not provide very good results whereas the modelling is close to the proposed one, we restrict further comparisons, for Models M 3 , M , fdakma performs the best among the competitive methods, followed by the Bayesian warping method (method (3)), but both are less good than the proposed method in terms of bias. Finally, we see that the elastic square-root slope method (method (4) in the table) does not perform well on those simulated datasets.

Real data analysis
We analyze the Pinch Force dataset, available in the R package fda. These data were described and analyzed in [26]. The data consist of measurements, at every second millisecond, on the exerted force (in Newton)  From the analysis with the proposed method, we get the estimated individual warping functions as in the left panel of Figure 9, and the warped (aligned) functions X i,j for each individual (right panel). The estimated covariance matrix Σ U (respectivelyΣ θ ) is presented in the left (respectively right) panel of Figure   5272 G. Claeskens et al.  10. From this, we observe that there is more time variability induced by the coefficient of the second B-spline basis function in the decomposition of h −1 (., θ), whereas there is more amplitude variability caused by the coefficient associated to the third basis function in the decomposition of U , see also Figure 9 (right panel).

Conclusion and further discussion
In this paper we considered a nonlinear mixed effect model for functional data. We apply a B-splines approximation on three different levels: on the inverse of the warping function describing the individual phase variability; on the global mean function and on the individual amplitude random effects. Random effects enter to model the individual amplitude as well as the phase variability. The main advantage of the proposed method is that it avoids the (costly) choice of landmarks, and that we can provide important theoretical support for the procedure: (i) convergence of the iterative algorithm to the target function(s); (ii) consistency and asymptotic normality of the estimators.
In this paper we considered the discrete T i time points to be fixed (non-random). However the methodology could be generalized fairly easily to random time points. Typically one would then need to assume that the distribution of the random time points is regular enough (meaning that there are no empty regions in the observed pattern of discretized time points). This would require, for example, an adaptation of the criteria used in Sections 3.2 and 3.3.
2. An analysis of the assumptions, particularly modeling assumptions of the noise, with the aim to see the robustness of the method, would be of interest.We postpone this analysis to an experimental work.

A.1. Proof of Lemma 1
Let θ 1 and θ 2 such that t = w(w −1 (t; θ 1 ); θ 2 ). Then it follows that, As this equation is true whatever the value of t ∈ [0, 1], we have that the integrand is equal to 0 for every u ∈ [0, 1] except for possibly a countable number of points. As B-splines are continuous, it is equal to 0 for every u ∈ [0, 1]. It holds that, As the left hand side does not depend on u, so the right hand side should be equal to 0 for all u. As (Bh l ) l is a B-spline basis, it induces that θ 2 = θ 1 .

A.3. Proof of Theorem 2
We have As X i (t) is Gaussian distributed, the identifiability of the warping process is equivalent to the identifiability of the mean and of the variance of X i . We thus investigate the expectation and the variance of X i (t).

B.5. Proof of Theorem 7
We check that our model satisfies the assumptions given in [22,Chapter 3]: • The U i are independent and follow a N (0, Σ Ui ) distribution, ε i follows a N Ti (0, σ 2 ε ) distribution and the U i are independent of ε i • The matrix B μ i is of full rank, as it is a functional basis • n ≥ m μ + 1 + 1 • The concatenated matrix [B μ i , B U i ] has rank greater than m μ if we don't take the same basis for μ and U i • The matrices I Ti and B U i (B U i ) t are linearly independent • lim n→+∞ n−rank(B U i ) n = 1 So we get the asymptotic normality of the estimator.

B.6. Proof of Theorem 8
Recall the different operator parts of the iterative algorithm in (12). The Banach fixed point theorem, recalled in Theorem B of Appendix C, is used to prove that there is a unique fixed point, and that the algorithm converges. To use this theorem, we work in R mμ+nm U +1+nr with the Euclidean distance. It is a non-empty complete metric space. The mapping we consider is Ψ, as defined in (12). We need to prove that Ψ is a contraction mapping.
Denote by k f the Lipschitz constant for the function f . We want to find k Ψ such that, for (x, y) ∈ (R mμ+nm U +1+nr ) 2 ,