Learning the smoothness of noisy curves with application to online curve estimation

Combining information both within and across trajectories, we propose a simple estimator for the local regularity of the trajectories of a stochastic process. Independent trajectories are measured with errors at randomly sampled time points. Non-asymptotic bounds for the concentration of the estimator are derived. Given the estimate of the local regularity, we build a nearly optimal local polynomial smoother from the curves from a new, possibly very large sample of noisy trajectories. We derive non-asymptotic pointwise risk bounds uniformly over the new set of curves. Our estimates perform well in simulations. Real data sets illustrate the effectiveness of the new approaches.


Introduction
More and more phenomena in modern society produce observation entities in the form of a sequence of measurements recorded intermittently at several discrete points in time. Very often the measurements are noisy and the observation points in time are neither regularly distributed nor the same across the entities. Functional data analysis (FDA) considers such data as being values on the trajectories of a stochastic process, recorded with some error, at discrete random times. One of the main purposes of the FDA is to recover the trajectories, also called curves or functions, at any point in time. See, e.g., [29,21,38,41] for some recent references. Whatever the approach for recovering the curve is, in the existing literature it is usually assumed that, for each curve, a certain number of derivatives exist. However, many applications, some of them presented in the following, indicate that assuming that the curves admit second, third,... order derivatives is not realistic. Assuming that the curves to be reconstructed are smoother than they really are could lead to missing important information carried by the data. In this contribution, we propose a definition of the local regularity of the curves which could be easily estimated from the data and used to estimate the curves.
To formalize the framework, let I ⊂ R be a compact interval of time. We consider N functions X (1) , . . . , X (n) , . . . , X (N ) generated as a random sample of a stochastic process X = (X t : t ∈ I) with continuous trajectories. For each 1 ≤ n ≤ N , and given a positive integer M n , let T (n) m , 1 ≤ m ≤ M n , be the random observation times for the curve X (n) . These times are obtained as independent copies of a variable T taking values in I. The integers M 1 , . . . , M N represent an independent sample of an integer-valued random variable M with expectation µ. Thus M 1 , . . . , M N is the N th line in a triangular array of integer numbers. We assume that the realizations of X, M and T are mutually independent. The observations associated with a curve, or trajectory, X (n) consist of the pairs (Y and ε (n) m are independent copies of a centered error variable ε. For the sake of readability, here and in the following, we use the notation X t for the value at t of the generic process X and X (n) (t) for the value at t of the realization X (n) of X. The N −sample of X is composed of two sub-populations: a learning set of N 0 curves and a set of N 1 curves to be recovered that we call the online set. Thus, 1 ≤ N 0 , N 1 < N and N 0 +N 1 = N . Let X (1) , . . . , X (N 0 ) denote the curves corresponding to the learning set.
Our first aim is to define a meaningful, model-free concept of local regularity for the process X and to build an estimator for it. The estimator could be computed easily and rapidly from the observations (Y (n) m , T (n) m ) corresponding to the curves in the learning set, and does not require a very large number N 0 of curves. Moreover, it could be easily updated if more curves are added to the learning set. The problem of estimating the regularity of X is related to the estimation of the Hausdorff, or fractal, dimension of time series. See, for instance, [9,8,16] and the references therein. However, herein, we adopt the FDA point of view and use the so-called replication and regularization features of functional data (see [29], ch.22). More precisely, we combine information both across and within curves. Thus, taking strength from the information contained in the whole set of N 0 available time series, we are able to investigate more general situations: X need not to be a Gaussian, or a transformed Gaussian process, it is not necessarily stationary or with stationary increments, it could have a fractal dimension which changes over time, it is observed with possibly heteroscedastic noise, at random moments in time.
The local regularity we study determines the regularity of the sample paths of X. Sample paths regularity determines, for instance, the minimax optimal rate for the nonparametric estimators of the mean and covariance functions. In particular, knowing the local regularity serves to distinguish the so-called sparsely and dense sampled curve cases. See [6,41]. For some widely used examples, the local regularity is also related to the rate of decrease of the eigenvalues of the covariance operator, a property which is commonly used in FDA literature. In almost all existing contributions, the sample paths regularity and the rate of the eigenvalues are supposed given. We here propose a simple method to estimate them.
Based on the regularity estimates, our second objective is to build an adaptive, nearly optimal smoothing for a possibly very large set of N 1 new curves. While several smoothers could be used, we focus on local polynomials. Optimal curve reconstruction is an important step in FDA, for instance for computing the median curve or the depth of a curve, to detect outliers. See, e.g., [25]. It can also serve to compute optimal mean and covariance functions estimator in the dense case. See, e.g., [6,41]. Let X [1] = X (N 0 +1) , . . . , X [N 1 ] = X (N ) , denote the curves from the online set to be recovered from the corresponding observations (Y . This issue is a nonparametric estimation prob-lem and, if each curve regularity is given, nonparametric estimators of the curves X [1] , . . . , X [N 1 ] could be easily built, for instance using the local linear smoother or the series estimator. Nevertheless in applications, there is no reason to suppose that the sample paths of the random process X have a known regularity. When it is not reasonable to assume a given regularity for the trajectories, one could use one of the existing data-driven procedures for determining the optimal smoothing parameter. However, the existing procedures, such as the cross-validation or the Goldenshluger-Lepski method [17], were designed for the case where one observes only one curve. Thus one has to apply them for each curve separately, which could require large amounts of resources.
In Section 2, we define the local regularity and provide concentration bounds for the estimator of the local regularity of the trajectories of X. Our results are new and of non-asymptotic type, in the sense that they hold for any values of the sample sizes N 0 and the mean value of observation times µ, provided these values are sufficiently large. In Section 3, we explain the relationship between the probabilistic concept of local regularity for the trajectory of X and the analytic regularity of the curves which usually determines the optimal risk rate in nonparametric estimation. We also provide insight into the relationship between the local regularity and the rate of decrease for the eigenvalues of the covariance operator. Given the estimate of the local regularity of the trajectories of X, in Section 3 we build adaptive local polynomial estimators and provide a non-asymptotic bound for the pointwise risk of the local polynomial smoother, uniformly over the online set. This uniform bound is obtained using an exponential-type moment bound for the pointwise risk for the local polynomial smoother, a new result of interest in itself. The pointwise risk bound is optimal, in the nonparametric regression estimation sense, up to some logarithmic factors induced by our stochastic curves model, the concentration of the local regularity estimator, and the uniformity over the online set. Assuming that X has a constant regularity over the interval I, we also derive a non-asymptotic bound for the risk of the local polynomial smoother uniformly over I, and uniformly over the online set. In Section 4, we provide some additional guidance for the implementation of the local polynomial smoother and report results from simulation showing that our estimator of the local regularity and the adaptive local polynomial estimator perform well in both cases, whether or not the trajectories are differentiable. As a further application of our local regularity estimation approach, we consider the median curve estimation problem for samples of noisy curves. Our median curve is obtained from the smoothed curves with the optimal bandwidth given by the local regularity estimate.
We compare the accuracy of our median curve with that obtained with the curves smoothed by cross-validation. While the accuracy is comparable, the computation time is far shorter when using our approach, and this makes it suitable for embedded systems or for applications with online data. A real data application on vehicle traffic flow analysis illustrates the effectiveness of our approaches. The proofs of our results are postponed to the Appendix. Additional technical aspects, simulation results, and details on traffic flow application are also relegated to the Appendix. To further illustrate the irregularity of the curves in applications, we also report in the Appendix the local regularity estimates for another three functional data sets often analyzed in the literature.

Local regularity estimation
The new local regularity estimator is introduced and studied in this section. After providing some insight into the ideas behind the construction, we provide a concentration result for our estimator under general mild assumptions which do not impose a specific distribution for X. In particular, X could, but need not, be a Gaussian process. The case where the variance of the noise is not constant is also discussed.

The methodology
Let us present the main ideas behind the construction of the regularity estimate. For this, let us introduce some more notation used throughout the paper. Let K 0 be an integer value which will be defined below, and consider the order statistics of a M -sample T 1 , . . . , T M distributed as T which admits the density f . Let t 0 ∈ I such that f (t 0 ) > 0. We extract the subvector of the K 0 closest values to t 0 and denote these values T (1) ≤ . . . ≤ T (K 0 ) . If t 0 = inf(I) then t 0 ≤ T (1) , while if t 0 = sup(I), then T (K 0 ) ≤ t 0 . When t 0 is an interior point of I, t 0 likely lies between T (1) and T (K 0 ) . Next, we define the interval where |I| denotes the length of the interval I and, recall, µ is the expectation of M . In the following, we introduce our conditions using the interval J µ (t 0 ), which depends on µ. The theoretical results we derive are non-asymptotic, in particular they hold for any fixed µ, provided it is sufficiently large. If one is interested by asymptotic results corresponding to the case where µ increases to infinity, then our J µ (t 0 ) is eventually contained in any fixed neighborhood of t 0 . In this case, one can state all the assumptions on X in a more standard way, using a fixed interval instead of our J µ (t 0 ).
We assume that the process X generating the continuous curves X (1) , . . . , X (N ) satisfies for some H t 0 ∈ (0, 1] and L t 0 > 0 which could both change with t 0 . Here and in the following, ≈ means the left-hand side is equal to the right-hand side times a quantity which tends towards 1 when |v − u| → 0. When the trajectories of X are not differentiable, H t 0 is what we call the local regularity of the process X at t 0 . For now, we focus on this case. When, with probability 1, the trajectories of X admits derivatives of order d ≥ 1 in a neighborhood of t 0 , the property (2) will be used for the derivative of order d of the smooth trajectories. In this smooth case, the local regularity of the process X at t 0 will be d + H t 0 . See the comment following Theorem 1. Some commonly used processes have the eigenvalues of the covariance operator such that, for some ν > 1, λ j ∼ j −ν , j ≥ 1. Such processes have a constant local regularity. Moreover, d + H t 0 = (ν − 1)/2. As an example, the stationary fractional Ornstein-Uhlenbeck process with index ρ ∈ (0, 2) has the covariance function Γ(s, t) = exp(−a|s − t| ρ ), for some a > 0, which yields ν = 1 + ρ, H t 0 ≡ ρ/2 and d = 0. Among the nonstationary processes satisfying our condition (2), we can mention the fractional Brownian motion with Hurst exponent H ∈ (0, 1), for which H t 0 ≡ H and λ j ∼ j −(1+2H) . Examples with d > 0 could be obtained by d−times integration of the processes with d = 0, such as for instance the so-called d−integrated Brownian motion. See, e.g., [24] for more details on these examples.
To construct our estimator of H t 0 , we consider the event which is expected to be of high probability. Let 1 B denote the indicator of B and let us define the expectation operator Using (2) and the independence between X and T , for any 1 ≤ k < l ≤ K 0 , From this and the moments of the spacing T (l) − T (k) as given in the Lemma 2, we obtain Now, for any 1 ≤ k ≤ K 0 , let ε (k) be a generic error term corresponding to the generic realization X T (k) , and denote Let σ 2 denote the variance of the error term, assumed to be finite. We then obtain We distinguish two situations : the case where σ 2 is known and the case where it is unknown. In the former case, we suppose that 4k − 3 is also less than K 0 and use twice the relationship (3) with k and 2k − 1, respectively. We deduce θ 2k−1 − 2σ 2 θ k − 2σ 2 ≈ 4 Ht 0 . Taking the logarithm on both sides, we obtain the proxy value of the local regularity parameter H t 0 , when σ 2 is given. In the case where σ 2 is unknown, assuming that 8k − 7 ≤ K 0 , we use the relationship (3) three times with k, 2k − 1 and 4k − 3, respectively, to obtain A natural proxy of H t 0 is then given by Our estimator of the local regularity parameter H t 0 is the empirical version of the proxy value H t 0 (k), or H t 0 (k, σ 2 ), built from a random sample of N 0 trajectories of X, the learning set of curves. Formally, we consider the sequence of events, for 1 ≤ n ≤ N 0 , and we defineθ where, for any n and k, Y The default value 1 is arbitrary and could be replaced by any number between 0 and 1. When σ 2 is unknown the corresponding estimator is whereθ 4k−3 is obtained from the formula ofθ 2k−1 after replacing k by 2k −1.
It is worth noting that our estimator could be easily updated every time new curves are included in the learning sample, without revisiting the learning set already used. Indeed, one should only add new terms in the sums definingθ k ,θ 2k−1 andθ 4k−3 .
The Associate Editor drew our attention on a large literature related to the estimation of the regularity of nonparametric functions. [15] consider that data are noisy measurements of one sample path from a scaled fractional Brownian motion (fBm) with unknown Hurst parameter and unknown scale. The measurements are sampled on an equidistant grid, and the noise is allowed to be heteroscedastic. The authors derive the optimal rate for estimating the Hurst parameter. Moreover, they provide an estimator achieving the optimal rate. The estimator of [15] is based on the self-similarity property of the fBm process. Our estimator relies on a related property imposed to the second order moment of the increments; see (2) above. Our condition defines a significantly larger class of processes. By construction, our results are not easily comparable to that of Gloter and Hoffmann, which are more refined but derived in a different, more restrictive context. Our estimator is designed to take advantage of the replication feature of functional data, where sample averages provide simple estimates for the moments of the increments. Another related topic extensively studied in the literature, is the construction of confidence or credible sets for a curve of unknown regularity. See, for instance, [14,3,4], or [31] for the Bayesian approach. Such quite elaborate methods, which often require the calibration of some tuning parameters, are not specifically designed for the functional data context. Finally, another related problem is testing the regularity of a signal. See, for instance, [7]. However, such methods, designed to be applied to one signal, do not provide a direct estimator of the regularity.

Concentration bounds for the local regularity estimator
Below, we focus on the more complicated and realistic case with unknown variance. The case with given variance could be treated after obvious adjustments. The results in this section depend on µ, the mean number of observation times T , and the cardinality N 0 of the learning set of curves. However, they are non-asymptotic in the sense that they hold true for any sufficiently large µ and N 0 satisfying our conditions. For deriving our results, we impose the following mild assumptions. (H2) The random variable T admits a density f : Moreover, there exist L f > 0 and 0 < β f ≤ 1 such that (H3) There exist a function φ t 0 (·, ·) > 0, the constants L t 0 , L φ > 0 and 0 < β φ ≤ 1 such that, for any u, v ∈ J µ (t 0 ), we have (H4) Two constants a, A > 0 exist such that (H5) The variables ε (n) m , n, m ≥ 1, are independent copies of a centered variable ε, with finite variance σ 2 , for which constants b ≥ σ 2 > 0 and B > 0 exist such that (H6) The random variable M is such that M ≥ 9 and γ 0 > 0 exists such that, for any s > 0, P (|M − µ| > s) ≤ exp(−γ 0 s).
Assumption (H2) imposes a mild condition on the distribution of the random observation points which provides convenient moment bounds for their spacings. In particular, it implies that, for a sufficiently large µ, . Assumption (H3) is a version of the so-called local stationarity condition. More precisely, (H3) implies that the trajectories of the process X = (X u : u ∈ I) are Hölder continuous in quadratic mean in the neighborhood of t 0 , with exact exponent H t 0 and local Hölder constant L t 0 . Let us call H t 0 the local regularity of the process X at t 0 . Examples include, but are not limited to, stationary or stationary increment processes X. See, e.g., [2] for some examples and references on processes satisfying the mild condition in (H3). Assumptions (H4) and (H5) are needed for deriving exponential bounds for the concentration of our local regularity estimator, and are satisfied by the sub-Gaussian random variables. In particular, (H4) is satisfied by the Gaussian processes. (H6) is a mild condition for controlling the variability of number of observation points on the curves. The lower bound on M guarantees that each curve in the learning set has a sufficient number of observation times for building our estimator. For a real number a, let a denote the largest integer not exceeding a. Theorem 1. Let Assumptions (H1)-(H6) hold true. Let K 0 be a positive integer such that with c a constant depending only on L f , β f , β φ , f (t 0 ) and H t 0 . Define k = (K 0 + 7)/8 and let H t 0 = H t 0 (k) be defined as in (8). Then, for a sufficiently large µ : where f is a positive constant depending on a, A, b, B and the length of the interval I.
To obtain a non-trivial estimator of H t 0 , we need k ≥ 2, thus the upper bound in (9) should be larger than 9, and this happens as soon as µ ≥ 80. For the estimator in (7), which requires an estimate of σ 2 , we would only need µ > 35. The exact expressions of the constants c and f could be traced in the proof of Theorem 1. The condition imposed on K 0 provides a panel of choices depending on N 0 and µ. As a result, up to some constants, and depending on L f , β f , β φ , f (t 0 ) and H t 0 , the concentration rate , one could expect, could be in a range such that µ 1 and log 1/2 (µ) 1. The best possible concentration of H t 0 is guaranteed as soon as N 0 is larger than some power of µ, while for a concentration as fast as some negative power of log(µ), one only needs a small number N 0 of curves in the learning set, that is larger than some power of log(µ).
For the purpose of building an adaptive optimal kernel estimator for the trajectories of X, we will impose = log −2 (µ) and an exponential bound equal to exp(−µ). The following corollary proposes a data-driven choice of K 0 which guarantees these requirements. This choice is guided by the fact that, for any constants a, b > 0, we have the relationship log a (µ) ≤ exp((log log(µ)) 2 ) ≤ µ b , provided µ is sufficiently large.
The conditions in Corollary 1 impose µ to be large, but still allow for many cases in the three regimes non-dense, dense and ultra-dense, as defined by [41].
One could also build H t 0 with only one trajectory of a process X with stationary increments. If the density of T is uniform and sufficiently many measurements are available, it suffices to split the interval [0, 1] into N 0 intervals of the same length and apply our methodology considering the measuring times and the noisy measured values in each block as belonging to a different curve in the learning set. Theorem 1 and Corollary 1 remain valid.

The case of conditionally heteroscedastic noise
In some applications, the assumption of constant variance for the error term ε could be unrealistic. Therefore, we consider the following conditional heteroscedastic error extension of model (1): where σ(·, ·) is some unknown function and u (n) m are independent copies of a centered variable u with unit variance.
Our approach also applies to the model (11) under some additional mild conditions. Indeed, assuming the expectations exist, we have From this identity it is clear that the arguments presented in Section 2.1 remain valid as long as the value of the last two expectations on the righthand side of the last display does not depend on k. Thus, in this case, even if the conditional variance of ε (n) m is not given, we could consider the same estimator H t 0 . This remark leads us to the following additional assumption.
(E1) The variables u (n) m from model (11) satisfy the Assumption (H5) with unit variance. Moreover, the function σ(·, ·) is bounded and the map u → E σ 2 (X u , u) , u ∈ I, is constant in a fixed neighborhood of t 0 .
Assumption (E1) allows the error term to be conditionally heteroscedastic, but imposes marginal (unconditional) homoscedasticity in a neighborhood of t 0 .
Under Assumption (E1), for any k we have and thus the terms like E B [σ 2 (X T (k) , T (k) )] cancel when considering the differences θ 4k−3 − θ 2k−1 and θ 2k−1 − θ k . The proof of Corollary 2 follows from the proof of Theorem 1 after obvious modifications, and hence will be omitted. It is worthwhile noting that, even if the regularity H t 0 is the same at any point t 0 , one may not be able to estimate the regularity H t 0 using only one observed noisy trajectory with conditionally heteroscedastic noise. This because, intuitively, it might be impossible to identify the oscillations of the signal of interest, that is to separate the increments of the trajectory of X from the differences of the error terms with variable variance. With our approach based on local observed increments averaged over several curves, the effect of the noise vanishes, provided the expectation of the conditional variance is constant. Hence, eventually the identification of the oscillations of X is recovered and there is no difference with respect to the case of homoscedastic errors.

The case of differentiable sample paths
The definition of the local regularity extends to the case of differentiable curves. When the curve admits derivatives up the order d > 0, condition (2) has to be stated for the d-th order derivative of the curve. To build an estimate of the local regularity of the d-th derivative of the curve, we propose to use a smoothing-based approximation of the d-th derivative. In the Appendix, we derive concentration bounds for the estimator of d + H t 0 when d > 0. In particular, we propose an estimator of d. The implementation of the estimator of d + H t 0 is described in Section 4.1, and the simulation results we report shows that it performs well. However, many real data we analyzed, revealed that in many applications, the sample paths do not seem differentiable. For this reason, and to save space, we leave to the supplement, the details of learning the smoothness in the general case d ≥ 0.

Adaptive optimal smoothing
With at hand an estimate of the local regularity ς t 0 = d + H t 0 obtained from a learning set of N 0 curves, we aim at recovering N 1 new noisy trajectories of X from what we call the online dataset. One of the most popular smoother is the local polynomial estimator, see [11]. This estimator depends on a tuning parameter, the bandwidth, which should ideally be chosen according to the regularity of the target function. Using the local regularity estimate, one can build optimal smoothing using alternative approaches, such as the splines. Here we focus on local polynomials.
One has to connect a definition of local regularity that is meaningful from the theory of stochastic processes to the usual definition of function regularity used in nonparametric curve estimation. Fortunately, in our framework, the parameter ς t 0 , which is understood as the local regularity of the process (X t : t ∈ J µ (t 0 )) in quadratic mean, see (2), is intrinsically linked with the regularity of the sample paths of the process. Indeed, in many important situations, which are covered by our assumptions, the regularity of the sample paths of a process does not depend on the realization of this process. For example, the regularity of any Brownian path is 1/2, in the sense that for any > 0, almost surely the sample path belongs to the Hölder space C 1/2− (I) and does not belong to C 1/2+ (J) whatever J ⊂ I. Here, for any a > 0, C a (I) denotes the space of uniformly a−Hölder continuous functions defined on I, see Theorem 2.2 and Corollary 2.6 of [33] for precise definition. More generally the regularity of the sample paths of a process is linked to integrated regularities through the Kolmogorov's Continuity Theorem [33,Theorem 2.1]. In particular, Assumption (H3) ensures that, with probability 1, the trajectories of the process (X : t ∈ J µ (t 0 )) are Hölder continuous with any exponent parameter 0 < a < H t 0 .
Below, we define the local polynomial estimator and derive its theoretical properties. Since our focus of interest is the simultaneous denoising of the additional N 1 curves, we consider the following pointwise risk: for a generic estimator X First, we provide a sharp bound for this risk with N 1 = 1, in the case where a suitable estimator of ς t 0 = d + H t 0 , computed from another independent sample, is given. Such a result, of interest in itself in nonparametric curve estimation, seems to be new. In this case, the expectation defining the risk R( X; t 0 ) should be understood as the conditional expectation given the estimator of ς t 0 . Next, we provide a sharp bound for R( X; t 0 ) in the case where N 1 ≥ 1 and the estimator of ς t 0 is obtained using the approach introduced in Section 2.

Local polynomial estimation
We assume that d ≥ 0 is an integer and H t 0 ∈ (0, 1). Letd and H t 0 be some generic estimators of d and H t 0 , respectively, and let ς t 0 =d + H t 0 be the corresponding estimator of ς t 0 = d + H t 0 . We assume thatd and H t 0 are independent of the N 1 from the online dataset, generated according to (1). The estimator of ς t 0 could be used to smooth any curve Y [n 1 ] (n 1 = 1, . . . , N 1 ) from the online dataset. For the sake of readability, we omit the superscript [n 1 ] and we consider a generic curve from the online dataset: For any u ∈ R, we consider the vector U (u) = (1, u, . . . , ud/d!). Let K : R → R be a positive kernel and define: where h is the bandwidth. The vector ϑ M,h satisfies the normal equations Let λ be the smallest eigenvalue of the matrix A and remark that, whenever Taking into account the expression of the bandwidth minimizing the pointwise mean squared risk for a regression function defined on I, with derivative of order d which is Hölder continuous in a neighborhood of t 0 , with exact exponent H t 0 , we consider a bandwidth Our focus of interest is on determining a nearly optimal rate of the bandwidth to be used to recover the trajectories of X. For the applications, one could also be interested in a nearly optimal constant, which in general needs to be estimated. In Section 4.1 we propose a simple way to estimate a suitable constant for the applications.
With at hand the bandwidth h, we propose the following definition of the local polynomial estimator of X t 0 of order d: where ϑ = ϑ M, h and, for any y > 1, The upper trimming with τ 5/12 (M ) is a technical device used to control the tails ofX t 0 . It has practically no influence in applications. For deriving our results onX t 0 , we impose the following mild assumptions.
(LP1) There exist two positive constants, a and A, such that for any p ≥ 1 : Moreover, for d ≥ 0 in Assumption (H3), if ∇ d X denotes the d-th derivative of the process X, (LP3) The estimator H t 0 satisfies the property where K 1 is some positive constant.
(LP4) The estimatord satisfies the property The first part of Assumption (LP1) provides a suitably tight control on the moments of X t , but still allows for unbounded trajectories. The second part of Assumption (LP1), is a technical condition which reinforces Assumption (H4). It allows to control the analytic regularity of the sample paths. More precisely, it is implicitly used in the definition of the variable Λ β in (33). Assumption (LP2) is a convenient, but mild, technical condition. It could be relaxed at the price of controlling the probability of the complement of the event {µ/ log(µ) ≤ M ≤ µ log(µ)}, for instance using (H6). Assumptions (LP3) and (LP4) are very mild conditions that the generic estimators of the regularity should satisfy. Since µ 1/ log 2 (µ) = e 1/ log(µ) for any µ > 1, the concentration of H t 0 at a suitable negative power of log(µ) will suffice for the smoothing purposes. For simplicity, and without loss of generality we consider the same constant K 1 in Assumptions (LP3) and (LP4).
Theorem 2. Assume that Assumptions (H1), (H2), (H5) and (H6) and Assumptions (LP1)-(LP4) hold true and let K(·) be a kernel such that, for any t ∈ R : There then exists a constant Γ 0 such that for any µ ≥ 1, . The bound on the exp( √ x)−moment of the |X t 0 − X t 0 | seems a new result for local polynomial estimators. For our purposes, it will entail a sharp bound for R( X; t 0 ). More precisely, the price for considering a risk measure uniformly over the whole online dataset is very low, that is a multiplying factor as large as a power of log(N 1 ) in the risk bound we derive below. [13] derived sharp bounds for all the moments of |X t 0 − X t 0 |. However, his bounds on the moments would induce a power of N 1 as multiplying factor for our risk bound, instead of the power of log(N 1 ).
Theorem 3. Assume that assumptions of Theorem 2 hold true, and let K be a kernel which satisfies (15). There then exists a positive constant Γ 1 such that R( X; t 0 ) = E max If all the trajectories X were in C ςt 0 (J µ (t 0 )), ς t 0 were known and N 1 = 1, the risk bound for R( X; t 0 ) would be of the usual nonparametric rate µ −2ςt 0 /(2ςt 0 +1) . Let us note that the fact that H t 0 is not known does not have any consequence on the risk bound in Theorem 3. Indeed, since µ 1/ log 2 µ = e 1/ log µ for any µ, the order of the risk bound does not change as soon as the probability of the event {d = d} ∩ {| H t 0 − H t 0 | ≤ 1/ log 2 µ} tends to 1. The log(1 + N 1 ) factor is given by the maximum over the N 1 curves in the online dataset. The factor {log(µ)} 2ςt 0 /(2ςt 0 +1) is due to the concentration properties of M around its mean µ. This factor would not appear if M/µ is almost surely bounded and bounded away from zero. The factor log 2 (µ) comes from probability theory. The trajectories of a stochastic process X with local regularity H t 0 does not necessarily belong to C ςt 0 (J µ (t 0 )) but they are almost surely in any Finally, let us notice that Corollary 1 states that the estimator defined by (8) satisfies (LP3) for d = 0 and any 0 < H t 0 < 1. This leads us to the following result.
Finally, we establish the uniform convergence of the recovered trajectories. Below, the uniformity is with respect to t 0 ∈ I and over all the curves in the online sample.
Theorem 4. Assume that assumptions of Theorem 2 hold true, uniformly, for any t 0 ∈ I. Let ς = ς t 0 be the global regularity of the process. Let also K be a kernel which satisfies (15). There then exists a positive constant Γ 1 such that In the usual local polynomial (LP) smoothing framework, for a regression function defined on I, given a sample of size M , a bound of the pointwise, mean squared error risk is first derived and this bound is then minimized with respect to the bandwidth h. See for instance [37]. When the regression function admits a derivative of order d which is Hölder continuous in a neighborhood of t 0 , with exact exponent H t 0 and local Hölder constant L t 0 , the optimal bandwidth is with q 1 and q 2 defined on pages 39-40 of [37]. With the Nadaraya-Watson estimator, we take t 0 is the variance of the noise that could depend on t 0 . This yields a refined constant C in this case. Details are given in the Appendix. The value f (t 0 ) can be estimated using all the T (n) m . Thus our target bandwidth h opt depends on two more unknown quantities, L t 0 and σ 2 t 0 , for which we now propose estimation procedures. The estimation of L t 0 could be based on similar ideas as used for H t 0 . For simplicity, we assume d = 0. The extension to the case d ≥ 1 could follow the same pattern as for the estimation of the local regularity, using the trajectories of the derivatives. Using twice the relationship (3) with k and 2k − 1, respectively, we deduce On the other hand, using the approximation of the moments of the spacings, as given in Lemma 2, we have Given an estimator of H t 0 , the empirical counterparts of η k obtained from the learning set of N 0 independent trajectories of X is where B n is the sequence of events defined in (5). An estimate of η 2k−1 could be obtained similarly. These facts lead us to the following estimator of the local Hölder constant L t 0 : For the implementation we propose k = ( K 0 + 7)/8 with To estimate the variance, we propose where S n ⊂ {2, 3, . . . , M n } is a set of indices for the n−th trajectory and |S n | is the cardinal of S n . When the variance of the error ε is considered constant, one could take S n = {2, 3, . . . , M n }. When it depends on t 0 , one could take S n = m : T , with K 0 defined above. This is the choice we used in our empirical investigation. When the variance of the errors also depends on the realizations X u , as described in Section 2.3, in general it is no longer possible to consistently estimate σ 2 (X t 0 , t 0 ). Our simulation experiments indicate that the estimate (19) remains a reasonable choice. Finally, the constant involved in the definition of the bandwidth could be estimated by C obtained by plugging the estimates of the unknown quantities into the definition of C in (17). Concerning the kernel, we use

Simulation experiments
We now illustrate the behavior of our local regularity estimator ς t 0 =d+ H t 0 computed using the learning set of noisy curves, and the performance of kernel smoother it induces for estimating the noisy curves from the online set.
The procedure for calculating ς t 0 is summarized in the following algorithm where LP (d) means local polynomial smoother with degree d ≥ 0. The Nadaraya-Watson smoother corresponds to LP (0).
Calculate L t 0 (1), as in (18), and σ 2 ), 1 ≤ n 1 ≤ N 1 , and LP (d) withd delivered by Algorithm 1. The bandwidth is calculated as h n 1 = ( C/M n 1 ) 1/(2 ςt 0 +1) , 1 ≤ n 1 ≤ N 1 , with ς t 0 obtained from Algorithm 1. The constant estimate C is the same for all curves in the online set, that is that obtained with ς t 0 , L t 0 ( H t 0 ) and σ 2 t 0 . We compare our approach with the classical cross-validation (CV) (least-squares leaveone-out) method applied for each curve X [n 1 ] separately. For CV, we use the R package np [19], after rescaling the CV bandwidth to account for their different definition of the Epanechnikov kernel. At this stage, we want to point out that our smoothing method is much faster than any standard, trajectory-by-trajectory approach, such as CV. We report a time comparison in the Appendix, and as expected, the ratio between the times needed for CV and for our approach is at least of the same order as N 1 . It is worth noting that one cannot follow an ad-hoc approach and transfer one CV bandwidth from a curve X [n 1 ] to another because H t 0 is not known, and could even vary with t 0 .
The data are generated from the model (1) using different settings for X, the distribution of T and the variance of the noise, as well as for N 0 and N 1 . For X, we consider three types of Gaussian processes: fractional Brownian motion (fBm) with constant Hurst parameter H ∈ (0, 1), fBm with piecewise constant Hurst parameter, and integrated fBm. In the later case, X t = t 0 W H (s)ds, where W H denotes a fBm with constant Hurst parameter H. The local regularity is constant for the first and the third type, and variable for the second. The third type is an example of X with smooth trajectories. We identify the setting for X by s ∈ {1, 2, 3}. A more detailed description of these processes, as well as plots of their trajectories, are provided in the Appendix. The number M of measuring times of a curve is a Poisson random variable with expectation µ, while for the measuring times T , we considered either a uniform distribution (identified by unif), or a deterministic equispaced grid (equi) on the range [0, 1]. For the noise, we considered the Gaussian distribution with both constant and variable variance. The cases are identified by σ 2 which could be a number or a list, respectively. The values of σ 2 are chosen in such a way that the variance ratio signal-to-noise remains almost unchanged. Thus, one simulation setting is defined by the 7-tuple (s, N 0 , N 1 , µ, f, H, σ 2 ), with f ∈ {unif, equi} and H, the Hurst parameter, is a list in the case of fBm with piecewise constant local regularity. Below, we present the results for a few settings, complementary results are reported in the Appendix. For each type of experiment, the reported results are obtained from 500 replications of the experiment. Our methodology is implemented in the R package denoisr available at https:  Figure 2 presents the estimation of ς t 0 for different settings (3, N 0 , 500, µ, equi, 1.7, 0.005). As expected, our local regularity estimation approach also performs well for smooth trajectories.
Next, we present the results on the risk R( X; t 0 ). Figure 3 presents the boxplots of the risk R( X; t 0 ) defined in (12) in the case of piecewise constant local regularity, with three values of t 0 , each one in the middle of the interval of the changes of regularity are defined. The results are quite good. Part of the curves with lower regularity are harder to estimate and thus results in higher risks than the more regular parts. It appears that N 0 and µ do not have the same influence on the risk as the estimation of the local regularity, and this is in line with the risk bound in Theorem 3. Thus, going from 300 to 1000 sampling points leads to large improvement in terms of risk whereas going from 250 to 1000 curves in the learning dataset only results in little or no improvement. Finally, it seems that the method achieves better results for equispaced sampling points. The same conclusions could be drawn from the results presented in Figure 4, obtained for the simulation experiment defined by the 7-tuple (3, 1000, 500, 1000, equi, 1.7, 0.005).
Finally, we present a comparison with the CV. Because of the large amount of computing resources required by CV, we only considered a few cases. Figure 5 presents the results in terms of the risk calculated at t 0 = 0.5 for the setting (1, 1000, 500, 300, equi, 0.5, 0.05). We make the remark that our method and CV perform similarly despite the fact that CV uses a specifically tailored bandwidth for each curve in the online set. The homoscedastic setting is favorable to CV which, for a given curve, uses a global bandwidth at any t 0 . Figure 6 presents the the heteroscedastic setting (2, 1000, 500, 1000, equi, (0.4, 0.5, 0.7), 0.05). CV preserves good performances when the local regularity varies moderately. Our method shows close performance in this case, slightly better when H t 0 = 0.7. Figure 7 presents the results in the setting (3, 1000, 500, 1000, equi, 1.7, 0.005). Again, CV and our method perform quite similarly.
Let us end our simulation experiments presentation with the results on the median curve estimation from noisy curves. We consider a Gaussian simulation design given by Model 4 in [25], page 726. The mean function is g(t) = 3 sin(3π(t + 0.5)) + 2t 3 and the covariance function is γ(s, t) = exp(−|t−s| 0.8 ). The curves are measured on a common equidistant grid with M ∈ {300, 1000} points and the samples of size N ∈ {100, 200} have a proportion of 25% contaminated observations. We used the contamination by peaks with the parameter values used by [25] in their Figure 6. To the simulation design used in [25], we add a Gaussian measurement error with constant variance σ 2 . We compute the median curve using the mode depth with the noiseless sample (the benchmark median) and with the noisy samples, without smoothing. Next we estimate the median curve using the smoothed curves. For smoothing, we use, on the one hand, LP (d) withd and the bandwidth delivered by Algorithm 1, and, on the other hand, the R package np with the bandwidth selected by CV for each curve separately. We compare the mean integrated squared error (IMSE) of the smoothingbased median curves with respect to the benchmark. We also compare the IMSE obtained by our approach with that obtained with the median curves computed from the noisy samples directly. Finally, we compare the computing times for our method and that using CV. The results obtained from 500 replications, with σ 2 = 1, are reported in Figure 8. The accuracy of   the estimates, obtained in a small fraction of a second using our smoothing method, is slightly lower than that obtained by CV, which requires far more computation time. Ignoring the noise in the sample leads to very large errors for the median curve estimation.

Real data analysis: the NGSIM Study
In this section, our method is applied to data from the Next Generation Simulation (NGSIM) study, which aims to "describe the interactions of multimodal travelers, vehicles and highway systems", see [18]. This study is known to be one of the largest publicly available source of naturalistic driving data. This dataset is widely used in traffic flow studies from the interpretation of traffic phenomena such as congestion to the validation of models for trajectories prediction (see e.g. [10,22,40,26,20] for some recent references). However, such data have been proved to be subject to measurement errors revealed by physical inconsistency between the space traveled, velocity and acceleration of the vehicles, cf. [28]. Montanino and Punzo [27] developed a trajectory-by-trajectory four-steps method to recover the signals from the noisy curves, and their methodology is now considered as a benchmark in the traffic flow engineering community for analyzing NGSIM data. The steps, finely tuned for the NGSIM data, are : 1. removing the outliers; 2. cutting off the high-and medium-frequency responses in the speed profile; 3. removing the residual nonphysical acceleration values, preserving the consistency requirements; 4. cutting off the high-and medium-frequency responses generated from step 3. The detailed description of these steps is provided in the Appendix.
To compare our smoothed curves to those of [27], we consider the following ratio: where X denotes our curve estimation while X is that obtained by [27]. A value of the ratio r( X, X) less than 1 indicates smoothed values closer to the observations. For our illustration, we consider a subset of the NGSIM dataset, known as the I-80 dataset. It contains 45 minutes of trajectories for vehicles on the Interstate 80 Freeway in Emeryville, California, segmented into three 15-minute periods (from 4:00 p.m. to 4:15 p.m.; from 5:00 p.m. to 5:15 p.m. and from 5:15 p.m. to 5:30 p.m.) on April 13, 2005 and corresponds to different traffic conditions (congested, transition between uncongested and congested and fully congested). In total, the dataset contains trajectories, velocities and accelerations for N 0 = 1714 individual vehicles that passed through this highway during this period, recorded every 0.1s. The number M n of measurements for each curve varies from 165 to 946. We focus on the velocity variable and rescale the measurement times for each of the 1714 velocity curves such that the first velocity measurement corresponds to t = 0 and the last one to t = 1. Figure 9 presents a sample of five curves from this data. It can easily be noticed that the velocities are quite erratic and their variation is not physically realistic, indicating the presence of a noise. Moreover, the data have been recorded at a moment of the day when traffic is evolving, it goes from fluid to dense traffic. Therefore, we consider that there are three groups in the data: a first group corresponding to a fluid (high-speed) traffic, a second one for in-between fluid and dense traffic, and a third groups corresponding to the dense (low-speed) traffic. To determine the three clusters, we fit a finite Gaussian mixture model to the vector of number of sampling points. The model is estimated by an EM algorithm initialized by hierarchical model-based agglomerative clustering as proposed by Fraley and Raftery [12] and implemented in the R package mclust [34]. The optimal model is then selected according to BIC. The three resulting  Figure 11 presents the results of the estimation of ς t 0 for values of t 0 from 0.2 to 0.8, for each group. The evolution of ς t 0 is quite smooth, except for Group 1 (Figure 11b). A possible explanation could be the small number of curves and the average of M n in this group, which correspond to low values of N 0 and µ. We also provide the estimation of the regularity using the whole sample of size 1714. The differences we notice between the estimates of ς t 0 from different groups support our preliminary clustering step.
To compute the curve estimate we adopt a leave-one-curve-out procedure: each curve is smoothed using the local regularity estimates computed from the other curves in the group (or the other 1713 curves when the data is not split into groups). The densities of the resulting ratios r( X, X) are plotted in Figure 12. When the traffic is fluid and the speed is high (group 1), our method perform much better than that of Montanino and Punzo. When the traffic is dense with low speed (group 3), the smoothed values obtained with the two methods are more similar, though our method still exhibits better performance for the majority of the curves.

A Proof of Theorem 1
The proof of Theorem 1 is based on several lemmas that we present in the following. For these lemmas, we implicitly assume that the conditions of Theorem 1 are satisfied.

Proof of Lemma 3 . By the definition in
. Note that E(θ k ) = θ k . Moreover, for any p ≥ 2, using Assumptions (H4) and (H5), we have where d and D are defined in the statement of this lemma. Bernstein's inequality implies and the same bound is valid for P(θ k − θ k ≤ −η). The bound for q k (η) follows.
Then, for sufficiently large µ, with e defined in Lemma 3.
Proof of Lemma 4. Assume that k and l satisfy the assumptions stated in the Lemma and assume moreover that µ is large enough so that η(θ l −θ k )/2 < 1. Then and the same bound is valid for p − k,l (η). By (22), we have provided µ is sufficiently large. We obtain the bound for max(p + k,l (η), p − k,l (η)) after applying Lemma 3.
Proof of Theorem 1. Let > 0. With the notation from (4) and (8), we can write and thus it suffices to bound the terms B and V .
The term B. The study of the set in the indicator function boils down to the study of the convergence of H t 0 (k) to H t 0 . Using Lemma 1 with with c a constant defined in Lemma 1. Using again Lemma 1 with k = 2k−1, l = 4k − 3 and a = 2 and taking the difference, we deduce that there exists R k such that where Similarly, we obtain : Combining (22) and (23), we obtain which leads, using the definition of H t 0 (k) given by (4), to: Note that, for sufficiently large µ, both R k and R 2k−1 are greater that −1/2. This implies Thus, since ρ * 4k−3 ≤ ρ * K 0 , the condition H t 0 (k) − H t 0 > ε/2 fails and B = 0 as soon as > 4 log 2 that is as soon as condition (10) is satisfied, provided µ is sufficiently large.

B Proofs of Theorems 2 and 3
The proofs of Theorems 2 and 3 are based on the following lemmas for which the proofs are provided in the Appendix E. For the first lemma we consider the matrix A defined in (13) with the bandwidth h = M −1/(2 ςt 0 +1) . Let λ be the smallest eigenvalue of this matrix. Let and let λ 0 denote its smallest eigenvalue. In the following, we assume that K(·) satisfies (15). Then A is positive definite [see 37, for details] and thus λ 0 > 0.
Under Assumptions (LP2), (LP3) and (LP4), the matrix A defined as in (13), with h = h, is positive semidefinite. Moreover, there exists a positive constant g that depends only on K, d, f (t 0 ) and λ 0 such that, for M sufficiently large, and, for sufficiently large µ, where Here, K 2 is a universal constant.
Since the dimension of A and A are given byd, the probability P(·) in Lemma 5 should be understood as the conditional probability given the estimatord.

Then, to show that (A) is finite, it suffices to show that
are finite and to apply Cauchy-Schwarz inequality. To control the stochastic term A 2 , remark that: where By Jensen's inequality, 4 . Thus, it remains to control B p for any p ≥ 4. For such values of p, we use Marcinkiewicz-Zygmund's inequality and obtain: By a version of Lemma 1.3 of [37], for κ defined in (15), Let χ m = h −1 1 {t 0 −h≤Tm≤t 0 +h} . By the Rosenthal inequality, there exists a universal constant C, such that, for any q ≥ 1, See [23].
Using Assumption (LP2), we deduce: The last line can be deduced using similar arguments to those used to obtain (45) in the proof of Lemma 5. Next, let W m = W 2 m / M j=1 W 2 j . Since the error terms are independent on the T m 's, and using (H5), by Jensen's inequality

Thus we have
, for some constant c 1 . For the last inequality, we use Stirling's formula. This implies that there exists a universal constant c 2 such that Combining (29) with (32) we obtain: The inequality on the last line comes from the fact that B 4 τ (µ) log 2 (µ) is bounded.
To control the bias term A 1 , let us define, for any 0 < β < H t 0 : where here X (d) u denotes the d-th derivative of the trajectory X u . Applying Taylor's formula and using the basic properties satisfied by the weights W m , we obtain: where |ζ m − t 0 | ≤ |T m − t 0 |. Note that this result is obtained using : Since, under E we have, W m = 0 as soon as |T m − t 0 | > h, : The last line follows from (30). Moreover, combining the result obtained by [33, p. 27], with (H4), for any 0 < H t 0 − β < β 0 where β 0 is some sufficiently small fixed value, we have: Since, by definition, the random variable Λ β is independent of H t 0 , M and the T m 's, by the last inequality above and inequality (31), we have: We thus obtain: Note that, on the event F, Taking β = H t 0 − log −1 µ, since (µ/ log µ) 1/ log µ is bounded, we deduce that, for some constant C > 0, It remains to control (B), (C) and (D). For this purpose, let us first note that, by the Assumption (LP1), where for the last inequality, we apply Lemma 6 with and ξ = |X t 0 | 1/2 , and thus c 2 = (5A/8) 1/3 . Now notice that, using Markov's inequality Sinceα < 1/2, Assumptions (LP3) and (LP4) imply that for sufficiently large µ: Moreover, Lemma 5 also implies that, for sufficiently large µ: Finally, if H denotes either E, F or G, we have where C denotes a positive constant. The choice α = 5/12 andα = 9/24 allows us to deduce that E ϕ 2 τ (µ)Z 2 P(H) is bounded. This concludes the proof of Theorem 2.

C Proof of Theorem 4
Proof of Theorem 4. Assume, without loss of generality that is an integer and let min(I) = s 0 < s 1 < . . . < s K < s K+1 = max(I) be a regular grid of the interval I. For any t ∈ I, let k t ∈ {1, . . . , K} be such that |t − s kt | ≤ 1/(2K − 2) =: . We have E max Bound for A. Using arguments similar to those of the proof of Theorem 3, we obtain: where c denotes an absolute positive constant and Ψ is defined by (16).
Bound for B. Note that using [37, p. 45] we obtain that there exists a positive constant such that, for any 1 ≤ n 1 ≤ N 1 , we have almost surely: Using (H5) and (LP1), this implies that there exists a positive constant Y such that: Bound for C. Set 0 < η < H/2 and, for any 1 ≤ n 1 ≤ N 1 , define the random variable: We have: It remains to bound the last expectation. By (H4), we have, for any 1 ≤ n 1 ≤ N 1 and any p ≥ 2: where |I| denotes the length of I and We finally obtain that: Gathering the bounds, we deduce that:

D.1 Main assumptions
In this section we propose an alternative approach to estimate the regularity ς t 0 = d + H t 0 of the process X without any restriction on d ∈ N. The main idea is to replace the noisy observations in (6)  (G1) With probability 1, for any ∈ {0, . . . , d} the -th order derivative ∇ X t of X t exists for all t ∈ O, and satisfies: (G2) Two positive constants S d and β d exist such that: (G3) a > 0 and A > 0 exist such that, for any ∈ {0, . . . , d} and any p ≥ 1: The quantity d + H d is the local regularity of the process on O, while L d is the Hölder constant of the d−th derivative of the trajectories.
where, for any s, t ∈ Ô Here, ∇ d X denotes a pilot estimator of the curve ∇ d X (n) that can be obtained by a presmoothing procedure.

D.3 Concentration properties
The quality of the estimator ς t 0 depends on the quality of the generic nonparametric estimators ∇ d X of ∇ d X. To quantify their behavior, we consider the local L p -risk Our method applies with any type of nonparametric estimator ∇ d X (local polynomials, splines,...) as soon as, for any p ∈ N, its L p -risk is suitably bounded. The following mild condition is satisfied by common estimators, see for instance Theorem 1 in [13] for the case of local polynomials.
(LP5) There exist two positive constants c and C such that We can now derive an exponential bound for the concentration of all the estimatorsĤ d , d ∈ {0, . . . , d}. To make this exponential bound useful for deriving optimal rates for our estimators of the mean and covariance functions, we will require the largest risk among R 2 (0), . . . , R 2 (d) to tend to zero as µ increases to infinity.
Then, for any µ larger than some constant µ 0 depending on B, τ , γ, Γ, H d , β d and for some positive constant f, we have The three quantities ρ(µ), ∆(µ) and ϕ(µ) are required to decrease to zero, as µ tends to infinity, in such a way that ρ(µ)/∆(µ) + ∆(µ)/ϕ(µ) → 0. We propose Γ = 2 and γ = 1/2. The choices of the rates for ρ(µ), ∆(µ) and ϕ(µ) satisfy some additional requirements. First, it will be shown below that, in order to achieve optimal rates of convergence for the mean and covariance estimators, the local regularity has to be estimated with a concentration rate ϕ(µ) faster than log −1 (µ). This is a consequence of the identity µ 1/ log(µ) = e for any µ > 1, and of a mild condition on N and µ, such as lim sup The technical condition (35) matches general situations found in applications. Second, we want to allow for reasonable rates of increase for N 0 , the size of the learning set. In Theorem 5, N 0 can increase as fast as an arbitrary positive power of µ. Third, since τ > 0 could be arbitrarily small, the rate imposed on the nonparametric estimators ∇ d X of ∇ d X is a very mild requirement which could be achieved by the common estimators, with random or fixed design, under mild conditions, in particular on the distribution of the M i and the smoothing parameter. See, for instance, [37] and [1]. In particular, the required rate for the ∇ d X can be obtained under general forms of heteroscedasticity.
Using (G2), we have: This implies that, for any d ∈ {0, . . . , d}: The same reasoning can be applied to bound the term p − d (s, t; κ).
the following inequality holds: Proof of Lemma 9. We first have to control the distance between H d and the proxy valueH d defined in (34). To do so, note that, for k = 2, 3, we have where, using (G2) and Lemma 7 This implies We deduce that By simple algebra and using the definition of the functions p + d and p − d introduced in Lemma 8, we obtain Note that, using Lemma 7 and (G2), we have for k = 2, 3: Thus, Lemma 8 can be used to write The same reasoning can be applied to bound p − d (t 1 , t k ; 1 − 2 − /2 ). However, note that in this case 1 − 2 − /2 ≤ /4. This implies that: To complete the proof of Lemma 9, if suffices to take f d = a 2 d+1 e log 2 (2)/2 4+8H d .
Proof of Theorem 5. Note that: Now, recall that, for d < d we have H d = 1. Note also that H d < 1. This implies that: Since 1 − H d > ϕ(µ) for µ sufficiently large, such that ϕ(µ) could replace in Lemma 9, we have: The Theorem 5 is proved.
We study separately the two terms of the right hand side of the above equation.

Let us recall the definitions
with U (u) = (1, u, . . . , ud/d!). Moreover, λ and λ 0 are the smallest eigenvalues of A and A, respectively. The matrix A is positive definite and thus λ 0 > 0. See [37]. The following result shows that, with high probability, λ stays away from zero. Let us recall that in our context,d is a generic estimator of d, independent of the online set of curves. Since dimension of the matrices A and A are given by this estimator, the probability P(·) in Lemma 5 should be understood as the conditional probability given the estimatord. Finally, recall that .
Proof of Lemma 5. Without loss of generality, we could work on the set {d = d}. Moreover, for simplicity, we write h instead of h below in this proof. Note that, using Assumption (LP2), for any 1 ≤ i ≤ j ≤ d, the element A i,j tends almost surely to the element A i,j as µ goes to infinity. This implies that the matrix A tends to the matrix A. This also implies that, for sufficiently large µ, we have λ > 0. More precisely, we have: where · 2 denotes the norm induced by the Euclidean norm whereas · ∞ denotes the entrywise sup-norm. LetP(·) andẼ(·) denote the conditional probability P(·|M ) and conditional expectation E(·|M ), respectively. Then:

Next, we decompose
Using Assumption (H2) and the fact that K(·) has the support [−1, 1], we have: This implies that, for h sufficiently small, that is for M sufficiently large, Let us define By property (15), we have Moreover, for h sufficiently small, that is for M sufficiently large, Applying the Bernstein inequality [see 35, p. 95], we obtain, for any x > 0: Then equation (27) follows if we define: It remains to prove (28). Let us define the events Using (27), we have The last line comes from Assumption (LP3). Note that under F Assumption (LP2) implies that, under F and, for sufficiently large µ, Thus, we have for some positive constant K 2 , that does not depend on 0 < β ≤ λ 0 /2.

F Moment bounds for spacings
We need to find an accurate approximation for moments like are defined as in Section 2, that is the subvector of the K 0 closest values to t 0 . We assume that T admits a density f . Such moments will be considered with k and l such that, for some fixed value and k − l m + 1 is small, and converges to zero when m → ∞. Herein, for any real number a, a denotes the largest integer smaller than or equal to a. These conditions on k and l allows for (k − l) increasing slower than m.
Let us point out that T (1) ≤ . . . ≤ T (K 0 ) defined in section 2 is not the order statistics from a random sample of T . In fact, T (k) , with 1 ≤ k ≤ K 0 , is the (G + k)−th order statistics of the sample T 1 , . . . , T m . Here G is a random variable and its value is determined by the way the subvector of K 0 closest values to t 0 is built. It is important to notice that G depends of the smallest and the largest values in this subvector, but is independent of the other components of the subvector. In particular, this means that in the case where T has a uniform distribution, the law of the spacings between T (1) ≤ . . . ≤ T (K 0 ) coincides with the law of the same type of spacings between the order statistics of a uniform sample of size m on [0, 1]. In particular, in the uniform case, the law of T (k) − T (l) depends only on m and k − l. For this reason, first we consider the case of T with uniform law. In the general case, we use the transformation by the distribution function in order to get back to the uniform case.

F.1 The uniform case
Consider U a uniform random variable on [0, 1]. Let U 1 , . . . , U m be an independent sample of U and let U (1) , . . . , U (m) be the order statistics. In the case of a uniform sample, U (k) − U (l) and U (k−l) have the same distribution, that is a beta distribution Beta(k − l, m − (k − l) + 1). Hence in this case, it is equivalent to study the moments of U (r) with 1 ≤ r = k − l ≤ m − 1. The variable U (r) has a Beta(r, m − r + 1) distribution. It also worthwhile to notice that U (k) − U (l) and U (l) are independent, and the same is true for U (k) − U (l) and U (k) .
By elementary calculations, we have where B(·, ·) denotes the beta function and Γ(·) the gamma function. To derive the bounds for the moments of interest, we use some existing results on the approximation of the gamma functions and the ratios of the gamma functions. The results are recalled in Section F.3 below. Let M be a random variable taking positive integer values. In the following proposition we assume that, given the realization of M ≥ K 0 , T 1 , . . . , T M be an independent sample with uniform distribution on [0, 1].
Lemma 11. Consider 0 < α ≤ 3 and 1 ≤ l < k ≤ m, and let r = k − l. Then, for any m ≥ K 0 in the support of M , distribution function) (resp. quantile function) of T . We assume that F is strictly increasing on [0, 1] and thus Q is the inverse function for F , and Q is differentiable with Q = 1/f . Then, given M = m, for any 1 ≤ l < k ≤ m, the joint distribution of the order statistics (T (k) , T (l) ) is the same as the joint distribution of (Q(U (k) ), Q(U (l) )), where U ( The for any 0 < α ≤ 3, , with C and c 0 are two constants depending only on α and L f , β f and f (t 0 ).
Proof of Lemma 12. In the following, we use several times the following property: for any a, b, α ≥ 0, Next, given M = m, By a first order Taylor expansion of Q(U (k) ) around the point U (l) , we get On the other hand, using Jensen's inequality and the variance of a beta distribution with parameters t m (m + 1) and (1 − t m )(m + 1), Gathering facts and using Lemma 11, there exists a constant c such that On the other hand, since U (k) − U (l) is independent of U (l) , from above and Lemma 11 we deduce that for any 0 < α ≤ α ≤ 3, for some constant C depending on L f , β f and f (t 0 ).
Coming back to relationship (48), taking power α on both sides of the identity, we can write Since for any a > −1 and 0 < α ≤ 3, using the bound (49) with α = α and α = α/2, for some constant c R depending on L f , β f and f (t 0 ). It remains to apply Lemma 11 to complete the proof. and deduce, for 2 ≤ s ≤ 3, and x ≥ 2, • Setting 3: Integrated fractional Brownian motion. The curves X t are obtained as integrals t 0 W H (s)ds, t ∈ [0, 1], of the paths of a fractional Brownian motion process W H with constant Hurst parameter H. Here, the local regularity of the process is the same at each point but will be greater than 1, thus this setting corresponds to the case of smooth trajectories. Figure 13c illustrates one realization of this setting. Figure 14a presents the violin plots of the needed time to smooth N 1 = 1000 curves. The results are obtained with the parameters of the simulation (1, 1000, 1000, 300, equi, 0.5, 0.05). They correspond to the total CPU time (system time and user time) to estimate the bandwidth h n and then estimate the curves at their sampling points. We perform these computations on a personal computer equipped with a processor Intel Core i7-6600U, CPU: 2.60GHz, RAM: 24Go and rerun the estimation 10 times. We observe that our smoothing device outperforms cross-validation and plug-in in terms of computation time: about 1000 times faster than the cross-validation. Let H n be a set of bandwiths. For the cross-validation, we may explain these differences because of the computation of the estimator for each bandwidth in H n and each curve X (n) of the sample (N 1 ×Card(H n ) calls to the estimation function) while our estimator requires only one estimation of the regularity of the functions and one evaluation of the estimator per curve (N 1 calls to the estimation function). In a similar way, figure 14b presents the violin plots of the time necessary to smooth N 1 = 1000 curves with the parameters of the simulation (3, 1000, 1000, 1000, equi, 1.7, 0.005). The same personal computer is used and the simulation is also run 10 times. For setting 3, our procedure is slower than for setting 1, which can be easily explained by the computation of the derivatives of each curve X (n) . However, the computation time for the cross-validation is still not comparable with ours.  (c) Integrated Brownian motion Figure 13: Illustrations of simulated data generated according to the different settings. The curves correspond to the generated trajectories without noise that we aim to recover, and the grey points correspond to the noisy measurements.

G.4 On the pointwise risk
For technical convenience, in our theoretical study, we only considered the case where the regularity estimator ς t 0 is applied with an independent sample. If one wants to smooth the curves in the learning set, one can use a leave-one-out method. That is, for each curve, one can estimate the local regularity without that curve, and smooth the curve with the estimate obtained. Our method for calculating H t 0 is very fast, and such a leave-one-curve-out procedure is feasible. This idea was used to analyze the NGSIM data. However, one could also simply smooth the learning set curves using the same local regularity estimates obtained from this dataset. Figure 17 presents the estimation of the risks R( X; 1/6), R( X; 0.5) and R( X; 5/6) for piecewise fBm, with constant noise variance σ 2 = 0.05, when the training and the test set are the same. The simulation results indicate that our theoretical results could be extended to the case where the online set is taken equal to the learning set, though the concentration deteriorates. The theoretical investigation of this issue is left for future work. Figure 18 presents the estimation of the risks R( X; 0.5) for fBm, with constant noise variance σ 2 = 0.05. Figure 19 presents the estimation of the risks R( X; 1/6), R( X; 0.5) and R( X; 5/6) for piecewise fBm, with heteroscedastic noise. The conclusion are the same than the homoscedastic case.

G.5 Details on the constant of the bandwidth h opt
When the regression function admits a derivative of order d which is Hölder continuous in a neighborhood of t 0 , with exact exponent H t 0 and local Hölder constant L t 0 , the optimal bandwidth for local polynomial smoothing proposed by [37] is with C = C t 0 = q 2 2ς t 0 q 2 1 with q 1 = C * L t 0 / ς t 0 ! and q 2 = σ 2 t 0 C 2 * . Here, C * is the constant defined on page 39 of [37]. Let us recall the notation used by [37] for the local polynomial estimator of a regression function r(·), at the point t, using a sample (Y 1 , T 1 ), . . . , (Y M , T M ): In the case of the Nadaraya-Watson (NW) estimator, A closer look at the proof of Proposition 1.13 of [37] reveals that the absolute value of the bias is bounded by  Figure 19: Estimation of the risks R( X; 1/6), R( X; 0.5) and R( X; 5/6) for piecewise fBm, with non-constant noise variance σ 2 = 0.04, 0.05 and 0.07.
Meanwhile, the conditional variance of the NW estimator given the T m can be bounded by Given that f (t 0 ) can be estimated using the data points from all the curves, the density of the T (n) m can be estimated with high accuracy. We therefore use the true value f (t 0 ) in our simulations, which in the case of a uniform design is equal to 1.

H Traffic flow: Montanino and Punzo [27] methodology
Montanino and Punzo [27] presents a four steps methodology to make the NGSIM data usable. For a complete description of the steps, we let the reader refer to their article [27]. We briefly summarize their method here. The four steps below are applied for each trajectory separately.
Step 1. Removing the outliers They remove the measurements that lead to unreliable values of the acceleration by cutting all the records above a deterministic threshold of 30 m/s 2 . The missing points are interpolated using a natural cubic spline with 10 reference points before and after the outliers.
Step 2. Cutting off the high-and medium-frequency responses in the speed profile They remove the noise from the signal by linear smoothing of the signal with low-pass filter. The considered one is a first-order Butterworth filter [5] with cutoff frequency of 1.25 Hz.
Step 3. Removing the residual unphysical acceleration values, keeping the consistency requirements They remove residual peaks that exceed defined thresholds (varying with speed levels). For that, they move the position of the vehicle when the peak in acceleration appears in order to fulfill the thresholds. In order to prevent inconsistency, a 5th-degree polynomial interpolation with constraint on the space traveled plus minor conditions was applied on a 1s window around the peak points.
Step 4. Cutting off the high-and medium-frequency reponses generated from step 3 This step is the same as the step 2 but using the results of the step 3.
The methodology of [27] seems very specific to the NGSIM dataset, or at least some trajectory dataset, and by extension can not be easily applied to others. For using the algorithm on other trajectory datasets, their method requires some fine-tuning of the parameters.
As explained in the main text, the 1714 observation units from the I-80 dataset, available in the NGSIM study, have been recorded at moments of the day when traffic is evolving, it goes from fluid to dense traffic. Therefore, we consider that there are three groups in the data: a first group corresponding to a fluid (high-speed) traffic, a second one for in-between fluid and dense traffic, and a third groups corresponding to the dense (low-speed) traffic. Our local regularity approach, and the kernel smoothing induced, are applied for each group separately. The three group clustering was performed using a Gaussian mixture model estimated by an EM algorithm initialized by hierarchical model-based agglomerative clustering as proposed by Fraley and Raftery [12] and implemented in the R package mclust [34]. The optimal model is then selected according to BIC. The three resulting classes have 239, 869 and 606 velocity trajectories, respectively. Plots of randomly selected subsamples of trajectories from each groups are provided in Figure  20.

I Complements on the real-data applications
In this section, we point out the fact that our situation is not specific only to the traffic flow data, but can be applied to other real datasets.

I.1 Canadian weather
The Canadian Weather dataset [30,29] records the daily temperature and precipitations in Canada averaged over the period from 1960 to 1994. Here, we are interested in the average daily temperature for each day of the year. It contains the measurements of 35 canadian stations. Here, we have N 0 = 35 and µ = 365. A sample of five temperature curves has been plotted in the Figure 21a. Figure 21b presents the estimation of H t 0 for different t 0 . We see that the estimation varies around 1 with K 0 = 25.

I.2 Household Active Power Consumption
The Household Active Power Consumption dataset is part of the Monash University, UEA, UCR time series regression archive [36] and was sourced   from the UCI repository 1 . The data measures diverse energy related features of a house located in Sceaux, near Paris every minute between December 2006 and November 2010. In total, its represents around 2 million data points. These data are used to predict the daily power consumption of a house. Here, we are only interested in the daily voltage. The dataset contains N 0 = 746 time series of µ = 1440 measurements. Figure 22a presents a sample of five curves from this dataset. The estimation of the local regularity H t 0 , plotted in Figure 22b, is around 0.5 with K 0 = 73.

I.3 PPG-Dalia
The PPG-Dalia dataset is also part of the Monash University, UEA, UCR time series regression archive [36] and was also sourced from the UCI repository 2 . PPG sensors are widely used in smart wearable devices to measure heart rate [32]. They contain a single channel PPG and 3D accelerometer Here, we are interested in the PPG channel. A sample of five curves is plotted in Figure 23a. The estimation of the local regularity H t 0 is also around 0.5 (see Figure 23b) with K 0 = 25.