Sparsely Observed Functional Time Series: Estimation and Prediction

Functional time series analysis, whether based on time of frequency domain methodology, has traditionally been carried out under the assumption of complete observation of the constituent series of curves, assumed stationary. Nevertheless, as is often the case with independent functional data, it may well happen that the data available to the analyst are not the actual sequence of curves, but relatively few and noisy measurements per curve, potentially at different locations in each curve's domain. Under this sparse sampling regime, neither the established estimators of the time series' dynamics, nor their corresponding theoretical analysis will apply. The subject of this paper is to tackle the problem of estimating the dynamics and of recovering the latent process of smooth curves in the sparse regime. Assuming smoothness of the latent curves, we construct a consistent nonparametric estimator of the series' spectral density operator and use it develop a frequency-domain recovery approach, that predicts the latent curve at a given time by borrowing strength from the (estimated) dynamic correlations in the series across time. Further to predicting the latent curves from their noisy point samples, the method fills in gaps in the sequence (curves nowhere sampled), denoises the data, and serves as a basis for forecasting. Means of providing corresponding confidence bands are also investigated. A simulation study interestingly suggests that sparse observation for a longer time period, may be provide better performance than dense observation for a shorter period, in the presence of smoothness. The methodology is further illustrated by application to an environmental data set on fair-weather atmospheric electricity, which naturally leads to a sparse functional time-series.


Introduction
Functional data analysis constitutes a collection of statistical methods to analyse data comprised of ensembles of random functions: multiple occurrences of random processes evolving continuously in time and/or space, typically over a bounded rectangular domain (Ramsay & Silverman, 2007;Ferraty & Vieu, 2006;Hsing & Eubank, 2015;Wang et al., 2016). The challenges arising in functional data, on the one hand, arise from their infinite-dimensional nature: this calls upon tools and techniques from functional analysis, while standard inference problems may become ill-posed. On the other hand, the data, though continuous in nature, are seldom observed as such. Instead, finitely sampled versions are available to the statistician. If the sampling is sufficiently dense, the data can often be treated as genuinely functional data, possibly after a pre-smoothing step. The statistical estimators and procedures may be then based on the intrinsically infinite dimensional inputs and techniques. This approach was popularised by Ramsay & Silverman (2007).
It can very well happen, though, that the data are recorded only at some intermediate locations of their domain, possibly corrupted by measurement error. In this case, it is necessary to regard the underlying functional nature of the data only as a latent process, and additional effort is required to construct adequate statistical methodology. This scenario is often referred to as sparsely observed functional data and usually occurs when the independent realisations of the latent functional process is a longitudinal trajectory. In a key paper, Yao et al. (2005a) demonstrated how to estimate the covariance operator of the latent functional process using kernel regression and how to estimate the principal components of the latent process through conditional expectations. See also Yao et al. (2005b) for an application of the proposed methodology in functional linear regression. The rate of convergence of the kernel smoother of Yao et al. (2005a) was later strengthened by Hall et al. (2006) and Li & Hsing (2010). Other methods to deal with sparsely observed functional data make use of minimizing a specific convex criterion function and expressing the estimator within a reproducing kernel Hilbert space, see Cai & Yuan (2010), and Wong & Zhang (2017).
Still, there are many applications where independence of the underlying curves cannot be assumed, for instance when the functional data are naturally ordered into a temporal sequence indexed by discrete time. We then speak of functional time series, and these are usually analysed by assuming stationarity and weak dependence across the time index. Historically, the research has been focused mostly into generalizing linear processes into functional spaces, see Bosq (2012a) and Blanke & Bosq (2007) for overview publications. More recently, the research has moved beyond the linear structure. Hörmann & Kokoszka (2010) considered the effect of weak dependence on principal component analysis and studied the estimation of the long-run covariance operator. Horváth et al. (2013) provided a central limit theorem for the mean of a stationary weak dependent sequence and considered the estimation of the long-run covariance operator.
A step further from the estimation of isolated characteristics such as the mean function and the said long-run covariance operator is to estimate the entire second-order structure of the process, without assuming linearity. To this aim, Panaretos & Tavakoli (2013a) introduced the notation of spectral density operators and harmonic principal components, capturing the complete second-order dynamics in the frequency domain, whereas Panaretos & Tavakoli (2013b) showed how to estimate the said spectral density operators by smoothing the operator-valued analogue of the periodogram. They formalised weak dependence by cumulant-type mixing conditions,à la Brillinger (1981). In parallel work, Hörmann et al. (2015) introduced the notation of dynamic principal components, closely related to the harmonic principal components of Panaretos & Tavakoli (2013a), and estimated the spectral density operators by the operator version of Bartlett's estimate (Bartlett, 1950).
Despite the long tradition of functional time series as a driving force behind theoretical and methodological progress in functional data analysis more generally, a surprising fact is that the focus has been almost exclusively "densely" observed functional time series, where it is assumed that the full functional data are available. Indeed discrete sampling appears to be a nearly absent consideration, with the exceptions (to our knowledge) being: Panaretos & Tavakoli (2013b), who show the stability of their asymptotics under dense discrete observation but with measurement error of decaying magnitude; and, more recently, Kowal et al. (2017b) who studied functional autoregressive models by means of Bayesian hierarchical Gaussian models. They derived a Gibbs sampler for inference and forecasting but the paper does not examine the asymptotic behaviour of the method. In particular, in one of their considered sampling regimes, which they call sparsefixed design, posterior Bayesian concentration would be intangible. The Bayesian modelling framework was also extended to multivariate dynamic linear models by Kowal et al. (2017a) and to dynamic function-onscalar regression by Kowal (2018). A related problem was studied by Paul & Peng (2011), who considered correlated sparsely observed functional data with separable covariance structure, but the focus was not on dynamics.
In this article we address this gap (or, rather, chasm) and consider the problem of estimating the complete dynamics, and recovering the latent curves, in a stationary functional time series that is observed sparsely, irregularly, and with measurement errors. The number of observations per curve is assumed to be random, almost surely finite, and not increasing to infinity. Therefore we speak of genuine sparsity, much in the same vein as Yao et al. (2005a). As a first step, we show how to estimate the full second-order dynamics of the functional time series based on sparse noisy data using kernel regression methods. We construct estimators of individual characteristics such as the mean function and the lag autocovariance operators, as an aside, but the main contribution is the kernel-based generalization of Bartlett's estimate of the spectral density operators. By integrating back the spectral density into the time domain we construct a consistent estimator of the entire space-time covariance structure.
Our methodology can also be interpreted in a design context: in certain applications, it might be possible for the scientist to choose how to distribute a given fixed budget of measurements over individual curves and over time. In this case, one might ask how to better estimate the underlying dynamics: whether it is better to sample a functional time series more densely over shorter time-span, or to record fewer observations per curve but over a longer time-space. In Section 4 we perform a simulation study to examine this tradeoff, and find that under sufficient smoothness, the sparse sampling regime over a longer period seems preferable.
The second contribution of the article is the establishment of a functional data recovery framework. We show how to predict the unobserved functional data once the space-time dynamics have been estimated. The recovery of the functional data is done by conditioning on all observed data, borrowing strength from the complete dynamics of the process (rather than just the marginal covariance). Simulations show that this approach comprehensively outperfoms a naive approach that would ignore dependence and simply employ the methodology of Yao et al. (2005a) using only the lag zero covariance, as one would do in the i.i.d. case (see e.g. Table 2). When the functional time series is Gaussian, we furthermore show how to construct confidence bands for the latent functional data, with both pointwise and simultaneous coverage. In addition, we show how the functional recovery methodology naturally leads to forecasting.
Functional time series methodology is often useful in analysing continuously measured scalar time series, that can be subdivided into segments of an obvious periodicity, usually days. A key benefit of this technique is the separation of the intra-day variability and the temporal dependence among the consecutive days. The approach is especially fruitful in the analysis of environmental or meteorological phenomena, for example, particulate matter atmospheric pollution (Hörmann & Kokoszka, 2010;Hörmann et al., 2015Hörmann et al., , 2016Aue et al., 2015). Nonetheless, some meteorological variables cannot be measured continuously and uninterruptedly. A practical motivation of this article comes from the data on atmospheric electricity (Tammet, 2009). The peculiarity of this data is that the atmospheric electricity can be reliably measured only in fair-weather conditions. Otherwise, the physical-chemical processes behind the atmospheric electricity are altered and thus a different kind of process is measured. Details of this mechanism are reported in the data analysis in Section 5. Because of this censoring protocol, the considered functional time series is genuinely sparsely observed. We analyse such a dataset using our proposed methods, as a means of illustration.
The rest of the article is organised as follows. In Section 2 we define the functional time-series framework we work with and introduce the estimation and prediction methodology. In Section 3 we formulate the asymptotic theory for the suggested estimators under two different sets of assumptions: the cumulant mixing conditions leading to suboptimal rates, and a stronger set of assumptions including the strong mixing conditions resulting in optimal rates. Section 4 contains the results of numerical experiments designed to probe the finite-sample performance of our methodology. Section 5 illustrates the proposed methodology on the fair-weather atmospheric electricity time series. In Appendix A we comment on some implementation concerns, the formal proofs are included in Appendix B, and some additional results of the numerical experiments are presented in Appendix C. 3

Functional Time Series Framework
Functional time series is a sequence of random function defined on the interval [0, 1] and is denoted as {X t } t∈Z = {X t (x), x ∈ [0, 1]} t∈Z . We assume that X t ∈ H = L 2 ([0, 1]) and E X t 2 < ∞. Moreover we assume that the realisations (paths) of X t are smooth functions (concrete smoothness assumptions will be introduced in Section 3). This space-time process will be referred to as a functional time series. Assuming second-order stationarity in the time variable t, we may define the (common) mean function of X t (·) by E (X t (x)) = µ(x), x ∈ [0, 1], and capture the second-order dynamics of the functional time series by its lag-h autocovariance kernels, Each kernel R h (·, ·) introduces a corresponding operator R h : L 2 ([0, 1]) → L 2 ([0, 1]) defined by right integration In addition to the stationarity, we assume weak dependence, in that the autocovariance kernels are summable in the supremum norm (denoted by · ∞ ) and the autocovariance operators summable in the nuclear norm (denoted by · 1 ) Under these conditions, Panaretos & Tavakoli (2013b) showed that for each ω ∈ (−π, π), the following series converge in the supremum norm and the nuclear norm, respectively f ω (·, ·) = 1 2π h∈Z R h (·, ·) exp(− i ωh), F ω = 1 2π h∈Z R h exp(− i ωh). (2. 2) The kernel f ω (·, ·) and the operator F ω are called the spectral density kernel at frequency ω and the spectral density operator at frequency ω respectively. The lagged autocovariance kernels and operators can be recovered by the inversion formula (Panaretos & Tavakoli, 2013b) that holds in the supremum and the nuclear norm, respectively: In particular, the spectral density operator F ω is a non-negative, self-adjoint trace-class operator for all ω.

Observation Scheme
We consider a sparse observation scheme with additive independent measurement errors. Let Y tj be the j-th measurement on the t-th curve at spatial position x tj ∈ [0, 1], where j = 1, . . . , N t and N t is the number of measurements on the curve X t for t = 1, . . . , T . The additive measurement errors are denoted by tj and are assumed to be independent identically distributed realisations of a mean 0 and variance σ 2 > 0 random variable for j = 1, . . . , N t and t = 1, . . . , T . Furthermore, the measurement errors are assumed to be independent of {X t } t∈Z as well as the measurement locations {x tj }. The observation model can be then written as Y tj = X t (x tj ) + tj , j = 1, . . . , N t , t = 1, ..., T. (2.4) The spatial positions x tj as well as their number N t are considered random and concrete conditions for the asymptotic results are given in Section 3. 4

Nonparametric Estimation of the Model Dynamics
Given the sparsely observed data {Y tj } generated by the observation scheme (2.4), we wish to estimate the mean function µ and the lag autocovariance kernels R h (·, ·). Thanks to the formulae (2.2) and (2.3), the estimation of the lag autocovariance operators is equivalent to the estimation of the spectral density f ω (·, ·).
In the first step, we estimate the common mean function µ by a local linear smoother, see, for example, Fan & Gijbels (1996). Let K(·) be a one-dimensional symmetric probability density function. Throughout this paper we work with the Epanechnikov kernel K(v) = 3 4 (1 − v 2 ) for v ∈ [−1, 1], and 0 otherwise, but any other usual smoothing kernel would be appropriate. Let B µ > 0 be the bandwidth parameter. We define the estimator of µ(x) asμ(x) =â 0 by minimizing the weighted sum of squares: Then, in a second step, we show how to estimate the second order characteristics of the functional time series, namely the lag-0 covariance and the lag-h autocovariance kernels. Since the measurement errors tj contribute only to the diagonal of the lag-0 autocovariance kernel, Cov(Y t+h,j , Y tk ) = R h (x t+h,j , x tk ) + σ 2 1 [h=0,j=k] where 1 [h=0,j=k] = 1 if the condition in the subscript is satisfied and zero otherwise. Therefore we consider the "raw" covariances h=0,j=k] . Hence, the diagonal of the raw lag-0 covariances must be removed when estimating the lag-0 covariance kernel. Specifically, to estimate the lag-0 covariance kernel, we employ a local-linear surface-smoother on [0, 1] 2 applied to the raw covariances {G 0,t (x tj , x tk ), t = 1, . . . , T, j = k}. Precisely, we letR 0 (x, y) =b 0 whereb 0 is obtained by minimizing the following weighted sum of squares: (2.7) and B R > 0 is the bandwidth parameter.
We estimate the measurement error variance σ 2 using the approach of Yao et al. (2005a). That is, we first estimate V (x) = R 0 (x, x) + σ 2 by smoothing the variance on the diagonal. We assignV (x) =ĉ 0 where: (2.8) Instead of using {R 0 (x, x) : x ∈ [0, 1]} as the estimator of the diagonal of the lag-0 covariance kernel (without the ridge contamination), Yao et al. (2003Yao et al. ( , 2005a opted for a local-quadratic smoother -arguing that the covariance kernel is maximal along the diagonal, and so a local-quadratic smoother is expected to outperform a local linear smoother. This heuristic was also confirmed by our own simulations. Therefore, following Yao et al. (2005a), we fit a local-quadratic smoother along the direction perpendicular to the diagonal. Concretely, the estimator is defined asR 0 (x) =c 0 wherec 0 is the minimizer of the following weighted sum of squares: where P (x tj , x tk ) is the first coordinate (which is the same as the second one) of the projection of the point (x tj , x tk ) onto the diagonal of [0, 1] 2 . The measurement error variance is then estimated bŷ (2.10) Since the estimator (2.10) is based on smoothers, it is not guaranteed to be a positive number. This problem was already commented on by Yao et al. (2005a). In the theoretical part of their paper, the negative estimate is replaced by zero and, in their code, it is replaced by a small positive number. The replacement by a positive number can be seen as a form of regularization. Next, we proceed with the estimation of the lag-h autocovariance kernels for h > 0. We define the estimator For h < 0 we setR h =R −h . Observe that we did not need to remove the diagonal as in (2.7). Denote the corresponding estimated covariance operators asR h .

Spectral Density Kernel Estimation
To estimate the spectral density kernels f ω one has to resort to smoothing or a different sort of regularization at some point. Panaretos & Tavakoli (2013b) performed kernel smoothing of the periodogram in the spectral domain whereas Hörmann et al. (2015) made use of Barlett's estimate. Bartlett's estimate involves a weighted average of the lagged autocovariances, with a choice of weights that downweighs higher order lags. From the theoretical perspective, this approach is equivalent to kernel smoothing of the periodogram (Priestley, 1981, §6.2.3). In fact, the Bartlett's weights correspond to the Fourier coefficients of the smoothing kernel, assumed compactly supported.
In this paper, we opt for the Barlett's weights (or triangular window ) defined as W h = (1 − |h|/L) for |h| < L and 0 otherwise for the Barlett's span parameter L ∈ N as it seems to be a popular choice (Hörmann & Kokoszka, 2010;Hörmann et al., 2015). It should be noted that other choices of weights are possible (Rice & Shang, 2017) and the so-called local quadratic windows (Parzen, Bartlett-Pristley, etc.) improve the asymptotic bias. See Priestley (1981, §7.5) for the detailed discussion in one-dimensional case. The statement seems to be also true for functional time series (van Delft, 2019).
If the full functional observations were available, the spectral density would be estimated by the formula (cf. Hörmann et al. (2015) whereR h are the standard empirical autocovariance operators. We could use the formula (2.12) and plug-in the smoothed autocovariance kernels obtained in Section 2.3 but instead, we opt to show how to directly construct a smoother-based estimator of the spectral density kernels. Specifically, we estimate the spectral density kernel at frequency ω ∈ (−π, π) by the local-linear surface-smoother applied to the raw covariances multiplied by complex exponentials. The weights for the smoother are based both on the spatial distance from the raw covariances as well as the time lag. Specifically, we estimate the spectral density kernel aŝ whered 0 is obtained by minimizing the following weighted sum of squares t . It turns out that the minimizer of this complex minimization problem can be expressed explicitly. Moreover, the minimizer depends only on a few quantities that are independent of ω, and can be pre-calculated. The estimator can be thus constructed for a given ω by multiplying these quantities by complex exponentials and performing a handful of inexpensive arithmetic operations. Consequently, it is computationally feasible to evaluate the estimator (2.13) on a dense grid of frequencies. The explicit form is stated in Section B.2.
Denote the integral operator corresponding tof ω (·, ·) asF ω . We can go back to the temporal domain by integrating the spectral density and reproduce the estimators of the autocovariance kernels and operators by the formulae ( ( 2.15) The estimators of spectral density kernelsf ω (·, ·), ω ∈ (−π, π), are achieved by kernel smoothing. Therefore, especially for smaller sample sizes, the operatorsF ω , ω ∈ (−π, π), might not be strictly non-negative, and may feature some tail negative eigenvalues of small modulus. To ensure numerical stability of the method in the following section, it is recommended to truncate these negative eigenvalues ofF ω at each frequency ω ∈ (−π, π).
If dimensionality reduction is of interest, one can truncate the spectral density operatorsF ω at each frequency ω ∈ (−π, π) to an appropriate rank. Such dimensionality reduction is based on the Cramér-Karhunen-Loève expansion and was proven optimal in preserving the functional time series dynamics by Panaretos & Tavakoli (2013a), and independently by Hörmann et al. (2015). Since dimension reduction is not necessary for our theory/methods in the next section, we do not pursue it further.

Periodic Behaviour Identification
As discussed at the beginning of Section 2.4, the choice of Bartlett's span parameter L is related to the bandwidth for smoothing in the frequency domain. To achieve consistent spectral density estimation, the parameter L needs to be kept quite small (cf. condition (B10) and Theorem 2). However, for the purpose of exploratory data analysis, it is useful to explore the data for periodic behaviour in a similar way as a periodogram is used in the case of scalar time series.
When the periodicity examination is indeed of interest, we propose to evaluate the estimator (2.13) for a fairly large value of L. The selection of adequate value of L is a question of computational power available because the computational time to evaluate (2.13) grows linearly in L. In the data analysis Section 5 we work with L = 1000 which is roughly half of the considered time series length.
Once the estimator (2.13) is evaluated for a given value of L we propose to calculate the trace of the spectral density operator at frequency ω ∈ (0, π). Peaks in this plot indicate periodic behaviour of the functional time series. The existence of periodicity is not only a useful insight into the nature of the data but may us prompt into approaching the periodic behaviour in a different way, for example by modelling the periodicity in a deterministic way as we do it in the data analysis carried out in Section 5.

Functional Data Recovery Framework and Confidence Bands
We now consider the problem of recovering the latent functional data {X t (x) : x ∈ [0, 1]} given the sparse noisy samples {Y tj }, and provide corresponding confidence bands.
Consider the random element X T = [X 1 , . . . , X T ] ∈ H T composed of "stacked" functional data (formally, it is an element of the product Hilbert space H T ). Note that Now define the stacked observables as N t is the total number of observations up to time T . By analogy to Y T , stack the measurement errors { tj } and denote this vector E T ∈ R N T 1 . Note that Var(E T ) = σ 2 I N T 1 . Further define the evaluation operators H t : H → R Nt , g → (g(x t1 ), . . . , g(x tNt )) for each t = 1, . . . , T and the stacked censor operator H T : Finally define the projection operator P t : H T → H, [g 1 , . . . , g T ] → g t for t = 1, . . . , T . In this notation we can rewrite the observation scheme (2.4) as The best linear unbiased predictor of X T given Y T , which we denote by X T (Y T ), is given by the formula where * denotes the adjoint operator. The term H T S T H * T is in fact a positive semi-definite matrix. Owing to the fact that σ 2 > 0, the matrix H T S T H * T + σ 2 I N T 1 is always invertible. Now fix s ∈ {1, . . . , T }. The best linear unbiased predictor of the functional datum X s , which we denote by X s (Y T ), is given by Hence the recovery of X s by the formula (2.19) uses the observed data across all t = 1, . . . , T , borrowing strength across all the observations. In practice, however, we need to replace the unknown parameters involved in the construction of the predictor by their estimates. DefineM T andŜ T by substitutingμ andR h for their theoretical counterparts in formulae (2.16) and (2.17) respectively. Now replace M T , S T , σ 2 byM T ,Ŝ T andσ 2 , respectively, in formulae (2.18) and (2.19). The resulting predictors are denoted bỹ In order to construct confidence bands for the unobservable paths, we work under the Gaussian assumption: (A1) The functional time series {X t } t as well as the measurement errors { tj } tj are Gaussian processes.
Thanks to the Gaussian assumption (A1), the predictors of X T and X s given by formulae (2.18) and (2.19) are in fact given by conditional expectations and are the best predictors among all predictors. Furthermore, we can calculate the exact conditional distribution of X T given Y T by the formula where (2.24) From (2.22) we can access the conditional distribution of X s for fixed s = 1, . . . , T , by writing To construct a band for X s with pointwise coverge, we construct a confidence interval for X s (x) at each x ∈ [0, 1] -as we will see, the endpoints of these intervals are continuous functions of x, and so automatically define a confidence band. In practice, one constructs bands for a dense collection of locations in [0, 1] and interpolates. Given the conditional distribution In practice, when we do not know the true dynamics of the functional time series, we have to use the estimates of µ(·) and R h (·, ·). We defineM X T |Y T ,Ŝ X T |Y T ,M Xs|Y T andŜ Xs|Y T by replacing M T and S T witĥ M T andŜ T in the formulae (2.23), (2.24), (2.26) respectively. Therefore the asymptotic confidence interval for X s (x) is obtain by rewriting (2.27) using the empirical counterpartŝ (2.28) For the construction of the simultaneous band we use the method introduced by Degras (2011). Fix s = 1, . . . , T . In the previous section we derived the conditional distribution of X s given Y T in formula (2.25). Define the conditional correlation kernel otherwise. (2.29) Then, the collection of intervals forms a (continuous) confidence band with simultaneous coverage probability (1−α) over x ∈ [0, 1]. Here z α,ρ is the (1 − α)-quantile of the law of sup x∈[0,1] |Z(x)| where {Z(x), x ∈ [0, 1]} is a zero mean Gaussian process with covariance kernel ρ X T |Y T . The definition of a quantile specifically requires that P (sup x∈[0,1] |Z(x)| ≤ z α,ρ ) = 1 − α. Degras (2011) explains how to calculate this quantile numerically. In practice, we replace the population level quantities in (2.30) by their estimated counterparts and define the asymptotic simultaneous confidence band as whereM Xs|Y T (x) andŜ X T |Y T (x, x) are as above and the quantile z α,ρ is calculated for the correlation structurê ρ Xs|Y T defined as the empirical counterpart to (2.29). Note that Φ −1 (1 − α/2) < z α,ρ for any correlation kernel ρ (Degras, 2011). Therefore, as expected, the pointwise confidence bands are enveloped by the simultaneous band. Once again, in practice, one evaluates the band limits defining (2.31) on a dense grid of [0, 1] and interpolates.
The functional recovery framework proposed in this section can be easily extended into forecasting, i.e. prediction of functional curves beyond the time horizon T . For the details, see Section A.4. 9

On the Choice of Mixing Conditions
In Sections 3.2 and 3.3 we develop asymptotic theory for our methodology under two different sets of assumptions. Firstly, in Section 3.2 we prove the asymptotic behaviour of the estimators under Brillinger-type cumulant mixing conditions. The corresponding Theorems 1 and 2 are in a sense canonical, in that their proofs rely on generalisations of the techniques by Yao et al. (2005a). Nevertheless, the yielded convergence rates for one dimensional smoothing and surface smoothing are O P (1/( √ T B µ )) and O P (1/( √ T B 2 R )), respectively, which are not optimal.
The optimal rates for one dimensional smoothing and surface smoothing are known to be O P ( log T /(T B µ )) and O P ( log T /(T B 2 R )) respectively. Recovering such rates using local-regression methods for time-series data relies heavily on the employed measure of weak dependence, namely strong mixing conditions, (Hansen, 2008;Liebscher, 1996;Masry, 1996), (Fan & Yao, 2008, Thm 6.5), geometric strong mixing conditions (Bosq, 2012b, Thm. 2.2 and Cor. 2.2), and ρ-mixing conditions (Peligrad, 1992). In Section 3.3 and Theorems 3, 4 we make use of techniques developed by Hansen (2008) to obtain the optimal rates under strong mixing.
Since these two sets of rates rest on qualitatively different conditions, we have chosen to include both results into the article.

Asymptotic Results under Cumulant Mixing Conditions
In order to establish the consistency and the convergence rate of the estimators introduced in Section 2, we will make use of the following further assumptions on the model (2.4): (B1) The number of measurements N t in time t are independent random variables with law N t ∼ N where N ≥ 0, E (N ) < ∞ and P(N > 1) > 0. (B2) The measurement locations x tj , j = 1, . . . , N t , t = 1, . . . , T are independent random variables generated from the density g(·) and are independent of the number of measurements (N t ) t=1,...,T . The density g(·) is assumed to be twice continuously differentiable and strictly positive on [0, 1].
We allow the event {N t = 0} to potentially have positive probability. This corresponds to the situation where no measurements are available at time t, for example when we additionally have missing data at random. We also need to impose smoothness conditions on the unknown functional parameters (B3) The common mean function, µ(·), is twice continuously differentiable on is uniformly bounded in h for all combinations of α 1 , α 2 ∈ N 0 where α 1 + α 2 = 2.
To prove the consistency of autocovariance kernels estimatorsR h (·, ·) we need to further assume some mixing conditions in the time domain. The smoothing estimators are essentially moment-based, therefore it is natural to consider cumulant-type summability conditions. For the introduction to the cumulants of real random variables see Rosenblatt (1985) and for the definitions and properties of the cumulant kernels and cumulant operators see Panaretos & Tavakoli (2013b).
We will also need to strengthen the summability assumption (2.1).
The last two conditions correspond to conditions C (1,2) and C (0,4) in Panaretos & Tavakoli (2013b), respectively. Finally, we impose the following assumptions on the decay rate of the bandwidth parameters and the growing rate of the Bartlett's span parameter L . We may now state our asymptotic results on uniform consistency and convergence rates: Under the assumptions (B1) -(B5) and (B7) -(B9), for for fixed lag h ∈ Z: Theorem 2. Under the assumptions (B1) -(B5) and (B7) -(B10), the spectral density is estimated consistently: sup If we further assume condition (B6), we can additionally obtain the convergence rate: As a consequence of Theorem 2 we obtain the consistency and the convergence rate of the entire space-time covariance structure (2.15), i.e. rates uniform in both time index and spatial argument: and assuming further (B6): (3.6)

Asymptotic Results Under Strong Mixing Conditions
We begin by listing the assumptions leading to the optimal convergence rates. Besides imposing the key assumption of the strong mixing we need to strengthen some of the other assumptions as well. We require some additional regularity conditions on the smoothing kernel K(·) which until now was assumed only to be a bounded probability density function. The condition is formulated for a generic k(·) multivariate kernel because we will require more than just the smoothing kernel K(·) to satisfy this condition. 11 (C1) The function k : R d → R is bounded and integrable and for some Λ 1 < ∞ and L < ∞, either k(u) = 0 for |u| >L and The following conditions impose more conditions on the functional time series model.
(D1) The functional time series {X t } t∈Z is strictly stationary and strong mixing with mixing coefficients α m that satisfy for A < ∞ and for some s > 2 For the estimation of the mean function µ(·) the following assumptions are required: (D6) The smoothing kernel K(·) satisfies |u| 4 K(u) du < ∞ and the functions u → K(u), u → uK(u), u → u 2 K(u) satisfy the assumption (C1). (D7) The coefficients β and s from the Assumption (D1) satisfy (D9) The functions g(·) and g(·)µ(·) are twice continuously differentiable on [0, 1].
The rates for the lag-h autocovariance kernel estomator(s) will require the following set of assumptions: The following conditions will be required for the rates concerning spectral density estimation.
(D14) Assume the summability in the supremum norm of the 4-th order cumulant kernel of We can now state the main consistency and convergence results under the strong mixing conditions.

Functional Data Recovery and Confidence Bands
In this section we turn our attention to developing asymptotic theory for the recovered functional data and the associated confidence bands, in particular, the asymptotic behaviour of the plug-in estimator (2.21) vis-à-vis its theoretical counterpart (2.19).
First of all, we need to clarify what asymptotic result we can hope to accomplish. Before venturing into functional time series, let us comment on the asymptotic results for independent identically distributed functional data (Yao et al., 2005a). As the number of sparsely observed functional data grows to infinity, one can consistently estimate the second-order structure of the stochastic process (which in this case consists of the zero-lag autocovariance, due to independence). This is then used in the plug-in prediction of a given functional datum, say X s (·), given the sparse measurements on this datum. In the limit, this prediction is as good as if we knew the true lag zero covariance of the stochastic process (Yao et al., 2005a, Theorem 3). Because the predictor uses the estimate of the lag zero covariance based on all the observed data, Yao et al. (2005a) call this trait as borrowing strength from the entire sample.
In the time series setting of the current paper, one can expand the concept of borrowing strength from the entire sample. As the number of sparsely observed functional data (i.e. the time horizon T ) grows to infinity, one can not only estimate the dynamics of the functional time series consistently (Theorem 2 and Corollary 1), but also further exploit the fact that neighbouring data are correlated to further improve the recovery. Because of the weak dependence, the influence of the observations decreases as we part away from the time s. Therefore we fix a span of times 1, . . . , S where s < S ∈ N and we will be interested in the prediction of X s given the data in this span. To be precise, we are going to prove that the prediction of X s from the data in the local span and based on the estimated dynamics from complete data is, in the limit, as good as the prediction based on the true (unknown) dynamics. Therefore, in our case, we are borrowing strength across the sample in a twofold sense -firstly for the estimation of the functional time series dynamics, and then for prediction of the functional datum X s .
The span S can in principle be chosen to be as large as one wishes, but is held fixed with respect to T . This is justified by the weak dependence assumption. In practice, one must also entertain numerical considerations and not choose S to be exceedingly large, since the evaluation of the predictors (2.19) and (2.21) based on longer spans requires the inversion of a big matrix.
We formulate Theorems 5 and 6 under the cumulant mixing conditions required for Theorems 1 and 2. Nevertheless, the conclusions also hold also under the strong mixing condition regime of Theorems 3 and 4 since, as is apparent from the proofs, the only requirement coming into play is the consistency of the spectral density operator estimators in the sense of (3.4) or (3.7).

Simulation Setting
In this section, we present a simulation study in order to prove the finite-sample performance of our methodology. To this aim, we simulate realisations of functional linear processes, namely functional moving average processes and functional autoregressive processes. These provide a good framework to investigate our methods since their spectral density operators can be explicitly calculated in closed form. Specifically, we consider: • Functional moving average process The (Gaussian) functional moving average process of order q is given by the formula (Bosq, 2012a) where µ ∈ H is the mean function, B j , j = 1, . . . , q are bounded linear operators in H, and {E t } is zero-mean Gaussian noise with a trace-class covariance operator S. The functional moving average process is a stationary linear process (Bosq, 2012a) and clearly satisfies the assumption (2.1) in the nuclear norm and thus admits the spectral density in the operator sense. Though the calculation of the spectral density of the functional moving average process is straightforward, we are not aware of it having been considered before in its functional form elsewhere.
Proposition 1. The functional moving average process defined above admits the spectral density in the operator sense (2.2). Moreover, if the kernels corresponding to the operators B 1 , . . . , B q are smooth, the spectral density exists also in the kernel sense (2.2) and the process satisfies the assumptions (B4), (B5), (B6). If the mean function µ(·) is smooth, the process satisfies also (B3).

• Functional autoregressive process
The (Gaussian) functional autoregressive process of order 1, well reviewed in Bosq (2012a), is defined by the iteration where {X t } is a functional time series in the Hilbert space H = L 2 ([0, 1]), µ ∈ H is the mean function, A is a bounded linear operator on H, and {E t } is zero-mean Gaussian noise with a trace-class covariance operator S. Bosq (2012a) showed that if the transition operator A satisfies A < 1 (the operator norm on H) then there exists a unique Gaussian stationary solution to the equation (4.3). The formula for the spectral density of the functional autoregressive process has a form analogous to the finite-dimensional vector autoregression case (cf. Priestley (1981, §9.4)), but its extension to the functional case appears to be a novel contribution: 15 Proposition 2. The functional autoregressive process of order 1 solving the equation (4.3) with A < 1 satisfies the assumption (2.1) in the operator sense, and admits the spectral density in the operator sense (2.2). Moreover, if the kernels corresponding to the operators A and S are smooth, the spectral density exists also in the kernel sense (2.2) and the process satisfies the assumptions (B4), (B5), (B6). If the mean function µ(·) is smooth, the process satisfies also (B3).
For our simulations we choose µ(x) = 4 sin(1.5πx). The autoregressive operator A = A c is the integral operator with kernel A c (x, y) = κ c exp −(x + 2y) 2 where the scaling constant κ c is chosen so that A c = c. We vary c to control the degree of temporal dependence and let c ∈ {0.7, 0.9}. The covariance operator S is the integral operator with kernel S(x, y) = 1.4 sin(2πx) sin(2πy) + 0.6 cos(2πx) cos(2πy). In the simulation results we denote the resulting two processes as FAR(1) 0.7 and FAR(1) 0.9 for c = 0.7 and c = 0.9 respectively.
The simulations must be obviously performed in a finite dimension. We performed the simulation in the third-order B-spline basis created by equidistantly placing 20 knots on the interval [0, 1]. Hence the basis admits 21 elements. The B-spline basis is efficient in expressing smooth functions (Ramsay & Silverman, 2007).
The sparse observations are then obtained by the following process. We set a maximum number of locations to be sampled N max ∈ {5, 10, 20, 30, 40}. For each t = 1, . . . , T , a random integer N t is independently drawn from the uniform distribution on 0, 1, . . . , N max . Next, for each t = 1, . . . , T , we independently draw N t random locations x tj , j = 1, . . . , N t from the uniform distribution on [0, 1]. At each location, an independent identically distributed Gaussian measurement error tj ∼ N (0, σ 2 ) is added and the ensemble Y tj = X t (x tj ) + tj , j = 1, . . . , N t , t = 1, . . . , T is used as the dataset for the estimation procedure. Therefore the observation protocol satisfies the assumptions (B1) and (B2).
The measurement error variance is chosen in the way that the ratio tr(R 0 )/σ 2 , which we interpret as a basic signal-to-noise ratio metric, is 20. The same signal-to-noise ratio was used in the simulation study by Yao et al. (2005a). Further simulation results of ours not reported here indicate that moderate variations of the signal-to-noise ratio do not change the conclusions of this simulation study.

Estimation of the Spectral Density
In this subsection we quantify the estimation error of the spectral density estimator (2.13) in our simulation setting. In particular, we want to explore the dependence of the estimation error on the length T of the time series and the number N max impacting the average number of measurements per curve.
For each of the considered process and for each pair of the sample size parameters T ∈ {150, 300, 450, 600, 900, 1200} and N max ∈ {5, 10, 20, 30, 40} we simulated 100 independent realisations. We have run the estimation procedure introduced in Sections 2.3 and 2.4. In each case, the tuning parameters B µ , B R , and B V are selected by the K-fold cross-validation as explained in Section A.1. The selection of Bartlett's span parameter are discussed in Section A.2. Based on the results of the simulation study, we introduce a simple selection rule that works well for spectral density estimation. The optimal L depends clearly on the (unknown) dynamics of the functional time series. As a compromise across the simulated processes we propose to use the following selection rule whereN is the average number of measurements per curve and · is the integer part of a given real number. The selection rule (4.5) was hand-picked for the considered range of variables T and N max and should not be used for extrapolation, especially not for dense observation schemes. Table 1 Average relative mean square errors (defined in (4.6)) of the spectral density estimators for the above defined functional moving average process of order 4 (FMA(4)) and varying sample sizes. The numbers in parentheses are the standard deviations of the relative mean square error. Each cell of the table (each error and its standard deviation) is the result of 100 independent simulations. The Bartlett's span parameter L was selected by the rule (4.5) T We measure the quality of the spectral density estimation by the relative mean square error defined as wheref ω (·, ·) and f ω (·, ·) are respectively the estimated and the true spectral density kernels at the frequency ω ∈ (−π, π). Due to space constraints, we present in Table 1 the results only for the functional moving average process of order 4, FMA(4). The results for the remaining considered processes are reported in Section C. Concerning the results of Table 1, one can raise an interesting design question: Provided one has a fixed budget for the total number of measurements to be made, should opt to record fewer spatial measurements over a longer time interval (lengthy but sparsely observed time series), or rather record dense spatial measurements over a shorter time period (short but densely observed time series)?
In order to answer this question we define a simple linear model to asses the dependence of the relative mean square error on the considered sample size parameters T and N max . For each of the considered processes we fit the linear model where RM SE(N max , T ) is the average relative mean square error for the considered parameters T and N max , (β 0 , β 1 , β 2 ) are the regression parameters, and e is a homoskedastic model error.
The least square estimate of (4.7) yields (β 0 ,β 1 ,β 2 ) = (1.98, −0.32, −0.57). The coefficientβ 2 is larger thanβ 1 in absolute value, therefore the relative increase of the time-length T has a stronger effect in reducing the relative mean square error of the estimated spectral density than the same relative increase in the number of points per curve. The apparent conclusion is that, in order to estimate the spectral density of a smooth functional time series, the better strategy is to invest in longer time-horizon T rather than denser sampling regime.

Recovery of Functional Data from Sparse Observations
In this section, we examine the performance of the functional recovery procedure proposed in Section 2.6. We compare the recovery performance of our dynamic predictor (2.21), in the following denoted as the dynamic recovery, with its static version that relies only on the lag-zero covariance and hence does not exploit the temporal dependence. In the following, we call this predictor the static recovery. This static recovery is in fact the predictor (2.21) with the Bartlett's span parameter L set to 1.
We simulate 100 independent realisations for each of the considered functional moving average processes FMA(2), FMA(4), FMA(8), and the considered functional autoregressive processes FAR(1) 0.7 , FAR(1) 0.9 , (their definitions in Section 4.2) and each combination of the sample size parameters T ∈ {150, 300, 450, 600, 900, 1200} and N max ∈ {5, 10, 20, 30, 40}. Again, due to space constraints, we state here the results only for the functional moving average process of order 4, FMA(4). The results for the other considered processes are stated in Section C.
For each dataset we run the estimation procedure from Sections 2.3 and 2.4. The tuning parameters B µ , B R , and B V are selected by K-fold cross-validation as explained in Section A.1. The parameter L is selected again by the rule (4.5).
We define the functional recovery (either dynamic or static) relative mean square error as whereX t is the recovered functional curve at t = 1, . . . , T , either dynamically or statically, and X t is the true (unobserved) functional datum.
The key factor contributing to the quality of the functional recovery is the estimateσ 2 of the additive measurement error variance parameter σ 2 . A very small value of the estimatedσ 2 can lead to an ill-conditioned matrix needed to be inverted in (2.20), thus resulting in a defective recovery of the functional data. Because this circumstance affects the relative mean square error metric, we opt to calculate the median of the relative mean square errors as a better indicator of the typical recovery error instead.
We calculate the relative gain as where RM SE(static) is the median relative mean square error of the static recovery and RM SE(dynamic) is the median mean square error of the dynamic recovery. Table 2 summarizes the relative gains of dynamic recovery over the static recovery. Unsurprisingly, the relative gain is strikingly large for sparser designs. This can be explained by the fact that in sparse designs there is not sufficient information to interpolate the functional curves themselves, and the observed data in neighbouring curves are crucial for the recovery of the curves. That being said, it is observed that even when the number of points sampled per curve are as many as 40, the improvement remains substantial, demonstrating that the new methodology should be preferred over methods designed for the i.i.d. case when dependence is present. Table 2 Relative gain (4.9) between median relative mean square error of dynamic recovery and median relative mean square error of static recovery. Positive percentage signifies that dynamic recovery has smaller error. Simulations from the functional moving average of order 4, FMA(4). Each cell of the

Data Analysis: Fair-Weather Athmospheric Electricity
The atmosphere is weakly conductive due to the ionization of molecules and this conductivity can be continuously measured by a variable called atmospheric electricity (Tammet, 2009). The ionization is the outcome of complicated physical-chemical processes that are subject to the current weather conditions. Since unfair weather conditions affect and alter these processes (Israelsson & Tammet, 2001), climatologists are interested in analysing the atmospheric electricity variable only under fair weather conditions (the definition of fair weather is given later). The analyses under fair weather conditions are of particular interest because the fair-weather electricity variable is a valuable source of information in global climate research (Tammet, 2009) as well as with regards to air pollution (Israelsson & Tammet, 2001). Tammet (2009) published an open-access database of atmospheric electricity time series accompanied by some meteorological variables. Most of the data come from weather stations across the former Soviet Union states and their data quality is assessed as high (Tammet, 2009). In this article, we analyse the time series of one weather station, namely that measured at the station near Tashkent, Uzbekistan. The atmospheric electricity was recorded between the years 1989 and 1993 in the form of hourly averages. Besides the atmospheric electricity, a number of other meteorological variables were measured, of which we use two: the wind speed and the total cloudiness.
The definition of the fair-weather criteria is not simple and can often be relatively subjective (Xu et al., 2013). Inspired by criteria in climatology research (Xu et al., 2013;Israelsson & Tammet, 2001), we define the weather conditions as fair if the particular hourly measurement satisfies all of the following conditions: • the wind speed is less than 20 km/h, • the sky is clear (the total cloudiness variable is equal to 0), • the atmospheric electricity E satisfies 0 < E < 250 V /m. Because of the above stated fair-weather criteria (and some genuinely missing data in the database), the resulting fair-weather electricity time series is, in fact, unevenly sampled time series. Nevertheless, we assume there exists an underlying continuous truth, corresponding to the atmospheric electricity if the weather was fair. The latent process of fair-weather atmospheric electricity is considered smooth and its values are observed only under the fair-weather conditions, possibly with a deviation from the truth (noise). Based on the above discussed natural mechanisms, we justify the assumption that the censoring protocol is independent of the underlying fair-weather atmospheric electricity process.
The underlying fair-weather atmospheric electricity process is a continuous scalar time series. Previous research (Hörmann & Kokoszka, 2010;Hörmann et al., 2015Hörmann et al., , 2016Aue et al., 2015) has demonstrated the usefulness of segmenting a continuous scalar time series into segments of an obvious periodicity, usually days, and thus constructing a functional time series. A key benefit of this practice is the separation of intra-day variability and the temporal dependence across the days while preserving a fully non-parametric model. We use the same approach in our analysis as well. We segment the (latent) continuous time series into days and consider each day us an unobserved (latent) functional datum defined on [0,24]. We place the hourly observations in the middle of the hour interval, i.e. 0.5, 1.5, 2.5, . . . , 23.5. Because of the above fair-weather criteria, the constructed fair-weather atmospheric electricity time series falls into the sparsely observed functional time series framework defined in Section 2.2. Figure 2 presents an overview of the considered fair-weather atmospheric electricity time series accompanied by monthly and yearly means. Figure 3 provides a zoomed-in perspective into a stretch of data in 4 consecutive days.
In summary, the fair-weather atmospheric electricity functional time series has the following features: • the data are recorded over 5 years, therefore the time horizon of the functional time series is T = 1826 (days), day among the days with at least one measurement.
The statistical question raised is the following. Benefiting from the separation of intra-day variability and temporal dependence across the days, can we fit an interpretable model of the process dynamics? Additionally, we aim to recover the latent functional data, fill in the gaps in the data, remove the noise, and construct confidence bands.
We analyse the fair-weather atmospheric electricity data by the means of Section 2. Initially, after removing the intra-day dependence by subtracting the estimateμ(·) we inspect the periodicity identification chart introduced in Section 2.5. Specifically, we construct the said chart with L = 1000 and plot the trace of the estimated spectral density operator against frequencies ω ∈ (0, π). We identify the peaks of this plot as suggesting the presence of periodicities in the corresponding frequencies.
The largest peak in Fig. 4 clearly corresponds to yearly periodicity together with a half-year harmonic. The peak is not entirely at 365 days because of the combination of the following factors: discretisation of the frequency grid, numerical rounding, and most likely the slight smoothing by L = 1000.
Once the yearly periodicity is discovered, we opt to model it deterministically, as is usual in (scalar) time series. Thus we propose the model where Y tj are the observed measurements at locations x tj , µ(·) is the intra-day mean, s t is yearly seasonality adjustment, and the "residual" process X t (·) is a zero-mean stationary weakly-dependent functional time series. The assumptions of an additive relation of µ(·) and s t as well as the stationarity of X t (·) were justified by exploratory analysis. We fit the model (5.1) in the following order. First, we estimate µ(·) by a local-linear smoother. Nevertheless, we expect the mean function to be periodic and assume µ(0) = µ(24). Thus we modify the estimator (2.5) to measure the distance between x and x tj as if the endpoints of the interval [0, 24] were connected. Having estimatedμ(·), we estimate the yearly periodic seasonality adjustment s t again by a local-linear smoother, again by assuming continuity between first day and last day of the year. The smoothing parameter was chosen by leave-one-year-out cross-validation. Figure 5 presents the estimatesμ(·) andŝ t . We observe that the intraday mean exhibits two peaks at around 4 a.m. and 3 p.m. The yearly seasonality is almost sinusoidal with low values in the spring and summer and high values in the autumn and winter.
Once the first-order structure given by µ(·) and s t is estimated, we calculate the raw covariance (2.6) by subtracting bothμ(x) andŝ t . The lag-0 covariance kernel R 0 (·, ·) is estimated by (2.7). For the estimation of the components of (2.10), namelyV (·) andR 0 (·), we use the same periodicity adjustment as forμ(·) because we expect the marginal variance (with and without the ridge contamination) to be continuous across midnight. For illustration and interpretation purposes we estimate also the lag-1 autocovariance R 1 (·, ·) by (2.11). Figure 6 shows the surface plots of these estimates. An interesting element of the estimated lag-0 covariance kernel is the peak at afternoon hours signifying higher marginal variance of the fair-weather atmospheric electricity in the afternoon hours. The estimated lag-0 correlation kernel demonstrates that the observations measured close to each other are highly correlated and the correlation diminishes as the distance grows. The estimated lag-1 autocovariance and autocorrelation kernels show that the correlation 21 Top-left:R 0 (·, ·), the estimated lag-0 covariance kernel R 0 (·, ·) (surface), the ridge contamination by the measurement error (red) Top-right: the correlation kernel corresponding toR 0 (·, ·). Bottom-left:R 1 (·, ·), the estimated lag-0 covariance kernel R 1 (·, ·). Bottom-right: the correlation kernel corresponding toR 1 (·, ·).
between two consecutive days is positive. The lag-1 autocorrelation kernel features a lifted-up surface up to correlation 1 in the eastern corner of the surface plot. The clear interpretation is that the late hours of one day are strongly correlated with early morning hours of the following day.
In order to estimate the spectral density consistently, we need to select a moderate value of Bartlett's span parameter L. Plugging in the size of the dataset into the formula (4.5) we set L = 19. Figure 7 presents a few views on the estimated spectral density kernels.
Once the spectral density is estimated, we apply the functional recovery method of Section 2.6 and estimate the unobserved functional data. The method produces estimates of intra-day profiles of fair-weather atmospheric electricity that can be interpreted as predicted atmospheric electricity if the weather was fair at given time, without the modelled noise. As a by-product, the method fills in the gaps in the data (the stretches of days without any measurement). Another output is the construction of confidence bands (under the Gaussianity assumption). Figure 8 presents 4 consecutive days with estimated (noiseless) fair-weather atmospheric electricity together with 95%-simultaneous confidence bands. It is important to note that these bands are supposed to cover the assumed smooth underlying functional data, not the observed data produced by adding measurement errors to the smooth underlying process. Fair-weather atmospheric electricity hourly measurements (red points) over 4 consecutive days; functional recovery of the latent smooth fair-weather atmospheric electricity process (blue); 95%-simultaneous confidence bands for the functional data of the said latent process (yellow).

Appendix A: Practical Implementation Concerns
A.1. Selection of bandwidths B µ , B R , and B V Our estimation methodology involves three bandwidth parameters B µ , B R , B V that need to be selected based on some data-driven criterion. To reduce the computational cost we choose to perform the selection of the parameters in successive fashion. The selection of a bandwidth parameter in kernel smoothing has been extensively studied in literature for the case of locally polynomial regression. The classical selector by Ruppert et al. (1995) calculates the asymptotic mean square error and plugs-in some estimated quantities. However, their methodology applies to the independent case which is distinctly different from the setting of this paper and hence we opt for a cross-validation selection procedure. The selection of the smoothing parameters by cross-validation has already been implemented by Yao et al. (2005a). Here we use a similar approach.
To further reduce the computational requirements we opt for a K-fold cross-validation strategy instead of the leave-one-curve-out cross-validation originally suggested by Rice & Silverman (1991). For the K-fold cross-validation, we work with K = 10 partitions, as follows. We randomly split the functional curves into K partitions and denote the time indices sets as T 1 , . . . , T K . For each k ∈ {1, . . . , K}, denoteμ (−k),B 0 µ the estimate of the common mean function µ calculated by the smoother (2.5) from data without the partition k and using the candidate smoothing parameter B 0 µ . We select the smoothing parameter B µ by minimizing the following loss: Once the smoothing parameter B µ is chosen we estimate the functionμ from all data and use it in the second step to select B R and B V for smoothing the covariance kernels. We choose these smoothing parameters only while smoothing the lag-zero covariance. The reason behind this is that we expect the same smoothness for higher order lags and the selection of the parameters on only one covariance kernel reduces the computational cost, which would otherwise become substantial. We again employ K-fold cross-validation.
DenoteR 0 (−k),B 0 R the estimate of R 0 obtained by the smoother (2.7) calculated from the data without the partition k and using the candidate smoothing parameter B 0 R . The smoothing parameters B R is selected by minimizing the following loss: To select the smoothing parameter B V , we denoteV (−k),B 0 V the estimate of the diagonal of R 0 (·, ·) including the ridge contamination, from the data except the partition k and using the candidate smoothing parameter B 0 V . The parameter B V is selected by minimizing the following loss: Once the minimizers B R and B V have been found, we construct the estimate of the lag-zero covariance kernelR 0 and the measurement error σ 2 from the full data. The bandwidth parameter B R will be used for estimation of the spectral density because we expect the same degree of spatial smoothness for spectral density kernels over all frequencies.
To numerically solve the optimization problems (A.1), (A.2), and (A.3) we use MATLAB's implementation of the Bayesian optimisation algorithm (BayesOpt). A review of this algorithm can be found for example in Mockus (2012).

A.2. Selection of Bartlett's span parameter L
The selection of the parameter L, i.e. the number of lags taken into account when estimating the dynamics, is a challenging problem in general. Selection rules for the bandwidth parameter for smoothing in the frequency domain, which is equivalent to Bartlett's estimate as explained in Subsection 2.4, is reviewed in Fan & Yao (2008) for the case of one-dimensional time-series. The selection of the parameter L, or equivalently the bandwidth parameter for frequency domain smoothing, has nevertheless not been explored for the case of functional time-series. Neither Panaretos & Tavakoli (2013b) nor Hörmann et al. (2015) provide datadependent criteria, but instead rely on a prior choices based on asymptotic considerations.
The selection of the tuning parameter L is better studied in a related problem -the estimation of the long-run covariance, which is in fact the value of the spectral density at frequency ω = 0. The long-run covariance can be estimated by the Bartlett's formula (2.12) for frequency ω = 0. Data adaptive selection procedures for the tuning parameter L have been suggested in this context by Rice & Shang (2017) and Horváth et al. (2016).
However, it is unclear how to incorporate the sparse sampling scheme to the above-cited rules. To address this issue, we run a number of numerical experiments, simulating datasets from a couple of smooth functional time-series, and estimating the spectral density with a varying value of the parameter L. By investigating the estimation error, we propose guidelines on selecting L in the form of a rule of thumb. The details on the simulation study are reported in Section 4.2, the results are recorded in Section C.1, and the proposed rule of thumb is stated in formula (4.5).

A.3. Representation of Functional Data
In the classical functional data analysis, one typically works with the functional data expressed with respect to a given finite (but possibly large) fixed basis. The usual choice is B-splines, Fourier basis, or wavelets. Throughout this article (in simulations and the data analysis) we choose to work with the B-spline basis of order 3 because B-splines are efficient in expressing smooth functions (Ramsay & Silverman, 2007).
A useful feature of the B-spline basis is the interpolation capability (Ramsay & Silverman, 2007) which we benefit from. The smoother based estimators introduced in Sections 2.3 and 2.4 require to perform the smoothing at every point of [0, 1] or [0, 1] 2 . Therefore one has to choose a grid where the smoother is to be calculated. To mitigate the computational time, we want to avoid executing the smoother on a very dense grid. Therefore we evaluate the smoother on a grid with a moderate number of points. Specifically, we operate with the equidistant grid with 21 and 21 × 21 points for functions and 2-dimensional kernels respectively. Once the smoothing estimator is realized on this grid, the functional counterparts as functions on [0, 1] and kernels on [0, 1] 2 are retrieved by the B-spline interpolation. This technique is in contrast to Yao et al. (2005a) who evaluate the smoother on the equidistant grid of size 51 × 51 and treat the covariance kernel as a 51 × 51 matrix and the functional data as vectors. Our simulations (not reported here) suggest that these two approaches have essentially the same statistical performance for smooth functional data. Indeed the stochastic estimation error dominates the numerical approximation error of the fully functional quantities. From the implementation point of view, the B-spline interpolation approach shortens the computational time, reduces the dimension of the data to be stored, and directly expresses the functional quantities with respect to a basis.
Once the smoother-based estimates of the model dynamics expressed in the B-spline basis, we assume that the functional data themselves are expressed within the fixed finite B-spline basis. Of course, the functional data are not directly observed and thus we treat the unknown basis coefficient as latent variables to be retrieved. Using the calculus for functions and operators expressed with respect to a basis (Ramsay & Silverman, 2007), the functional recovery formulae of Section 2.6 can be rewritten and their evaluation is based on vector and matrix manipulations, albeit in a much lower dimensional setting.

A.4. Forecasting
A natural next step to consider, and indeed one of the main reasons why one may be interested in recovering the functional time-series dynamics, is that of forecasting. In this section, we comment on how the forecasting problem naturally fits into the functional data recovery framework introduced in Section 2.6.
Assume that we are given sparse data {Y tj : 1 ≤ j ≤ N t , 1 ≤ t ≤ T } and we wish to forecast the functional datum X T +r for r ∈ N as well as to quantify the uncertainty of the forecast. We define the random element X T +r = [X 1 , . . . , X T , X T +1 , . . . , X T +r ] ∈ H T +r . If the forecasts for the intermediate data X T +1 , . . . , X T +r−1 are not of interest, we may delete these elements and naturally alter the explained method below. Nevertheless, we opt to explain the approach for forecasting up to the time T + r simultaneously.
We utilize the notation introduced in Subsection 2.6. By extending the formulae (2.16) and (2.17) for t = 1, . . . , T + r we obtain the law of X T +r , i.e. the joint law of X 1 , . . . , X T +r , and can calculate their conditional distribution given the observed data Y T . In particular, by taking s = T +r in the equations (2.19), (2.27), and (2.30) we obtain the forecast, the pointwise confidence band, and the simultaneous confidence band respectively for the functional datum X T +r . In practice, we substitute the unknown population level quantities by their empirical estimators. Therefore, by taking s = T + r in the equations (2.21), (2.28), and (2.31) we obtain the forecast, the (asymptotic) pointwise confidence band, and the (asymptotic) simultaneous confidence band for X T +r .

B.1. Proof of Theorem 1
We start with the smoother for the common mean function µ(·). Its estimatorμ(x), the minimizer of (2.5), explicitly:μ All of the above quantities are functions of x ∈ [0, 1] and all of the operations are to be understood in the pointwise sense, and this includes the division operation. In Lemma 1 and Lemma 2 we determine the asymptotic behaviour of S r and Q r , respectively.  For the stochastic term, it will be useful to employ the Fourier transform. The inverse Furrier transform of the function u → K(u)u r is defined as ζ r (t) = e − i ut K(u)u r du. Therefore we may write and thus we can write Thanks to the independence of {N t } and {x tj } we can bound the variance of φ Sr (x) The proof is concluded by combining (B.2) and (B.4), and by the observation that E (|Z n |) = O(a n ) implies Z n = O P (a n ) for an arbitrary sequence of random variables Z n and a sequence of constants a n .
Proof. The proof of Lemma 2 follows the same ideas as that of Lemma 1. We use the bias variance decomposition and a Taylor expansion to order 2 to derive the analogous results as in (B.2) as well as the formulae for M [Q0] (x) and M [Q1] (x). We then define in analogy to (B.3). Thus we can write It remains to bound the variance of (B.5). However, the temporal dependence among Y tj must be now taken into account. First of all remark that for an arbitrary stationary time-series {Z t } with a summable autocovariance function ρ Z (·), one has: This sequence of real random variables constitutes a stationary timeseries. By conditioning on N t and x tj , and applying the law of total covariance, we can bound the autocovariance of {Z t } by |ρ Z (h)| ≤ max x,y |R h (x, y)| for h = 0. For h = 0, the bound is augmented by σ 2 due to the measurement error but this changes nothing on the summability. The autocovariance function is summable thanks to the assumption (2.1) and we conclude that Varϕ r (v) = O(1/T ). By repeating the same steps as in (B.4) we obtain E sup which completes the proof.
Proof of the first part of Theorem 1. By combining Lemma 1, Lemma 2, the formula (B.1), and the uniform version of Slutsky's theorem, we obtain the rate (3.1).
Now we turn our attention to the estimation of the lag-0 covariance and lag-h autocovariance kernels. We include the proof only for h = 0. For h = 0 one has to exclude the diagonal to evade the measurement errors but the proof is essentially the same. It is possible to explicitly express the minimizer to (2.11) (cf. Li & Hsing (2010)). The general principles of the explicit formula deviation are also commented on for the case of spectral density estimation in Section B.2, which uses similar deviation steps as the estimator of lagged autocovariance kernels. The explicit formula yieldŝ where |h| < T and All of the above terms are functions of (x, y) ∈ [0, 1] 2 and all operations are understood the pointwise sense, including the pointwise inversion of We asses the uniform asymptotic behaviour of S where the constant U is uniform for 0 ≤ p + q ≤ 2, T ∈ N, |h| < T , and where c h = (EN ) 2 for h = 0 and c 0 = E{N (N − 1)}. Moreover, the convergence (B.9) is uniform in h.
Proof. Note the bias-variance decomposition of the estimation error Considering a Taylor expansion of order 2, it is easy to show that the formulae (B.10) and that the second term of (B.8) is of order O(B 2 R ) uniformly in h and T . Taking the analogous steps as in the proof of Lemma 1 while using the Fourier transform of the function (u, v) → K(u)K(v)u p v q , one can prove that the first term on the right-hand side of (B.11) are bounded by 1/(T − |h|). Now assume that the common mean function µ(·) is known for the moment. Thus formally definẽ We analyse the asymptotics ofQ where the constant U is uniform for 0 ≤ p + q ≤ 2, T ∈ N, |h| < T , and where c h = (EN ) 2 for h = 0 and c 0 = E{N (N − 1)}. Moreover, the convergence (B.14) is uniform in h.

Proof. Again, the bias-variance decomposition yields
E sup x,y∈[0,1] By taking a Taylor expansion of order 2, it is again straightforward to show that the formulae (B.15) and that the second term of (B.13) is of order O(B 2 R ) uniformly in h and T . To treat the first term on the right-hand side of (B.13), we define the Fourier transform of the function (α, β) → K(α)αK(β)β as ζ pq (u, v) = e − i(uα+vβ) K(α)α p K(β)β q dαdβ. Thus we may write and writeQ Analogously to (B.4), it now remains to analyse the variance of ϕ As in the proof of Lemma 2 we want to bound the sum of the autocovariance function ξ∈Z |ρ Z (h) (ξ)| but the bound must be uniform in h. By conditioning on N t and x tj , and applying the law of total covariance, the ξ-lag autocovariance ρ Z (h) (ξ) can be bounded by For ξ ∈ {−h, 0, h}, the bound is augmented by σ 2 but this changes nothing as to the summability with respect to ξ ∈ Z.
Using the formula for the 4-th order cumulant of centred random variables (Rosenblatt, 1985, p. 36), we express the covariance on the right-hand side of (B.16) as Taking the absolute value and the supremum, the sum of (B.16) with respect to ξ is bounded thanks to the fact that the cumulant on the right-hand side of (B.17) is summable by (B5) and the autocovariances are summable by (2.1). Moreover the sum is bounded uniformly in h. Therefore concludes the proof of the bound (B.13).
In the following lemma we modify the previous result for the raw covariances G h,t instead ofG h,t .
Lemma 5. Under (B1) -(B5), (B7) and (B8), for h ∈ Z and 0 ≤ p + q ≤ 2; p, q ∈ N 0 Proof. We follow the lines of the discussion at the end of the proof of Yao et al. (2005a, Theorem 1). Consider a generic raw covariance G h,t (x, y) = (X t+h (x) −μ(x)) (X t (y) −μ(y)) and its counterpartG h,t (x, y) = (X t+h (x) − µ(x)) (X t (y) − µ(y)). They can be related to each other by the expansion: Proof of the second part of Theorem 1. Combining the results of Lemma 2 and Lemma 5, we obtain the following uniform convergence rates: The numerator of the ratio (B.7) exhibits the following uniform convergence and therefore we have proven the convergence rate for the autocovariance kernel estimator uniformly in x, y ∈ [0, 1].

31
Finally we turn to the estimation of the measurement error variance σ 2 . The minimizer of the local quadratic smoother (2.9) can be expressed explicitly as All of the above quantities are understood as functions of x ∈ [0, 1] and all operations are considered pointwise, including the pointwise inversionB −1 = (B(x)) −1 .
The following corollary is a direct consequence of Lemma 6 and Lemma 7, and the formula (B.18).

Lemma 8. Under (B1) -(B5) and (B7) -(B9),
Proof. The proof is similar to the proofs of the above lemmas. An explicit formula for the minimizer of (2.8) can be found analogously.
Proof of the last part of the Theorem 1. Combining Lemma 6, Lemma 7, and Lemma 8 yields the rate (3.3). See also the proof of Li & Hsing (2010, Theorem 3.4) where the proof with the local-linear smoothing of the diagonal is written out in detail.

B.2. Proof of Theorem 2
Firstly we comment that the minimizer to (2.14) and hence the estimator can be expressed explicitly (2.13) asf All of the above quantities are understood as functions of (x, y) ∈ [0, 1] 2 and all operations are considered in a pointwise sense, including the pointwise inversion B −1 = (B(x, y)) −1 . To see why the minimizer has the form (B.19) we simplify the notation of the complex minimisation problem (2.14) to the following: where A j ∈ C represents the raw covariances multiplied by the complex exponential, and v j ≥ 0 are the spatial and Barlett's weights. The sum of squares can be rewritten in the matrix notation as min d0,d1,d2 where † denotes the complex conjugate, A = (A 1 , . . . , Thanks to X and V being real, the real and imaginary parts of the minimisation can be separated: We solve the above minimisation problems by the classical normal equations formula for the weighted least squares problem and obtain We can calculate the first element of X VX −1 X VA by Cramér's rule. After switching back to the quadruple summation (2.14) we arrive at the formula (B.19).
To investigate the asymptotic behaviour of the estimator (2.13), we need to analyse the asymptotics of the terms in the formula (B.19). We now assess the asymptotic behaviour of S pq and Q ω pq . Lemma 9. Under the assumptions(B1), (B2), and (B8), for any p, q ∈ N 0 , such that 0 ≤ p + q ≤ 2: for h = −L, . . . , L, t = 1, . . . , T − h, j = 1, . . . , N t+h , k = 1, . . . , N t for j = k if h = 0. Because L = o(T ) we may assume (and we do) in the entire proof that L ≤ T /2. Noting that L −1 L h=−L W h = 1 we start with the decomposition (N t − 1). The second term on the right-hand side of (B.20) is of order O P (L). The third term is bounded by bounding the variance N t ≤ U T for |h| ≤ T /2 where the constant U is independent of T and h but may depend on the distribution of N . Thus the third term is of order O P (T −1/2 ) thanks to The fourth term on the right hand side of order O P (T −1/2 ) becauseN = EN + O P (T −1/2 ).
The first term on the right-hand side of (B.20) is decomposed as The second term on the right hand side of (B.21) is of order uniformly. The first term on the right hand side of (B.21) is treated using similar steps as in the proof of Lemma 3, therefore for a constant U independent of B R , T and |h| < T /2, now establishes that the first term on the right hand side of (B.21) is of order Lemma 10. For p, q ∈ N 0 , be such that 0 ≤ p + q ≤ 2, we have 1. under (B1) -(B5), (B7), (B8) and (B10)

under (B1) -(B8) and (B10)Q
where all convergences are uniformly in ω ∈ [−π, π] and x, y ∈ [0, 1] and Analogously to Lemma 4, we first assume that µ(·) is known. Hence we definẽ Under the assumption (B5), the second and the third term on the right-hand side of (B.22) converge to zero uniformly in x, y ∈ [0, 1] by Kronecker's lemma. Assuming further the assumption (B6), these terms are in fact of order O(L −1 ) uniformly in x, y ∈ [0, 1]. The first term on the right-hand side of (B.22) is treated similarly as in the proof of Lemma 5.
The second term on the right-hand side of (B.23) is of order O(LB 2 R ) uniformly in x, y ∈ [0, 1]. The first term on the right-hand side of (B.23) is treated analogously as in the proof of Lemma 5, thus there exists a constant U independent of B R , T and |h| < T /2 such that concludes the rates o P (1), and O P (LT −1/2 B −2 R ) under the assumption (B6). The proof is completed by the repetition of the steps in the proof of Lemma 5, switching to the O P notation and noting that the difference betweenQ ω pq and Q ω pq is asymptotically negligible. Proof of Theorem 2. Combining the above derived results in lemmas 9 and 10 we are ready to establish the asymptotic behaviour of the terms that enter the formula (B.19).

B.3. Proof of Corollary 1
Proof of Corollary 1. Note that for h ∈ Z and x, y ∈ [0, 1]: Assuming further (B6), proving the statement (3.6) is analogous to the previous line.
Under the assumptions (D1) -(D9), the time-series {Ỹ i ,X i } i is strictly stationary and strongly mixing, with mixing coefficientsα(m) ≤Ãm −β , and satisfies the conditions (2) -(7) of Hansen (2008). Indeed: The conditions (10) Next we turn to the estimation of the lag-h autocovariance kernels R h (·, ·). Fix h ∈ Z. For simplicity consider h = 0. The proof for h = 0 is essentially the same, only the diagonal "raw" covariances must be removed. For the moment assume that the mean function µ(·) is known and we shall work with the "raw" covariancesG h,t (x t+h,j , x tk ) as defined in (B.12). Similarly as in the first part of this proof, define now the three dimensional time-series {Ỹ i ,X i } i composed of the "raw" covariances and their locations Ỹ 1 ,Ỹ 2 , . . . = G h,1 (x 1+h,1 , x 1,1 ), . . . ,G h,1 (x 1+h,N 1+h , x 1,Nt ),G 2,1 (x 2+h,1 , x 2,1 ), . . . , We are again going to make us of Hansen (2008, Thm 10). Under the assumptions (D1) -(D13) it is easy to verify (analogously as in the first part of this proof) that the time series {Ỹ i ,X i } i satisfies the conditions (2) -(7) of Hansen (2008). The conditions (10) -(13) also follow directly from our assumptions. It remains to repeat the discussion as in the proof of Lemma 5 to conclude that the difference betweenG h,t (x t+h,j , x tk ) and G h,t (x t+h,j , x tk ) is asymptotically negligible with respect to the rate bellow. Therefore by Hansen (2008, Thm 10), for fixed h ∈ Z,

B.5. Proof of Theorem 4
The proof of Theorem 4 is more involved. Rather than make direct use of, we shall need to modify the proof techniques of Hansen (2008) in order to construct our proof. We express the spectral density kernel estimator (2.13) in a similar way as in the proof of Theorem 2.
All of the above quantities are understood as functions of (x, y) ∈ [0, 1] 2 and all operations are considered in a pointwise sense, including the pointwise inversion B −1 = (B(x, y)) −1 . Similarly as in the proof of Theorem 3 define for h ∈ Z, Let k(·) : R 2 → R be a function satisfying the assumption (C1) and denote for r = 0, 1 and h ∈ Z definê Lemma 11. Under the assumptions (D1) -(D13), for N h > 0 and where the constant Θ is uniform in h ∈ Z, x, y ∈ [0, 1], r = 0, 1.
Proof. Note that the sequence {Z h,r i (x, y)} i is a stationary scalar time-series and denote its autocovariance function as ρ Z h,r i (x,y) (ξ) for lag ξ. Therefore we have the bound (B.6). Conditioning on N h yields The sum on the right hand side of (B.29) can be bounded by The bound (B.30) is uniform in h and constitutes the constant Θ in the statement of Lemma 11.
The key tool for our proof is an exponential-type inequality for strongly mixing random sequences. This inequality was given by Liebscher (1996, Thm 2.1), whose result was derived from Rio (1995, Thm 5).
Lemma 12 (Liebscher/Rio). Let Z i be a stationary zero-mean real-valued process such that |Z i | ≤ b, with strong mixing coefficients α m . Then for each positive integer m ≤ n and such that m < b/4 and assuming further assumption (B6), Similarly as in the proof of Lemma 10, decompose Under the assumption (D14), the last two terms on the right-hand side of (B.33) converge to zero uniformly in x, y ∈ [0, 1] by Kronecker's lemma. Assuming further the assumption (B6), these terms are in fact of order O(L −1 ) uniformly in x, y ∈ [0, 1]. The first term on the right-hand side of (B.33) is of order O P (LT −1/2 ) uniformly in x, y ∈ [0, 1]. The bias term, third term on the right-hand side of (B.33), is of order O P (LB 2 R ) which is shown exactly as in the proof of Lemma 4.
It remains to treat the second term on the right-hand side of (B.33), for which we start with the observation Denote a T = log T /(T B 2 R ) −1/2 . To show the order O P (La T ) of the right-hand side of (B.34) we investigate the probabilities for some M > 0 We bound the probabilities on the right-hand side of (B.35) using the proof techniques presented in Hansen (2008, Thm 2). For the simplification of the notation and the proof we shall assume that the numbers of observation locations are deterministic and constant, Without this assumption, all bounds must be conditioned on these counts and the unconditional statements follow from the fact that (1/T )N h = (EN ) 2 +O P (T −1/2 ) for h = 0 and (1/T )N 0 = (E{N (N −1)})+O P (T −1/2 ) where the convergences are uniform in |h| < T /3. Under the technical assumption (B.36), N h = (T − |h|)N 2 for h = 0 and N 0 = T N (N − 1). From the assumption (D15) we may take T to be sufficiently large so that Our proof follows essentially the same steps Hansen (2008, Thm 2), the only difference is that we need to keep track of the uniformity in h and adjust the convergence rate for the growing L. Using The proof consists of three steps. where τ T = a −1/(s−1) T . Secondly, we replace the supremum overx ≡ (x, y) ∈ [0, 1] with a maximisation over a finite N g -point grid. And finally, with the help of the exponential inequality from Lemma 12 we bound the remainder. Define Following the same steps as in the proof of Hansen (2008, Thm 2), we bound uniformly in |h| < T /3. Thus replacingỸ i withỸ i 1 [|Ỹi|≤τ T ] yields only an error of order O P (a T ) and we therefore assume for the rest of the proof thatỸ i ≤ τ T .
The second step of the proof introduces a discretization of the square [0, 1] 2 which can be covered by a regular grid of N g = 2B −2 R a −2 T points such that for each (x, y) ∈ [0, 1] 2 , the closest grid pointx j ≡ (x j , y j ) is at a distance of at most B R a T distance. Denote this discretization as A j ⊂ [0, 1] 2 , j = 1, . . . , N g .
Thanks to the assumption (D11), for allx 1 ,x 2 ∈ [0, 1] 2 satisfying x 1 −x 2 ≤ δ ≤L, we have the bound where k * : R 2 → R is a bounded integrable function. Indeed, if k(·) satisfies the compact support condition of (C1) and is Lipschitz then k * (u) = Λ 1 1 [ u ≤2L] . If on the other hand k(u) satisfies the differentiability condition of (C1), then we may put k * The inequality (B.38) implies that if a T ≤L then forx ∈ A j we have x −x j /B R ≤ a T and, for T large enough such that a T ≤L, that is, a modification ofΨ h,r where k(·) is replaced by k * (·). Note that by the assumptions (D4) and (D10), E Ψ h,r (x) is bounded uniformly in h ∈ Z and r = 0, 1. Following the steps in the proof of Hansen (2008, Thm 2), we conclude that The terms (B.39) and (B.40) are bounded likewise because the only difference between them is the presence of k(·) and k * (·). Next we show how to bound (B.39).
By the definition (B.28) of Z h,r i (x) we notice that |Z h,r wherek is the upper bound of the bounded function k(·). Therefore, by Lemma 11, for m sufficiently large we have, uniformly in |h| < m/3, Put m = (a T τ T ) −1 and we conclude that m < T and m < b T /4 for = M a T T B 2 R for T sufficiently large. Therefore by Lemma 12 for anyx ∈ [0, 1] where the second inequality comes from the fact that the time-series {Z h,r i (x)} is strong mixing with coefficients α m ≤Ã(m − |h|) −β for m ≥ |h|, the third inequality is due to (B.37), and the final one by taking M > Θ. Since N g ≤ 2B −2 R a −2 T we have from the above inequality and (B.39) and (B.40) that P sup Returning to the inequalities (B.34) and (B.35), we conclude that For M sufficiently large and by the assumptions (D15) and (D16) Proof. We decompose the estimation error as follows: The first term on the right hand side of (B.44) is of order O(T −1/2 ), uniformly in x, y ∈ [0, 1], because The third term on the right hand side of (B.44) is of order O(B 2 R ), uniformly in x, y ∈ [0, 1]. This is shown identically as in the proof of Lemma 3.
The second term on the right hand side of order uniformly in x, y ∈ [0, 1] and |h| < L. This is shown analogously as the proof of Lemma 13. The difference is that the normalising factor 1/L improves the rate to (log T /(T B 2 R )) 1/2 as opposed to L(log T /(T B 2 R )) 1/2 as in Lemma 13.
Proof of Theorem 4. We start with assuming that the mean function µ(·) is known. Combining the results of Lemmas 13 and 14, and the formula (B.24) provides the rate (3.7), and the rate (3.8) if (B6) is assumed.
The proof is completed by the discussion that the difference between the "raw" covariances with and without µ(·) is negligible.

B.6. Proof of Theorem 5
The following lemma ensures the convergence ofM Xs|Y S andŜ Xs|Y S to their population level counterparts (2.26). We investigate the convergence without the Gaussianity assumption.
Proof. We start withM Xs|Y S . Decompose the difference as The first term J 1 on the right-hand side of (B.45) tends to zero, uniformly in x, as T → ∞ by Theorem 1. The second term J 2 and the third term J 3 can be rewritten as where Cov(X s (x), Y S ) is a random vector in R N S 1 whose elements are of the form {R h k (x, x t k ,j k )} N S k=1 for some lags h k and locations x t k ,j k and Var(Y S ) is a random matrix in R N S 1 ×N S 1 whose elements are of the form The terms Cov(X s (x), Y S ) and Var(Y S ) −1 are defined using the estimated autocovariance kernels.
To treat the term J 2 note that Var(Y S ) −1 − Var(Y S ) −1 → 0 as T → ∞ by Corollary 1. The term bounded uniformly in x due to its convergence to Cov(X s (x), Y S ), uniformly in x, by Corollary 1. The term J 3 is treated similarly. Cov(X s (x), Y S ) − Cov(X s (x), Y S ) → 0, uniformly in x, by Corollary 1. The formula for the varianceŜ Xs|Y S (x, y) can be written aŝ Its convergence, uniform in (x, y) ∈ [0, 1] 2 , is treated similarly as above by Corollary 1.
Proof of Theorem 5. The first statement of Lemma 15 is the statement of Theorem 5.

B.7. Proof of Theorem 6
Proof of Theorem 6 . We start with the pointwise confidence band. Fix x ∈ [0, 1]. From (A1) and the con- By Lemma 15, and thus Now we turn our attention to the simultaneous confidence band. By the definition of the conditional distribution By the definition of the simultaneous confidence bands (Degras, 2011), which was reviewed in Section 2.6, Define the correlation kernel ρ Xs|Y T (x, y) as in (2.29). Assume for simplicity of the proof that ρ Xs|Y T (x, x) > 0 for all x ∈ [0, 1]. Then where the square root and the division is understood pointwise. Denote W ρ the law of sup Note also that if ρ n → ρ uniformly then N (0, ρ n ) → N (0, ρ) weakly, W ρn → W ρ weakly and therefore z α,ρn → z α,ρ .

B.8. Proof of Proposition 1 and Proposition 2
Proof of Proposition 1. The formula (4.2) is verified by calculating the autocovariance operators of the functional moving average process, which are non-zero only for a finite number of lags. The assumptions (B3), (B4), (B5), (B6) are easily verified by the smoothness of the kernels and the exponential decay of the norm of the autocovariance operators. Verifying the condition (2.1) in the supremum sense yields the existence of the spectral density in the kernel sense (2.2).
Proof of Proposition 2. The existence, the uniqueness, and the stationarity is treated by Bosq (2012a). The Gaussianity is also immediate. We now verify the formula (4.4). We can write the inversions on the right-hand side of (4.4) as a Neumann series: Fix h ≥ 0. Expanding the sums on the right-hand side of (B.46), in order to obtain the term with e − i ωh one has to sum up is the lag-0 covariance operator of the process (Bosq, 2012a). Checking the analogue of (B.47) for h < 0 yields the formula (4.4). The discussion of the assumptions is analogous to the functional moving average process.
We run a simulation study across the considered functional moving average processes FMA(2), FMA(4), and FMA (8), and the functional autoregressive processes FAR(1) 0.7 and FAR(1) 0.9 . For their definitions refer to Subsection 4.1. We simulated 25 independent realizations of each of the process for each pair of the considered sample size parameters T ∈ {150, 300, 450, 600, 900, 1200} and N max ∈ {5, 10, 20, 30, 40}. For each realization we selected the bandwidth parameters B µ , B R , and B V for smoothing estimators by the K-fold cross-validation suggested in Section A.1. Then we estimated the spectral density by the estimator (2.13) with varying value of Bartlett's span parameter L to identify what value is the optimal for the estimation of the spectral density with respect to the relative mean square error (4.6). First five parts of Table 3 presents the optimal values of L for the considered processes and the considered sample sizes.
The optimal value of L depends on the dynamics of the functional time-series quite substantially. Especially striking is the case of the autoregressive process FAR(1) 0.9 which features a higher degree of temporal dependence than the other processes. Observing the results in the first five parts of Table 3 we suggested the selection rule (4.5) as a compromise among the considered processes.
The bottom-right part of Table 3 presents the evaluations of the rule (4.5) for the considered sample sizes. For the evaluation we consider the average number of points per curveN to be set to the expectation of the number of points N max /2. Table 3 The best L to minimize the relative mean square error (4.6) of the spectral density estimation for the functional moving average processes FMA(2), FMA(4), and FMA (8), and the functional autoregressive processes FAR (1) Table 4 states the average relative mean square error (4.6) for the considered functional moving average processes FMA(2),FMA(4), FMA(8), and the functional autoregressive processes FAR(1) 0.7 , FAR(1) 0.9 . The results for the functional moving average process of order 4, FMA(4), were already stated in Table 1 in Section 4.2 without the standard deviations. Figure 9 displays the fitted regression surface for the model (4.7) for the functional moving average processes FMA (2) (8), and the functional autoregressive processes FAR(1) 0.7 , FAR(1) 0.9 respectively. Therefore the conclusion of higher time-length preference of Section 4.2 remains valid. Table 4 Average relative mean square errors (defined in (4.6)) of the spectral density estimators for the considered functional time-series. The numbers in parentheses are the standard deviations of the relative mean square error. Each cell of the table (each error and its standard deviation) is the result of 100 independent simulations. The Bartlett's span parameter L was selected by the rule (4.5)

C.2. Spectral Density Estimation
T  Table 5, Table 6, Table 7, Table 8, and Table 9 summarize the performance of dynamic and static recovery methods. Because of the reasons explain in Section 4.3, the relative means square error is sensitive to poor estimation of the measurement error variance parameter σ 2 . Therefore we take into account only those simulations whereσ > 0.05 and calculate the median relative mean square error and the corresponding inter-quartile range instead of the mean of the errors and their standard deviation. The column Relative gain of Table 6 for the functional moving average process of order 4, FMA(4), corresponds to the data in Table 2. Table 5 Median relative mean square error (4.8) of the dynamic and static recovery and the relative gain (4.9) between them. Each row of the