Predictive, finite-sample model choice for time series under stationarity and non-stationarity

In statistical research there usually exists a choice between structurally simpler or more complex models. We argue that, even if a more complex, locally stationary time series model were true, then a simple, stationary time series model may be advantageous to work with under parameter uncertainty. We present a new model choice methodology, where one of two competing approaches is chosen based on its empirical, finite-sample performance with respect to prediction, in a manner that ensures interpretability. A rigorous, theoretical analysis of the procedure is provided. As an important side result we prove, for possibly diverging model order, that the localised Yule-Walker estimator is strongly, uniformly consistent under local stationarity. An R package, forecastSNSTS, is provided and used to apply the methodology to financial and meteorological data in empirical examples. We further provide an extensive simulation study and discuss when it is preferable to base forecasts on the more volatile time-varying estimates and when it is advantageous to forecast as if the data were from a stationary process, even though they might not be.


Introduction
A well-trodden path in applied statistical research is to propose a model believed to be a good approximation to the data-generating process, and then to estimate the model parameters with a view to performing a specific task, for example prediction. However, even if the analyst were 'lucky' and chose the right model family, thereby reducing modelling bias, the resulting parameter estimators could be so variable that the selected model might well be sub-optimal from the point of view of the task in question. Choosing a slightly wrong model but with less variable parameter estimates may well lead to superior performance in, for example, prediction. This paper explores how this unsurprising but interesting phenomenon could and should affect model choice in time series analysis.
Choosing between stationary and non-stationary modelling is, typically, an important step in the analysis of time series data. Stationarity, which assumes that certain probabilistic properties of the time series model do not evolve over time, is a key assumption in time series analysis, and several excellent monographs focus on stationary modelling; see, e. g., Brillinger (1975), Brockwell and Davis (1991) or Priestley (1981). However, in practice, many time series are deemed to be better-suited for non-stationary modelling; this judgement can be based on diverse factors, such as, for example, visual inspection, formal tests against stationarity, or the observation that the data have been collected in a time-evolving environment and therefore are unlikely to have come from a stationary model.
One popular framework for the rigorous description of non-stationary time series models is that of local stationarity (Dahlhaus, 1997(Dahlhaus, , 2012, in which the data are modelled locally as approximately stationary. We illustrate the main idea of the paper using a simple example of a locally stationary time series model, the time-varying autoregressive model (of order 1) X t,T = a(t/T )X t−1,T + Z t , t = 1, ..., T, with T denoting the sample size, a : [0, 1] → (−1, 1) being some suitable function and Z t being an i. i. d. sequence with mean zero and variance one. Typically, to forecast future observations, one would require an estimate of a(1), see e. g. Chen et al. (2010). Before constructing a suitable estimator, some analysts would wish to test if a was indeed time-varying, and there exist a vast amount of techniques to validate the assumption of a constant second-order structure in this framework; see von Sachs and Neumann (2000), Paparoditis (2009), Dwivedi and Subba Rao (2010), Paparoditis (2010), Dette et al. (2011), Nason (2013), Preuß et al. (2013) or Vogt and Dette (2015). If the process was found to be non-stationary, it would be tempting to estimate a(1) by a localised estimate based on the most recent observations of X t,T . This localisation would most likely reduce the bias of the estimator if the true dependency structure was indeed timevarying, but also increase its variance. However, if, for example, the function a was varying only slowly over time, this estimation procedure might result in sub-optimal estimation from the point of view of the mean squared prediction error, yielding inferior forecasts compared to the classical stationary AR(1) model. This would be particularly likely if the test of stationarity employed at the start was not constructed with the same performance measure in mind (i. e., mean squared prediction error) and was therefore 'detached' from the task in question (i. e., prediction). One of the findings of this paper is that even if the function a varied over time, one should in some cases treat it as constant in order to obtain smaller prediction errors, or in other words, 'prefer the wrong model' from the point of view of prediction.
The main aim of this paper is to propose an alternative model choice methodology in time series analysis that avoids the pitfalls of the above-mentioned process of testing followed by model choice. More precisely, our work has the following objectives: • To propose a generic procedure for finite-sample model choice which avoids the path of hypothesis testing but instead chooses the model that offers better empirical finite-sample performance in terms of prediction on a validation set, with associated performance guarantees for the test set of yet unobserved data. Although the procedure is proposed and analysed theoretically in the framework of choice between stationarity and local stationarity and in the context of prediction, the procedure is applicable more generally whenever a decision needs to be made between two competing approaches, and can therefore be viewed as model-and problem-free.
• To suggest 'rules of thumb' indicating when the (wrong) stationary model may be preferred in a time-varying, locally stationary situation from the point of view of forecasting; and when a time-varying model should be preferred.
Our procedure validates and puts on a solid footing the possibly counter-intuitive observation that it is sometimes beneficial to choose the 'wrong' (but possibly simpler) model in time series analysis, if that model relies on more reliable estimators of its parameters than the right (but possibly more complex) model. While we stop short of conveying the message that simplicity in time series should always be preferred, part of our aim is to draw time series analysts' attention to the fact that particularly complex time series models may well appear attractive on first glance as they have the potential to capture features of the data well, but on the other hand can be so hard to estimate that this makes them inferior to simple and easy-to-estimate alternative models, even if the latter are wrong.
We now briefly describe related recent literature. The work of Xia and Tong (2011), who, while discussing time series prediction, select the model based on the minimisation of up to m-step ahead prediction errors (rather than the usual 1-step ahead ones) also appears to carry the general message that different models may be preferred for the same dataset depending on the task in question, or, in the language of the authors, on the 'features to be matched'. Besides similarities in this general outlook, our model-fitting methodology and the context in which it is proposed are entirely different. Forecasting in the presence of structural changes is a widely studied topic in the econometrics literature, see e. g. the comprehensive review by Rossi (2013) and the references therein. In particular, Giraitis et al. (2013) also use the minimisation of the 1-step ahead prediction error as a basis for model choice under non-stationarity, but, unlike us, do not consider the question of how this may lead to the preference for the 'wrong' model in finite samples. Instead of pursuing the cross-validation approach as in our work and in Giraitis et al. (2013), McDonald et al. (2016) evaluate the upper bound on the generalisation error in time series forecasting, and use its heuristically estimated version to guide model choice. We note, however, that this approach requires the estimation of some possibly difficult to estimate parameters, unlike cross-validation-based approaches.
The empirical mean squared prediction error (MSPE) which we will employ in our method is closely related to the population MSPE under parameter uncertainty. The strand of literature discussing this population quantity includes Baillie (1979) and Reinsel (1980), where approximating expressions were derived for stationary VAR time series. Further, Lewis and Reinsel (1988) compare the MSPEs of the optimal h-step predictor obtained in general r-dimensional linear processes and a wrongly assumed VARMA(p,q) model. They also determine an asymptotic approximation for the MSPE in a VAR(p) model when the parameters are estimated from the data of a general r-dimensional linear process. Note that the above references' results are restricted to stationary time series. For locally stationary tvMA(∞) processes, Palma et al. (2013) discuss optimal h-step ahead forecasting, in terms of the true model characteristics. Yet, they do not take parameter uncertainty into account.
While the main question we are concerned with is whether a stationary or a time-varying autoregressive model should be used for prediction, a nested question is what order the stationary or non-stationary model should have. Traditionally, order selection is done via minimisation of an information criterion, see, e. g., Brockwell and Davis (1991), p. 301. For stationary AR models Shibata (1980) and Bhansali (1996) discuss how to select the model order efficiently, which here means that asymptotically a lower bound is achieved for the MSPE. Zhang and Koreisha (2015) develop an adaptive criterion for model selection based on predictive risk. Peña and Sánchez (2007) derive and compare MSPE for univariate and multivariate predictors when the parameters are known. They then define and estimate a criterion (a measure of predictability) to choose between these two prediction options. Their approach is similar to ours in spirit, but, firstly, it chooses between univariate and multivariate models while we consider stationary and non-stationary models and, secondly, their methodology works with the population MSPE (which moves the focus away from the observed data to the postulated model), while we work with the corresponding empirical quantity directly. This difference in approaching the problem also holds for another, more general class of special-purpose-criteria: the focused information criteria (FIC), which were introduced in Claeskens and Hjort (2003). The FIC methodology with the focus on choosing the model best suited for prediction was then applied in the field of time series analysis in Claeskens et al. (2007), where the best AR(p) model for prediction is chosen, in Rohan and Ramanathan (2011), where the best ARMA(p,q) model for this purpose is chosen, and in Brownlees and Gallo (2008), where models for volatility forecasting are chosen. The idea of the FIC is that the model which minimises the asymptotic MSPE is the best one and the FIC is then based on an estimator of that asymptotic MSPE. Contrary to this, our approach is based on the empirical MSPE directly, which we believe to be the more relevant quantity in many applications. Contrary to the FIC which is based on the large-sample theory of the estimators involved, we provide finite-sample exponential bounds that imply a performance guarantee for our method. This approach can be advantageous, when it is preferred that the model choice also depends on the size of the sample, which in our view should be a natural requirement.
Our paper is organised as follows. In Section 2 we provide a simple motivating example. In Sections 3.1 and 3.2 we introduce and comment on our new time series model choice methodology. The statistical properties of our procedure are discussed in Section 3.3, where also the performance guarantee (Theorem 3.1) is provided. The results of a simulation study and the analysis of three empirical examples can be found in Sections 4 and 5. In Section 6 we discuss statistical properties of the local Yule-Walker estimator and prove its strong uniform consistency under local stationarity (Corollary 6.2). We conclude with a summary in Section 7. In Appendix A technical conditions for the results from Sections 3.3 and 6 are listed and the main proofs are collected. Technical details of the proofs and tables and figures from the simulations section are gathered in an Online Supplement (Appendices B-I).

Motivating example
We consider the time-varying autoregressive (tvAR) model of order 2: X t,T = a 1 (t/T )X t−1,T + a 2 (t/T )X t−2,T + Z t , t = 1, ..., T, where a 1 (u) := 0.15+0.15u, a 2 (u) := 0.25−0.15u, and Z t is Gaussian white noise. X t,T is a non-stationary process which lies in the locally stationary class of Dahlhaus (1997). We will now compare different forecasting procedures for X 0.9T,T , where T ∈ {50, 500, 5000}. The predictor that minimises the mean squared prediction error is given bŷ Yet, since in practice the underlying model is unknown, the analyst needs to (1) make assumptions regarding the model, and (2) estimate the assumed model's parameters.
For the purpose of this illustration, we discuss four possible models. In the first two models we falsely assume that the data were stationary and model X t,T to satisfy a traditional, autoregressive (AR) equation.
• In the first of the two cases we assume an AR(1) model and • in the second case we assume the model to be an AR(2) model.
We further, discuss cases three and four, where the correct class of models (tvAR) is assumed. Yet, • in case three, we falsely assume a tvAR(1) model, before • in case four, we correctly assume the model to be a tvAR(2) model.
Note that the true model, the tvAR(2) model, is the most complex one of the four choices. In each of the models we estimate the parameters by solving the empirical Yule-Walker equations. In the case of the tvAR models we localise by using the segment  whereX 1,N 0.9T,T corresponds to the models of order 1 andX 2,N 0.9T,T corresponds to the models of order 2. The segment length N will be chosen as 0.9T −1 in the AR models and strictly smaller than this in the tvAR models.
In Figure 1, we observe that the predictors associated with the simpler, stationary AR model perform better or similarly well if T = 50 or T = 500. If T = 5000 the predictor associated with the locally stationary tvAR model performs visibly better in terms of its MSPE when the segment size N is chosen appropriately. In conclusion, this example illustrates how it can be advantageous to assume a wrong, but structurally simpler model when only a short time series is available. In particular, the model chosen should depend on the task at hand (here: prediction) and on the amount of data available. For T = 50 the best result is obtained by assuming the AR(1) model which is the simplest of the four candidates. When T = 500 the more complex AR(2) model becomes advantageous. Note that this model is more complex than the AR(1) model and thus provides a better approximation to the true tvAR(2) mechanism, but is still simplifying, because it does not take the time-varying characteristics into account at all. Only when even more data (here: T = 5000) are available, the variability of the parameter estimates of the tvAR(2) model is small enough not to overshadow the modelling bias, which in this example is rather small.
The important observation to be made here, thus, is that finding the 'right' model may not always be a suitable way of proceeding when it comes to the prediction of future observations. We point out that this observation was made in other contexts of time series analysis. For example, basic exponential smoothing is a widely used forecasting and trend extrapolation technique, and although it is well-known that it corresponds to standard Box-Jenkins forecasting in the ARIMA(0, 1, 1) model, it is also frequently used for data that does not follow it.
This paper investigates the question of what is the best model in terms of forecasting performance in the context of the choice between stationarity and non-stationarity. To ask this question explicitly instead of applying a test for stationarity is important since the smallest sample size T needed to reject the null hypothesis of stationarity may be smaller than the sample size needed to obtain improvement in terms of our task of interest, namely forecasting. In the following section, we will elaborate more on this question. Further, in Section 4, we see, as results of a simulation study, under which conditions using the true model is advantageous and when it can become disadvantageous.
3 When (not) to use locally stationary models under local stationarity: the new model choice methodology

Precise description of the procedure
We work in the framework of general locally stationary time series (a rigorous definition is deferred to Appendix A), in which the available data is a finite stretch X 1,T , . . . , X T,T from an array (X t,T ) t∈Z,T ∈N * of random variables with mean zero and finite variances. Our aim is to determine a linear predictor for the unobserved X T +h,T from the observed X 1,T , . . . , X T,T .
Our proposal is to compare candidate h-step ahead predictors in terms of their empirical mean squared prediction error and choose the predictor with the best forecasting performance. To this end, we proceed as follows: Step 1. Separate the final 2m observations from the T available observations. The observations with indices M 0 := {1, . . . , T − 2m}, M 1 := {T − 2m + 1, . . . , T − m} and M 2 := {T − m + 1, . . . , T } will be referred to as the (main part of the) training set, the final part of the training set and the validation set, respectively. The set of unobserved data with the indices M 3 := {T + 1, . . . , T + m} will be referred to as the test set. The size m of the separated sets will be small in comparison to the sample size T (and hence also to the training set).
Step 2. Compute the linear 1-step ahead prediction coefficientŝ  X −|k|,T X ,T , k = 0, . . . , p max . (3) The parameters p max ≥ 1 and N ⊂ {p max + 1, . . . , T − m − h} are to be specified by the user. Comments on how they are to be chosen are deferred to Section 3.2.
Note that f ls t,h and f s t,h are the forecasts of type (5) and (6) that minimise the empirical MSPE (on M 1 ) within the classes of tvAR and AR models of orders p = 0, 1, . . . , p max , respectively.
Step 6. Use f ls t,h as h-step ahead forecast of X t+h , with t + h > T , if holds for j = 2, and f s t,h otherwise, where with * indicating the corresponding model (we write 'ls' for the locally stationary approach and 's' for the stationary model) and δ ≥ 0 is a parameter by which the user of the procedure specifies which degree of superiority of the more complex procedure is required before it is preferred over the simpler alternative (cf. the end of Section 3.2).
By Theorem 3.1 we have that, with an appropriately chosen δ, the decision rule of type (7) will, with high probability, prefer the same models on the validation and on the test sets.

Remarks on the procedure
Some further explanations regarding the procedure are in order now. Our comments are organised according to the steps of the previous section.
Step 2. The coefficients (1) are estimates for the coefficient functions a 1 (t/T ), . . . , a p (t/T ) in the tvAR(p) model a j (t/T )X t−j,T + σ(t/T )Z t , t = 1, ..., T, (see, for example, Dahlhaus and Giraitis (1998)). Recall that Z t is usually assumed to be white noise and that X t,T is non-stationary if at least one of the functions a j , j = 1, . . . , p, or σ is non-constant.
We are interested in linear forecasts that will perform well for time series possessing a general dependency structure. The tvAR(p) model (9) is a natural choice to approximate the linear dynamics of the observed, non-stationary time series. This is the case, because the coefficient functions at time t/T coincide with the 1-step prediction coefficients (of order p) which define the best linear predictor. In Section 6, we show thatâ a j X t−j,T 2 , also when the observations do not satisfy (9). A forecasting procedure derived within the tvAR(p) model can therefore be expected to behave reasonably, irrespective of whether the tvAR(p) model is true or 'just' an approximation to the truth. Note that we use the tvAR(p) model to approximate the dynamic structure of the data in Section 3.2 and most of our examples in Section 4 are of this kind, but we do not assume that the data actually satisfies it.
Step 3. Linear h-step ahead predictors can either be obtained by iterating the model equation (9) or by using a separate model for each h in which the indices of the sum on the right hand side run from j = h, . . . , p + h − 1. These approaches have been referred to as the plug-in method and the direct method, respectively. A comparison of the two approaches can, for example, be found in Bhansali (1996), where results for a class of linear, stationary processes were derived. We employ the plug-in method.
Step 4. From the previous comments it can be seen how the predictors f ls t,h;p,N and f s t,h;p relate to the choice of modelling the time series' dynamics by a tvAR(p) or AR(p) model, respectively. In each of these model classes, increasing the order p will give a better approximation of the dynamics, but increase the complexity of the model, and make it more difficult to deal with under parameter uncertainty.
The parameters p max and N ⊂ {p min + 1, . . . , T − m − h} are to be chosen by the user and should depend on T . N ∈ N determines the degree of locality in the estimation of the coefficients. The parameters p = 1, . . . , p max and N ∈ N will influence the degree of bias and variance of the predictor. Our selection mechanism will balance them implicitly.
The p corresponds to the order of the tvAR(p) model used to approximate the dynamics. A larger p will give a better approximation, but it will also be more difficult to estimate. A larger N will most likely reduce the variance, but, apart from exceptional cases, inflict a bias if non-stationarity is present.
Traditional choice of N . It is obvious that the variance of the estimator can decrease when a larger segment is used, but that the non-stationarity will potentially inflict an additional bias that increases with N . Under the condition that N/T + T /N 2 = o(1), Dahlhaus and Giraitis (1998) derive asymptotic expansions for the local Yule-Walker estimator's bias and variance. They conclude that N should be chosen at the order of T 4/5 , with the constant depending on the second derivatives of the true model quantities, which are unknown and difficult to estimate. The choice of N should thus, ideally, be such that N T 4/5 , for all N ∈ N . In practice, since the true model parameters are unknown, this rate provides very little guidance to the user of the method. We recommend, though, to adhere to two facts: the upper and lower bound of N should be bounded away from 0 and T . In other words, we recommend to choose N with min N 'large enough', for the performance guarantee to be valid (cf. Theorem 3.1) and max N being 'visibly smaller' than T , to ensure that there is a clear boundary between the locally stationary and the stationary approach.
Traditional choice of p. As described in the beginning of this section we use the tvAR(p)model to approximate the dynamic structure of the data. Intuitively, we have that the larger the order p the better the approximation to the true dynamic structure. In opposition to the previously discussed question of how to choose the segment length N , we here have that a smaller p will inflict a modelling bias, while a larger p will typically be accompanied by an inflation of the variance of the estimation, because it implies that more parameters need to be estimated. Traditionally, the model order is chosen by minimizing information criteria as for example AIC or BIC. Claeskens et al. (2007) propose to use a version of the focused information criterion (FIC, see Claeskens and Hjort (2003)) to select the model order of a stationary AR(p) model optimal with respect to forecasting when the true model is known to be AR(∞). However, as mentioned in the introduction, the FIC-based methods employs an estimator of the asymptotic MSPE, while our approach is based on the empirical MSPE, which facilitates our focus on the finite sample performance.
Step 5 and 6. Our procedure performs two stages of selections via a training step followed by a validation step, as is common in machine learning. Firstly (in Step 5), it selects the model order p and, for the locally stationary approach, the segment length N by comparing predictors within each class of models under consideration (i. e., time-varying or non-time-varying autoregressive models). The parameters p and N are chosen such that the empirical MSPE (predicting M 1 ) is minimised. Secondly (in Step 6), a final competition of the winners is performed to select among the two classes of models. The procedure that minimises the empirical MSPE (predicting M 2 ) is selected and used for forecasting of the test set (M 3 ).
In our theoretical analysis of the next section (see, in particular, Theorem 3.1) we show that the proposed procedure will, with high probability, choose the same class of models on the validation as on the test set, implying that the procedure with the best empirical performance will be selected.
The parameter δ. We choose the predictor derived from the more complex, locally stationary model only if it performs at least δ · 100% better in forecasting the observations from the validation set than the simpler, stationary model. The user-selected parameter δ ≥ 0 specifies how much, in percentage terms, the user wishes to gain before deciding to use the more involved locally stationary approach. δ = 0 corresponds to the decision rule whereby the time-varying model is chosen if it performs better or equally well on the validation set. If we choose δ such that 1 + δ is larger (or smaller) than the true superiority of the locally stationary procedure over the stationary approach in terms of MSPE, then our procedure will, with high probabilty, choose the predictor with the best empirical performance on the test set (cf. Section 3.3).

Theoretical results
In this section, we establish theoretical results that will facilitate our analysis of the model choice suggested by decision rule (7). We show that the probability of choosing different models on the validation and the test set decays to zero at an exponential rate, which can be viewed as a "performance guarantee" of our model choice methodology.
Our results hold under reasonable assumptions on the observed data. We now provide the reader with an idea of the generality of the result by commenting briefly on each of the five main assumptions, the rigorous statement of which and further, more detailed comments are deferred to Appendix A so as not to impair the flow of the paper.
Heuristics of the main assumptions. By Assumption 1 the data are required to be locally stationary. The notion of local stationarity we impose goes beyond that of locally stationary linear processes and, in particular, we do not require the data to be tvAR. The constant C ≥ 0 in Assumption 1 is associated with the degree of non-stationarity and will vanish for second order stationary processes. Assumption 2 is that the data be α-mixing at an exponential rate, which we require to prove the result of the probability in our result to be vanishing at an exponential rate as well. In Assumptions 3 and 4 we require that the local spectral density f (u, λ) be bounded from below, by m f > 0, and from above, by M f < ∞, and that f (·, λ) be continuously differentiable with |∂/∂uf (u, λ)| being bounded by M f ≥ 0, respectively. Finally, in Assumption 5, we require the moments to satisfy a Bernstein-type condition involving the constants c and d. The assumptions are non-restrictive in the sense that many popular and widely used time series models (tvARMA, tvGARCH, etc.) satisfy the full set of assumptions.
We are now ready to state the main result that guarantees that our procedure will, with high probability, choose the predictor that achieves the best empirical performance on the test set.
In condition (11), we require the size T to be above a threshold. This requirement is for technical reasons in our proof and, in fact, our simulation results in Section 4 suggest that the probability bounded in Theorem 3.1 will be large also for T smaller than the threshold, as long as δ is chosen appropriately. More precisely, we require that f (δ) will be non zero and the larger it is, the larger the probability will be. The quantity f (δ) is constructed to measure the difference between the MSPEs of the stationary predictors for different p 1 and the MSPEs of the locally stationary predictors for different (p 2 , N ) scaled by a factor of 1+δ. It is obvious that a condition like this is required for consistency of the procedure, because if there is no difference in performance either approach may equally likely be chosen. Note that in the situation where both approaches perform equally well we do not need the selection to be consistent. Further, note that the condition in our result is slightly stronger than necessary, as we do not only require the best performing procedures to perform differently for the p 1 and (p 2 , N ) that yield the best result, but we require it for any combination. This is due to our method of proof.
The quantity f (δ) depends on the model under consideration and, as p max and |N | get larger, may potentially tend to zero. Thus, to employ Theorem 3.1 in practice, one has to analyse p 2 max |N |P 1+δ , whether it will vanish and if so, at what rate this happens. We expect that, if p max is kept fix, min N → ∞ and max N = o(T ) that, in many cases, f (δ) can be bounded away from zero. Numerical illustrations which suggest this are provided in Section 4.
To illustrate the usefulness of Theorem 3.1, we derive two corollaries of the result. The corollaries consider a simplified version of our procedure in which the model order is always chosen as p = 1, for both the stationary and the locally stationary approach. Restricting ourselves to p = 1 allows us to formulate the equivalent of Theorem 3.1 that rephrases assumption (10) in more explicit terms. To make the modified procedure precise: we now consider the procedure where, after Step 1-4, as described in Section 3.1, we continue with Steps 5' and 6', as follows: Step 5'. Amongst the predictors (5) selectf ls t,1 := f ls t,1;p ls ,N , withp ls := 1 and and, amongst the predictors (6) selectf s t,1 := f s t,1;ps , withp s := 1. Note thatf ls t,1 andf s t,1 are the forecasts of type (5) and (6) that minimise the empirical MSPE (on M 1 ) within the classes of tvAR(1) and AR(1), respectively.
Step 6'. Usef ls t,1 as 1-step ahead forecast of X t+1 , with t + 1 > T , if holds for j = 2, andf s t,1 otherwise, where The modified procedure, based on steps 1-4 and 5'-6', is then consistent (a) if δ is chosen large enough or (b) in case the true model is non-stationary, if δ is chosen small enough.
To make the non-stationarity requirement in the second case precise, we define and the maximum and minimum difference between the ratio of local auto-covariances γ 0 (u), γ 1 (u), from Assumption 1, at any point u of the validation set, and the ratio of averaged auto-covariances (averaged across the training set). Note that D sup ≤ 2 and that D inf is a measure for the non-stationarity of the training set. In particular, it will vanish if the data stems from a stationary process. We will further require that the local autocorrelations of order 1 for the validation set are bounded and to this end define It is important to observe that while the quantity f (δ) in Theorem 3.1 depends on N we have that the D's only depend on m and T and the first and second order local moments (γ 1 and γ 0 ). The definition does not require that the data actually stems from a tvAR(1) process, but if the true model were indeed a tvAR(1) with coefficient function a, then we have γ k (u) = a(u) k /(1 − a(u) 2 ). In this important class of models the integrals in the first term of the D's thus are explicit in terms of the function a, as We now state two results about the modified procedure of 1-step ahead forecasting. The first result illustrates that the procedure will always be consistent if δ is chosen large enough: Corollary 3.3 Let (X t,T ) t∈Z,T ∈N * satisfy Assumptions 1-5 (Appendix A), EX t,T = 0, and ρ := sup (14), γ k (u), k = 0, 1, from Assumption 1, m f from Assumption 3 and M f from Assumption 4. Then, for T large enough to satisfy condition (11) with p max = h = 1 and f (δ) := δπm f (1 − ρ 2 ), withR s,ls T,i (1), i = 1, 2, defined in (12), we have P (R s,ls Further more, we have as a second result that if the true model is non-stationary in the sense that the quantity D inf is large compared to the N/T for all N ∈ N , then we also have consistency for δ's that are small enough: Corollary 3.4 Let (X t,T ) t∈Z,T ∈N * satisfy Assumptions 1-5 (Appendix A), EX t,T = 0, and D inf > 0, with D inf defined in (15). Let m ∈ N * , and N ⊂ {2, . . . , T − m − 1}, and δ ≤ 1 8 D 2 inf , Then, with m f from Assumption 3 and M f from Assumption 4, if and T is large enough to satisfy condition (11) with p max = h = 1 and f (δ) := πD 2 inf m f /2, withR s,ls T,i (h), i = 1, 2, defined in (12), we have P (R s,ls Note that, in Corollaries 3.3 and 3.4, to show that ε of P (1,1) m,min N (ε) is bounded away from zero, it suffices to bound D inf and Remark 3.2 then fully applies, yielding the exponential bound.

Simulations
In this section we discuss finite sample properties of the estimatesR s,ls T,i (h), defined in (7), and their population counterparts R s,ls T,j (h) := (E(MSPE s T,j (h)))/(E(MSPE ls T,j (h))). The simulation study was conducted with the R package forecastSNSTS (R Core Team, 2016; Kley et al., 2016), available from The Comprehensive R Archive Network (CRAN). In particular, we investigate the performance of decision rule (7). To this end, we have considered 15 different tvAR models. Three of the models are stationary, the other 12 are non-stationary. Amongst the non-stationary processes we have some where the covariance structure changes quickly and some where the covariances change slowly. Further, we will have examples where the processes given by the parameters at some local time u are almost unit root and some where they are not.
For each of the models we proceed as follows. We simulate sequences of length T + m = n ∈ {100, 200, 500, 1000, 2000, 4000, 6000, 8000, 10000}. The T + m observations, with T and m as in Section 3, contain the training, validation and test set. We separate the test and validation set of length m := n .85 /4 . Thus, n i := n − (3 − i) n .85 /4 , i = 0, ..., 3, mark the end indices of the main part of the training set, the final part of the training set, the validation set and the test set, respectively. We have chosen m as a function of n in such a way that m = o(n) and m → ∞, as n → ∞. The sizes of the three sets therefore are 12, 22, 49, 88, 159, 288, 406, 519, and 627 for the different sequence sizes, respectively.
As described in Section 3.1 we then, for any h = 1, . . . , H := 10, determine linear h-step ahead predictions for X t+h,T with t + h ∈ {n 0 + 1, . . . , n 1 }. We determine the 'stationary predictions', with coefficients estimated for a given p = 0, . . . , p max := 7, from X 1,T , . . . , X t,T byv (p,h) t,T (t) from Step 3 of the procedure. For simplicity, we have chosen the same p max for every T . We further determine 'locally stationary predictions' where the coefficientsv (p,h) N,T (t) are used for p = 0, . . . , p max and where N min := (n/2) 4/5 and N max := n 4/5 . Note that N always contains 25 (or fewer) equally spaced elements, the smallest being N min . We limit the number of elements in N to reduce computation time and point out that results did not change significantly when a larger number of elements was used. We then compare the predictors with respect to their empirical mean squared prediction error (MSPE) on the final part of the training set and, according to Step 5 of the procedure, choose the stationary predictor withp s that minimises the MSPE on M 1 amongst all stationary predictors and the locally stationary predictor with (p ls ,N ls ) which minimizes the MSPE on M 1 amongst all the locally stationary predictors.
For those two predictors we then determine the empirical mean squared prediction errors MSPE * T,2 and MSPE * T,3 , defined in (8), on the validation and test set, respectively. We record seven pieces of information:p s ,p ls ,N ls , MSPE s T,2 , MSPE ls T,2 , MSPE s T,3 , and MSPE ls T,3 . We replicate the experiment 10000 times. Now we define the models. The first two tvAR(1) models (the innovations Z t are i. i. d Gaussian white noise, in all 15 models) are defined by two periodic coefficient functions, namely the models are We then look at six tvAR(1) models where the AR coefficient increases linearly, namely Finally, we consider the stationary AR(1) model with coefficient −0.6 and two models with independent observations, one with and one without heteroscedasticity, namely We further consider the following tvAR(2) model from Dahlhaus (1997) and one of its tangent processes (cf. the comment after Assumption 1) Further, we consider two tvAR(1) models where the coefficient decreases linearly: We now discuss our analysis of process (17), the first of the models, in detail and comment briefly on the outcome for the other processes with tables and figures for the other models being deferred to Appendix I (Online Supplement).
In Figure 2, note that, since in the numerator we have the MSPE for the best stationary predictor and in the denominator the MSPE for the best locally stationary predictor, a ratio above 1 corresponds to the situation where the best locally stationary predictor outperforms the best stationary predictor. It can be seen whether this happens on average, while in Table 2 we can see the proportion of simulated cases in which this has happens. In Figure 2, we thus observe that, for n = 100, the stationary approach performs better on average across all values of h on both the test and validation set. For n = 200 the locally stationary approach performs better for 3 ≤ h ≤ 6 on the test set, while the stationary approach still excels for all h on the validation set. For n ≥ 500 the locally stationary approach is better across all values of h on the test set and for 2 ≤ h ≤ 4 it outperforms the stationary approach on the validation set. For n ≥ 1000 the locally stationary approach is always as least as good as the stationary approach for all h. It is striking that, for this particular model and for the larger n's we see that as h gets larger the two approaches (stationary and locally stationary) perform almost equally well on average, which can be seen from the lines in Figure 2 being close to one. Another important observation is that, as n gets larger and m/n gets smaller, we see the lines for the validation and test set converging, which is in line with what Theorem 3.1 suggests should happen.
We now, briefly, compare the outcome of model (17) to that of model (18); details are shown in Appendix I (Online Supplement). Note that in model (17) the coefficient function ranges from 0.61 to 0.99, placing some of its tangent processes close to the unit root. In model (18) the coefficient function ranges from 0.11 to 0.49. Thus, the two models have the same variation of the coefficient function, but in model (18) the tangent processes are further away from the unit root. In Figure 9, it can be seen that the stationary approach is preferred over the locally stationary approach for sequences up to length n = 1000. Further, we observe that the advantage of using the locally stationary approach for sequences of length n ≥ 4000 is minuscule and visible only for 1-step ahead forecasting. For the other models we can make similar observations: Rules of Thumb. The locally stationary approach outperforms the stationary approach only if either the sequence is long, or the coefficient function varies a lot, or the tangent processes are close to the unit root. In any other case the stationary approach can be chosen without (a large) loss.
the example. It is interesting to compare the observed proportions with the corresponding value of f (δ), which we provide in Table 4. We see that a larger proportion typically goes along with a larger value of f (δ) indicating the relevance of condition (10). To make it more precise: the tables are concerned with the proportion for which the decision rule (7) yields the same result no matter if we take i = 2 or i = 3, i. e. we count what proportion of runs satisfies (R s,ls T,2 (h) ≥ 1 + δ andR s,ls T,3 (h) ≥ 1 + δ) or (R s,ls T,2 (h) < 1 + δ andR s,ls T,3 (h) < 1 + δ).
We see that if δ is chosen large enough the probability for the event (32) approaches 1, as T and m increase. More precisely, this is the case, if δ is chosen smaller than the ratio of MSPEs depicted in Figure 2 on both the validation and test set or larger than both those ratios. This is as expected from Corollary 3.3 and 3.4. A more detailed analysis is possible, employing the information provided in Table 1. In the third row of tables we see, for example, that for n = 10000 and δ = 0 the procedure will consistently choose the locally stationary approach on both the test and validation set for 1-step ahead forecasting. For n = 10000 and δ = 0.05, on the other hand, we see that the procedure almost consistently chooses the locally stationary approach on the validation set while it is rather undecided (50%-50%) on the test set. For δ = 0.1 the procedure almost consistently chooses the stationary approach on the test set and is to some degree undecided (70%-30%) on the validation set. Finally, if δ = 0.15, we see that the stationary approach gets chosen almost consistently on both validation and test sets. This is just what we would expect, as a smaller δ must lead to the locally stationary approach being preferred, as the more complex locally stationary approach only gets selected if the empirial MSPE of the stationary approach is at least (1 + δ)-times of the empirical MSPE of the locally stationary approach.
We now briefly summarise our findings for the other models and refer to Appendix I (Online Supplement) for further details.
For the different models we observe that for n = 100 the best stationary predictor usually (in all models, except the strongly non-stationary (30) and (31)) outperforms the locally stationary approach, which can be seen from the figures where the lines are below 1 and also in the proportions from the tables. In 11 of the 15 models the 1-step ahead stationary predictor yields the better performance in more than 65% of the cases. Even in the highly non-stationary models (17), (23), (30), and (31) where the tangent processes for the observations to be predicted are close to the unit root it is still more than 47% in which the stationary approach yields the better performance. As a general conclusion we thus see that locally stationary modelling might not be advantageous for forecasting short time series. Further, for those models, when n = 1000, we can see that the locally stationary approach will result in a significantly reduced MSPE as the sample size increases. For example, with model (23) the 1-step ahead locally stationary approach outperforms the stationary approach by about 20% on average and in almost 92% of the cases.
We further observe that in all of the figures the plots in the bottom row, corresponding to n = 4000, 6000, 8000 and 10000 are of a generally similar shape (in (28) we see this only for n ≥ 6000). For an explanation note that, as n grows larger, m/n tends to zero and consequently the data in the validation and test set behave more similarly. For n = 100, for example, the validation and test set make up 24% of the data, while for n = 10000 it is only 12.54%. We observe two different kinds of shapes: (a) for some of the models, the locally stationary and the stationary approach appear to be performing similarly well as h increases, and (b) in other models we find that the locally stationary approach outperforms the stationary approach even more as h increases. More precisely, for models (17)-(20), (28), (30), and (31), we see that the locally stationary approach excels for small h and both approaches are almost equally good as h gets larger, while for models (25)-(27) the stationary approach is advantageous for smaller h and the two approaches are of similar performance for the larger h's. In model (29) the stationary procedure is always better (uniformly with respect to h), but as T gets larger the advantage is less pronounced. Note that models (25), (26), and (29) are stationary and (26) and (27) are the independent observations. It is thus not very surprising that the stationary approach outperforms the locally stationary one. Finally, in models (21)-(24), where the tangent processes for the observations from the validation and test set are close to the unit root, we see that as h gets larger the ratio of MSPEs first increases and then (slowly) decreases. These observations corroborate the rules of thumbs we have given before.

London Housing Prices
We analyse average housing prices from the UK House Price Index (HPI). The HPI is updated monthly with data from the Land Registry, the Registers of Scotland, and the Land and Property Services Northern Ireland. The data is combined by the Office of National Statistics using hedonic regression. 1 The sequence we used for the analysis contains 252 monthly index values from 1995 to 2015. 2 The data is depicted in the left panel of Figure 3. For the analysis we consider T + m = 251 monthly changes (in percent). The prices are centred by subtracting the arithmetic mean prior to the analysis. We clearly see autocorrelation at lags less or equal than 4 and at lag 12 in the right panel of Figure 3.
We then compute the 1-step to 6-step prediction coefficients, defined in (4), with which we can predict an observation X t+h from X t , . . . , X t−p+1 , where X t+h is an observation made either in 2013, 2014 or 2015, respectively. We choose p = 0, 1, . . . , 18, where p = 0 shall mean that we are predicting with 0. Note that the maximum p was chosen larger than 12, as we are dealing with monthly data and dependence at lag 12 can be seen from the autocorrelation function. We consider the stationary predictors as well as locally stationary predictors with N = 35, 36, . . . , 84 = 251 4/5 .
Interestingly, in Figure 4, we observe that the MSPE of the locally stationary forecasts  are typically larger than corresponding ones of the stationary forecasts. Further, note that amongst the stationary forecasts the mean squared prediction errors for both the 1-step and 6-step ahead predictor with p = 0 (shown as a grey solid horizontal line) is smaller than those for 0 < p < 12, but larger than those with 12 ≤ p ≤ 15. Amongst the locally stationary estimators this observation is not visible.
As described in our procedure we now determine thep s ,p ls , andN that minimise the MSPE within each class of predictors. For 1-step ahead prediction we findp s = 13, p ls = 16, andN = 56. For 6-step ahead prediction we findp s = 14,p ls = 13, andN = 51. The numbers are summarised in Table 5.
In conclusion, our analysis has revealed that, from the point of view of prediction of the 2015 observations, treating the data as stationary does not have a negative effect. We were able to see that using the estimates from the (stationary) AR(13) or AR(14) models (for month and half-year ahead forecasts, respectively) gave us better predictions than using the (locally stationary) estimates of segments of roughly 4.5 years (56 and 51 months, respectively). In particular, the impact of, for example, the 2008-2009 financial crisis on the stationary estimates, as one might naively expect, is not profound enough to worsen the predictors' performance.

Temperatures Hohenpeißenberg
In this example, we analyse seasonally adjusted, daily temperature data collected at the meteorological observatory in Hohenpeißenberg (Germany). More precisely, we use n = T + m = 11315 observations of daily mean temperatures that were recorded between 1985 and 2015. 3 The data are shown in the left panel of Figure 5. To eliminate the clearly visible trend and seasonality, we have fitted a harmonic linear regression model of the form to capture the trend and annual variation. The red curve in the left panel of Figure 5 is the prediction of the fitted model. We then consider the residuals of this model which are shown in the middle panel of Figure 5. The right panel of Figure 5 shows the autocorrelation function, which clearly indicates that serial dependence is present. Birr et al. (2016) analyse the same data set and fit a stationary ARMA(3,1) model to capture the serial dependence.
In Figure 6, the MSPE are presented in the same manner as in Section 5.1. In this example we have chosen p max = 10 and N := { n 4/5 /2 , . . . , n 4/5 } = {875, 876, . . . , 1748, 1749} and m := 365. The MSPE corresponding to p = 0 is 116 in this example and therefore not visible in the plot. Note further, that for p = 1 the 1-step ahead forecasts are performing far worse than those for p ≥ 2.
By minimising the empirical MSPE on the end of the training set the procedure chooses, for the stationary approachp s = 5 andp s = 6 for h = 1 and h = 2, respectively. For   the locally stationary approach the procedure chooses (p ls ,N ) = (5, 1255) and (p ls ,N ) = (5, 1235) for h = 1 and h = 2, respectively. Empirical MSPEs for other values of p and N are shown in Figure 6. The numbers are summarised in Table 6.
For 1-step ahead forecasting and on the validation set this yields, given thep s chosen by the procedure, that MSPE s 10950,2 (1) = 7.21 and, given thep ls and N chosen by the procedure, that MSPE ls 10950,2 (1) = 7.34. Similarly, for 2-step ahead forecasting, we have MSPE s 10950,2 (2) = 13.02 and MSPE ls 10950,2 (2) = 13.26. Because the respective ratios are both smaller than 1, the procedure chooses the stationary approach over the locally stationary approach. Obviously, this superiority will continue to hold if δ is chosen strictly larger than one by the user.
On the test set we have MSPE s 10950,3 (1) = 8.02 and MSPE ls 10950,3 (1) = 8.09 for 1-step ahead forecasting. Likewise, for 2-step ahead forecasting, we have MSPE s 10950,3 (2) = 14.77 and MSPE ls 10950,3 (2) = 14.93. Thus, again, the stationary approach is superior for 1-step and 2-step ahead forecasting and we see that the approach that our procedure chose, the one performing best on the validation set, is also performing best on the test set.
In conclusion, in this example, we have provided clear evidence that the temperature data collected in the Hohenpeißenberg observatory, from the point of view of prediction, should be treated as if they were stationary. We see that using the estimates related to a AR(5), AR(6), or AR(1) model (for h = 1, h = 2, 3, 4, or h = 5, respectively) yielded the better forecasts on the validation set in all cases. In particular, using the estimates localised to the segment suggested by the procedure (using the past roughly 3.5-4 years; 1234-1512 days) the performance was always worse. This observation is remarkable, in the sense that, in 30 years of data an analyst might typically expect non-stationarity (e. g., changes due to global warming). Further, our procedure consistently chose the approach with the better performance on the test set, but for the case where h = 5, in which both approaches perform almost equally well. In this case (h = 5), choosing the stationary approach, though it simplifies the analysis, results in a performance-loss of only 0.2%.

Volatility around the time of the EU referendum in the UK, 2016
This example is about forecasting volatility of the FTSE 100 stock index. More precisely, we consider a sequence of n = T + m = 468 (daily) opening prices p open and closing prices p close , dated from 2 January 2015 to 4 November 2016. 4 The analysis is then based on the sequence ((p close − p open )/p close ) 2 , centered by subtracting the arithmetic mean of this sequence. The data are shown in Figure 7.    of forecasting the 15 observations from the end of the training set. We can see, for the 1-step, 2-step and 3-step ahead forecasts, that the lines have a characteristic shape: as N increases from 30 to about 45-50 the MSPE is constant or slightly decreasing (for each p at a different level), from 45-50 to about 55-60 it rises, sharply and 'in turmoil', to a significantly higher level and then it slowly decreases from around 65 onwards.
Note that observations 1 through to 373 were recorded from 2 January 2015 to 23 June 2016 (the day of the EU referendum) and observations 374 through to 468 were recorded from 24 June 2016 to 4 November 2016. This implies that the final 95 observations were recorded after the EU referendum, meaning that there are 50 observations between the EU referendum and the observations to be forecast in the first step. Thus, the increase of the lines when N is roughly between 50 and 60 corresponds to an increased bias of the Yule Walker estimator due to non-stationarity when pre-referendum data is starting to be used for the estimation of the prediction coefficients. Another important observation is that in the post-referendum part of the diagram (25 ≤ N ≤ 45) the data does not show a great degree of non-stationarity, for the purpose of forecasting. More clearly, we cannot distinguish it from data that were stationary in terms of the MSPE observed. This can be seen from the fact that the MSPE has more of a constant or decreasing shape than it is 'U-shaped'. For all h the minimum empirical MSPE for forecasting the data from the end of the estimation set is obtained somewhere between around 30 and 50. More precisely, we choose (p ls ,N ls ) as (6, 35), (7, 47) and (7, 46), for h = 1, 2, 3, respectively. For h = 1 we see 6.40 × 10 −9 , for h = 2 we see 5.14 × 10 −9 , for h = 3 we see 5.08 × 10 −9 ). When using the stationary approach we choosep s as 6, when h = 1, 2, 3, respectively. For h = 1 we see 9.35 × 10 −9 , for h = 2 we see 8.55 × 10 −9 , for h = 3 we see 8.65 × 10 −9 ). The numbers are summarised in Table 7.
Using these predictors to forecast the 15 observations from the validation set we see that, for h = 1 the MSPE of the stationary approach is 4.25×10 −9 and the MSPE of the locally stationary approach is 3.77 × 10 −9 . For h = 2, we see that the MSPE of the stationary approach is 4.35 × 10 −9 and the MSPE of the locally stationary approach is 3.56 × 10 −9 . For h = 3, we see that the MSPE of the stationary approach is 4.59 × 10 −9 and the MSPE of the locally stationary approach is 3.46 × 10 −9 . Thus, for all three h = 1, 2, 3 we choose the locally stationary approach, as it performs better on the validation set. Now, forecasting the final 15 observations (the ones from the test set) we see that, for h = 1 the MSPE of the stationary approach is 3.10 × 10 −9 and the MSPE of the locally stationary approach is 3.32 × 10 −9 . For h = 2, we see that the MSPE of the stationary approach is 3.64 × 10 −9 and the MSPE of the locally stationary approach is 3.31 × 10 −9 . For h = 3, we see that the MSPE of the stationary approach is 4.20 × 10 −9 and the MSPE of the locally stationary approach is 3.55 × 10 −9 . Thus, choosing the stationary approach would have yield a slightly better result in forecasting the observations from the test set, but for h = 2, 3 we chose the approach that has the better performance.
In this example, we see that, for the case of h = 2, . . . , 5, our procedure choose the procedure consistently, while for h = 1 it did not. We see that the 'mistakenly' chosen locally stationary approach, for h = 1, yielded only an 6.5% worse performance (in terms of the MSPE on the test set), while for h = 2, 3, 4, 5 it chose the procedure that is substantially better (9.9% to more than 36% improvement). An evaluation of the application of the method should also take into account that the recent changes are ongoing and that less than 5 month of observations after the EU referendum in the UK were available, which requires that m was chosen rather small. For this example, we further conducted a sensitivity analysis, by varying the parameters m and p max . Selected results, in which we see how results change are shown in Appendix H (Online Supplement).

Analysis of the localised Yule-Walker estimator under general conditions and local stationarity
In this section we discuss the probabilistic properties of the localised Yule-Walker estimatorâ (1). We believe the results to be of independent interest and therefore present them in this separate section. They are also key results for the proofs of the result in Section 3.3. Our results will hold under the five main assumptions discussed in Section 3.3 and rigorously stated in Appendix A. The assumptions are not restrictive and, in particular, the concentration result in this section will hold for a broad class of locally stationary processes and, in particular, does not require that the data come from a tvAR(p) model. Further, we allow for any 1 + p ≤ N ≤ T and, in particular, allow for a diverging model order p, as T → ∞. We do not, as do for example Dahlhaus and Giraitis (1998), require that N = o(T ).
Our main result (Theorem 6.1) provides a non-asymptotic bound for the distance of a (p) N,T (t) to the following population quantity: The Yule-Walker estimator is widely used in practice andâ N,T (t) and its properties have been studied in detail under various conditions. Bercu et al. (1997Bercu et al. ( , 2000 and Bercu (2001) derive large deviation principles for Gaussian AR processes when the model order is 1. A simple exponential inequality, also for model order 1, is given in Section 5.2 of Bercu and Touati (2008). Yu and Si (2009) prove a large deviation principle for general, but fixed, model order. The cited results all require that the underlying process is stationary. Dahlhaus and Giraitis (1998) analyse the bias and variance of the localised Yule-Walker estimator in the framework of local stationarity. They do not, however, provide an exponential inequality, and, as far as we are aware, no result as the one we provide below is available at present.
The exponential inequality in Theorem 6.1, which we now state, is explicit in terms of all parameters and constants. We make use of the explicitness to derive Corollary 6.2, by which the localised Yule-Walker estimator is strongly, uniformly consistent, even when the model order is diverging as the sample size grows.
Then, for every T ≥ 2C 1 p 2 , N ≥ 1 + p ≥ 2 and ε > 0, we have: and C 0 , C 1 and C 1,1 , C 2,1 are defined in (43) and (85), respectively. Theorem 6.1 is a key ingredient to the proof of Lemma A.2 which is essential to the proof of the performance-guarantee-result (Theorem 3.1) of our procedure. Further, it implies Corollary 6.2 Let (X t,T ) t∈Z,T ∈N * satisfy Assumptions 1-5 (Appendix A), EX t,T = 0 and let P = P T and N = N T be sequences of integers that satisfy 2 ≤ 1 + P ≤ N ≤ T . Assume that P = o(N (1+2d)/(4+6d) ) and N → ∞, as T → ∞. Further, assume that there exists a sequence R T with 0 ≤ R T → ∞ and R T log(T ) = o (N/P ) 1/(3+4d) , as T → ∞, where d is the constant from Assumption 5. Then, we have , almost surely, as T → ∞.

Remark 6.3 For any stationary AR(p) model we have thatā (p)
N,T (u) corresponds to the vector of coefficients. This can be seen from Lemma 6.5, below, and the fact that C 1 = 0 if the model is stationary. Thus, choosing N T = T and P T = p, our result yields the same rate as Theorem 1 in Lai and Wei (1982), by which the (least squares) estimator is strongly consistent with rate (log(T )/T ) 1/2 .
We now discuss the quantityā N,T (t). We further have that a The following two lemmas discuss approximation properties ofā Define C 0 and C 1 as in (43).
Define C 0 and C 1 as in (43). Then, The proofs of Lemmas 6.4 and 6.5 can be found in Appendix D (Online Supplement). Note that, if p 2 = o(T ), as T → ∞, then we have, by Lemma 6.4, thatã     T (t) coincides with the vector of coefficients (a 1 (t/T ), . . . , a p (t/T )), as is evident from the Yule-Walker equations.
It is worth mentioning that in case of a stationary process, where C 1 = 0, Lemmas 6.4 and 6.5 imply that the time-varying versions of the quantities and their local approximation coincide for all T (as they should).

Roueff and Sanchez-Perez (2016) prove a similar bound (Lemma 3)
: for T ≥ T 0 := Cp 3/2 πm f . Note that (for larger p) their constant D 1 can be substantially larger than the constant in Lemma 6.4, which is largely due to a different representations of a (p) We will also need the following result that bounds the norm of a Lemma 6.6 Let (X t,T ) t∈Z,T ∈N * satisfy Assumptions 1, 3, and 4, from Appendix A, and EX t,T = 0. Then, for u ∈ R and ∆ ≥ 0, we have By Lemma 2 in Roueff and Sanchez-Perez (2016) we have a (p) 0 (u) ≤ 2 p . Their proof adapts arguments from Lemma 4.2 in Dahlhaus and Giraitis (1998) where â (p) 0 (u) ≤ 2 p almost surely is proven. We choose to work with the bound from Lemma 6.6, because it has the advantage that it does not depend on p. Further, note that neither of the bounds is sharp, as by Cauchy-Schwarz inequality we clearly have a (1) 0 (u) ≤ 1.

Conclusion
In this paper, we have presented a method to choose between different forecasting procedures, based on the empirical mean squared prediction errors the procedures achieve. Using the empirical rather than the asymptotic mean squared prediction error, our procedure automatically takes into account that different models should be preferred depending on the amount of data available, which is an important difference to the Focused Information Criterion by Claeskens and Hjort (2003). Working in the general framework of locally stationary time series we choose from two classes of forecasts that were motivated by approximating the serial dependence of the time series by time-varying or traditional autoregressive models. The procedure implicitly balances the modelling bias (which is lower if the model is more complex) and the variance of estimation (which increases for more complex models). Our two step procedure automatically chooses the number of forecasting coefficients to be used and the segment size from which the forecasting coefficients are estimated.
In a comprehensive simulation study we have illustrated that it is often advisable to use a forecasting procedure derived from a simpler model when not a vast amount of data is available. In particular, in the tvAR-models of our simulations, if the variation over time is not very pronounced and when the tangent processes are not close to being unit root it is advisable to work with the simpler stationary model, even when the data are non-stationary.
As an important side result of our rigorous theoretical analysis of the method, we have shown that the localised Yule-Walker estimator is strongly, uniformly consistent under local stationarity. Vogt, M. and Dette, H. (2015). Detecting smooth changes in locally stationary processes.

A Assumptions, technical details and proofs
In this section we provide proofs of the main results in the paper. We keep the proofs as short as possible, by deferring technical results to the Online Supplement.
We use the usual symbols to denote sets of numbers: Further, for a vector x ∈ R d we denote by x its Euclidean norm and by x ∞ its sup-norm. For a m × n matrix A = (a i,j ) i=1,...,m, j=1,...,n ∈ R m×n we denote by A := sup{ Ax / x : x = 0} its spectral norm, by A 1 := max j=1,...,n m i=1 |a i,j | and A ∞ := max i=1,...,m n j=1 |a i,j | the matrix norms induced by the corresponding vector norms, and by A F := ( m i=1 n j=1 |a i,j | 2 ) 1/2 = vecA the Frobenius norm.
In some of the technical lemmas we denote by · M or · v an arbitrary matrix or vector norm, respectively. Special properties we require include submultiplicativity of a matrix norm, and compatibility of a matrix norm with a vector norm. A matrix norm which satisfies AB M ≤ A M B M for all square matrices (m = n), is said to be submultiplicative. A matrix norm · M and vector norm · v are said to be compatible if Ax v ≤ A M x v for all square matrices A and vectors x (of sizes that allow for the matrix product).
To rigorously prove the results, some technical assumptions are in order. We assume the time series X 1,T , . . . , X T,T to be from a locally stationary process as described in Roueff and Sanchez-Perez (2016); see Assumption 1, below. More precisely, we will work with the doubly indexed process (X t,T ) t∈Z,T ∈N * of random variables with finite variances and use the following definitions: The local time-varying covariance function is defined for all t ∈ Z, T ∈ N * and k ∈ Z as A local spectral density f is a R 2 → R + function, (2π)-periodic and locally integrable with respect to the second variable. The local covariance function γ associated with the time varying spectral density f is defined on R × Z by Assumption 1 (Locally stationary processes) (X t,T ) t∈Z,T ∈N * is an array of random variables with finite variances. We say that (X t,T ) t∈Z,T ∈N * is locally stationary with local spectral density f if the time varying covariance function of (X t,T ) t∈Z,T ∈N * and the local covariance function associated with f satisfy where C ≥ 0 is a constant.
Many processes considered in the literature fulfil this definition. Examples include any second order stationary process (then we have C = 0), the general (linear) locally stationary process first considered by Dahlhaus (1997), but also non-linear processes as elaborated by Roueff and Sanchez-Perez (2016) [see their Example 3].
In the tvAR-model that we used to motivate our prediction approach in Section 3.2, see (9), and as examples in Section 4, the local covariances are naturally those of the stationary AR-process when the parameter u of the coefficient functions is chosen as t/T . We will refer to these AR-processes as the tangent processes of the tvAR-process.
We recall, from Section 3.2, that the tvAR(p)model is used to approximate the linear dynamic structure of the data, but that we do not assume that the data actually satisfies it. Thus our results are more general.
We will further make an assumption with respect to the process's range of dependence in terms of its strong mixing coefficient: More specifically, we will assume that the underlying process satisfies Assumption 2 (geometrically α-mixing) There exist constants K > 0 and ρ > 1 such that Assumption 2 is satisfied for a broad class of locally stationary time series models (linear and non-linear); see, for example Fryzlewicz and Subba Rao (2011) or Vogt (2012).
It is further common to make assumptions with respect to the local spectral density.

Assumption 3
The local spectral density f is bounded from above and below: As an immediate consequence to (35), (36), and (37) we have that We further assume that f (u, λ) is smooth with respect to the local time u: The local spectral density f is continuously differentiable with respect to the first argument and the partial derivative is uniformly bounded. More precisely, assume Note that by assuming the existence of a continuous partial derivative we have, by Leibniz's integral rule, that γ k (u) exists and has the following form which, in particular, implies that |γ k (u)| ≤ 2πM f .
Finally, to prove the exponential inequalities in Lemma (A.2), we make assumptions regarding moments of the marginal distributions. More precisely, we will assume the following, generalized version of Bernstein's condition (cf. Saulis and Statulevičus (1991)) Assumption 5 (Bernstein-type condition) Assume that there exists c > 0 and d ≥ 1/2, such that By the following lemma we have that many Gaussian processes will satisfy Assumption 5 with c := 6πM f and d := 1/2.
Proof. Deferred to Appendix B (Online Supplement).
We now introduce two constants that combine constants from the assumptions. Stating the results in terms of these constants will help to better interpret the bounds and significantly shorten otherwise complicated expressions. To this end, we define The constant C 0 can be interpreted in terms of the strength of serial correlation. Note that C 0 will be smaller if there is little variation (uniform in local time) of the spectral density with respect to frequency. In particular, it will be minimal if the spectral density is constant. This would corresponds to the case of white noise. The constant C 1 can be interpreted as divergence from stationarity. In particular, note the meaning of the two summands of the first factor. The constant M f corresponds to the rapidity of changes in stationarity and will vanish in case of stationarity. The constant C corresponds to the quality of locally approximating the correlation structure with a stationary processes correlation structure. It, also, vanishes if the underlying process is stationary.
Before we state the proof of Theorem 3.1, we derive auxiliary results that are concerned with a general mean squared prediction error defined as withv (p,h) i;N,T (t) defined in (4) andv The first auxiliary result (Lemma A.2) entails that the quantity defined in (44) is, with high probability, close to (33), e 1 denotes the first canonical unity vector of dimension p and H denotes a p × p Jordan block with all eigenvalues equal to zero: in terms of the second order properties of the process: with g with e 1 and H defined in (47) and, for every u ∈ R and ∆ ≥ 0, where Some comments on the quantities defined in (45) Note that the definition of a (p) ∆ (u) in (51) resembles the solution to the Yule-Walker equations. In fact, for ∆ = 0 we have the solution to the Yule-Walker equations with the locally approximating stationary covariance structure, at local time u. We allow for ∆ > 0 to capture the evolving second moments of the processes with indices from [u−∆, u]. This averaging is necessary to take the bias from the changing second moments into account, as we do not require N/T = o(1) for our results. The quantity g (p,h) (5). Note that expectation of the empirical mean squared prediction error (44) we are considering is naturally an average of these quantities: We now state the results that the quantities defined in (44) and (45) are close, with high probability.
As detailed in Remark 3.2 of the main text, P (p,h) m,N (ε) → 0, as T → ∞, at an exponential rate if p, h, ε are fixed and m, N → ∞ faster than logarithmically.
We now state the result that the quantities defined in (45) and (48) are close. The quality of the approximation depends on the parameters T , p and h, but is uniform with respect to s, m and N : The proofs of Lemmas A.2 and A.3 are long and technical. We therefore defer them to Appendix B (Online Supplement).
A few comments about Lemma A.3 are in order. Note that the approximation error is zero in case of a stationary time series, as then 2πM f + C = 0. Note further, that the approximation will be good, if h and p are small compared to T . More precisely, if h(2C 2 0 ) h p 2 = o(T ), then the difference will vanish asymptotically. In particular, if h = O(1), then it would suffice to assume that p = o(T 1/2 ), for the approximation error to vanish.
We now provide the proofs of the results from Section 3.3 and conclude with the proofs of the results from Section 6.
Proof of Theorem 3.1.
We compact notation: denote We now bound the part of the right hand side of (53) that involves the quantity A. Using the fact that we have the first inequality of For the inequality in (54) we have used the definition of f (δ) and 1/4 > 1/10. For the first inequality in (55) we have used that, we have with ε := f (δ)/(20(1 + δ)), by assumption, and thus, by Lemmas A.2 and A.3, we have and For the second inequality in (55) we have used that We now bound the part of the right hand side of (53) that involves the quantity B. We have .
Note that we have where the first two terms can be bound by an application of (56) and the indicator function vanishes for all T satisfying the condition of the Theorem, because MSPE (p,h) where Lemma C.4(iv) from the Online Supplement was employed for the last inequality. This completes the proof of the theorem. s,m,0,T , in particular, implies that we may interpret MSPE (p,h) ∆ 1 ,∆ 2 (u) as an approximation of the expectation of the empirical MSPE for an h-step ahead linear forecast of order p, where observations up to (local) time u have been made. The ∆ 1 and ∆ 2 are (localised) length which are related to the segment length of observations used for the estimation of the forecasting coefficients and the segment from which the observations X t+h,T that are being forecasted are taken, respectively.

Proofs of Corollary 3.3 and 3.4.
Following the proof of Theorem 3.1 we have that Corollary 3.3 holds if we show that min N ∈N

MSPE
(1,1) (1,1) Likewise, to show Corollary 3.4, we bound the left hand side inequality with πm f D 2 inf /2 on the right hand side. Denoting To find the lower bound we want, it therefore suffices to proof lower bounds, for every or for −1 × (58). For notational convenience we omit the w's and denote In the Appendix B (Online Supplement) we show that By (15), we have |F − γ 1 γ 0 | ≥ D inf and by (14), uniformly with respect to ω, which can be seen as follows: first, note that where we have used that 2πm f ≤ γ 0 (w; ∆) := 1 0 γ 0 (w + ∆(x − 1))dx and |γ 1 |/γ 0 ≤ 1. Employing (59), we have now brought the tools together to prove Corollary 3.3: For the first inequality we have used the fact that (γ 1 /γ 0 − E) 2 ≥ 0 and the definitions of ρ and D sup . For the second inequality we have used the condition imposed on δ.
Finally, employing (59) again, we prove Corollary 3.4: where in the first inequality we have used (16). For the second inequality we have used the definition of D inf and assumption (60) of the Corollary, by Finally, for the third inequality we have used that by assumption in the Corollary 2δ ≤ D 2 inf /4.
We now proceed with the proofs of the results from Section 6.
Proof of Theorem 6.1.
deduce that M 0 is invertible for T ≥ 2p 2 C 1 , because it is positive definite with smallest eigenvalue larger or equal to m f /2. An application of Lemma F.6, with the spectral norm as the matrix norm and the Euclidean norm as the vector norm yields where we have use Lemma E.3(ii-c) again to bound 1/ M −1 0 . In the last step we employed Thus, by Hölder's inequality For the Euclidean norm we have used Finally, by Corollary E.2(iii) and Lemma E.
where the second inequality holds for T ≥ 2 8C 0 , and the third line follows from the fact, for any two integers N and p with N ≥ 1 + p ≥ 2 we have that ( h * N −|h| ) h=0,1,...,p is an increasing sequence. This is easy to see: 1 Proof of Corollary 6.2.
Note the fact that, if R n ≥ 0 is a sequence with R n → ∞, as n → ∞, then Thus, employing the Borel-Cantelli lemma, it suffices to show that, for any given ε > 0 and sequence 0 ≤ This follows, since we have for T large enough. In the second inequality we have used the fact that, due to P = o(N ), there exists aC > 0 such that 1/N ≥C/(N − P ), for T large enough. Note that we have 4+6d) ) and d ≥ 1/2, such that, in the third inequality, Theorem 6.1 can be applied, where we have also used the fact that, under the assumptions made implying that, for T large enough, we have This completes the proof.  (2011)), if X is a b-sub-Gaussian random variable, then where we have used (40) for the first inequality and for the second inequality assumed that without loss of generality 3πM f ≥ 1 and we have used that, for p ∈ N * , we have which implies Γ(p/2 + 1) 2 ≤ p!, for p ≥ 3, which finishes the proof, as EX t,T = 0. Proof of Lemma A.2. We prove the first equation in full detail and comment on how the proof needs to be adapted for the second inequality in the end. First, we consider the case when p ≥ 1. In the end of the proof we will comment on the (easier) case p = 0. Denotev (p,h) N,T (t) =: (v 1,t , . . . ,v p,t ) and note that, omitting the second index of X t,T for the sake of brevity, we have and an analogous equation for MSPE With this notation we have We now use Lemma F.5 to disentangle (62) and (63). More precisely, for the ith addend (i = 1, . . . , p) in (62) we have where we have used the Cauchy-Schwarz inequality, Assumption 5 and (40), which holds as T ≥ C πm f , to (uniformly) bound the second moments Note that we have used m 2 2 := 3πM f c 2 24 d in the application of Lemma F.5. For the (i 1 , i 2 )th addend (i 1 , i 2 = 1, . . . , p) in (63) we analogously have 2 (6πM f c 2 24 d ) 2 + ε 2 /(p + 1) 4 1/4 where we bounded sup t=s,...,s+n−1 E(X 2 t−i 1 +1 X 2 t−i 2 +1 ) by m 2 2 as before.
We now use Lemma F.4 to bound (67) by a sum of two quantities resembling those from (64). Applying Lemmas F.4 and C.3(i), for the second we have required that T ≥ 5C 1 p 2 , yields that (67)    where the subscript at the equation numbers indicate that the respective expressions depend on i or i 1 , i 2 . We now discuss how to bound (71) and (72).
To bound (71) we note that (61), (65), (66), (68), and (69) are all of the type to which Lemma E.5 can be applied. We do so with the quantities b, a t , α, h, ε of Lemma E.5 chosen as in Table 8.
Note that (i) the bound from Lemma E.5 is independent of b, that (ii) increasing A, α or h * will increase the bound and (iii) decreasing ε will also increase the bound. Note that comment (ii), with respect to α, holds, because we have enlarged the constants C 1,α and C 2,α toC 1,α andC 2,α .
Lastly, for the case where p = 0, note that it suffices to bound (61).
For r (2) p,h,N,T (t) we have where the second inequality holds by employing Lemmas C.1(ii) and C.3, both of which can be applied, since we assumed T ≥ 3h2 h C 1 p 2 , and 3h2 h C 1 p 2 ≥ 5C 1 p, as h ≥ 1. Also note that γ (p,h) ( t+h T ) ≤ γ (p+h) 0 ( t+h T ) to which we apply Lemma E.3(i-b). We also employed the bound on γ i+h−1 (u) from below (42) and used the mean value theorem.
where the last inequality is due to Lemma C.3(ii), C.2(i), E.3(i-a), C.1(ii), and C.3(ii) to the respective terms. Note that Lemma C.3(ii) can be applied as we have assumed T ≥ 3h2 h C 1 p 2 , which implies the condition for Lemma C.1(ii), as h ≥ 1.
For the first inequality, we have used the fact that For the very last inequality we used 9/(2π) 1/2 + 4 ≤ 8.
Substituting these four results into (77) which completes this proof.
Proof of (59). With the notation from the proof of Corollary 3.3, we have where in the last step the (reverse) triangle inequality was used. The inequality then follows, as

C Lemmas regarding v, g and MSPE
For the proofs we will employ (Lemma C.1(i) and(ii)) thatv More precisely, the following lemma that is constructed analogously to Lemma 6.5, but for the h-step ahead coefficients, holds: Lemma C.1 Let (X t,T ) t∈Z,T ∈N * satisfy Assumptions 1, 3, and 4 (Appendix A) and EX t,T = 0. Define C 0 and C 1 as in (43). Then, we have, forv (p,h) N,T (t) defined in (46), Further, for the norms of u → v (p,h) ∆ (u) and ∆ → v (p,h) ∆ (u) and the derivatives, we have the following bounds: Lemma C.2 Let (X t,T ) t∈Z,T ∈N * satisfy Assumptions 1, 3, and 4 (Appendix A) and EX t,T = 0. C 0 as in (43) and m f , M f , M f from the assumptions. Then, with v (p,h) Lemma C.2 also holds for h = 1. Part (i) thus extends Lemma 6.6.
Finally, we use Lemmas 6.4, 6.5 and 6.6 to bound the norm ofā Let (X t,T ) t∈Z,T ∈N * satisfy Assumptions 1, 3, and 4 (Appendix A) and EX t,T = 0. Define C 0 and C 1 as in (43). Then, The bound in Lemma C.3(ii) is tighter, but requires T to be larger.
An important observation is that, as a function of u, MSPE (p,h) N/T,n/T (u) is differentiable with bounded derivative Lemma C.4 Let (X t,T ) t∈Z,T ∈N * satisfy Assumptions 1, 3, and 4 (Appendix A) and EX t,T = 0. C 0 as in (43) and the other constants from the assumptions. Then, the function g (p,h) ∆ , defined in (49), is continuously differentiable and the derivatives are bounded. More precisely, We have In particular, this implies, or MSPE Proof of Lemma C.1. By the definitions ofĀ where the first inequality is due to the submultiplicativity of the spectral norm, the second inequality is due to Lemma F.3, the third inequality is due to the fact that A − A 0 ≤ A − A 0 F and A 0 = e 1 a (p) (t/T ) , because adding/subtracting the Jordan block doesn't change the eigenvalues (see also the proof of Lemma C.2).
For (ii) we have, by a similar argument as above that v (p) where the second inequality holds if T ≥ T 0,1 := 5C 0 C 1 p 2 C −1 0 . Note that T ≥ T 0,1 implies T ≥ 2C 1 p 2 , which is the condition required to apply Lemma 6.5(ii).
Proof of Lemma C.2. Let e 1 and H be defined as in (47). We derive a compact expression for dv where v(u) is short for By Weyl's inequality we have that the eigenvalues of µ 1 ≤ . . . ≤ µ p of e 1 a(u) + H fulfil λ 1 + ρ i ≤ µ i ≤ ρ i + λ p , where ρ 1 ≤ . . . ≤ ρ p denote the eigenvalues of e 1 a(u) and λ 1 ≤ . . . ≤ λ p denote the eigenvalues of H. Note further, that the eigenvalues of H are λ 1 = . . . = λ p = 0 (it is a Jordan block). Therefore: µ i = ρ i . In particular: e 1 a(u) + H = e 1 a(u) ≤ a(u) ≤ C 0 , where the last inequality follows from Lemma 6.6. This observation, obviously, implies that v = e 1 X h ≤ X h ≤ a h , which yields (i).
For the proof of (ii) recall that the notation reflects that v, a, Γ, and γ are functions of the variable u (for a fixed ∆ ≥ 0). By elementary rule (15), p. 148 and Theorem 3, p. 151 from Magnus and Neudecker (1988), we have, that Iterating elementary rule (15), p. 148, we get that, for every square matrix function X and h = 1, 2, . . . that Obviously we have dX = e 1 (da) , which yields For da note that, by (78) and employing Lemma E.3(i), we have Here we have used that, by the assumed continuity of ∂ ∂u f (u, λ), we have (dΓ such that an application of Lemma 4.1 from Gray (2009) we have by Bessel's inequality (see, again, the proof of Lemma E.3(i)) that dγ For the proof of (iii) consider ∆ to be the argument of the functions (instead of u). Let this be reflected by changing the notation to v(∆) := v ∆ (u) (for a fixed u ∈ R). Note that all previous arguments remain the same, but for the derivation of dΓ and dγ. To this end, denote f λ (u) := f (λ, u) and note that we have Proof of Lemma C.3. This follows, because The second inequality follows by application of Lemmas 6.6 and 6.5(ii). Note that Lemma 6.5(ii) can be applied since T ≥ 2C 1 p 2 by assumption. The third inequality follows, if T ≥ 5C 1 p 2 , which is what we assumed.
For the second inequality note that, similar to the definition ofv Step 3 of Section 3.1 and (4), we have the following recursive relationship In other terms, we havev (p,1) i,N,T (t) =ā (p) i,N,T (t) and, for every h = 2, 3, . . . , we havē Employing the first part of this lemma we thus have For the proof of (ii), as in the proof of the bound for ā where, for the second inequality, we have employed Lemma C.2(i) and C.1(ii). The third inequality holds, if which, for all h = 1, 2, . . ., implies the condition required to apply Lemma C.1(ii).
Proof of Lemma C.4. For the sake of brevity, we omit sub and superscripts that are not needed for this particular argument. More precisely, in this section we denote g  For (i) note that we have, by (36) together with Assumption 3, by Lemma C.2(i), and by Lemma E.3(i), We now compute the derivatives of g (p,h) ∆ (u). We have: Note that, for ∆ = 0 and h = 1, we have v = a and by the definition of a, it follows that Γv − γ = 0, such that the second term of the derivative will only show if h ≥ 2.
In consequence, for (i) we bound dg p,h as follows: In the above, for the second inequality, we have used (42) together with Assumption 4, Lemma C.2(i)+(ii), and Lemma E.3(i).
For the proof of (ii) use the partial derivatives of γ and Γ with respect to ∆ that were derived in the proof of Lemma C.2(iii). For the proof of (iii) use Lemma C.2(iii) instead of Lemma C.2(ii) and use the bounds for dΓ and dγ that were derived in the end of the proof of Lemma C.2(iii). For the proof of (iv) note that the partial derivative of g is continuous by (i) and employ Leibniz's integral rule. (v) follows analogously from (ii). For the proof of (vi) we note that, by the argument in the proof of Lemma C.2(iii), we have (vi) then follows from (i) of this lemma.

D Proof of auxiliary results from Section 6
Proof of Lemma 6.6. Note that a (p)

The assertion follows from Lemma E.3(i).
Outline for the proof of Lemma 6.4 and Lemma 6.5.
Note that the norm to be bounded is, in all cases, of a difference of quantities of the form a 1 := Γ −1 1 γ 1 and a 2 := Γ −1 2 γ 2 , where the norms of the components of a 2 and norms of the differences of the components can be bounded using results from Section E.1. We denote these bounds by Γ −1 2 ≤ K, γ 2 ≤ κ, Γ 1 − Γ 2 ≤ D, and γ 1 − γ 2 ≤ δ.
For the bound of the norm of interest note that For each of the inequalities to be proven, we will provide a T 0,1 such that Under this condition, we have, by Lemma F.2, that Finally, we will provide a T 0,2 such that Under this condition, we have γ 1 ≤ κ + δ ≤ 2κ, by the triangle inequality. It is implied that, for T ≥ max{T 0,1 , T 0,2 }, the norm a 1 − a 2 does not exceed For the proof of the individual cases it thus suffices to refer to Section E.1 where appropriate bounds (i. e., K, κ, D, and δ) can be found, from which T 0,1 , T 0,2 , and (83) are obtained.
Proof of Lemma 6.4.
We proceed as outlined and chooseã T (t) to be a 1 and a (p) 0 (t/T ) as a 2 . Lemma E.3(i) and Corollary E.2(i) provide the needed bounds. We have (83), we have the following bound ã (p) The assertion follows, as we have p 1/2 C/m f ≤ C 1 p 2 and C −1 0 < 1.
Proof of Lemma 6.5. We proceed as outlined. For (i) we chooseā N,T (t) to be a 1 and a (p) 0 (t/T ) as a 2 . Lemma E.3(i) and Corollary E.2(ii) provide the needed bounds. We have (83), we have the following bound We chose to state the result to hold for T ≥ N C 1 max{4p, p 1/2 C −1 0 }, which is more restrictive, but allows for the more compact expression, as C 0 > 1.

E Properties of empirical localised auto-covariances E.1 Approximations for moments
Lemma E.1 Let (X t,T ) t∈Z,T ∈N * satisfy Assumptions 1 and 4 (Appendix A), EX t,T = 0.
Then, withγ k;N,T (t) defined in (3), f (u, λ) and C from Assumption 1, and M f from Assumption 4, we have: (i) Proof of Lemma E.1. We have by Lemma F.8, where the differentiability (with respect to u) is guaranteed by Assumption 4.
For the second equation in the lemma, note that (for fixed t, N, k, u) we have where Thus, by Assumption 4 and because u ∈ [0, 1], we have For the proof of the third part of the Lemma, note that (for fixed t, N, k, u) we have, in analogy to (84) Thus, by Assumption 4 and because u ∈ [0, 1], we have which completes the proof of the lemma.

Corollary E.2 Under the conditions of Lemma E.1, withΓ
(p) and (iii) For (i), note that, letting Γ (p) where the first part of the first inequality is due to (37) and the second part is due to (41) and the mean value theorem.
We bound the norms of ∆ := δ j,k (t, T ) j,k=1,...,p Then, we have, by Hölder's inequality and we get the second part of (i) by bounding the Euclidean norm by p 1/2 times the maximum norm.
For (ii) note that, by Lemma E.1(iii), we have We bound the spectral norm of ∆ by its Frobenius norm: For the second part of (ii) note, that For the proof of (iii) note that, letting where the first inequality is due to Lemma E.1(ii) and for the second inequality we have used π(p + 1) ≤ 2πp and Ap 2 + Bp ≤ (A + B)p 2 which holds as A, B ≥ 0, and p ≥ 1. By Hölder's inequality this implies For the other bound note Let (X t,T ) t∈Z,T ∈N * satisfy Assumptions 1, 3, and 4 (Appendix A) and EX t,T = 0. Then, we have: (i-a) the matrices Γ (p) (u) and Γ (i-b) the norms of the respective vectors are uniformly bounded: (ii-a) The largest eigenvalue of EΓ (p) N,T (t) satisfies the following bound: N,T (t) is positive definite, and the smallest eigenvalue satisfies the following bound: Proof of Lemma E.3. We use the notation from Gray (2009). For every function f defined on [0, 2π], with a Fourier series that has absolutely summable Fourier coefficients and for every n ∈ N * , he defines, in (4.8), the p × p Toeplitz matrix T n (p) := π −π f (λ)e −i(k−j)λ dλ : k, j = 0, 1, . . . , p − 1 .
We will apply Lemma 4.1 from Gray (2009) according to which, if T p (f ) is Hermitian (which is the case if and only f is real valued), its eigenvalues lie between the essential infimum and the essential supremum of f .

by Assumption 3. (i-b) follows from
Bessel's inequality, as we have For the proof of (ii-a) apply the triangle inequality: N/T (t/T ) and note that the first term on the right hand side can be bounded by (i-a) of this lemma and the second part can be bounded by Corollary E.2(iii).
Note that the right hand side does not depend on t.
Proof of Lemma E.4. We state and prove a more general version, which entails Lemma E.4 as a special case (let a t := 1, α = 1, b := t − N + 1, and n := N − |h|).
Note the important fact that the bounds in the inequality do not depend on b.
Proof of Lemma E.5.
We proceed along the lines of the proof of Lemma 2 in Goldenshluger and Zeevi (2001) and apply the two results from Saulis and Statulevičus (1991) which are cite in Section G. As Goldenshluger and Zeevi (2001) pointed out, several other exponential-type inequalities are available in the literature; e. g., Doukhan (1994) and Bosq (1998). These result could be used to derive the type of bound we need, but under our conditions the results from Saulis and Statulevičus (1991) seem most suitable here.
We show this by employing Lemma G.2 with Y ,T := n −1 a t X α +b−1,T X α +b−1+h,T − E(X α +b−1,T X α +b−1+h,T ) , = 1, . . . , n. The first step in the application of Lemma G.2 is to show that (94) holds with appropriately chosen H 1 and γ 1 . The second step will be to show that the bound we get from Lemma G.2 can again be bounded by the right hand side in (93) with H,∆, and γ as defined before.
For the first step, observe that for any = 1, . . . , n and k = 2, 3, . . . we have For the first part of (86) we have used that In the above we have used Cauchy-Schwarz inequality, Assumption 5, Assumption 3, which, if T ≥ C/(πm f ), implies (40), and the elementary inequality (Lk)! ≤ (L!) k (k!) L , which holds for L, k ∈ N * . For the second part of (86) we have used that where we have employed arguments as before and 1 ≤ (k!) 2αd .
For the second step we turn our attention to ∆ n (ν) defined in Lemma G.2. Note that where α T and α are the functions from Assumption 2. Further, note that the trivial inequality α(t − s − |h|) ≤ 1 holds for t ≤ s + |h|. Thus, adopting the argument from step 3 in the proof of Lemma 2 in Goldenshluger and Zeevi (2001), we have where the last inequality follows from exp(x) − 1 ≥ x, for x ≥ 0.
Thus, for δ = 1 and k = 2, 3, . . ., we have where we have used the fact that p p ≤ p! e p , for p ∈ N.

F Technical Results
In the proofs we use the following general results, which are not restricted to locally stationary processes.
Lemma F.1 Let A ∈ R p×p an invertible matrix and ∆ ∈ R p×p be a matrix with A −1 M · ∆ M < 1 for a submultiplicative matrix norm · M . Then, the matrix A+∆ is invertible and we have Demmel (1997) can be applied, which asserts that I − X = A −1 (A + ∆) is invertible and that We deduce that A + ∆ is invertible, because we have where we have used that A is invertible by assumption and I−X is invertible by Lemma 2.1 in Demmel (1997). Finally, the bound on the matrix norm of the inverse follows from the assumed submultiplicativity This finishes the proof of the lemma.
An important corollary to the above lemma is the following: Lemma F.2 Let A ∈ R p×p be an invertible matrix and ∆ ∈ R p×p be a matrix with A −1 M · ∆ M ≤ c < 1 for a submultiplicative matrix norm · M . Then, the matrix A + ∆ is invertible and we have Proof of Lemma F.2. Note that Lemma F.1 can be applied, which yields that A + ∆ is invertible as an immediate consequence. Further, note that Employing the submultiplicativity of the norm and the inequality from Lemma F.1 yields The assertion then follows, because we have − ∆ M = ∆ M and, by assumption, Lemma F.3 Let A and A 0 be two square matrices and · M be a submultiplicative matrix norm. Then, for any h ∈ N, Proof of Lemma F.3. The statement for h ∈ {0, 1} is obvious. For h = 2, 3, . . . note that by the binomial theorem we have

This obviously implies
Lemma F.4 Let u and v be two real-valued random variables. Further, let u 0 and v 0 be two real numbers. Then, for all ε > 0 where the first inequality is due to the triangle inequality and the second is due to Young's inequality. We now bound the first term and note that the second term can be handled analogously. Note that The assertion then follows, because The equality in the above holds due to the mean value theorem, for some ξ ∈ [|v 0 | 2 , |v 0 | 2 + ε]. The inequality is due to ξ ≤ |v 0 | 2 + ε.
We will further use the following lemmas: Lemma F.6 Let M ∈ R p×p be a random p × p matrix with existing expectation M 0 := EM, which is assumed to be invertible. Further, let v be a R p -valued random vector v with existing expectation Ev := v 0 . Then, for every submultiplicative matrix norm · M that is compatible with the (vector) norm · v , we have: for every ε > 0 where for ( * ) we have used the fact that any conditional probability is ≤ 1 for the first part and the following argument for the second part: In the above inequalities we have used Lemma F.1, by which we have: The assertion then follows by choosing δ = Lemma F.7 Let x = (x 1 , . . . , x p ) be a random vector and x 0 = (x 0,1 , . . . , x 0,p ) be a deterministic vector. Define two p × p matrices Proof of Lemma F.7. Note that, for h = 1 we have v = x and v 0 = x 0 . Thus, the assertion holds as an equality. For h = 2, 3, . . ., we have Thus, with this and by Lemma F.3, we have where the second inequality uses A − A 0 ≤ A − A 0 F = x − x 0 and the fact that A 0 = e 1 x 0 ≤ x 0 . For A 0 = e 1 x 0 note that subtracting the Jordan block doesn't change the eigenvalues (see also the argument regarding the norm of X in the proof of Lemma C.2). For the rest of the derivation denote y := x − x 0 1/(h−1) . Then, we have where y min := min{|y j | : j = 1, . . . , h} is the minimum of the absolute values of the (complex) roots y 1 , . . . , y h of the polynomial function p(y) := y h + x 0 y − (ε/h) 1/(h−1) . The inequality in the above is due to the fact that, for all y ≥ 0, we have which we now prove. Note that p is continuous and p(0) < 0. For a proof by contradiction, we assume that y ≤ y min . We find that p(y) ≤ 0 for y = y min , because (a) it is either a root of the polynomial and then we have p(y) = 0, or (b) we have p(y) = 0, in which case it has to be negative: we cannot have p(y) > 0, because then the intermediate value theorem would implies that p(ξ) = 0, for some ξ ∈ [0, y min ), which is not possible, because it would have |ξ| < y min . In conclusion, we have shown that p(y) ≤ 0 for all |y| ≤ y min , which finishes the proof of (89).
By Rouché's theorem, we have

This implies
We report a bound that is larger, to have an expression that is valid for all h. Note further, that for ε ≤ h max{ x 0 , 1} h−1 , we have Lemma F.8 Let f : [0, 1] → R be continuous and differentiable on (0, 1). Then, for every A, B = 0, . . . , T , T ∈ N * , A < B, we have Proof of Lemma F.8.
By the mean value theorem and the assumed conditions we have that for every ∈ Z there exists ξ ∈ [( − 1)/T, /T ] and ζ ∈ (ξ , /T ) such that Thus, for A, B = 0, . . . , T we have In the above we have used that Integration by substitution yields that Substitute (92) into (90) and multiply by T /(B − A), then the assertion follows from (91).
G Results from Saulis and Statulevičus (1991) In this section we cite two results of Saulis and Statulevičus (1991) that are needed for the proofs in the previous sections.