Modelling Time-Varying First and Second-Order Structure of Time Series via Wavelets and Differencing

Most time series observed in practice exhibit time-varying trend (first-order) and autocovariance (second-order) behaviour. Differencing is a commonly-used technique to remove the trend in such series, in order to estimate the time-varying second-order structure (of the differenced series). However, often we require inference on the second-order behaviour of the original series, for example, when performing trend estimation. In this article, we propose a method, using differencing, to jointly estimate the time-varying trend and second-order structure of a nonstationary time series, within the locally stationary wavelet modelling framework. We develop a wavelet-based estimator of the second-order structure of the original time series based on the differenced estimate, and show how this can be incorporated into the estimation of the trend of the time series. We perform a simulation study to investigate the performance of the methodology, and demonstrate the utility of the method by analysing data examples from environmental and biomedical science.


Introduction
Time series data can often possess complex and dynamic characteristics. Most commonlyencountered time series in practice are nonstationary -the mean and autocovariance of the series vary over time. Modelling how these properties change over time is crucial for making inference on the data. Nonstationary time series arise in a variety of applications, for example in environmental sciences (Hu et al., 2019), climatology (Das and Politis, 2020), and financial time series (Roueff et al., 2019). In this article, we consider a time series model of the form X t,T = µ t T + t,T , 0 ≤ t < T, (1.1) where µ : [0, 1] → R is a non-parametric deterministic trend function, and t,T is a locally stationary wavelet (LSW) process with E( t,T ) = 0. This model accounts for nonstationarity in both the first and second-order structure of the time series: the time-varying mean function is encapsulated in the trend term, while the time-varying second-order behaviour is described by the LSW process term. A thorough explanation of Model (1.1), including necessary background on LSW processes, is given in Section 2.
First and second-order estimation of a time series are most commonly considered separately, rather than jointly within the one common methodological framework. However, there is an emerging literature where the mean function is estimated as well as parameters responsible for second-order nonstationarity. For example Tunyavetchakit (2010) consider time varying AR(p)processes where the mean function and AR coefficients are estimated jointly within the same procedure. Other methods, such as Dahlhaus and Neumann (2001) which focuses on the semiparametric setting, consider results where the mean function is time-varying and/or estimated. Khismatullina and Vogt (2020) test for increases and decreases in trend in the presence of stationary time series errors, while Dette and Wu (2020) consider the problem of prediction in locally stationary time series.
The problem of performing inference on the mean function in the presence of nonstationary second-order structure is a highly challenging one. In von Sachs and MacGibbon (2000), for example, the authors describe a method using wavelet thresholding, however the threshold used is data-driven and depends on the nonstationary second-order structure, which is ultimately unknown. Vogt (2012) employs kernel-based methods to estimate a smooth non-parametric trend function in the presence of locally stationary errors. Dette et al. (2019) test for relevant changes in the mean of nonstationary processes. Similarly, there is less attention in the literature on nonstationary second-order estimation in the presence of a non-zero mean function.
In order to estimate a nonstationary second-order structure a zero-mean process is often assumed. One of the most well-known methods for removing the trend in a time series is differencing: see for example Chan et al. (1977) and Shumway and Stoffer (2010). The time series {X t } can be, for example, first-differenced to obtain a new time series, {∇X t = X t − X t−1 }. Differencing aims to remove a trend without the need to estimate any parameters (whose estimation often includes an assumption of stationary errors). In the time series literature, it is often the case that inference is made on the properties of the differenced time series. In contrast, this article proposes a method to jointly estimate the time-varying trend and second-order structure of the original time series, by employing the commonly-used strategy of differencing a time series in order to remove the trend.
Our approach to this problem can be summarised as follows. By differencing the time series to remove the trend, we can estimate the appropriate second-order quantities of interest of the locally stationary wavelet part of Model (1.1). This is achieved by considering the effect of differencing on the second-order properties of the series in order to develop an appropriate estimation procedure. Expanding upon the rigorous theory developed in Nason et al. (2000), we obtain results on the consistent estimation of the second-order structure using our modified estimation strategy for the original time series. Using this estimate, we discuss a wavelet thresholding technique to estimate the trend function µ(t/T ) in a principled manner, by taking into account the time-varying second-order behaviour. Our methodology thus enables the joint estimation of the first and second-order structure of nonstationary time series. The applications demonstrate the added utility that estimating the second order structure of the original time series brings.
The rest of the article is organised as follows. In Section 2 we introduce the time series model which we focus on in this article, describe key assumptions, and discuss necessary background to LSW processes. In Section 3, we analyse the effect that differencing has on the spectral structure of a time series, and explain the intuition behind our methodology. Furthermore, we describe the methodology for consistent estimation of the second-order structure in the presence of trend, and in Section 4 we use this estimate in order to estimate the trend of the series. Simulation studies assessing the method's performance are given in Section 5. In Section 6, we apply our framework to a data example, demonstrating the utility of the method, while concluding remarks are given in Section 7. All proofs and additional simulation and data application results are contained within the supplementary material.

Wavelets and Model Formulation
In this section we introduce the modelling paradigm that we will use, as well as explaining the necessary background concepts.
In the LSW framework, wavelets act as building blocks, analogous to the Fourier exponentials in the classical Cramér representation for stationary processes. Briefly, wavelets are oscillatory basis functions which provide efficient (sparse) multiscale representations of signals. Wavelets are useful in estimating time-varying quantities, especially nonstationary characteristics, due to these attractive properties. For an overview of wavelet techniques, see for example Nason (2008) or Vidakovic (2009).
For example, for a function f ∈ L 2 (R), we have the expression f (x) = k∈Z c J φ J,k (x) + j≤J k∈Z d j,k ψ j,k (x), where the wavelet ψ j,k (x) = 2 −j/2 ψ(2 −j x−k) is a dilated and translated version of a (mother) wavelet ψ(x) and similarly for the father wavelet φ(x). The coefficients d j,k at location k and scale j represent the oscillatory behaviour of the signal f at a particular frequency, whereas the coefficients c j,k give information about the mean behaviour of the signal at different scales j.
Next, we define the discrete wavelets that make up the building blocks of the LSW model. Let ψ be a compactly supported wavelet, for example any within the Daubechies family Daubechies (1992). Denote by {h k , g k } the low-/high pass filter pair associated with ψ. Let N h be the number of non-zero values of {h k }, and define L j = (2 −j − 1)(N h − 1) + 1. The discrete wavelets at a given scale j ∈ Z − , as discussed in Nason et al. (2000), are defined as the vectors ψ j = (ψ j,0 , . . . , ψ j,L j −1 ) , where ψ −1,n = k g n−2k δ 0k = g n , and ψ j−1,n = k h n−2k ψ j,k for n = 0, . . . , L j − 1, where δ 0k is the Kronecker delta. The discrete father wavelet is defined similarly using the associated low-pass filter {h k }.

Model Definition
Below, we define the Trend Locally Stationary Wavelet (T-LSW) model, which is composed of a deterministic Lipschitz continuous trend component and a locally stationary wavelet component.
Our T-LSW model, developing on the theory of locally stationary wavelet processes of Nason et al. (2000), allows for simultaneous inference on the time-varying mean and autocovariance of a time series.
Definition 2.1. A trend locally stationary wavelet (T-LSW) process {X t,T }, t = 0, . . . , T − 1, and T = 2 J ≥ 1 for J ∈ N is a doubly-indexed stochastic process with the following representation in the mean-square sense: j=−∞ k∈Z w j,k;T ψ j,k−t ξ j,k , (2.1) where {ξ j,k } is a random, uncorrelated, zero-mean orthonormal increment sequence, {ψ j,k−t } j,k is a set of discrete non-decimated wavelets, and {w j,k;T } is a set of amplitudes. The quantities in representation (2.1) possess the following properties: 2. There exists, for each j ≤ −1, a Lipschitz continuous function W j (z) for z ∈ (0, 1) which satisfies the following properties: the Lipschitz constants L j are uniformly bounded in j and −1 j=−∞ 2 −j L j < ∞. There exists a sequence of constants C j such that for each T where for each j ≤ −1 the supremum is over k = 0, . . . , T − 1, and where the sequence The model imposes the same assumptions on the LSW component as in Nason et al. (2000), allowing for locally stationary second-order structure, while also permitting nonstationary firstorder behaviour by incorporating a smooth mean function µ. Imposing a Lipschitz continuous trend assumption is not overly restrictive, given that trend functions are generally assumed to be smooth and slowly-evolving. In particular, polynomials (restricted to the interval [0, 1]) are Lipschitz continuous, as are sinusoids. Such an assumption is commonly made in the literature, see for example Vogt (2012) and Khismatullina and Vogt (2020). Furthermore, in Section 3.6 we will discuss the case when the trend is not Lipschitz.

Background to LSW Processes
In the original work of Nason et al. (2000), the trend function µ(t/T ) in Equation (2.1) is assumed to be everywhere zero, which forces the process mean E(X t,T ) to be equal to zero for all t. Consequently, within the original LSW framework it is only possible to estimate timevarying second-order structure when the time series does not exhibit a trend. In order to discuss our proposed methodology in the setting where a trend is present, we dedicate the rest of the section to recalling a number of definitions and results from Nason et al. (2000), which we will expand upon and adapt to the setting where a trend is present.
The second-order structure of a LSW process is uniquely determined by its spectrum. The evolutionary wavelet spectrum (EWS) of a LSW process is defined as S j (z) := |W j (z)| 2 for rescaled time z = k/T ∈ (0, 1), and measures the contribution to variance at a particular rescaled time z and scale j. Since the W j are assumed to be Lipschitz continuous, the spectrum S j is also Lipschitz continuous, which ensures it evolves slowly over time. Alterations to the LSW model that use different assumptions on the EWS can be found in Fryzlewicz and Nason (2006) and Van Bellegem et al. (2008), which assume respectively that the EWS is piecewise constant and of bounded total variation.
For ease of notation we now drop the dependence on T in the subscripts of the model quantities. The EWS is estimated via the empirical wavelet coefficients of the time series, given by d j,k := X t , ψ j,k−t = t X t ψ j,k−t . As with Fourier approaches, the raw wavelet periodogram I j k := |d j,k | 2 is a biased, inconsistent estimator of the EWS (Nason et al. (2000), Proposition 4): where the operator A = (A jl ) j,l<0 is given by A jl := Ψ j , Ψ l = τ Ψ j (τ )Ψ l (τ ), and the autocorrelation wavelets are defined by Ψ j (τ ) := k∈Z ψ j,k ψ j,k−τ , j < 0, τ ∈ Z. Hence, the covariance at a rescaled location z = k/T ∈ (0, 1). The LACV, c(z, τ ), of a LSW process , for τ ∈ Z, z ∈ (0, 1). The LACV can be thought of as a decomposition of the autocovariance of a process over scales and rescaled time locations. The LACV is estimated by plugging in the smoothed, corrected estimate for the EWS into the definition of the LACV. Using wavelet thresholding of the EWS estimator, it is shown in Nason et al. (2000) that the LACV estimator is consistent. The next section addresses the consistent estimation of these second-order quantities in the presence of first-order nonstationarity.

LSW Estimation in the Presence of Trend via Differencing
In this section, we discuss methodology for estimation of the evolutionary wavelet spectrum and local autocovariance function of a trend-LSW process. In order to estimate these quantities, we employ the ubiquitous practice of differencing to remove the trend, but crucially correct for the effect of this on the spectrum.

Using Differencing to Detrend a Time Series
One of the most common methods for removing the trend in a time series is differencing, see for example Chan et al. (1977) and Shumway and Stoffer (2010). The time series {X t } can be, for example, first-differenced to obtain a new time series, {∇X t = X t − X t−1 }, upon which inference is then performed. One advantage of differencing is that no parameters are estimated in the differencing operation, which is not the case for detrending achieved using an estimator of the trend. An n-th order difference is capable of removing a polynomial trend of degree n from the data.
One of the consequences of differencing is that the second-order statistical properties of the time series in Model (2.1) will change, sometimes quite drastically. Therefore, it is potentially problematic to directly use the differenced process for inference on the original process. In the context of ARIMA modelling, differencing is performed in order to induce stationarity, and estimation is then performed on the differenced series. However, this reasoning does not hold within our setting: if we difference a nonstationary LSW process, the result will still be a nonstationary process. Due to the potentially complex structure of the LSW process, properties of the differenced series are not necessarily representative of the original series. However, in order to estimate the trend component in (2.1), we require an estimate of the second-order structure of the original time series, not the differenced series.
It is straightforward to produce an example where spectral behaviour can be completely altered as a consequence of detrending using, for example, first-differencing. Consider the zero-mean LSW process of length T = 2 10 = 1024, defined by the spectrum S j (z) = 1 for j = −1, 0 otherwise. This time series is referred to as the scale −1 Haar moving average process, and can be written as Computing the expectation of the raw wavelet periodogram of the differenced time series, we find that E(I −1 k ) = 5, and for j < −1, E(I j k ) = 3 × 2 j+1 . Therefore, for all z ∈ (0, 1), the expected value of the LSW estimator at time z is E(L(z)) = E(A −1 10 I(z)). Having differenced the time series, a problem arises since E(I j k ) = l A jl S l (k/T ) + O(T −1 ). In particular, the expectation of the EWS estimate at scale −2 at any time point is given by E l A −1 −2,l I l k = −0.79 < 0. Therefore, the expectation of our estimate of the spectrum at level −2 is -0.79, while in the original time series we had S −2 (z) = 0 for all z ∈ (0, 1). In Figure   3.1 left, we see a plot of the original spectrum, while on the right, we see the expectation of the corrected periodogram estimate, showing a clear discrepancy.
Differencing can also cause vastly alter the sparsity structure of the spectrum of the error process. The white noise process Y t = t , where the t are IID random variables, can be represented as an LSW process using Haar wavelets and a spectrum defined by S j (z) = 2 j for all z ∈ (0, 1).
Differencing this time series gives ∇X t = X t − X t−1 = t − t−1 , which is a Haar LSW process with spectrum S j (z) = 2 for j = −1, S j (z) = 0 otherwise. This induces autocorrelation in the time series at lag 1, which is similar to what is observed when over-differencing in classical stationary time series. Therefore, it can be seen that we must take into account the differencing of the time series if we wish to say something meaningful about the original series.

Intuition Behind Estimation Procedure
Denote the LSW component of Model (2.1) by t . Given that the trend component of the Model (2.1) is Lipschitz continuous, first-differencing the time series yields Hence, differencing the original series results in a differenced locally stationary wavelet process that is asymptotically zero-mean. We wish to estimate the evolutionary wavelet spectrum of the original time series {X t } using the differenced series {∇X t }. Proceeding by using the standard estimation procedure of Nason et al. (2000) by taking the squared wavelet coefficients and premultiplying by the inverse of the A matrix, as in Equation (2.4), is not appropriate, as we have seen in Section 3.1. Denote the empirical non-decimated wavelet coefficients of the first-differenced series byd We can compute the expectation of the raw wavelet periodogramĨ j k := |d j,k | 2 , which will yield an analogous result to Equations (2.2) and (2.3). Hence, our estimation strategy will follow that of Nason et al. (2000). However, the correction of the raw wavelet periodogram to achieve unbiasedness will require a different correction matrix, to account for the fact that the raw wavelet periodogram is calculated on the differenced series.
Note that, while wavelets themselves are differencing operators, coarser wavelet scales accumulate large bias due to the trend and large filter length. Bias in the coarse scales will also affect the bias in the finer scales, due to the necessity for correcting the estimate using a correction matrix. This can be particularly noticeable for trends that display cusps. Differencing first removes the trend immediately, ensuring that the wavelet transform does not accumulate large bias.
Remark 1. We can write the differenced process, in the mean-square sense, as whereξ j,k = ξ j,k − ξ j,k−1 . This process satisfies all of the required properties of a standard LSW process except one: the increments are no longer orthonormal. Instead, we have that −1 for j = l and k = m + 1 or k + 1 = m, 0 otherwise.
Therefore, there is a distinction between the observed time series, which we assume to have LSW errors, and the differenced one, which we do not. This is in contrast to existing approaches utilising differencing for second-order estimation, such as ARIMA models, which instead assume that the differenced series follows the model form rather than the original series.

Asymptotic Behaviour of the Differenced Raw Wavelet Periodogram
As motivated by the discussion in the previous section, we estimate the spectrum by calculating the raw wavelet periodogram of the first-differenced time series. We return to the case of general n-th differencing in Section 3.6. For the purpose of theoretical results, we assume that the {ξ j,k } in Model (2.1) are Gaussian. In practice, this assumption is not required, and in the simulation study in Section 5 we also investigate the case where the {ξ j,k } are exponential random variables.
Before obtaining the result of the expectation of the raw wavelet periodogram, we define the following two operators, which involve the inner product of the autocorrelation wavelets at lag 1.
Proposition 3.2. The matrix D 1 J is invertible.
Intuitively, it is not surprising that the quantity D 1 will appear when we calculate the expectation of the squared wavelet coefficients of the first-differenced series. Indeed, we have the following result for the asymptotic behaviour of the raw wavelet periodogram of the firstdifferenced time series.
Proposition 3.3. LetĨ j k = |d j,k | 2 be the wavelet periodogram of the first-differenced time series. Under the assumptions of Model (2.1), Therefore, for the vector of periodograms I(z) := {I l zT } l=−1,...,−J , and the vector of corrected periodograms L(z) where S(z) := {S j (z)} j=−1,...,−J is the EWS of the original process. Thus we can bias correct the raw wavelet periodogram of the first-differenced time series by using the inverse of D 1 J , instead of the inverse of A J from Nason et al. (2000).
Remark 2. Proposition 3.3 can be generalised to the lag L differenced time series {∇ L X t := X t − X t−L } T t=L+1 for any fixed integer L. The expectation of the raw wavelet periodogram of the lag L differenced series at location k is given by ). Then, the matrix D L J is invertible by the same argument used to show invertibiltiy of D 1 J . This enables the raw wavelet periodogram of the L-differenced series to be corrected using the matrix (D L J ) −1 . This technique can be used to extend the method to allow for seasonal trends, as shown empirically in Section 5.3.

Bounded Invertibility of Haar and Shannon Operators
In order to achieve an asymptotically unbiased estimator of the EWS, we require boundedness of the inverse operator used when performing the bias correction step in the estimation procedure.
In the original LSW work of Nason et al. (2000), boundedness of the inverse of the operator A was shown in the case of the Haar and Shannon wavelets. The members of the Daubechies compactly supported family of wavelets are characterised by the number of vanishing moments. The Haar wavelet is equivalently the Daubechies extremal phase wavelet with one vanishing moment, while the Shannon wavelet is the limiting wavelet in the Daubechies compactly supported family (Chui, 1997) as the number of vanishing moments tends to ∞. As noted in von Sachs et al. (1997), Haar and Shannon wavelets can therefore be viewed as the lower and upper "extremes" of the family of Daubechies compactly support wavelets. Therefore, proving bounded invertibility in these two cases intuitively suggests that bounded invertibility should hold in the case of all Daubechies' compactly supported wavelets. It is for these two wavelets that we prove bounded invertibility here; extensions to other wavelets are left for future work.
Proposition 3.2 shows that the matrix D 1 J is invertible with bounded inverse. Asymptotically, however, the inverse operator is unbounded, which mirrors a result in the setting of locally stationary Fourier time series (Roueff and Von Sachs, 2011, Equation 5). Intuitively this is expected, since the differencing operator itself is asymptotically non-invertible. We can interpret the expectation result in Proposition 3.3 as a quantification of the effect of differencing on the spectral structure of the time series.
We can account for this theoretical issue by proving bounded invertibility for a related, rescaled operator P : 2 (N) → 2 (N), where the entries of P are given by P jl = 2 −j/2 D 1 jl 2 −l/2 . The use of P is due to the fact that the inverse of the operator D 1 is unbounded, so our approach is to work with the operator P , which we can show has a bounded inverse. Showing that P possesses a bounded inverse enables us to show theoretical consistency properties of the wavelet periodogram-based estimator. Hence, we show for the Haar and Shannon wavelet families that P possesses a bounded inverse, which enables consistent estimation of the EWS for Model (2.1).
Note that, in the practical implementation of the methodology, we still use the inverse of the D 1 J matrix for correcting the raw wavelet periodogram as it is an invertible matrix (Proposition 3.2).
Theorem 3.4. Let λ min (P ) denote the smallest eigenvalue of P , where the entries of P are given by P jl = 2 −j/2 D 1 jl 2 −l/2 . Then, for the Haar and Shannon wavelet families, there exists δ > 0 such that λ min (P ) ≥ δ and hence ||P −1 || < ∞. That is, P is positive-definite and has a bounded inverse.

Smoothing and Estimation Theory
As in the original LSW model, the wavelet periodogram is not a consistent estimator and must be smoothed. Smoothing to achieve consistency can be performed, for example, via wavelet thresholding or a running mean. For brevity, we provide theoretical results for wavelet thresholding, building on known results in the literature.
In order to utilise the result on boundedness of the operator P −1 in Theorem 3.4, we rewrite the formula for the expectation of the wavelet periodogram. We can express the expectation in terms of P and a scaled version of S j , given byS j = 2 j/2 S j , by rescaling the periodogram appropriately. Concretely, consider the auxiliary process t = j,kw jkψj,k−t ξ lm , wherew jk = 2 j/4 w jk andψ j,k−t = 2 −j/4 ψ j,k−t . Then, the expectation of the raw wavelet periodogram (with respect to the rescaled waveletψ j,k−t ) is given by whereS j (k/T ) = 2 j/2 S j (k/T ) and P jl = 2 −j/2 D 1 jl 2 −l/2 . Then, to achieve consistency we take a similar approach to Nason et al. (2000). For each fixed scale j, the rescaled periodogramĨ j k of a Gaussian LSW process (which is scaled χ 2 -distributed) is smoothed as a function of z = k/T using, for example, discrete wavelet transform (DWT) shrinkage or translation invariant (TI) denoising of Coifman and Donoho (1995). Using the DWT, smoothing is performed with respect to an orthonormal wavelet basis where the "true" wavelet coefficients are given by v j rs = 1 0Ĩ j zT ψ rs (z) dz. As in von Sachs et al. (1997), we employ a slight abuse of notation, with ψ r 0 −1,s = φ r 0 ,s , in order to include the scaling coefficient at the coarsest scale r 0 of the second wavelet scheme. The empirical analogues of the wavelet coefficients are given bŷ Then, denoising is achieved by applying non-linear hard wavelet thresholding to the wavelet coefficientsv j rs . The resulting estimator is obtained by inverting the wavelet transform using only the coefficients which remain after thresholding: whereṽ j rs =v j rs I(|v j rs | > λ) are the hard thresholded wavelet coefficients with threshold λ. The particular threshold is specified in Theorem 3.5, as is the appropriate set of indices r over which to perform the summation.
The theoretical argument to consistently estimate S j (k/T ) is thus as follows. First, smooth the rescaled wavelet periodogram at each scale j using wavelet thresholding. Next, use the (bounded) P -inverse matrix to correct the smoothed, rescaled periodogram. Finally, multiply the estimate at each scale by 2 −j/2 , since S j (k/T ) = 2 −j/2S j (k/T ), yielding the final estimator, Using the hard threshold λ(j, r, s, T ) 2 = Var(v j rs ) log 2 (T ) when smoothing the periodogram via wavelet thresholding, we can show that the smoothed, corrected estimate S j (z) is consistent with respect to the L 2 error.
Theorem 3.5. Let ψ be a wavelet of bounded variation, with 2 r = o(T ) for wavelet coefficientŝ v j rs . For a Gaussian trend-LSW process and using the threshold λ 2 (j, r, s, T ) = Var(v j rs ) log 2 (T ), for each fixed j, The rate obtained in Equation (3.2) is a consequence of known results on wavelet thresholding estimators, utilised in Neumann and Von Sachs (1995), and the multiplication of 2 −j/2 that occurs in the estimation procedure. The rate highlights the fact that differencing the time series has resulted in an "information loss", with spectral estimation in coarser scales resulting in slower rates of convergence. As a consequence, we can only consistently estimate the wavelet spectrum for a proportion of the finest scales j. In particular, we have that we need for some δ > 0 in order for the mean-squared error of the EWS estimator to be o(1). Next, we tackle estimation of the local autocovariance via the EWS estimate.
Proposition 3.6. Defineĉ(z, τ ) by replacing S j (z) by S j (z) in the equation for the local autocovariance and replacing the lower limit in the sum from j = −∞ to j = −J 0 , i.e.
Let T → ∞ and let J 0 = α log 2 T for α ∈ (0, 1). Assume that S j (z) ≤ D2 γj for some positive constant D, where γ = (3α) −1 − 1/2. Then, We reiterate that the rescaling argument used to achieve consistency in Theorem 3.5 and Proposition 3.6 is performed purely for theoretical reasons; practical considerations for estimation are discussed at the end of the section. The results in Theorem 3.5 and Proposition 3.6 show that in the case where the trend is Lipschitz continuous, we can consistently estimate both the EWS and LACV of the original process using first-order differences and a modified bias correction.
The assumption placed on the decay rate of the EWS in Proposition 3.6 is a purely technical one, utilised in order to ensure mean square consistency of the estimator. The assumption controls for the fact that the local autocovariance is estimated using the finest J 0 scales, instead of across infinite scales which are not available in practice. The specific form of the decay rate is calculated in order to balance with the error rate of the wavelet thresholding procedure.

n-th Order Differencing
In some cases, a first-difference may not be enough to remove a trend. Further differencing can be performed, although it is usually only necessary to at most second-difference a time series (Brockwell et al., 1991). If we assume that the (n − 1)-th derivative of µ is Lipschitz, then the n-th difference of the time series will be (asymptotically) free of trend. We denote the n-th difference of a time series as {∇ n X t }.
To calculate the expectation of the squared non-decimated wavelet coefficients of the n-th differenced series, we can argue in a similar fashion to the case of first-differencing. Denote the The entries of the matrix A n J are given by the inner product of the autocorrelation wavelet with the autocorrelation wavelet at lag n. If we difference a time series n times, then the expectation of the squared wavelet coefficients will involve linear combinations of inner product autocorrelation wavelet matrices from lag 0 (i.e. the standard Amatrix) up to lag n (the matrix A n ). The result for the expectation of the wavelet periodogram of the n-th differenced time series mirrors that of the first-differenced series, and is given by the following proposition.
Proposition 3.7. Letd j,k = t ∇ n X t ψ j,k−t be the non-decimated wavelet coefficients of ∆ n X t , and letĨ j k := |d j,k | 2 . If the (n − 1)-th derivative of µ is Lipschitz, then Corollary 3.8. For second differences, Note that Proposition 3.7 generalises Proposition 4 in Nason et al. (2000), in which n = 0.
As in the case of first-differences, the bias operator D n does not possess a bounded inverse.
Intuitively, for higher order differences, the eigenvalues of D n decay to 0 at increasingly faster rates. As such, correcting the estimation in a similar fashion as was described in Theorem 3.5 yields much slower rates of convergence.
For the remainder of the section, we focus on the situation where a second-difference suffices to remove the trend. We can show that using first-differences suffices to obtain a consistent spectral estimate, even though the trend has not been fully removed. The first-differences of the trend are Lipschitz continuous, and as such the magnitude of the wavelet coefficients can be bounded using the wavelet characterisation of Hölder spaces (Daubechies, 1992). Using this bound, we can bound the error of the first-differenced estimator in terms of the error due to estimation and error due to the squared wavelet coefficients of the first-differenced trend. Using this argument we are able to show consistency of both the spectrum and local autocovariance estimator as follows.
Theorem 3.9. Assume that the first derivative of µ is Lipschitz, and let J 1 = β log 2 T for β ∈ (0, 1). Further assume that the smoothed raw wavelet periodogram is corrected across the finest J 1 scales only. Under the same conditions as Theorem 3.5, for each fixed j, S j (z) is a consistent estimator of S j (z), provided that T 7β−4 → 0 as T → ∞, since Defineĉ(z, τ ) by replacing S j (z) by S j (z) in the equation for the local autocovariance and replacing the lower limit in the sum from j = −∞ to j = −J 0 , i.e.
where J 0 = α log 2 T for α < β ∈ (0, 1). Under the assumptions of Proposition 3.6, provided Hence, we can still consistently estimate the EWS and LACV via first-order differences when the trend of the time series has a Lipschitz continuous first derivative. Therefore, we argue that in most practical scenarios, it is sufficient to only perform one difference in order to estimate the evolutionary wavelet spectrum and local autocovariance in the presence of a trend.
In practice, we set J 0 = J 1 when performing EWS and LACV estimation. That is, we use J 1 scales for estimating the EWS, and the same number of scales for estimating the LACV.
Performing estimation in this fashion ensures that the spectral estimate is well-behaved, and is an approach commonly adopted in the LSW literature. We suggest using J 1 = β log 2 (T ) , with β = 7/10, motivated by extensive numerical results given in Appendix A.5. Furthermore, this choice is in agreement with other discussion in the literature, see for example Sanderson et al. (2010). Finally, an algorithmic description of the spectral estimation procedure, where smoothing is carried out either via wavelet thresholding or using a running mean, is detailed in Algorithm 1.

Trend Estimation Using the Spectral Estimate
In this section, we discuss a wavelet thresholding approach for the estimation of the trend component of Model (2.1). If a first (or second) difference is capable of removing the trend from a time series, we have shown that we can consistently estimate the time-varying evolutionary wavelet spectrum using the smoothed, corrected raw wavelet periodogram of the differenced time series. We now wish to use this estimate in order to estimate the trend of the time series.

Wavelet Thresholding Estimator
The approach we take is to use the spectral estimate directly within a wavelet thresholding estimation procedure. In von Sachs and MacGibbon (2000), the authors describe a wavelet Algorithm 1: Spectral Estimation Procedure Input: Data {X t } n t=1 , spectral estimation wavelet ψ 0 , maximum scale J 1 , smoothing wavelet ψ or smoothing bin width parameter W 1. Compute wavelet periodogram of first-differenced time series: whereṽ j rs =v j rs I(v j rs > λ(j, r, s, T )), withṽ j rs given in Equation (3.1), and λ(j, r, s, T ) =σ j log(T ) withσ j computed as the median absolute deviation estimator of thev j rs .
(B) Running mean : 3. Compute the corrected, smoothed wavelet periodogram aŝ thresholding methodology for consistent curve estimation in the presence of locally stationary errors, subject to mild regularity conditions. The authors propose to use a local median absolute deviation pre-estimate for the variance of the wavelet coefficients, which is used in the threshold.
We instead incorporate the consistent spectral estimate into the thresholding procedure. This yields an analogous version of Theorem 1 in von Sachs and MacGibbon (2000), for the specific case of Lipschitz continuous trend functions.
With a slight abuse of notation, denote the estimated wavelet coefficients of the time series byv rs . The results in this section apply to the commonly used soft and hard thresholding rules, where λ = λ(r, s, T ) is the threshold, and I is the indicator function. Asymptotic normality of the empirical wavelet coefficients, established in von Sachs and MacGibbon (2000), permits the use of a coefficient-dependent universal threshold λ(r, s, T ) = σ rs 2 log(T ), where σ 2 rs is the variance of the wavelet coefficients. This yields the following result for the wavelet thresholding estimatorμ obtained using the DWT and the threshold λ(r, s, T ), with either soft or hard thresholding, calculated in the same way as described in Section 3.5. We note here however, that the result also holds for TI wavelet denoising (outlined in Section 3.5) since the nondecimated wavelet coefficients can be seen as a set of DWT coefficients computed from cyclic shifts of the data.
Proposition 4.1. Letψ be a wavelet of bounded variation, with 2 r = o(T ) for wavelet coefficientsv rs . For a trend-LSW process with Lipschitz continuous trend, and using the threshold λ(r, s, T ) = σ rs 2 log(T ), the estimatorμ satisfies Note that this result is subject to mild regularity assumptions on the LSW component of the model, all of which are satisfied. The innovations {ξ j,k } are not restricted to be Gaussian, and can for example be exponential, gamma, or inverse Gaussian distributed. To estimate the variance σ 2 rs , which is necessary to choose the threshold λ(r, s, T ), we use an estimate of the variance of the empirical wavelet coefficients. If the LSW process is generated by wavelet ψ 0 , and the wavelet used for thresholding is denoted ψ 1 , the variance of the empirical wavelet coefficients d j,k := t X t ψ 1 j,k−t is given by , and where Ψ 0 j (τ ) and Ψ 1 j (τ ) are autocorrelation wavelets with respect to wavelets ψ 0 and ψ 1 . By plugging in the estimate S j (z), obtained in Section 3.5, into the expression (4.1), this yields the universal-type threshold λ(r, s, T ) =σ r,s 2 log(T ), wherê

Practical Considerations
In alignment with discussion in von Sachs and MacGibbon (2000), in practice we analyse approximately the finest 7/10 scales of the time series, the same as in the spectral estimation procedure.
In practice, we have found that applying hard thresholding yields better performance. We recommend the use of translation invariant (TI) thresholding over a standard discrete wavelet transform. We have found that it offers stronger practical performance, in terms of the mean squared error of the estimator. As noted in Nason (2008), use of a non-decimated transform ensures that there is "more chance" of the wavelet coefficients picking up the signal of the time series.
Note that it is possible to obtain negative estimates of the variance of the wavelet coefficients, although based upon our empirical analyses this is rare. In this case we replace the negative values by the nearest neighbouring positive value, which was found to have no discernible impact on the trend estimation procedure. We recommend the use of the Daubechies Least Asymmetric wavelet with 4 vanishing moments, as we have found empirically that it works well for estimation purposes and also helps to minimise the number of negative variance estimates.
Note that Proposition 4.1 holds for non-Gaussian time series, while theoretical results concerning the second-order estimation require an assumption of normality. In practice, our approach still performs well in the presence of non-Gaussian noise, as we show in the simulation study.
To complete the discussion, we provide algorithmic pseudo code for the trend estimation procedure using TI thresholding in Algorithm 2.

Simulation Study
In this section we illustrate the ability of our proposed methodology to jointly estimate the mean and EWS of a trend-LSW process by performing a simulation study. For each set of simulations, we use the three EWS shown in Figure 5.1, which represent spectra with distinct characteristics. The spectra are explicitly defined in the supplementary material. Spectrum S 1 , studied in Nason (2008), displays coarse-scale, slowly-evolving sinusoidal behaviour with a fine-scale burst in power at time point 800. Spectrum S 2 is a concatenation of moving average processes and contains power moving from fine to coarser scales, and was examined in Nason et al. (2000). Spectrum S 3 contains slowly-evolving power at fine scales.
We simulate 100 realisations of time series {X t } T −1 t=0 of length T = 2 10 = 1024 from LSW processes with those spectra, with different additive trend functions. The trends used in the simulation study are a linear, sinusoidal, logistic and piecewise quadratic trend, denoted in

Spectral Estimation Performance
In this simulation we show that by first-differencing to remove the trend, we can obtain an unbiased EWS estimate, which in turn can be used to obtain a trend estimate that performs well. For each realisation, the un-smoothed estimate of the EWS was calculated, which was then used to obtain an averaged estimate for the EWS across the 100 realisations. In alignment with the discussion in Section 3.5, we correct the raw wavelet periodogram across the finest 7 scales.
We report the mean squared error of the averaged spectrum compared with the true spectrum in Table 1. We compare these values with the mean squared error obtained by using the standard LSW estimation procedure of Nason et al. (2000). In this case, no trend is present, the estimate is performed using the original A matrix inverse for bias correction, and no differencing is performed. This is calculated using the ewspec3 command in the locits R package. This is reported in the "None" row in Table 1, and represents the 'best-case' performance with which to compare. We see that despite the presence of a trend, differencing enables us to approximately remove it in order to accurately estimate the spectrum. The estimation error is smaller using our methodology than the "None" case for spectrum 2, while it is marginally worse for the other two spectra. Visual inspection of the resulting estimators shows they are generally satisfactory, with examples given in Figure 5.3. Note that, while the spectral estimation could also have been performed using a second difference instead of first, this over-differencing results in higher estimation error. A numerical comparison between first and second differences, showing the effect of over-differencing, is given in Appendix A.3.

Trend Estimation Performance
We next assess the trend estimation procedure. The trend estimate is computed using the Daubechies Least Asymmetric wavelet with 4 vanishing moments, analysing the finest 6 scales.
The spectrum estimate used in the thresholding procedure is smoothed with a running mean of bin width size 128. We compute the average mean squared error across the 100 realisations, as We compare our method to two other wavelet-based trend estimation methods. In order to make a fair comparison, we use the same hyper-parameters where applicable. Firstly, we compare to the standard wavelet thresholding method based upon a global, time-independent universal thresholdσ r 2 log (1024), computed using the wavethresh package in R. This estimator assumes that the error process is second-order stationary, and is referred to as stationary wavelet thresholding (SWT) in the tables. Secondly, we compare to the wavelet-based trend estimation procedure in von Sachs and MacGibbon (2000). No code is publicly available for this method, and so we have implemented the method utilising wavethresh and following the description of the computation of the threshold in Section 2.5 of von Sachs and MacGibbon (2000). In the tables, the abbreviation LSWT refers to our method of locally stationary wavelet thresholding, and MVSWT refers to the wavelet-based method of von Sachs and MacGibbon (2000).
We repeat this simulation twice; firstly using Gaussian innovations in the LSW process, and secondly using exponentially distributed innovations. Further results examining the performance in other, non-LSW error scenarios can be found in the supplementary material. The results of the simulation for Gaussian LSW processes are reported in Table 2, while the results for exponentially distributed innovations are shown in Table 3. Values in bold are the lowest values for each trend and spectrum value across the three methods.
From the tables we observe that the performance of our estimator in the presence of exponentially distributed random innovations is comparable to the Gaussian case. The error is consistent across the various time series scenarios, and the standard deviation is low. We see that our method outperforms the two other wavelet-based methods across all trend and spectrum scenarios.  Table 3: Average mean squared error and standard deviation in brackets of trend estimate over 100 realisations generated using exponential innovations.

Extension to Seasonal Time Series
Our approach can be extended empirically to time series that display an additive deterministic seasonal component. From the discussion in Remark 2, if the time series possesses a (known) number of seasons K, lag K-differencing can be performed to remove the seasonality in order to estimate the EWS of the original time series. To illustrate this, we investigated the performance of the method in the presence of seasonal and smooth trends.
We simulated 100 realisations of time series according to the various trend and spectrum scenarios described in Section 5.1. To each of these scenarios, we added a (monthly) stationary seasonal component with K = 12, where the value of each of the 12 seasonal components K 1 , . . . , K 12 is generated from a uniform random variable on the interval [0, 10]. That is, the time series is generated by where µ is the smooth trend function, the seasonal component s t = K 1 for t mod 12 = 1, . . . , s t = K 12 for t mod 12 = 0, and t is the LSW component.  Table 4: Mean squared error comparison for the averaged spectrum estimate, multiplied by 10 3 , across the spectrum, trend, and seasonality scenarios.
The results of these experiments are given in Table 4, which can be compared to the results for the lag 1 differences in Table 1. In particular, the performance is actually slightly better for spectrum 3; the smoothest spectrum. The other 2 spectra yield worse results, however: the performance drop-off is most noticeable for spectrum 2, which contains piecewise stationary components. This is due to the increased bias caused in estimating the time-varying spectra using lag 12 differences instead of lag 1 differences. Examples of the averaged spectrum estimates are given in Figure 5.4, qualitatively showing strong similarities to the examples in Figure 5.3.

Data Application: Baby Electrocardiogram Data
Next, we discuss a data application that highlights the benefits of our proposed methodology.
We re-analyse the baby electrocardiogram (ECG) data set first examined in Nason et al. (2000).
Furthermore, in the supplementary material, we analyse a wave height data set collected from a buoy in the Atlantic Ocean that was studied in Killick et al. (2012). In Figure 6.1 left, we see a time series for the ECG reading of a 66-day-old infant. The series is available in the wavethresh package and was collected by the Institute of Child Health at the University of Bristol. The series can be seen to exhibit both nonstationary trend and second-order behaviour. As Nason et al. (2000) note, one reason for the nonstationarity is that the ECG varies considerably over time and changes significantly between periods of sleep and waking. In Nason et al. (2000), the authors first-difference the series to remove the trend, as shown in Figure 6.1 right, resulting in an approximately zero-mean time series. Then, LSW analysis is performed on the differenced series. We instead perform analysis on the original series, and note the similarities and differences to the original analysis in Nason et al. (2000). We use the Daubechies Least Asymmetric wavelet with 10 vanishing moments to analyse both series and the same bin width of size 128 for the running mean smoother, but as discussed we use a different matrix to correct for bias. In Figure   6.2 left, we see our analysis on the original series. The top plot is not scaled, while the bottom plot is individually scaled at each level. On the right, we see (our version of) the original analysis of Nason et al. (2000), again with top plot unscaled and bottom plot scaled individually. In line with previous discussion, only the finest 7 scales were used for analysis. From the top plots, we can see the effect that differencing has had on the resulting spectral estimate. In our estimate based on the original time series, we see that the power is spread fairly evenly across scales.
However, in the spectral estimate of the differenced series, power is concentrated in the finest scales. From the bottom plots, we see that both estimates exhibit similar overall behaviour, with both evolving over time in a similar fashion.
Due to the difference in magnitude between the estimated spectra at different scales, some features of the original series can be identified more easily in our analysis. In Nason et al. (2000), an accompanying data set, the sleep state of the observed infant, is also used in the analysis. The sleep state is judged by a trained human observer, and is measured as either quiet (1), between quiet and active (2), active (3), or awake (4). A strong association between sleep state and spectral value is observed for scales −1 to −5 in Nason et al. (2000). However, using the original series instead of the differenced series can allow this association to become more apparent: in particular, consider the spectrum estimate at scale −4, enlarged in Figure   6.3. In solid line is the spectral estimate of the original series, while the dashed line shows the spectral estimate for the differenced series. The dotted line shows the sleep state of the infant. Our estimate correlates strongly with the sleep state, especially in highlighting periods of being awake (sleep state 4). In general, the "signal" of variability appears stronger in our analysis, which is due to the differencing lowering the level of autocorrelation within the series.
Furthermore, our estimate contains fewer negative spectral values, perhaps suggesting that the original time series is better represented as an LSW process as opposed to the first-differenced series.
We perform TI wavelet thresholding using the Daubechies Least Asymmetric wavelet with 4 vanishing moments to estimate the trend of the Baby ECG series. We analyse the finest 7 scales of the series, using a hard universal threshold ofσ r,s 2 log(2048), whereσ r,s is calculated using the spectral estimate in Figure 6.2 left. The trend estimate is shown in Figure 6.4 in the solid line. We see that in general the estimate is quite smooth, with more rapid changes in mean occurring at approximately 23:00, 03:00, and 06:00, corresponding to changes in sleep state. The estimate again correlates with the underlying sleep state, highlighting the benefit of performing a joint first and second-order analysis of the data.  von Sachs, R., Nason, G. P., and Kroisandt, G. (1997). Adaptive estimation of the evolutionary wavelet spectrum. University of Kaiserslautern, Germany, and University of Bristol, UK.

A.1 Spectra and Trend Functions Used
In this section we provide additional information for the discussion in Section 5 of the main text.

A.2 Results for Daubechies LA10 Wavelet LSW Processes
In this section we report the results of the simulations when the Daubechies Least Asymmetric wavelet with 10 vanishing moments (LA10) is used to simulate the LSW processes. Table 5 reports the mean squared error comparison for the averaged spectral estimates, where the LSW process has been generated using the LA10 wavelet. The estimation error for spectrum 1 is almost identical comparing to "None" using our methodology with a trend present, while for spectrum 2 and 3 there is only a marginal increase. The results show that the method performs well irrespective of the wavelet that generates the LSW process. Table 6 reports the average mean squared error and standard deviation for the trend estimate where the 100 LSW processes have been generated using Gaussian innovations and the LA10 wavelet. Lastly, Table 7 reports the average mean squared error and standard deviation for the trend estimate where the 100 LSW processes have been generated using exponential innovations and the LA10 wavelet. For both cases, a hard threshold ofσ r,s 2 log (1024) Table 7: Average mean squared error and standard deviation in brackets of trend estimate over 100 realisations generated using exponential innovations and the Daubechies LA10 wavelet.

A.3 Effect of Over-Differencing
In the main text, we recommended using a first difference for removing the trend. Here, we compare the estimation performance of the spectral estimate computed using first-and seconddifferences, to highlight the effect of over-differencing the time series. We perform the same simulations as in Section 5.1 of the main text, and compute the averaged spectral estimate using both first-and second-differences. To compare the two results, we compute the relative mean squared error of the first-differenced estimate and second-differenced estimate, in each of the trend and spectrum scenarios. To control for boundary effects, we only compute the RMSE on the non-boundary wavelet coefficients. The results are reported in Table 8. We see that over-differencing yields a higher MSE for spectral estimation. The RMSE is above 1 in all scenarios, showing that the first-difference estimator achieves a uniformly stronger performance.

A.4 Further Trend Estimation Simulations
Next, we assess the performance of the trend estimation procedure in the presence of non LSW-type error structures to highlight the versatility of our methodology. In particular, we simulate 100 realisations of time series using the previously defined trends, with (Gaussian) errors simulated from the following models: (A) Time-varying AR(2) model with parameters φ 1 (z) = 0.8 cos(1.5 − cos(4πz)), φ 2 (z) = −0.2 + 0.4z, z ∈ (0, 1).  Table 9: Average mean squared error and standard deviation in brackets of trend estimate over 100 realisations generated from Models A -D.
process studied in von Sachs and MacGibbon (2000). Model C represents a piecewise stationary AR(1) process with a single changepoint in the autoregressive parameter. The results of these simulations are given in Table 9, again with bold values showing the lowest reported mean squared error. We see that our methodology is able to perform well across the four scenarios, working well in both stationary and nonstationary second-order settings. It outperforms the other three methods in all of the trend and error scenarios. Overall, we see that our proposed methodology offers strong practical performance across a variety of time series models.

A.5 Choice of α and β Parameters
In this section we investigate the effect that the choice of the parameters α and β have on estimation quality, in order to inform the best choice of the parameters in practice. Recall that β is the proportion of scales used when estimating the spectrum, whilst α is the number of scales used for estimating the LACV.
Firstly, the choice of β was investigated. We simulated 1000 realisations of LSW processes, for three wavelets: the Haar, Daubechies EP4, and Daubechies EP10 wavelet. For each realisation the spectrum was given by S j = U j 2 γj , where U j ∼ Uniform(0, 5) and γ > 0 is varied.
This provides a more complete picture of the impact of the choice of β than only using the three previously defined spectra. The decaying spectral structure is used in order to control for the fact that estimation of the EWS is more difficult in coarse scales, and mirrors the assumption in Proposition 3.6 in the main text. In addition, we use a global mean to smooth the raw wavelet periodogram at each level, in order to focus on the effect of β, rather than the bin width parameter.
We calculated the smoothed EWS estimate, while varying the number of scales used to perform EWS correction, for γ = 1/2, 1/4, 1/8, and time series length T = 512, 1024, 2048. We the higher the number of vanishing moments, the higher the optimal choice of β. Across all (27) scenarios, the mean value of β that produces the lowest average mean squared error in each scenario is 0.721. This motivates setting J 1 = 7 10 log 2 (T ) as a flexible rule-of-thumb. In practice, different values of β can also be compared visually to assess the fit.
Next, given the above choice of β = 7 10 log 2 (T ) , we perform the same set of simulations as above, this time with varying α, the number of scales used to estimate the LACV function. Note

A.6 Canadian Wave Height Data
Here we examine publicly available wave height data for a location in the North Atlantic. For the spectral estimate, we use the Daubechies Least Asymmetric wavelet with 10 vanishing moments. For simplicity, the periodogram is smoothed using a running mean with bin width 512, corresponding to a time length of roughly 3 weeks. The estimate is shown in Figure A .4, where each level in the plot is scaled individually for clarity. We can see strong nonstationarity in the spectrum, with more variability during the winter months and less in the summer, as found in Killick et al. (2012) using a changepoint approach.
To estimate the trend of the series, we perform TI wavelet thresholding using the Daubechies Least Asymmetric wavelet with 4 vanishing moments. In line with previous discussion, we analyse the finest 9 scales of the series. We use a soft universal threshold ofσ r,s 2 log(8192) to account for likely non-Gaussianity, whereσ r,s is calculated using the spectral estimate in Figure   A.4. The trend estimate is shown in the solid line in Figure A.5. We see that the estimated trend function is relatively smooth with occasional sharp changes, with the mean wave height larger during the winter and smaller in the summer. There are several locations where perhaps the wavelet coefficients were not thresholded correctly, leading to the sharp changes. Using a scale-dependent smoothing of the raw wavelet periodogram is one possible way to remedy this.
Using the spectral estimate in Figure A.4, we calculated the local autocovariance estimate of the series. In Figure A.6, we see in the solid line the estimate using our methodology for the local variance function of the time series. The nonstationary nature of the variance is clear to see, with the summer months containing low variability, while winter months display much higher variability. As one way of determining the performance of our method, we can compare the variance estimate with the estimate obtained by using the detrended wave height data. The detrended wave height data is obtained by subtracting the wavelet thresholding trend estimate from the data. We then perform standard LSW inference on this series, using the same parameters as in our approach. In dashed line, we see the local variance estimate obtained in this way. We see that the two estimates agree, which is reassuring on two counts: first that our spectral estimate obtained using the differenced data is accurate, and secondly that the trend estimate obtained was accurate. and dotted lines respectively. We see that in general the series exhibits strong autocorrelation, which we may expect due to the observations being recorded at hourly intervals. Furthermore, we note that the shape of the autocorrelation function changes across the four months plotted.

B.1 Proof of Proposition 3.2
The matrix D 1 J is symmetric and positive semi-definite, since it can be expressed as a Gram matrix of vector inner products: where Ψ j is the autocorrelation wavelet vector at scale j, and ∇ is the first-differencing matrix with diagonal entries 1, above the main diagonal entries equal to −1, and all other entries equal to zero. The matrix ∇ is invertible (with inverse given by the upper triangular matrix with non-zero entries equal to 1). By Nason et al. (2000) Theorem 1, the family of vectors {Ψ j } −1 j=−J is linearly independent. This implies that {∇Ψ j } −1 j=−J is also linearly independent, since this family is given by the invertible matrix transform of a family of linearly independent vectors. Therefore, D 1 J is invertible since it is the Gram matrix of a set of linearly independent vectors.

B.2 Proof of Proposition 3.3
The expectation is given by where the remainder term can come out of the inner bracket since, for fixed j, the sum is finite as the wavelet is compactly supported. Now we evaluate each expectation individually. The term I is simply the expectation of the raw wavelet periodogram of the original LSW model. Hence, Finally, substituting r = u − v, we obtain Hence, we obtain For the variance part, we follow the same argument as in the proof of Proposition 4 of Nason et al. (2000). The wavelet coefficients of the differenced series are asymptotically Gaussian. Hence, the wavelet periodograms are asymptotically scaled χ 2 -distributed, and hence the variance is asymptotically proportional to the expectation squared.

B.3 Proof of Theorem 3.4
(Proof for the Haar wavelet). The proof follows the same strategy to that of the proof of Theorem 2 in Nason et al. (2000). We show that there exists δ > 0 such that λ min (P ) ≥ δ, by using the following property from Toeplitz matrix theory. Let T be Toeplitz (and Hermitian) with elements {t 0 , t 1 , . . .}. Let f (z) = ∞ n=−∞ t n z n for z ∈ C be the symbol of the operator associated with T . If n |t n | < ∞, then f ( Reichel and Trefethen (1992), theorem 3.1 (i)). For ease of notation, indices will now run from 1 to ∞ instead of −1 to −∞. Using straightforward algebra, we can derive explicit formulae for the entries of P , which are given by the following lemma.
Lemma B.1. In the case of the Haar wavelet, the elements of the matrix P are given by P jj = 10, P j,j+m = 6 × 2 −m/2 , for l = j + m, m > 0.
We must therefore calculate the elements of A 1 , from which we can obtain the elements of D 1 , and hence P . Note that replacing the (τ − 1) terms with (τ + 1) results in an equivalent definition of the operators A 1 and D 1 . When l = j, we have that Hence D 1 jj = 2A jj − 2A 1 jj = 10 × 2 −j , and therefore P jj = 10. When j = l, without loss of generality assume l > j. We have that since the value of |τ + 1| is always less than or equal to 1/2, and so we use the 1 − 3|τ | part of the autocorrelation wavelet formula. The first sum is equal to The second sum is given by From this, we finally obtain that Therefore, we have that D 1 jl = 2A jl − 2A 1 jl = 6 × 2 −l , from which the required form for P follows.
Returning to the main proof, we have that P is symmetric, however the formula above only refers to the upper triangular portion of the matrix P . Now, P is a Toeplitz matrix with p 0 = 10 and p m = 6 × 2 −m/2 . We can now show that λ min (P ) ≥ δ > 0. Substituting in the formula for the symbol of the symmetric Toeplitz P , we obtain min |z|=1 {f (z)} = min |z|=1 10 + 2Re ∞ n=1 6 × 2 −n/2 z n = 10 + 12 min This function is strictly monotonically increasing on −1 ≤ Re(z) ≤ 1, therefore it follows that (Proof for the Shannon wavelet). Note that the indices now run over the negative integers. We can compute the entries of P using a variant of the Fourier domain formula shown in Equation (B.1) below, which is a consequence of Parseval's relation.
whereΨ j (ω) denotes the Fourier transform of Ψ j (τ ), which is equal to the squared modulus of the Fourier transform of the non-decimated wavelet coefficients,ψ j (ω). Explicitly, where m 0 (ω) = 2 −1/2 k h k exp(−iωk) is the transfer function; {h k } is is the high-pass quadrature mirror filter with k h 2 k = 1 and k h k = 2 1/2 ; and |m 1 (ω)| 2 = 1 − |m 0 (ω)| 2 . The formula for the Fourier transform of the non-decimated wavelets given in Equation (B.2) and the corresponding formulae for m 0 (ω) and m 1 (ω) for the Shannon wavelet can be found using the Fourier transform of the continuous time mother and father wavelets, which can be found in Chui (1997), pages 46 and 64. Define the set C j , for j < 0, to be As in Nason et al. (2000), the Fourier transform of the non-decimated Shannon wavelets is given byψ j (ω) = −2 −j/2 exp(−2 −j−1 iω)I C j (ω), where I C j is the indicator function on the set C j . From this the Fourier transform of the autocorrelation wavelets can be obtained aŝ Using Parseval's relation and the shifting property of the Fourier transform, we have that The first term in the sum is exactly the entries of the original A-matrix which are given by A jj = 2 −j for j < 0, A jl = 0 for j = l. When j = l, the second term is equal to zero since the supports of differentΨ j (ω) do not overlap. Thus, the matrix is diagonal, and when j = l, the second term is given by Now, expanding the complex exponential terms into trigonometric functions, we obtain − τ Ψ j (τ )Ψ j (τ + 1) = − 2 −2j−1 i π cos(2 j π) + i sin(2 j π) − cos(2 j+1 π) − i sin(2 j+1 π) + cos(−2 j+1 π) + i sin(−2 j+1 π) − cos(−2 j π) − i sin(−2 j π) .
Next, approximate the diagonal terms using a Maclaurin series: Therefore, the matrix is diagonal, with all diagonal entries being uniformly bounded away from 0, with P jj ≥ P −1,−1 ≈ 13.09 for all j. Hence, we have shown that there exists some δ > 0 such that λ min (P ) ≥ δ.
B.4 Proof of Theorem 3.5 As T → ∞, the eigenvalues of D 1 tend to zero and hence, when viewed as an operator acting on the sequence space 2 (N), its inverse is unbounded. In the locally stationary Fourier time series setting, a similar relationship is found, as given in Equation (5) of Roueff and Von Sachs (2011). Loosely speaking, the original spectrum at frequency ω is related to the differenced one through multiplication of the term |1 − e −iω | −2 . As ω → 0 (corresponding to low frequencies) the equation blows up. This mirrors our scenario, where, as the correction matrix grows in size -and we consider coarse-scale (low frequency) behaviour -the inverse matrix norm becomes larger.
We can account for the unboundedness of the inverse of D 1 by using a rescaling of the LSW process itself. Consider the auxiliary process t = j,kw jkψj,k−t ξ lm , wherew jk = 2 j/4 w jk and ψ j,k−t = 2 −j/4 ψ j,k−t . Then, the expectation of the raw wavelet periodogram (with respect to the rescaled wavelet) is given by whereS j (k/T ) = 2 j/2 S j (k/T ) and P jl = 2 −j/2 D 1 jl 2 −l/2 . We can therefore use the (bounded) P -inverse matrix to correct the smoothed, rescaled periodogram, and then multiply by 2 −j/2 , since S j (k/T ) = 2 −j/2S j (k/T ). To determine the appropriate threshold for the wavelet-based estimator, we use the following lemma: Lemma B.2. For a Gaussian trend LSW process and using a wavelet ψ of bounded variation, the wavelet coefficientsv j rs , with 2 r = o(T ), obey uniformly in s, E(v rs ) − Lemma B.2 is analogous to Theorem 3 of Nason et al. (2000). The result of mean square consistency follows due to a combination of Equation (B.3) and Theorem 4 of Nason et al. (2000). The mean squared error of the smoothed, corrected wavelet periodogram is given by whereÎ l zT is the smoothed estimate of the rescaled raw wavelet periodogram and Λ l (z) = n P nlSn (z). The first term can be bounded as which follows since P −1 jl is bounded, D 1 ln = O(2 l ), n S n (z) < ∞, and T = 2 J . Hence, the mean squared error is given by O 2 −j T −2/3 log 2 (T ) .

B.5 Proof of Proposition 3.6
We can write the mean squared error as where R J 0 can be bounded as For the first term, we obtain by Equation (6) from the main text, and using that Ψ j (τ ) ≤ 1 for all j and τ . Hence, provided that T α−2/3 log 2 (T ) → 0 as T → ∞, the estimator is mean square consistent.

B.6 Proof of Proposition 3.7
In order to derive the formula for the expectation of the squared wavelet coefficients of a general n-th difference, we require the formula the n-th difference itself, which is a well-known result.
Denote by ∇ n X t the n-th difference of the time series {X t } at time t. Then, ∇ n X t = n k=0 (−1) k n k X t−k = n k=0 (−1) k n k t−k + O(T −1 ), (B.4) which follows from the differentiability assumption of the trend µ. Now, observe that, for an n-th difference, the expectation will involve the sum of the spectrum over all scales, multiplied by a linear combination of lagged inner product A-matrix entries, denoted A τ , from lag 0 to lag n. To calculate the coefficient in front of each of the A τ , we simply calculate the sum of the coefficients of the { t s } for each particular lag in the expansion obtained by squaring the n-th difference of the time series, for which we can use Equation (B.4).
For example, to calculate the coefficient of A, we add together the coefficients of the squared terms in the square of the differenced series, i.e. add the coefficients of 2 t , 2 t−1 , . . . 2 t−n . For the coefficient of A 1 , we add together the coefficients of the terms in the square of the difference that differ in index by 1, i.e. we add the coefficients of t t−1 , t−1 t−2 , t−1 t , . . . , n−1 n . In particular, if we were interested in the third difference, then the coefficient in front of A would be 1 + 9 + 9 + 1, and the coefficient in front of A 1 would be −3 − 3 − 9 − 9 − 3 − 3 = −30, and so on. Hence the formula for the coefficient in front of A is given by n r=0 n r 2 = 2n n , and similarly, the formula for the coefficient in front of A τ , for τ ≥ 1, is given by 2(−1) τ n−τ r=0 n r n r + τ = 2(−1) τ 2n n + τ , where the multiplication by 2 arises due to symmetry (for example we must add both the coefficients of t−1 t−2 and t−2 t−1 ). The equality on the right follows from a counting argument.
The number of ways to choose n + τ objects from 2n choices is the same as the number of ways of choosing r objects from the first n and choosing r + τ from the remaining n objects, for 0 ≤ r ≤ n − τ . The expectation is accurate up to order O(T −1 ) by the same argument as in the proof of Proposition 3.3. Hence, the squared expectation of the wavelet coefficients of the n-th differenced series are given by B.7 Proof of Theorem 3.9 By the Daubechies characterisation of Hölder spaces, rescaling of the LSW process, and Propo- Hence, the mean squared error of the smoothed wavelet periodogram is given by whereÎ l zT is the smoothed estimate of the raw wavelet periodogram and Λ l (z) = n P nlSn (z). The first term can be bounded as which follows since P possesses a bounded inverse with exponentially decaying entries, and using Equation (B.5). The second term can be bounded in the same fashion as in the proof of Theorem 3.5, and is of order O(2 −j T −β ), which gives the stated consistency result. Similarly, the mean squared error of the LACV estimator is calculated as where R J 0 is asymptotically negligible by the argument in Proposition 3.6. For the first term, The mean squared error rate is obtained by using Theorem 1 of von Sachs and MacGibbon (2000), with the specific case of a Lipschitz continuous trend.