Method of moments estimators for the extremal index of a stationary time series

The extremal index $\theta$, a number in the interval $[0,1]$, is known to be a measure of primal importance for analyzing the extremes of a stationary time series. New rank-based estimators for $\theta$ are proposed which rely on the construction of approximate samples from the exponential distribution with parameter $\theta$ that is then to be fitted via the method of moments. The new estimators are analyzed both theoretically as well as empirically through a large-scale simulation study. In specific scenarios, in particular for time series models with $\theta \approx 1$, they are found to be superior to recent competitors from the literature.


Introduction
The statistical analysis of the extremal behavior of a stationary time series is important in many fields of application, such as in hydrology, meteorology, finance or actuarial science (Beirlant et al., 2004). Such an analysis typically consists of two steps: (1) assessing the tail of the marginal law and (2) assessing the serial dependence of the extremes, that is, the tendency that extreme observations occur in clusters. The present work is concerned with step (2). The most common and simplest mathematical object capturing the serial dependence between the extremes is provided by the extremal index θ ∈ [0, 1]. In a suitable asymptotic framework, the extremal index can be interpreted as the reciprocal of the expected size of a cluster of extreme observations. The underlying probabilistic theory was worked out in Leadbetter (1983); Leadbetter et al. (1983); O'Brien (1987); Hsing et al. (1988); Leadbetter and Rootzén (1988).
Estimating the extremal index based on a finite stretch of observations from the time series has been extensively studied in the literature. An early overview is provided in Section 10.3.4 in Beirlant et al. (2004), where the estimators are classified into three groups: estimators based on the blocks method, the runs method or the inter-exceedance time method. Respective references are Hsing (1993); Smith and Weissman (1994); Ferro and Segers (2003); Süveges (2007); Robert (2009);Northrop (2015); Cai (2019), among many others. The proposed estimators typically depend on two or, arguably preferable, one parameter to be chosen by the statistician. The present paper is on a class of method of moments estimators (based on the blocks method), which improves upon a recent estimator proposed in Northrop (2015) and analyzed theoretically in Berghaus and Bücher (2018).
Some notations and assumptions are necessary for the motivation of the new class of estimators. Throughout the paper, X 1 , X 2 , . . . denotes a stationary sequence of real-valued random variables with continuous cumulative distribution function (c.d.f.) F . The sequence is assumed to have an extremal index θ ∈ (0, 1], i.e., for any τ > 0, there exists a sequence u b = u b (τ ), b ∈ N, such that lim b→∞ bF (u b ) = τ and lim b→∞ P(M 1:b ≤ u b ) = e −θτ , (1.1) whereF = 1 − F and M 1:b = max{X 1 , . . . , X b }. Next, define a sequence of standard uniform random variables by U s = F (X s ) and let Since bF {F ← (e −x/b )} = b(1 − e −x/b ) → x for b → ∞, it follows from (1.1) that, for any x > 0, In other words, for large block length b, Y 1:b approximately follows an exponential distribution with parameter θ, denoted by Exp(θ) throughout. This inspired Northrop (2015) and Berghaus and Bücher (2018) to estimate θ by the maximum likelihood estimator for the exponential distribution; see Section 2 below for details on how to arrive at an observable (rank-based) approximate sample from the Exp(θ)-distribution based on an observed stretch of length n from the time series (X s ) s∈N . The idea of transforming observations into a sample of exponentially distributed observations is actually not new within extreme value statistics: it is also, among many others, the main motivation for the Pickands estimator in multivariate extremes (Pickands, 1981;Genest and Segers, 2009). More precisely, if (X, Y ) is a bivariate random vector from a multivariate extreme value distribution with Pickands function A = (A(w)) w∈ [0,1] , then ξ(w) = min{− log F X (X)/(1− w), − log F Y (Y )/w} is exponentially distributed with parameter A(w). Given a sample of size n from (X, Y ), we may replace F X and F Y by their empirical counterparts and arrive at an approximate sample of size n from the Exp(A(w))-distribution, to be, for instance, estimated by the maximum likelihood method.
The present paper is now motivated by the following observation: while the maximum likelihood estimator is asymptotically efficient in the ideal situation of observing an i.i.d. sample from the exponential distribution, it was shown in Genest and Segers (2009) for rank-based estimators of the Pickands function that it is in fact more efficient to consider alternative estimators based on the method of moments, such as a rank-based version of the CFG-estimator (Capéraà et al., 1997). Given that Northrop's blocks estimator is also rank-based, the main motivation of this work is to consider CFG-type estimators for the extremal index θ. Alongside, we will also investigate other moment-based estimators, among which is one that is closely connected to the madogram (Naveau et al., 2009). We will show that the new estimators may exhibit a substantially smaller asymptotic variance than Northrop's maximum likelihood estimator, in particular the CFG-type estimator in the case where θ is close to one.
The remaining parts of this paper are organized as follows: in Section 2, we collect some results about certain useful moments of the exponential distribution and use those to introduce the new estimators for θ. Regularity assumptions needed to prove asymptotic results are summarized and discussed in Section 3. The paper's main results are then presented in Section 4, alongside with a discussion of certain aspects of the derived asymptotic variance formulas. Section 5 is about a particular time series model, for which we show that all regularity conditions imposed in Section 3 are met. The finite-sample performance of the new estimators is investigated in a Monte-Carlo simulation study in Section 6. Finally, all proofs are postponed to Section 7.

Definition of estimators
Recall the definition of Y 1:b in (1.2), where b ∈ N. Similarly, let and note that, as b → ∞ and for any x > 0, by similar arguments as for Y 1:b . The convergence relations in (1.3) and (2.1) serve as a basis for the method of moments estimators defined below. Subsequently, let X 1 , . . . , X n denote a finite stretch of observations from the stationary sequence (X s ) s∈N . Within Section 2.1 and 2.2, we start by using (1.3) and (2.1) to derive some observable, approximate samples from the Exp(θ)-distribution. In Section 2.3, we collect some moment equations for the exponential distribution, which will then be used to motivate new estimators for the extremal index in Section 2.4.

Two approximate Exp(θ)-samples based on disjoint blocks maxima
Divide the sample X 1 , . . . , X n into k n successive blocks of size b n , and for simplicity assume that n = b n k n (otherwise, the last block of less than b n observations should be deleted). For i = 1, . . . , k n , let M ni = max{X (i−1)bn+1 , . . . , X ibn } denote the maximum of the X s in the ith block of observations and let Due to relations (1.3) and (2.1), if the block size b = b n is sufficiently large, the (unobservable) random variables Y ni and Z ni are approximately exponentially distributed with parameter θ. Observable counterparts are obtained by replacing F by the (slightly adjusted) empirical c.d.f. F n (x) = (n + 1) −1 n s=1 1(X s ≤ x), giving rise to the definitionŝ Both the samples Y db n = {Ŷ ni : i = 1, . . . , k n } and Z db n = {Ẑ ni : i = 1, . . . , k n } will be used later to define disjoint blocks estimators for θ (note that both samples are dependent over i due to the use ofF n , which complicates the asymptotic analysis).

Two approximate Exp(θ)-samples based on sliding blocks maxima
As in the previous paragraph, let n denote the sample size and b n denote a block length parameter (the assumption that k n = n/b n ∈ N is not needed, no discarding is necessary). For t = 1, . . . , n− b n + 1, let M sb nt = M t:t+bn−1 = max{X t , . . . , X t+bn−1 } denote the maximum of the X s in a block of length b n starting at observation t. Define By the same heuristics as before, the observable samples Y sb n = {Ŷ sb nt : t = 1, . . . , n − b n + 1} and Z sb n = {Ẑ sb nt : t = 1, . . . , n−b n +1} are approximate samples from the exponential distribution and will be used later to define sliding blocks estimators for θ (both samples are heavily dependent over i due to the use ofF n and the use of overlapping blocks).

Preliminaries on the exponential distribution
Some important moment equations, valid for a random variable ξ, which is Exp(θ)-distributed, are collected. First, where γ = − ∞ 0 log(x)e −x dx ≈ 0.577 denotes the Euler-Mascheroni-constant. Equation (CFG) is the basis for motivating the CFG-estimator, see Capéraà et al. (1997); Genest and Segers (2009) and the details in Section 1. Next, note that which serves as a basis for the madogram, see Naveau et al. (2009). A further choice, including (CFG) as a limit, is provided by , where p > 0. It may be verified that lim p→∞θR,p (χ m ) =θ CFG (χ m ), see also (2.2) for another relationship between the two estimators. Next, replacing χ m by any of the four samples Y db n , Z db n , Y sb n or Z sb n defined in Sections 2.1 and 2.2, we finally arrive at 12 method of moments estimators for θ. We use the suggestive notationŝ to, e.g., denote the disjoint blocks CFG-estimator based on theŶ ni and the sliding blocks madogram-estimator based on theẐ ni , respectively. Note that the four estimators of the form θ yn m,R,1 ,θ zn m,R,1 , m ∈ {db, sb}, are the (pseudo) maximum likelihood (PML) estimators considered in Berghaus and Bücher (2018).

Mathematical preliminaries
Further mathematical details are necessary before we can state asymptotic results about the estimators defined in the previous section. The asymptotic framework and the conditions are mostly similar as in Section 2 in Berghaus and Bücher (2018), but will be repeated here for the sake of completeness.
The serial dependence of the time series (X s ) s∈N will be controlled via mixing coefficients. For two sigma-fields F 1 , F 2 on a probability space (Ω, F, P), let In time series extremes, one usually imposes assumptions on the decay of the mixing coefficients between sigma-fields generated by {X s 1(X s > F ← (1−ε n )) : s ≤ } and {X s 1(X s > F ← (1−ε n )) : s ≥ + k}, where ε n → 0 is some sequence reflecting the fact that only the dependence in the tail needs to be restricted (see, e.g., Rootzén, 2009). As in Berghaus and Bücher (2018), we need a slightly stronger condition, that also controls the dependence between the smallest of all block maxima. More precisely, for −∞ ≤ p < q ≤ ∞ and ε ∈ (0, 1], let B ε p:q denote the sigma algebra generated by U ε s := U s 1(U s > 1 − ε) with s ∈ {p, . . . , q} and define, for ≥ 1, In Condition 3.1(iii) below, we will impose a condition on the decay of the mixing coefficients for small values of ε. Note that the coefficients are bounded by the standard alpha-mixing coefficients of the sequence U s , which can be retrieved for ε = 1. The extremes of a time series may be conveniently described by the point process of normalized exceedances. The latter is defined, for a Borel set A ⊂ E := (0, 1] and a number x ∈ [0, ∞), by n (E) = 0 iff N 1:n ≤ 1 − x/n; the probability of that event converging to e −θx under the assumption of the existence of the extremal index θ.
Fix m ≥ 1 and x 1 > · · · > x m > 0. For 1 ≤ p < q ≤ n, let F (x1,...,xm) p:q,n denote the sigma-algebra generated by the events The condition ∆ n ({u n (x j )} 1≤j≤m ) is said to hold if there exists a sequence ( n ) n with n = o(n) such that α n, n (x 1 , . . . , x m ) = o(1) as n → ∞. A sequence (q n ) n with q n = o(n) is said to be ∆ n ({u n (x j )} 1≤j≤m )-separating if there exists a sequence ( n ) n with n = o(q n ) such that nq −1 n α n, n (x 1 , . . . , x m ) = o(1) as n → ∞. If ∆ n ({u n (x j )} 1≤j≤m ) is met, then such a sequence always exists, simply take q n = max{nα 1/2 n, n , (n n ) 1/2 } . By Theorems 4.1 and 4.2 in Hsing et al. (1988), if the extremal index exists and the ∆(u n (x))condition is met (m = 1), then a necessary and sufficient condition for weak convergence of N (x) n is convergence of the conditional distribution of N (x) n (B n ) with B n = (0, q n /n] given that there is at least one exceedance of 1 − x/n in {1, . . . , q n } to a probability distribution π on N, that is, where q n is some ∆(u n (x))-separating sequence. Moreover, in that case, the convergence in the last display holds for any ∆(u n (x))-separating sequence q n . If the ∆(u n (x))-condition holds for any x > 0, then π does not depend on x (Hsing et al., 1988, Theorem 5.1). A multivariate version of the latter results is stated in Perfekt (1994), see also the summary in Robert (2009), page 278, and the thesis Hsing (1984). Suppose that the extremal index exists and that the ∆(u n (x 1 ), u n (x 2 ))-condition is met for any x 1 ≥ x 2 ≥ 0, x 1 = 0. Moreover, assume that there exists a family of probability measures where q n is some ∆(u n (x 1 ), u n (x 2 ))-separating sequence. In that case, the two-level point process N (x 1 ,x 2 ) n = (N (x 1 ) n , N (x 2 ) n ) converges in distribution to a point process with characterizing Laplace transform explicitly stated in Robert (2009) 2 (i, j) = π(i)1(j = 0).
The following set of conditions will be imposed to establish asymptotic normality of the estimators.
Condition 3.4 (Technical Condition for the CFG-type estimator).
(i) For some q > 1/2, we have b n = O(k q n ) as n → ∞. (ii) For some τ ∈ (0, 1/2), we have, as n → ∞, where e n denotes the tail empirical process defined in (3.1) and where e is a centered Gaussian process with continuous sample paths and covariance as given in Lemma 7.3. (iii) For any c > 0, we have, as n → ∞, (iv) For any c > 0 and some µ ∈ (1/2, 1/{2(1 − τ )}) with τ from (ii), as n → ∞, The items of Condition 3.1 are the same as Condition 2.1(i)-(v) and (2.2) in Berghaus and Bücher (2018) and are discussed in great detail in that reference. Condition 3.2 is needed for uniform integrability of the sequences Z 2/p n1 and log 2 Z n1 , respectively. It implies lim n→∞ Var(Z 1/p n1 ) = Var(ξ 1/p ), lim n→∞ Var(log Z n1 ) = Var(log ξ), respectively, where ξ denotes an exponentially distributed random variable with parameter θ. Condition 3.3 is a bias condition requiring the approximation of the first moment of f (Z n1 ) by E[f (ξ)] to be sufficiently accurate, where f (x) ∈ {x 1/p , exp(−x), log x}. Condition 3.4 is a technical condition which is only needed for deriving the asymptotics of the CFG-estimator. The Condition 3.4(i) requires b to be not too large. Sufficient conditions for Condition 3.4(ii) in terms of beta mixing coefficients can be found in Drees (2000). A sufficient condition for Condition 3.4(iii) is for instance strong mixing with polynomial rate α 1 (n) = O(n −(1+ √ 2)−ε ), n → ∞, for some ε > 0, together with Condition 3.4(i) being met with q < 1/( √ 2 − 1) ≈ 2.41. Indeed, for any x ≥ c and η > 0, one can write By Theorem 2.2 in Shao and Yu (1996), we have sup x≥0 |U n,η (1 − x/b n )| = O P (1) for all η ≤ 1 − 2 −1/2 ≈ 0.29. Hence, by Condition 3.4(i), The expression on the right-hand side is o P (1) if we choose η ∈ (1/2 − 1/{2q}, 1 − 2 −1/2 ]; note that the latter interval is non-empty since q < 1/( √ 2 − 1). Finally, Condition 3.4(iv) is another technical condition requiring the approximation of the law of Z n1 by the exponential distribution to be sufficiently accurate in the lower tail.

Asymptotic Results
We present asymptotic results on all estimators defined in Section 2. For simplicity, all results are stated and proved for theẐ ni -versions only. As in Theorem 3.1 in Berghaus and Bücher (2018), it may be verified that the respective versions based onŶ ni show the same asymptotic behavior as theẐ ni -versions. Throughout, for z ∈ (0, 1), let (ξ Theorem 4.1. Under Condition 3.1, 3.2(i), 3.3(i) and 3.4, we have for m ∈ {db, sb} and as n → ∞, where for m ∈ {db, sb} and as n → ∞, where Theorem 4.3. Fix p > 0. Under Condition 3.1, 3.2(ii) and 3.3(iii), for m ∈ {db, sb} and as n → ∞, where The proofs are provided in Section 7 and bear some similarities with the one of Theorem 3.2 in Berghaus and Bücher (2018). In particular, they rely on the delta method, Wichura's theorem and empirical process theory to adequately handle the asymptotic contribution of the rank transformation. The most sophisticated proof is the one of Theorem 4.1, which is essentially due to the fact that E[log ξ] = ∞ 0 log(t)θe −θt dt is an improper integral both at zero and at infinity (see also Genest and Segers, 2009, for similar technical difficulties with the CFG-estimator for the Pickands dependence function in multivariate extremes).
It is worth to mention that the difference is a universal constant independent of any properties of the observed time series. The same holds true for the Root-estimator with a constant depending in a complicated way on the parameter p (the graph of p → (σ 2 db,p − σ 2 sb,p )/θ 2 is depicted in Figure 1, with a value of approximately 0.2274 for the PML-estimator). For the Madogram-estimator, this difference depends on θ (see Figure 1 for the graph of θ → (σ 2 db,M − σ 2 sb,M )/θ 2 ); it is non-negative and decreasing with value 1/12 ≈ 0.083 for θ → 0 and approximately 0.0079 for θ = 1. In that regard, the use of sliding blocks over disjoint blocks is least beneficial for the Madogram-estimator. Example 4.4. In the case that the time series is serially independent, the cluster size distributions are given by π(i) = 1(i = 1) and It can be seen that these formulas hold true whenever θ = 1. Consequently, the limiting variances in Theorem 4.1 and 4.2 are equal to It is remarkable that the asymptotic variances are substantially smaller than those of the maximum likelihood estimator, see Example 3.1 in Berghaus and Bücher (2018), which are equal to 1/2 and 0.2726 for the disjoint and sliding blocks version, respectively. The limiting variance in the case of the Root-estimator is given by

Some values are
It can further be shown that lim p→∞ σ 2 m,p = σ 2 m,C for m ∈ {db, sb}. Remark 4.5. Instead of working withF n in the definition ofẐ . This modification has been motivated as a bias reduction scheme in Northrop (2015). Sincẽ some simple calculations show that, for instance for the CFG-estimator, showing that the modification is asymptotically negligible. It is however beneficial in finitesample situations, whence it has been applied throughout the finite-sample situations considered in Section 6. Obviously, similar adaptions can be applied to the sliding blocks version and the other moment based estimators.

Example: max-autoregressive process
In this section, we exemplarily discuss the new estimators when applied to a max-autoregressive process, defined by the recursion where α ∈ [0, 1) and where (Z s ) s∈Z is an i.i.d. sequence of Fréchet(1)-distributed random variables. A stationary solution of the above recursion is such that the stationary solution is again Fréchet(1)-distributed. Note that a model with an arbitrary stationary c.d.f. F may be obtained by consideringX s = F ← {exp(−1/X s )} and that all subsequent results are also valid for (X s ) s . We start by explicitly calculating the asymptotic variances of the estimators in Section 5.1, and then show in Section 5.2 that all regularity conditions from Section 3 are met.

Regularity Conditions for the ARMAX-model
Recall that X s is Fréchet(1)-distributed, i.e., the stationary c.d.f. F is given by The assumptions in Condition 3.1 are satisfied as shown in Berghaus and Bücher (2018), page 2322, provided b n and k n are chosen to satisfy the conditions in Item (iii). Next, by induction, (5.1) A tedious but straightforward calculation then shows that the assumptions in Condition 3.2 and 3.3 are met, provided k n /b 2 n = o(1), cf. Condition 3.1(iii). Condition 3.4(i) is a condition on the choice of b n , that is under the control of the statistician. Conditions 3.4(ii) and 3.4(iii) are consequences of mixing properties of (X s ) s as argued at the end of Section 3. It remains to show that Condition 3.4(iv) is satisfied. By (5.1) and with ξ ∼ Exp(θ), we have for any µ > 1/2, where the final estimate follows from Taylor's theorem and Condition 3.4(i).
• The Markovian Copula-model (Darsow et al., 1992): Here, F ← is the left-continuous quantile function of some arbitrary continuous c.d.f. F , (U s ) s is a stationary Markovian time series of order 1 and C ϑ denotes the Survival Clayton Copula with parameter ϑ > 0. We consider choices ϑ = 0.23, 0.41, 0.68, 1.06, 1.90 such that (approximately) θ = 0.2, 0.4, 0.6, 0.8, 0.95 Berghaus and Bücher (2018) and fix F as the standard uniform c.d.f. (the results are independent of this choice, as the estimators are rank-based). Algorithm 2 in Rémillard et al. (2012) allows to simulate from this model.
In each case, the sample size is fixed to n = 2 13 = 8192 and the block size is chosen from b = b n ∈ {2 2 , . . . , 2 9 }. The performance is assessed based on N = 3000 simulation runs each.

Comparison of the introduced estimators
We start by comparing the finite-sample properties of the proposed sliding blocks estimatorŝ θ x sb,CFG ,θ x sb,MAD andθ x sb,R,p for p ∈ {0.5, 0.75, 1, 2, 4, 8, 16} and for x ∈ {z n , y n }. Respective results for the corresponding disjoint blocks version are omitted, as the latter are always outperformed by their sliding blocks counterparts. Subsequently, the index sb is therefore omitted.
As the simulation results are mostly similar among the different models and estimators, they are only partially reported, with a particular view on highlighting interesting qualitative features. We begin by a detailed investigation of the variance, the squared bias and the mean squared error (MSE) as a function of the block size parameter b. For illustrative purposes, we restrict the presentation to the z n -versions and the ARCH-model. The corresponding results are depicted in Figure 3 (for the CFG-, the Madogram-and three selected Root-estimators).
In general, as to be expected from the underlying theory, the variance curves are increasing in b, while the squared bias curves are (mostly) decreasing in b, resulting in a typical U-shape for the MSE curves. The hierarchy of the estimators with regard to the considered performance measures is similar among the considered values of θ. In terms of the MSE, up to an intermediate block size, the CFG-and Madogram-estimator are superior to the other estimators (especially to the PML-estimator), while for large block sizes the Madogram-estimator has a relatively high MSE, but the CFG-estimator partly remains superior. The Root-estimators are, as expected, ordered in p and located between the PML-and CFG-estimator.
Next, a comparison between the z n -and y n -versions of the estimators is drawn in Figure 4; for illustrative purposes, attention is restricted to six different models and two estimators. Remarkably, there are many models, especially for smaller values of θ, in which the MSE-curves of the y n -versions lie uniformly below the ones of the z n -versions. In the remaining models, neither version can be said to be strictly preferable. Furthermore, it is remarkable that, for θ close to one, the MSE-curves of the y n -versions are often no longer U-shaped, but increasing in the block size instead. The latter behavior may be explained by the proximity to the i.i.d. case, since in that case we have such that there is real equality in relation (1.3), resulting in a vanishing bias.
Next, we investigate the dependence of the performance of the Root-estimators on the parameter p; recall that p = 1 yields the PML-estimator, while 'p = ∞' yields the CFG-estimator. In Figure 5, the MSE-curves are depicted as a function of p for various fixed block sizes and for three selected models. It can be seen that choices of p < 1 lead to a poor behavior of the corresponding estimators. At the same time, the results do not allow to identify some 'optimal' choice of p ≥ 1 which is valid uniformly over all models. A similar conclusion can be drawn from Table 1, which presents, for the ARCH-and ARMAX-model and every block size b, the value of p for which the Root-estimator attains the minimal MSE (p = ∞ corresponds to the CFG-estimator). One can see that most values of p are represented, with p = ∞ appearing most often, but that there is no optimal choice of p universally over all models.

Comparison with other estimators for the extremal index
In this section, we compare the performance of the introduced estimators with the following estimators: the bias-reduced sliding blocks estimator from Robert et al. (2009) (with a datadriven choice of the threshold as outlined in Section 7.1 of that paper), the integrated version of the blocks estimator from Robert (2009), the intervals estimator from Ferro and Segers (2003) and the ML-estimator from Süveges (2007). The parameters σ and φ for the Robert-estimator (cf. page 276 of Robert, 2009) are chosen as σ = 0.7 and φ = 1.3. In the case of the intervals-and Süveges-estimator, the choice of a threshold u is required, which is here chosen as the 1 − 1/b n empirical quantile of the observed data. With regard to our estimators, we present results for the sliding-blocks, bias-reduced and z n -versions, if not indicated otherwise.   In Figure 6, we depict the MSE as a function of the block size b. For most models, the MSEcurves of the estimators from the literature are again U-shaped due to the bias-variance tradeoff already described in section 6.1. It can further be seen that no estimator is uniformly best in any model under consideration. The method-of-moment estimators do however compare quite well to the competitors.
The minimum values of the MSE-curves in Figure 6 are of particular interest. Due to the large amount of estimators and models under consideration (in total 26 estimators and 17 models) we try to simplify possible comparisons by the following aggregation, summarized in Table 2  for p ∈ {8, 16} are the best performing estimators.

Proof of Theorem 4.1-4.3
The proofs of Theorem 4.1-4.3 are actually quite similar in that each proof will be decomposed into a sequence of similar intermediate lemmas. Occasionally, those lemmas will be hardest to prove for Theorem 4.1 and easiest to prove for Theorem 4.2; this is also reflected by the larger number of conditions required for the proof of Theorem 4.1. The proof of Theorem 4.3 in turn is quite similar to the one in Berghaus and Bücher (2018), and of intermediate difficulty. For the above reasons, we will carry out the proof of Theorem 4.1 in great detail, and skip parts of the technical arguments needed for Theorem 4.2 and 4.3 where possible.
All convergences are for n → ∞ if not stated otherwise.

Proof of Theorem 4.1
The following notations will be used throughout: Observing that (ϕ −1 (C) ) {ϕ (C) (θ)} = θ, the two assertions of the theorem are a consequence of the delta-method and Proposition 7.1 and Proposition 7.2, respectively.
Proposition 7.1. Under Condition 3.1, 3.2(i), 3.3(i) and 3.4, we have Proof. We may decompose We have C n = o(1) by Condition 3.3(i). For the treatment of A n , recall the tail empirical process defined in (3.1). Further, letÑ ni = (n + 1)/n ×N ni , and note that denote the empirical c.d.f. of Z n1 , . . . , Z nkn . By Equation (7.1), we obtain Heuristically,Ĥ kn (x) ≈ 1 − exp(−θx) and W n (x) ≈ e(x)/x (where e denotes the limit of the tail empirical process), whence the tentative limit of A n should be For a rigorous treatment of A n + B n , let and let B be defined as in Lemma 7.3 below. As shown above, A n = E n +o(1). The proposition is hence a consequence of Wichura's theorem (Billingsley (1979), Theorem 25.5) and the following items: The assertion in (i) is proven in Lemma 7.6. The assertion in (ii) follows from the fact that E m + B is normally distributed with variance τ 2 m as specified in Lemma 7.6, and the fact that τ 2 m → σ 2 db,C /θ 2 as m → ∞ by Lemma 7.7. Finally, Lemma 7.8 proves (iii). Proposition 7.2. Under Condition 3.1, 3.2(i), 3.3(i) and 3.4, we have Proof. The proof is very similar to the proof of Proposition 7.1. Decompose Again, we have C sb n = o(1) by Condition 3.3(i). A similar calculation as in (7.3) in the case of the disjoint blocks shows that A sb n can be written in the following way denotes the empirical c.d.f. of Z sb n1 , . . . , Z sb n,n−bn+1 . We may now treat A sb n +B sb n exactly as A n +B n in the proof of Proposition 7.1, with E n , E n,m and Lemma 7.6, 7.7 and 7.8 replaced by and Lemma 7.12, 7.13 and 7.14, respectively.

Auxiliary lemmas for the proof of Theorem 4.1 -Disjoint blocks
Throughout this section, we assume that Condition 3.1, 3.2(i) and 3.3(i) are met. where (e(x 1 ), . . . , e(x m ), B) ∼ N m+1 (0, Σ m+1 ) with Here, r(0, 0) = 0 and, for x ≥ y ≥ 0 with x = 0, Lemma 7.4. For any m ∈ N, we have where (e, B) is a centered Gaussian process with continuous sample paths and with covariance functional as specified in Lemma 7.3.
Lemma 7.5. For any m ∈ N, we have Lemma 7.6. For any m ∈ N, we have where, with r and f defined as in Lemma 7.3, Lemma 7.7. As m → ∞, τ 2 m → σ 2 db,(C) /θ 2 , where σ 2 db,(C) is specified in Theorem 4.1. Lemma 7.8. If, in addition to Condition 3.1, 3.2(i) and 3.3(i), Condition 3.4 holds, then, for all δ > 0, lim Proof of Lemma 7.3. We proceed similarly as in the proof of Lemma 9.3 in Berghaus and Bücher (2018). Weak convergence of the vector (e n (x 1 ), . . . , e n (x m )) is a consequence of Theorem 4.1 in Robert (2009). For the treatment of the joint convergence with B n , we only consider the case m = 1 and set x 1 = x; the general case can be treated analogously. For i = 1, . . . , k n , we decompose a block I i = {(i − 1)b n + 1, . . . , ib n } into a big block I + i and a small block I − i , where, recalling n from Condition 3.1(iii), and set It can further be shown by the same arguments as in the proof of Lemma 9.3 in Berghaus and Bücher (2018) that B − n := B n − B + n = o P (1). Finally, for ε ∈ (0, c 1 ∧ c 2 ), define A + n = {min i=1,...,kn N + ni > 1 − ε}, and note that P(A + n ) → 1 by Condition 3.1(v). As a consequence of the previous three statements, it suffices to show that, using the Cramér-Wold device, for any λ 1 , λ 2 ∈ R. Now, the left-hand side of (7.4) can be written as whereg i,n = g i,n 1(Z + ni < εb n ) and where Note, thatg i,n only depends on the block I + i and is B ε (i−1)bn+1:ibn − n -measurable. In particular, the (g i,n ) i=1,...,kn are each separated by a small block of length n . A standard argument based on characteristic functions and the assumption on alpha mixing may then be used to show that the weak limit of k −1/2 n kn i=1g i,n is the same as if theg i,n were independent. Next, we show that Ljapunov's condition (Billingsley (1979), Theorem 27.3) is satisfied. By Minkowski's inequality, for any p ∈ (2, 2 + δ), C ∞ = sup n∈N E[|g 1,n | p ] < ∞ by Condition 3.1(ii) and 3.2(i). Further, by stationarity and independence, we get Hence, provided lim n→∞ E[g 2 1,n ] exists, the last expression converges to 0 and hence Ljapunov's condition is met. As a consequence, k −1/2 n kn i=1g i,n weakly converges to a centered normal distribution with variance lim n→∞ E[g 2 1,n ]. Finally, since lim n→∞ E[g 2 1,n ] = lim n→∞ E[g 2 1,n ], it remains to be shown that lim n→∞ E[g 2 1,n ] = λ 2 1 r(x, x) + 2λ 1 λ 2 h(x) + λ 2 2 π 2 /6.
Since we may replace I + 1 by I 1 and then b n by n, this in turn is a consequence of The assertion in (7.5) follows from Theorem 4.1 in Robert (2009). Further, since Z 1:n d −→ ξ ∼ Exp(θ) and since since | log(Z 1:n )| 2 is uniformly integrable by Condition 3.2(i), we have lim n→∞ Var{log(Z 1:n )} = Var{log(ξ)} = π 2 6 , which is (7.7). Finally, note that E[N (x) n ] = x and E[log(Z 1:n )] → ϕ (C) (θ) by similar arguments as given above. As a consequence, (7.6) follows from lim n→∞ E N (x) n log(Z 1:n ) = h(x). The latter in turn can be seen as follows: first, The expected value on the right-hand side can be written as Perfekt (1994) and Robert (2009). By uniform integrability we obtain that the expected value on the right-hand side of (7.8) converges to The proof is finished.
Proof of Lemma 7.4. For fixed x > 0, consider the function For z n → z, one has f n (z n ) → e(z)/z. Hence, since (e n (x 1 ), . . . , e n (x m ), B n ) d −→ (e(x 1 ), . . . , e(x m ), B) for any x 1 , . . . , x m > 0 and m ∈ N by Lemma 7.3, we can apply the extended continuous mapping theorem (Theorem 18.11 in van der Vaart (1998) . This is the fidis-convergence needed to prove Lemma 7.4. Asymptotic tightness follows directly from asymptotic tightness of e n and B n .
the assertion follows from Lemma 7.4, Lemma C.8 in Berghaus and Bücher (2017) and the continuous mapping theorem.
Proof of Lemma 7.6. As a consequence of Lemma 7.5, Lemma 7.4 and the continuous mapping theorem, we have where the variance τ 2 m is given by as asserted.
Proof of Lemma 7.7. By the definition of τ 2 m in Lemma 7.6 . Hence, applying the transformation z = y/x, the first summand on the right-hand side of (7.9) can be written as For the second summand on the right-hand side of (7.9), note that see Formula (A.7) in the proof of Lemma 9.6 in Berghaus and Bücher (2018) and Robert (2009). Therefore, we can rewrite h from Lemma 7.3 as follows where we have used the transformation z = e y /x. For 0 < x ≤ 1, the first integral is zero and we obtain while for x > 1, As a consequence, writing g(z) = E ξ (z) Next, some tedious calculations based on Fubini's theorem allow to rewrite the sum of the last three double integrals as dz.
Using the fact that g(z) dz.
Finally, one can show such that, assembling terms and recalling f ( The lemma is now an immediate consequence of (7.9), (7.10) and (7.12).

Auxiliary lemmas for the proof of Theorem 4.1 -Sliding blocks
Throughout this section, we assume that Condition 3.1, 3.2(i) and 3.3(i) are met. where (e(x 1 ), . . . , e(x m ), Here, the functions r and f are defined as in Lemma 7.3.
Lemma 7.10. For any m ∈ N, we have where (e, B sb ) is a centered Gaussian process with continuous sample paths and with covariance functional as specified in Lemma 7.9.
Lemma 7.11. For any m ∈ N, we have where E n,m = m 1/m W n (x)θe −θx dx is as in Lemma 7.5. Lemma 7.12. For any m ∈ N, we have where, with r and f defined as in Lemma 7.3, Lemma 7.13. As m → ∞, τ 2 sb,m → σ 2 sb,(C) /θ 2 , where σ 2 sb,(C) is specified in Theorem 4.1. Lemma 7.14. If in addition, Condition 3.4 holds, then, for all δ > 0, Proof of Lemma 7.9. As in the proof of Lemma 7.3 we only show joint weak convergence of (e n (x), B sb n ) for some fixed x > 0; the general case can be shown analogously. For given ε ∈ (0, c 1 ∧ c 2 ) let A n = {min t=1,...,n−bn+1 N sb nt > 1 − ε}, such that P(A n ) → 1 by Condition 3.1(v). By the Cramér-Wold device, it suffices to prove weak convergence of for some arbitrary λ 1 , λ 2 ∈ R, where the negligible term stems from omitting a negligible number of summands. We are going to apply a big block-small block argument, based on a suitable 'blocking of blocks' to take care of the serial dependence introduced through the use of sliding blocks. For that purpose, let k * n < k n be an integer sequence with k * n → ∞ and k * n = o(k δ/(2(1+δ)) n ), where δ is from Condition 3.1(ii). For q * n = k n /(k * n + 2) and j = 1, . . . , q * n , define Thus we have q * n big blocks J + j of size k * n b n , which are separated by a small block J − j of size 2b n , just as in the construction in the proof of Lemma 10.3 in Berghaus and Bücher (2018). Consequently, we have λ 1 e n (x) + λ 2 B sb n = L + n + L − n + o P (1), where for j = 1, . . . , q * n . In the following, we show that, on the one hand, L − n 1 A n = o P (1) and that, on the other hand, L + n 1 A n converges to the claimed normal distribution. First, we cover L − n 1 A n . As in the proof of Lemma 7.3, we have We proceed by showing that Var[L − n ] = o(1). By stationarity, one has (1), for which it suffices to show that ||W ε− n1 || p = o(1) for some p ∈ (2, 2 + δ). By Minkowski's inequality, one has by Condition 3.1(ii) and 3.2(i). It remains to treat the sum over the covariances. Since W ε− nj is B ε {(j(k * n +2)−2)bn+1}:{j(k * n +2)bn} -measurable, we may apply Lemma 3.11 in Dehling and Philipp (2002) By Condition 3.1(iii), the sum q * n j=2 α c2 (jk * n b n ) 1−2/p converges to zero, hence ||W ε− n1 || p = o(1) as asserted.
Let us now treat the term L + n 1 A n and show weak convergence to the asserted normal distribution. One can write A standard argument based on characteristic functions shows that the weak limit of q * n −1/2 q * n j=1W + nj is the same as if the summands were independent. By arguments as before, we may also pass back to an independent sample W + nj , j = 1, . . . , q * n . The assertion then follows from Ljapunov's central limit theorem, once we have shown the Ljapunov condition.
For that purpose, note that ||W + nj || 2+δ = O( q * n k n ) = O( k * n ) by similar arguments as in (7.20) such that E[|W + nj | 2+δ ] = O(k * n (2+δ)/2 ). As a consequence, ) by construction and provided that the limit of E[|W + n1 | 2 ] exists. If it does, we can conclude that L + n d −→ N (0, lim n→∞ E[|W + n1 | 2 ]). and it suffices to show that To this, note that W + n1 = λ 1 e n * (x) + λ 2 B sb n * + o P (1), where e n * and B sb n * are defined as e n and B sb n with n replaced by n * = k * n b n and k n by k * n ; and our general conditions still hold with this replacement. The result follows from Lemma 7.15 and Lemma 7.16 and the proof of Theorem 4.1 in Robert (2009).
Proof of Lemma 7.10. Up to notation, the proof is exactly the same as the one of Lemma 7.4 in the disjoint blocks case.
Proof of Lemma 7.11. The result follows immediately from the argument in the proof of Lemma 7.5 and the proof of Lemma 10.2 in Berghaus and Bücher (2018).
Proof of Lemma 7.12. Up to notation, the proof is exactly the same as the one of Lemma 7.6 in the disjoint blocks case.
Proof of Lemma 7.14. The proof is similar to the one of Lemma 7.8, which is why we keep it short.
whereẽ n (x) = e n (x) + k −1/2 n . For some ε > 0 define the event such that P(B n ) → 1 by Condition 3.4(iii). As in the proof of Lemma 7.8, with f defined in (7.14), we can write such thatV n2 can be bounded as in the proof of Lemma 7.8 as follows This expression can be handled as in the proof of Lemma 10.1 in Berghaus and Bücher (2018), such that lim m→∞ lim sup n→∞ P(|V n2 1 Bn 1 Cn | > δ) = 0.
The remaining term |V n1 | can be treated analogously to the eponymous term in the proof of Lemma 7.8.
Proof. (a) We assume that both U s and Z sb nt are measurable with respect to the appropriate B ε ·:· sigma-algebra; the general case can be treated by multiplying with suitable indicator functions as in the proof of Lemma 7.9. Let A j = s∈Ij 1 U s > 1 − x bn and D j = s∈Ij log(Z sb ns ). Then The second sum is asymptotically negligible, since ||A j || 2 = ||N (x) bn (E)|| 2 = O(1) and || log(Z sb n,n−bn+1 )|| 2 = O(1) by Condition 3.1(ii) and 3.2(i). Next, following the argument in the proof of Lemma B.1 in Berghaus and Bücher (2018), we may write Note that lim n→∞ E[log(Z sb n1 )] = ϕ (C) (θ) by uniform integrability of log(Z 1:n ), and that sup n∈N ||f n || ∞ + ||g n || ∞ < ∞ as a consequence of s∈I1 1 U s > 1 − x bn 2 × log(Z sb n1 ) 2 < ∞ by Condition 3.1(ii) and 3.2(i). Hence, the lemma is proven if we show that, for any ξ ∈ (0, 1), Since the proof for f n (1 − ξ) is similar, we only treat g n (ξ), which can be written as Let us next show joint weak convergence of s∈I2 1(U s > 1 − x bn ) and log(Z sb n, (1+ξ)bn +1 ). For that purpose, note that coincides with F n (i, e y ) in the proof of Lemma B.1 in Berghaus and Bücher (2018). Hence, by that proof, we have As a consequence of the previous three displays, and since weak convergence and uniform integrability implie convergence of moments, we have as asserted, which implies part (a) of the lemma. (b) In the proof of Lemma B.3 in Berghaus and Bücher (2018) it is shown that, for y ≤ log(x), . Equation (7.11) then allows to rewrite As a consequence, further noting that Then, by Fubinbi's theorem, The assertion now follows from the fact that after assembling terms, where Ei(x) = − ∞ −x e −t /t dt for x > 0 is the exponential integral. Lemma 7.16. One has lim n→∞ Var(B sb n ) = 8 log(2) − 4 ≈ 1.545.
Proof. As in the proof of Lemma 7.15, we assume that the Z sb nt are measurable with respect to the appropriate B ε ·,· sigma-algebra. We may then argue as in that proof to obtain whence convergence of the integral over f n in (7.21) may be concluded from the dominated convergence theorem, once we have shown pointwise convergence of f n . To this end we show that, for any fixed ξ ∈ (0, 1), for some random vector X (ξ) , Y (ξ) . This in turn will imply by Condition 3.2(i) and therefore For the proof of (7.22), define, for x, y ∈ R, which converges to by the proof of Lemma B.2 in Berghaus and Bücher (2018). Hence, (7.22), where the random vector (X (ξ) , Y (ξ) ) has joint c.d.f.
We are left with calculating the right-hand side of (7.23). By Lemma 7.26, we have We start with the first summand A. Recall the exponential integral Ei(x) = − ∞ −x e −t /t dt for x > 0, and note that Next, invoke the substitution z = θe y to obtain that 7.4. Proof of Theorem 4.2 The following notation will be used throughout: Note thatθ zn db,MAD = ϕ −1 (M) (Ŝ n ) andθ zn sb,MAD = ϕ −1 (M) (Ŝ sb n ), where ϕ (M) (x) = x/(1 + x). The assertion follows from the delta-method and Proposition 7.17 and 7.19. The term C n is asymptotically negligible by Condition 3.3(ii). A straightforward calculation shows that the summand A n can be written in terms of the tail empirical process e n as whereĤ kn is the empirical c.d.f. of Z n1 , . . . , Z nkn , see (7.2). The asymptotic normality of A n +B n can now be shown as in the proof of Proposition 7.1. The corresponding key result is given by Lemma 7.18; whose proof is similar (but easier) as for the CFG-estimator (Lemma 7.3) and is omitted for the sake of brevity.
By Condition 3.3(iii), the term C n converges to zero. A straightforward calculation shows that the term A n can be written as The asymptotic normality of A n + B n can be shown as in the proof of Proposition 7.1 by an application of Wichura's theorem. Here, Lemma 7.3 needs to be replaced by Lemma 7.22, whose proof is similar but easier and therefore omitted for the sake of brevity. A(x) −Ã r (x)x x η → 0 for r → ∞.
Thus, the limes superior (for n → ∞) of the expression on the right-hand side of (7.28) can be made arbitrarily small by increasing r. Finally, we can bound |I n3 | as follows |I n3 | ≤ r k=1 |A(k/r)| k/r |B n (x)|, which converges to zero by assumption.
Lemma 7.26. Let X and Y be real-valued random variables such that XY is integrable. Then, Proof. This is a standard calculation based on Fubini's theorem.