Trimmed extreme value estimators for censored heavy-tailed data

We consider estimation of the extreme value index and extreme quantiles for heavy-tailed data that are right-censored. We study a general procedure of removing low importance observations in tail estimators. This trimming procedure is applied to the state-of-the-art estimators for randomly right-censored tail estimators. Through an averaging procedure over the amount of trimming we derive new kernel type estimators. Extensive simulation suggests that one of the new considered kernels leads to a highly competitive estimator against virtually any other available alternative in this framework. Moreover, we propose an adaptive selection method for the amount of top data used in estimation based on the trimming procedure minimizing the asymptotic mean squared error. We also provide an illustration of this approach to simulated as well as to real-world MTPL insurance data.


Introduction
In recent years the problem of tail estimation with right-censored data has received considerable attention starting with Beirlant et al. (2007) and Einmahl et al. (2008), who considered the problem for different domains of attraction. Most efforts however have been dedicated to heavy-tailed distributions, with several papers being motivated by heavy-tailed insurance claim data with long development times of the claims, see e.g. Beirlant et al. (2010Beirlant et al. ( , 2016Beirlant et al. ( , 2018Beirlant et al. ( , 2019; Worms and Worms (2014, 2016; Ndao et al. (2014). The underlying model assumption here is that the random variable of interest X has a Pareto-type distribution function where is slowly varying at infinity: lim x→∞ (tx) (x) = 1, for every t > 1. Worms and Worms (2019) discuss the analogous problem for Weibull-type distributions. Right-censoring for heavy-tailed distributions in a regression setting was discussed in Ndao et al. (2016); Dierckx et al. (2019); Goegebeur et al. (2019a); Stupfler (2016), while Goegebeur et al. (2019b) is on a bivariate extension. Stupfler (2019) discusses the case of dependent censoring.
Here we revisit the case of heavy-tailed data. More specifically, we consider the random right-censoring model, where the independent and identically distributed (i.i.d.) observations X 1 , . . . , X n of X may be preceded by censoring variables C 1 , . . . , C n , and it is known if that happens. One then observes where C 1 , . . . , C n is an i.i.d. sequence of censoring random variables, independent of the observations X i . Here 1{X i ≤ C i } denotes the indicator of the event {X i ≤ C i }. In order to obtain non-degenerate and tractable identities, one assumes that also the censoring variables are Pareto-type distributed with distribution function with c another slowly varying function at infinity. This choice also motivates the asymptotic Pareto behaviour to be introduced in the sequel. Then we have that for x > 1, where z (x) = (x) c (x). As explained in Einmahl et al. (2008), the parameter p = ξ z /ξ = ξc ξ+ξc is the limit of p(z) := P(e 1 = 1|Z 1 = z) as z → ∞, and can be interpreted as the non-censoring probability in the limit, or the tail limiting proportion of non-censored data. In the exact Pareto setting (i.e. and c being constant) the censoring indicators e 1 , . . . , e n turn out to be i.i.d. Bernoulli(p) random variables, independent of Z 1 , . . . , Z n .
Within this censoring and regularly varying context, Beirlant et al. (2007) proposed a first estimator of ξ in the spirit of the classical Hill estimator . Concretely, define the order statistics of the observed sample as Z 1,n ≤ · · · ≤ Z n,n , and e i,n the corresponding censoring indicators, i = 1, . . . , n. Then the Hill estimator adapted for censoring is given by e n−i+1,n the fraction of non-censored data in the top k observations, and the classical Hill estimator (Hill, 1975) based on the top k observations. Einmahl et al. (2008) showed that, under some regularity assumptions, H k is consistent whatever the value of p ∈ (0, 1): as k, n → ∞ and k/n → 0. Moreover Einmahl et al. (2008) derived the asymptotic normality of H k under general conditions. Worms and Worms (2014) proposed an alternative generalization of the Hill estimator based on the fact that where F (x) = 1 − F (x). In the exact Pareto case, the above limit is an equality. Replacing F with the Kaplan-Meier estimator for F (x) yields the estimator which was shown to be consistent in Worms and Worms (2014). Observe that both (2) and (5) reduce to H Z k when there is no censoring. Based on simulation studies, see e.g. Beirlant et al. (2018), the estimator H W k is known to exhibit superior behaviour in comparison with H k , especially with respect to bias. The mathematical treatment of this estimator turns out to be difficult and the asymptotic normality of H W k has only be derived in Beirlant et al. (2019) under light censoring, i.e. ξ < ξ c or p > 1/2, and some regularity conditions. Here we will propose a novel family of estimators on the basis of a trimming procedure that will exhibit competitive behaviour and for which the mathematical treatment is simpler. This then also yields a generalization to the censoring case of the classical class of kernel estimators introduced in Csörgő et al. (1985).
Before introducing the trimming procedure, we first propose simplified versions of H W k , that will be more amenable for the approach in the sequel. To this end, note that one can write (1 − 1/j) e n−j+1,n log(Z n−i+1,n /Z n−i,n ).
In the exact Pareto case one has p(z) = p and the term in the square bracket has expectation where we have used the fact that the e n−j+1,n are i.i.d. Bernoulli random variables with the same law as the e j . Notice that for Pareto-type variables, this only holds asymptotically, as k, n/k → ∞. The Rényi representation also entails for the exact Pareto case that (7) and (8) and using the approximations 1−p k /j ≈ exp(−p k /j) and k j=i+1 j −1 ≈ log((k +1)/i), we propose the following estimator which is closely related to H W k : where the log-spacings in the sum are all taken with respect to the same baseline order statistic Z n−k,n . The latter will allow to apply the trimming operation of removing low importance observations in the tail estimation, developed in Bladt et al. (2020b) for the non-censoring case, to the present situation with censoring.
In this paper, we extend the trimming method proposed in Bladt et al. (2020b) to the case of random right censoring, both for H k and H A k . Averaging the trimmed statistics over the amount of trimming then leads to new estimators which belong to a general family of kernel estimators comprising H k and H A k . This family turns out to be closed under the proposed averaging operation after trimming. Then we study the basic asymptotic properties of the kernel estimators in Section 3, even in case of heavy censoring, i.e. p ≤ 1/2. Trajectories of the trimmed statistics as a function of the amount of trimming turn out to be quite flat near the optimal threshold value minimizing the mean squared error (MSE). Based on this, in Section 4 -for the first time in this setting -an adaptive selection method for the amount of top data used in tail estimation is proposed. In Sections 5 and 6, we show through simulations and a case study from insurance that the new kernel estimators and the threshold selection method exhibit promising properties.
2. Trimmed estimators for ξ 2.1. Trimming tail estimators. In Bladt et al. (2020b), lower trimming of the classical Hill estimator was shown to be an effective strategy to obtain Hill-type plots with lower variance arising from the changes of the baseline order statistic, which aids in the visual selection of a horizontal part of the trajectory. Here, we extend this approach to the censored case, and consider lower trimming of the estimators H k and H A k , deleting the smallest k − b (b ≤ k) peaks over thresholds Z n−i+1,n /Z n−k,n , i = b + 1, . . . , k: trimming H Z k as in Bladt et al. (2020b) one obtains a trimmed version of H k while for H A k we propose since, when p k is replaced by the exact value p, using (8) the expected value 2.2. Averaging and kernels. The above trimming procedure naturally leads to new estimators when considering the empirical mean of the trimmed estimators across b = 1, . . . , k: For instance, in case of H A b,k this is asymptotically equivalent to as can be seen using partial summation and a simple Riemann sum approxima- In fact H k , H A k and H A k can all be put into a kernel framework, by defining where K is a positive kernel function satisfying In particular, we get Note that H W k does not fall into this framework, but its simplified version H A k does.
Also notice that, when trimming any kernel estimator H K k to obtain the averaging operation 1 k k b=1 H K b,k leads to an associated kernel estimator Rewriting the kernel estimators H K k from (13) in terms of the random variables V j (j = 1, . . . , k) has some theoretical advantage, since for the exact Pareto case these are independent and exponentially distributed with mean ξ z thanks to the Rényi representation: Using a Riemann approximation, one can also propose to use that also satisfies the norming 1 0 K (u, p) du = 1/p. The class of estimators H K k can be considered as generalizations of the kernel estimators proposed in Csörgő et al. (1985) from the non-censoring to the censoring case.
2.3. Quantile estimation. Following the approach from Weissman (1978) it is possible to construct quantile estimators as a function of the sample size as follows. Let the quantile function of a regularly varying tail be Q(p). The regular variation property implies that the ratio of increasingly large quantiles satisfies as discussed in Section 4.3 in de Haan and Ferreira (2007). This then leads to a quantile estimator based on k order statistics and the kernel K aŝ whereQ KM is the quantile function derived from the Kaplan-Meier estimator F defined in (4).

Asymptotic representations
In this section we derive the asymptotic distributions of the kernel estimators and their trimmed counterparts as introduced in the preceding section. In Einmahl et al. (2008) the asymptotics for H k = H K 0 k was discussed in detail. Beirlant et al. (2019) provided an asymptotic normality result for H W k when p > 1/2, but that estimator is not in the current kernel framework. Here we provide asymptotic representations for the class of kernel estimators in the form H K k . To this end, we make use of second-order assumptions which were first proposed in Hall and Welsh (1985) and which have widely been used in papers on the estimation of the extreme value index for Pareto-type distributions both in the non-censoring case such as Csörgő et al. (1985) and the censoring case as in Beirlant et al. (2019): where β, β c , C, C c are positive constants and D, D c are real constants. It now follows that Concerning the scaled spacings V j , j = 1, . . . , k, one then has the following expansion as n, k → ∞ and k/n → 0 as given in Theorem 4.1 in Beirlant et al. (2004): with E j standard exponential random variables, independent with each n, and k j=i R j,n /j = o p (Q 0,z (n/k)) max(log((k + 1)/i, 1)). Next, from Einmahl et al. (2008) and Beirlant et al. (2016) where N ∼ N (0, 1) can be chosen appropriately independent of {V 1 , V 2 , . . .} , and (20) and (21) we now derive that Using the mean value theorem, we have from (21) that Next, using (20), Concerning the second last term in (24) we find that From (23) and (24) we can now state an asymptotic expansion for H K k − ξ. Theorem 3.1. Under (19) we have as k, n → ∞ and k/n → 0 where N and {E j , j ≥ 1} are introduced respectively in (21) and (20).
In order to select an optimal k, we minimize then the following asymptotic mean squared error of H K k : with v k,p = 1 k Remark 3.2. Note that the first two terms in (25) concern the estimation of p and ξ, respectively. The third term can be regarded as a bias term arising from the second order assumption, which commonly appears in classical extreme value theory. The fourth and final term is proportional to ξ z and can be regarded as a discretization error term, since the sum inside the parenthesis is a Riemann approximation to 1 0 K k (u, p) du = 1/p. In general, this error is small but nonzero.

Optimal choice of k when estimating ξ
Denoting the trimmed version of the Hill estimator in the fully observed case by for a universal constant K and a specific function f . Here k opt (H Z b,k ) is the optimal sample fraction minimizing the expectation of the empirical variance On the other hand, from (27) we obtain that under β ≤ β c that This means that the optimal k for the estimator H k with respect to minimization of the AMSE is linked to the optimal k of its trimmed versions for the minimization of the expected empirical variance in the non-censored case. A consequence of the above formula is that That is, a larger percentage of censoring leads to a higher threshold, when compared to the non-censored case. This can already be seen from the expression of the AMSE given in (27), where a smaller p leads to more weight being given to the bias term. From an intuitive point of view, when dealing with censored datasets, two sources of bias have to be accounted for, and hence a smaller sample fraction k is needed to control them.
In practice, for a given sample, one finds an estimatek 0 =k opt (H Z b,k ) of k opt (H Z b,k ) through minimization of S 2 k over k, from which an adaptive choice of k is found through using an estimateρ z of ρ z and replacing p by pk 0 . Estimators of the secondorder parameter ρ z have been proposed for instance in Fraga Alves et al. (2003). Estimators exhibit a high variability and many authors consider the use of a fixed value for ρ z such as ρ z = −1. In the next section we use the choices ρ z = −1, −1/2, −3/2, but the results are not very sensitive to this parameter, and we here also propose to stick to the choice ρ z = −1.
Remark 4.1. For kernels different from K 0 , if we restrict to the case β ≤ β c and p > 1/2, and consider the expressions kv k,p =:ṽ k,p and b k,p taken from (26) as constant in k (i.e. converging fast enough to its limit for k → ∞) we find from (26) that the optimal k opt for any kernel equals This is then basically the same formula as for H k , (29), but one has to calculatẽ v k,p and b k,p at every k with p estimated by pk 0 . In heavy censoring cases the rates of convergence of the estimators are different and the approach becomes more involved. This makes the procedure significantly more computer-intensive. In practical applications, however, the difference between the optima across kernels and heavy and light censoring cases is rather small and does not justify the extra calculations.

Simulations
We performed simulations using the following distributions.
The results are based on 200 simulations of sample size n = 200 each. The firstorder tail-determining parameters were chosen such that 1/ξ = 3/2, 2, 4, which seem to be realistic magnitudes for insurance applications, as will be illustrated in the next section. The remaining parameters were chosen in order to satisfy inequalities such as p > 1/2 or p ≤ 1/2 but are otherwise arbitrary.
In Figures 1, 2 and 3 we plot the bias, variance and mean squared error as a function of k of the various estimators considered above. Observe that the plots are in logarithmic scale for display purposes. For the bias term this means that there is an asymptote corresponding to lim t↓0 log(t). Note that the MSE characteristics of the estimator H K 2 k are quite comparable to those of H W k in the Burr and Fréchet cases, and are even better for the log-gamma model. The corresponding analysis for the quantile estimators is given in Figures 4, 5 and 6, which are in agreement with the former plots.
In Figures 7, 8 and 9 we provide violin plots for the K 0 and K 2 -based estimator at the threshold selected according to the automatic procedure given in the previous section. We have taken ρ z = −1, −3/2, −1/2 respectively for the three distributions that we consider. These values were permuted (the resulting plots are omitted) and the results were not very sensitive to the choice of ρ z . To avoid degeneracies, a cutoff of 1/5 of the size of the data set was used for the empirical variance estimates. We also add the results of the parameter estimates when taking k fixed at the theoretical optimal value. The latter theoretical optimum is only available in the light-censoring cases and for distributions properly belonging to the Fréchet domain of attraction (not the log-gamma case). A general observation is that the violin plots bundle together close to zero, as is desired. Looking closer, we observe that in the regularly varying setup, the adaptive selection of k together with the use of the kernels K 0 and K 2 comes very close to the performance of the Hill estimator evaluated at the theoretical optimum (essentially an oracle estimator, since we input the parameters of the simulated data into this theoretical optimum). As rough guidelines, we observe that the heavy censoring case has significantly worse behaviour than the light censoring case, and that the K 2 -based estimator has the best behaviour, agreeing with the conclusions from Figures 1, 2 and 3. We believe that this regime-shift between light and heavy censoring is responsible for the increased number of outliers in cases where p > 1/2. 6. Insurance Application: censored claims data vs ultimates We now proceed to analyze an insurance dataset consisting of 837 motor thirdparty liability (MTPL) insurance claims from 1995 till 2010. This data set has The data exhibit right-censoring, that is, a claim size is partially observed whenever the development of the claim payment is ongoing and the claim is not yet closed. Closed claim sizes are thus considered as observed data points, and open claims are considered as right-censored observations. In Bladt et al. (2020a) it was argued that the assumption of random censoring and heavy-tailedness is adequate. In the top panel of Figure 10 we have the three kinds of data that are available (open claims, closed claims and ultimates), and on the bottom panel the survival function of the open and closed claims using the Kaplan-Meier estimator together with the empirical survival function of the ultimates. We observe that the tail of the latter under-estimates the tail index that is suggested by using survival-analysis techniques.
We now apply the censored tail estimators introduced in this paper to the data. Using the same mechanism as for the simulation study (and ρ z = −1) we find that k = 35 is the optimal threshold for the estimator using the kernel K 0 (see Figure 11). As observed in the simulations, it is perfectly reasonable to consider the other estimators (H K 1 k , H K 2 k and the Worms estimator H W k ) at this value as The latter two values correspond to the kernel K 2 and to the Worms estimator H W k . Notice that they are quite close, and although the simulations suggest that the two last values are the best performing, the 95% confidence interval for the first of these estimators is given by [0.415, 1.267] which amply accommodates all four estimates. Hence, in practice, with only one sample available, it is not possible to make overly conclusive claims regarding the superiority of these point estimates, since they are not statistically distinguishable. The corresponding 99.5% quantile estimators are given by 14957214, 12093195, 9129021, 9355831, illustrating how small changes in tail estimation can lead to large differences in the quantile scale. Note however that the quantile estimates based on H W k and H K 2 k are quite stable for k ≤ 80. A more refined analysis of ρ z is known to be unstable, but can be routinely applied (for instance using the mop function from the R package evt0). For the Hill estimator of the Z i variates (ignoring censoring) this gives the estimateρ z = −0.616, which is relatively close to our choice of −1, given the high variability of second-order parameter estimators. Repeating the above analysis with this value has small quantitative influence and no additional qualitative insight, and is thus omitted.  . Violin plots for the simulation results in case of Burr distributions under different non-censoring asymptotic probabilities. We take the difference between estimated and theoretical ξ. The cases 2p < 1 and 2p ≥ 1 correspond to heavy and light censoring, respectively. The labels kern0 and kern2 correspond to the use of the estimators H K 0 k and H K 2 k . The specified k = 34 is the theoretical optimal sample fraction.

Conclusion
In this paper we developed novel extreme value estimators under right-censoring in a kernel framework. The latter class is closed (in the asymptotic sense) under the averaging operation of their trimmed versions, by a simple replacement of kernel. The asymptotic behaviour is given for arbitrary kernels, which allows us to compute, for instance, the expression for the MSE as a function of k. The choice of the optimal threshold with respect to MSE is explored in connection with the empirical variance of the trimmed trajectories, which leads to an automated way of selecting a threshold. As for the non-censored case, the idea of selecting a threshold by exploiting this link, circumvents the usual estimation difficulties and instabilities which arise in previous approaches in the literature which typically require the estimation of the second-order parameter D, and of ξ itself. Despite its simplicity, simulation studies suggest that the method is also efficient. In fact, when compared with the theoretically optimal value, the latter sometimes is too small to be of any practical relevance, and then our adaptive estimator is superior. In the other cases, when the theoretically optimal value is sensible, our estimator also performs well against it. We finally apply the procedure to a well-understood insurance dataset, and the simulation studies suggest that the instances where K 0 has been used in the literature (either alone, or in combination with expert information) to analyze these data could Figure 8. Violin plots for the simulation results in case of Fréchet distributions under different non-censoring asymptotic probabilities. We take the difference between estimated and theoretical ξ. The cases 2p < 1 and 2p ≥ 1 correspond to heavy and light censoring, respectively. The labels kern0 and kern2 correspond to the use of the estimators H K 0 k and H K 2 k . The specified k = 30, 15 are the theoretical optimal sample fractions. very possibly be improved by considering K 2 instead. Interesting directions for further research include trimming the kernel estimators from above, to remove outliers from data, and to apply combined tail information using censored data and expert information with the new kernels, improving the previous methods. Finally, it will be interesting to consider optimality criteria for the choice of k for any kernel, and to work out criteria for the selection of the optimal kernel from a purely mathematical point of view. Figure 9. Violin plots for the simulation results in case of log-gamma distributions under different non-censoring asymptotic probabilities. We take the difference between estimated and theoretical ξ. The cases 2p < 1 and 2p ≥ 1 correspond to heavy and light censoring, respectively. The labels kern0 and kern2 correspond to the use of the estimators H K 0 k and H K 2 k .