Why scoring functions cannot assess tail properties

Motivated by the growing interest in sound forecast evaluation techniques with an emphasis on distribution tails rather than average behaviour, we investigate a fundamental question arising in this context: Can statistical features of distribution tails be elicitable, i.e. be the unique minimizer of an expected score? We demonstrate that expected scores are not suitable to distinguish genuine tail properties in a very strong sense. Specifically, we introduce the class of max-functionals, which contains key characteristics from extreme value theory, for instance the extreme value index. We show that its members fail to be elicitable and that their elicitation complexity is in fact infinite under mild regularity assumptions. Further we prove that, even if the information of a max-functional is reported via the entire distribution function, a proper scoring rule cannot separate max-functional values. These findings highlight the caution needed in forecast evaluation and statistical inference if relevant information is encoded by such functionals.


Introduction
Many of our day-to-day decisions rely on our ability to produce reasonable forecasts for quantities of interest. For example, production planning involves forecasts on consumer demand, decisions in farming depend on information about the likely weather conditions and financial risk management uses statistical features of portfolio losses. Usually, such quantities are modelled via a random variable Y having an unknown probability distribution and the reasonable actions of a decision maker depend on the properties of this distribution. Forecasts can encode such properties via real numbers, e.g. means or quantiles of the distribution, via sets, e.g. a confidence interval, or by a report of the whole distribution function.
When several competing forecasts are available, a crucial problem is to determine which one is most valuable. A principled approach to this task is to compare the forecasts to a set of realizations of Y via a scoring rule or a scoring function, see e.g. Gneiting and Raftery (2007) and Gneiting (2011). A scoring function assigns a real-valued score based on a forecast and a realizing observation. If a functional, i.e. a statistical property, of a distribution is the unique minimizer of the expected score with respect to this distribution, it is called elicitable. Elicitability is a desirable property for comparative forecast evaluation, where it can be used to incentivize risk-neutral forecasters to report their beliefs (Gneiting, 2011). Moreover, elicitable functionals enable regression and Mestimation (Fissler and Ziegel, 2016;Gneiting, 2011) and are central to various machine learning algorithms (Frongillo and Kash, 2018;Steinwart et al., 2014). Recent theoretical advances on scoring functions and elicitability in the real-valued case can be found in Lambert et al. (2008), Gneiting (2011) and Steinwart et al. (2014). More general vectorvalued functionals are treated in Kash (2015, 2018) and Ziegel (2016, 2019).
Many statistical functionals such as expectations, quantiles, and expectiles are elicitable and there exist convenient characterizations of the corresponding classes of consistent scoring functions, cf. Gneiting (2011) and the references therein. On the other hand, several widely considered functionals fail to be elicitable, for instance the variance, the mode (Heinrich, 2014) and the prominent financial risk measure Expected Shortfall (ES) (Gneiting, 2011;Weber, 2006). The non-elicitability of the latter functional can be addressed via more general notions of elicitability: Fissler and Ziegel (2016) show that ES is jointly elicitable with the risk measure Value at Risk (VaR), where the latter is simply an extreme quantile. In other words, ES has elicitation complexity equal to two in the sense of Frongillo and Kash (2018). In this particular instance the elicitability problems associated with ES can be resolved, at the cost of considering a higher dimensional problem.
More generally, there is a recent growing interest in sound forecast evaluation techniques with an emphasis on distribution tails rather than average behaviour. For instance Friederichs and Thorarinsdottir (2012) investigate the use of scoring rules for distribution classes central to extreme value theory, and Diks et al. (2011), Lerch et al. (2017) as well as Holzmann and Klar (2017) consider weighted scoring rules for forecasts of distribution tails. An event-based approach to evaluate whether exceedances of high thresholds are predicted correctly is pursued by Stephenson et al. (2008) and Ferro and Stephenson (2011). Closely connected is the verification tool of Taillardat et al. (2019) which is based on the asymptotic behavior of the continuously ranked probability score (CRPS), conditional on high realizations. A fundamental question arising in this context is to what extent, and in which sense, statistical features of distribution tails are elicitable. The latter problem is the central theme of this manuscript.
In our approach to this question we introduce the concept of max-functionals which naturally arises from a key feature shared by the statistical functionals that are typically considered in extreme value theory. We demonstrate that max-functionals fail to be elicitable in a very strong sense. Consequently, it is natural to ask whether part of the problem can be mitigated by abandoning point forecasts in favor of reports of the entire distribution function. In this regard we generalize a result by Taillardat et al. (2019) and show that it is an inherent property of all proper scoring rules that they cannot perfectly distinguish among different max-functional values.
The manuscript is organized as follows. In Section 2 we review the three notions of elicitability that are used in the recent literature. Section 3 introduces the class of max-functionals and shows that they cannot be elicitable and that their elicitation complexity is infinite under mild assumptions. Section 4 provides examples of widely used max-functionals. In Section 5 we turn to reports of entire distributions. We show that arbitrary large differences in tail behaviour, either quantified by tail equivalence or max-functionals, can remain undetected by proper scoring rules. Section 6 concludes with a discussion of the results.

Prerequisites: Elicitability and elicitation complexity
For the reader's convenience this section recalls the central definitions of elicitability and reviews basic findings. A more detailed overview of the existing literature is given in Fissler and Ziegel (2016) and Gneiting (2011), whose notation we follow here. Let O ⊆ R d be a fixed set, called observation domain, equipped with Borel σ-algebra O. We use F to denote a collection of probability distributions on (O, O), whilst also identifying probability distributions with their cumulative distribution functions. A functional will be a mapping T : is integrable with respect to all F ∈ F. We use the short notation for F-integrable functions h, g and x ∈ A, F ∈ F.
Scoring functions and Elicitability In the following, S : A × O → R denotes a scoring function, i.e. an F-integrable function. The central concepts connecting scoring functions and statistical functionals are consistency and elicitability.
Definition 2.2 ((Joint) elicitability). A functional T : F → A ⊆ R n is called elicitable if there exists a strictly F-consistent scoring function for T . It is called jointly elicitable with the functional T : F → A ⊆ R k if (T, T ) is an elicitable functional.
An important necessary condition that a statistical functional needs to satisfy in order to be elicitable is convexity of level sets, which goes back to Osband (1985), cf. for instance Gneiting (2011, Theorem 6) and Lambert et al. (2008, Lemma 1) for a proof.
Theorem 2.3 (Convexity of level sets). Let T : F → A be an elicitable functional. If F 0 , F 1 ∈ F and λ ∈ (0, 1) are such that Example 2.4. The simplest example of an elicitable functional is the mean of a distribution. More precisely, let g : O → R be such that g and g 2 are F-integrable and define T : F → R via T (F ) =ḡ(F ). Then T is elicitable with a strictly F-consistent scoring function given by S(x, y) = (x − g(y)) 2 , the ubiquitous squared error loss. Likewise, the moment functionals defined via T k (F ) := y k dF (y) for k ∈ N are elicitable.
A simple example of a non-elicitable functional is the variance functional T var (F ) := T 2 (F ) − T 1 (F ) 2 , whose non-elicitability follows directly from Theorem 2.3. Nevertheless, T var is jointly elicitable since the vector (T 1 , T var ) can be obtained from the elicitable vector (T 1 , T 2 ) via a bijection and hence it is elicitable, see e.g. Gneiting (2011, Theorem 4). Another notable property is that on every subset of F where T 1 is constant, T var reduces to a shifted version of the second moment T 2 and is thus elicitable on this subset. That is, T var is conditionally elicitable given T 1 in the following sense.
Definition 2.5 (Conditional elicitability). Let T : F → A ⊆ R n and T : F → A ⊆ R k be functionals and let T be elicitable. For any x ∈ A define the set Then the functional T is called conditionally elicitable given T if for any x ∈ A its restriction to the class F x is elicitable.
The concept of conditional elicitability was first introduced by Emmer et al. (2015) and motivated by a conditional backtesting approach for Expected Shortfall (ES) forecasts. A slight generalization was given by Fissler and Ziegel (2016). Our definition coincides with the one from Fissler and Ziegel (2016) except that we drop the condition that T has elicitable components and only require it to be elicitable. This allows for a more convenient presentation of our results below.
Neither joint elicitability nor conditional elicitability imply elicitability, which follows from Example 2.4 with the variance functional serving as a counterexample. If a functional T is jointly elicitable with the functional T , and T is elicitable, then it is conditionally elicitable given T . Conversely, as discussed in Fissler and Ziegel (2016), it is unclear under which conditions a conditionally elicitable functional is jointly elicitable.
Elicitation complexity The definitions of joint elicitability and conditional elicitability both require a second elicitable functional T accompanying the functional of interest. The distinction between both functionals is made more explicit in the concept of elicitation complexity. To illustrate this, recall Example 2.4 and note that the variance functional satisfies T var = f (T 1 , T 2 ), where f (x 1 , x 2 ) = x 2 − x 2 1 . Since T 1 and T 2 are elicitable, we say that the variance functional has complexity 2. In general, T has elicitation complexity at most k if there is an elicitable functional T : F → A ⊆ R k such that T = f (T ) holds. Any f and T satisfying this condition are then called link function and intermediate functional, respectively. The smallest dimension k for which such a representation is feasible is the elicitation complexity.
Definition 2.6 (Elicitation complexity). For any set of distribution functions F the set of R k -valued elicitable functionals defined on F is denoted via E k (F). For a functional T : F → A ⊆ R and sets C k ⊆ E k (F) the elicitation complexity of T with respect to If the minimum is not attained for any k ∈ N, the elicitation complexity of T with respect to (C k ) k∈N is infinite and we write elic(T ) = ∞.
Elicitation complexity was introduced by Lambert et al. (2008) and further analyzed in Frongillo and Kash (2018), the latter motivated by its role in empirical risk minimization (ERM) algorithms in machine learning. Intuitively speaking, it replaces the question whether a functional is elicitable by the question how complex it is to elicit the functional.
If no regularity conditions are imposed on f or T , this can lead to small complexities without clear benefits in applications. More precisely, if f is arbitrary and C k = E k (F) is chosen, pathological choices of f , like bijections from R k to R, cause all functionals to have complexity 1, as demonstrated by Frongillo and Kash (2018, Remark 2). To avoid such problems, it is standard to choose suitable subclasses C k of intermediate functionals.
One possible choice, which is used by Frongillo and Kash (2018) as well as Dearborn and Frongillo (2019) Another possibility, implicitly used by Lambert et al. (2008) is to define C k to be a subclass of all functionals which have elicitable components.
Lastly, it is also possible to impose regularity on the link function f , e.g. by requiring differentiability or continuity. Notably, joint elicitability can be understood as a version of elicitation complexity where the link function is the projection on the last component (Frongillo and Kash, 2018).
We need to be cautious when interpreting elicitation complexity, since imposing different regularity conditions via (C k ) k∈N can lead to different elicitation complexities for the same functional, see Frongillo and Kash (2018, Subsection 2.2) for an example. In particular, some R k -valued functional might be elicitable and simultaneously have elicitation complexity strictly greater than k. Conversely, a functional can have elicitation complexity 1, although it is not itself elicitable, as illustrated in Frongillo and Kash (2018, Remark 1).
We conclude this section with a lemma which considers the properties of a functional T if it is restricted to some subclass F 2 ⊆ F. The first statement corresponds to the first part of Lemma 2.11 of Fissler and Ziegel (2015), the second and third statement are simple extensions. Their proofs are straightforward and therefore omitted.
Lemma 2.7. Let T : F → A be a functional and let F 2 ⊆ F be non-empty.
(a) If T is elicitable, then the restricted functional T |F 2 is elicitable.

The elicitation complexity of max-functionals
This section introduces max-functionals, the central objects of our study, and investigates their elicitability as well as their elicitation complexity. Henceforth, let F always denote a convex class of distributions.
The essential feature of a max-functional is that its value on convex combinations of distributions is determined by the values attained on the extreme points. Equivalently, we can also define min-functionals and all results carry over with minor modifications. The constant functional is the simplest max-functional, but we will usually not be interested in this trivial case. Instead, Section 4 collects some non-trivial examples of max-functionals that are routinely considered in extreme value theory. Also note that, by definition, restrictions of max-functionals to a certain set of values are again maxfunctionals.
Non-elicitability of max-functionals We start by proving that max-functionals cannot be elicitable. As remarked in Section 2 the usual way to show that a functional is not elicitable consists of applying Theorem 2.3, i.e. showing that it fails to have convex level sets. However, any max-functional has convex level sets by definition. So this approach is not feasible, as in the case of the mode functional (Heinrich, 2014). Instead, we employ the following new criterion.
Corollary 3.4. If T : F → R is a non-constant max-functional, then it is not elicitable.
Loosely speaking, Theorem 3.3 states that elicitable functionals cannot be piecewise constant on convex combinations of distributions. It is closely connected to Theorem 2.3, but of independent interest beyond its use to establish non-elicitability for max-functionals. Frongillo and Kash (2018) state that 'no nonconstant finite-valued property is identifiable'. Theorem 3.3 implies the following analogon.
Corollary 3.5. If T : F → R is a non-constant finite-valued functional, then it is not elicitable.
Elicitation complexity of max-functionals Turning from the elicitability question to the elicitation complexity of max-functionals, the question of elicitation complexity is only meaningful in relation to a family of sets (C k ) k∈N , where each set C k ⊂ E k (F) is a collection of reasonably regular R k -valued elicitable functionals, cf. Section 2. Our major regularity requirement is mixture-continuity as in Bellini and Bignozzi (2015) and Fissler and Ziegel (2019).
is a continuous function.
Many statistical properties are mixture-continuous, e.g. ratios of expectations, quantiles and expectiles, see Fissler and Ziegel (2019) for details. Lambert et al. (2008) consider only continuous functionals and Fissler and Ziegel (2019) and Bellini and Bignozzi (2015) show that under weak assumptions, an elicitable functional T is mixturecontinuous if its expected score function x →S(x, F ) is continuous for all F ∈ F. Therefore, a functional which is not mixture-continuous can have discontinuous expected scores, leading to difficulties in forecast evaluation, estimation and regression.
To avoid further degenerate behaviour, we impose a richness assumption on potential intermediate functionals T in the sense that we require the image T (F) ⊆ R k to have at least non-empty interior. This assumption is natural for large enough classes F and was, for instance, used by Ziegel (2016, 2019) when establishing results on consistent scoring functions for T .
In addition to mixture continuity, we follow Lambert et al. (2008) and consider only functionals with elicitable components. Summarising, the first family of functionals which we consider in our complexity result is T mixture-continuous with elicitable components, int(T (F)) = ∅ , where int(B) denotes the interior of a set B ⊆ R k . Alternatively, we require that the image T (F) of a potential intermediate functional T has not only non-empty interior, but is itself an open set, i.e. we consider the family We are now in position to consider the elicitation complexity of max-functionals with respect to these families.
Theorem 3.7. Let T : F → R be a max-functional. Then the following hold true.
(a) T has elicitation complexity ∞ with respect to (U k ) k∈N unless T (F) contains its supremum.
(b) T has elicitation complexity ∞ with respect to (V k ) k∈N unless T is constant.
Proof. Assume there is a k ∈ N, a surjective functional T : F → A in U k or V k and a function f : A → R such that T = f • T . Without loss of generality, T is surjective, hence its mixture-continuity together with the assumed convexity of F imply that A is path-connected. Since it has non-empty interior, we can choose a hyperrectangle and consider each component of T isolated on Q. To do so, choose a component j ∈ {1, . . . , k} and a z i ∈ [c i , d i ] for all i ∈ {1, . . . , k}\{j}. We can then obtain F c j ,z , F d j ,z ∈ F such that T (F c j ,z ) = (z 1 , . . . , z j−1 , c j , z j+1 , . . . , z k ) and T (F d j ,z ) = (z 1 , . . . , z j−1 , d j , z j+1 , . . . , z k ).
All components of T are elicitable and thus have convex level sets by Theorem 2.3. Consequently, the i-th component, where i ∈ {1, . . . , k}\{j}, equals z i for all convex combinations of F c j ,z and F d j ,z . If we define the fact that the j-th component has convex level sets and is mixture-continuous implies that for all a ∈ A j,z there exists a λ ∈ (0, 1) with T (λF c for all x ∈ (c j , d j ), implying that f has to be constant on the set A j,z . Repeating this argument for any choice of j ∈ {1, . . . , k} and z i ∈ [c i , d i ] with i ∈ {1, . . . , k}\{j} shows that there is a C ∈ R such that f (q) = C for all q ∈ int(Q).
Now fix x 0 ∈ int(Q). For any x 1 ∈ A we can choose distributions F 0 , F 1 ∈ F with T (F 0 ) = x 0 and T (F 1 ) = x 1 . Since x 0 ∈ int(Q) and T is mixture-continuous, there is a small µ ∈ (0, 1) such that T (µF 1 + (1 − µ)F 0 ) ∈ int(Q) holds. We thus obtain implying f (x 1 ) ≤ C. Since x 1 was arbitrary, we have f (x) ≤ C for all x ∈ A , showing C = sup T (F) and proving statement (a).
Assume now that A is open. Then for every x 1 ∈ A there is a hyperrectangle Q 1 ⊆ A such that x 1 ∈ int(Q 1 ). Arguing as in the beginning of the proof gives f (q) = f (x 1 ) for all q ∈ int(Q 1 ). So letting T (F 1 ) = x 1 as above we obtain a ν ∈ (0, 1) such that Since x 1 was arbitrary, T must be constant, proving part (b).
Theorem 3.7 implies infinite elicitation complexity of max-functionals in a wide range of natural settings. Ultimately, our main interest lies in understanding the elicitation complexity with respect to the more general family U k , which imposes only very weak assumptions on a potential intermediate functionals.
Corollary 3.8. Let T : F → R be a max-functional and let one of the following conditions be satisfied.
(i) T is unbounded.
(ii) T is surjective onto an open interval (a, b).

(iii) T is surjective onto a half-open interval [a, b).
Then T has elicitation complexity ∞ with respect to (U k ) k∈N .
Alternatively, considering elicitation complexity with respect to (V k ) k∈N amounts to requiring more regularity for a potential intermediate functional T and, in this case, all non-constant max-functionals have infinite elicitation complexity. Lemma 2.7 further implies that the infinite elicitation complexity of max-functionals also extends to larger classes than the considered convex family of distribution functions F and is valid with respect to smaller families contained in (U k ) k∈N or (V k ) k∈N .
Finally, by definition, any functional of finite elicitation complexity is conditionally elicitable, but it is unclear whether the reverse implication holds. We thus conclude with showing that max-functionals with infinite elicitation complexity can neither be conditionally elicitable nor jointly elicitable.
Theorem 3.9. Let T : F → R be a max-functional such that elic(T ) = ∞ with respect to a family (C k ) k∈N . Let T : F → A be a functional with T ∈ C m for some m ∈ N. Then the following hold true.
(a) T is not conditionally elicitable given T .
(b) T is not jointly elicitable with T .
Proof. For the first part assume conversely, that there is an m ∈ N and a functional T ∈ C m such that T is conditionally elicitable given T . That is, T is elicitable on the subclass F x = {F ∈ F | T (F ) = x} for any x ∈ A . By assumption, there is no link function f such that T = f • T holds. Consequently, there is at least one z ∈ A ⊆ R m such that T is not constant on F z . If z defines such a class, then it is convex due to the elicitability of T and moreover we can find F 0 , F 1 ∈ F z such that T (F 0 ) = T (F 1 ) holds. Theorem 3.3 now implies that the restriction of T to F z cannot be elicitable, a contradiction to the conditional elicitability of T .
For the second part note that, as remarked in Section 2 and in the discussion of Fissler and Ziegel (2016), the joint elicitability of T with an elicitable functional T implies that T is conditionally elicitable given T . Consequently, the first part of the proof implies the result.
We conclude this section with a technical remark. In the spirit of Frongillo and Kash (2018), our complexity result (Theorem 3.7) employs regularity assumptions on the possible intermediate functionals. The main assumption is that they possess elicitable components. Why this is essential is illustrated by the use of the hyperrectangle Q in the proof. Intuitively, this assumption can be relaxed at the cost of more technical arguments. The main challenge hereby is to control the values of T in a small hyperrectangle (or ball) around some x 0 ∈ int(A ). However, we did not pursue this approach further, since we believe that our setting covers many functionals of practical interest and at the same time illustrates the irregular behaviour that will be inherent to any link function for a max-functional.

Examples of max-functionals
Prominent examples of max-functionals, to which the results of Section 3 apply, are routinely considered in extreme value theory and are key characteristics for the purpose of inference on the tail of a distribution.
Upper endpoint For a real-valued random variable with distribution function F , its upper endpoint is the supremum of its support By definition, the upper endpoint can be interpreted as a real-valued max-functional on the convex class {F ∈ F | x F < ∞}. Bellini and Bignozzi (2015, Example 3.9) discuss the upper endpoint under the name worst-case risk measure and show that it is not elicitable, once further regularity conditions on the admissible scoring functions are imposed. In light of Corollary 3.4 the non-elicitability of the upper endpoint follows without any further assumptions. In addition it has infinite elicitation complexity in the sense of Theorem 3.7 and Corollary 3.8.

Index of regular variation / Tail index When the upper endpoint is infinite,
another key characteristic to describe the tail behaviour of heavy-tailed distributions is the index of regular variation. A strictly positive measurable function f satisfying for t > 0 is called regularly varying (at infinity) with index ρ(f ) ∈ R. For a distribution F its index of regular variation is the respective index for its survival function F := 1−F , that is, T (F ) := ρ(F ). Its inverse T (F ) −1 is also called tail index in the risk management literature, cf. McNeil et al. (2015, Section 5.1). If the tail F is regularly varying with (a negative) index ρ, this means that F decays essentially like a power function with decay rate 1/ρ. Since ρ(f + g) = max(ρ(f ), ρ(g)) (cf. e.g. de Haan and Ferreira (2006, Proposition B.1.9)), the index of regular variation T is naturally a max-functional, while the tail index T −1 is a min-functional.
Tail-separating functionals More generally, we can deduce that the property of 'being a max-functional' (or min-functional) is in fact inherent to all 'tail-ordering indices'.
To make this precise, let us consider the following natural order on distribution tails. For two distribution functions F and G with upper endpoints x F , x G ∈ R ∪ {∞} we say that G has heavier tail than F and write We say that F and G are tail equivalent and write F ∼ t G if they share the same upper endpoint Note that "< t " defines a strict partial order on any set of distribution functions F and that for tail equivalent F and G neither F < t G nor G < t F can hold. The following proposition shows that a functional which respects the tail order "< t " is a max-functional.
Proposition 4.1. Let T : F → R be a functional that satisfies for all F, G ∈ F Then T is a max-functional.
By symmetry, the case F 1 < t F 0 can be treated analogously. In the remaining case we have neither where the latter follows as the tail of F 0 is not heavier than the tail of F 1 . This implies that neither F 1 < t F λ nor F λ < t F 1 can hold true, which gives T (F λ ) = T (F 1 ) = max(T (F 0 ), T (F 1 )) and concludes the proof.
Another instance of a tail-ordering functional in the sense of Proposition 4.1 is the M-index as introduced in Cadena and Kratz (2016). If it exists, it is the unique ρ ∈ R such that lim x→∞ F (x) x ρ+ε = 0 and lim It is easily seen that the M-index coincides with the index of regular variation for distribution functions F with regularly varying tail function F . As it sorts survival functions according to their power law decay, Proposition 4.1 implies that the M-index is a max-functional.
Extreme value index A central characteristics of extreme value theory is the extreme value index, which classifies the limiting behaviour of rescaled maxima of growing samples from a distribution. More precisely, if there exist suitable location-scale normings a n > 0, b n ∈ R such that the distribution functions F n (x) := F n (a n x + b n ) converge weakly to a non-degenerate distribution function G, the limiting distribution function G is necessarily a Generalized Extreme Value Distribution (GEV). This means that up to a location-scale normalization we have The distribution F is said to be in the max-domain of attraction of G = G γ and the shape parameter γ(F ) is the extreme value index (EVI) of F , cf. e.g. the monographs Resnick (1987) and de Haan and Ferreira (2006) for further background.
Let F be the class of distribution functions which are in a max-domain of attraction for some GEV and consider first the EVI on the subclass of heavy-tailed distributions F + = {F ∈ F | γ(F ) > 0}. It is well-known that a distribution F ∈ F has EVI γ > 0 if and only if ρ(F ) = −γ −1 , where ρ is the index of regular variation (cf. e.g. Resnick (1987, Proposition 1.11)). Consequently, the EVI γ is also a max-functional on F + .
When considering the class of light-tailed distributions, i.e. the case γ(F ) < 0, we need to specify an upper endpoint first in order to make 'being a max/min-functional' meaningful for the EVI γ. To this end, let F Again the EVI behaviour is governed by regular variation, since γ(F ) = −γ(F * ) with F * (x) = F (x * − x −1 ) (cf. e.g. Resnick (1987, Proposition 1.13)). This shows that the EVI γ is a min-functional on the class F x * . Note that it is crucial to assume equal upper endpoints, because otherwise it is not the EVI that dominates the tail behaviour, but the upper endpoint itself.
So far, we have looked at statistical indices that classify univariate tail behaviour. However, similar issues arise when we want to quantify joint tail behaviour in higher dimensions. Exemplary, let us consider the coefficient of tail dependence.
Coefficient of tail dependence In order to quantify the tail behaviour of a bivariate distribution function Tawn (1996, 1997) introduced the coefficient of tail dependence. For a bivariate distribution function F of a random vector (X 1 , X 2 ) let us write F i (x) := P(X i > x), i = 1, 2 and F (x) := P(X 1 > x, X 2 > x) for the associated survival functions. Suppose there is an α > 0 such that both F 1 and F 2 are regularly varying with index −α. If in addition the joint survival function F is regularly varying with index −α/η for some η ∈ (0, 1], the coefficient η = η(F ) is called coefficient of tail dependence (CTD) of the bivariate distribution F . Let us consider the CTD η on the class of bivariate distributions Then it follows for F, G ∈ F α that ρ(λF + (1 − λ)G) = −α/ max(η(F ), η(G)) by the properties of the index of regular variation. Hence η is a max-functional on F α .

Proper scoring rules and max-functionals
In probabilistic forecasting, the whole distribution function instead of a single value is reported to the decision maker. Analogously to a scoring function, a scoring rule then assigns a score based on the forecasted distribution and a realizing observation. The scoring rule is called proper if its expected score with respect to a distribution is minimized whenever the forecast coincides with this distribution, see e.g. Gneiting and Raftery (2007) or Dawid (2007) for recent reviews.
In light of the results of Section 3, the following approach may seem reasonable to someone seeking information about a max-functional: Instead of single values, distribution functions are reported and evaluated via proper scoring rules. Then the maxfunctionals are computed from the forecasted distributions.
If the max-functional of interest is a property of the tail, e.g. the extreme value index, one could expect this method to work well as long as the scoring rule shows a good performance in the tails. In order to emphasize specific regions of interests, in particular the tails, Gneiting and Ranjan (2011) and Diks et al. (2011) combined scoring rules with weight functions. Drawbacks and benefits of these weighted proper scoring rules were further studied in Lerch et al. (2017) and Holzmann and Klar (2017), where the latter propose general construction principles. A theoretical problem is pointed out by Taillardat et al. (2019), who show that weighted versions of the continuously ranked probability score (CRPS) cannot detect that two distributions are not tail equivalent.
This section shows that the problems detected by Taillardat et al. (2019) occur also for max-functionals and do not depend on the specific choice of proper scoring rule. Simply put, the expected score difference of two distributions can be arbitrarily small while their values for a max-functional can be large. As previously, F is a convex set of distribution functions on O ⊆ R d . In our notation we follow Gneiting and Raftery (2007) as well as Section 4. For clarity of presentation we require all scoring rules to be F-integrable, while Gneiting and Raftery (2007) only require quasi-integrability. The latter means that the expected scoreS(G, F ) is well-defined (and not necessarily finite) for all G, F ∈ F. Our assumption of F-integrability is however only a minor restriction, which can be relaxed as discussed below.
A popular choice of scoring rule is the (weighted) continuous ranked probability score, abbreviated by CRPS (wCRPS). For some weight function w : R → [0, ∞) the wCRPS is defined via and the CRPS is obtained in the special case, where w is equal to one (Gneiting and Ranjan, 2011;Matheson and Winkler, 1976). In order to emphasize the right tail, the choice w(x) = 1(q ≤ x) for some threshold q ∈ R can be used. Both wCRPS and CRPS are proper scoring rules as long as F contains only distributions with finite first moments. In this case the CRPS is even strictly proper, while the wCRPS is only under additional assumptions, see Gneiting and Raftery (2007), Gneiting and Ranjan (2011) and Holzmann and Klar (2017).
As demonstrated by Taillardat et al. (2019, Section 2), the wCRPS is not able to clearly distinguish between different tail behavior. More precisely, given a distribution G and ε > 0, it is always possible to construct a distribution F that is not tail equivalent to G and such that where Y has distribution G. This results shows that for any distribution G the tail can be modified while keeping the expected wCRPS ε-close to its minimum. As put by Taillardat et al. (2019) this means that the wCRPS is not a tail equivalent score.
In the following we show that all proper scoring rules fail to be tail equivalent in this sense. Moreover, we extend these findings to max-functionals, i.e. we show that no proper scoring rule is max-functional equivalent. Both findings are immediate consequences of the subsequent continuity considerations for scoring rules.
Proof. We proceed similar to the proof of Nau (1985, Proposition 3). Let F, G ∈ F and denote F λ := λF + (1 − λ)G for λ ∈ [0, 1). We obtain the inequality since S is a proper scoring rule. Rearranging leads to for λ ∈ [0, 1) and the right hand side of this equation vanishes as λ ↓ 0.
The argument of the proof of Lemma 5.3 can be extended to quasi-integrable scoring rules as considered in Gneiting and Raftery (2007). The additional requirement is that the expected scoreS(G, F ) is finite and that S is regular, i.e.S(F, F ) ∈ R for all F ∈ F.
We can now turn our attention to the main result of this section. It is motivated by the observation that tail equivalence and max-functionals lead to a similar kind of discontinuity on the convex combinations λF + (1 − λ)G, which intuitively conflicts with the diagonal-continuity of proper scoring rules. This allows for an extension of the results of Taillardat et al. (2019). Recall the tail-ordering from Section 4 and that we assume F to be convex.
Theorem 5.4. Let S : F × R → R be a proper scoring rule and G ∈ F. Then the following are true.
(a) If there is an F ∈ F with heavier tail than G, then for all ε > 0 there is an F ε ∈ F that is not tail equivalent to G and such that Proof. Fix G ∈ F and let S be a proper scoring rule. For F ∈ F set F λ := λF +(1−λ)G.
Since F is convex, we have F λ ∈ F for all λ ∈ [0, 1]. Moreover, S is diagonal-continuous at G by Lemma 5.3, implying that for all ε > 0 and F ∈ F we can find a δ ∈ (0, 1] such that |S(F λ , G) −S(G, G)| ≤ ε holds for all λ ∈ [0, δ]. Now assume there is an F ∈ F with heavier tail than G. If x F > x G , we have x F λ > x G for all λ ∈ (0, 1]. If on the other hand for x < x * and the right-hand side goes to infinity as x → x * . Hence, in both cases the distributions F λ cannot be tail equivalent to G for λ ∈ (0, 1], showing part (a). For the second part, let F ∈ F satisfy T (F ) > T (G). Since T is a max-functional, T (F λ ) = T (F ) > T (G) holds for λ ∈ (0, 1], proving part (b).
The first part of Theorem 5.4 shows that the lack of tail equivalence is not a flaw of the wCRPS, but inherent to all proper scoring rules (up to integrability assumptions). The second part extends this non-equivalence of proper scoring rules to max-functionals. Loosely speaking, this means that there can not only be pairs of not tail equivalent distributions, but also pairs of distributions with arbitrarily different max-functional values, and both lead to almost identical expected scores.

Discussion
Recent research investigates the elicitation properties of widely used statistical functionals. When the emphasis lies on an understanding of tail properties, typical functionals to characterize this behaviour fall into the class of max-functionals. In particular, all functionals that order distribution tails belong to this class (cf. Proposition 4.1). We show here that max-functionals do not only fail to be elicitable (Theorem 3.3), but have in fact infinite elicitation complexity in a wide range of settings (Theorem 3.7). This contrasts situations in which the non-elicitability can be alleviated by a finite elicitation complexity as, for instance, is the case for the variance or the Expected Shortfall (Fissler and Ziegel, 2016;Frongillo and Kash, 2018). Rather it bears resemblance to the mode, which is non-elicitable and has infinite elicitation complexity as well, see Heinrich (2014) and Dearborn and Frongillo (2019). As an alternative to point forecasts, we may allow that the max-functional is reported via the entire distribution function. In principle such probabilistic forecasts can be compared using proper scoring rules. However, Theorem 5.4 demonstrates that this approach does not lead to a satisfying comparison of the associated max-functional values either, in the sense that the difference of expected scores can be arbitrarily small, although the difference of max-functional values may be large. This complements recent findings of Taillardat et al. (2019) and extends them from the wCRPS to all integrable proper scoring rules.
From an applied viewpoint our result show that expected scores are not suitable to access tail information for regression, M-estimation or comparative forecast evaluation. Also intuitively, connecting scoring functions and genuine tail properties is obstructed by the fact that finite samples never contain sufficiently rich information on tail behaviour. What might come to rescue though, is that the max-functionals themselves are often not the main concern in practical applications, but rather a tool to guide the extrapolation from intermediate order statistics to the functionals of interest, which may include for instance a high quantile. Nevertheless, the problems of proper scoring rules presented in Theorem 5.4 cast doubt on the ability of proper weighted scoring rules to distinguish different tail regimes and provide an alternative theoretical foundation for the limitations of weighted scoring rules described in Lerch et al. (2017) and Holzmann and Klar (2017). Likewise, Friederichs and Thorarinsdottir (2012) experience difficulties in the usage of scoring rules for estimating the shape parameter of generalized extreme value distributions. Our results illustrate that these problems are unavoidable, whenever dealing with max-functionals or tail equivalence. We thus anticipate that techniques to do comparative assessment of forecasts in such settings will remain an active area of research.