Forecast Evaluation of Quantiles, Prediction Intervals, and other Set-Valued Functionals

We introduce a theoretical framework of elicitability and identifiability of set-valued functionals, such as quantiles, prediction intervals, and systemic risk measures. A functional is elicitable if it is the unique minimiser of an expected scoring function, and identifiable if it is the unique zero of an expected identification function; both notions are essential for forecast ranking and validation, and $M$- and $Z$-estimation. Our framework distinguishes between exhaustive forecasts, being set-valued and aiming at correctly specifying the entire functional, and selective forecasts, content with solely specifying a single point in the correct functional. We establish a mutual exclusivity result: A set-valued functional can be either selectively elicitable or exhaustively elicitable or not elicitable at all. Notably, since quantiles are well known to be selectively elicitable, they fail to be exhaustively elicitable. We further show that the class of prediction intervals and Vorob'ev quantiles turn out to be exhaustively elicitable and selectively identifiable. In particular, we provide a mixture representation of elementary exhaustive scores, leading the way to Murphy diagrams. We give possibility and impossibility results for the shortest prediction interval and prediction intervals specified by an endpoint or a midpoint. We end with a comprehensive literature review on common practice in forecast evaluation of set-valued functionals.


Introduction
Humanity faces the need to make decisions despite uncertainty. This uncertainty has many sources, one of which stems from unknown or random future events. For example, in agriculture the right time for harvesting depends on the weather in the forthcoming days; in business, an investment decision in a production facility depends on future demand; in politics, decisions for travel restrictions or lockdowns heavily depend on the anticipated future development of a pandemic. Predicting uncertain future events is therefore urgent and ubiquitous, dating back at least to the ancient Delphic Oracle and spanning the time up to today's sophisticated quantitative epidemiological models.
The presence of various sorts of forecasts and humanity's reliance on them calls for a careful assessment and evaluation, basically focusing on two complementary aspects: First, are given forecasts good or reliable in absolute terms? And second, how well have certain forecasts performed in comparison to some alternative predictions, assessing their relative quality? Clearly, these questions can only be answered ex post, given observations of the future events in question. Then, the reliability or calibration can be assessed in terms of moment or identification functions. Forecast comparison and ranking, in turn, is commonly performed in terms of loss or scoring functions. To perform forecast evaluation properly, one must specify a certain quality criterion or directive the forecasts should follow. This directive might be given indirectly in terms of a cost, loss, or score, such that good forecasts aim to minimise this criterion (in expectation). This directive can also be formulated directly, such as the whole probability distribution of the uncertain event, capturing the entire inherent uncertainty, or as a summary measure thereof, called functional, such as the mean, variance, or a certain risk measure. When the directive is specified in the latter way, it is crucial that the tools of forecast evaluation, chiefly scoring and identification functions, are in line with this directive. This alignment leads to the notions of strictly consistent scores, which are minimised in expectation by the correctly specified forecasts, and strict identification functions, whose roots in expectation are the correctly specified forecasts.
The literature on forecast evaluation has mainly focused on single-valued functionals T , such as real-valued and vector-valued point forecasts or probabilistic forecasts, where a single correct value is specified for each possible distribution (for technical definitions and an account of the literature, we refer to Section 2.1). Yet set-valued functionals are abound; prominent examples include quantiles, 1 the mode, and prediction intervals, which may all be non-unique and therefore set-valued. Applications bring many other examples, such as the set-valued systemic risk measures introduced in Feinstein, Rudloff, and Weber (2017) which specify the entire set of capital allocations adequate to render a financial system's risk acceptable. Set-valued functionals also naturally arise via expectations or quantiles of random sets (Molchanov, 2017), such as from climatology and meteorology (the area affected by a flood), reliability engineering (parts of a machine being affected by extreme heat) or medicine (tumorous tissue in the human body); cf. Bolin and Lindgren (2015). We discuss these applications and several others in Section 6 with a special focus on quantiles of random sets in Section 5.
Techniques developed to evaluate single-valued forecasts do not in general suffice for set-valued functionals. For functionals like the mode or the α-quantile, it is common to restrict to the set of distributions with a unique mode or α-quantile (Fissler & Ziegel, 2019;Heinrich, 2014), yet one may be interested in distributions with multiple modes or quantiles. Moreover, many functionals, such as quantiles of random sets, are inherently set-valued and such a restriction is not available, or too much of a simplification. We therefore need a comprehensive theoretical framework for the assessment of forecasts for set-valued functionals.
It turns out that already for the definition of elicitation (or identification) of a setvalued functional, there are several possibilities, depending on the form the forecasts take. One may ask for an arbitrary element of the functional, an arbitrary subset, or perhaps even the entire set itself. For the case of the α-prediction interval, and the uniform distribution on [0, 1], these definitions correspond to any subinterval of [0,1] of length at least α, any set of such intervals, or the entire set of all such intervals. Moreover, if a set-valued functional is elicitable in one of these corresponding senses, does it continue to be elicitable if one specifies a particular element of the functional to be elicited, such as the shortest α-prediction interval; or if it is not elicitable in one of the senses above, could such a specification render it elicitable?
In this paper, we present a general theoretical framework to evaluate the forecasts of set-valued functionals, which clarifies and expands upon these questions. We begin with a thorough definition of elicitability and identifiability of set-valued functionals (Section 2). In particular, as alluded to in the above questions, we define two types of set-valued elicitation (identification): for the selective type, we follow Lambert and Shoham (2009) and Gneiting (2011a) where a single-valued forecast must be among the set of correct values specified by the functional-as it is also typical for quantiles (Koenker, 2005), whereas the exhaustive type more ambitiously asks one to forecast the entire set of correct values, and requires this set to be the unique minimiser (zero) of the expected score (identification function).
The main result of this article, Theorem 3.7, states that the two types of elicitability are mutually exclusive: a set-valued functional is either selectively elicitable or exhaustively elicitable, or not elicitable at all, subject to mild regularity conditions. The proof follows from a refinement of the classical result that CxLS are necessary for elicitability (Proposition 3.3). This mutual exclusivity result is powerful in its ability to rule out elicitability of one type or the other; for example, quantiles of random variables are known to be selectively elicitable, thus failing to be exhaustively elicitable. Interestingly, any specification of the quantile, such as the lower quantile, or Value at Risk in the risk management literature, is also not elicitable in general; see Proposition 3.12 and the discussion thereafter.
We illustrate our framework with new results for prediction intervals (Section 4) and Vorob'ev quantiles (Section 5). For prediction intervals, we show that the whole class of α-prediction intervals is exhaustively elicitable. While this immediately rules out the se-lective elicitability, we consider certain interesting specifications of prediction intervals. If the midpoint or an endpoint is given by a constant, this specification is elicitable, indeed. However, if the midpoint or an endpoint is given via a general identifiable functional, it is in general not elicitable, unless the endpoint is specified via a quantile. We also show that the shortest prediction interval is not elicitable in either sense. This section complements and generalises recent results established independently in Brehmer and Gneiting (2020). We then establish the exhaustive elicitability and selective identifiability of Vorob'ev quantiles of random closed sets. For an application to systemic risk measures, we refer the reader to Fissler, Hlavinová, and Rudloff (2019), where the theoretical framework of this paper has been applied to establish selective identifiability, exhaustive elicitability, and mixture representations of exhaustive scoring functions, leading to Murphy diagrams. We close with a comprehensive literature review on forecast evaluation for set-valued quantities, covering spatial statistics, machine learning, engineering, climatology and meteorology, and philosophy, leaving many interesting avenues for future research. Technical proofs and further interesting results are deferred to the Appendix.

Two types of elicitability and identifiability 2.1. Scoring and identification functions for single-valued functionals
We use the decision-theoretic framework described for example in Gneiting (2011a); cf. Savage (1971), Osband (1985), Lambert, Pennock, and Shoham (2008), Fissler andZiegel (2016, 2019). Let (Ω, F, P) be some complete, atomless probability space rich enough to accommodate all random elements mentioned in the sequel. With Y we denote an observation of interest, taking values in some measurable space (O, O), called observation domain. Forecasts for Y are denoted by X, taking values in a measurable space (A, A) called action domain. We assume that the directive for an ideal forecast is given in terms of a statistical functional of the (conditional) distribution F of Y (given X). Mathematically, this is a map T : M → A, where M is some class of probability measures or probability distribution functions on (O, O). All functions are tacitly assumed to be measurable. A scoring function is a map S : It is negatively oriented, meaning that a forecast x ∈ A receives the penalty S(x, y) if y ∈ O materialises. Statistically, the relative quality of a prediction observation sequence (X t , Y t ), t = 1, . . . , N , is evaluated by S in terms of the realised score (2.1) Invoking an expected utility maximisation argument or a suitable law of large numbers, it has been widely argued (Engelberg, Manski, & Williams, 2009;Murphy & Daan, 1985) that a scoring function should incentivise truthful forecasts in that the Bayes-act coincides with the given directive. We say that S is incentive compatible or M-consistent for all x ∈ A and F ∈ M where we implicitly assume thatS(x, F ) := S(x, y)dF (y) exists. If additionally equality in (2.2) implies that x = T (F ), S is strictly M-consistent for T . A functional that admits a strictly consistent scoring function is called elicitable (Lambert et al., 2008;Osband, 1985). As such, the elicitability of a functional opens the way to meaningful forecast comparison (Gneiting, 2011a) which is closely related to comparative backtests in finance (Fissler, Ziegel, & Gneiting, 2016;Nolde & Ziegel, 2017). Similarly, it is crucial for M -estimation (Huber, 1967;Huber & Ronchetti, 2009) and regression, such as quantile regression (Koenker, 2005;Koenker & Basset, 1978) or expectile regression (Newey & Powell, 1987).
While scoring functions serve the purpose of forecast comparison and ranking, we employ identification functions when it comes to forecast validation. Similarly to a scoring function, an identification function is a map V : A × O → R k where we again make the tacit assumption thatV (x, F ) := V (x, y)dF (y) exists for all x ∈ A, F ∈ M with the additional assumption that the expectation be finite. V is an M-identification function for T ifV (T (F ), F ) = 0. It is a strict M-identification function for T if additionallyV (x, F ) = 0 implies that x = T (F ). In the literature on point-valued functionals, it turned out to be appropriate that k coincides with the dimension of the forecasts Frongillo & Kash, 2015;Osband, 1985). Since statistical practice demands to evaluate the realised identification function, which is the counterpart of (2.1) upon replacing S by V , V simply needs to map to a real vector space. One can even be more flexible and use an infinite-dimensional space. E.g. in Proposition 4.20 the identification function maps to R [−1,1] , the space of functions from [−1, 1] to R. One can also relax the requirement that the expected identification function attains a 0 at the correctly specified forecast. It can rather attain some predefined particular value(s)-the important requirement being that this value be identifiable in the common sense.
In statistics and econometrics, identification functions are often called moment functions and give rise to the (generalised) method of moments (Newey & McFadden, 1994) or Z-estimation. For a discussion of identifiability and calibration in the context of backtesting risk measures, we refer the reader to the insightful papers Davis (2016) and Nolde and Ziegel (2017). For a recent general perspective on identification, see Basse and Bojinov (2020).

Selective and exhaustive scoring and identification functions
When T is set-valued, we may write T : M → 2 W , where W is some generic space. As mentioned in the Introduction, we distinguish two types of forecasts with the corresponding notions of scoring and identification functions. In decision-theoretic terms, this translates into two sensible choices for the action domain A: The elements of the action domain A sel representing possible forecasts are points in the space W . Truthful reporting means that there are generally multiple best actions, namely all selections t ∈ T (F ) ⊆ A sel for F ∈ M. Mnemonically, we shall refer to A sel as a selective action domain.
(ii) A = A exh ⊆ 2 W : The elements of the action domain A exh representing possible forecasts are subsets of the space W . Truthful reporting means that there is a unique best action, namely the exhaustive functional T (F ) ∈ A exh for F ∈ M. Similarly, we shall refer to A exh as an exhaustive action domain.
The two different choices of action domains lay claim to different levels of precision and ambition of the forecasts. For a certain functional T : M → 2 W , the connection between the choice of the selective action domain A sel ⊆ W and the exhaustive action domain A exh ⊆ 2 W will be specified if needed for a certain result, otherwise remaining unspecified. However, a sensible connection between the two choices we have in mind is We continue to use the dichotomy introduced above also for scoring functions evaluating forecasts for some set-valued functional T : M → 2 W . Let M ⊆ M be some generic subset.
The exhaustive score S exh is strictly M -consistent for T if it is M -consistent for T and if equality in (2.4) implies that B = T (F ). T is exhaustively elicitable on M if there is a strictly M -consistent exhaustive scoring function for T .
Unless mentioned explicitly otherwise, we tacitly assume that all scoring functions are M -finite in the sense thatS sel ( If we merely say that T : M → 2 W is (selectively or exhaustively) elicitable, we mean it is (selectively or exhaustively) elicitable on M. Assuming M-finiteness is convenient in many proofs and is a standard assumption in the literature (Brehmer & Strokorb, 2019;Wang & Wei, 2020;Ziegel, 2016a). Note that without stipulating M-finiteness, the strict M-consistency of a selective (exhaustive) scoring function S sel (S exh ) implies thatS sel (t, F ) ∈ R for all F ∈ M, t ∈ T (F ) (S exh (T (F ), F ) ∈ R for all F ∈ M). According to  we call any two (selective or exhaustive) scoring functions S, S : A × O → R equivalent if there is some λ > 0 and some function a : O → R such that S (x, y) = λS(x, y) + a(y). It is immediate to see that this equivalence relation preserves M-consistency, and also strict M-consistency, subject to a being M-integrable. If there is no risk of confusion, we shall drop the indices "sel" and "exh" to indicate the difference between selective and exhaustive interpretations, respectively.
For identification functions, we again make the distinction between selective and exhaustive identification functions to allow for a rigorous treatment of set-valued functionals.
T is exhaustively identifiably on M if it possesses a strict exhaustive M -identification function.
Again, we say that T : M → 2 W is (selectively or exhaustively) identifiable, if we mean it is (selectively or exhaustively) identifiable on M.
For single-valued functionals such as the mean, the distinction between selective and exhaustive elicitability is obsolete, since any choice of an action domain leads to a unique best action. Hence, one is actually always in the exhaustive setting, and there is no point in mentioning this fact explicitly. Of course, we could formally identify a point-valued functional T : M → A with the set-valued functional T : M → A = {{a} | a ∈ A} where T (F ) = {T (F )}. Then the following lemma holds. Lemma 2.3. Let T : M → A be some point-valued functional. Define the set-valued functional T (F ) := {T (F )}, F ∈ M. Then T , considered as a map to the power set 2 A , is selectively elicitable (identifiable) if and only if T , considered as a map to the exhaustive action domain A = {{a} | a ∈ A}, is exhaustively elicitable (identifiable). Moreover, the selective elicitability (identifiability) of T : M → 2 A is equivalent to the elicitability (identifiability) of T .
While we are aware of contributions to the literature which consider either the selective or the exhaustive interpretation only (see Section 6), one novelty in the present paper is that we thoroughly study and compare these two alternative notions, which is the content of the next section.

Structural results
The structural results presented in this section consist of generalisations of the classical Convex Level Sets (CxLS) property due to Osband (1985) and their immediate implications (Section 3.1), the main result on the mutual exclusivity of selective and exhaustive elicitability (Section 3.2), and implications of certain specifications of set-valued functionals in Section 3.3. (i) T has the selective CxLS property on M if for all F 0 , F 1 ∈ M and for all λ ∈ (0, 1) such that (1 − λ)F 0 + λF 1 ∈ M :

CxLS properties and their implications
(ii) T has the selective CxLS* property on M if for all F 0 , F 1 ∈ M and for all λ ∈ (0, 1) such that (1 − λ)F 0 + λF 1 ∈ M : (iii) T has the exhaustive CxLS property on M if for all F 0 , F 1 ∈ M and for all λ ∈ (0, 1) such that (1 − λ)F 0 + λF 1 ∈ M : If we omit to mention the class M explicitly, we mean that T has the corresponding CxLS property on M. The exhaustive CxLS property is the most common one in the literature, and the one used for point-valued functionals (Bellini & Bignozzi, 2015;Delbaen, Bellini, Bignozzi, & Ziegel, 2016;Steinwart, Pasin, Williamson, & Zhang, 2014;Wang & Wei, 2020). The selective CxLS property follows the one proposed in Gneiting (2011a), while the selective CxLS* property is novel. However, it is noteworthy that the recent paper Brehmer and Strokorb (2019) introduced the notion of max-functionals. Using our notation, a real-valued functional T : M → R is called a max-functional if for any F 0 , F 1 ∈ M and λ ∈ (0, 1)    The second point of Lemma 3.2 underpins why the distinction of the CxLS properties is obsolete for the point-valued case.

It is immediate that a real-valued functional
It is classical knowledge originating from the seminal work of Osband (1985) that the exhaustive CxLS property is necessary for exhaustive elicitability and exhaustive identifiability, and that the selective CxLS property is necessary for selective elicitability and selective identifiability. Under additional regularity assumptions and for real-valued functionals, Steinwart et al. (2014) established that the CxLS property is also sufficient for both elicitability and identifiability. A novelty is the following necessity-result.
due to the strict M-consistency of S. This implies that The identity in (3.1) stems from the fact that the expected scoreS(·, ·) behaves "linearly" in its second argument, which is the integration measure. Again, invoking the strict Mconsistency of S, the assertion follows.
Theorem 3.5. If T : M → A exh satisfies the proper-subset property and the selective CxLS* property, it is not exhaustively elicitable.
Proof. Assume S is a strictly M-consistent exhaustive scoring function for T . Let F, G ∈ M be such that ∅ = T (G) T (F ). Then The proper-subset property implies that there is a sufficiently small λ 0 ∈ (0, 1), such that (1 − λ 0 )F + λ 0 G ∈ M and, exploiting the selective CxLS* property yielding T ((1 − λ 0 )F + λ 0 G) = T (G), we end up with which violates the strict M-consistency. Note that the last inequality only holds under the tacit assumption that S is M-finite.
Remark 3.6. Remarkably, the combination of the selective CxLS* property and the proper-subset property implies that there are F, G ∈ M with T (F ) = T (G) such that for all λ ∈ (0, 1) it holds that T ((1 − λ)F + λG) ∈ {T (F ), T (G)}. That means, in our Theorem 3.5 we directly recover the condition of Theorem 3.3 in Brehmer and Strokorb (2019). Even though their result is stated for real-valued functionals only, it immediately generalises to the set-valued case. Hence, the conclusions coincide in both instances implying that T fails to be (exhaustively) elicitable.

Mutual exclusivity
We now present our main result, which states that, for functionals satisfying the propersubset property, selective and exhaustive elicitability are mutually exclusive. The proof follows immediately from Proposition 3.3 and Theorem 3.5.
Theorem 3.7 (Mutual exclusivity). Let T : M → A exh ⊆ 2 A sel be a set-valued functional with the proper-subset property. Then T cannot be both selectively elicitable and exhaustively elicitable.
This result gives a broad insight into the structure of set-valued elicitability. It basically establishes the following partition of set-valued functionals: (1) The class of selectively elicitable functionals.
(3) The class of functionals which are not elicitable at all.
The result also gives a powerful tool to rule out elicitability without the need for a direct argument, which in some cases may appear quite challenging a priori. We give several example applications of Theorem 3.7 below.
Example 3.8. (i) Any α-quantile, α ∈ (0, 1) is selectively elicitable. If the class M is reasonably large (e.g. it contains all measures with finite support), then the α-quantile clearly satisfies the proper-subset property. Hence, it fails to be exhaustively elicitable on such a class.
(ii) If M is the class of distributions on R with finite support, then the mode functional is selectively elicitable on M with the strictly M-consistent selective scoring function S(x, y) = 1{x = y} (Gneiting, 2017;Heinrich, 2014). Since the mode functional satisfies the proper-subset property on M, it also fails to be exhaustively elicitable on M.
(iii) Any elicitable real-valued functional T : M → R induces trivial set-valued functionals T − (F ) := (−∞, T (F )] and T + (F ) = [T (F ), ∞). Clearly, the elicitability of T is equivalent to the exhaustive elicitability of T − and T + considered as maps to g. by invoking the revelation principle (Fissler, 2017;Gneiting, 2011a;Osband, 1985). If T is not constant on M, then T − and T + also satisfy the proper-subset property, which means they violate the selective CxLS* property such that they are not selectively elicitable. Vice versa, if T + or T − satisfy the selective CxLS* property, then T or −T is a max-functional in the sense of Brehmer and Strokorb (2019) such that T (and −T ) is not elicitable unless it is constant, which recovers their Corollary 3.4.
(iv) In , the exhaustive elicitability of the set-valued systemic risk measures defined by Feinstein et al. (2017) has been established. For a random vector Y representing a financial system, a measure of systemic risk is defined as a collection of capital allocations k ∈ R d such that ρ(Λ(Y + k)) ≤ 0 where ρ is a scalar risk measure and Λ : R d → R a non-decreasing aggregation function. The cash-invariance property of these risk measures implies that they satisfy the proper-subset property. This means that they cannot be selectively elicitable.
(v) In Section 4, we consider the class of α-prediction intervals, i.e., of intervals a random variable will fall into with a probability of at least α. We show that, on a suitable class of probability distributions M, this class of α-prediction intervals is exhaustively elicitable on M, and in Lemma 4.6 we construct an example that shows that the proper-subset property is satisfied on M. The combination of these results then rules out the selective elicitability of the class of α-prediction intervals on M.
(vi) Theorem 4.16 shows that there are classes of distributions where the collection of all shortest α-prediction intervals fails to be elicitable in either sense-selectively and exhaustively; see Remark 4.18 for details. Up to our knowledge, this is the first non-degenerate example of a set-valued functional which is not elicitable in either sense. (Clearly, any non-elicitable single-valued functional, such as the variance, would trivially satisfy such a statement by virtue of Lemma 2.3).
(vii) In Section 5, we establish that Vorob'ev quantiles of random sets are selectively identifiable and exhaustively elicitable. Under the additional mild proper-subset property, which is satisfied in a lot of settings, this means that Vorob'ev quantiles cannot be selectively elicitable.
Remark 3.9. Elicitability and identifiability have structural differences in this context. While Theorem 3.5 carries over to exhaustive identifiability with an easy adaption of the proof, it does not seem to be possible to establish an analogon of Proposition 3.3 for selective identifiability due to possible cancellation effects. One can merely establish that selective identifiability implies the selective CxLS property. Therefore, it remains open if selective and exhaustive identifiability are mutually exclusive in the sense of Theorem 3.7.

Specifications of set-valued functionals
Let us now take a look at specifications of set-valued functionals. For a set W = ∅ and a set-valued functional T : We start with a lemma, the proof of which is straightforward.
Clearly, the scoring function S (identification function V ) appearing in Lemma 3.10 is only strictly consistent This suggests the question as to whether the specification can be elicitable (identifiable) at all, which the following proposition is concerned with.
However, any S ∈ S M fails to be strictly M-consistent for T . Hence, S M = ∅.
A common problem when applying Proposition 3.11 for practical purposes is that most characterisation results concerning the class of strictly consistent scoring functions, if known, typically assume regularity conditions on the scoring functions such as continuity or differentiability; cf. Table 1 in Gneiting (2011b) or Osband's Principle Osband, 1985). Interestingly, an argument similar to the one used in the proof of Theorem 3.5 leads to a result which rules out the elicitability of specifications under very weak conditions on the functional. In particular, it dispenses with regularity conditions on scoring functions.
In line with Bellini and Bignozzi (2015) we call a functional T from a convex class of distributions to some topological space A mixture-continuous if for any (3.2) Then, any specification T : there is an open set U ⊆ A such that a ∈ U and b / ∈ U , then any specification T fails to be mixture-continuous.
Proof. Let T : F → A be a specification of T and suppose S : Fissler and Ziegel (2019) implies that T is not identifiable. Assume that there is a strictly M-consistent scoring function S for T . This implies that for all λ ∈ (0, 1) This contradicts the elementary fact that the map [0, 1] λ →S(t 0 , (1 − λ)F + λG) − S(t 1 , (1−λ)F +λG) is continuous (where we have exploited the M-finiteness of S), which rules out the elicitability of T . Finally, if A has a Fréchet topology, γ is not continuous, which shows that T is not mixture-continuous. Indeed, let U ⊂ A be an open set such We would like to emphasise that the mere failure of mixture-continuity of T does not rule out its elicitability. Indeed, Proposition 2.2 in Fissler and Ziegel (2019) (cf. Proposition 3.4 in Bellini and Bignozzi (2015)) only rules out the existence of a continuous strictly consistent scoring function for T .
Proposition 3.13. Let M be a convex class of distributions such that (3.2) is satisfied for the α-quantile, α ∈ (0, 1). Then no specification of the α-quantile is elicitable.
Note that (3.2) is satisfied for the α-quantile e.g. if M contains all distributions with finite support. Proposition 3.13 thus rules out the elicitability of the lower quantile or the specification introduced in the recent preprint Aronow and Lee (2018) relative to such classes.
Notably, Proposition 3.12 rules out the elicitability with an M-finite score, which also translates to Proposition 3.13. Relaxing this condition and looking at the 0-and 1quantile, we arrive at functionals which are both selectively and exhaustively elicitable, which is the content of Subsection 4.4.

Prediction intervals
A common task for the statistical forecaster is to report an interval [a, b] ⊆ R into which future observations of a given real-valued random variable Y will fall with at least a specified coverage probability α ∈ (0, 1], that is, P(Y ∈ [a, b]) ≥ α. Thereby, the inherent uncertainty of the actual outcome is captured. Any such interval will be referred to as an α-prediction interval.
The literature on evaluating prediction intervals considers reports for these functionals typically in the exhaustive sense, meaning that an interval is reported rather than a single point. 2 Gneiting and Raftery (2007, Sections 6.2 and 9.3) consider consistent exhaustive scores for the central α-prediction interval or 'equal-tailed' α-prediction intervals; cf. Greenberg (2018) for a discussion of these scores and Bracher, Ray, Gneiting, and Reich (2020) for a timely application of interval forecasts in the context of epidemiology. This basically amounts to a prediction for a pair of quantiles at the (1 − α)/2-and (1 − (1 − α)/2)-level. If one fixes a certain coverage of, say, α, this ansatz can be generalised to construct consistent scoring functions for a non-central α-prediction interval of which the endpoints are specified in terms of quantiles at level β and β + α, where β ∈ (0, 1 − α). Schlag and van der Weele (2015) also consider exhaustive scoring functions for intervalvalued predictions. However, they start with a certain scoring function of appeal to them and do not thoroughly characterise the functional which is elicited by this scoring function. See Askanazi, Diebold, Schorheide, and Shin (2018) for an overview of interval forecasts, in which, however, mostly impossibility results are presented.
There is typically a whole class of α-prediction intervals for Y , resulting in a collection of subsets of R. In Section 4.2 we show that this whole class of α-prediction intervals is exhaustively elicitable, subject to sensible conditions on the class of distributions. As a direct consequence of Theorem 3.7, it is not selectively elicitable. This fact imposes a substantial challenge to the sound evaluation of single arbitrary α-prediction intervals without imposing any further restrictions. On the other hand, imposing such further restrictions, it is well known that an α-prediction interval given by two quantiles as its endpoints can be elicited due to the elicitability of the individual quantiles, if the quantiles are singletons on the respective class of probability distributions. Such an interval is a particular specification of the class of α-prediction intervals, and one might wonder about the elicitability (identifiability) of other specifications. In the second part of this section, we discuss the elicitability and identifiability of several specifications of the class of α-prediction intervals, with largely negative results.
Our results are nicely complemented by the very recent and independently developed preprint Brehmer and Gneiting (2020). They essentially study the subclass of α-prediction intervals with exact coverage α, and show that this subclass fails to be selectively elicitable; see Remark 4.8 for details. Furthermore, they establish properties on homogeneous and translation invariant scores for the central α-prediction interval (or 'equal-tailed' α-prediction interval) and show some complementary impossibility results on the shortest α-prediction interval.

Notation
Let M 0 be the class of Borel probability distributions on R where we deliberately overload notation and identify the corresponding Borel measures with their cumulative distribution functions. LetŪ : We introduce the following subclasses of M 0 : Let M inc be the class of strictly increasing distribution functions, M cont the class of continuous distributions, and M inc,cont := M inc ∩ M cont . Naturally, the fact that F ∈ M inc implies that the support of F is whole R. However, to allow for the treatment of distributions with bounded support, we define the class of α-pseudo-increasing distributions M α,inc for α ∈ (0, 1]. For any F ∈ M 0 we say that F ∈ M α,inc if and only if #q α (F ) = #q 1−α (F ) = 1, and (4.1) where the notation #A denotes the cardinality of a set A. This means that F ∈ M α,inc if and only if its α-and (1 − α)-quantiles are singletons, and if for any β ∈ (0, 1 − α) the β-quantile or the (β + α)-quantile is a singleton.
For any α ∈ (0, 1] and upon identifying any non-empty interval [a, b] ⊆ R with the vector of its endpoints (a, b) ∈Ū , we formally introduce the class of α-prediction intervals for a distribution F ∈ M 0 as is an upper set with respect to the ordering cone C := (−∞, 0]×[0, ∞). Moreover, I α (F ) is non-empty since F (R) = 1 ≥ α, implying that (−∞, ∞) ∈ I α (F ). Therefore, we introduce the natural maximal exhaustive action domain for reports for I α as with the usual definition of the Minkowski sum. Moreover, for any F ∈ M 0 , we introduce the following functions closely connected to I α (F ): First, the function gives the upper endpoint of the shortest α-prediction interval with lower endpoint a; see Figure 1 for an illustration.
(ii) For all F ∈ M 0 the function Γ α (F ) is increasing and left-continuous.
(iii) For F ∈ M inc, cont the function Γ α (F ) is strictly increasing and continuous.
Proof. (i) follows from the right continuity of F . (ii) is implied by the fact that the functions a → F (a−) and β → q − β (F ) are increasing and left-continuous. For (iii) recall that for F ∈ M inc, cont , both a → F (a−) and β → q − β (F ) are strictly increasing and continuous. Finally, for (iv) note that for which gives (half of) the length of the shortest α-prediction interval of F , centred at m. Again, a continuity argument yields that the infimum is attained such that Note that I α (F ) corresponds to the epigraph of Γ α (F ), given by which also guarantees the (Borel-) measurability of the set I α (F ). We also introduce the graph of Γ α (F ) as Finally, we introduce the subclass U * ⊆ U of sets which can be written in form of epigraphs of left-continuous functions γ :

Elicitability and identifiability of the class of α-prediction intervals
One of the main results of this paper is as follows.
Theorem 4.2. For α ∈ (0, 1] the following assertions hold: is an M 0 -consistent exhaustive scoring function for I α . (iii) If additionally µ is positive on U , 3 then the restriction of S exh to U * × R is strictly M α,inc -consistent for I α , rendering the class of α-prediction intervals exhaustively elicitable on M α,inc .
Proof. Part (i) follows from Lemma 4.1 and standard arguments. For part (ii) let F ∈ M 0 , A * = I α (F ) and A ∈ U. Then, using a Fubini argument, we obtain The inequality is easily established by recalling thatV Note that in part (iii), the fact that q F (x 1 −) (F ) is a singleton whenever #q α+F (x 1 −) (F ) > 1 plays an important role. If this were not the case, we would obtain rectangles of points . This would mean that our exhaustive scoring function fails to distinguish between the correct forecast and one that does not contain some of the points within this rectangle. The reasoning behind why the α and (1 − α) quantiles are required to be singletons is similar. For α = 1, note that M α,inc only contains distributions with support R. But in that case, I α is constant, namely {R} for all distributions, and thus not interesting. We will therefore exclude the case α = 1 from further discussion.
, it is straightforward to construct a score equivalent to the one in (4.6) given by (4.8) From this stage, one can easily construct a family of elementary scores, S u = S δu , u ∈ U , given by (4.8). As a consequence of Theorem 4.2, these elementary scores are M α, incconsistent for I α . Clearly, S µ (A, y) = S u (A, y)µ(du) which is a mixture representation in the spirit of Ehm, Gneiting, Jordan, and Krüger (2016). This opens the way to the powerful tool of Murphy diagrams u → S u (A, y) discussed there as well. In order to avoid the necessity of choosing a measure µ, one instead considers the elementary scores in (4.8) over different values of the parameter u ∈ U . In the one-dimensional case discussed in Ehm et al. (2016) as well as in the case of the class of α-prediction intervals, one can easily visualise the values of the expected score differences graphically. With the possibly increasing dimensionality of the space u comes from, the illustrative accessibility of this approach gets more involved. We discuss an example with possibly higher dimension in Section 5. For an illustration of 2-dimensional Murphy diagrams, we refer the reader to .
Intuitively, the class of α-prediction intervals, I α (F ), of a distribution F contains a great deal of information about F itself. So one might wonder if it is possible to recover F , knowing I α (F ). If so, this would mean that I α actually constitutes a bijection. And consequently, the exhaustive elicitability of I α would directly follow from the existence of strictly proper scoring rules for probabilisitic forecasts , invoking the revelation principle (Gneiting, 2011a;Osband, 1985). The following proposition asserts that I α is not a bijection, which underlines the novelty of Theorem 4.2.
Proof. For a, b ∈ R, a, b > 0, define a%b := a − b a/b , the real analog of the modulus.
Otherwise, let β = 1%α > 0. For all 0 ≤ a ≤ β/(α − β), define the probability density Thus, f 0 is the uniform density on [0, 1], and for a > 0, f a raises and lowers the density according to where y falls modulo α. Letting F a ∈ M α,inc be the corresponding probability measure, we will show that I α (F a ) = I α (F 0 ) for all a > 0.
We again see that Remark 4.5. Variants of prediction intervals other than connected intervals might also be natural to consider, e.g., wrapped intervals (allowing intervals of the form (−∞, b] ∪ [a, ∞) where b < a), unions of intervals, and most generally, any measurable prediction set. In Appendix B, we show that most of these generalisations are indeed bijective with F . That means their exhaustive elicitability follows directly from the existence of strictly proper scoring rules for probabilisitic reports and the revelation principle. One exception is the case of wrapped intervals when α is rational, as the construction in the first case of Proposition 4.4 applies, and injectivity fails. (When α is irrational, repeatedly wrapping intervals corresponds to an irrational rotation, from which one can compute a dense set of quantiles such that one can again invoke the revelation principle to obtain exhaustive elicitability.) We claim that the class of wrapped prediction intervals with a rational α is exhaustively elicitable under mild assumptions on the underlying class of distributions, using a similar integral construction as the one in Theorem 4.2.
In order to use Theorems 3.7 and 4.2 to conclude that I α is not selectively elicitable on M α,inc it is essential to show that I α satisfies the proper-subset property on M α,inc .
Proof. This is a direct combination of Theorem 4.2, Theorem 3.7 and Lemma 4.6.

Prediction interval with an endpoint or the midpoint given by an identifiable functional
In some situations, one might be interested in prediction intervals with one endpoint or the midpoint specified as some (identifiable) functional. The simplest situation arises if the midpoint or an endpoint is simply a constant; we defer the discussion to Appendix A. Apart from constants, the most natural such functionals appear to be the mean or the median for the midpoint, while also other quantiles or expectiles might be interesting. If one endpoint is specified in terms of some quantile, the other endpoint must be a quantile itself and the elicitability of the vector is obvious and well known (if the quantiles are both singletons); see e.g.  or Proposition 4.9, which recalls this result for the sake of completeness. On the other hand, we can show that there are no twice continuously differentiable exhaustive scoring functions (see Propositions 4.10 and 4.12) for other functionals under mild conditions. In the case of the midpoint given by an identifiable functional, this even holds for the quantile. This gives rise to the conjecture that such intervals are in general not elicitable. Despite their failure of being (smoothly) elicitable, these functionals are still identifiable, therefore possessing the CxLS property. This leads to the novel observation that, in the multivariate setting, the equivalence of the CxLS property with identifiability and elicitability established for one-dimensional functionals in Steinwart et al. (2014) fails to hold. We only address the case of the left endpoint given by an identifiable functional and remark that the right endpoint case works mutatis mutandis.
(ii) QI α,β is elicitable on any subclass M of M 0 such that the β and α+β-quantiles are singletons for all distributions in M. Any sum of two strictly M-consistent scoring functions for the respective quantiles is a strictly M-consistent scoring function for QI α,β .
Note that, in fact, essentially any strictly consistent scoring function for QI α,β is a sum of two strictly consistent scoring functions for the respective quantiles; see Fissler and Ziegel (2016, Proposition 4.2). Very recently, Brehmer and Gneiting (2020, Theorem 3.1) characterised all translation invariant or positively homogeneous consistent scores for the central α-prediction interval.
Choosing β = 0, reporting QI α,β would boil down to reporting the lower α-quantile, such that the identifiability and elicitability hold if (and only if) the α-quantile is a singleton. For the case β = 1 − α, the second component of QI α,β is the essential supremum. Therefore, we only obtain an identifiability result if all distributions in M are unbounded from above and the (1 − α)-quantiles are singletons. For elicitability results, we refer to Subsection 4.4. Finally, we would like to remark that Proposition 4.9 together with the mutual exclusivity result of Theorem 3.7 implies that there cannot be a scoring function R × R → R such that the expected score is minimised on an interval between two quantiles, subject to very mild conditions on the class of distributions M (such that the proper-subset property is satisfied for QI α,β ).
In Proposition 4.9, we ensured the existence of the α-prediction interval by restricting the range of β. Similarly, one has to restrict the class of probability distributions suitably to ensure the existence of an interval with the demanded coverage when the left endpoint is given by some general identifiable functional l : M → R where M is some subclass of M 0 . To assure that there is enough mass above l(F ), we write M l = {F ∈ M | F (l(F )−) ≤ 1 − α}. For a midpoint specification in terms of an identifiable functional m : M → R, such a restriction is not necessary. (i) T l is identifiable on M l ∩ M inc, cont with a strict identification function (ii) Assume that M l is such that (a) M l ∩ M inc, cont is convex; (b) for any z ∈ R × (0, ∞) there are F 1 , F 2 , F 3 ∈ M l ∩ M inc, cont such that 0 is in the interior of the convex hull of the set V (z, Then there is no strictly M l ∩ M inc, cont -consistent scoring function S for T l such thatS(·, F ) is twice continuously differentiable on R × (0, ∞) for any F ∈ M l ∩ M inc, cont .
Proof. See Appendix C.
Points (a), (b) and (d) are basically richness assumptions on the class M l ∩ M inc, cont , which are needed to establish necessary conditions on the shape of possible strictly consistent scoring functions via Osband's principle (Fissler & Ziegel, 2016, Theorem 3.2). In particular, (b) and (d) in combination with the convexity stipulated under (a) are surjectivity condition where (b) also assumes that the expected identification function may vary enough. (c) is a pure smoothness assumption which is needed since the proof exploits first and second order conditions. In concrete situations, e.g. when M l ∩M inc, cont is the class of finite Gaussian mixtures and l is the mean functional, these conditions can be verified by straightforward calculations.
(ii) Assume that M is such that assumptions (a), (b) and (c) from Proposition 4.10 hold mutatis mutandis. Moreover, suppose that (d) for any (m Then there is no strictly M ∩ M inc, cont -consistent scoring function S for T m such thatS(·, F ) is twice continuously differentiable on R × (0, ∞) for any F ∈ M ∩ M inc, cont .
Proof. See Appendix C.

Shortest prediction intervals
In the context of probabilistic forecasts, Gneiting, Balabdaoui, and Raftery (2007, p. 243) proposed the paradigm of "maximizing the sharpness of the predictive distribution subject to calibration", continuing: "Calibration refers to the statistical consistency between the distributional forecasts and the observations and is a joint property of the predictions and the events that materialize. Sharpness refers to the concentration of the predictive distributions and is a property of the forecasts only." Following this rationale, a particularly well-motivated restriction of I α is the shortest prediction interval SI α , meaning a prediction interval of minimal length (sharp) subject to achieving a coverage of at least α (calibrated). This is in line with the decision-theoretic derivation of the 'prescriptive optimal interval forecast' given in Askanazi et al. (2018, Section 2.2): "restrict attention to correctly-calibrated intervals, and then pick the shortest (on average)." In this subsection, we will study the elicitability of SI α .
Let us first consider the case α = 1, where the shortest α-prediction interval of F ∈ M is SI 1 (F ) = ((ess inf(F ), ess sup(F )) , which is possibly of infinite length. Here ess inf and ess sup are the essential infimum and supremum, respectively, defined by sup q 0 and inf q 1 , where q α is the quantile functional. Thus, to understand the elicitability of SI 1 , it suffices to study the elicitability of ess inf and ess sup.
To this end, let g : R → R be an increasing and bounded function, and set g(±∞) = lim x→±∞ g(x). Recall that for α ∈ (0, 1) a consistent selective score for the α-quantile is given by S α (x, y) = (1{y ≤ x} − α) g(x) − g(y) . If q α is surjective on M in the sense that for any x ∈ R there exists an F ∈ M such that x ∈ q α (F ), then S α becomes strictly M-consistent if and only if g is strictly increasing. Now consider the following generalisations of S α for α ∈ {0, 1}, clearly failing to be M-finite in general: S 0 (x, y) = ∞ · 1{y < x} + g(y) − g(x), (4.9) S 1 (x, y) = ∞ · 1{y > x} + g(x) − g(y). (4.10) Interestingly, if g is constant, S 0 becomes a strictly M 0 -consistent selective scoring function for q 0 , and S 1 for q 1 . On the other hand, if g is strictly increasing, they become strictly M 0 -consistent for the essential infimum and essential supremum, respectively, and the elicitability of SI 1 then follows.
Proposition 4.13. SI 1 can be elicited on M 0 with non M 0 -finite, strictly M 0 -consistent score S ((a, b) where g : R → R is strictly increasing and bounded.
Proof. We haveS((a, b) , F ) = ∞ if F ([a, b]) < 1, and g(b)−g(a) otherwise. Clearly, this is the sum of the strictly consistent functions for the essential supremum and infimum given in (4.9) and (4.10).
Remark 4.14. Since q 0 = (−∞, sup q 0 ] and q 1 = [inf q 1 , ∞), the scores S 0 and S 1 can be directly used to construct strictly consistent exhaustive scoring functions for q 0 and q 1 , respectively, by invoking the revelation principle. Thus, the 0-quantile and 1-quantile are both selectively and exhaustively elicitable. Theorem 3.7 does not apply here as Theorem 3.5 only holds for scoring functions whose expectation is always finite. Moreover, if we were to impose that all scores be finite in expectation, a common assumption in the literature (Brehmer & Strokorb, 2019;Wang & Wei, 2020), Proposition 3.12 would apply, implying the non-elicitability of ess inf and ess sup, recovering a result established in the proof of Ziegel (2016a, Corollary 4.3). Hence, while the result for SI 1 is positive, it is narrowly so, as it leans heavily on the ability to assign infinite expected scores.
Turning now to the case α ∈ (0, 1), we first observe that the shortest α-prediction interval is necessarily bounded: by a simple continuity argument for the probability measure F ∈ M, there is some C > 0 (depending on F ) such that (−C, C) ∈ I α (F ). For F ∈ M and α ∈ (0, 1), we may therefore define For many distributions F , such as the uniform distribution on [0, 1], SI α (F ) contains more than a single element, so we again must formally distinguish between exhaustive and selective reports. Importantly, Lemma 4.15, proven in the Appendix, asserts that SI α (F ) is always non-empty, meaning each distribution has at least one shortest αprediction interval.
The following theorem gives a comprehensive negative result of the elicitability of SI α for α ∈ (0, 1).
Theorem 4.16 (Shortest α-prediction interval). (i) For α ∈ (0, 1), the shortest αprediction interval SI α is not selectively elicitable on any class M containing (a) all distributions with bounded Lebesgue densities, or (b) all distributions on N 0 which are unimodal with mode k for some k ≥ 1. Moreover, SI α can even not be selectively elicited with any non M-finite score.
While Theorem 4.16 (i) considers the question of selective elicitability of SI α and is in line with the findings of Brehmer and Gneiting (2020), there is no counterpart to part (ii) dedicated to the exhaustive elicitability of SI α .
Remark 4.17. The classes M specified in (i) and (ii) of Theorem 4.16 are not contained in M α,inc , which makes it hard to thoroughly compare the results of Theorem 4.16 with Theorem 4.2. The elicitability of SI α on M α,inc remains an open problem, though we conjecture a negative result. This should also be compared with the discussion of Brehmer and Gneiting (2020, Condition 3.7 and Theorem 3.8). The class considered in their Theorem 3.8 is also not contained in M α,inc . In particular, they also leave open the problem of elicitability on classes of distributions with strictly positive Lebesgue densities.
Remark 4.18. On any class M satisfying the conditions of Theorem 4.16 (i) and (ii) SI α fails to be selectively elicitable and exhaustively elicitable. This yields an interesting set-valued functional which fails to be elicitable in either sense.
Remark 4.19. One may also consider general prediction regions rather than merely intervals, in which case a natural object to study is the α-prediction region with smallest Lebesgue-measure. One can employ a very similar argument to the one used in the proof of Theorem 4.16 (ii) to rule out the exhaustive elicitability (with an M-finite score) of the class of α-prediction regions with minimal Lebesgue measure, denoted by SR α . Indeed, again writing F λ = (1 − λ)δ 0 + λδ 1 , we obtain for β ∈ (0, 1/2] and γ ∈ (1/2, 1] The rest of the argument follows the lines of the proof. for (x 1 , x 2 ) ∈ U and y ∈ R. Then for any F ∈ M inc, cont and any (x 1 , x 2 ) ∈ U it holds that (x 1 , x 2 ) ∈ SI α (F ) if and only if Proof. Let F ∈ M inc, cont and (x 1 , x 2 ) ∈ U such that (4.12) holds.V (x 1 , x 2 ) , F (0) = 0 implies that (x 1 , x 2 ) ∈ I α (F ) and x 2 = Γ α (F )(x 1 ) such that there is no shorter αprediction interval for F with a lower endpoint of x 1 . The conditionV (x 1 , x 2 ) , F (a) ≤ 0 for all a ∈ R means that all other intervals [x 1 + a, x 2 + a] with the same length either fail to be an α-prediction interval (corresponding to a strict inequality), or they are also an α-prediction interval (corresponding to equality), but with the same logic as above, there cannot be a shorter one with the same lower endpoint x 1 + a. Hence, we can conclude that (x 1 , x 2 ) ∈ SI α (F ). Vice versa, if (x 1 , x 2 ) ∈ SI α (F ), then (4.12) is immediate.
In closing, we would like to remark that for multivariate observations, a generalisation from prediction intervals to prediction regions is mandatory. If we do not impose any restrictions other than measurability, say, we can still obtain a selective identifiability result in the spirit of Theorem 4.2 (i). For other similar extensions of our results, Remark 4.19 points into a negative direction for the smallest prediction regions. For considerations analogue to the ones in Subsection 4.3 one would need to impose further restrictions on the shape of the regions (e.g. one might consider balls with a certain centre and radius) to ask sensible questions. This is beyond the scope of the current project. On the other hand, the following section elaborates on a complementary direction of Vorob'ev quantiles, which only become interesting in a multivariate / spatial setting.

Vorob'ev quantiles
As Azzimonti, Ginsbourger, Chevalier, Bect, and Richet (2018) point out, the "problem of estimating the set of inputs that leads a system to a particular behavior is common in many applications", and they explicitly mention the fields of reliability engineering and climatology (see references therein). In such a context, the quantity of interest is a random set Y. This set could specify the region of a blackout in a country, the area affected by an avalanche in the mountains or tumorous tissue in the human body. In many situations such as extreme weather events, e.g. floods, storms or heatwaves, the random set Y is specified in terms of an excursion set {z ∈ R d | ξ z ≥ t}, t ∈ R, of some random field (ξ z ) z∈R d . A main task in mathematical statistics is to construct confidence intervals or confidence regions in R d from a random sample. Consequently, such confidence regions may also be considered as random sets. Functionals of interest are various expectations of Y as described in the comprehensive textbook Molchanov (2017), notably, the Vorob'ev expectation (Chevalier, Ginsbourger, Bect, & Molchanov, 2013), the distance average expectation (Azzimonti, Bect, Chevalier, & Ginsbourger, 2016) and conservative estimates based on Vorob'ev quantiles (Azzimonti et al., 2018).
In this section, we shall focus on Vorob'ev quantiles and shall notably establish exhaustive elicitability results and related selective identifiability results under reasonable conditions. In that respect, it generalises and extends the known result that the symmetric difference in measure is an exhaustive consistent scoring function for the median; see Proposition 2.2.8 in Molchanov (2017) and below for details.
To settle some notation, we work again on some suitable complete, atomless probability space (Ω, F, P). Let E be some generic separable Banach space equipped with its Borel σ-algebra, with the Euclidean space as a leading example. Let µ be some σ-finite nonnegative reference measure on E and let U be the family of closed subsets of E. We shall use the convention to denote any subset of E with a capital latin letter, with the additional distinction that a random set will be denoted with a bold capital letter.
Definition 5.1 (Random closed set). Y : Ω → U is called a random closed set if for all In decision-theoretic terminology, that means that our observation domain O coincides with U. In line with Definition 5.1 and following Molchanov (2017, Chapter 1), we equip U with the σ-algebra generated by the family B(U) := {U ∈ U : U ∩ K = ∅, K ∈ K} where K is the collection of all compact subsets of E. Consequently, we shall identify the distribution F Y of a random closed set Y with its capacity functional. That is, we set As before, let M denote some generic class of distributions K → [0, 1]. While F Y characterises the whole (joint) distribution of a random closed set Y, its restriction to singletons, in some sense, specifies the marginal distributions of Y. This restriction is called coverage function p Y : E → [0, 1] and is formally defined as Finally, we can define the Vorob'ev quantiles of closed random sets.
Definition 5.2 (Vorob'ev quantile). The upper excursion set of p Y at level α ∈ [0, 1], The Vorob'ev α-quantile plays a special role in the context of confidence regions. Suppose Y = g α (Y 1 , . . . , Y n ) where Y 1 , . . . , Y n are i.i.d. random vectors in R m following some parametric distribution F (θ), θ ∈ Θ ⊆ R k , and g α : (R m ) n → U is a measurable map. In this context, E clearly corresponds to Θ. Then one can say that Y constitutes an α-confidence region for the parameter θ if θ ∈ Q α (Y) for all θ ∈ Θ.
As p Y is an upper semicontinuous function (Molchanov, 2017), Q α (Y) is a closed set in E. Therefore, in a decision-theoretic terminology, we set the exhaustive action domain to be U and the selective action domain to be E. Moreover, for further reference, define the sets Note that the measurability of these sets is implied by the upper-semicontinuity (and thus measurability) of p Y . It goes without saying that the quantities Q α , Q > α and Q = α are law-invariant functionals in that they only depend on the distribution F Y of a random closed set Y, and, a fortiori, on its coverage function p Y . Therefore, we shall consider them as maps defined on some generic specification of distributions M.
Proposition 2.2.8 in Molchanov (2017) establishes that the symmetric difference in measure is an M-consistent exhaustive scoring function for the Vorob'ev median Q 1/2 (Y) : M → U. Other Vorob'ev quantiles solve a restricted minimisation problem; see Proposition 4 in Azzimonti et al. (2018). More precisely, for α ∈ [0, 1], Q α = Q α (Y) it holds that for all measurable sets M ⊆ E such that µ(M ) = µ(Q α ). To arrive at a consistent scoring function for a general α ∈ [0, 1], we first introduce a strict selective M-identification function for Q = α .
is a strict selective M-identification function for Q = α . Moreover, V α is oriented in the sense that for all u ∈ E and for all F ∈ M Proof. The proof follows directly from the definition of p Y .
This oriented strict M-identification function for Q = α turns out to be the main building block in the construction of an exhaustive M-consistent scoring function for Q α . The rationale is akin to the ones presented for the scalar case by Ziegel (2016b), Dawid (2016) and the multivariate case in Fissler et al. (2019, Section 3.2).
is an M-consistent exhaustive scoring function for Q α . More precisely, it holds that for all F ∈ M arg min Proof. First note that -if we extend S α to the family of measurable subsets of E with finite measure -it holds that for any such D ⊆ E we have S α (X, Y ) = S α (D, Y ) whenever µ(X D) = 0. Now, let X ∈ U 0 such that µ(X D) = 0 for some measurable Then, invoking Robbin's Theorem (Molchanov, 2017, Theorem 1.5.16), it holds that for any M ∈ U 0 and any F ∈ M where the last inequality follows from the orientation of V α . Moreover, the inequality is strict if and only if µ(Q > α (F ) \ M ) + µ(M \ Q α (F )) > 0, which establishes equality in (5.3). Indeed, for any Then D is measurable and it can be easily verified that Proposition 5.4 and in particular the equality in (5.3) exactly quantify by how much the score S α fails to be strictly consistent for Q α . Moreover, in contrast to the symmetric difference in measure in (5.1), the score S α in (5.2) assumes both negative and positive values in general. Imposing the normalisation condition that S α (Y, Y ) = 0 for all Y ∈ U, which implies the non-negativity of S α , the score S α in (5.2) is equivalent to Moreover, one can see that one really retrieves the symmetric difference in measure for α = 1/2. The following theorem states conditions for the strict consistency of S α . In the sequel we denote the closure of any set M ⊆ E with cl(M ) and its interior with int(M ).
(i) For any u ∈ E the elementary score S α,u : U × U → [0, ∞), is a non-negative exhaustive M-consistent scoring function for Q α .
(ii) Let π be a σ-finite non-negative measure on E. Then the map S α,π : U×U → [0, ∞], is a non-negative exhaustive M-consistent scoring function for Q α .
(iii) If M is such that Q α (F ) = cl(Q > α (F )) and Q α (F ) = cl(int(Q α (F )) for all F ∈ M, then Q α is exhaustively elicitable on M. Moreover, for any σ-finite positive measure π (that is, π assigns positive mass to all open non-empty sets) on E such that E F [π(Y)] < ∞ and π(Q α (F )) < ∞ for all F ∈ M, the restriction of S α,π defined in (5.5) to the family U : = {U ∈ U | U = cl(int(U ))} is a strictly M-consistent exhaustive scoring function for Q α .
To prove this theorem, we will need an auxiliary result that we introduce now.
Lemma 5.6. If for two sets A, B ⊆ E it holds that A = cl(int(A)) and B = cl(int(B)), Proof. Let A, B as above and assume that there is some x ∈ A B. Without loss of generality assume x ∈ A \ B. Since x ∈ A, there is a sequence (a n ) n∈N ⊆ int(A) converging to x. Moreover, since B is closed and x / ∈ B, there is some m ∈ N such that for all n ≥ m it holds that a n / ∈ B. Thus, for all n ≥ m we have a n ∈ int(A) Since the interior of a set is the union of all of its open subsets, Proof of Theorem 5.5. The proof of (i) follows along the lines of the proof of Proposition 5.4 upon setting µ = δ u . Note that with this choice of µ, any set is of finite measure. (ii) is a direct consequence of the nonnegativity and consistency of S α,u (X, Y ). For (iii), let F ∈ M such that E F [π(Y)] < ∞ and note that for any M ∈ U with π(M ) = ∞, we haveS α,π (M, F ) = ∞. Therefore it suffices to consider M ∈ U with π(M ) < ∞ and one can invoke the equality in (5.3).
For any other closed set M ∈ U we therefore obtain that X M = ∅. This implies that int(X M ) = ∅ and therefore, since π is positive, π(X M ) > 0.
The orientation of the selective identification function V α directly implies order-sensitivity in the sense of Nau (1985) or Fissler and Ziegel (2019) with respect to the partial order induced by the subset relation.
Proposition 5.7. Let α ∈ [0, 1]. Then any exhaustive M-consistent scoring function S α,π for Q α of the form (5.5) is order-sensitive. That means for any F ∈ M and for any A, B ∈ U such that Q α (F ) ⊆ A ⊆ B or B ⊆ A ⊆ Q α (F ) it holds thatS α,π (A, F ) ≤ S α,π (B, F ).
It is worth to explore further connections between mixture representation of consistent scoring functions established for Vorob'ev quantiles in Theorem 5.5 and the corresponding mixture representation in the one-dimensional case, which was introduced and discussed in Ehm et al. (2016). Indeed, the elementary scores introduced there, . Of course, we can identify the reals x, y with the corresponding sets X = [x, ∞) and Y = [y, ∞), which shows that we end up with the form given in (5.4). For further analogy, let Z be a real-valued random variable. This induces a random closed set Z = [Z, ∞). Then it holds that As discussed in Example 3.8 (iv), the elicitability of q − α (Z) is equivalent to the exhaustive elicitability of [q − α (Z), ∞). One can easily check that for a positive measure H on R, is a strictly consistent exhaustive scoring function for [q − α (Z), ∞) if and only if the . That is, if and only if the α-quantile of Z is unique. This retrieves the first condition in part (iii) of Theorem 5.5. Note that in the case of a one-dimensional quantile, the second condition is equivalent to q − α (Z) = q + α (Z), too. However, in the case of Vorob'ev quantiles it is more involved and does not follow from the first condition in general. This structural difference also highlights the importance of a thorough framework for dealing with set-valued functionals.

Connections to forecast evaluation in the literature
We would like to close the paper with a comprehensive literature review of different practices of treating forecasts for set-valued functionals. We think that these various perspectives illustrate the advantage our unified theoretical framework on set-valued forecast evaluation, with the thorough distinction between a selective and an exhaustive mode, offers. At the same time, these perspectives offer numerous starting points for further research projects to uncover their behaviour in terms of the classification into selectively elicitable functionals, exhaustively elicitable functionals, and functionals failing to be elicitable at all.

Statistical forecast evaluation
While Lambert et al. (2008) only consider real-valued functionals where the distinction between selective and exhaustive scoring functions is superfluous, the influential paper Gneiting (2011a) treats functionals as potentially set-valued; cf. Bellini and Bignozzi (2015); Lambert and Shoham (2009). However, only the concept of selective scoring functions with the corresponding notion of (strict) consistency and elicitability are given. Presumably, the motivation for doing so was induced by the quantile-functional as one of the most prominent examples of a set-valued functional. To the best of our knowledge, forecasts for the quantile are exclusively considered in the selective sense (Gneiting, 2011b;Koenker, 2005;Komunjer, 2005), in which they are elicitable. The reason for not considering them in the exhaustive sense might lie in the impossibility of establishing corresponding elicitability results, of which the first formal proof-to the best of our knowledge-is given in this paper.
The recent preprint Brehmer and Gneiting (2020) considers elicitability for the class of predictive intervals and certain specifications thereof through the lens of the selective notion.

Statistical theory and risk measurement
Quantiles and expectiles (Newey & Powell, 1987) of univariate distributions are well known (selectively) elicitable functionals. In the literature on quantitative risk management, they are also common scalar risk measures. There are different competing attempts to generalise them to a multivariate setting. We refer the reader to two recent and insightful papers and the corresponding references therein: Hamel and Kostner (2018)

Spatial statistics
As mentioned at the beginning of Section 5, estimating set-valued quantities is a common endeavour in spatial statistics. In that context, forecasts and estimates are commonly considered with what we call an exhaustive angle. Interesting open theoretical questions besides Vorob'ev quantiles are to consider other functionals, notably expectations, of random sets presented in the book Molchanov (2017).
One area of particular interest in spatial statistics is meteorology and climatology. In these disciplines, forecast evaluation is more commonly known under the term forecast verification. We refer the reader to the comprehensive overview paper Dorninger et al. (2018). Besides simply comparing a set-valued forecast and a set-valued observation as outlined above, there are also more involved situations covered. E.g. acknowledging the spatio-temporal structure of many processes such as precipitation, one might evaluate probabilistic forecasts for the marginal distributions of the random field of interest at certain grid points, using the neighbourhood method (see Dorninger et al. (2018) for references). Assessing the entire joint distribution of the random field seems extremely ambitious and we are unaware of any verification method at the moment.

Regression and Machine Learning
Recent literature on isotonic regression embraces the idea of explicitly modelling functionals as set-valued; see Jordan, Mühlemann, and Ziegel (2019) and Mösching and Dümbgen (2020), where the two papers consider these functionals in the selective sense. Kivaranovic, Johnson, and Leeb (2019) examine how to obtain prediction intervals with deep neural networks. In the area of machine learning, the recent paper Gao, Chen, Chenthamarakshan, and Witbrock (2019) considers set-valued regression as well, however, considering finite sets only. The observations (or response variables) Y t are finite subsets of some label space S, which is assumed to be at most countably infinite. Denoting the regressors with X t ∈ R p then they are interested in finding a function m : R p → {I | I ⊆ S, |I| < ∞} such that m(X t ) is reasonably close to Y t . However, they do not explicitly specify the loss function they use for the regression problem. In an orthogonal direction, Zaheer et al. (2017) consider the case of set-valued regressors rather than set-valued responses, which does not lead to the question of an appropriate choice of loss function with set-valued arguments.

Philosophy
Within a more philosophical strand of literature about credences, i.e., subjective probabilities of degrees of belief, Mayo-Wilson and Wheeler (2016) argue that imprecise credences about the probability of a binary event can be represented as subsets of the unit interval [0, 1]; cf. Seidenfeld, Schervish, and Kadane (2012). They consider numerical accuracy measures, being functions of the set-valued credence and the binary outcome. In this regard, they consider scoring functions taking sets as arguments. However, this ansatz is distinct from our focus since we consider forecasts for functionals which are inherently set-valued and dispense with a discussion of subjective probabilities, whereas they consider set-valued forecasts for a functional which is actually real-valued, namely the probability of a binary event.
where µ is a non-negative finite measure on R. If moreover µ is positive, S is also strictly M a,α,inc -consistent.
Proof. The proof follows along the lines of the proof of Theorem 4.2.
Remark A.2. A non-negative equivalent version of the score in (A.1) is given via where h : (−∞, ∞] → R is a decreasing function given by h(t) = µ([t, ∞)) such that h(+∞) = 0. Besides the obvious interpretation of the first line of (A.2) in the context of mixture representation, it is remarkable to see the structural similarity to a standard quantile score in the second line of (A.2), which is of the form 1{y ≤ x} − α h(y) − h(x) . In fact, if the whole support of F lies above a, the right endpoint of the resulting interval is the α-quantile. If F assigns positive mass to (−∞, a), there is a correction term accounting for the fact that the α-quantile would not be sufficient to achieve the required coverage anymore. Note that without the correction term, one is in the setting of Theorem 5 in Gneiting (2011a) with w(y) = 1{y ≥ a}. This would correspond to forecasting the α-quantile of F truncated at a, i.e., loosely speaking the point under which α · 100% of the mass above a is located. The aim, however, is to report a point such that α · 100% of the whole mass is between a and the reported point.
Proposition A.3. For some m ∈ R consider the functional b m : F → d α (F )(m) ∈ [0, ∞), specifying half of the length of the shortest α-prediction interval with midpoint m. Then the following assertions hold: is an M inc, cont -consistent scoring function for b m . It is strictly M inc, cont -consistent if µ is positive.
Proof. The proof follows along the lines of the proof of Theorem 4.2.
Again we see the structural similarity to a standard quantile score in the second line. In particular, we see that b m corresponds to the α-quantile of the distribution of |Y − m|.

B. Injectivity Results for Prediction Interval Variants
For simplicity, let us consider the class M c ⊆ M 0 of probability measures with singlevalued quantiles in the range (0, 1), i.e., supported on an interval (potentially all of R) and whose CDFs are strictly increasing on that interval. We first observe that if a functional value T (F ) uniquely determines a dense set of quantiles for F , then T must be injective.
Lemma B.1. For some set W , let T : M c → 2 W and let Q ⊆ (0, 1) be dense. If for all F ∈ M c , the value of T (F ) uniquely determines the values of q β (F ) for all β ∈ Q, then T is injective.
Proof. By definition of M c , we have q β (F ) = F −1 (β) for all F ∈ M c and β ∈ (0, 1). As F is continuous and strictly monotone, its inverse is also continuous on (0, 1) and strictly monotone. Thus, specifying the values of F −1 on a dense subset of (0, 1) uniquely specifies F −1 and thus F .
For α ∈ (0, 1] we will now define the collection C α (F ) of all α-prediction sets of F , as well as unions of two prediction intervals I 2 α (F ) and wrapped intervals I w α (F ). Recall the definitionŪ := {(a, b) ∈R 2 | a ≤ b}, whereR : = R ∪ {−∞, ∞}. In what follows, we will overload notation and interpret I ∈Ū as a closed interval, so for example if I = (a, b) we have F (I) = F ([a, b]).
, a ≥ b} . We first show injectivity of I 2 α , and thus C α . We will routinely rely on the bijection between the above I 2 a and the functional I 2 =α : M c → 2 B(R) defined by I 2 =α (F ) = {(I 1 , I 2 ) ∈ I 2 α : F (I 1 ∪ I 2 ) = α}. It is clear that I 2 α (F ) can be constructed from I 2 =α (F ) and vice versa, by adding or removing nested intervals. In particular, I 2 α is injective if and only if I 2 =α is.
Proof. We will instead show injectivity of the functional I 2 =α . Let F ∈ M c , and consider first the case α ≤ 1/2. Given I 2 =α (F ), we will show how to compute the quantiles q k,n := F −1 (kα/2 n ) for all k ∈ N, n ∈ N 0 such that kα/2 n ∈ (0, 1), at which point the result will follow from Lemma B.1. We first show the result for k ≤ 2 n ; the other values will follow from the observation that F ((q k,n , q k+2 n ,n ]) = α.
We conclude, for all 0 ≤ b < 1/(2πn), that Since I w α is in turn determined by I w =α , it also fails to be injective for rational α.
The existence of F 1 , F 2 ∈ M l ∩ M inc, cont as in assumption (d) for any (l * , b * ) ∈ R × (0, ∞) as well as the surjectivity of T l implicitly implied via (a) and (b) yield that h 22 ≡ 0. Furthermore, the existence of F 3 ∈ M l ∩ M inc, cont as in assumption (d) for any (l * , b * ) ∈ R × (0, ∞) together with the surjectivity of T implies that h 12 ≡ 0. Finally, the existence of F 4 ∈ M l ∩ M inc, cont as assumed implies h 21 ≡ 0. Now, for any x ∈ R × (0, ∞) let F ∈ M l ∩ M inc, cont be such that l(F ) = x 1 . Since V l is a strict identification function for l,V l (x 1 , F ) = 0 and we obtain ∂ 2 h 11 (x) = 0. Therefore there is a function g : R → R such that h 11 (x) = g(x 1 ) for all x ∈ R × (0, ∞). In summary, we obtain that which implies thatS(·, F ) is constant in x 2 and S cannot be strictly consistent for T l .
Proof of Proposition 4.12. Part (i) follows from the fact that V m is a strict M-identification function for m and from the strict monotonicity and continuity of F ∈ M ∩ M inc, cont . For (ii) suppose S is a strictly M ∩ M inc, cont -consistent scoring function for T m such thatS(·, F ) is twice differentiable on R×(0, ∞) for all F ∈ M∩M inc, cont . Using exactly the same arguments as in the proof of Proposition 4.10, we can derive the existence of a function h : R × (0, ∞) → R 2×2 with differentiable components h ij such that where V is the strict M ∩ M inc, cont -identification function from part (i). Again, for any x ∈ R × (0, ∞) and F ∈ M ∩ M inc, cont , the Hessian ∇ 2 xS (x, F ) must be symmetric and, for x = T m (F ), it must be positive semidefinite. We obtain that for i = 1, 2. From the symmetry of the Hessian it follows that The existence of F 1 , F 2 as in assumption (d) for any (m * , b * ) ∈ R×(0, ∞) and surjectivity of T m implicitly given via (a) and (b) imply that h 22 ≡ 0. Furthermore, the existence of F 3 as in assumption (d) together with the surjectivity of T m implies that h 12 ≡ 0. Finally, the existence of F 4 as assumed implies h 21 ≡ 0. The rest of the argument follows as in the proof of Proposition 4.10.
Proof of Lemma 4.15. For α = 1, please note that SI 1 (F ) = {(ess inf(F ), ess sup(F )) }. Now let α ∈ (0, 1). First note that SI α (F ) = ∅ if and only if the function h(x) := Γ α (F )(x) − x attains its infimum over the interval P := {x ∈ R | F (x−) ≤ 1 − α} where we note that P is closed, bounded from above and unbounded from below. Assume that Γ α (F ) is continuous. Then h is continuous, too. Since Γ α (F )(x) ≥ x, m := inf x∈P h(x) ≥ 0. The tightness of F implies that m < ∞. From the definition of the infimum, there is a sequence (x n ) n∈N ⊆ P with h(x n ) → m. If this sequence is bounded from below, there is a convergent subsequence (x n k ) k∈N with limit x ∈ P and the continuity of h implies that h(x n k ) → h(x), thus the infimum is attained. If (x n ) n∈N is not bounded from below, there is a divergent subsequence (x n l ) l∈N . But then Lemma 4.1 (iv) implies that h(x n l ) → ∞ = m which is a contradiction. Finally assume that Γ α (F ) fails to be right-continuous such that h is also not continuous. h is discontinuous at x if and only if Γ α (F ) jumps at x. Jumps of Γ α (F ) can be caused by two situations, namely if F has jumps or if F has flat spots. Both of them can occur at most countably many times (see e.g. Theorem 2.1 in Shorack (2006)) which means that Γ α (F ) can have at most countably many jumps. Let I = {1, 2, . . . , n 0 } for some n 0 ∈ N or I = N be some index set and let (a i ) i∈I be the collection of points where Γ α (F ) jumps and let (j i ) i∈I be the corresponding jump sizes. For all i ∈ I and any ε ∈ (0, j i /2] it holds that Γ α (F )(a i + ε) ≥ Γ α (F )(a i ) + j i and consequently that h(a i + ε) > h(a i ). Thus if h attains its minimum, it is not in any of the intervals (a i , a i + j i /2), i ∈ I. Now define the sequence of functions h i in the following way: Set h 0 := h. For any i ∈ I, if h i−1 is continuous at a i , set h i = h i−1 , else where u i = 2(h(a i +j i /2)−h(a i ))/j i and v i = h(a i )−u i a i . It can easily be verified that h i is continuous on [a i , a i + j i /2) and moreover that h i (x) > h i (a i ) for all x ∈ [a i , a i + j i /2). Therefore if h i−1 attains its infimum at x * ∈ R, so does h i and h i−1 (x * ) = h i (x * ). The pointwise limiting function h * of (h i ) i∈N is a continuous function that, by an earlier argument, attains its infimum over P . By the construction of the functions h i , h also attains its infimum over P , at the same point as h * .
developed. This work was supported in part by the U.S. National Science Foundation under Grant No. 1657598.