Regression Diagnostics meets Forecast Evaluation: Conditional Calibration, Reliability Diagrams, and Coefficient of Determination

Model diagnostics and forecast evaluation are two sides of the same coin. A common principle is that fitted or predicted distributions ought to be calibrated or reliable, ideally in the sense of auto-calibration, where the outcome is a random draw from the posited distribution. For binary responses, this is the universal concept of reliability. For real-valued outcomes, a general theory of calibration has been elusive, despite a recent surge of interest in distributional regression and machine learning. We develop a framework rooted in probability theory, which gives rise to hierarchies of calibration, and applies to both predictive distributions and stand-alone point forecasts. In a nutshell, a prediction - distributional or single-valued - is conditionally T-calibrated if it can be taken at face value in terms of the functional T. Whenever T is defined via an identification function - as in the cases of threshold (non) exceedance probabilities, quantiles, expectiles, and moments - auto-calibration implies T-calibration. We introduce population versions of T-reliability diagrams and revisit a score decomposition into measures of miscalibration (MCB), discrimination (DSC), and uncertainty (UNC). In empirical settings, stable and efficient estimators of T-reliability diagrams and score components arise via nonparametric isotonic regression and the pool-adjacent-violators algorithm. For in-sample model diagnostics, we propose a universal coefficient of determination, $$\text{R}^\ast = \frac{\text{DSC}-\text{MCB}}{\text{UNC}},$$ that nests and reinterprets the classical $\text{R}^2$ in least squares (mean) regression and its natural analogue $\text{R}^1$ in quantile regression, yet applies to T-regression in general, with MCB $\geq 0$, DSC $\geq 0$, and $\text{R}^\ast \in [0,1]$ under modest conditions.


Introduction
Predictive distributions ought to be calibrated or reliable (Dawid, 1984;Gneiting and Katzfuss, 2014). More generally, statistical models ought to provide plausible probabilistic explanations of observations, be it in-sample or out-of-sample, ideally in the sense of auto-calibration, meaning that the outcomes are indistinguishable from random draws from the posited distributions. For binary outcomes, auto-calibration is the universal standard of reliability. In the general case of linearly ordered, real-valued outcomes, weaker, typically unconditional facets of calibration have been studied, with probabilistic calibration, which corresponds to the uniformity of the probability integral transform (PIT; Dawid, 1984;Diebold et al., 1998), being the most popular notion. Recently, conditional notions have been proposed (Mason et al., 2007;Bentzien and Friederichs, 2014;Strähl and Ziegel, 2017), and there has been a surge of attention to calibration in the machine learning community, where the full conditional distribution of a response, given a feature vector, is of increasing interest, as exemplified by the work of Guo et al. (2017), Kuleshov et al. (2018), Song et al. (2019), Gupta et al. (2020), Zhao et al. (2020), Sahoo et al. (2021) and Roelofs et al. (2022). While in the literature on forecast evaluation predictive performance is judged out-of-sample, calibration is relevant in regression diagnostics as well, where in-sample goodness-of-fit is assessed via test statistics or criteria of R 2 type. In many ways, the complementary perspectives of model diagnostics and forecast evaluation are two sides of the same coin.
In this paper, we strive to develop a theory of calibration for real-valued outcomes that complements the aforementioned strands of literature. Starting from measure theoretic and probabilistic foundations, we develop practical tools for visualizing, diagnosing and testing calibration, for both in-sample and out-of-sample settings, and applying both to full distributions and functionals thereof. Section 2 develops an overarching, rigorous theoretical framework in a general population setting, where we establish hierarchical relations between notions of unconditional and conditional calibration, with Figure 1 summarizing key results. We reduce a posited distribution to a typically single-valued statistical functional, T, and define conditional calibration in terms of said functional. While in general auto-calibration fails to imply calibration in terms of a functional, we prove this implication for functionals defined via an identification function, such as event probabilities, means, quantiles, and generalized quantiles. We plot recalibrated values of the functional against posited values to obtain T-reliability diagrams and revisit extant score decompositions to define nonnegative measures of miscalibration (MCB), discrimination (DSC) and uncertainty (UNC), for which the mean score satisfiesS = MCB − DSC + UNC. These considerations continue to apply when T-regression is studied as an end in itself, such as in mean (least squares) and quantile regression. In this setting, Theorem 2.26 establishes a general link between unconditional calibration and canonical score optimization, which nests classical results in least squares regression and the partitioning inequalities of quantile regression (Koenker and Bassett, 1978, Theorem 3.4).
In Section 3, we turn to empirical settings and statistical inference. We adopt and generalize the approach of  that uses isotonic regression and the pool-adjacentviolators (PAV) algorithm (Ayer et al., 1955) to obtain consistent, optimally binned, reproducible and PAV based (CORP) estimates of T-reliability diagrams and score components, along with uncertainty quantification via resampling. As opposed to extant estimators, the CORP approach yields non-decreasing reliability diagrams and guarantees the nonnegativity of the estimated MCB and DSC components. The regularizing constraint of isotonicity avoids artifacts and overfitting. For in-sample model diagnostics, we introduce a generalized coefficient of determination R * that links to skill scores, and nests both the classical variance explained or R 2 in least squares regression (Kvålseth, 1985), and its natural analogue R 1 in quantile regression (Koenker and Machado, 1999). Subject to modest conditions R * ∈ [0, 1], with values of 0 and 1 indicating uninformative and immaculate fits, respectively.
In forecast evaluation, reliability diagrams and score components serve to diagnose and quantify performance on test samples. The most prominent case arises when T is the mean functional and performance is assessed by the mean squared error (MSE). As a preview of the diagnostic tools developed in this paper, we assess point forecasts by Tredennick et al. (2021) of (log transformed) butterfly population size from a ridge regression and a null model. The CORP mean reliability diagrams and MSE decompositions in Figure 2 show that, while both models are reliable, ridge regression enjoys considerably higher discrimination ability.
The paper closes in Section 4, where we discuss our findings and provide a roadmap for follow-up research. While  introduced the CORP approach in the nested case of probability forecasts for binary outcomes, the setting of real-valued outcomes treated in this paper is far more complex, as it necessitates the consideration of statistical functionals in general. Throughout, we link the traditional case of regression diagnostics and (stand-alone) point forecast evaluation, where functionals such as conditional means, moments, quantiles or expectiles are modeled and predicted, to model diagnostics and forecast evaluation in the fully distributional setting (Gneiting and Katzfuss, 2014;Hothorn et al., 2014). Appendices A-C include material of more specialized or predominantly technical character.
2 Notions of calibration, reliability diagrams, and score decompositions Generally, we use the symbol L to denote a generic conditional or unconditional law or distribution, and we identify distributions with their cumulative distribution functions (CDFs). We write N (m, c 2 ) to denote a normal distribution with mean m and variance c 2 , and we let ϕ and Φ denote the density and the CDF, respectively, of a standard normal variable.

Prediction spaces and prequential principle
We consider the joint law of a posited distribution and the respective outcome in the technical setting of Gneiting and Ranjan (2013). Specifically, let (Ω, A, Q) be a prediction space, i.e., a probability space where the elementary elements ω ∈ Ω correspond to realizations of the random triple (F, Y, U ), where Y is the real-valued outcome, F is a posited distribution for Y in the form of a CDF, and U is uniformly distributed on the unit interval. Statements involving conditional or unconditional distributions, expectations, or probabilities, generally refer to the probability measure Q, which specifies the joint distribution of the forecast F and the outcome Y . The uniform random variable U allows for randomization. Throughout, we assume that U is independent of the σalgebra generated by the random variable Y and the random function F in the technical sense detailed prior to Definition 2.6 in Strähl and Ziegel (2017). Let A 0 ⊆ A denote the forecaster's information basis, i.e., a sub-σ-algebra such that F is measurable with respect to A 0 . Then F is ideal relative to A 0 (Gneiting and Ranjan, 2013) if F (y) = Q (Y ≤ y | A 0 ) almost surely, for all y ∈ R.
If F is ideal relative to some sub σ-algebra A 0 , then it is auto-calibrated (Tsyplakov, 2013) in the sense that F (y) = Q (Y ≤ y | F ) almost surely, for all y ∈ R, which is equivalent to being ideal relative to the information basis σ(F ) ⊆ A 0 . Extensions to prediction spaces with tuples (Y, F 1 , . . . , F k , U ) that allow for multiple CDF-valued forecasts F 1 , . . . , F k with associated information bases A 1 , . . . , A k ⊂ A are straightforward.
Example 2.1 (Gneiting and Ranjan (2013); Pohle (2020)). Conditionally on a standard normal variate µ, let the outcome Y be normal with mean µ and variance 1. Then the perfect forecast F 1 = N (µ, 1) is ideal relative to the information basis A 1 = σ(µ) generated by µ. The unconditional forecast F 2 = N (0, 2) agrees with the marginal distribution of the outcome Y and is ideal relative to the trivial σ-algebra A 2 = {∅, Ω}.
More elaborate notions of prediction spaces are feasible. In particular, one might include a covariate or feature vector Z and consider random tuples of the form (Z, F, Y, U ). Indeed, the transdisciplinary scientific literature has considered reliability relative to covariate information, under labels such as strong (Van Calster et al., 2016) or individual (Chung et al., 2021;Zhao et al., 2020) calibration. We refrain from doing so, as our simple setting adheres to the prequential principle posited by Dawid (1984), according to which predictive performance needs to be evaluated on the basis of the tuple (F, Y ) only, without consideration of the forecast-generating mechanism. The aforementioned extensions become critical in studies of cross-calibration (Strähl and Ziegel, 2017), stratification (Ehm and Ovcharov, 2017;Ferro et al., 2020;Allen, 2021), sensitivity (Fissler and Pesenti, 2022), and fairness (Pleiss et al., 2017;Mitchell et al., 2021).

Traditional notions of unconditional calibration
Let us recall the classical notions of calibration of predictive distributions for real-valued outcomes. In order to do so, we define the probability integral transform (PIT) of the CDF-valued random quantity F , where F (y−) = lim x↑y F (x) denotes the left-hand limit of F at y ∈ R, and the random variable U is standard uniform and independent of F and Y . The PIT of a continuous CDF F is simply Z F = F (Y ). The predictive distribution F is probabilistically calibrated or PIT calibrated if Z F is uniformly distributed on the unit interval. The use of the probabilistic calibration criterion was suggested by Dawid (1984) and popularized by Diebold et al. (1998), who proposed the use of PIT histograms as a diagnostic tool. Importantly, in continuous settings probabilistic calibration implies that prediction intervals bracketed by quantiles capture the outcomes at the respective nominal level. Furthermore, the predictive distribution F is marginally calibrated (Gneiting et al., 2007) if Thus, for a marginally calibrated predictive distribution, the frequency of (not) exceeding a threshold value matches the posited unconditional probability.
Example 2.2. In the setting of Example 2.1, let η attain the values ±η 0 with equal probability, independently of µ and Y , where η 0 > 0. Then the unfocused forecast with CDF is probabilistically calibrated, but fails to be marginally calibrated (Gneiting et al., 2007). Similarly, let δ take the values ±δ 0 with equal probability, independently of µ and Y , where δ 0 ∈ (0, 1). Then the lopsided forecast F with density is marginally calibrated, but fails to be probabilistically calibrated. For details see Appendix A.2.
It is well known that an ideal forecast is both probabilistically calibrated and marginally calibrated (Gneiting and Ranjan, 2013, Theorem 2.8;Song et al., 2019, Theorem 1). Reformulated in terms of auto-calibration the following holds.
Theorem 2.3. Auto-calibration implies marginal and probabilistic calibration.
Auto-calibration thus is a stronger requirement than either marginal or probabilistic calibration, and the latter are logically independent. However, in the special case of a binary outcome, probabilistic calibration and auto-calibration are equivalent (Gneiting and Ranjan, 2013, Theorem 2.11), and auto-calibration serves as a universal notion of calibration. In the case of three or more distinct outcomes, Gneiting and Ranjan (2013) conjectured that auto-calibration is stronger than simultaneous marginal and probabilistic calibration. We resolve and prove their conjecture within the following example.
Example 2.4. We begin by considering continuous CDFs and then discuss a discrete example with three distinct outcomes only.
(b) For a full resolution of the aforementioned conjecture by Gneiting and Ranjan (2013), we fix µ = 0 and replace the intervals by fixed numbers y 1 < y 2 < y 3 . Thus, F assigns mass p j to y j , whereas the event Y = y j realizes with probability q j for j = 1, 2, and 3. The forecast remains probabilistically and marginally calibrated, and fails to be auto-calibrated.

Conditional calibration
While checks for probabilistic calibration have become a cornerstone of predictive distribution evaluation (Dawid, 1984;Diebold et al., 1998;Gneiting et al., 2007), both marginal and probabilistic calibration concern unconditional facets of predictive performance, which is increasingly being considered insufficient (e.g., Levi et al., 2022). Stronger conditional notions of calibration, which condition on facets of the predictive distribution, have emerged in various strands of the scientific literature. For example, Mason et al. (2007) used conditional (non) exceedance probabilities (CEP) to assess the calibration of ensemble weather forecasts. These were used by Held et al. (2010) and Strähl and Ziegel (2017) to derive calibration tests, which operate under the hypothesis that the forecast F is CEP calibrated in the sense that where q − α (F ) = inf{x ∈ R : F (x) ≥ α} denotes the (lower) α-quantile of F . Similarly,  introduced the notion of a threshold calibrated forecast F , which stipulates that Essentially, CEP calibration is a conditional version of probabilistic calibration, and threshold calibration is conditional marginal calibration.
Theorem 2.5. CEP calibration implies probabilistic calibration, and threshold calibration implies marginal calibration.
Proof. Immediate by taking unconditional expectations, as noted by .
Variants of these concepts can be found scattered in the literature. Notably, Sahoo et al. (2021) introduce a notion of calibration for continuous predictive distributions, which requires that Q(Z F ≤ α | F (t)) = α almost surely, for all α ∈ (0, 1), t ∈ R.
As in Figure 1, we refer to this property as strong threshold calibration. The notion is weaker than auto-calibration, but implies both CEP calibration and threshold calibration, subject to conditions that we discuss below. We proceed to the general notion of conditional T-calibration in terms of a statistical functional T as introduced by Arnold (2020) and Ferro et al. (2020). Other authors (Pohle, 2020;Krüger and Ziegel, 2021) refer to this notion or special cases thereof as auto-calibration with respect to T. A statistical functional on some class F of probability measures is a measurable function T : F → T into a (typically, finite-dimensional) space T with Borel-σ-algebra B(T ). Technically, we work in the prediction space setting under a natural measurability condition that is not restrictive (Fissler and Holzmann, 2022).
Assumption 2.6. The class F and the functional T are such that F ∈ F , the mapping T(F ) : (Ω, A) → (T , B(T )) is measurable, and L (Y | T(F )) ∈ F almost surely.
Essentially, under a T-calibrated predictive distribution F , we can take T(F ) at face value. Perhaps surprisingly, an auto-calibrated forecast is not necessarily T-calibrated, as noted by Arnold (2020, Section 3.2). For a simple counterexample, consider the perfect forecast from Example 2.1, which fails to be T-calibrated when T is the variance, the standard deviation, the interquartile range, or a related measure of dispersion.
We proceed to show that this startling issue does not occur with identifiable functionals, i.e., functionals induced by an identification function (Theorem 2.11). Similar to the classical procedure in M -estimation (Huber, 1964), an identification function weighs negative values in the case of underprediction against positive values in the case of overprediction, and the corresponding functional maps to the possibly set-valued argument at which an associated expectation changes sign. Following Jordan et al. (2022), a measurable function V : R × R → R is an identification function if V (·, y) is increasing and left-continuous for all y ∈ R. We operate under Assumption 2.6 with the implicit understanding that V (x, ·) is quasi-integrable with respect to all F ∈ F for all x ∈ R. Then, for any probability measure F in the class F , the functional T(F ) induced by V is defined as where the lower and upper bounds are given by the random variables An identifiable functional T is of singleton type if T(F ) is a singleton for every F ∈ F . Otherwise, T is of interval type. Table 1 lists key examples, such as threshold-defined event probabilities, quantiles, expectiles, and moments. The definition of the Huber functional involves the clipping function κ a,b (t) = max(min(t, b), −a) with parameters a, b > 0 (Taggart, 2022). In the limiting cases as a = b → 0 and a = b → ∞, the Huber functional recovers the α-quantile (q α ) and the α-expectile (e α ), respectively. For identifiable functionals we can define an unconditional notion of T-calibration as well. Note that in contrast to traditional settings, where F is fixed, we work in the prediction space setting, where F is a random CDF. In principle, the subsequent Definitions 2.9 and 2.24 depend on the choice of the identification function. However, as we demonstrate in Appendix A.4, the following condition ensures that the identification function is unique, up to a positive constant, so that ambiguities are avoided. Table 1: Key examples of identifiable functionals with associated parameters, identification function, and generic type. For a similar listing see Table 1 in Jordan et al. (2022).

Functional
Parameters Identification function Type Assumption 2.8. The identification function V induces the functional T on a convex class F 0 ⊇ F of probability measures, which contains the Dirac measures δ y for all y ∈ R. The identification function V is (i) of prediction error form, i.e., there exists an increasing, left-continuous function v : The examples in Table 1 all satisfy Assumption 2.8.
Definition 2.9. Suppose that the functional T is generated by an identification function V , and let Assumptions 2.6 and 2.8 hold. Then the predictive distribution F is unconditionally For the interval-valued α-quantile functional, q α (F ) = [q − α (F ), q + α (F )], condition (8) reduces to the traditional unconditional coverage condition with the latter being equivalent to Q(Y < q + α (F )) ≤ α. Probabilistic calibration implies unconditional α-quantile calibration at every level α ∈ (0, 1), as hinted at by Kuleshov et al. (2018, Section 3.1), and proved in our Appendix A.5. Under technical assumptions, condition (8) simplifies to with the classical unbiasedness condition E Q [m 1 (F )] = E Q [Y ] arising in the case of the mean or expectation functional.
Example 2.10. Let T be the mean functional or a quantile. Then the unfocused forecast from Example 2.2 is unconditionally T-calibrated, but fails to be conditionally T-calibrated. For details see Figure 4 and Appendix A.1.
Importantly, for any identifiable functional auto-calibration implies both unconditional and conditional T-calibration, as we demonstrate now. Note that Assumption 2.6 is a minimal condition, as it is required to define conditional T-calibration in the first place.
Theorem 2.11. Suppose that the functional T is generated by an identification function and Assumption 2.6 holds. Then auto-calibration implies conditional T-calibration, and subject to Assumption 2.8 conditional T-calibration implies unconditional T-calibration.
Proof. The statements in this proof are understood to hold almost surely. By Theorem 4.34 and Proposition 4.35 of Breiman (1992) in concert with auto-calibration, F is a regular conditional distribution of Y given F , and we conclude that Furthermore, a regular conditional distribution F T = L(Y | T(F )) of Y given T(F ) exists, and the tower property of conditional expectation implies that , where the boundaries are random variables. The proof of the first part is complete if we can show that T − (F T ) = T − (F ) and T + (F T ) = T + (F ). Let ε > 0. By the definition of T + (F ), we know that V (T + (F ), y) dF (y) ≤ 0 and V (T + (F ) + ε, y) dF (y) > 0. Using nested conditional expectations as above, the same inequalities hold almost surely when integrating with respect to F T . Hence, by the definition of T + (F T ), we obtain T + (F ) ≤ T + (F T ) < T + (F ) + ε. An analogous argument shows that This completes the proof of the first part and shows that F is conditionally T-calibrated.
Finally, if F is conditionally T-calibrated, unconditional T-calibration follows by taking nested expectations in the terms in the defining inequalities.
An analogous result is easily derived for CEP calibration.
Proof. It holds that F ) almost surely for α ∈ (0, 1). As F is a version of L(Y | F ), the nested expectation equals α almost surely by Proposition 2.1 of Rüschendorf (2009), which implies CEP calibration.
When evaluating full predictive distributions, it is natural to consider families of functionals as in the subsequent definition, where part (a) is compatible with the extant notion in (5).
(b) quantile calibrated if it is conditionally q α -calibrated for all α ∈ (0, 1); (c) expectile calibrated if it is conditionally e α -calibrated for all α ∈ (0, 1); (d) moment calibrated if it is conditionally n-th moment calibrated for all integers n = 1, 2, . . . While CEP, quantile, and threshold calibration are closely related notions, they generally are not equivalent. For illustration, we consider predictive CDFs in the spirit of Example 2.4.
Under the following conditions CEP, quantile, and threshold calibration coincide.
Assumption 2.15. In addition to Assumption 2.6 for quantiles and threshold (non) exceedances at all levels and thresholds, respectively, let the following hold.
(i) The CDFs in the class F are continuous and strictly increasing on a shared support interval.
(ii) There exists a countable set G ⊆ F such that Q(F ∈ G) = 1.
Theorem 2.16. Proof. By Assumption 2.15(i) the CDFs F ∈ F are invertible on the common support with the quantile function α → q − α (F ) as inverse. Hence, for every α ∈ (0, 1) the functional q α is of singleton-type and q α (F ) = {q − α (F )}. In this light, the almost sure identity implies part (a). To prove part (b), let G be as in Assumption 2.15(ii) and assume without loss of generality that Q(F = G) > 0 for all G ∈ G. If α ∈ (0, 1) and t ∈ R are such that Q(F (t) = α) > 0, then where Assumption 2.15(ii) ensures that the events conditioned on have positive probability. Hence, quantile and threshold calibration are equivalent. The remaining implications are immediate from Theorem 2.5.  Figure 3: The equiprobable predictive distribution F picks the piecewise linear, partially (namely, for y ≤ 2) identical CDFs F 1 and F 2 with equal probability. It is jointly CEP, quantile, and threshold calibrated, but fails to be auto-calibrated.
We conjecture that the statement of part (b) holds under Assumption 2.15(i) alone, but are unaware of a measure theoretic argument that serves to generalize the discrete reasoning in our proof. As indicated in panel (b) of Figure 1, we also conjecture that CEP or quantile calibration imply threshold calibration in general, though we have not been able to prove this, nor can we show that CEP or quantile calibration imply marginal calibration in general. Strong threshold calibration as defined in (6) implies both CEP and threshold calibration under Assumption 2.15, by arguments similar to those in the above proof. The following result thus demonstrates that the hierarchies in panel (a) and, with the aforementioned exceptions, in panel (b) of Figure 1 are complete, with the caveat that hierarchies may collapse if the class F is sufficiently small, as exemplified by Theorem 2.11 of Gneiting and Ranjan (2013). Proof. We establish the claims in a series of (counter) examples, starting with part (b), where we present an example based on two equiprobable, partially overlapping CDFs in Figure 3. A similar example based on four equiprobable, partially overlapping CDFs in Appendix A.6 yields part (a). As for part (c), we return to the piecewise uniform forecast in Example 2.4, where for simplicity we fix µ = 0. This forecast is probabilistically and marginally calibrated, but fails to be threshold calibrated, because Q Y ≤ 3 2 | F ( 3 2 ) = 5 8 = 5 10 + 1 2 · 1 10 = 11 20 = 5 8 .
As for parts (d) and (e), we refer to the unfocused and lopsided forecasts from Example 2.2. Clearly, further hierarchical relations are immediate. For example, given that probabilistic calibration does not imply marginal calibration, it does not imply threshold calibration nor auto-calibration. We leave further discussion to future work, but note that moment calibration does not imply probabilistic nor marginal calibration, as follows easily from classical results on the moment problem (e.g., Stoyanov, 2000). For an overview of calibration properties in our examples, see Table 2.

Reliability diagrams
As we proceed to define reliability diagrams, it is useful to restrict attention to single-valued functionals. To this end, if an identifiable functional T is of interval type, we instead consider its single-valued lower or upper bound, T − (F ) or T + (F ), which we call the lower and upper version of T, or simply the lower and upper functional, without explicit reference to the original functional T.
The following result demonstrates that T-calibration implies calibration of the upper and lower functional.
Proposition 2.18. Suppose that the functional T is generated by an identification function V , and let Assumption 2.6 hold. Then conditional T-calibration implies conditional T − -and T + -calibration, and subject to Assumption 2.8 unconditional T-calibration implies unconditional T − -and T + -calibration.
Proof. Suppose that T * is the lower or upper version of a functional T generated by the identification function V . As σ(T * (F )) ⊆ σ(T(F )), we find that ). If T * is the lower functional, the former inequality is strict and hence T − (F ) = min T(L(Y | T − (F ))), whereas if T * is the upper functional, the latter is strict and hence T + (F ) = max T(L(Y | T + (F ))).
Unconditional T * -calibration is an immediate consequence of unconditional T-calibration.
In this light, we restrict attention to single-valued functionals that are lower or upper versions of identifiable functionals, or identifiable functionals of singleton type. Any such functional can  be associated with a random variable X = T(F ), and we call any random variable X rc , for which almost surely, a recalibrated version of X. Clearly, we can also define X rc for a stand-alone point forecast X, based on conceptualized distributions, by resorting to the joint distribution of the random tuple (X, Y ), provided the right-hand side of (11) is well defined and finite almost surely. The point forecast X is conditionally T-calibrated, or simply T-calibrated, if X = X rc almost surely. Subject to Assumption 2.8, X is unconditionally T-calibrated if For recent discussions of the particular cases of the mean or expectation and quantile functionals see, e.g., Nolde and Ziegel (2017, Sections 2.1-2.2). Patton (2020, Proposition 2), Krüger and Ziegel (2021, Definition 3.1) and (Satopää, 2021, Section 2).
To compare the posited functional X with its recalibrated version X rc , we introduce the T-reliability diagram.
Assumption 2.19. The functional T is a lower or upper version of an identifiable functional, or an identifiable functional of singleton type. The point forecast X is a random variable, and the recalibrated forecast X rc = T(L(Y | X)) is well defined and finite almost surely.
Definition 2.20. Under Assumption 2.19, the T-reliability diagram is the graph of a mapping While technically the T-reliability diagram depends on the choice of a regular conditional distribution for the outcome Y , this is not a matter of practical relevance. Evidently, for a T-calibrated forecast the T-reliability diagram is concentrated on the diagonal. Conversely, deviations from the diagonal indicate violations of T-calibration and can be interpreted diagnostically, as illustrated in Figure 4 for threshold, quantile, and moment calibration. For a similar display in the specific case of mean calibration see Figure 1 of Pohle (2020).
In the setting of fully specified predictive distributions, the distinction between unconditional and conditional T-calibration is natural. Perhaps surprisingly, the distinction vanishes in the setting of stand-alone point forecasts if the associated identification function is of prediction error form and the forecast and the residual are independent.
Theorem 2.21. Let Assumption 2.19 hold, and suppose that the underlying identification function V satisfies Assumption 2.8. Suppose furthermore that the point forecast X and the generalized residual T(δ Y ) − X are independent. Then X is conditionally T-calibrated if, and only if, it is unconditionally T-calibrated.
Proof. Given any constant c ∈ R it holds that In view of (11) and (12), conditional and unconditional T-calibration are equivalent.
For quantiles, expectiles, and Huber functionals V is of prediction error form and the generalized residual reduces to the standard residual, X − Y . In particular, this applies in the case of least squares regression, where T is the mean functional, and the forecast and the residual have typically been assumed to be independent in the literature. We discuss the statistical implications of Theorem 2.21 in Appendix B.

Score decompositions
We now revisit a score decomposition into measures of miscalibration (MCB), discrimination (DSC), and uncertainty (UNC) based on consistent scoring functions. Specifically, suppose that S is a consistent loss or scoring function for the functional T on the class F in the sense that and all x ∈ R (Savage, 1971;Gneiting, 2011). If the inequality is strict unless x ∈ T(F ), then S is strictly consistent. Consistent scoring functions serve as all-purpose performance measures that elicit fair and honest assessments and reward the utilization of broad information bases (Holzmann and Eulert, 2014). If the functional T is of interval type, a consistent scoring function S is consistent for both T − and T + , but strict consistency is lost when T is replaced by its lower or upper version and S is strictly consistent for T. For prominent examples of consistent scoring functions, see Table 3.
A functional is elicitable if it admits a strictly consistent scoring function (Gneiting, 2011). Under general conditions, elicitability is equivalent to identifiability (Steinwart et al., 2014, Theorem 5). The respective functionals allow for both principled relative forecast evaluation through the use of consistent scoring functions, and principled absolute forecast evaluation via T-reliability diagrams and score decompositions, as discussed in what follows.
Let L(Y ) denote the unconditional distribution of the outcome and suppose that x 0 = T(L(Y )) is well defined. As before, we operate under Assumption 2.19 and work with X = T(F ), its recalibrated version X rc , and the reference forecast x 0 . Again, the simplified notation accommodates stand-alone point forecasts, and it suffices to consider the joint distribution of the tuple (X, Y ). Following the lead of Dawid (1986) in the case of binary outcomes, and Ehm and Ovcharov (2017) and Pohle (2020) in the setting of point forecasts for real-valued outcomes, we consider the expected scores for the forecast at hand, its recalibrated version, and the marginal reference forecast x 0 , respectively.
Definition 2.22. Let Assumption 2.19 hold, and let x 0 = T(L (Y )) and the expectationsS,S rc , andS mg in (13) be well defined and finite. Then we refer to as miscalibration, discrimination, and uncertainty, respectively.
The following result decomposes the expected scoreS for the forecast at hand into miscalibration (MCB S ), discrimination (DSC S ), and uncertainty (UNC S ) components.
Theorem 2.23 (Dawid (1986), Pohle (2020)). In the setting of Definition 2.22, suppose that the scoring function S is consistent for the functional T. Then it holds that where MCB S ≥ 0 with equality if X is conditionally T-calibrated, and DSC S ≥ 0 with equality if X rc = x 0 almost surely. If S is strictly consistent then MCB S = 0 only if X is conditionally T-calibrated, and DSC S = 0 only if X rc = x 0 almost surely.
A remaining question is what consistent scoring function S ought to be used in practice. To address this issue, we resort to mixture or Choquet representations of consistent loss functions, as introduced by  for quantiles and expectiles and developed in full generality by Dawid (2016), Ziegel (2016) and Jordan et al. (2022). Specifically, we rely on an obvious generalization of Proposition 2.6 of Jordan et al. (2022), as noted at the start of their Section 2. Let T be identifiable with identification function V satisfying Assumption 2.8, and let η ∈ R. Then the elementary loss function S η , given by Moment of order n n = 1, 2, . . .
is consistent for T. As an immediate consequence, any well defined function of the form where H is a locally finite measure on R, is consistent for T. If T is a quantile, an expectile, an event probability or a moment, then the construction includes all consistent scoring functions, subject to standard conditions, and agrees with suitably adapted classes of generalized piecewise linear (GPL) and Bregman functions, respectively (Gneiting, 2011;. We now formalize what Ehm and Ovcharov (2017, p. 477) call the "most prominent" choice, namely, scoring functions for which the mixing measure H in the representation (16) is uniform.
Definition 2.24. Suppose that the functional T is generated by an identification function V satisfying Assumption 2.8, with elementary loss functions S η as defined in (15). Then a loss function S is canonical for T if it is nonnegative and admits a representation of the form where λ is the Lebesgue measure, a > 0 is a constant, and b is a measurable function.
Clearly, any canonical loss function is a consistent scoring function for T. Furthermore, if the identification function is of the prediction error form, then any canonical loss function has score differentials that are invariant under translation in the sense that S(x 1 +c, y+c)−S(x 2 +c, y+c) = S(x 1 , y) − S(x 2 , y). Conversely, we note from Section 5.1 of Savage (1971) that for the mean functional the canonical loss functions are the only consistent scoring functions of this type.
Typically, one chooses the constant a > 0 and the measurable function b(y) in (17) such that the canonical loss admits a concise closed form, as exemplified in Table 3. Since any selection incurs the same point forecast ranking, we refer to the choice in Table 3 as the canonical loss function. The most prominent example arises when T is the mean functional, where the ubiquitous quadratic or squared error scoring function,  Table 4, as functions of η 0 ≥ 0 and δ 0 ∈ (0, 1), respectively.
is canonical. In this case, the UNC component equals the unconditional variance of Y , as x 0 is simply the marginal mean µ Y of Y , and the MCB and DSC components of the general score decomposition (14) are respectively. Note that here and in the following, we drop the subscript S whenever we use a canonical loss. Table 4 and Figure 5 provide explicit examples.
In the nested case of a binary outcome Y , where X and X rc specify event probabilities, the quadratic loss function reduces to the Brier score (Gneiting and Raftery, 2007), and we refer to  and references therein for details on score decompositions. In the case of threshold calibration, the point forecast x = F (t) is induced by a predictive distribution, and the Brier score can be written as For both real-valued and binary outcomes, it is often preferable to use the square root of the miscalibration component (MCB 1/2 ) as a measure of calibration error that can be interpreted on natural scales (e.g., Roelofs et al., 2022).
A canonical loss function for the Huber functional (Table 1) is given by cf. Taggart (2022, Definition 4.2). In the limiting case as a = b → ∞, we recover the canonical loss functions for the α-expectile, which include the quadratic loss in (18). Similarly, if we rescale suitably and take the limit as a = b → 0, we recover the asymmetric piecewise linear or pinball loss, as listed in Table 3, which lies at the heart of quantile regression. We move on to a remarkable property of canonical loss functions. In a nutshell, the point forecast X is unconditionally T-calibrated if, and only if, the expected canonical loss deteriorates under translation. This property, which nests classical results in regression theory, as we demonstrate at the end of Section 3.3, does not hold under consistent scoring functions in general. For a counterexample see Appendix A.8.
Assumption 2.25. The point forecast X, the functional T and the identification function V satisfy Assumptions 2.8 and 2.19, and S is a canonical loss for T. Furthermore, E[S(X + η, Y )] and E[V (X + η, Y )] are well defined and locally bounded as functions of η ∈ R.
Theorem 2.26. Under Assumption 2.25, the point forecast X is unconditionally T-calibrated if, and only if, Proof. If X is unconditionally T-calibrated and c > 0, then is nonnegative by the second part of the unconditional T-calibration criterion (12). Conversely, if the score difference in (20) is nonnegative for all c > 0, then so is Hence, the second part of (12) is satisfied. An analogous argument shows that the score difference (20) is nonnegative for all c < 0 if, and only if, the first inequality in (12) is satisfied.
As a consequence, under a canonical loss function the MCB component in the score decomposition (14) of Theorem 2.23 decomposes into nonnegative unconditional and conditional components MCB u and MCB c , respectively, subject to the mild condition that unconditional recalibration via translation is feasible.
Theorem 2.27. Let Assumption 2.25 hold, and suppose there is a constant c such that X + c is unconditionally T-calibrated. Let X urc = X + c andS urc = E[S(X urc , Y )], and define MCB u =S −S urc and MCB c =S urc −S rc .
where MCB u ≥ 0 with equality if X is unconditionally T-calibrated, and MCB c ≥ 0 with equality if X rc = X urc almost surely. If S is strictly consistent, then MCB u = 0 only if X is unconditionally T-calibrated, and MCB c = 0 only if X rc = X urc almost surely.
Proof. Immediate from Theorems 2.23 and 2.26, and the fact that conditional recalibration of X and X + c yields the same X rc .
In settings that are equivariant under translation, such as for expectiles, quantiles and Huber functionals when both X and Y are supported on the real line, X can always be unconditionally recalibrated by adding a constant. Under any canonical loss function S, the basic decomposition (14) then extends toS For instance, when S(x, y) = (x−y) 2 is the canonical loss for the mean functional, MCB u = c 2 is the squared unconditional bias. The forecasts in Figure 5 and Table 4 are free of unconditional bias, so MCB u = 0 and MCB c = MCB.
In all cases studied thus far, canonical loss functions are strictly consistent , and so MCB u = 0 if and only if the forecast is unconditionally T-calibrated, and MCB c = 0 if and only if X urc = X rc almost surely. While in other settings, such as when the outcomes are bounded, unconditional recalibration by translation might be counterintuitive (in principle) or impossible (in practice), the statement of Theorem 2.27 continues to hold, and the above results can be refined to admit more general forms of unconditional recalibration. We leave these and other ramifications to future work.
3 Empirical reliability diagrams and score decompositions: The CORP approach We turn to empirical settings, where calibration checks, scores, and score decompositions address critical practical problems in both model diagnostics and forecast evaluation. The most direct usage is in the evaluation of out-of-sample predictive performance, where forecasts may either take the form of fully specified predictive distributions, or be single-valued point forecasts that arise, implicitly or explicitly, as functionals of predictive distributions. Similarly, in model diagnostics, where in-sample goodness-of-fit is of interest, the model might supply fully specified, parametric or non-parametric conditional distributions, or single-valued regression output that is interpreted as a functional of an underlying, implicit or explicit, probability distribution. Prominent examples for the latter setting include ordinary least squares regression, where the mean or expectation functional is sought, and quantile regression.
In the case of fully specified predictive distributions, we work with tuples of the form where F i is a posited conditional CDF for the real-valued observation y i for i = 1, . . . , n, which we interpret as a sample from an underlying population Q in the prediction space setting of Section 2. In the case of stand-alone point forecasts or regression output, we assume throughout that the functional T is of the type stated in Assumption 2.19 and work with tuples of the form where x i = T(F i ) ∈ R derives explicitly or implicitly from a predictive distribution F i for i = 1, . . . , n.
In the remainder of the section, we introduce empirical versions of T-reliability diagrams (Definition 2.20) and score components (Definition 2.22) for samples of the form (22) or (23), which allow for both diagnostic checks and inference about an underlying population Q. While practitioners may think of our empirical versions exclusively from diagnostic perspectives, we emphasize that they can be interpreted as estimators of the population quantities and be analyzed as such. A key feature of our approach is the use of nonparametric isotonic regression via the pool-adjacent-violators algorithm, as proposed by  in the particular case of binary outcomes. The generalization that we discuss here is hinted at in the discussion section of their paper.

The T-pool-adjacent-violators (T-PAV) algorithm
Our key tool and workhorse is a very general version of the classical pool-adjacent-violators (PAV) algorithm for nonparametric isotonic regression (Ayer et al., 1955;van Eeden, 1958). Historically, work on the PAV algorithm has focused on the mean functional, as reviewed by Barlow et al. Algorithm 1: General T-PAV algorithm based on data of the form (23) Input: (x 1 , y 1 ), . . . , (x n , y n ) ∈ R 2 where x 1 ≤ · · · ≤ x n Output: T-calibrated valuesx 1 , . . . ,x n partition into groups G 1:1 , . . . , G n:n and letx i = T(δ i ) for i = 1, . . . , n while there are groups G k:i and G (i+1):l such thatx 1 ≤ · · · ≤x i andx i >x i+1 do merge G k:i and G (i+1):l into G k:l and letx i = T(δ k:l ) for i = k, . . . , l end (1972), Robertson and Wright (1980), and de Leeuw et al. (2009), among others. In contrast, Jordan et al. (2022) study the PAV algorithm in very general terms that accommodate our setting.
We rely on their work and describe the T-pool-adjacent-violators algorithm based on tuples (x 1 , y 1 ), . . . , (x n , y n ) of the form (23), where without loss of generality we may assume that x 1 ≤ · · · ≤ x n . Furthermore, we let δ i denote the point measure in the outcome y i . More generally, for 1 ≤ k ≤ l ≤ n we let be the associated empirical measure. Algorithm 1 describes the generation of an increasing sequencex 1 ≤ · · · ≤x n of recalibrated values, which by construction are conditionally T-calibrated with respect to the empirical measure associated with (x 1 , y 1 ), . . . , (x n , y n ). The algorithm rests on partitions of the index set {1, . . . , n} into groups G k:l = {k, . . . , l} of consecutive integers.
The following result summarizes the remarkable properties of the T-PAV algorithm, as proved in Section 3.2 of Jordan et al. (2022).
Theorem 3.1 (Jordan et al. (2022)). Suppose that the functional T is as stated in Assumption 2.19. Then Algorithm 1 generates a sequencex 1 , . . . ,x n such that the empirical measure associated with (x 1 , y 1 ), . . . , (x n , y n ) is conditionally T-calibrated. This sequence is optimal with respect to any scoring function S of the form (16), in that for any non-decreasing sequence t 1 ≤ · · · ≤ t n .
We note that for a functional of interval type, the minimum on the left-hand side of (24) is the same under the lower and upper version, respectively. For customary functionals, such as threshold (non) exceedance probabilities, quantiles, expectiles, and moments, the optimality is universal, as functions of the form (16) exhaust the class of the T-consistent scoring functions subject to mild conditions . While the PAV algorithm has been used extensively for the recalibration of probabilistic classifiers (e.g., Flach, 2012), we are unaware of any extant work that uses Algorithm 1 for forecast recalibration, forecast evaluation, or model diagnostics in non-binary settings.

Empirical T-reliability diagrams
Recently,  introduced the CORP approach for the estimation of reliability diagrams and score decompositions in the case of probability forecasts for binary outcomes. In a nutshell, the acronym CORP refers to an estimator that is Consistent under the assumption of isotonicity for the population recalibration function and Optimal in both finite sample and asymptotic settings, while facilitating Reproducibility, and being based on the PAV algorithm. Here, we extend the CORP approach and employ nonparametric isotonic T-regression via the T-PAV algorithm under Assumption 2.19, where T is the lower or upper version of an identifiable functional, or an identifiable singleton functional.
We begin by defining the empirical T-reliability diagram, which is a sample version of the population diagram in Definition 2.20.
Definition 3.2. Let the functional T be as stated in Assumption 2.19, and suppose that x 1 , . . . ,x n originate from tuples (x 1 , y 1 ), . . . , (x n , y n ) with x 1 ≤ · · · ≤ x n via Algorithm 1. Then the CORP empirical T-reliability diagram is the graph of the piecewise linear function that connects the points (x 1 ,x 1 ), . . . , (x n ,x n ) in the Euclidean plane.  (2015) concern the mean functional. However, none of these papers employ the PAV algorithm, and the resulting diagrams are subject to issues of stability and efficiency, as illustrated by  in the case of binary outcomes.
For the CORP empirical T-reliability diagram to be consistent in the sense of large sample convergence to the population version of Definition 2.20, the assumption of isotonicity of the population recalibration function needs to be invoked. As argued by Roelofs et al. (2022) and , such an assumption is natural, and practitioners tend to dismiss nonisotonic recalibration functions as artifacts. Evidently, these arguments transfer to arbitrary functionals, and any violations of the isotonicity assumption entail horizontal segments in CORP reliability diagrams, thereby indicating a lack of reliability. Large sample theory for CORP estimates of the recalibration function and the T-reliability diagram depends on the functional T, the type -discrete or continuous -of the marginal distribution of the point forecast X, and smoothness conditions. Mösching and Dümbgen (2020) establish rates of uniform convergence in the cases of threshold (non) exceedance and quantile functionals that complement classical theory (Barlow et al., 1972;Casady and Cryer, 1976;Wright, 1984;Robertson et al., 1988;El Barmi and Mukerjee, 2005;Guntuboyina and Sen, 2018).
In the case of binary outcomes, Bröcker and Smith (2007, p. 651) argue that reliability diagrams ought to be supplemented by consistency bars for "immediate visual evaluation as to just how likely the observed relative frequencies are under the assumption that the predicted probabilities are reliable."  develop asymptotic and Monte Carlo based methods for the generation of consistency bands to accompany a CORP reliability diagram for dichotomous outcomes, and provide code in the form of the reliabilitydiag package (Dimitriadis and Jordan, 2021) for R (R Core Team, 2021). The consistency bands quantify and visualize the variability of the empirical reliability diagram under the respective null hypothesis, i.e., they show the pointwise range of the CORP T-reliability diagram that we expect to see under a calibrated forecast. Algorithms 2 and 3 in Appendix B generalize this approach to produce consistency bands from data of the form (22) under the assumption of auto-calibration. In the specific case of threshold calibration, where the induced outcome is dichotomous, the assumptions of auto-calibration (in the binary setting) and T-calibration (for the non-exceedance functional) coincide (Gneiting and Ranjan, 2013, Theorem 2.11), and we use the aforementioned algorithms to generate consistency bands ( Figure 6, top row). Generally, auto-calibration is a strictly stronger assumption than T-calibration, with ensuing issues, which we discuss in Appendix B.1. Furthermore, to generate consistency bands from data of the form (23), we cannot operate under the assumption of auto-calibration.
As a crude yet viable alternative, we propose in Appendix B.2 a Monte Carlo technique for the generation of consistency bands that is based on resampling residuals. As in traditional regression diagnostics, the approach depends on the assumption of independence between point forecasts and residuals. Figure 6 shows examples of T-reliability diagrams with associated residual-based 90% consistency bands for the perfect, unfocused, and lopsided forecasts from Section 2 for the mean functional (middle row) and the lower quantile functional at level 0.10 (bottom row). For further discussion see Appendix B.2. In the case of the mean functional, we add the scatter diagram for the original data of the form (23), whereas in the other two cases, inset histograms visualize the marginal distribution of the point forecast.
We encourage follow-up work on both Monte Carlo and asymptotic methods for the generation of consistency and confidence bands that are tailored to specific functionals of interest, similar to the analysis by  in the basic case of probability forecasts for binary outcomes.

Empirical score decompositions
In this section, we consider data (x 1 , y 1 ), . . . , (x n , y n ) of the form (23), where implicitly or explicitly x i = T(F i ) for a single-valued functional T. Letx 1 , . . . ,x n denote the respective T-PAV recalibrated values, and letx 0 = T(F 0 ), whereF 0 is the empirical CDF of the outcomes y 1 , . . . , y n . Let denote the mean score of the point forecast at hand, the recalibrated point forecast, and the functional T applied to the unconditional, marginal distribution of the outcome, respectively. If all quantities in (25) are finite, we refer to as the miscalibration, discrimination and uncertainty components of the mean score S. Our next result generalizes Theorem 1 of  and decomposes the mean score S into a signed sum of nonnegative, readily interpretable components.
Theorem 3.3. Suppose that the functional T satisfies the conditions in Assumption 2.19. Let the scoring function S be of the form (16), suppose thatx 1 , . . .x n originate from tuples (x 1 , y 1 ), . . . , (x n , y n ) via Algorithm 1, and let all terms in (25) be finite. Then where MCB S ≥ 0 with equality ifx i = x i for i = 1, . . . , n, and DSC S ≥ 0 with equality ifx i =x 0 for i = 1, . . . , n. If S is strictly consistent, then MCB S = 0 only ifx i = x i for i = 1, . . . , n and DSC S = 0 only ifx i =x 0 for i = 1, . . . , n.
Thus, CORP estimates of score components enjoy the same properties as the respective population quantities (Theorem 2.23, eq. (14)). This is not to be taken for granted, as the nonnegativity of the estimated components cannot be guaranteed if approaches other than the T-PAV algorithm are used for recalibration (Dimitriadis et al., 2021, Supplementary Section S5). Recently, the estimation of calibration error has seen a surge of interest in machine learning (Guo et al., 2017;Kuleshov et al., 2018;Kumar et al., 2019;Nixon et al., 2019;Roelofs et al., 2022). Under the natural assumption of isotonicity of the population recalibration function, MCB S is a consistent estimate of the population quantity MCB S , with canonical loss functions being natural choices for S. As noted, it is often preferable to use the square root of the miscalibration component under squared error as a measure of calibration error that can be interpreted on natural scales. Asymptotic distributions for our estimators depend on the functional T, the scoring function S, and regularity conditions. Large sample theory can leverage extant theory for nonparametric isotonic regression, as hinted at in the previous section, though score components might show distinct asymptotic behavior. Further development is beyond the scope of the present paper and strongly encouraged.
In the remainder of the section, we assume that S is a canonical score and drop the subscript in the score components. If there is a constantĉ ∈ R such that the empirical measure in (x 1 +ĉ, y 1 ), . . . , (x n +ĉ, y n ) is unconditionally T-calibrated, let We then refer to MCB u = S − S urc and MCB c = S urc − S rc as the CORP unconditional and conditional miscalibration components of the mean canonical score, respectively. Under mild conditions, these estimates are nonnegative and share properties of the respective population quantities in Theorem 2.27.
Theorem 3.4. Let the conditions of Theorem 3.3 hold, and let S be a canonical loss function for T. Suppose there is a constantĉ ∈ R such that the empirical measure in (x 1 +ĉ, y 1 ), . . . , (x n + c, y n ) is unconditionally T-calibrated, and suppose that all terms in (28) are finite. Then where MCB u ≥ 0 and MCB c ≥ 0.
Proof. Immediate from Theorems 2.26 and 3.1, and the trivial fact that the addition of a constant is a special case of an isotonic mapping.
In the middle row of Figure 6, the extended CORP decomposition, which estimates the population decomposition (21), is applied to the mean squared error (MSE). Likewise, the extended CORP decomposition of the canonical score for quantiles, i.e., the piecewise linear quantile score (QS) from Table 3, is shown in the bottom row. The top row concerns threshold calibration, and we report the standard CORP decomposition (27) of the Brier score (BS) from (19). While the assumptions of Theorem 3.3 are satisfied in this setting, the addition of the constantĉ may yield forecast values outside the unit interval, whence we refrain from considering the refined decomposition in (29). In this context, the distinction between out-of-sample forecast evaluation and in-sample model diagnostics is critical. When evaluating out-of-sample forecasts, both unconditional and conditional miscalibration are relevant. In contrast, in-sample model fits frequently enforce unconditional calibration. For example, if we fit a regression model with intercept by minimizing the canonical loss for a functional T, Theorem 2.26 applied to the associated empirical measure guarantees in-sample unconditional T-calibration. As special cases, this line of reasoning yields classical results in ordinary least squares regression, and the partitioning inequalities of quantile regression in Theorem 3.4 of Koenker and Bassett (1978).

Skill scores and a universal coefficient of determination
Let us revisit the mean scores in (25) under the natural assumption that the terms in S and S mg are finite and that S mg is strictly positive. In out-of-sample forecast evaluation, the quantity is known as skill score (Murphy and Epstein, 1989;Murphy, 1996;Gneiting and Raftery, 2007;Jolliffe and Stephenson, 2012) and may attain both positive and negative values. In particular, when S(x, y) = (x−y) 2 is the canonical loss function for the mean functional, S skill coincides with the popular Nash-Sutcliffe model efficiency coefficient (NSE; Nash and Sutcliffe, 1970;Moriasi et al., 2007). A positive skill score indicates predictive performance better than the simplistic unconditional reference forecast, whereas a negative skill score suggests that we are better off by using the simple reference forecast. Of course, it is possible, and frequently advisable, to base skill scores on reference standards that are more sophisticated than an unconditional, constant point forecast (Hyndman and Koehler, 2006). In contrast, if the goal is in-sample model diagnostics, the quantity in (30) typically is nonnegative. As we demonstrate now, it constitutes a powerful generalization of the coefficient of determination, R 2 , or variance explained in least squares regression, and its close cousin, the R 1 measure in quantile regression (Koenker and Machado, 1999). Specifically, we propose the use of as a universal coefficient of determination. In practice, one takes S to be a canonical loss for the functional T at hand, and we drop the subscripts in this case. The classical R 2 measure arises when S(x, y) = (x − y) 2 is the canonical squared error loss function for the mean functional, and the R 1 measure of Koenker and Machado (1999) emerges when S(x, y) = 2 (½{x ≥ y} − α) (x − y) is the canonical piecewise linear loss under the α-quantile functional. Of course, in the case α = 1 2 of the median, the piecewise linear loss reduces to the absolute error. In Figure 7, we present a numerical example on the toy data from Figure 1 in Kvålseth (1985). The straight lines show the linear (ordinary least squares) mean and linear (Laplace) median regression fits, which Kvålseth (1985) sought to compare. The piecewise linear broken curves illustrate the nonparametric isotonic regression fits, as realized by the T-PAV algorithm, where T is the mean and the lower and the upper median, respectively. As the linear regression fits induce the same ranking of the point forecasts, they yield the same PAV-recalibrated values that enter the terms in the score decomposition (26), and thus they have identical discrimination components in (27), which equal 10.593 under squared error and 2.333 under absolute error, regardless of which isotonic median is used. The uncertainty components, which equal 12.000 under squared error, and 2.889 under absolute error, are identical as well, since they depend on the observations only. Thus, the differences in R 2 respectively R 1 in Figure 7 stem from distinct miscalibration components. Of course, linear mean regression is preferred under squared error, and linear median regression is preferred under absolute error.
Various authors have discussed desiderata for a generally applicable definition of a coefficient of determination (Kvålseth, 1985;Nakagawa and Schielzeth, 2013) for the assessment of in-sample fit. In particular, such a coefficient ought to be dimensionless and take values in the unit interval, with a value of 1 indicating a perfect fit, and a value of 0 representing a complete lack of fit. The universal coefficient of determination R * enjoys these properties under modest conditions.  Figure 1), along with nonparametric isotonic mean and median regression fits. The isotonic median regression fit is not unique and framed by the respective lower and upper functional.
Assumption 3.5. Suppose that the functional T is as stated in Assumption 2.19 with associated identification function V . Let the scoring function S be of the form (16), and suppose that x 1 , . . .x n in (25) originate from tuples (x 1 , y 1 ), . . . , (x n , y n ) via Algorithm 1. Furthermore, let the following hold.
(i) The terms contributing to S and S mg in (25) are finite, and S mg > 0.
(ii) The values x 1 , . . . , x n have been fitted to y 1 , . . . , y n by in-sample empirical loss minimization with respect to S, with any constant fit x 1 = · · · = x n being admissible.
For example, suppose that T is the mean functional and S is the canonical squared error scoring function. Then condition (i) is satisfied with the exception of the trivial case where y 1 = · · · = y n , and condition (ii) is satisfied under linear (ordinary least squares) mean regression with intercept. Similarly, if T is a quantile and S is the canonical piecewise linear loss function, then (i) is satisfied except when y 1 = · · · = y n , and (ii) is satisfied under linear quantile regression with intercept. In this light, the following theorem covers the classical settings for the R 2 and R 1 measures.
Proof. The claim follows from Theorem 3.1, the trivial fact that a constant fit is a special case of an isotonic mapping, and the assumed form (16) of the scoring function.
We emphasize that Assumption 3.5 and Theorem 3.6 are concerned with, and tailored to, in-sample model diagnostics. At the expense of technicalities, the regularity conditions can be relaxed, but the details are tedious and we leave them to subsequent work. The condition that any constant fit x 1 = · · · = x n be admissible is critical and cannot be relaxed.

Empirical examples
We now illustrate the use of reliability diagrams, score decompositions, skill scores, and the coefficient of determination R * for the purposes of forecast evaluation and model diagnostics.
In the basic setting of tuples (x 1 , y 1 ), . . . , (x n , y n ) of the form (23), the point forecast x i represents the functional T of a posited distribution for y i . The most prominent case of the mean functional and canonical squared error loss (18) Figure 6). With a mean squared error (MSE) of 0.224, ridge regression performs much better than the null model with an MSE of 0.262. The CORP score decomposition shown in Figure 2 refines and supports the analysis.
We move on to discuss the more complex setting of tuples (F 1 , y 1 ), . . . , (F n , y n ) of the form (23), where F i is a posited distribution for y i (i = 1, . . . , n). As discussed in Section 2 and visualized in Figure 1, the traditional unconditional notions of calibration, namely, probabilistic and marginal calibration, constitute weak forms of reliability. For this very reason, we recommend that checks for probabilistic and marginal calibration are given priority in this setting, much in line with current practice. Typically, probabilistic calibration is checked by plotting histograms of empirical probability integral transform (PIT) values (Diebold et al., 1998;Gneiting et al., 2007), though this is hindered by the need for binning. In Appendix B.3, we discuss the PIT reliability diagram, a rarely used alternative that avoids binning and retains the spirit of our CORP approach by plotting the CDF of the empirical PIT values. Similarly, as we also discuss in Appendix B.3, the marginal reliability diagram can be used to assess marginal calibration in the spirit of the CORP approach. If the analysis indicates gross violations of probabilistic or marginal calibration, we note from Section 2 and Figure 1 that key notions of conditional calibration must be violated as well. Otherwise, we might proceed to check stronger conditional notions of calibration, such as threshold, mean, and quantile calibration.
To illustrate this process, we consider quarterly Bank of England forecasts of consumer price index (CPI) inflation rates, as issued since 2004. The forecast distributions, for which we give details and refer to extant analyses in Appendix C.2, are two-piece normal distributions that are communicated to the public via fancharts. The forecasts are at prediction horizons up to six quarters ahead in the time series setting, where k step ahead forecasts that are ideal with respect to the canonical filtration show PIT values that are independent at lags ≥ k + 1 in addition to being uniformly distributed (Diebold et al., 1998). However, as discussed in Appendix C.1, independent, uniformly distributed PIT values do not imply auto-calibration, except in a special case. Thus, calibration diagnostics beyond checks of the uniformity and independence of the PIT are warranted.
In Figure 8, we consider forecasts one quarter ahead and show PIT and marginal reliability diagrams, along with empirical autocorrelation functions (ACFs) for the first two moments of the PIT. In part, the PIT reliability diagram and the ACFs lie outside the respective 90% consistency bands. For a closer look, we also plot the threshold reliability diagram at the policy target of 2% and the lower α-quantile reliability diagram for α = 0.75. The deviations from reliability remain minor, in stark contrast to calibration diagnostics at prediction horizons k ≥ 4, for which we refer to Appendix C.2. Figure 9 shows the standard CORP decomposition (27) of the Brier score (BS) for the induced probability forecasts at the 2% target and the extended CORP decomposition (29) of the piecewise linear quantile score for α-quantile forecasts at level α = 0.75 and lead times up to six quarters ahead. In the latter case, the difference between MCB and MCB u equals the MCB c component. Generally, the miscalibration components increase while the discrimination components decrease with the lead time. Related results for the quantile functional can be found in Pohle (2020, Table 5, Figures 7 and 8), where there is a notable increase in the discrimination (resolution) component at the largest two lead times, which is caused by counterintuitive decays  in the recalibration functions. In contrast, the regularizing constraint of isotonicity prevents overfitting in the CORP approach. The coefficient of determination or skill score R * decays with the prediction horizon and becomes negative at lead times k ≥ 4. This suggests that forecasts remain informative at lead times up to at most three quarters ahead, in line with the substantive findings in Pohle (2020) and other extant work, as hinted at in Appendix C.2.

Discussion
We have developed a comprehensive theoretical and methodological framework for the analysis of calibration and reliability, serving the purposes of both (out-of-sample) forecast evaluation and (in-sample) model diagnostics. A common principle is that fitted or predicted distributions ought to be calibrated or reliable, ideally in the sense of auto-calibration, which stipulates that the outcomes are random draws from the posited distributions. For general real-valued outcomes, we have seen that auto-calibration is stronger than both classical unconditional and recently proposed conditional notions of calibration. We have developed hierarchies of calibration in the spirit of Van Calster et al. (2016), as highlighted in Figure 1, and proposed a generic notion of conditional calibration in terms of statistical functionals. Specifically, a posited distribution is conditionally T-calibrated if the induced point forecast for the functional T can be taken at face value. This concept continues to apply when stand-alone point forecasts or regression output in terms of the functional T are to be evaluated, and can be assessed via T-reliability diagrams and associated score decompositions. Importantly, our tools apply regardless of how forecasts are generated, be it through the use of traditional statistical regression models, modern machine learning techniques, or even subjective human judgment.
We have adopted and generalized the nonparametric approach of , who obtained consistent, optimally binned, reproducible and PAV based (CORP) estimators of T-reliability diagrams and score components in the case of probability forecasts for binary outcomes. While our tools apply in the much broader setting of identifiable functionals and real-valued outcomes, the arguments put forth by  continue to apply, in that CORP estimators are bound to, simultaneously, improve statistical efficiency, reproducibility (Stodden et al., 2016), and stability (Yu and Kumbier, 2020). In a nutshell, the CORP approach is flexible, due to its use of nonparametric regression for recalibration, and yet it avoids overfitting, owing to the regularizing constraint of isotonicity. Notably, the CORP score decomposition yields a new, universal coefficient of determination, R * , that nests and generalizes the classical R 2 in ordinary least squares (mean) regression, and its cousin R 1 in quantile regression. In independent work, Allen (2021) also observes the link between skill scores, score decompositions and the coefficient of determination. We have illustrated the CORP approach on Bank of England forecasts of inflation, along with a brief ecological example. Gneiting et al. (2023) provide an in depth review of the particular case of forecasts in the form of (one or multiple) quantiles, accompanied by case studies. Code in R (R Core Team, 2021) for reproducing our results is available at https://github.com/resinj/replication_GR21.
Follow-up work on the CORP approach for specific functionals T is essential, including but not limited to the ubiquitous cases of quantiles and the mean functional, where the newly developed tools can supplement classical approaches to regression diagnostics, as hinted at in the ecological example. In particular, we have applied a crude, all-purpose, residual-based permutation approach to generate consistency bands for T-reliability diagrams under the hypothesis of T-calibration. Clearly, this approach can be refined, and we anticipate vigorous work on consistency and confidence bands, based on either resampling or large sample theory, akin to the developments in  for probability forecasts of binary outcomes. Similarly, CORP estimates of miscalibration components under canonical loss functions are natural candidates for the quantification of calibration error in empirical work. Reliability and discrimination ability are complementary attributes of point forecasts and regression output, and discrimination can be assessed quantitatively via the respective score component. When many forecasts are to be compared with each other, scatter plots of CORP miscalibration (MCB) and discrimination (DSC) components admit succinct visual displays of predictive performance. In this type of display, forecasts with the same score or, equivalently, identical coefficient of determination, R * , gather on lines with unit slope, and multiple facets of forecast quality can be assessed simultaneously, for a general alternative to the widely used Taylor (2001) diagram.
Formal tests of hypotheses of calibration are critical in both specific applications, such as banking regulation (e.g., Nolde and Ziegel, 2017), and in generic tasks, such as the assessment of goodness of fit in regression (Dimitriadis et al., 2021, Section S2). In Appendix B.4, we comment on this problem from the perspective of the theoretical and methodological advances presented here. While specific developments need to be deferred to future work, it is our belief that the progress in our understanding of notions and hierarchies of calibration, paired with the CORP approach to estimating reliability diagrams and score components, can spur a wealth of new and fruitful developments in these directions.

Appendix A Supporting calculations for Section 2
Here, we provide supporting computations and discussion for Examples 2.2 and 2.4, Definitions 2.9 and 2.24, Figures 4 and 5, and Table 4, along with a discussion of the relation between probabilistic calibration and unconditional quantile calibration, and a counterexample hinted at in the main text. For subsequent use, the first three (non-centered) moments of the normal distribution N (µ, σ 2 ) are µ, µ 2 + σ 2 , and µ 3 + 3µσ 2 . As in the main text, we let ϕ and Φ denote the density and the cumulative distribution function (CDF), respectively, of a standard normal variable.

A.4 Identification functions, unconditional calibration, and canonical loss
In this section, we demonstrate that Definitions 2.9 and 2.24 are unambiguous and do not depend on the choice of the identification function, which is essentially unique. To this end, we first contrast the notions of identification functions in Fissler and Ziegel (2016) and Jordan et al. (2022). Fissler and Ziegel (2016) is integrable with respect to all F ∈ F for all x ∈ R. Jordan et al. (2022) additionally require V to be increasing and left-continuous in its first argument. Furthermore, there is a subtle difference in the way that the functional is induced. While Fissler and Ziegel (2016) (7). The approach by Jordan et al. (2022) allows for quantiles to be treated in full generality and ensures that the interval T(F ) coincides with the closure of T 0 (F ) if the latter is nonempty.
In the setting of Fissler and Ziegel (2016), if V is an identification function, then so is (x, y) → h(x)V (x, y) whenever h(x) = 0 for all x ∈ R. If the class F is sufficiently rich, then any two locally bounded identification functions V andṼ that induce a functional T 0 of singleton type relate to each other in the stated form almost everywhere on the interior of T 0 (F ) × R (Dimitriadis et al., 2022, Theorem 4), which implies that increasing identification functions of prediction error form are unique up to a positive constant. The following proposition provides an elementary proof under slightly different conditions that are tailored to our setting. Notably, identification functions of prediction error form induce functionals that are equivariant under translation by Proposition 4.7 of Fissler and Ziegel (2019), a result which can easily be transferred to the setting of Jordan et al. (2022).
Proposition A.1. Let F be a convex class of probability measures such that δ y ∈ F for all y ∈ R. If the functional T is induced on F by an identification function V (x, y) = v(x − y) of prediction error form, where v is increasing and left-continuous with v(−r) < 0 and v(r) > 0 for some r > 0, then any other identification function of the stated form that induces T is a positive multiple of V .
Hence, if we assume an identification function of type (i) in Assumption 2.8, Definitions 2.9 and 2.24 do not depend on the choice of the identification function, as it is unique up to a positive constant. Trivially, the same holds true for type (ii). To complete the argument that the definitions are unambiguous, the following technical argument is needed. Remark A.2. If a functional T of singleton type is identified by both an identification function V (x, y) = v(x − y) of type (i) and an identification functionṼ (x, y) = x − T(δ y ) of type (ii), theñ V is also of type (i). To see this, let z denote the unique value at which the sign of v changes, and note that z = T(δ y ) − y for all y, since V induces the functional T for each Dirac measure δ y . Hence, T(δ y ) = y + z andṼ (x, y) = x − y − z is of type (i).
We close this section with comments on the role of the class F . As expressed by Assumption 2.8, we prefer to work with identification functions that elicit the target functional T on a large, convex class F of probability measures, to avoid unnecessary constraints on forecast(er)s. Furthermore, when evaluating stand-alone point forecasts, the underlying predictive distributions typically are implicit, and assumptions other than the existence of the functional at hand are unwarranted and contradict the prequential principle. Evidently, if the class F is sufficiently restricted, additional identification functions arise. For example, the piecewise constant identification function associated with the median can be used to identify the mean within any class of symmetric distributions.

A.5 Probabilistic calibration and unconditional quantile calibration
As noted in the main text, probabilistic calibration implies the unconditional α-quantile calibration condition (9) at every level α ∈ (0, 1). To see this, note that if F is probabilistically . As Example 2.14(b) demonstrates, the reverse implication does not hold in general. However, Assumption 2.15 ensures the equivalence of probabilistic calibration and unconditional α-quantile calibration at every level α ∈ (0, 1).

A.6 Counterexample (Proposition 2.17(a))
As pointed out by Sahoo et al. (2021, p. 5), strong threshold calibration does not imply autocalibration. Here, we provide a simple example illustrating this, as Sahoo et al. (2021) do not present such. The example is similar in spirit to the continuous forecast of Example 2.14(a) (as c → 0), but with strictly increasing distribution functions satisfying Assumption 2.15.
Let F be a mixture of uniform distributions on the intervals [0, 1], [1,2], [2,3], and [3, 4] with weights p 1 , p 2 , p 3 , and p 4 , respectively, and let Y be from a mixture with weights q 1 , q 2 , q 3 , and q 4 . Furthermore, let the tuple (p 1 , p 2 , p 3 , p 4 ; q 1 , q 2 , q 3 , q 4 ) attain each of the values with equal probability. The equal average of the distribution of the PIT conditional on either forecast from the top row, and either forecast from the bottom row, is uniform. As any nontrivial conditioning in terms of a threshold yields a combination of two forecast cases, one from the top row and one from the bottom row, the forecast F is strongly threshold calibrated.
A.7 Remarks on Figure 4 The root transforms in the moment reliability diagrams in the bottom row of Figure 4 bring the first, second, and third moment to the same scale. The peculiar dent in the reliability curve for the (third root of the) third moment of the piecewise uniform forecast results from the transform, which magnifies small deviations between x = m 3 (F ) and x rc when x is close to zero. For comparison, Figure 10 shows moment reliability diagrams for all three forecast without applying the root transform.

A.8 Counterexample (Theorem 2.26)
The statement in Theorem 2.26 does not hold under consistent scoring functions in general. For a counterexample, consider the empirical distribution of (x 1 , y 1 ), . . . , (x n , y n ), where x i = i and y i = x i + 10 9 for i = 1, . . . , 9, and x 10 = y 10 = −10. The respective mean-forecast X fails to be unconditionally mean calibrated, whereas the shifted version X + 1 is unconditionally mean calibrated. Nonetheless, the expected elementary score (15) for the mean functional (i.e., V (x, y) = x − y) with index η = − 19 2 increases when X gets replaced with X + 1.

Appendix B Consistency resamples and calibration tests
Monte Carlo based consistency bands for T-reliability diagrams can be generated from resamples, at any desired nominal level. The consistency bands then show the pointwise range of the resampled calibration curves. For now, let us assume that we have data (x 1 , y 1 ), . . . , (x n , y n ) of the form (23) along with m resamples at hand, and defer the critical question of how to generate the resamples. Complementary to consistency bands, tests for the assumed type of calibration, as quantified by the functional T and a generic miscalibration measure MCB, can be performed as usual. Specifically, we compute MCB j for each resample j = 1, . . . , m, and if r of the resampled measures MCB 1 , . . . , MCB m are less than or equal to the miscalibration measure computed from the original data, we declare a Monte Carlo p-value of 1 − r m+1 .

B.1 Consistency resamples under the hypothesis of auto-calibration
When working with original data of the form (22), we can generate resamples under the hypothesis of auto-calibration in the obvious way, as follows.
Algorithm 3: Consistency resamples under the hypothesis of auto-calibration Input: (F 1 , y 1 ), . . . , (F n , y n ) Output: resamples (x 1 ,ỹ As noted, in the case of threshold calibration, the induced outcome is binary, whence the assumptions of auto-calibration and T-calibration coincide. For other types of functionals, autocalibration is a strictly stronger assumption than T-calibration, and it is important to note that the resulting inferential procedures may be confounded by forecast attributes other than T-calibration. For illustration, let us return to the setting of Example 2.1 and suppose that, conditionally on a standard normal variate µ, the outcome Y is normal with mean µ and variance 1. Given any fixed σ > 0, the forecast F σ = N (µ, σ 2 ) is auto-calibrated if, and only if, σ = 1. However, if T is the mean or median functional, then F σ is T-calibrated under any σ > 0. Clearly, if we use Algorithm 3 to generate resamples, then the consistency bands generated by Algorithm 2 might be misleading with regard to the assessment of T-calibration. For example, if σ < 1 the confidence bands tend to be narrow and might erroneously suggest a lack of T-calibration, despite the forecast being T-calibrated.

B.2 Consistency resamples under the hypothesis of T-calibration
The issues just described call for an alternative to Algorithm 3. Residual-based approaches can be used to generate resamples under the weaker hypothesis of T-calibration. In developing such a method, we restrict the discussion to single-valued functionals T under which y i = T(δ i ), which covers all cases of key interest, such as the mean functional, lower or upper quantiles, and expectiles. As is standard in regression diagnostics, residual-based approaches operate on the basis of tuples (x 1 , y 1 ), . . . , (x n , y n ) of the form (23) under the assumptions of independence between the point forecast, x i , and the residual, y i − x i , and exchangeability of the residuals. For a discussion in the context of backtests in banking regulation, see Example 3 of Nolde and Ziegel (2017).
Interestingly, Theorem 2.21 demonstrates that under these assumptions a forecast is conditionally T-calibrated if, and only if, it is unconditionally T-calibrated. Thus, we draw resamples in a two-stage procedure. First, we find the constant c from Theorem 2.27 such that the empirical distribution of (x 1 + c, y 1 ), . . . , (x n + c, y n ) or, equivalently, (x 1 , y 1 − c), . . . , (x n , y n − c), is unconditionally T-calibrated, and then we resample from the respective residuals, as follows.

Algorithm 4: Consistency resamples under the joint hypothesis of T-calibration and independence between point forecasts and residuals
Input: (x 1 , y 1 ), . . . , (x n , y n ) Output: resamples (x 1 ,ỹ 1 ), . . . , (x n ,ỹ (j) n ) for j = 1, . . . , m for i = 1, . . . , n do let r i = y i − x i end find c such that (x 1 + c, y 1 ), . . . , (x n + c, y n ) is unconditionally T-calibrated for j ∈ {1, . . . , m} do sampler 1 , . . . ,r n from {r 1 , . . . , r n } with replacement for i = 1, . . . , n do letỹ i = x i +r i − c end end As noted in the main text, the consistency bands for the threshold reliability diagrams in Figures 6 and 8 have been generated by Algorithms 2 and 3. This is nearly the same as the Monte Carlo technique proposed by  that applies in the case of (induced) binary outcomes (only). However, unlike , we do not resample the forecasts themselves. To generate consistency bands for the mean and quantile reliability diagrams in these figures, we apply Algorithm 2 to m = 1000 resamples generated by Algorithm 3. Evidently, this is a crude procedure and relies on classical assumptions. Nonetheless, we believe that in many practical settings, where visual tools for diagnostic checks of calibration are sought, the consistency bands thus generated provide useful guidance.
Further methodological development on consistency and confidence bands needs to be tailored to the specific functional T of interest, and follow-up work on Monte Carlo techniques and large sample theory is strongly encouraged. Extant asymptotic theory for nonparametric isotonic regression, as implemented by Algorithm 1, is available for quantiles and the mean or expectation functional, as developed and reviewed by Barlow et al. (1972), Casady andCryer (1976), Wright (1984), Robertson et al. (1988), El Barmi and Mukerjee (2005), and Mösching and Dümbgen (2020), and can be leveraged, though with hurdles, as rates of convergence depend on distributional assumptions and limit distributions involve nuisance parameters that need to be estimated, whereas the use of bootstrap methods might be impacted by the issues described by Sen et al. (2010).

B.3 Reliability diagrams and consistency bands for probabilistic and marginal calibration
For the classical notions of unconditional calibration in Section 2.2, the CORP approach does not apply directly, but its spirit can be retained and adapted. As for probabilistic calibration, the prevalent practice is to plot histograms of empirical probability integral transform (PIT) values, as proposed by Diebold et al. (1998), Gneiting et al. (2007 and Czado et al. (2009), though this is hindered by the necessity for binning, as analyzed by Heinrich (2021) in the nearly equivalent setting of rank histograms. The population version of our suggested alternative is the PIT reliability diagram, which is simply the graph of the CDF of the PIT Z F in (1). The PIT reliability diagram coincides with the diagonal in the unit square if, and only if, F is probabilistically calibrated. For tuples of the form (22) the empirical PIT reliability diagram shows the empirical CDF of the (potentially randomized) PIT values. This approach does not require binning and can be interpreted in much the same way as a PIT diagram: An inverse S-shape corresponds to a U-shape in histograms and indicates underdispersion of the forecast, as typically encountered in practice. Evidently, this is not a new idea and extant implementations can be found in work by Pinson and Hagedorn (2012) and .
As regards marginal calibration, we define the population version of the marginal reliability diagram as the point set The marginal reliability diagram is concentrated on the diagonal in the unit square if, and only if, F is marginally calibrated. For tuples of the form (22) the empirical marginal reliability diagram is a plot of the empirical non exceedance probability (NEP)F 0 (y) = 1 n n i=1 ½{y ≥ y i } against the average forecast NEPF (y) = 1 n n i=1 F i (y) at the unique values y of the outcomes y 1 , . . . , y n , and interpolated linearly in between. Of course, this is not a new idea either and can be interpreted as a P-P plot.
For marginal calibration diagrams, we obtain consistency bands under the assumption of marginal calibration by drawing resamples y (j) 1 , . . . , y (j) n fromF = 1 n n i=1 F i , computing the respective marginal reliability curve, and repeating over Monte Carlo replicates j = 1, . . . , m. Then we find consistency bands in the spirit of Algorithm 2. For PIT reliability diagrams, a trivial technique applies, as we may obtain consistency bands under the assumption of probabilistic calibration by (re)sampling n independent standard uniform variates, computing the respective empirical CDF, and repeating over Monte Carlo replicates. Evidently, there are alternatives based on empirical process theory (Shorack and Wellner, 2009). Figure 11 illustrates PIT and marginal reliability diagrams on our customary examples, along with 90% consistency bands based on m = 1000 Monte Carlo replicates.

B.4 Testing hypotheses of calibration
While the explicit development of calibration tests exceeds the scope of our paper, we believe that the results and discussion in Section 2 convey an important general message: It is critical that the assessed notion of calibration be carefully and explicitly specified. Throughout, we consider tests under the assumption of independent, identically distributed data from a population. For extensions to dependent samples, we refer to Strähl and Ziegel (2017), who generalized the prediction space concept to allow for serial dependence, and point at methods introduced by, e.g., Corradi and Swanson (2007), Knüppel (2015) and Bröcker and Ben Bouallègue (2020). The most basic case is that of tuples (x 1 , y 1 ), . . . , (x n , y n ) of the form (23), where implicitly or explicitly x i = T(F i ) for a single-valued functional T. We first discuss tests of unconditional calibration. If the simplified condition (10) is sufficient, a two-sided t-test based on v = 1 n n i=1 V (x i , y i ) can be used to test for unconditional calibration. In the general case, two one-sided t-tests can be used along with a Bonferroni correction. In the special case of quantiles, there is no need to resort to the approximate t-tests, and exact binomial tests can be used instead. Essentially, this is the setting of backtests for value-at-risk reports in banking regulation, for which we refer to (Nolde and Ziegel, 2017, Sections 2.1-2.2).
As noted earlier in the section, resamples generated under the hypothesis of conditional Tcalibration can readily be used to perform Monte Carlo tests for the respective hypothesis, based on CORP score components that are computed on the surrogate data. Alternatively, one might leverage extant large sample theory for nonparametric isotonic regression (Barlow et al., 1972;Casady and Cryer, 1976;Wright, 1984;Robertson et al., 1988;El Barmi and Mukerjee, 2005;Mösching and Dümbgen, 2020). Independently of the use of resampling or asymptotic theory, CORP based tests avoid the issues and instabilities incurred by binning (Dimitriadis et al., 2021, Section S2) and may simultaneously improve efficiency and stability. In passing, we hint at relations to the null hypothesis of Mincer-Zarnowitz regression (Krüger and Ziegel, 2021) and tests of predictive content (Galbraith, 2003;Breitung and Knüppel, 2021).
We move on to the case of fully specified distributions, where we work with tuples (F 1 , y 1 ), . . . , (F n , y n ) of the form (22), where F i is a posited conditional CDF for y i (i = 1, . . . , n). Tests for probabilistic calibration then amount to tests for the uniformity of the (potentially, randomized) PIT values. Wallis (2003) and Wilks (2019, p. 769) suggest chi-square tests for this purpose, which depend on binning, and thus are subject to the aforementioned instabilities. To avoid binning, we recommend the use of test statistics that operate on the empirical CDF of the PIT values, such as the classical Kolmogorov-Smirnov (KS) statistic, as suggested and used to test for PIT calibration by Noceti et al. (2003) and Knüppel (2015), or, more generally, tests based on distance measures between the empirical CDF of the PIT values, and the CDF of the standard uniform distribution that arises under the hypothesis of probabilistic calibration. Recently proposed alternatives arise via e-values . Similarly, tests for marginal calibration can be based on resamples and distance measures betweenF andF 0 , or leverage asymptotic theory.
In the distributional setting, arbitrarily many types of reliability can be tested for, and all of the aforementioned tests for unconditional or conditional T-calibration apply. Multiple testing needs to be accounted for properly, and the development of simultaneous tests for various types of calibration would be useful. In this context, let us recall from Theorem 2.16 that, subject to technical conditions, CEP, threshold, and quantile calibration are equivalent and tests for CEP calibration (Held et al., 2010;Strähl and Ziegel, 2017), quantile and threshold calibration assess identical hypotheses.

Appendix C Time series settings and the Bank of England example
In typical time series settings, as exemplified by our analysis of Bank of England forecasts in Section 3, the assumption of independent replicates of forecasts and observations is too restrictive. While the diagnostic methods proposed in our paper continue to apply, statistical inference requires care, as discussed by Corradi and Swanson (2007) and Knüppel (2015), among other authors. Here, we elucidate the role of uniform and independent probability integral transform (PIT) values for calibration in time series settings, and give further details and results for the Bank of England example.

C.1 The role of uniform and independent PITs
In a landmark paper, Diebold et al. (1998, p. 867) showed that a sequence of continuous predictive distributions F t for a sequence Y t of observations at time t = 0, 1, . . . results in a sequence of independent, uniformly distributed PITs if F t is ideal relative to the σ-algebra generated by past observations, A t = σ(Y 0 , Y 1 , . . . , Y t−1 ). This property does not depend on the continuity of F t and continues to hold under general predictive CDFs and the randomized definition (1) of the PIT (Rüschendorf and de Valk, 1993, Theorem 3).
In the case of continuous predictive distributions, Tsyplakov (2011, Section 2) noted without proof that if the forecasts F t are based only on past observations, i.e., if F t is A t -measurable, then the converse holds, namely, uniform and independent PITs arise only if F t is ideal relative to A t . The following result formalizes Tsyplakov's claim and proves it in the general setting, without any assumption of continuity.
Theorem C.1. Let (Y t ) t=0,1,... be a sequence of random variables, and let A t = σ(Y 0 , . . . , Y t−1 ) for t = 0, 1, . . . Furthermore, let (F t ) t=0,1,... be a sequence of CDFs, such that F t is A t -measurable for t = 0, 1, . . . , and let (U t ) t=0,1,... be a sequence of independent, uniformly distributed random variables, independent of the sequence (Y t ). Then the sequence of randomized PITs, (Z t ) = (F t (Y t −) + U t (F t (Y t ) − F t (Y t −))) is an independent sequence of uniform random variables on the unit interval if, and only if, F t is ideal relative to A t , i.e., F t = L(Y t | A t ) almost surely for t = 0, 1, . . .
The proof utilizes the following simple lemma. Proof. Problem 14 of Breiman (1992, Chapter 4), which is proved by Schmidt (2011, Satz 18.2.10), states that for random variables X 1 and X 2 such that σ(Y, X 1 ) is independent of σ(X 2 ), E[Y | X 1 , X 2 ] = E[Y | X 1 ] almost surely. The statement of the lemma follows, as Proof of Theorem C.1. Since F t is measurable with respect to A t , there exists a measurable function f t : R t → F such that F t = f t (Y 0 , . . . , Y t−1 ) for each t by the Doob-Dynkin Lemma (Schmidt, 2011, Satz 7.1.16). 1 We define G t := f t (G −1 0 (Z 0 ), . . . , G −1 t−1 (Z t−1 )) recursively for all t, and show the "only if" assertion by induction.
Evidently, the assumption that no information other than the history of the time series itself has been utilized to construct the forecasts is very limiting. In this light, it is not surprising that, while the "if" part of Theorem C.1 is robust, the "only if" claim fails if F t is allowed to use information beyond the canonical filtration, even if that information is uninformative. A simple counterexample is given by the unfocused forecast from Example 2.2, which is probabilistically calibrated, but fails to be auto-calibrated. Its PIT nevertheless is uniform and independent even for autoregressive variants (Tsyplakov, 2011, Section 6).
1 Note that f 0 is constant, and ft is not a random quantity, but a fixed function that encodes how the predictive distributions are generated from past observations. The σ-algebra on F , which is implicitly used throughout, is given by A F = σ({{F ∈ F : F (x) ∈ B} : x ∈ Q, B ∈ B(R)}), where B(R) denotes the Borel σ-algebra on R. For each x ∈ Q there exists a measurable function fx,t such that Ft(x) = fx,t(Y 0 , . . . , Y t−1 ) by the Doob-Dynkin Lemma, and ft is essentially the countable (and hence measurable) collection (fx,t) x∈Q .

C.2 Details and further results for the Bank of England example
Bank of England forecasts of inflation rates are available within the data accompanying the quarterly Monetary Policy Report (formerly Inflation Report), which is available online at https://www.bankofengland.co.uk/sitemap/monetary-policy-report. The forecasts are visualized and communicated in the form of fan charts that span prediction intervals at increasing forecast horizons, and derive from two-piece normal forecast distributions.
A detailed account of the parametrizations for the two-piece normal distribution used by the Bank of England can be found in Julio (2006), and we have implemented the formulas in this reference. Historical quarterly CPI inflation rates are published by the UK Office for National Statistics (ONS) and available online at https://www.ons.gov.uk/economy/inflationandpriceindices/timeseries/d7g7.
We consider forecasts of consumer price index (CPI) inflation based on market expectations for future interest rates at prediction horizons of zero to six quarters ahead, valid for the third quarter of 2005 up to the first quarter of 2020, for a total of n = 59 quarters. These and earlier Bank of England forecasts of inflation rates have been checked for reliability by Wallis (2003), who considered probabilistic calibration, by Clements (2004) in terms of probabilistic, mean, and threshold calibration, by Galbraith and van Norden (2012), who considered probabilistic and mean calibration, by Strähl and Ziegel (2017) with focus on conditional exceedance probability (CEP) calibration, and by Pohle (2020), who considered quantile calibration. The 2% inflation target is discussed at the Bank of England website at https://www.bankofengland.co.uk/monetary-policy/inflation. Figures 12-17 show calibration diagnostics for inflation forecasts at prediction horizons of k ∈ {0, 2, 3, 4, 5, 6} quarters ahead, in the same format as Figure 8 in the main text, which concerns forecasts at a lead time of one quarter.