Point forecasting and forecast evaluation with generalized Huber loss

Huber loss, its asymmetric variants and their associated functionals (here named Huber functionals) are studied in the context of point forecasting and forecast evaluation. The Huber functional of a distribution is the set of minimizers of the expected (asymmetric) Huber loss, is an intermediary between a quantile and corresponding expectile, and also arises in M-estimation. Each Huber functional is elicitable, generating the precise set of minimizers of an expected score, subject to weak regularity conditions on the class of probability distributions, and has a complete characterization of its consistent scoring functions. Such scoring functions admit a mixture representation as a weighted average of elementary scoring functions. Each elementary score can be interpreted as the relative economic loss of using a particular forecast for a class of investment decisions where profits and losses are capped. The relevance of this theory for comparative assessment of weather forecasts is also discussed.


Introduction
In many fields of human endeavor, it is desirable to make forecasts for an uncertain future. Hence, forecasts should be probabilistic, presented as probability distributions over possible future outcomes (Gneiting and Katzfuss, 2014). Nonetheless, many practical situations require forecasters to issue single-valued point forecasts. In this situation, a directive is required about the specific feature or functional of the predictive distribution that is being sought, or about the loss (or scoring) function that is to be minimized (Gneiting, 2011a;Ehm et al., 2016). Examples of functionals include the mean, median, a quantile or expectile, with the latter recently attracting interest in risk management (Bellini and Di Bernardino, 2017). Examples of scoring functions include the squared error scoring function S(x, y) = (x − y) 2 and absolute error scoring function S(x, y) = |x − y|. In the case that the directive is in the form of a statistical functional, it is critical that any scoring function used is appropriate for the task at hand. Ideally, a point forecast sampled from one's predictive distribution by using the requested functional should also minimize one's expected score. That is, the scoring function should be consistent for the functional (Gneiting, 2011a). It is well-known that the squared error scoring function is consistent for the mean and that the absolute error scoring function is consistent for the median. Within this framework, predictive performance is assessed by computing the mean score over a number of forecast cases.
This paper studies Huber loss, and asymmetric variants of Huber loss (Definition 4.2), as a scoring function of point forecasts, along with its associated functional. The classical Huber loss function (Huber, 1964) with positive tuning parameter a is given by S(x, y) = 1 2 (x − y) 2 , |x − y| ≤ a a|x − y| − 1 2 a 2 , |x − y| > a, (1.1) where x is a point forecast and y the corresponding realization. It applies a quadratic penalty to small errors and a linear penalty to large errors and is an intermediary between the squared error and absolute error scoring functions. Huber loss is used by the Australian Bureau of Meteorology (BoM) to compare predictive performance of temperature and wind speed forecasts with a view to streamlining forecast production (Foley et al., 2019), and is described to weather forecasters in that organization as "a compromise between the absolute error and the squared error, in an attempt to use the benefits of both of these." The motivation for using Huber loss in this applied context is given in Section 2. For a given predictive distribution, we call the set of point forecasts that minimize expected Huber loss the Huber mean of that distribution. The Huber mean is an intermediary between the median and the mean with some appealing properties. It can be described as the midpoint of the 'central interval' of length 2a of a distribution, where a is the tuning parameter of the corresponding Huber loss function (see Equation 3.6 and the accompanying geometric interpretation). The Huber mean, unlike the mean, is not dependent on the behavior of the distribution at its tails. At the same time, it accounts for more behavior in the vicinity of the center of the distribution than the median. It is therefore a robust measure of location for a distribution. More generally, the Huber functional gives the minimizers of expected asymmetric Huber loss, and is an intermediary between some α-quantile and α-expectile, as was also noted by Jones (1994) from the perspective of M-estimation. Basic properties of Huber means and Huber functionals are discussed in Section 3, many of which can be traced to Huber (1964) in the classical symmetric case.
In the context of point forecasting, an essential property of a statistical functional is that it is elicitable; that is, that the functional generates precisely the set of minimizers of some expected score (Lambert et al., 2008). The Huber functional is shown to be elicitable for classes of probability distributions on R (or subintervals of R) under weak regularity assumptions (Theorem 4.5). The class of scoring functions that are consistent for the Huber functional is also characterized, being parameterized by the set of convex functions (also Theorem 4.5). Edge cases of this characterization recover the general form of scoring functions that are consistent for quantiles (Gneiting, 2011b;Thomson, 1979) and expectiles (Gneiting, 2011a).
Determining which consistent scoring function to use is non-trivial since in practice this choice can influence forecast rankings (Murphy, 1977;Schervish, 1989;Merkle and Steyvers, 2013) as illustrated in Section 5.1. In the case of quantiles and expectiles, Ehm et al. (2016) gave clarity to this issue by showing that each consistent scoring function for those functionals admits a mixture representation; that is, can be expressed as a weighted average of elementary scoring functions. Likewise, each consistent scoring function for the Huber functional can be expressed as a weighted average of elementary scoring functions that are consistent for the Huber functional (Theorem 5.3). Again, the analogous results for quantiles and expectiles are recoverable as edge cases of this theorem. The Huber functional and its associated elementary scores arise naturally in optimal decision rules for investment problems with fixed up-front costs, where profits and losses are capped (Section 5.3). Such models are intermediaries between the classical simple cost-loss decision model (e.g. Richardson (2000); Wolfers and Zitzewitz (2008)) on the one hand, and investment decision models with no bounds on profits or losses (Ehm et al., 2016;Bellini and Di Bernardino, 2017) on the other. The mixture representation, along with the economic interpretation of elementary scoring functions and Murphy diagrams, aids interpreting forecast rankings in empirical situations (Ehm et al. (2016); Section 5.4). Applications include, for example, selecting a consistent scoring function that emphasizes predictive performance at the extremes of a variable's range (Taggart, 2021).
Conclusions are presented in Section 6 and proofs of the main results are given in the appendix.

The use of Huber loss for assessing weather forecast quality
The following summarizes the context and initial motivation for using Huber loss to assess predictive performance of forecast systems at the Australian BoM.
The U.S. National Weather Service and the BoM have an operational model where automated gridded weather forecasts are curated and manually adjusted by meteorologists prior to publication for public use. Advances in numerical weather prediction have led to on-going re-evaluation of the role of human forecasters in meteorological service provision (Just and Foley, 2020;Sturrock and Griffiths, 2020). At the BoM, comparative assessment of the predictive performance of official forecasts, which are issued by meteorologists, and automated forecasts from the Operational Consensus Forecasts (OCF) system (Bureau of Meteorology, 2017), was performed initially for forecasts of probability of precipitation. One goal was to advise on the likely impact on forecast quality if automation were adopted. The forecast service was well-defined for these precipitation forecasts and the Brier score was used as a consistent scoring function for ranking the two forecast systems and reporting on the statistical significance of those ranks (Griffiths et al. (2017), c.f. Section 5.1).
However, for some other variables, such as daily maximum temperature, the BoM's forecast service was not clearly defined. Point forecasts (i.e. R-valued forecasts) were requested from forecasters but there was no policy on which point forecast was suitable, either in the form of a directive (e.g. issue the mean of the predictive distribution) or of a scoring (or loss) function to be minimized. This lack of service clarity led a team within the BoM to seek a scoring function that would be suitable for penalizing forecast errors when the forecast user group is the heterogeneous public.
First, scoring functions that penalized over-and under-prediction asymmetrically were not considered. Second, anecdotal evidence suggested that maximum temperature forecasts with small errors (where the forecast x and observation y differed by about 1 • C) were generally viewed favorably by the public whereas those with larger forecast errors (around 4 or 5 • C) were not. Furthermore, the heavy reputational costs associated with egregious errors of comparable size were similar: a 9 • C error was not much worse than an 8 • C error. Hence, for daily maximum temperature forecasts, two clear preferences could be articulated: 1. five 1 • C errors are preferred over four perfect forecasts followed by a 4 • C error; 2. a 9 • C error followed by a perfect forecast is to be preferred over an 8 • C error followed by a 4 • C error.
Two commonly used scoring functions, the absolute error and squared error functions, do not satisfy both requirements. Table 1 shows that the mean absolute error scores for these error sequences are consonant with Preference 2 but not with Preference 1, while the opposite holds for mean squared error. However, the mean scores generated by the Huber loss scoring function of Equation (1.1), with a = 3, are consonant with both preferences. This is the scoring function that was selected. Huber loss has subsequently been used by the BoM to score daily maximum and minimum temperature forecasts, and hourly temperature, dewpoint temperature and wind magnitude forecasts, each with an appropriate choice of tuning parameter a. The remainder of the paper is devoted to developing the theory of Huber loss, and its asymmetric variants, in the context of point forecasting and forecast evaluation, with some applications of the theory illustrated using official BoM forecasts and OCF forecasts.

Quantiles, expectiles and Huber functionals
To begin, we establish some notation. We work in a setting where forecasts are made for some quantity, where the range of possible outcomes belongs to some interval I ⊆ R. Forecasts of the quantity can be in the form of a predictive distribution F on I or of a point forecast x in I. The realization (or observation) of the quantity will usually be denoted by y.
Let F(R) denote the class of probability measures on the Borel-Lebesgue sets of R and F(I) denote the subset of probability measures on I. For simplicity, we do not distinguish between a measure F in F(R) and its associated cumulative density function (CDF) F . For F in F(I), write Y ∼ F to indicate that a random variable Y has distribution F ; that is, P(Y ≤ t) = F (t) whenever t ∈ I. Throughout, the notation E F indicates that the expectation is taken with respect to Y ∼ F .
The power set of a set A will be denoted P(A). For a real-valued quantity X, we denote by X + the quantity max(0, X). The partial derivative with respect to the ith argument of a function g is denoted ∂ i g.
Whenever a, b ∈ [0, ∞], define the 'capping function' κ a,b : R → R by That is, κ a,b (x) is x capped below by −a and above by b. Note that x + = κ 0,∞ (x). In many contexts users or issuers of forecasts want a relevant point summary x of a predictive distribution F . This can be generated by requesting a specific statistical functional of F . Given an interval I ⊆ R and some space F of probability distributions in F(I), a statistical functional (or simply a functional ) T on F(I) is a mapping T : F(I) → P(I) (Horowitz and Manski, 2006;Gneiting, 2011a). Two important examples are quantiles and expectiles.
Example 3.1. Suppose that I ⊆ R and α ∈ (0, 1). The α-quantile functional Q α : F(I) → P(I) is defined by For any F , Q α (F ) is a closed bounded interval of I. The two endpoints only differ when the level set F −1 (α) contains more than one point, so typically the functional is single valued. The median functional Q 1/2 arises when α = 1/2. If q is an α-quantile of F and F is continuous at q then F (q)/(1 − F (q)) = α/(1 − α). Figure 1 illustrates the quantiles Q 1/2 (F ) (the median) and Q 0.7 (F ), where F is the exponential distribution. The aforementioned property is illustrated in the figure via the vertical dashed line segments, whose lengths are in the ratio α : (1−α).
Example 3.2. Given an interval I ⊆ R, let F 1 (I) denote the space of probability measures F(I) with finite first moment. The α-expectile functional E α : F 1 (I) → P(I) is defined by Figure 1: The quantile Q α , expectile E α and Huber quantile H α (where H α = H α a (F ), a = 0.6) when α = 0.5 (top) and α = 0.7 (bottom) for the exponential distribution F (t) = 1 − exp(−t), t ≥ 0. The ratios of the areas of the two shaded regions, of the areas of the two regions bounded by thick dashed lines, and of the lengths of the two dotted line segments, are α : (1 − α).
whenever F ∈ F 1 (I). It can be shown there is a unique solution x to the defining equation, so expectiles are single-valued. Expectiles were introduced by Newey and Powell (1987) in the context of least squares estimation and have recently attracted interest in financial risk management (Bellini and Di Bernardino, 2017). Expectiles share properties of both expectations as well as quantiles, and nests the mean functional E 1/2 . Using integration by parts, one can show that {x} = E α (F ) if and only if The latter equation gives a geometric interpretation of the α-expectile of F . It is the unique point x such that the (1 − α)-weighted area of the region bounded by F and 0 on the interval (−∞, x] ∩ I is equal to the α-weighted area of the region bounded by F and 1 on the interval [x, ∞) ∩ I. Figure 1 illustrates this interpretation, via the areas of the shaded regions, for the expectiles E 1/2 (F ) (i.e. mean) and E 0.7 (F ), where F is the exponential distribution. Equation (3.1) can be re-written as By modifying the parameters of the capping function κ 0,∞ , we introduce another functional.
Definition 3.3. Suppose that a > 0, b > 0, α ∈ (0, 1) and that I ⊆ R is an interval. Then the Huber functional H α a,b : F(I) → P(I) is defined by whenever F ∈ F(I). In the case when a = b, we simplify notation and write H α a (F ) for H α a,a (F ). The special case H 1/2 a (F ) is called a Huber mean.
We have named the Huber functional for Peter Huber, whose loss function h a (u) = 1 2 u 2 , |u| ≤ a a|u| − 1 2 a 2 , |u| > a (3.4) (Huber, 1964) also bears his name. The connection between the Huber functional and Huber loss will be made explicit in Section 4. Since the Huber functional is an example of a generalized quantile (Breckling and Chambers, 1988;Jones, 1994;Bellini et al., 2014), H α a,b (F ) may also be called a Huber quantile of F . We note here that The function V is an identification function (Gneiting, 2011a, Section 2.4) for H α a,b , and will be used to establish important properties of the Huber functional.
As with expectiles, a routine calculation using integration by parts shows that x ∈ H α a,b (F ) if and only if This gives a geometric interpretation of the Huber functional as the set of points x where the (1 − α)weighted area of the region bounded by F and 0 on [x − b, x] ∩ I equals the α-weighted area of the region bounded by F and 1 on [x, x + a] ∩ I. In the case when α = 1/2, the two areas are equal. This is illustrated for the exponential distribution in Figure 1 for H α 0.6 (F ) (when α = 1/2 and α = 0.7) and in Figure 2 for H α a,b (F ) (when α = 0.7, a = 2 and b = 1). In light of the corresponding geometric interpretations of quantiles and expectiles, and also the similarity between Equations (3.2) and (3.3), it should come as no surprise that α-quantiles and α-expectiles are nested as edge cases in the family {H α a } a∈(0,∞) of Huber functionals. The following proposition makes this precise and lists several other basic properties of the Huber functional. In what follows, F −1 (w) denotes the closure in R of the level set F −1 (w), and R(F ) denotes the smallest closed interval of R that contains the support of the measure F .

If F has finite first moment then
6. IfF ∈ F(I) and F (t) =F (t) whenever Part (1) is similar to Proposition 1(a) of Bellini et al. (2014), whilst parts (4), (5) and (6) were noted, in the case of finite discrete distributions when a = b and α = 1/2, by Huber (1964). The proof is given in the appendix.
Part (6) can be interpreted as saying that the Huber functional only depends on the values of the CDF F away from its tails. In situations where the tail of a predictive distribution is difficult to model, but a point summary describing its broad center is desired, this property is useful. In particular, the Huber functional is invariant to the modification of F outside the interval In contrast, modification of the tails of F will generally change its mean and expectile values, whilst quantile values are invariant to modifications of F anywhere apart from at the quantile.
Parts (2) and (3) specify conditions on F for which H α a,b (F ) is multi-valued. Expectiles are always single-valued whereas quantiles can sometimes be multi-valued. Multi-valued quantiles can arise when F has a bi-modal probability density function (PDF), taking values from the interval I that separates the two distinct densities that comprise the PDF. Huber functionals H α a,b (F ) provide a single-valued alternative to multi-valued quantiles by choosing a and b such that a + b is at least the length of I. A precisely stated corollary is that if each level set of F on R(F ) has length not exceeding a + b then H α a,b (F ) is single-valued for every α in (0, 1). It also follows that H α a,b (F ) is single-valued whenever the quantile Q α (F ) is single-valued. Figure 2 illustrates a distribution F for which H 1/2 a (F ) is multi-valued if and only if 0 < a < 3. In this particular case, F has a symmetric bi-modal PDF, and also the property that E 1/2 (F ) ⊆ H 1/2 a (F ) ⊂ Q 1/2 (F ) whenever a > 0.
Note that while H α a (F ) is in some sense an intermediary between Q α (F ) and E α (F ), the right-hand side of Figure 1 illustrates that the Huber quantile does not always lie between the corresponding quantile and expectile.

Scoring functions, consistency and elicitability
In this section we discuss scoring functions and their relationship to point forecasts and functionals. Two key concepts are those of consistency and elicitability. How these concepts relate to the Huber functional is the subject of Theorem 4.5, which is the first major result of this paper.

Scoring functions and Bayes' rules
The scoring function S is said to be regular if (i) for each x ∈ I the function y → S(x, y) is measurable, and (ii) for each y ∈ I the function x → S(x, y) is continuous, with continuous derivative whenever x = y.
The score S(x, y) can be interpreted as the loss or cost accrued when the point forecast x is issued and the observation y realizes. Examples of scoring functions include the squared error scoring function S(x, y) = (x − y) 2 , the absolute error scoring function S(x, y) = |x − y| and the zero-one scoring function S(x, y) = 1 {|x−y|≥k} (x), for some positive k. Only first two of these are regular, whilst the zero-one scoring function fails to be regular on account of its discontinuity when |x − y| = k. The measurability condition (i) is a technical condition that is satisfied by most (if not all) scoring functions that arise in practice.
Huber loss (3.4) gives rise to the regular scoring function S(x, y) = h a (x − y). We introduce a more general version.

The classical Huber loss function given by Equation (3.4) is 2h
1/2 a,a . The same generalization is used by Zhao et al. (2021) for robust expectile regression. Figure 2 shows the graph of h 0.7 2,1 . Note Generalized Huber loss gives rise to the regular scoring function S(x, y) = h α a,b (x − y). Given a scoring function S, a forecast system that generates point forecasts can be assessed by computing its mean scoreS, whereS over a finite set of forecast cases {x 1 , . . . , x n } with corresponding observations {y 1 , . . . , y n }. In this framework, if a number of competing forecast systems are being compared then the one with the lowest mean score is the best performer. Thus, given a scoring function S and predictive distribution F , an optimal point forecast is anyx in I that minimizes the expected score; that is, provided that the expectation exists. A point forecast that is optimal in this sense is also known as a Bayes' rule (Gneiting, 2011a;Ferguson, 1967). It has long been known that the Bayes' rule under the squared error scoring function S(x, y) = (x − y) 2 is the mean of F , and under the absolute error scoring function S(x, y) = |x − y| is any median of F . The Bayes' rule under the asymmetric piecewise linear scoring function is a quantile Q α (F ) (e.g. Ferguson (1967)), whilst the Bayes' rule under the asymmetric quadratic scoring function is the expectile E α (F ) (Newey and Powell, 1987;Gneiting, 2011a).
To find the Bayes' rule under the generalized Huber loss scoring function S( is the identification function given by (3.5). This implies that x ∈ H α a,b (F ). So, at least formally, the Bayes' rule under generalized Huber loss is the corresponding Huber functional of F . A precise statement will be given in the next subsection.

Consistency and elicitability
Whenever a point forecast request specifies what functional of the predictive distribution is being sought, the scoring function used to evaluate the point forecast should be appropriate for that functional.
Definition 4.3. (Gneiting, 2011a;Murphy and Daan, 1985) Suppose that I ⊆ R. A scoring function S : I × I → R is said to be consistent for the functional T relative to a class F of probability distributions on I if for all probability distributions F in F, all t in T(F ) and all x in I. The functional T is said to be strictly consistent relative to the class F if it is consistent relative to the class F and if equality in (4.4) implies that x ∈ T(F ).
Evaluating point forecasts with a strictly consistent scoring function rewards forecasters who give truthful point forecast quotes from carefully considered predictive distributions. This is because the requested functional of the predictive distribution coincides with the optimal point forecast (or Bayes' rule).
The families of consistent scoring functions for quantiles and expectiles each have a standard form. Subject to slight regularity conditions, a scoring function S is consistent for the quantile functional Q α if and only if S is of the form where g is a non-decreasing function (Gneiting, 2011b;Thomson, 1979;Saerens, 2000). Moreover, if g is strictly increasing then S is strictly consistent. The standard asymmetric piecewise linear scoring function (4.2) for quantiles (which includes, up to a multiplicative constant, the absolute error scoring function for the median) is recovered from Equation (4.5) with the choice g(t) = t. Subject to standard regularity conditions, a scoring function S is consistent for the expectile functional E α if and only if S is of the form where φ is a convex function with subgradient φ (Gneiting, 2011a). Moreover, if φ is strictly convex then S is strictly consistent. The standard asymmetric quadratic scoring function (4.3) for expectiles (including, up to a multiplicative constant, the squared error scoring function for the mean) is recovered from (4.6) by taking φ(t) = t 2 . When α = 1/2, the function S of (4.6) is known as a Bregman function.
We will show that consistent scoring functions for the Huber functional also have a standard form. Before doing so, we introduce a critical concept related to the evaluation of point forecasts.
Definition 4.4. (Lambert et al., 2008) A statistical functional T is said to be elicitable relative to a class F of probability distributions if there exists a scoring function S that is strictly consistent for T relative to F.
For example, quantiles are elicitable relative to the class F(R), while expectiles are elicitable relative to the class of distributions in F(R) with finite first moment (Gneiting, 2011a). It is worth noting that some statistical functionals are not elicitable, including the sum of two distinct quantiles and conditional value-at-risk, a risk measure used in finance (Gneiting, 2011a).
We turn now to the Huber functional. The main thrust (subject to appropriate regularity conditions) is that the Huber functional is elicitable, and that S is consistent for H α a,b if and only if S is of the form where φ is a convex function with subgradient φ . Moreover, S is strictly consistent if φ is strictly convex. The generalized Huber loss scoring function S(x, y) = h α a,b (x − y) arises from Equation (4.7) with the choice φ(t) = t 2 . The following gives a precise statement.
1. The Huber functional H α a,b is elicitable relative to the class of probability measures F(I) when I is bounded or semi-infinite, and elicitable relative to the class of probability measures F(I) with finite first moment when I = R.
2. Suppose that φ : I → R is convex on I. Then the function S : I × I → R, defined by Equation (4.7), is a consistent scoring function for the Huber functional H α a,b relative to the class F(I) of probability measures F for which both and are finite. If, additionally, φ is strictly convex then S is strictly consistent for H α a,b relative to the same class of probability measures.
3. Suppose that the scoring function S : I × I → R is regular. If S is consistent for the Huber functional H α a,b relative to the class of probability measures in F(I) with compact support, then S is of the form (4.7) for some convex function φ : I → R. Moreover, if S is strictly consistent then φ is strictly convex.
The proof is given in the appendix. The general form (4.7) for the consistent scoring functions of the Huber functional yields, as edge cases, the general form for the consistent scoring functions of expectiles and quantiles. To be precise, let S H,φ a denote the scoring function S given by (4.7) when a = b, and let S E,φ and S Q,g denote the consistent scoring functions of Equations (4.6) and (4.5) respectively. The relationship between S H,φ a and S E,φ is straightforward via pointwise limit lim a→∞ S H,φ a (x, y) = S E,φ (x, y). (4.8) For the other end of the spectrum we consider the rescaled consistent scoring function S H,φ a /a, and obtain the pointwise limit where φ is nondecreasing because φ is convex. Importantly, the relevant regularity conditions ensure that every non-decreasing function g in the representation (4.5) is the subderivative of some suitable convex φ.
The consistent scoring functions for the Huber functional thus show a mixture of the properties of the consistent scoring functions for quantiles and expectiles. Focusing on the functional H 1/2 a for positive a, the only consistent scoring function (up to a multiplicative constant) on R × R that only depends on the difference x − y between the forecast and observation is the classical Huber loss scoring function (x, y) → h 1/2 a,a (x − y). This is because the only Bregman function (up to a multiplicative constant) that has the same property for E 1/2 is the squared error scoring function (x, y) → (x − y) 2 (Savage, 1971). Hence, apart from multiples of classical Huber loss, other consistent scoring functions for H 1/2 a on R × R penalize under-and over-prediction asymmetrically. One such example is the exponential family (4.10) parameterized by λ ∈ R and obtained from (4.7) via φ(t) = 2 exp(λt)/λ 2 . These are analogous to the exponential family of Bregman functions considered by Patton (2020).

Mixture representations and Murphy diagrams
The main theoretical tool presented in this section is the mixture representation for consistent scoring functions of the Huber functional (Theorem 5.3). Mixture representations were introduced for quantiles and expectiles by Ehm et al. (2016) and have several very useful applications, including providing insight into forecast rankings.

Ranking of forecasts
Recall from Section 4.1 that point forecasts from two competing forecast systems A and B can be ranked by calculating their mean scoresS A n andS B n over a finite number n of forecast cases for some scoring function S. If the forecast cases are independent, a statistical test for equal predictive performance can be based on the statistic t n , where for forecasts {x A i } and {x B i } and corresponding realizations {y i }. Subject to traditional regularity conditions, the statistic t n is standard normal under the null hypothesis of vanishing expected score differentials. Corresponding p-values are computed and if the null hypothesis is rejected then A is preferred if t n < 0 and B is preferred otherwise (Diebold and Mariano, 1995;Gneiting and Katzfuss, 2014). Unfortunately, forecast rankings and the results of hypothesis tests can depend on the choice of consistent scoring function (Ehm et al., 2016, pp. 506, 515-516), as we now illustrate.
Example 5.1. Two forecast systems, OCF and MAN, of the Australian BoM produce point forecasts for the daily maximum temperature at Sydney Observatory Hill. The OCF system is fully automated and generates forecasts from a blend of bias-corrected numerical weather prediction forecasts. The MAN forecast is the official forecast of the BoM and is manually issued by meteorologists who have access to various information sources, including OCF. We consider forecasts for the period July 2018 to June 2020 with a lead time of one day. See Figure 3 for a sample time series of MAN and OCF forecasts with observations.
Suppose that these forecasts are targeting the Huber mean H 1/2 3 , and make the simplifying assumption that successive forecast cases are independent. If the consistent scoring function S(x, y) = 2 h 1/2 3,3 (x − y) is used, then the mean score for MAN is lower than the mean score for OCF, and with a p-value of 6.52 × 10 −4 the null hypothesis of equal predictive performance is rejected at the 5% significance level in favor of MAN forecasts. However, if the consistent scoring function S 2;3 defined by Equation (4.10) is used, then OCF has the lower mean score, albeit with a p-value of 0.333 that does not lead to rejection of the null hypothesis.

Mixture representations
In Section 4.2 it was seen that the class of consistent scoring functions for each quantile, expectile and Huber functional is very large, being parametrized either by the set of nondecreasing functions or by the set of convex functions. The following results show that this apparent multitude can, in a certain sense, be reduced to a one-parameter family of so-called elementary scoring functions.
In general, the choice of function φ in the representations (4.6) and (4.7) is not unique. To facilitate precise mathematical statements, a special version of φ will be chosen. Let I denote the class of all left-continuous non-decreasing functions on R, and let C denote the class of all convex functions φ : R → R with subgradient φ in I. This last condition will be satisfied if φ is chosen to be the left-hand derivative of φ. Denote by S Q α the class of scoring functions S of the form (4.5) such that g ∈ I, by S E α the class of scoring functions S of the form (4.6) such that φ ∈ C, and by S H α,a,b the class of scoring functions S of the form (4.7) such that φ ∈ C. For most practical purposes, S Q α , S E α and S H α,a,b can be identified with the class of consistent scoring functions for the respective functional on R.
The following important result on the representation of scoring functions that are consistent for the quantile and expectile functionals is due to Ehm et al. (2016).
Both integral representations (5.2) and (5.4) hold pointwise. The functions defined by (5.3) and (5.5) are called elementary scoring functions for the quantile and expectile functionals respectively. Thus Theorem 5.2 essentially states that each scoring function that is consistent for a quantile or expectile functional can be expressed as a weighted average of corresponding elementary scoring functions. The analogous result for Huber functionals is new and stated below.
Theorem 5.3. Every member S of the class S H α,a,b has a representation of the form
The proof is given in the appendix and is a simple adaptation of the proof for quantiles and expectiles.
Each function S H α,a,b,θ of Theorem 5.3 is called an elementary scoring function for the Huber functional, and also belongs to S H α,a,b , as can be seen via Equation (4.7) with the choice φ(t) = (t−θ) + and φ (t) = 1 {θ<t} . The mixture representation of Equation (5.6) holds pointwise. Moreover, when a = b, the mixture representations for the consistent scoring functions of expectiles and quantiles emerge as edge cases of Theorem 5.3 by taking limits as a → ∞ and as a ↓ 0 and using the dominated convergence theorem. Details are given in Remark A.1. Ehm et al. (2016) showed how the elementary scoring functions S Q α for quantiles have a natural economic interpretation related to binary betting and the classical simple cost-loss decision model (e.g Richardson (2000); Wolfers and Zitzewitz (2008)). On the other hand, the elementary scoring functions S E α for expectiles arise naturally in simple investment decisions where profits attract taxation and losses tax deduction, possibly at different rates (Ehm et al., 2016;Bellini and Di Bernardino, 2017). The elementary scoring functions S H α,a,b,θ for the Huber functional also admit an economic interpretation. It is the loss, relative to actions based on a perfect forecast, of an investment decision Monetary payoff

Economic interpretation of elementary scoring functions
with fixed costs, possibly differential tax rates for profits versus losses, and where profits and losses are capped. This represents an intermediary position between the interpretation for quantiles (where economic losses, if they occur, are fixed irrespective of how near or far the forecast is to the realization) and that for expectiles (where there is no cap on profits or on losses). To illustrate, we give two examples. The first is an adaptation of the interpretation for the elementary scoring functions of expectiles presented by Ehm et al. (2016), while the second shows how the Huber functional and its elementary functions can arise in the context of investment decisions based on weather forecasts.
Example 5.4. Suppose that Alexandra considers investing a fixed amount θ in a start-up company in exchange for an unknown future amount y of the company's profits or losses. Additionally, Alexandra takes out an option to set a limit b on losses she could incur but which also imposes a limit a on the profits she could receive. Alexandra will make a profit if and only if y > θ, and so adopts the decision rule to invest if and only if her point forecast x of y exceeds θ. Her pay-off structure is as follows: 1. If Alexandra refrains from the deal, her pay-off will be 0, independent of the outcome y.
Here min(θ − y, b) is the monetary loss, bounded by b, and the factor 1 − r L accounts for Alexandra's reduction in income tax with r L ∈ [0, 1) representing the deduction rate.
3. If Alexandra invests and y > θ realizes then her pay-off is positive at (1 − r G ) min(y − θ, a), where r G ∈ [0, 1) denotes the tax rate that applies to her profits.
The top matrix in Table 2 shows Alexandra's pay-off under her decision rule. The positivelyoriented pay-off matrix can be reformulated as a negatively oriented regret matrix, by considering the difference between the pay-off for an (hypothetical) omniscient investor who has access to a perfect forecast and the pay-off for Alexandra. For example, if x ≤ θ and y > θ realizes, then the omniscient investor's pay-off is (1 − r G ) min(y − θ, b) while Alexandra's pay-off is 0, and so Alexandra's regret is (1 − r G ) min(y − θ, b). The bottom matrix of Table 2 is Alexandra's regret matrix, which up to a multiplication factor is the elementary score S H α,a,b,θ (x, y). So to minimize regret, Alexandra should invest if and only if x > θ, where x = H α a,b (F ), F is Alexandra's predictive distribution of the future value of the investment and α = (1 − r G )/(2 − r L − r G ). The point forecast x = H 1/2 a (F ) arises if profits and losses are capped by the same value and if the rates r G and r L are equal.
Example 5.5. Hannah runs a business selling ice creams from a mobile cart at a sports stadium. Historically, there is an approximately linear relationship between the volume of ice cream sales on any given afternoon and the observed daily maximum temperature, so that the profit p from sales is modeled by p = ky + c, where y is the observed daily maximum temperature, k > 0 and c ∈ R.
Additionally, 0 ≤ p ≤ a for some positive a, since total sales are limited by cart capacity, while any unsold units can be sold at a later date. If Hannah chooses to sell ice creams on any given afternoon, she must also pay a fixed cost f (staff wages and stadium fees). If model assumptions are correct, Hannah will make a profit if and only if ky + c > f . So she adopts the decision rule to sell ice creams on any given afternoon if and only if her point forecast x of the maximum temperature exceeds the decision threshold θ, where θ = (f − c)/k. Her pay-off structure is as follows.
1. If Hannah does not sell ice creams then her pay-off is 0.
2. If Hannah sells ice creams and y > θ then her profit after tax is where r G ∈ [0, 1) denotes the tax rate. Her profit can be rewritten as (1 − r G )k min(y − θ, (a − f )/k).
3. If Hannah sells ice creams and y < θ then her loss after tax deductions is (1 − r L ) min(f − (ky + c), f ), where r G ∈ [0, 1) denotes the deduction rate, and losses are capped by f since unsold ice creams go back into storage. Her loss can be rewritten as (1 − r L )k min(θ − y, f /k).
As with Example 5.4, these outcomes can be converted to a regret matrix, which up to a multiplication factor is the elementary score S H α,(a−f )/k,f /k,θ (x, y) where α = (1−r G )/(2−r L −r G ). Consequently, her optimal decision rule is to sell ice creams if and only if The essential features of Example 5.5 also arise in the context of rainfall storage and water trading. Any profits made by selling harvested water are capped by storage capacity. The predicted volume v of water that is collected from any rainfall event can be modeled by v = ky + c, where y is the predicted rainfall at a representative point within the catchment, c is catchment initial loss and k is determined by catchment size and continuing loss.

Forecast dominance, Murphy diagrams and choice of consistent scoring function
We return to the problem of forecast rankings with the notion of forecast dominance (Ehm et al., 2016, Section 3.2). We say that forecast system A dominates forecast system B for point forecasts targeting a specific Huber functional if the expected score of point forecasts from A is not greater than the expected score of point forecasts from B, for every consistent scoring function. In practice this is impossible to check directly because the family of consistent scoring functions, parameterized by φ ∈ C, is very large. However, by the mixture representation of Theorem (5.3), one need only test for dominance over the family, parametrized by θ ∈ R, of elementary functions. In empirical situations, this is further reduced to checking forecast dominance for finitely many θ. In what follows, we consider tuples (x iA , x iB , y i ) consisting of the ith point forecast from systems A and B along with the corresponding observation y i .
Corollary 5.6. Suppose that α ∈ (0, 1), a > 0 and b > 0. The forecast system A empirically dominates B for predictions targeting H α a,b if To see why, note that the score differential θ → d i (θ) for the ith forecast case is piecewise linear and right-continuous, and is zero unless θ lies between x iA and x iB . The only possible discontinuities are at x iA and x iB , and the only possible changes of slope are at y i , y i − a and y i + b.
for each forecast source, computed at each of the points θ of Corollary 5.6. The top left of Figure 3 presents the Murphy diagram for three different forecasts targeting the Huber mean H 1/2 3 of the daily maximum temperature at Sydney Observatory Hill (July 2018 to June 2020). The MAN and OCF forecasts were discussed in Example 5.1. For any given day, the Climate forecast is the Huber mean H 1/2 3 of 46 observations, sampled from the previous 15 days and from a 31 day period this time last year centered on the day in question. A lower mean score is better.
The graph in the top right of Figure 3 represents forecast performance as a skill score (Jolliffe and Stephenson, 2003, Section 2.7) with respect to two reference forecasts: the perfect forecast (skill score = 1) and the Climate forecast (skill score = 0). The difference in mean elementary scores between OCF and MAN forecasts is presented in the bottom left, with pointwise 95% confidence intervals. Neither of these forecasts dominates the other.
Returning to Example 5.5, if Hannah's decision rule is to sell ice creams if and only if the H 1/2 3 point forecast x exceeds 30 • C, then Hannah should base her decisions on the MAN forecast, since its mean elementary score, which is proportional to economic regret, is lowest (see the top left of Figure 3 where θ = 30). But if her fixed investment costs f changed, then so would her decision threshold θ, and the Murphy diagram indicates which forecast system historically performed better at the new threshold. The initial motivation for the BoM using Huber loss was to inform decisions for streamlining forecast production that are broadly consistent with public appraisal of forecast quality. The decision model associated with the elementary scoring functions of H 1/2 3 could be taken as a proxy for the decision model of the average forecast user. Under this assumption, the evidence presented in Figure 3 suggests that using the automated OCF forecast would have negligible impact for the average user who has a decision threshold lower than 28 • C. However, evidence points towards human meteorologists (MAN) producing better forecasts for users with temperature decision thresholds from the high 20s to the mid 30s and possibly beyond. Such information can be used to inform BoM policy regarding when its official forecast can be based on the automated system and when meteorologists should intervene. It also indicates where future improvements to the OCF system are possible.
The mixture representation and Murphy diagram also gives insight into why the two different scoring functions of Example 5.1 lead to different forecast rankings. The classical Huber loss scoring function S(x, y) = 2 h 1/2 3,3 (x − y) is obtained from Equation (4.7) with the choice φ(t) = t 2 . The corresponding mixing measure is dM (θ) = 2 dθ, implying that every elementary scoring function in the mixture representation (5.6) is weighted equally, and also that the area underneath each graph in the Murphy diagram (top left of Figure 3) is twice the mean Huber lossS for that forecast system. On the other hand, the exponential scoring function S 2;3 is obtained from Equation (4.7) with the choice φ(t) = exp(2t)/2. In this case dM (θ) = 2 exp(2θ) dθ and so mean elementary scores in the corresponding mixture representation are weighted heavily for higher values of θ. Hence when scored by S 2;3 , a slight over-forecast of 40.4 • C by MAN on 19 December 2019 (OCF forecast 35.4 • C and the observation was 39.3 • C) was penalized substantially more heavily than the OCF under-forecast, resulting in a higher mean scoreS 2;3 for MAN than OCF.
Finally, we consider the choice of consistent scoring function for Huber quantile point predictions in the situation where the point forecast serves the needs of a diverse community of users. Classical Huber loss, obtained when φ(t) = 2t 2 , applies equal weight to all θ. For everyday use, this choice of φ may be justified by the desire to weight all decision thresholds θ equally. On the other hand, for weather forecasts there may be a desire, from a public risk perspective, to give greater weight to values of θ that lie in the hazardous climatological extremes, so that competing forecast system candidates are evaluated with that in mind. For maximum temperature forecasts, the mixing measure dM (θ) = φ (θ) dθ, where puts increasing weight on decision thresholds below 5 • C and above 35 • C. This yields the convex function 1 6 (θ − 35) 3 + 1 2 θ 2 , θ ≥ 35 . and the corresponding consistent scoring function S can be computed from Equation (4.7). With this S, MAN maximum temperature forecasts for Sydney outperform those of OCF, and with a p-value of 1.48 × 10 −3 the null hypothesis of equal predictive performance is rejected at the 5% significance level.
If comparing predictive performance with emphasis on the extremes is desired, the mixing measure can be designed to concentrate positive weight on the region of interest. For example, choosing dM (θ) = 1 {θ≥35} dθ emphasizes performance for decision thresholds of at least 35 • C. Taggart (2021) discusses evaluation of extremes and decompositions of consistent scoring functions for quantiles, expectiles and Huber means using precisely this approach.

Conclusion
We have defined the Huber functional so that it gives the set of optimal point forecasts for minimizing the expected generalized Huber loss. The Huber functional is an intermediary between quantiles and expectiles, which it nests as edge cases. The Huber functional incorporates more information about a predictive distribution F than quantiles, yet unlike expectiles it is not sensitive to the behavior of F at its tails. We have shown that the Huber functional is elicitable, given a characterization of its consistent scoring functions and stated the mixture representation for those scoring functions. These theoretical results enable the use of the Huber functional and its associated consistent scoring functions within a theoretically sound framework for point forecasting and evaluation (see Gneiting (2011a), Gneiting and Katzfuss (2014), Ehm et al. (2016) and the references therein). Moreover, the Huber functional is shown to arise naturally within decision theory for a broad class of investment problems, and within this context the mixture representation facilitates some justification for the choice of consistent scoring function when point forecasts targeting the Huber functional are utilized by a heterogeneous user group.
Many organizations, including meteorological agencies, have traditionally issued point forecasts that are not well-defined, and that are consumed by a very broad user group. Where there is appetite to clarify forecast definitions, and where it is desirable that point forecasts target some 'middle' point of the predictive distribution, the Huber mean provides a good candidate functional, as it combines the local sensitivity of the expectation with the global robustness of the median. The classical Huber loss scoring function S(x, y) = h 1/2 a,a (x − y) is a natural choice for a consistent scoring function of the Huber mean, as it favors all user-decision thresholds equally (in the sense discussed in Section 5.4). Nonetheless, if it is desirable that forecast performance at some user-decision thresholds is more important than at others, the mixture representation provides a method for generating the appropriate scoring function.

A Proofs
Proof of Proposition 3.4. We first prove the proposition for the case when I = R.
In light of the essential equivalence of Equations (3.3) and (3.6), define G a,b : R → R by Where then is no confusion, we will drop the subscripts and simply be denote the function by G.
Since the CDF F is nonnegative and nondecreasing, it follows that G is a nondecreasing function on R. Moreover, G is also continuous on R.
First we will show that the set of zeroes of G is nonempty and lies in R(F ), which will establish that H α a,b (F ) is a nonempty subset of R(F ). Since G is continuous and nondecreasing on R, it suffices to show that that G takes at least one positive and one negative value in any neighborhood of R(F ), which will also establish that the zero set is bounded.
Suppose that ε > 0. If R(F ) has finite left-endpoint r 0 then Otherwise, let η = αa/((1 − α)b + αa) and note that 0 < η < 1. So there exists v in R(F ) such that F (u) < η whenever u ≤ v. So Similarly, if R(F ) has a finite right-endpoint r 1 then G(r 1 + ε) > 0. Otherwise, there exists w in R(F ) such that F (u) > η whenever u ≥ w. So This shows that G has at least one zero, and since ε is arbitrary and G is nondecreasing and continuous, all the zeroes are contained in the interval R(F ), and the zero set is a closed bounded interval.
To prove parts (2) and (3) when I = R, we note that the zero set of G is [c, d] where c < d only if there is a constant w ∈ (0, 1) such that The closure of this latter set is precisely [c − b, d + a]. Moreover, if u is any such zero of G, (A.1) implies that 0 = G(u) = (1 − α)bw + αaw − αw.
This shows that q 0 − a < min(H α a (F )) ≤ q 0 + a, from which is obtained lim a↓0 min(H α a (F )) = q 0 . Similarly, one can show that G a,a (q 1 + a) > 0 and G a,a (q 1 − a) ≤ 0, which are used to establish lim a↓0 max(H α a (F )) = q 1 . Part (5) follows from the definition of expectiles and the Huber functional. To prove part (6), letG a,b denote the function of the form (A.1) defined usingF in the place of F , and suppose that F =F on the interval The reverse inclusion is obtained similarly.
Finally, for each part, the case when I ⊂ R can be deduced from the case when I = R by considering the natural extension of F ∈ F(I) to F(R), and using the fact that H α a,b (F ) ⊆ R(F ) ⊆ I.
Proof of Theorem 4.5. To prove part (2), we take a similar approach to the proof presented in (Brehmer, 2017, pp. 38-39) (which follows Gneiting (2011a)) of the consistency theorem for expectiles. Suppose that F ∈ F(I) and that the expectations both exist and are finite, which will guarantee the existence of the expectations that follow. Fix a and b in (0, ∞) and α in (0, 1). For convenience, denote κ a,b by κ. Consider t ∈ H α a,b (F ) and let x ∈ R. Suppose that φ is convex and let S be defined by (4.7). We need to show that Define the function g : and note that g is nonnegative by the convexity of φ, and strictly positive if φ is strictly convex.
Define the function f : and note that f (u, v) ≥ 0 whenever u ≥ v by the convexity of φ, with f (u, v) > 0 whenever u > v if φ is strictly convex.
To show (A.3), we break it up into two main cases (either x < t or t < x) and then into several sub-cases. Consider first the case where x < t with subcase t+a]∩I} and A 7 = {Y ∈ (t + a, ∞) ∩ I}. Note that the sets A i are disjoint and that their union is I. Hence We will calculate each term in the series and sum them together at the end. The calculations are: Now when summing these terms together, note that since t ∈ H α a,b (F ), Equation (3.3) implies that and thus the all terms containing f (x, t) vanish. The remaining terms are all nonnegative by the properties of f and g, which establishes (A.3) in this particular subcase and hence that S is consistent for H α a,b . To prove strict consistency in this subcase, suppose that φ is strictly convex and that equality holds in (A.3). So we must have where K can be written as a sum of nonnegative terms, having applied (A.4). Each of the terms in the final expression is nonnegative, so for equality to hold they must all equal 0. Now the terms involving A 3 , A 4 and A 5 are all strictly positive unless P(Y ∈ A i ) = 0 for i = 3, 4, 5. Similarly, the terms involving A 2 and A 6 are positive unless P(Y ∈ A 2 \{x − b}) = P(Y ∈ A 6 \{t + a}) = 0. Together, this implies that P(Y ∈ (x − b, t + a) ∩ I) = 0, or equivalently that F is constant on (x − b, t + a) ∩ I. Combining this with the fact that t ∈ H α a,b (F ) if and only if (3.6) holds, it is easy to see that x ∈ H α a,b (F ). This establishes strict consistency. For the main case x < t, there are four further subcases: The proof of consistency for each subcase proceeds in the same way as the first subcase, and if proceeding in this order most of the calculations in subcases that have already been proved can be used to prove subsequent subcases. The proof of strict consistency also proceeds similarly for the first case, by showing that F is constant on (x − b, t + a) ∩ I. Details are left to the reader.
The case when t < x is proved the same way, but calculations are quicker by exploiting symmetry and anti-symmetry. For example, the subcase t − b < t < t + a ≤ x − b < x < x + a proceeds by switching the roles of t and x in the definitions of A i , and then making the switches −a ↔ b and A i ↔ A 8−i in the calculations for each term. For example, in the case when t > x we have A 6 = {Y ∈ (t, t + a] ∩ I} and while in the case when x < t, after making switches, we have A 2 = {Y ∈ (t − b, t] ∩ I} and All the terms are nonnegative apart from those involving f (x, t), which will vanish when all the terms are summed together. Details are left to the reader. This completes the proof of part (2).
To prove part (1) for the cases when I is bounded or semi-finite, use the result of part (2) with the bounded (on I) strictly convex function φ(t) = e −t (or φ(t) = e t if I is the of the form (−∞, c)). When I = R, use the same approach with φ(t) = t 2 and note that E F [φ(Y ) − φ(Y − a)] and E F [φ(Y ) − φ(Y + b)] exists and is finite if E F Y exists and is finite.
To prove part (3), we apply Osband's principle with the identification function V of Equation (3.5). An argument similar to (Gneiting, 2011a, p. 753, 759) shows that ∂ 1 S(x, y) = h(x)V (x, y) for x, y ∈ I and some function h : I → I. Integration by parts yields the representation (4.7), where the function φ is defined by for some x 0 in I. Now since S(x, y) ≥ 0 for all x, y ∈ I, it follows from (4.7) that (x − y)φ (x) + φ(y) − φ(x) ≥ 0 whenever −a ≤ x − y ≤ b, which in turn implies that φ is convex on I. If S is strictly consistent, then S(x, y) > 0 for all non-identical x and y in I, whence a similar argument shows that φ is strictly convex.
We will show that Φ(x, y) = 2 The case x − y > b is handled analogously. The case −a ≤ x − y < 0 is essentially the same as the proof of the case x < y for expectiles (Ehm et al., 2016, p. 529), and the case 0 < x − y ≤ b is analogous. The final case x = y is trivial. Finally, note that the increments of M are determined by S and so the mixing measure is unique.
Remark A.1. We show how the mixture representations for the consistent scoring functions of quantiles and expectiles (Theorem 5.2) emerge as limiting cases of Theorem 5.3. Consider the case for expectiles first. For fixed x, y and θ we have S E α,θ (x, y) = lim a→∞ S H α,a,a,θ (x, y) .
Using the notation and limits following the statement of Theorem 4.5, S E,φ α (x, y) = lim a→∞ S H,φ α,a (x, y) = lim a→∞ ∞ −∞ S H α,a,a,θ (x, y) dM (θ) where the interchange of limits and integration in the final equality is justified by the dominated convergence theorem and where dM (θ) = dφ (θ). This recovers the mixture representation for expectiles. Turning now to quantiles, for fixed x, y and θ we have lim a↓0 1 a S H α,a,a,θ (x, y) =      1 − α , y < θ < x α , x ≤ θ < y 0 otherwise, and so lim a↓0 1 a S H α,a,a,θ (x, y) = S Q α,θ (x, y) for almost every θ (differing only when θ = y). Hence, using the notation and limits following Theorem 4.5 and the dominated convergence theorem, where dM (θ) = dφ (θ). This recovers the mixture representation for quantiles.