Empirical AUC for evaluating probabilistic forecasts

Scoring functions are used to evaluate and compare partially probabilistic forecasts. We investigate the use of rank-sum functions such as empirical Area Under the Curve (AUC), a widely-used measure of classification performance, as a scoring function for the prediction of probabilities of a set of binary outcomes. It is shown that the AUC is not generally a proper scoring function, that is, under certain circumstances it is possible to improve on the expected AUC by modifying the quoted probabilities from their true values. However with some restrictions, or with certain modifications, it can be made proper.


Introduction
Predicting the outcomes of multiple binary variables is a common problem across a variety of application domains, such as fraud detection, credit risk evaluation, medical diagnostics and weather forecasting.Such forecasts typically carry some information describing the uncertainty of the forecaster, such as assigning explicit probabilities or some other numerical value to each variable that allows the variables to be ranked in order of relative probability of occurrence.
This paper investigates numerical measures for evaluating and comparing the accuracy of such forecasts.Although such measures have always been important for comparing algorithms, their role has become increasingly important with the popularity of prediction competitions, where it is necessary to precisely quantify the performance of participants.In particular, we use the framework of scoring functions, which maps the prediction and subsequent observation to a single real number, the score, representing the reward to the forecaster.The aim of the forecaster is then to maximise this reward.
Scoring functions can be viewed as extensions of scoring rules (section 2.1), which require that the forecast be fully probabilistic, providing a full joint probability distribution over the set of all possible outcomes, which can be infeasible and unnecessary in many situations.Scoring functions (section 2.2) on the other hand can make use of partial probabilistic information such as marginal distributions, or rankings of expected values.One desirable feature of both scoring rules and scoring functions is that they be proper : that the forecaster always has the incentive to be honest, in that the forecast which maximises their expected score matches their true belief.
The focus of this paper is on a class of scoring functions termed rank-sum functions (section 3), the most well-known of which is the area under the curve (AUC), the curve in question being the receiver operating characteristic (ROC).The ROC and AUC describe the usefulness of the forecast in terms of its ability to discriminate between positive and negative outcomes.Note that this paper specifically focuses on the empirical AUC, and not the theoretical quantity that is perhaps more often studied: this distinction is explained in detail in section 3.1.
The main results (section 3.2) identify sufficient conditions for rank-sum scoring functions to be proper for evaluating the accuracy of forecasts of the marginal probabilities of a sequence of binary forecasts.In general, the AUC is not of this class, and a counter-example is provided which demonstrates a case in which the AUC is not a proper scoring function, in that there exist distributions under which the forecaster might improve their expected score by quoting probabilities different than their true belief.
This framework can be further extended to the case where instead of making a direct prediction, the forecaster is required to provide a mapping that indirectly makes predictions from an as-yet unobserved covariate (section 4).In section 5, we discuss some open questions, and problems with extending the framework to a sequential setting.
2 Scoring of forecasts for binary outcomes

Scoring rules
Consider the setting where one is eliciting forecasts about some future outcome Y that takes values in an outcome space Y.A probabilistic forecast is a distribution Q for Y that describes the forecasters uncertainty of Y .We define F to be a family of distributions over Y that are under consideration.
After the actual outcome Y = y is observed, the reward to the forecaster is determined by a scoring rule, a function S : Y × F → R, that maps the quoted Q and observed outcome y to a real number S(y, Q) termed the score.We take scoring rules to be positively oriented, that is the score represents the reward to the forecaster, who therefore aims to maximise this quantity.In a decision theoretic context, the negation of the score can be considered a loss function.Mathematically, the problem can be precisely phrased in the form of a game between a Forecaster and Nature (Dawid et al., 2012).
For any P ∈ F , we can then define the expected score as the EP [S(Y, Q)], where Y is generated from P .A scoring rule S is proper if an optimal strategy for the forecaster is to quote a distribution that matches their actual uncertainty, that is, if for all Q, P ∈ F , (1) Additionally, S is termed strictly proper if this is the only optimal strategy, i.e. (1) is an equality only if Q = P .Proper scoring rules for discrete variables have been extensively studied (e.g.Dawid et al., 2012); common examples include the Brier, spherical and the log scores.
In this paper, we will consider the outcome space to be a vector of binary variables, In this case, the distribution Q takes values on ∆ 2 n −1 , the (2 n −1)-dimensional unit simplex.If the family F is the set of all such distributions, then for large values of n this can place a large burden in terms of time and resources in constructing, communicating and evaluating the score of the forecast.This motivates a more flexible framework.

Scoring functions
Suppose that instead of supplying a distribution Q from a family F , we require forecaster to quote a forecast from an arbitrary set Z, which we will term the prediction space.Then a scoring function is a mapping of the form s : Y × Z → R. Gneiting (2011) extensively studied scoring functions in the context of point forecasts, where Z = Y, though as we shall demonstrate, the concept extends directly to a more general context.The price of this generality is that we now need to explicitly specify the aspects of the forecasters uncertainty that we want to capture.This can be described by a (statistical) functional, a possibly set-valued function, T : F → Z or T : F → ℘Z, where ℘Z denotes the power set of Z.
A scoring function s is then said to be T -proper (Gneiting (2011) uses the term consistent) if for all P ∈ F , and all u ∈ Z, (2) for Z-valued functional T , or for a set-valued functional T , , t)] for all t ∈ T (P ).
(3) Furthermore, we can define s to be strictly T -proper if equality holds only if u = T (P ) or u ∈ T (P ), respectively.Note that the condition in (3) implies that for any proper scoring function s of a set-valued functional, the expected score EP [s(Y, t)] must be constant for all t ∈ T (P ).As would be expected from the terminology, there is a strong link between scoring functions and scoring rules, in that a (strictly) proper scoring function defines a (strictly) proper scoring rule (Gneiting, 2011, Theorem 3).
In this paper, we focus on two specific classes of functionals for distributions on Y = {0, 1} n .

Marginal scoring
Definition 1 The marginal functional M maps a joint distribution to the marginal probabilities of each element of Y , We can easily construct scoring functions for the marginal functional as functions of scoring rules for the individual elements of Y .
Theorem 1 Let S i : {0, 1} × [0, 1] → R be a scoring rule for a single binary outcome, such as the logarithmic, quadratic or Brier score.Then the scoring function Proof Each S i can be maximised independently by choosing

Rank scoring
Recall that a total preorder is a transitive and reflexive relation such that for any pair i, j, at least one of i j or j i.Given such a , we can define i ∼ j as the symmetric relation i j and i j and i ≺ j as the asymmetric relation i j (which due to totality, implies i j).Note also implies a total ordering of the equivalence classes under ∼.
Define Ξ n to be the set of total preorders on the set of indices I = {1, . . ., n}, then any vector v ∈ R n induces an element of v ∈ Ξ n by Definition 2 The exact rank functional R : F → Ξ n maps a joint distribution to the total preorder induced by the marginal functional M.
The exact rank functional can also be characterised in terms of pairwise comparisons.
Proposition 1 Let = R(P ) for some distribution P on Y. Then Proof By adding P [Y i = 1, Y j = 1] to both sides, we have that In the case where all the elements of M(P ) are unique, R(P ) is a total order.We define Ω n ⊆ Ξ n to be the set of all total orders on I.
Note that the exact rank functional requires that ties ( ) be identified exactly.We define a weaker notion under which the ties can be ignored.
Definition 3 The weak rank functional R * : F → ℘Ξ n is the set-valued functional that maps a probability distribution to the set of total preorders contained in the exact rank functional: As a result, if all elements of M(P ) are unique, then R * (P ) = {R(P )}, and conversely if all the elements of M(P ) are equal, then R * (P ) = Ξ n .
Given an R * -proper scoring function s, we can construct a M-proper scoring function s ′ , via s ′ (y, m) = s(y, m ).Of course, such a scoring function can never be strictly M-proper, as m is preserved under any monotonic increasing transformation.
An advantage of rank-based scoring functions is that they allow the use of more abstract measures of propensity other than probability, and make it possible to compare forecasts generated by a wide variety of algorithms, whose outputs need not necessarily have a direct probabilistic interpretation.The downside is that we lose the ability to say anything about the calibration of the forecaster.

Rank-sum scoring functions
We now consider a particular class of rank-based scoring functions.For any total preorder , we define its rank vector ρ : Ξ n → R n to be the net number of elements that precede each element, We will consider the class rank-sum scoring functions, of the form s(y, ) = g(y) (4) for some functions g and σ = (σ i ) i=1,...,n Example 1 (Wilcoxon-Mann-Whitney u) The most well-known example of such a function is the Wilcoxon-Mann-Whitney u, commonly used as a nonparametric test statistic for comparing magnitude of two random variables.
It is defined as the number of times observations where y i = 0 precede observations where y i = 1, with ties counting as half (5) The term inside the summation is equal to 1 2 [1 + ½ i j − ½ i j ], and so where n 1 (y) = n i=1 y i , and n 0 (y) = n − n 1 (y).By symmetry, we have that i,j (½ i j − ½ i j ) = 0, and hence, For a fixed y, u will take values on the half-integers 0, 1 2 , 1, . . ., n 0 (y)n 1 (y).
Example 2 (Area under the curve) The receiver operating characteristic (ROC) describes the trade-off of sensitivity and specificity (or type I and type II error) of a preorder, and is calculated by plotting the true positive rate against the false positive rate that would be obtained by taking different elements of the preorder as the cutoff.
It can be described as the parametric curve on [0, 1] × [0, 1], starting at (1, 1), then linearly connecting the points for each equivalence class i under ∼, in the order of ≺.
The area under the curve (AUC) is then the total area under this curve, which will take values on [0, 1].It is well-established (e.g.Hanley and McNeil, 1982) that this is in fact equal to the Wilcoxon-Mann-Whitney u, standardised by dividing by n 0 (y)n 1 (y).
Note that if the outcomes are identical (i.e.y = 0 or 1), then the ROC and AUC are not properly defined.For convenience, we can define the AUC to be 1/2 in both these cases, however the choice of this constant does not affect any of the results other than Theorem 2.
As a result, we can write Also related is the Gini coefficient, g(y, ) = 2 AUC(y, ) − 1, which is twice the net area of the ROC above the diagonal, and takes values on [−1, 1].

Relation to theoretical AUC
Although the AUC has been widely explored in the literature, much of this work (e.g.Agarwal et al., 2005;Clémençon et al., 2008;Hand, 2009;Flach et al., 2011) focuses on a related but distinct quantity, which we will term the theoretical AUC.
Let θ be a joint distribution for a random pair (X i , Y i ), where X i , taking values in some set X • , is termed the covariate or feature, and Y i is a single binary response.For some mapping f : X • → R, we define the conditional Then the theoretical ROC replaces the empirical quantities of ( 6) with their theoretical equivalents, which again, describes a curve over [0, 1] × [0, 1].Similarly, the theoretical AUC, denoted tAUC(θ, f ), is the area under this curve.
The theoretical AUC can be rewritten as the conditional expectation (e.g.Clémençon et al., 2008 where the expectation is with respect to the product measure of θ × θ for The relationship between the empirical and theoretical AUCs is wellestablished, though for completeness we clarify the usual presentation (e.g.Agarwal et al., 2005, Lemma 2).
Proof For any vector y = 0, 1, the expectation of (5) conditional on Y = y gives an expression of the form of (7), and hence We emphasise several key differences between the empirical and theoretical AUC.Firstly, the theoretical AUC is a function of the mapping f from X i that is used to induce a ranking on Y i (confusingly, this is itself referred to as a "scoring function" in the literature).
Another distinction is that the distribution θ is now a hypothetical sampling model for a single pair (X i , Y i ), whereas the previous distribution P describes the forecasters uncertainty for a set (Y 1 , . . ., Y n ).We emphasise that these are distinct concepts: whereas the i.i.d.assumption is typically reasonable in a sampling context, it is extremely unrealistic for describing uncertainty, in that it would imply that there is absolutely no information to be gained about Y n from the other Y 1 , . . ., Y n−1 .
Additionally, although the negation of tAUC(θ, f ) can still be interpreted as a loss function in the standard decision-theoretic sense (e.g. for deriving minimax procedures), tAUC(θ, f ) cannot be used as a scoring function as θ is typically never observed directly.

Proper rank-sum scoring functions
To determine the propriety of such scoring functions, we utilise the following key lemma.
Lemma 1 For any fixed vector v ∈ R n , the quantity is maximised over ∈ Ξ n if and only if is contained in (v) , the preorder induced by v.
Proof Firstly, note that if we were to consider only total orders ∈ Ω n , then the statement is a direct result of the rearrangement inequality.For any total preorder ∈ Ξ n , define A( ) to be the set of total orders contained in , that is A( ) = R * ( ) ∩ Ω n .Then for any i, j, by symmetry we have that Therefore ρ( ) is the average of all ρ( ′ ) for ′ ∈ A( ).It follows then that (8) is is maximised if and only if all such ′ are themselves contained (v) , which in turn implies that itself is contained in (v) .✷ This then leads to our main result.
Theorem 3 A rank-sum scoring function s of the form in (4) is strictly R *proper if and only if P f , the preorder induced by EP [σ i (Y )], is an element of R * (P ) for all P ∈ F .
Proof By the linearity of expectation, we have that By Lemma 1, this can be maximised by any contained in P f .These are all elements of R * (P ) if and only if P f itself is in R * (P ).✷ Consequently, the Wilcoxon-Mann-Whitney u function is a strictly R *proper scoring function, however the same cannot be said of the AUC.
Then defining α as in Example 2, we have that This rather contrived example is illustrative of how the problem arises, namely the denominator of α can alter the relative importance of certain outcomes.Nevertheless, there exist certain families F under which AUC is indeed proper.
Theorem 4 If the number of positive outcomes n 1 (Y ) is almost surely constant for all P ∈ F , then AUC is a strictly R * -proper scoring function.
This justifies the use of AUC as a scoring function in cases where the forecaster is informed of the number of positive outcomes beforehand.This means that the forecaster is able to use this information to rule out extreme tail events that might otherwise have provided a windfall score.For example, in the IJCNN Social Network Challenge by Kaggle (https://www.kaggle.com/c/socialNetwork)competitors were required to estimate 8960 binary outcomes (corresponding to presence/absence of an edge), of which they were informed that exactly half were positive.
Theorem 5 If the Y i 's are mutually independent under all P ∈ F , then AUC is a strictly R * -proper scoring function.
Proof Note that if y i = y j , then n 1 (y) = 1 + n ¬(i,j) 1 (y), where n k =i,j y k , and similarly for n 0 .Then since if y i = y j , the numerator is zero.Then by mutual independence, As the latter expectation is strictly positive, it follows that As noted in section 3.1, mutual independence is a somewhat unrealistic condition for scoring functions.Nevertheless, it can be useful when combined with the following result.
Theorem 6 Let F consist of distributions P such that there is a latent variable Z whereby Proof Condition (i) implies that and by condition (ii) then, This provides a means for showing AUC is proper in more general contexts, by combining it with one of the previous two theorems to satisfy condition (i).For example, if θ is a parameter in a Bayesian model, conditional on which the outcomes are independent (e.g. a logistic regression model), then AUC is proper for the predictive distributions if (ii) holds.However these conditions can fail if there is significant uncertainty in the ordering of the outcomes, which may arise in problems such as out-of-sample prediction.
Example 4 Suppose that there are two candidate models, A and B, each weighted with probability 1/2, and the forecaster is to rank 100 outcomes, of which 10 have a particular feature U present.Suppose that the forecast probabilities are and that outcomes are independent within each model.Then the resulting marginal probabilities are However using the induced ranking will result in an expected AUC of 0.496, whereas the opposite ranking will give an expected AUC of 0.504 (see supplementary material).

Scoring functions for mappings
In many forecasting settings, each variable Y i has a corresponding covariate or feature X i taking values in some measurable space X • , which can be used to inform the prediction.In the case where the forecaster is able to observe the covariates directly, we can assume any relevant information is taken into account, and thus no additional consideration is required.However we can also consider the setting in which the forecaster does not observe the covariates, but is instead required to provide some sort of mapping from the covariate space X = (X • ) n to the original prediction space Z for Y (we use the term mapping so as to distinguish from scoring functions).In other words, the forecaster is required to make a prediction in the mapping prediction space Z = {f : X → Z}.
Furthermore, any scoring function s : Y × Z → R has a corresponding mapping form s : (X × Y) × Z → R which is simply s evaluated using the mapping applied to the observed covariates, Similarly, given any statistical functional T : F → Z, we can define the corresponding mapping functional T : F XY → Z as the mapping of the conditional expectation where P Y |X=x denotes the conditional distribution of Y given X = x under P .That is, the optimal mapping should map each x ∈ X to the optimal prediction under the conditional distribution P Y |X=x .
Theorem 7 Let s be a T -proper scoring function for a family F , then s is a T -proper scoring function for F XY if for each P XY ∈ F XY , there exists a family of conditional distributions {P Y |X=x } x which is a subset of F .
Proof The expected mapping score is The inner expectation can be maximised for each value of X ∈ X by choosing f (x) = arg max z E [s(Y, z) | X], which, as s is T -proper, will be (an element of) T (P Y |X=x ).✷ However we typically don't want to consider all possible mappings f : X → Z. Instead, we typically are only interested in mappings that can be applied coordinate-wise, In other words, we constrain the mapping such that the forecast for each Y i depends only on its corresponding covariate X i , and require that this mapping be the same for all i.Of course, we also need to constrain the family of distributions to ensure that the marginal mapping is coordinatewise.
Theorem 8 Let F be the set of distributions for (X, Y ) such that (i) Y i are conditionally independent of X given X i , and (ii) the distribution of Y i | X i is the same for all i.
Then for any M-proper scoring function s for a family F , s is a M -proper scoring function for the set of coordinate-wise mappings if the conditional distributions P Y |X=x are in F .
Proof By (i) we have that E and by (ii) it follows that this quantity is the same for all i.Therefore the mapping f (x) = M (P Y |X=x ) is coordinate-wise, which by Theorem 7, implies that s is M -proper.✷ Consequently u, the mapping form of u is M -proper for any F satisfying (i) and (ii).For AUC to be M -proper, additional conditions are required, such as mutual independence of elements of Y conditional on X.

Discussion
Although we have demonstrated that AUC is not generally a proper scoring function, Examples 3 and 4 both exhibit quite extreme dependence between outcomes.Therefore, it might be possible to establish a more relaxed criteria for establishing propriety of AUC, for example, bounds on correlation or other measures of dependence.
We have also only considered the batch prediction setting where the forecaster is required to provide the preordering for all Y before any outcomes have been observed.One alternative is a sequential framework, where at each point in time the forecaster is required to provide a forecast for Y t+1 , having already observed Y 1 , . . ., Y t .In the ranking case, this requires the forecaster to provide a total preorder t+1 on I t+1 that is compatible with the one t provided on I t .Unfortunately, rank-sum scoring functions are essentially useless in this setting.
Example 5 Let s be any rank-sum scoring rule of the form in (4), where σ i (y) = σ j (y) if y i = y j , and σ i (y) ≥ σ j (y) if y i > y j (both u and the AUC satisfy this property).Then in the sequential setting, it is possible to maintain an optimal score by choosing t+1 such that i ≺ t+1 t + 1 ≺ t+1 j for all i, j ≤ t : Y i = 0 and Y j = 1.By a straightforward application of induction, it is easy to see that such a sequence exists, and that it will maintain this "perfect separation", in that all i where Y i = 1 will always be ranked above all j where Y j = 0. Therefore, by Lemma 1, this will result in the largest possible score (i.e. an AUC of 1): note that unlike the previous sections, we refer to actual score, not just the expected score.
In other words, it is possible to construct an optimal procedure with absolutely no information whatsoever about the process of Y t .This problem will persist in the analogous mapping problem, where the forecaster is free to choose the mapping f t : X • → R at each iteration.
(i) for almost all Z, EP [Y | Z] induces the same preordering as EP [α(Y ) | Z], and (ii) this preordering is the same for almost all Z, then AUC is a strictly proper scoring function for R * .