Classification with Asymmetric Label Noise: Consistency and Maximal Denoising

In many real-world classification problems, the labels of training examples are randomly corrupted. Previous theoretical work on classification with label noise assumes that the two classes are separable, that the label noise is independent of the true class label, or that the noise proportions for each class are known. In this work we give weaker conditions that ensure identifiability of the true class-conditional distributions, while allowing for the classes to be nonseparable and the noise levels to be asymmetric and unknown. Under these conditions, we also establish the existence of a consistent discrimination rule, with associated estimation strategies. The conditions essentially state that most of the observed labels are correct, and that the true class-conditional distributions are"mutually irreducible,"a concept we introduce that limits the similarity of the two distributions. For any label noise problem, there is a unique pair of true class-conditional distributions satisfying the proposed conditions, and we argue that this pair corresponds in a certain sense to maximal denoising of the observed distributions. Both our consistency and maximal denoising results are facilitated by a connection to"mixture proportion estimation,"which is the problem of estimating the maximal proportion of one distribution that is present in another. This work is motivated by a problem in nuclear particle classification.


Introduction
In binary classification, one observes multiple realizations of two different classes, where P 0 and P 1 , the class-conditional distributions, are probability distributions on a Borel space (X , S). The feature vector X i y ∈ X denotes the i-th realization from class y ∈ {0, 1}. The general goal is to construct a classifier from this data.
There are several kinds of noise that can affect a classification problem. A first type of noise occurs when P 0 and P 1 have overlapping supports, meaning that the label is not a deterministic function of the feature vector. In this situation, even an optimal classifier makes mistakes. In this work, we consider a second type of noise, label noise, that can occur in addition to the first type of noise. With label noise, some of the labels of the training examples are corrupted. We focus in particular on random label noise, as opposed to feature-dependent or adversarial label noise.
To model label noise, we represent the training data via contamination models: According to these mixture representations, each "apparent" class-conditional distribution is in fact a contaminated version of the true class-conditional distribution, where the contamination comes from the other class. Thus,P 0 governs the training data with apparent class label 0. A proportion 1 − π 0 of these examples have 0 as their true label, while the remaining π 0 have a true label of 1. Similar remarks apply toP 1 . The noise is asymmetric in that π 0 need not equal π 1 . We emphasize that π 0 and π 1 are unknown. The distributions P 0 and P 1 are also unknown, and we do not wish to impose models for them. In particular, the supports of P 0 and P 1 may overlap, so that the classes are not separable. Previous work on classification with random label noise, reviewed below, has not considered the problem in this generality. Our first contribution is to introduce necessary and sufficient conditions on the elements P 0 , P 1 , π 0 , π 1 of the contamination models such that these elements are uniquely determined givenP 0 andP 1 . These conditions are the following: • (Total noise level) π 0 + π 1 < 1, • (Mutual irreducibility) It is not possible to write P 0 as a nontrivial mixture of P 1 and some other distribution, and vice versa.
To shed some light on these conditions, we remark that in the absence of any assumption, the solution (P 0 , P 1 , π 0 , π 1 ) to (1)- (2), when the contaminated dis-tributionsP 0 ,P 1 are given (i.e., in the limit of infinite sample sizes, or population version of the problem), is non-unique. For example, were the condition on total label noise not required, for any solution, swapping the role of classes 0 and 1 would also be a solution (with complementary contamination probabilities), while leaving the apparent labels unchanged. Furthermore, we describe in detail (at the population level) the geometry of the set of all possible solutions (P 0 , P 1 , π 0 , π 1 ) to (1)- (2). We argue that for any pairP 0 =P 1 , there always exists a unique solution satisfying the above two conditions. Moreover, this solution uniquely corresponds to the maximum possible total label noise level (π 1 + π 0 ) compatible with the observed contaminated distributions, and also to the maximum possible total variation separation P 1 − P 0 T V under the condition π 1 + π 0 < 1. In this sense, P 0 and P 1 satisfying the second condition are maximally denoised versions of the contaminated distributions.
Our second contribution is to develop a discrimination rule that is universally consistent in the sense that for anyP 0 ,P 1 , it consistently estimates the optimal classification performance as defined with respect to the maximally denoised distributions (which are the underlying uncontaminated distributions under the above conditions). A key aspect of our contribution is that the label noise proportions π 0 and π 1 are unknown, in contrast to previous work, and the linchpin of our solution is a method for accurately estimating π 0 and π 1 . We argue that these proportions can be estimated using methods for mixture proportion estimation (MPE), which is the problem of estimating the mixing proportion of one distribution in another. We review previous work on MPE and also establish a new rate of convergence result for MPE that is employed in our analysis.
As a third contribution, we present experimental results indicating that the proposed methodology is practically viable. In particular, we show that π 0 and π 1 can be accurately estimated using the same principles guiding our theory. To illustrate this point, we examine some standard benchmark data sets as well as a real data set from a nuclear particle classification problem that is naturally described by our label noise model.
Portions of this work appeared earlier in Scott et al. [41] and Scott [40]. This longer version integrates those versions and extends them by establishing the necessity of the proposed conditions, a consistency analysis featuring clip-as semi-supervised novelty detection (SSND); see Blanchard et al. [7] for a review of this literature. In particular, Blanchard et al. [7] develop theory for "mixture proportion estimation" that we leverage in our analysis.
A basic version of multiple instance learning can be reduced to classification with one-sided label noise [see 36]. In multiple instance learning, the learner is presented with bags of instances. In one basic setting, the bags are labeled negative if they contain only negative instances, and positive if they contain at least one positive instance. If one assumes that the instances in positive bags follow a mixture modelP 1 = (1−π)P 1 +πP 0 , and the instances are iid according to P 0 orP 1 , the setting is that of one-sided label noise.
As mentioned above, classification with label noise is the basis of co-training [8], which is a framework for classifying instances that are represented by two distinct "views." The original analysis of co-training considers the "realizable" case, where labels are a deterministic function of inputs. Our results allow us to state a result for co-training without making this restrictive assumption. This result is presented in Section 8.
There is also a connection between classification with label noise and class probability estimation. As pointed out in our initial technical report [42], there is a simple way to express mutual irreducibility in terms of the class probability function. From this relationship, and given other developments in this paper, it is straightforward to express π 0 and π 1 in terms of the maximum and minimum values of the contaminated class probability function. This suggests an alternative estimation strategy for the label noise proportions, which has recently been investigated by Liu and Tao [25] and Menon et al. [30]. In Section 9, we elaborate on this approach and connections to these works. We also investigate this approach experimentally in Section 11.
As a final connection with existing literature, we note that an alternative way to view the contamination model (1)-(2) is to interpret it as a source separation problem. In the usual source separation setting, the realizations from the different sources are linearly mixed, whereas in the present model, the source probability distributions are (we do not observe a signal superposition, but a signal coming randomly from one or the other source). As a common point with the source separation setting, it is necessary to postulate additional constraints on the sources in order to resolve non-uniqueness of the possible solutions. In independent component analysis, for instance, sources are assumed to be independent. Our assumption of mutual irreducibility between the sources plays a conceptually comparable role here. Similarly, the assumption on the total noise level resolves the ambiguity that the sources would otherwise only be identifiable up to permutation.

Some initial notation
Let f : X → {0, 1} be a classifier. Denote the (uncontaminated) Type I and Type II errors These quantities are what define many classification performance measures of interest, such as the so-called minmax criterion, R(f ) = max{R 0 (f ), R 1 (f )}, or the probability of error, where ν is the a priori probability of class 1.
We also define the corresponding contaminated Type I and II errors These quantities can easily be estimated from the training data by their basic empirical counterparts.

Outline
The remainder of the paper is outlined as follows. Section 2 discusses the challenges posed by label noise for classifier design. Section 3 presents an alternate representation of the contamination models that reduces the problem to that of mixture proportion estimation, which is discussed in Section 4. In Section 5 we introduce our proposed identifiability conditions, establish their sufficiency and necessity, and also discuss maximal denoising. A method for mixture proportion estimation is discussed in Section 6, where a novel rate of convergence result is presented and subsequently applied to develop a consistent discrimination rule in Section 7. In Section 8, we apply our label noise results to generalize an earlier result on co-training. Section 9 makes a connection between our label noise framework and the problem of class probability estimation. Algorithm implementations are described in Section 10, and experimental results are provided in Section 11. Shorter proofs tend to be found in the body of the paper, while longer ones are in an appendix.

The challenge of label noise
Before delving into more technical matters, we first offer an overview of the challenges posed by label noise. We focus on the population setting (n 0 , n 1 = ∞) and compare classifier design based on the contaminated distributions,P 0 andP 1 , versus the true ones, P 0 and P 1 . To begin, we introduce the following condition on the total amount of label noise.
This condition states, in a certain sense, that a majority of the labels are correct on average. It even allows that one of the proportions be very close to one if the other proportion is small enough. This condition was previously adopted by [8].
Let p 0 and p 1 be densities of P 0 and P 1 , respectively, with respect to a common dominating measure. Theñ are respective densities ofP 0 andP 1 .

Proposition 1. Assume (A)
holds. For all γ ≥ 0, and every x such that p 0 (x) > 0 andp 0 (x) > 0, The proof involves a sequence of simple algebraic steps to transform one likelihood ratio into the other, and the use of (A) to ensure that the direction of the inequality is preserved.
For most performance measures of interest (probability of error, Neyman-Pearson, etc.), it is well-known that the optimal classifier takes the form of a likelihood ratio test (LRT) based on the true densities [24,21]. According to the proposition, every true LRT is identical to a contaminated LRT with a different threshold. As the threshold of one LRT sweeps over its range, so too does the threshold of the other LRT. Equivalently, both LRTs generate the same receiver operating characteristic (ROC).
However, if we design a classifier with respect to the contaminated estimates of performance, we will not obtain a classifier that is optimal with respect to the true performance measure, except in very special circumstances. To make this point concrete, we now consider four specific performance measures.
Probability of error. When the feature vector X and label Y are jointly distributed, the probability of misclassification is minimized by a LRT, where the threshold γ is given by the ratio of a priori class probabilities. If γ = 1, then the corresponding threshold for the contaminated LRT is also 1, regardless of π 0 and π 1 , which follows directly from (5). Furthermore, assuming π 0 , π 1 > 0 and with some simple algebra it is easy to show that λ = γ only if γ = 1. Thus, treating the contaminated data as if it were clean is suboptimal whenever the a priori class probabilities are unequal.
Neyman-Pearson. As noted above, the true and contaminated LRTs have the same ROC. If a point on this ROC is chosen such thatR 0 (f ) = α, it will generally not be the case that R 0 (f ) = α. This follows becauseR 0 (f ) = (1 − π 0 )R 0 (f )+π 0 R 1 (f ). Simple algebra shows that R 0 (f ) =R 0 (f ) iff π 0 = 0 or R 0 (f ) + R 1 (f ) = 1. The latter condition is not satisfied by an optimal classifier unless P 0 = P 1 , since it corresponds to random guessing. The former case, π 0 = 0, means the negative class has no contamination, and is equivalent (after swapping class labels) to learning from positive and unlabeled examples.

Minmax.
The minmax criterion is defined as R(f ) := max{R 0 (f ), R 1 (f )}, and the minmax classifier is the minimizer of this quantity. The minmax classifier corresponds to the point on the ROC of the true and contaminated LRTs where R 0 (f ) = R 1 (f ). Indeed, if R 0 (f ) = R 1 (f ), then max{R 0 (f ), R 1 (f )} can be decreased by moving along the ROC such that the larger of R 0 (f ), R 1 (f ) is decreased. Thus, designing a classifier with respect to the contaminated distributions yields a point on the optimal ROC whereR 0 (f ) =R 1 (f ). Using equations (3) and (4), simple algebra reveals thatR 0 (f ) =R 1 (f ) and The first condition is not satisfied for asymmetric label noise, and the latter condition is not true for an optimal classifier unless P 0 = P 1 .
Balanced Error. Menon et al. [30] actually show that the balanced error, given by 1 2 (R 0 (f ) + R 1 (f )), is the only performance measure that is a function of R 0 (f ) and R 1 (f ), such that optimizing the corrupted performance measure is equivalent to optimizing the clean performance measure regardless of the label noise proportions or prior class probabilities.
In summary, a classifier that is optimal with respect to a contaminated performance measure is not optimal for the uncontaminated performance measure except in special cases. Accurate estimation of the true performance measure is thus a critical issue for classifier design. In the next section, we expose a technique for estimating performance using estimates of the label noise proportions.

Alternate contamination model
We introduce an alternate contamination model that will later be used to obtain estimates of the label noise proportions, and consequently estimates of classifier performance.
This lemma motivates estimates of the true Type I and Type II errors. For any classifier f , we may express the contaminated Type I and Type II errors as where Equations (8) and (9) follow from Lemma 2. By solving for R 0 (f ) and R 1 (f ) in (8) and (9), we find We can estimateR 0 (f ) andR 1 (f ) from the training data. Therefore, if we can estimateπ 0 andπ 1 , then we can estimate R 0 (f ) and R 1 (f ), and thereby design a classifier. This approach was analyzed in Scott et al. [41]. In Sec. 7 we describe another approach to classifier design based on surrogate loss minimization that also relies on estimates ofπ 0 andπ 1 . In the next section we describe a framework that is used to estimateπ 0 andπ 1 . We conclude this section with a converse to Lemma 2: Lemma 3. Assume that (6)-(7) hold andP 1 =P 0 . Then P 1 = P 0 and there exist unique π 1 , π 0 ∈ [0, 1) (namely π 0 =π 0 (1−π1) 1−π1π0 and π 1 =π 1 (1−π0) 1−π1π0 ) so that (1)-(2) hold; furthermore, (A) is satisfied.
The alternate representations (6)- (7) are decoupled in the sense that (6) does not involve P 1 , while (7) does not involve P 0 . This allows us to estimateπ 0 andπ 1 separately, by reducing to the problem of "mixture proportion estimation" (see next section). It further motivates the mutual irreducibility condition on (P 0 , P 1 ) that, together with (A), ensures thatπ 0 ,π 1 are identifiable. The decoupling perspective also allows us to address the following question: Given the contaminated distributionsP 1 ,P 0 , while (P 0 , P 1 ) are unknown, what are the solutions (π 0 , π 1 , P 0 , P 1 ) satisfying model (1)-(2)? Obviously, (0, 0,P 1 ,P 0 ) is a trivial solution; we will argue that mutual irreducibility ensures that the solution is unique and non-trivial, and furthermore that the resulting P 0 , P 1 correspond to maximally denoised versions ofP 1 ,P 0 . The issues are developed in Section 5. In the next section, we review the work of Blanchard et al. [7].

Irreducibility and mixture proportion estimation
Let F , G, and H be distributions on (X , S) such that where 0 ≤ κ ≤ 1. Mixture proportion estimation is the following problem: given iid realizations from both F and H, estimate κ. This problem was previously addressed by [7], and here we relate the essential definitions and results from that work.
Without additional assumptions, κ is not an identifiable parameter, as noted by Blanchard et al. [7]. In particular, if F = (1 − κ)G + κH holds, then any alternate decomposition of the form , and δ ∈ [0, κ) , is also valid. Because we have no direct knowledge of G , we cannot decide which representation is the correct one. Therefore, to make κ identifiable, some additional condition must be assumed. The following definition will be useful.

Definition 4.
Let G , H be probability distributions. We say that G is irreducible with respect to H if there exists no decomposition of the form G = γH + (1 − γ)F , where F is some probability distribution and 0 < γ ≤ 1 . We say that G and H are mutually irreducible if G is irreducible with respect to H and vice versa.
The following was established by Blanchard et al. [7]. 1) and G such that the decomposition F = (1 − κ * )G + κ * H holds, and such that G is irreducible with respect to H . If we additionally define κ * = 1 when F = H, then in all cases By this result, the following is well-defined. Clearly, G is irreducible with respect to H if and only if κ * (G|H) = 0. It is also interesting to note that 1 − κ * (F |H) is an example of a statistical distance. That is, 1 − κ * (F |H) is always nonnegative, and is equal to zero if and only if F = H, by Proposition 5. Furthermore, Proposition 8 below states that this distance can be expressed in terms of the likelihood ratio, like Kullback-Liebler and other information divergences. This statistical distance has been studied previously for discrete distributions in the analysis of Markov chains [2], where it is called the "separation distance." In general, κ * (F |H) = κ * (H|F ), so that this is not actually a metric on distributions.
To consolidate the above notions, we state the following corollary which expresses that irreducibility of G with respect to H is sufficient for the mixture proportion to be identifiable.
Some intuition for κ * and irreducibility come from the following result. Part of the result is in terms of the receiver operating characteristic (ROC) for the problem of testing the null hypothesis X ∼ H against the alternative X ∼ F . Given a measurable set S ∈ S, we can think of S as a rejection region (where the null hypothesis is rejected). Then H(S) is the false positive rate and F (S) is the true positive rate, and the optimal ROC is defined as The ensuing result follows from Theorem 6 of Blanchard et al. [7]. Proposition 8 (Blanchard et al. [7]).
If f and h are densities of F and H, respectively, with respect to a common dominating measure, then Proof. The first two identities are established by Blanchard et al. [7]. See also [39]. The proof of the first identity is very similar to the proof of the third identity given below. Intuition for the second identity comes from the first identity and the observation that the optimal ROC is concave. To prove the third identity, let , which clearly integrates to one, and is a.s. nonnegative by definition of γ * . To see (ii), suppose that for which contradicts the definition of γ * .
An alternate proof of the last statement, based on properties of ROC curves of likelihood ratio tests, is given in an appendix.
H(S) motivates the universally consistent estimator of κ * due to [7], reviewed below in Section 6. The second identity, which states that κ * (F |H) is the slope of the optimal ROC at its right end-point, motivates a more practical estimator discussed in Section 11. Proposition 8 makes it possible to check irreducibility for certain distributions. For example, κ * (G|H) = 0 whenever the support of G does not contain the support of H. Irreducibility is also possible even if G and H have the same support, as in the case where G and H are Gaussian distributions with different means, and the variance of H is no more than the variance of G. This follows easily from the density ratio characterization of κ * . Proposition 8 also makes it easy to check mutual irreducibility for various distributions P 0 and P 1 . Indeed, two continuous distributions are mutually irreducible iff the (essential) infimum and supremum of their density ratio are 0 and ∞, respectively. Figure 1 shows three examples where X = R. In the first example, P 0 and P 1 are such that the support of one is not contained in the support of the other, and therefore mutual irreducibility is satisfied. In the second example, P 0 and P 1 are Gaussian distributions with equal variances and unequal means. By plugging in the formulas for the Gaussian densities, it is easy to verify that mutual irreducibility again holds. In the third example, P 0 and P 1 are again Gaussian densities with unequal means, but this time with unequal variances. In this case, it is again not hard to show that κ * (P 0 |P 1 ) = 0, but κ * (P 1 |P 0 ) > 0, where P 1 has the larger variance. Thus, mutual irreducibility does not hold in this case. We do note, however, that κ * (P 1 |P 0 ) tends to zero very fast as the means move apart.

Mutual irreducibility: Sufficiency, necessity, and maximal denoising
We argue that mutual irreducibility of P 0 and P 1 is both necessary and sufficient for identifiability of the elements (π 0 , π 1 , P 0 , P 1 ) of the contamination models, and relate it to the notion of maximal denoising of the contaminated distributions. Since our focus in this section is identifiability and not estimation, our discussion is at the population level.

Sufficiency of mutual irreducibility for identifiability
Recalling the result of Lemma 2, the distributionsP 0 andP 1 can be writteñ By Corollary 7, we can identifyπ 0 andπ 1 provided the following condition holds: (B) P 0 is irreducible with respect toP 1 and P 1 is irreducible with respect tõ P 0 .
We prefer an irreducibility assumption based on the true class-conditional distributions, and so introduce the following: (C) P 0 and P 1 are mutually irreducible.
Note that it follows from assumption (C) that P 0 = P 1 , which is a hypothesis of Lemma 2. We now establish that (C) and (B) are essentially equivalent. Lemma 9. P 0 is irreducible with respect toP 1 if and only if P 0 is irreducible with respect to P 1 and π 1 < 1. The same statement holds when exchanging the roles of the two classes. In particular, under assumption (A), (C) is equivalent to (B) .
Proof. This will be a proof by contraposition. Assume first that P 0 is not irreducible with respect toP 1 . Then there exists a probability distribution Q and 0 < γ ≤ 1 such that (2) forP 1 yields

Now, plugging in Equation
Solving for P 0 produces Conversely, assume by contradiction that P 0 is not irreducible with respect to P 1 , i.e., there exists a decomposition , so that P 0 is not irreducible with respect toP 1 . Finally, in the case π 1 = 1, we haveP 1 = P 0 , in which case, trivially, P 0 is not irreducible with respect toP 1 either.
The following corollary summarizes the discussion of sufficiency.

Corollary 10. If (A) and (C) hold, then π
Thus, π 0 and π 1 are explicit functions ofP 0 andP 1 under (A) and (C) . It follows that P 0 and P 1 can then be recovered by solving the identities (6)-(7). In fact, using these identities, it is easy to check that a slightly stronger statement holds: for any arbitrary givenP 0 =P 1 , there is a unique solution (π 0 , π 1 , P 0 , P 1 ) of (1)-(2) satisfying (A) and (C) . For short, we call this solution the unique mutually irreducible solution of the problem (condition (A) being tacitly required.) The uniqueness and various properties of this particular solution will be explored in more detail in Theorem 12 below; in the next Section, we first argue that conditions (A) and (C) are necessary for decontamination in a certain sense.

Necessity
As noted earlier, givenP 0 =P 1 , there are in general many (π 0 , π 1 , P 0 , P 1 ) solving equations (1)-(2), so that decontamination is not well-defined in the absence of additional conditions. Requesting mutual irreducibility of (P 0 , P 1 ) is one way to ensure unicity of the solution, and also has an interpretation in terms of maximum denoising (see Theorem 12 below). But is it in any way a natural assumption? We now argue that this condition is also the only one ensuring some relatively natural properties of the decontamination operation. We introduce some additional notation: let P denote the set of probability distributions on X . Denote P 2 * the set of couples (P, Q) ∈ P 2 with P = Q. We denote ψ the contamination operator from [0, 1] 2 × P 2 to P 2 , with ψ(π 0 , π 1 , P 0 , P 1 ) = (P 0 ,P 1 ) given by (1)-(2).
Let φ denote a decontamination operator, i.e., a function from a subset of P 2 to [0, 1] 2 × P 2 such that φ(P 0 ,P 1 ) returns a solution of (1)- (2), in other words ψ • φ is the identity on the domain of φ. We further denote φ := (φ π , φ P ), where φ π (P 0 ,P 1 ) are the solution contamination weights and φ P (P 0 ,P 1 ) are the solution source distributions. Finally, given a decontamination operator φ, call the image of φ P the set of φ-sources -this is the set of probability distribution couples that are considered as the uncontaminated sources by the operator φ in at least one configuration of observed contaminated distributions.
The interpretation of this result is that mutual irreducibility is a necessary condition for decontamination if conditions (i) to (iv) are required. Condition (i) states that the decontamination operation should be defined on the full domain of possible (distinct) observed distributions and can thus be seen as a universality condition. Condition (ii) is a natural symmetry requirement. Condition (iii) is a continuity assumption (changing the mixing weights by an arbitrarily small amount should not result in a "jump" in the returned estimated contamination proportions) and condition (iv) is a stability condition (a couple (P 0 , P 1 ) identified as a source should still be output as a source by the decontamination operator under small enough mutual mixing proportions.) Remark. Removing one of the "natural" requirements (i)-(iv) invalidates the conclusion. For example, restricting decontamination to a certain specific model of sources -say Gaussian distributions -could give rise to a non-mutually irreducible decontamination, coherent within that model but forgoing universality (i). If we remove continuity requirement (iii), we can find a decontamination operator that is not mutually irreducible and satisfies the other conditions by "tiling" the solution space: for any (P 0 , P 1 ) mutually irreducible, any (π 0 , π 1 ) such that π 0 + π 1 < 1, (P 0 ,P 1 ) = ψ(π 0 , π 1 , P 0 , P 1 ), for π i ∈ [ ki n , ki+1 n ) (n can be chosen arbitrarily), define the decontamination φ as φ = (φ π , φ P ) with It is easy to check that π 0 + π 1 < 1 implies φ π (P 0 ,P 1 ) ∈ [0, 1] 2 and satisfies (A).
Then the above φ satisfies (i), (ii) and (iv) but is not the mutually irreducible decontamination. Finally, stability condition (iv) is needed in order to prevent "trivial" decontaminations such as φ(P 0 ,P 1 ) = (0, 0,P 0 ,P 1 ), which is obviously continuous. Excluding the everywhere trivial decontamination is not enough, as a decontamination could also be trivial on part of the space only.

Maximal denoising
To conclude this section, we present a result that rounds out the discussion of the initial and modified contamination models, and mutual irreducibility.
In particular, we describe all possible solutions (π 0 , π 1 , P 0 , P 1 ) to our model equations (1)-(2) whenP 0 ,P 1 are given and arbitrary, and an equivalent characterization of the unique mutually irreducible solution. It can be seen as an analogue of Proposition 5 for the label noise contamination models.

Denotingπ
3. The feasible region R for the proportions (π 0 , π 1 ) (that is, the projection of Λ to its first two coordinates, which is also one-to-one), is the closed quadrilateral defined by the intersection of the positive quadrant of R 2 with the half-planes given by 4. The mutually irreducible solution (π * 0 , π * 1 , P * 0 , P * 1 ) is also equivalently characterized as: • the unique maximizer of (π 0 + π 1 ) over Λ; • the unique extremal point of Λ where both of the constraints in (14) are active; • the unique maximizer over Λ of P 0 − P 1 T V , the total variation distance between the source distributions.  2), when contaminated distributions (P 0 ,P 1 ) are observed and the true distributions (P 0 , P 1 ) are unknown. Each feasible (π 0 , π 1 ) corresponds to a single associated solution The proof of the theorem relies on the explicit one-to-one correspondence established in Lemmas 2 and 3 between the solutions of the original decomposition (1)-(2) and its decoupled reformulation (6)- (7). The result of Proposition 5 is applied to the decoupled formulation, then pulled back, via the correspondence, in the original representation. The last statement concerning the total variation norm is based on the relation obtained by subtracting (1) from (2). Therefore, the maximum feasible value of P 1 − P 0 T V corresponds to the maximum of (π 0 +π 1 ), i.e., the unique mutually irreducible solution.
The geometrical interpretation of this theorem is visualized on Figure 2. In particular, point 1 of the theorem shows that conditions (A) and (C) do not restrict the class of possible observable contaminated distributions (P 1 ,P 0 ); rather, they ensure in all cases the identifiability of the mixture model. Point 4 indicates that the unique solution satisfying the mutual irreducibility condition (C) can be characterized as maximizing the possible total label noise level (π 0 + π 1 ), or, still equivalently, the total variation separation of the source probabilities P 0 , P 1 . In this sense, the mutually irreducible solution can also be interpreted as maximal label denoising or maximal source separation of the observed contaminated distributions.

Mixture proportion estimation and a rate of convergence
Blanchard et al. [7] present a universally consistent estimator κ of κ * (F |H). We review this estimator below. They also establish a "no free lunch" result stating that no estimator of κ * (F |H) can converge at a fixed rate for all F and H. In this section we also introduce distributional assumptions under which the estimator of Blanchard et al. [7] converges at a known rate.
We begin by reviewing the universally consistent estimator of κ * (F |H) introduced by Blanchard et al. [7]. Let F and H be probability measures on a Borel space (X , S). Recall from Proposition 8 The basic idea is to replace F and H by empirical estimates and take the infimum over a union of VC classes. Thus, consider a sequence of VC classes of sets, ni for i = 0, 1. By the VC inequality, for any i = 0, 1, δ i ∈ (0, 1), k ≥ 1 and any distribution Q on X , with probability at least 1 − δ i over the draw of an i.i.d. sample of size n i according to Q , we have where Q denotes the empirical distribution built on the sample. In MPE we have training data For k ≥ 1, define where (·) + is the max of its argument and zero (the ratio is defined to be ∞ if the denominator is zero), and where F (S) and H(S) are the empirical true positive and false positive probabilities associated with the rejection region S. By the VC inequality and Proposition 8, κ(k, δ 0 , δ 1 ) is an upper bound on κ * (F |H), with probability at least 1 − δ 0 − δ 1 .
By the union bound, this is also an upper bound on κ * , with probability at least 1 − 2(δ 0 + δ 1 ), since k k −2 = π 2 /6 < 2. To ensure that this upper bound approaches κ * as n 0 , n 1 → ∞, the sequence (S k ) ∞ k=1 is assumed to satisfy the following universal approximation property, which we refer to as (AP1): For any S * ∈ S , and any distribution Q , Finally, κ is defined as κ = κ( 1 n0 , 1 n1 ). Blanchard et al. [7] show the following, which makes no assumption on the distributions F and H and thus establishes a universally consistent method for MPE.
Theorem 13 (Blanchard et al. [7]). With probability at least It should be noted that the statement of the consistency result of Blanchard et al. [7] contains a slight error. We present a correction to the statement of the consistency result in an appendix; see also Scott [39]. The error/correction does not affect the present work.
We now introduce an assumption on F and H that will ensure a certain rate of convergence for κ above. This rate will be used in the next section to establish consistency of a discrimination rule. The support of a distribution Q, denoted supp(Q), is the smallest closed set whose complement has measure zero. The assumption supp(H) ⊂ supp(G) clearly implies that G is irreducible with respect to H, and therefore γ in (D) is equal to κ * (F |H).
In addition, we adopt a modified approximation condition on the sequence (S k ), referred to as (AP2): For all G, H with supp(H) ⊂ supp(G) there exists k ≥ 1 and S ∈ S k s.t. G(S) = 0 and H(S) > 0.
Remark. (AP1) requires that the sets in S k become increasingly complex, so that V k → ∞. On the other hand, (AP2) does not. For example, if X = R d and S is the Borel σ-algebra generated by the standard topology on R d , (AP2) is satisfied taking S 1 to be the VC class of all open balls {x : x − c < r}, c ∈ R d , r > 0, and S k = ∅ for k ≥ 2. In this case, we could even simplify the estimator of κ * to be κ := κ(1, 1 n0 , 1 n1 ), and the rate of convergence presented below would still hold (the proof requires only minor modifications). However, we elect to work with the definition of κ above to emphasize that the rate of convergence applies to the universally consistent estimator.
In the next section, assume κ is defined in terms of VC classes satisfying (AP2).

Consistent classification with unknown label noise proportions
The consistent estimator of κ * just discussed provides a clear path to the design of a consistent discrimination rule when the label noise proportions are unknown. The estimator of κ * , together with Corollary 10, can be combined to give consistent estimators ofπ 0 andπ 1 under assumptions (A) and (C). Plugging in these estimators, along with empirical estimates ofR 0 andR 1 , into Eqns. (10) and (11), yields estimates of R 0 and R 1 that can be shown to converge uniformly over a VC class of classifiers to their true values. By allowing the size of the VC class to grow as the sample size(s) grow, empirical risk minimization can be shown to be a consistent discrimination rule with respect to any performance measure defined in terms of R 0 and R 1 . This idea utilizes standard ideas in learning theory and is illustrated for the minmax criterion in Scott et al. [41].
One drawback of empirical risk minimization over VC classes is that it is computationally intractable for most VC classes of interest. In the remainder of this section we establish a computationally tractable consistent discrimination rule based on surrogate risk minimization.

Problem formulation
Let (X, Y ) be random on X × {0, 1} where X is a Borel space, and let P denote the probability measure governing (X, Y ). Let M denote the set of decision functions, i.e., the set of measurable functions X → R. Every f ∈ M induces a classifier x → u(f (x)) where u(t) is the unit step function For any f ∈ M, define the cost-insensitive P-risk of f Define the cost-insensitive Bayes P -risk R * P := inf f ∈M R P (f ). It is well known [15] that for any f ∈ M, the excess P -risk satisfies where η(x) := P (Y = 1 | X = x).
Generalizing the above, for any α ∈ (0, 1) we can define the α-cost-sensitive P -risk for any f ∈ M, The corresponding Bayes risk is R * P,α := inf f ∈M R P,α (f ), and the analogue to (20) is [38]: Observe (20) corresponds to the case α = 1 2 . With this background, we turn to the problem of classification with label noise. We assume (X, Y,Ỹ ) are jointly distributed, where Y is the true but unobserved label, andỸ is the observed but noisy label. As in the rest of the paper, we focus on label noise that is independent of the feature vector X, meaning that the conditional distribution ofỸ given X and Y depends only on Y .
We would like to minimize R P (f ), but we only have access to data fromP , the joint distribution of (X,Ỹ ). Natarajan et al. [32] show that minimizing a cost-sensitiveP -risk is equivalent to minimizing the cost-insensitive P -risk. We state and prove an equivalent result which has a simpler proof. In this setting, π i = Pr(Y = 1 − i |Ỹ = i), i = 0, 1. We introduce the following assumption on the amount of label noise, which slightly strengthens (A).

G. Blanchard et al.
The result follows now from (20) and (21): The problem we will address is the construction of a discrimination rule f n that is computationally tractable, does not know α, π 0 , or π 1 , and is such that R P ( f n ) − R * P → 0 in probability. To achieve this, we develop an algorithm f n based on surrogate risk minimization such that RP ,α ( f n ) − R * P ,α → 0 in probability.

Surrogate losses
A loss is any measurable function L : {0, 1} × R → [0, ∞). For example, the P -risk is defined in terms of the 0 − 1 loss, L(y, t) = 1 {y =u(t)} . Given a loss L we define the risk and the corresponding optimal risk R * P,L = inf f ∈M R P,L (f ). A surrogate loss is one that is used as a surrogate for another, such as a loss L that is convex in its second argument in lieu of the 0-1 loss. Surrogate losses are common in machine learning because they can often be optimized efficiently, unlike the 0-1 loss and its cost-sensitive variants. The notion of classification calibration was developed to theoretically justify the use of surrogate losses. A loss L is said to be α-classification calibrated iff there exists an increasing and continuous function θ with θ(0) = 0 such that for all f ∈ M, ). An equivalent and more technical characterization of α-CC is provided by [38], but the above definition suffices for our purposes. The point is that driving the surrogate excess risk to zero drives the target excess risk to zero for α-CC losses, and the former can be accomplished by computationally tractable methods like support vector machines, as shown below.

Corollary 16. Suppose L is
Natarajan et al. [32] consider the setting where π 0 and π 1 are known. Using the above result, they apply Rademacher complexity analysis to establish performance guarantees for a classification strategy based on regularized empirical risk minimization with a surrogate loss L α .

Estimating α
When π 0 and π 1 are unknown, a natural strategy is to base a learning algorithm on a surrogate loss L α , where α is an estimate of α. We propose an estimate of the form where π 0 and π 1 are estimates based on our previously developed results. In particular, suppose we observe noisy data One difference to note going forward is that the sample sizes n 0 and n 1 are now random, whereas before they were considered to be nonrandom. This turns out to be a minor difference; see the proof of Proposition 17 below. Now, let π 0 and π 1 be estimates ofπ 0 andπ 1 obtained by applying the estimator κ of Section 6 twice. The formulas from Lemma 3 lead to the estimates π 0 = π 0 (1 − π 1 ) 1 − π 0 π 1 and π 1 = π 1 (1 − π 0 ) 1 − π 0 π 1 .
This assumption is reasonable in many classification problems. It essentially says that for each of the two (noise-free) classes, there exist patterns belonging to that class that could not possibly be confused with patterns from the other class. We have the following.

Algorithm
We now introduce a consistent classification procedure based on surrogate losses in the case of unknown label noise proportions. The algorithm relies on the framework of reproducing kernel Hilbert spaces. Thus, let H be a RKHS, and let L be a loss for binary classification. We say that L is Lipschitz if L(y, t) is a Lipschitz function of t for each y. The algorithm returns the classifier where L α is the α-weighted cost-sensitive loss associated with L, as defined in (23). For example, if L(y, t) = max{0, 1 − (2y − 1)t} is the hinge loss, f n is a cost-sensitive support vector machine.

First consistency result
We will assume that the reproducing kernel k associated with H is universal and bounded [43]. The former property implies that elements of the RKHS can get arbitrarily close to the Bayes risk. The latter property states that sup x k(x, x) =: The Gaussian kernel is an example satisfying both of these properties. 1 2 -CC loss. Let λ n > 0 tend to zero as n → ∞ such that λ n n/ log n → ∞. Then

Theorem 18. Assume (A') and (C') hold, that the reproducing kernel associated with H is universal and bounded, and that L is a Lipschitz,
as n → ∞.

Alternate consistency result with clippable losses
It is possible to establish a consistency theorem without requiring a rate of convergence on α (thus only requiring the milder condition (C) rather than (C')), at the expense of treating a more narrow class of losses.
A T -clippable loss L(y, t) (see 43, Section 2.2) satisfies the following property: where Clip T (t) := min(T, max(−T, t)) . It is shown by [43], Lemma 2.23, that a convex loss is T -clippable iff ∀y ∈ {0, 1}, the function t ∈ R → L(y, t) admits a minimum which is attained for some t ∈ [−T, T ]. As a consequence, many common surrogate losses are clippable; for instance the hinge loss, the squared loss and the truncated squared loss are 1-clippable. On the other hand, the logistic and the exponential losses are not clippable.

Theorem 19. Assume (A') and (C) hold, that the reproducing kernel associated with H is universal and bounded, and that L is a Lipschitz
, T -clippable, 1 2 -CC loss. Let λ n > 0 tend to zero as n → ∞ such that λ n n → ∞. Definȇ f n := Clip T ( f n ) , where f n is defined by (25). Then as n → ∞.

A more general analysis of co-training
Co-training is a model for binary classification in which the feature vector can be partitioned into two sets of variables, called "views." The critical assumption of co-training is that the views are conditionally independent, given the class label. We refer to this assumption as the co-training assumption. Blum and Mitchell [8] show that under this assumption, the optimal classifier can be learned from unlabeled data only, provided the learner has access to a "weakly-useful predictor," which is a classifier that, roughly speaking, is at least slightly better than random guessing. The basic idea is to apply the weakly-useful predictor to one of the views to generate noisy labels for the other view. By the co-training assumption, the problem now reduces to classification with label noise. The original analysis assumes that the true label is a deterministic function of either view. Our framework allows us to relax this assumption.
To state our result, we assume that the feature vector X and label Y are jointly distributed with joint distribution Q. Let P 0 and P 1 be the class conditional distributions of Q. Furthermore, let X be expressed as (X A , X B ), representing the two views. Under the co-training assumption, X A and X B are conditionally independent given Y . The unlabeled training data are X 1 , . . . , X n . A weakly-useful classifier is a classifier h such that 0 < Q({x : h(x) = 1}) < 1 and q 0 (h) + q 1 (h) < 1, where Theorem 20. Let h A be a known weakly-useful classifier based on view A. Assume that the class-conditional distributions of X B are mutually irreducible, and let X 1 , . . . , X n be iid. Under the co-training assumption, there exists a classifi- . By the co-training assumption, the class-conditional distribution ofỸ given X B and the true label Y is not dependent on X B . Therefore we have the setting of a label noise problem.
the numbers of examples n 0 and n 1 with each noisy label grow with n. Furthermore, the contamination probabilities Since h A is weakly-useful, we have that π 0 + π 1 < 1.
We also have mutual irreducibility for this label noise problem, by assumption. Therefore, a consistent classification rule exists by the construction in Scott et al. [41].
The key point is that this result weakens the assumption of deterministic class labels to a mutual irreducibility assumption. The existence of a weaklyuseful classifier could be guaranteed, for example, if a small amount of labeled training data was available.
The previous argument relies on the consistent classification rule from [41]. The consistency result for classifiers based on clippable surrogate losses, from earlier in this paper, could also be employed provided the definition of a weaklyuseful classifier is strengthened to require that q i (h) < 1 2 for each i.

Mutual irreducibility and class probability estimation
In this section, we relate mutual irreducibility of P 0 and P 1 to the problem of class probability estimation. Let p 0 and p 1 be densities of P 0 and P 1 with respect to a common dominating measure. Further assume that the feature vector X and label Y are jointly distributed with joint distribution P , and that q := P (Y = 1) ∈ (0, 1). The posterior probability that Y = 1 is denoted The problem of estimating η from data is known as class probability estimation [12,35]. The most well-known approach to class probability estimation is logistic regression, which posits the model where w and x have the same dimension, and b ∈ R. The parameters w and b are fit to the data by maximum likelihood. More generally, estimates for η commonly have the form The following result connects the posterior class probability to mutual irreducibility.
Proposition 21. With the notation defined above, and Therefore, P 0 and P 1 are mutually irreducible if and only if η min = 0 and η max = 1.
Proof. By Bayes' rule, it is true that almost everywhere, . Equation (26) now follows from Proposition 8. Similarly, we have (almost everywhere) . 2808

G. Blanchard et al.
Now (27) follows from Proposition 8. The final statement follows from (26) and (27) and the definition of mutual irreducibility.
Thus, estimates of κ * (P 0 |P 1 ) and κ * (P 1 |P 0 ) could be used to inform choices about the design of the link function (e.g., its domain) and model class of decision functions. Proposition 21 also suggest another possible approach to mixture proportion estimation. Suppose η is an estimator for η that is consistent with respect to the supremum norm, and let q be the empirical estimate of q based on a random sample from P . Inverting Equation (26), is a consistent estimate of κ * (P 1 |P 0 ). Similar remarks apply to κ * (P 0 |P 1 ). Although this suggests that class probability estimation solves mixture proportion estimation in the binary classification context, we note that sup-norm consistency will require distributional assumptions, and therefore the distribution-free estimator of Blanchard et al. [7] is a more general solution. All of the above observations were present in our original technical report on this topic [42]. Since then, Liu and Tao [25] and Menon et al. [30] have further explored the idea of estimating label noise proportions from the minimum and maximum of the contaminated class probability function. In particular, we note the following.

Corollary 22. Consider the setting of the previous paragraph. If (A) and (C) hold, then
and Proof. By Proposition 21, we have that The result now follows from these equations, Corollary 10, and algebra. Note thatq is easily estimated from the fraction of contaminated training examples withỸ = 1. Therefore, estimates ofη max andη min lead directly to estimates of the contamination proportions π 0 and π 1 . This approach to estimating label noise proportions is explored experimentally below, where it is compared with the ROC-based estimator.
Menon et al. [30] adopt the conditions η max = 1 and η min = 0 together with (A) as their identifiability conditions for label noise under the contamination model. From the above discussion, these conditions are clearly equivalent to ours. Liu and Tao [25] consider the label flipping model for label noise. They consider an equivalent sufficient condition based on η in that context. Connections with our mutual irreducibility assumption are noted in each of these works.

Implementation of estimators
The ROC characterization from Proposition 8 says that κ * is the minimum slope of any line passing through the point (1, 1) in ROC space and any other point on the optimal ROC. If the optimal ROC happens to be concave, this is the slope of the ROC at its right end-point. See Fig. 3.
Motivated by this idea, we suggest the following practical algorithm for MPE. First, split each of the two samples (1) and (2) into two portions according to a common ratio. Using the first portion of each data set, run a universally consistent classification algorithm that yields a full ROC. In our implementation, we run kernel logistic regression (KLR) with a Gaussian kernel, and vary the threshold on the posterior probability estimate to obtain an ROC. Note that KLR is run on the contaminated data. The bandwidth and regularization parameters of KLR are set using cross-validation.
Using the second half of each sample, construct conservative estimates (as in Eqn. (18)) of the ROC for a discrete set of thresholds on the KLR posterior probability function. To obtain these conservative estimates, we do not use the empirical error plus or minus a VC bound. Instead, we use direct binomial tail inversion (also known as one-sided exact Clopper-Pearson confidence interval), which is the tightest possible deviation bound for a binomial random variable [22]. Using these conservative estimates, we then compute the minimum slope of all line segments joining points on the ROC to the point (1, 1).
We also considered an alternative approach to estimating the label noise proportions, based on class probability estimation as discussed in Section 9. As in the preceding estimator, we split each sample into two portions, and used the first portion of each sample to train a KLR estimate of the class probability functionη. We then used the second portion of each sample to estimate the minimum and maximum values ofη, which we then plugged into the formulas (28)- (29) to obtain estimates of π 0 and π 1 . To obtain some robustness to outliers, we estimated the maximum and minimum using the 99th and 1st percentiles, respectively, as suggested by [30].
The estimates based on the ROC method are denoted π roc 0 and π roc 1 , while the estimates based on class probability estimation are denoted π cpe 0 and π cpe 1 . The former estimates are based on a 20/80 split of each sample, and the latter on a 80/20 split, as these seemed to give the best results. The latter ratio was also employed by Menon et al. [30]. A detailed Matlab implementation, reproducing our results, is available at http://web.eecs.umich.edu/~cscott.

Experiments
To study the performance of the above estimators, we examined the problem of classification with label noise using three data sets. The waveform data set is available from the UCI Repository, and consists of three classes of synthetically generated waveforms. The classes are overlapping, as the Bayes risk for this data set is known to be around 10 %. We generated data for a binary classification problem (using only two of the classes) with label noise proportions π 0 and π 1 specified as in Table 1. Sample sizes of n 0 = n 1 = 1000 were chosen. We also used the MNIST handwritten digits data set, digits 3 and 8, with a similar setup as to the waveform data. In this case the sample sizes were n 0 = n 1 = 2000.
A third data set comes from nuclear particle classification, where the training data are realistically described by the label noise model. The data are obtained from organic scintillation detectors, which detect both gamma-rays and neutrons, and associate every detected particle with a digitally sampled pulseshaped waveform [1]. The goal is to classify gamma-ray pulses (class 0) from neutron pulses (class 1). See discussion in Section 1.1. Training data were obtained by measuring particles emitted from a Cf-252 source, which undergoes spontaneous decay and emits both neutrons and gamma rays. Data were preprocessed by aligning pulse peaks and by eliminating signals with multiple peaks (corresponding to multiple detected events within a single observation window). Through a special experimental configuration [3], the time of flight (TOF) for each particle hitting the detector was also measured. Since neutrons travel more slowly than gamma-rays, this gives noisy labels by looking only at those particles with TOF in a certain window. Gamma-rays travel at the speed of light, so a data set with mostly gamma-ray pulses was obtained by focusing on those particles with TOFs around the speed of light (TOF < 5 ns). However, neutrons can still have TOFs in this window because they were generated from either a background event or from another fission event that occurred just an instant before the one being measured. A neutron TOF-window was also selected (45 < TOF < 55 ns), and as with the other window, this one will also contain some proportion of gamma-ray pulses. We obtained samples of size n 0 = n 1 = 3000 from each window. It is important to keep in mind that in this application, the ground truth π 0 and π 1 are unknown, and it can only be assessed whether our estimates of these quantities are reasonable based on physics knowledge. The results are reported in Table 1. Regarding the ROC method, the results indicate that this method provides reasonably accurate estimates of the label noise proportions in the four experimental settings where the true proportions are known. These results also suggest that mutual irreducibility can be a reasonable assumption in practice. In the nuclear particle classification problem, although ground truth labels are unavailable, the proportions estimated by the ROC method are at least consistent with the expectation that noisy labels should be relatively rare (given the high rate of Cf-252 fission events relative to the expected rate of background events), and also with the knowledge that neutrons are rarer background events than gamma-rays (i.e., π 0 < π 1 ).
With regards to the CPE method, the results indicate that the method is sometime accurate, but other times incurs considerable error. We also note that for the nuclear data, π 1 is estimated to be smaller than π 0 , which is inconsistent with the knowledge that contaminating neutrons are more rare than contaminating gamma-rays. To further investigate this issue, we formed Table 2. The first two columns are the same as in the previous table, restricted to the waveform and digits data for which ground truth is known. The third and fourth columns show the empirical percentiles of the ground truth values ofη min and η max , which should ideally be near 0 and 1. We see that our implementation of the CPE estimator can be both conservative (estimating more noise than is actually present), and overly optimistic (estimating less noise than is present). Indeed, percentiles far from 0 or 1 reflect over optimism. On the other hand, percentiles of exactly 0 and 1 (of which there is one instance in Table 2) are quite likely signs of conservitism. In this case, the empirical values ofη do not cover the full range [η min ,η max ].
CPE-based estimators were also studied by Liu and Tao [25], Menon et al. [30], who report more favorable results for this method. We use KLR to estimate the class probabilities, whereas they employ different techniques. Given this discrepancy in findings, the issue warrants further investigation. There are two factors that may favor the ROC-method. First, the ROC method employs uncertainty quantification (on the deviation between true and empirical probabilities) in the form of direct binomial tail inversion when estimating the slope of the ROC at its right endpoint. Similar uncertainty quantification would likely benefit the CPE method and make it less overly optimistic. Second, the ROC method leverages the shape constraint that it is typically concave.
To illustrate the importance of accounting for label noise, we further examine nuclear particle classification. As noted in Section 2, training a classifier on contaminated training data generates the same ROC as training with uncontaminated data, and the real impact of accounting for label noise occurs in performance evaluation. In Fig. 4, the solid curve plots the ROC for the nuclear particle data, using contaminated test data to estimate the false positive and true positive rates. The dotted curve relies on Eqns. (10)- (11) to correct these probabilities, revealing that the classifier actually classifies the particles much more accurately than one would expect if label noise was not accounted for. This makes intuitive sense, because many of the particles from the contaminated test data that appear to be incorrectly classified are actually correctly classified, and just have erroneous labels.

Conclusion
We argue that consistent classification with label noise is possible if a majority of the labels are correct on average, and the class-conditional distributions P 0 and P 1 are mutually irreducible. Under these conditions, we leverage results of [7] on mixture proportion estimation to design consistent estimators of the noise proportions. These estimators are applied to establish a consistent discrimination rule based on surrogate loss minimization, although other performance measures could be analyzed similarly. Unlike previous theoretical work on this problem, we handle the cases where the supports of P 0 and P 1 may overlap or even be equal, and the noise proportions are asymmetric and unknown.
We also argue that mutual irreducibility is necessary if we require the decontamination operation at population level to satisfy some natural conditions (universality, symmetry, continuity and stability.) Additionally, requiring mutual irreducibility can be equivalently seen as aiming at maximum denoising of the contaminated distributions, or maximum separation of the unknown sources P 0 , P 1 for given contaminated distributions. Thus, our discrimination rule is universally consistent in the sense that its performance tends to the optimal performance corresponding to the maximally denoised P 0 , P 1 , regardless ofP 0 ,P 1 .
Finally, we investigate two practical implementations of MPE, one based on the ROC for the contaminated data, and the other based on class probability estimation for the contaminated data. The ROC method exhibits good accuracy in the label noise setting on three different data sets, including the nuclear particle classification problem that originally motivated this work. Our CPE implementation, on the other hand, still requires further development.

Proposition 23.
Assume that the ROC of the likelihood ratio tests Proof. The slope of the ROC of an LRT with threshold γ is equal to γ wherever the slope is well defined [33,37]. The right end-point of the ROC corresponds to γ * = ess inf x∈supp(H) h(x) . That is, for all γ > γ * , the Type I error of the LRT is strictly less than 1, whereas it equals 1 at γ * .

B.5. Proof of Theorem 14
We begin by establishing (19) without the absolute value, which is the more challenging direction. The reverse direction will follow easily by the first part of Theorem 13.
The same inequality holds with the absolute value by the first part of Theorem 13, which holds on the same event (samples where the VC bounds hold for all k ≥ 1) as was used to establish the above inequality.

B.6. Proof of Theorem 18
By Corollary 16, it suffices to show RP ,Lα ( f n ) − R * P ,Lα → 0 in probability. Toward this end we employ Rademacher complexity analysis. In particular, we will leverage the following result.
Theorem 24. Let Z, Z 1 , . . . , Z n be iid random variables taking values in a set Z. Let σ 1 , . . . , σ n be iid Rademacher random variables, independent of Z, Z 1 , . . . , Z n . Consider a set of functions G ⊆ [a, b] Z . ∀δ > 0, with probability ≥ 1 − δ with respect to the draw of Z 1 , . . . , Z n , we have where R n (G) = E Let > 0, and let f ∈ H be such that RP ,Lα (f ) < R * P ,Lα + 2 , which is possible since the the reproducing kernel associated with H is universal [43]. Then The first term can be bounded, with probability at least 1 − 1/n, by , which gives the first term on the RHS. The second term comes from the observation that functions in G have ranges confined to [0, C 0 + DBM n ]. To see this, recall that losses are by definition nonnegative, that L is Lipschitz in its second argument, and that for any f ∈ B H (M n ), we have f ∞ = sup x∈X | f, k(·, x) | ≤ BM n by the reproducing property and Cauchy-Schwarz.
The fifth term is bounded similarly, with the only additional observation being that f ∈ B H (M n ) for n sufficiently large.
The middle term can be bounded by λ n f 2 , which tends to zero as n → ∞. This follows from the definition of f n , since To bound the second term, observe that for any f ∈ B H (M n ), where D is the Lipschitz constant of L. By Cauchy-Schwarz and the reproducing property, where B is the bound on the kernel. Now f H ≤ C0 λn , and so for the second term to go to zero, we need | α − α|/λ n to go to zero. Under (A') and (C'), we know that | α − α| converges at a rate of log n n , and by our assumption on the rate of decay of λ n , | α − α|/λ n tends to zero as n → ∞, except on a vanishingly small event.
The fourth term is handled in a similar manner, where again we observe that f ∈ B H (M n ) for n sufficiently large.
In summary, we have shown that RP ,Lα ( f n ) − R * P ,Lα ≤ with probability tending to one as n (and with it n 0 and n 1 ) tends to infinity. This concludes the proof.

B.7. Proof of Theorem 19
We start by establishing that L α is T -clippable and its clipped version is Lipschitz and bounded with constants independent of α ∈ (0, 1). The loss L being T -clippable implies by definition that both L 0 (t) = L(0, t) and L 1 (t) = L (1, t) are clippable. Therefore, L α (y, t) = (1 − α)1 {y=1} L 1 (t) + α1 {y=0} L 0 (t) is Tclippable (regardless of α ∈ (0, 1).) Denote L α (y, t) := L α (y, Clip T (t)) = (1−α)1 {y=1} L 1 (Clip T (t))+α1 {y=0} L 0 (Clip T (t)) , and define C 0 := max{L(0, 0), L(1, 0)} . Since L is assumed Lipschitz with constant D, both L 1 and L 0 are Lipschitz and since Clip T is 1-Lipschitz, by composition L α is also a Lipschitz loss (with the same constant D, regardless of α ∈ (0, 1).) Furthermore, since Clip T (t) ∈ [−T, T ] , we have for all (y, t) and α:  Let > 0, and let f ∈ H be such that RP ,Lα (f ) < R * P ,Lα + 2 , which is possible since the the reproducing kernel associated with H is universal [43]. We have The first and last terms can be bounded, with probability at least 1 − 1/n, by 2DBM n √ n + (C 0 + D max(T, B f ∞ )) ln 4n 2n using Rademacher complexity analysis as was done in the preceding proof. Here D is the Lipschitz constant for L (and thus also for L) and B is the bound on the kernel. Note that since a different loss is used for the first and last terms, we use a union bound and thus introduce an additional factor in the log term. The third term equals R L α (f n ) − R L α ( f n ) and is nonpositive by definition of an T -clippable loss.
The middle (4th) term can be bounded (as in the preceding proof) by λ n f 2 , which tends to zero as n → ∞.
To bound the second term, observe that for any f , The fifth term is handled in a similar manner, but with the non-clipped loss L instead of L; in this case we have In summary, we have shown that RP ,Lα ( f n ) − R * P ,Lα ≤ with probability tending to one as n (and with it n 0 and n 1 ) tends to infinity. This concludes the proof.