Binary classification with corrupted labels

In a binary classification problem where the goal is to fit an accurate predictor, the presence of corrupted labels in the training data set may create an additional challenge. However, in settings where likelihood maximization is poorly behaved-for example, if positive and negative labels are perfectly separable-then a small fraction of corrupted labels can improve performance by ensuring robustness. In this work, we establish that in such settings, corruption acts as a form of regularization, and we compute precise upper bounds on estimation error in the presence of corruptions. Our results suggest that the presence of corrupted data points is beneficial only up to a small fraction of the total sample, scaling with the square root of the sample size.


Introduction
Consider a classification problem, where our goal is to predict a binary label Y ∈ {±1} using information captured by a feature vector X ∈ R d . Based on n training data points (X 1 , Y 1 ), . . . (X n , Y n ), the objective is to fit a classifier f : R d → {±1} to this data, mapping a new test feature vector X to a predicted label +1 or −1.
In many settings, inherent noise in the measurement process can introduce corruption into the observed labels Y i . For example, consider a medical application where features X i for patient i determine their likelihood of having a particular disease, and Y i ∈ {±1} indicates presence or absence of the disease. Imperfect diagnostic tests might mean that the observed label may differ from the true label Y i . Writing Y i ∈ {±1} to denote the observed label, we might have P{ Y i = −1 | Y i = +1} > 0 (if the diagnostic test has a nonzero rate of false negatives) and similarly P{ Y i = +1 | Y i = −1} > 0 (indicating false positives).
while ρ denotes the probability that the observed label is corrupted, assumed to be identical across all data points (the "homogeneous noise" setting).
In the classification problem, our goal is to define a classification rule that, given a feature vector x ∈ R d , outputs a predicted label +1 or −1. The misclassification rate is minimized by predicting +1 or −1 depending on whether η(x) is above or below 0.5, respectively. In a real data setting where η(x) is unknown, the classification problem is typically addressed by fitting some function f (x) ∈ R and then predicting the label sign(f (x)). We can interpret f (x) as containing information about both our prediction for the label (via the sign) and our confidence in this prediction (via the magnitude-values f (x) ≈ 0 indicate uncertainty).
Given a possible choice of the function f , the misclassification rate on the training data set {(X i , Y i ) : i = 1, . . . , n} is therefore given by the empirical 0-1 loss, measures misclassification on the corrupted training data set {(X i , Y i ) : i = 1, . . . , n}. Our goal is to ensure a low "true" misclassification rate, i.e., for predicting the label Y for a new point with features X, that is, where (X, Y ) is a new data point drawn from the same distribution as the original training data-that is, X ∼ P X , and Y |X is a label in {±1} with probabilities determined by η(X).
Since the zero/one loss is challenging to optimize, it is standard to use a surrogate loss function : R → R + , typically chosen to be continuous, convex, and monotone nonincreasing. For example, a logistic surrogate loss is given by (t) = log(1 + e −t ), while the hinge loss is given by Given a sample of n data points, (X 1 , Y 1 ), . . . , (X n , Y n ), we then define the empirical risk which is the average surrogate loss on the data set {(X i , Y i ) : i = 1, . . . , n}, and the corrupted empirical risk which is the average surrogate loss on the corrupted data set {(X i , Y i ) : i = 1, . . . , n}. We will also write the "true" risk of a function f , with expectation taken over a data point (X, Y ) drawn from the same distribution as before, i.e., X ∼ P X , and label Y |X drawn with probabilities determined by η(X).

Summary of questions and results
The key question of this work is to compare the performance of the empirical risk minimizer, f = argmin f ∈F L n (f ), and its corrupted counterpart, where the minimization is taken over some predefined class of functions F (for example, linear functions of x). That is, how does the presence of corrupted labels affect the performance of the empirical risk minimizer? In particular, we emphasize that the surrogate loss function is unchanged-we do not adjust or attempt to "correct" for the presence of corruption (this is in contrast to much of the existing literature, which we review below). Our findings can be summarized as follows. First, we find that corruption mimics regularization-in particular, for a fixed function f ∈ F, the corrupted empirical risk L ρ n (f ) is a biased estimate of the true risk L(f ), but acts as an unbiased estimate of a penalized version of this risk, where λ > 0 is a penalty parameter depending on the corruption level ρ, while the regularization function is given by the expected loss of the function f under a completely random label. While adding a penalty introduces bias into our estimator, it also serves to reduce variance, and for limited sample size n, this reduction in variance may outweigh the bias. Our second finding is therefore that, in some settings, corruption may lead to reduced risk for finite sample size, since it is effectively acting as a regularizer and can substantially reduce variance.

Prior work
The problem of learning a classifier in the presence of corrupted labels has been studied in many works in the recent literature. Here we give a very brief overview of the settings and types of results considered. Consider the more general model Here η(x) denotes the probability of a positive (true) label as before, while ρ(x, y) denotes the probability that the observed label is corrupted, which now may depend on x and/or y. Frénay et al. [7] and Frenay and Verleysen [8] provide overviews of recent works on this problem. They categorize the existing methods to three types: label noise-robust models, data cleaning methods, and label noise-tolerant learning algorithms.
The homogeneous noise setting assumes that ρ(x, y) ≡ ρ for all x, y-that is, there is a constant probability for each label to be corrupted. This is the setting we study in the present work. Under this setting, Long and Servedio [14] study boosting algorithms and discuss negative consequences of label noise. Van Rooyen, Menon and Williamson [26] consider ERM method and propose a label noise-robust loss function. Manwani and Sastry [16] discuss the noisetolerance property of risk minimization. Cannings, Fan and Samworth [5] show that LDA is consistent under the noise, and Blanco, Japón and Puerto [2] propose robust algorithms that apply relabeling and clustering to SVM.
The class-dependent noise setting assumes that ρ(x, y) = ρ y for all x, y-that is, the probability of corrupting a positive label (Y = +1 but Y = −1) is constant with respect to the feature vector x, and similarly for a negative label, but these two probabilities may differ. For example, in our earlier medical example, the diagnostic test might have different false positive and false negative rates, but these rates themselves are constant across patients (i.e., independent of features such as age that might be included in the X vector). Liu and Tao [13], Scott, Blanchard and Handy [25], and Blanchard et al. [1] study the consistency of the classifier under corruption, while Reeve et al. [23] focus on the minimax optimal learning rate of the corrupted estimator. Some recent works try correction of the loss function or the observed labels; see Natarajan et al. [19], van Rooyen and Williamson [27], Patrini et al. [21], and Lin and Bradic [12]. Other recent works focus on studying or developing label noise-robust methods; see Natarajan et al. [18], Patrini et al. [20], Reeve and Kabán [24], Bootkrajang and Kabán [3], and Bootkrajang and Kabán [4].
Finally, the general setting-where ρ(x, y) might vary with x-is studied by Cannings, Fan and Samworth [5]. In particular, they examine a setting for k-nearest neighbor where the corrupted labels Y i are more "clean" than the original labels Y i , in the sense that the corruption mechanism defined by ρ(x, y) acts to denoise labels near the decision boundary (i.e., η(x) ≈ 0.5) Specifically, suppose that, for values x with η(x) slightly higher than 0.5, we have ρ(x, +1) < ρ(x, −1) (that is, a label Y i = −1 that "should" instead be positive, has a greater chance of being flipped to Y i = +1), and similarly if η(x) is slightly lower than 0.5 then ρ(x, +1) > ρ(x, −1). In this case, the Y i 's carry strictly more information for estimating the decision boundary, as compared to the Y i 's; this setting is therefore fundamentally different from the one we consider here, where homogeneous noise creates strictly noisier labels. Menon, Van Rooyen and Natarajan [17] consider a similar general setting where they show that any consistent algorithm for noise free setting is also consistent under noisy labels under appropriate assumptions. Recent discussions on the noise-tolerence and the robustness of the corrupted classification under this setting can be found in Ghosh, Manwani and Sastry [9] and Cheng et al. [6].

Intuition: corruption acts as regularization
The key idea for studying the corrupted estimator through the framework of regularization, is to find a regularizer that matches the expected behavior of the corruption. In order to do this, we first find a different representation of the corruption variables: define drawn independently from each other and independently of the clean data. Then let That is, R i determines whether the label Y i will be replaced by a random sign, and Z i provides this random sign. Examining this construction we can see that this yields the same distribution of the corrupted labels as the original definition. We can then write the corrupted loss as Next, we treat f as fixed, and then condition on the clean data and marginalize over the distribution of the R i 's and Z i 's: Recall the definition of the regularizer, the expected loss of f on purely random labels. We can also consider an empirical version, We therefore see that where λ = 2ρ 1−2ρ . Finally, for any fixed function f , we have by definition. Therefore, we can view the corrupted empirical risk minimizer f as a sample estimate of the minimizer of the penalized loss L(f ) + λR(f ). To summarize our findings so far, we have seen that f = argmin f ∈F L ρ n (f ) can be described in two ways: • Fixing the training data {(X i , Y i ) : i = 1, . . . , n} and taking an expectation over the corruption mechanism (the R i 's and Z i 's above), we see that L ρ n (f ) has (conditional) expected value L n (f ) + λ R n (f ), a penalized empirical risk.
• Taking expectations over both the original data and the random corruption, L ρ n (f ) has expected value L(f ) + λR(f ), a penalized true risk.

Results for the linear setting
Next, we will examine the implications of this relationship between corruption and regularization, on the goals of minimizing risk. From this point on, we will restrict our discussion to the setting where F consists of linear functions, in order to be able to achieve precise results. Consequently we will shift our notation from functions f to vectors w. Specifically, for each w ∈ R d we will define the population-level loss and regularized loss, as well as the empirical loss and empirical corrupted loss, We will also define population-level minimizers and empirical minimizers whenever these minimizers exist. (Note that, in some settings, the loss or its empirical or corrupted counterpart may have no minimizer-for example, logistic loss, where the positive and negative labels can be perfectly separated.) For each of the four minimization problems, if the minimizer exists but is not unique, our results will apply to any minimizer (e.g., w ρ * denotes any element of the set argmin w∈R d L ρ (w), etc).
It is well-known that regularization may help reduce risk, even at the cost of increasing bias due to the influence of the regularization function. As discussed earlier, since corruption mimics regularization, in many settings we empirically observe that corruption reduces the risk-that is, L( w ρ n ) < L( w n ), even though the corruption introduces bias. We will next study why this phenomenon occurs, by establishing bounds on the loss L( w ρ n ) of the corrupted estimator.

Theoretical results
We begin by defining our assumptions. First, we require some conditions on the loss function : The loss function is nonnegative, nonincreasing, convex, and L-Lipschitz. Furthermore, is strictly decreasing on negative values, with for some γ > 0, and has a subexponential decay for positive values, The last two conditions ensure that the loss function enacts a strong penalty if X w predicts the sign of Y incorrectly (i.e., (t) is large for t < 0), but decays quickly if X w predicts the sign of Y correctly (i.e., (t) is small for t > 0). These conditions are satisfied by many well-known examples, for instance: • The logistic loss t = log(1 + e −t ) satisfies Assumption 1 with γ = 1 2 and We will also need some weak assumptions on the distribution of the feature vector X: for all unit vectors u ∈ S d−1 .
For example, this assumption is satisfied by any multivariate Gaussian distribution with mean μ and covariance Σ, with the parameters a 0 , a 1 , a 2 depending on μ and on the largest and smallest eigenvalues of Σ, but not on the dimension d.
Under these assumptions, our main result establishes a bound on the loss of the corrupted estimator w ρ n . Theorem 1. Suppose that Assumptions 1 and 2 hold. Let n ≥ 2 and fix any Then with probability at least 1 − n −α , the set argmin w∈R d L ρ n (w) is nonempty, and for all w ρ n ∈ argmin w∈R d L ρ n (w) it holds that Here C, C depend only on α and on the constants in Assumptions 1 and 2, but not on n, d, or ρ.
We can see an immediate tradeoff in the upper bound in Theorem 1. The ρ 1/2 term acts as an "approximation error", where a large corruption proportion ρ leads to a potentially large gap between the loss of the regularized estimator, L( w ρ * ), and the minimum possible loss without regularization, inf w∈R d L(w). On the other hand, the ρ −1/2 · d log n n term is the "estimation error", which is large when the corruption proportion ρ is small (i.e., insufficient regularization). The resulting upper bound on risk is minimized when the corruption level scales as ρ d log n n 1/2 , leading to an upper bound on excess risk scaling as d log n n 1/4 . This suggests that even a very small fraction of corrupted entries can lead to a reduced risk. In contrast, the uncorrupted minimization problem may not behave well under these weak assumptions-for instance, if the labels are perfectly linearly separable (as might be the case if, e.g., Y |X follows a logistic regression with very high signal strength), then a minimizer does not even exist (i.e., argmin w∈R d L n (w) is empty).
The assumption that ρ ≥ C · d log n n is not merely an artifact of the proof-in fact, without this type of assumption, we cannot even ensure that argmin w∈R d L ρ n (w) is nonempty. To see why, let us consider a setting where the population is perfectly separable and is a strictly decreasing function. In this case, the empirical risk minimizer w n does not exist (or in other words, it diverges). Now, if ρ = 1/n, then with probability (1 − 1 n ) n ≈ e −1 , the corrupted dataset is equal to the original dataset, which means that the corrupted data set is also perfectly separable and thus w ρ n does not exist. Of course, the result of Theorem 1 is an upper bound on the loss, and may be loose for certain examples; the value of ρ that minimizes the upper bound (i.e., ρ d log n n 1/2 ) might not be the same as the value of ρ that minimizes the loss itself. In particular, the result can be viewed as a "worst case" bound that holds even when the unregularized loss has no minimizer (such as logistic regression with perfectly separable labels, as mentioned above); in problems where this is not the case, regularization is not as critical, and a smaller value of ρ (or even ρ = 0) may perform better.

Proof of Theorem 1
Our first step is to examine some properties of the regularized population minimizer w ρ * and its empirical counterpart, the corrupted estimator w ρ n .

Lemma 1.
Suppose Assumptions 1 and 2 hold. Fix any ρ ∈ (0, 1 2 ). Then argmin w∈R d L ρ (w) is nonempty, and any w ρ Moreover, for any α > 0, if n ≥ 2 and ρ ≥ C · d log n n then with probability at least 1 − n −α it holds that argmin w∈R d L ρ n (w) is nonempty, that any w ρ n ∈ argmin w∈R d L ρ n (w) must satisfy w ρ n ≤ C 0 ρ −1/2 , and that Here C, C 0 , C 1 , C 2 depend on α and on the constants in Assumptions 1 and 2, but not on n, d, or ρ.
Now we prove the theorem. By Lemma 1, with probability at least 1 − n −α , for any w ρ * ∈ argmin w∈R d L ρ (w) and all w ρ n ∈ argmin w∈R d L ρ n (w) it holds that From now on, we assume that these events all hold. Then we have where we set C = max {2C 1 , 4C 2 }. Next, by definition of L ρ , we have where the last step holds since ρ ≤ 1 2 . Therefore, which completes the proof of the theorem.

Another perspective on the regularizer
The results above suggest that the main source of possible improvements by corruption is the shrinkage induced by the corruption (or, at the population level, by the regularizer R(w)). In particular, the results of Lemma 1 show that, in the linear setting, the corruption (or the regularizer) lead to an upper bound on w . We will now examine this connection more closely.
The following lemma verifies that, up to constants, R(w) is equivalent to w . In a sense, then, we can view regularization with R(w) as effectively placing a penalty on w .

Lemma 2. Suppose Assumptions 1 and 2 hold. Then it holds that
where c L , c U depend only on the constants in Assumptions 1 and 2.
Proof. In the calculations (3) and (4) appearing in the proof of Lemma 1, we will see that Assumption 2 implies that for all unit vectors u ∈ R d . For any w ∈ R d , for the lower bound, we have and furthermore by convexity of . For the upper bound, we have

Simulations
Now we empirically investigate the effect of corruption through a simulation. 1 We generate the data {(X i , Y i )} 1≤i≤n in the following way: choosing dimension d = 50, we draw independently for each i = 1, . . . , n. We run the experiment at a small and large sample size, n = 400 and n = 2000, and at a range of values of the corruption probability, ρ ∈ {0, 0.01, 0.02, . . . , 0.2}. For each sample size n and corruption level ρ, we run 100 independent trials of the experiment, we choose the logistic loss function (t) = log(1+e −t ), and compute the corrupted empirical minimizer w ρ n defined in (2) and the penalized population-level minimizer w ρ * as in (1) (which reduces to the uncorrupted empirical minimizer w n and the unpenalized population-level minimizer w * , respectively, in the case ρ = 0). Note that the data generating distribution does not follow the logistic regression model (due to the cubic term), and so the logistic loss simply acts as a surrogate for the 0-1 loss (i.e., it does not correspond to a likelihood for some well-specified model). Figure 1 shows the performance of the corrupted estimator w ρ n and its population-level version w ρ * , across the range of corruption values ρ ∈ {0, 0.01, 0.02, . . . , 0.2}, at each sample size n ∈ {400, 2000}; the result at ρ = 0 is highlighted in each case, as it corresponds to the uncorrupted estimator w n and to the corresponding population-level minimizer w * . Overall, the plots illustrate how corruption acts as regularization-for the smaller sample size n = 400, we see that a small amount of corruption substantially reduces the test risk of the empirical minimizer w ρ n , while for the larger sample size n = 2000 the uncorrupted estimator w n achieves good performance and we no longer see any noticeable improvement from corruption. For the population-level minimizers, on the other hand, increasing regularization always leads to an increase in risk, as expected.

Discussion
In this work, we have shown that the corruption of labels has a regularizationtype effect on binary classification problems, leading to a possibility of an improvement of the fitted classifier in terms of test risk. Unlike many prior works that apply adjustment or correction to achieve consistency or robustness of the estimator, our result implies that corruption itself can be beneficial without any adjustment to the estimation process, and thus it could be better in some cases to simply fit the corrupted dataset without any modification on the methods-in particular, this means that we do not need to know or estimate the corruption mechanism, as would be the case for a procedure that corrects for the corruption. For the fitting of linear classifiers using empirical risk minimization under homogeneous noise, Theorem 1 provides an explanation for the possibility of corruption being beneficial, illustrating the tradeoff between loss approximation and the estimation.
We can expect a similar tradeoff for more general settings where the noise is not homogeneous, or where different estimation methods are applied; in general, it is intuitive that a small amount of corruption can reduce the chance of overfitting, especially when the inherent noise level is low, and that this benefit may outweigh the low bias that is introduced. As an example of a broader setting where this type of phenomenon may be useful, we can consider a setting where some data points are known to be "clean" while others are potentially cor-rupted (this setting can be thought of as a special case of transfer learning-for example, see Reeve, Cannings and Samworth [22]). While we might expect that performance could be improved by removing or down-weighting the latter data points in order to avoid or reduce the effect of corruption, our findings instead suggest that the presence of the non-"clean" data might even be beneficial.
The question of corrupted labels, with its possible risks and benefits, is studied only in a very specific setting in our work (i.e., linear prediction rules in low dimensions), and many open questions remain. First, noting that the corrupted loss can be thought as another surrogate of 0-1 loss, we may ask how corruption affects the prediction performance of the estimator in terms of misclassification rate, i.e., 0-1 risk. Second, do similar phenomena occur in the high-dimensional regime, d n or d ∝ n? In particular, we have seen that homogeneous corruption mimics an 2 penalty in the low-dimensional setting; however, the same is not immediately true in high dimensions, since these results rely on concentration type arguments that would no longer hold (and, in particular, for d n, in general both the uncorrupted data {(X i , Y i )} 1≤i≤n and the corrupted data {(X i , Y i )} 1≤i≤n are perfectly linearly separable, so we cannot expect good performance without some additional constraints or regularization). Finally, since the key phenomenon underlying our results is the way that homogeneous corruption mimics 2 regularization (and therefore, corruption induces shrinkage in the resulting estimator), this does not explain any potential benefits from corruption if we instead use methods such as a k-nearest-neighbor estimator, or other methods where there is no notion of shrinkage; is corruption beneficial more broadly, by reducing the chance of overfitting in a more general sense? We leave these questions for future work.

A.1. Proof of Lemma 1
We first verify that L ρ is β-Lipschitz, where β = L a1 a0 . For any w = w ∈ R d we have where the last inequality follows from Assumption 2 via the calculation We therefore have that L ρ is β-Lipschitz. Note that the above argument also holds for ρ = 0, implying that L is also β-Lipschitz.
c2γ log 2 . We will show that, for any First we calculate where the first inequality holds by definition of the distribution of the corrupted label Y (since P{ Y = +1 | X} ∈ [ρ, 1 − ρ] holds almost surely), while for the second inequality, by Jensen's inequality together with Assumption 2, We also know that by Assumption 1, and so We therefore have by Assumption 2 > 0 by definition of t.
In particular, this implies that L ρ (tu) > inf w∈R d L ρ (w) for all u ∈ S d−1 . Since w → L ρ (w) is continuous as shown above, this implies that L ρ (w) attains its infimum, and any w ρ * ∈ argmin w∈R d L ρ (w) must satisfy w ρ * ≤ t. Next we bound L( w ρ * ) for any w ρ * ∈ argmin w∈R d L ρ (w). First note that the corrupted risk can be written as Applying (5) with w = w ρ * we obtain and similarly applying (5) with w = − w ρ * we obtain Since L ρ ( w ρ * ) ≤ L ρ (− w ρ * ) by optimality of w ρ * , and ρ < 1 2 by assumption, this proves that L( w ρ * ) ≤ L(− w ρ * ) and therefore, Next, fix any w ∈ R d . First consider the case that w ≤ cρ −1/2 , where c = c1a2 2βc2 . Then where the last inequality holds since L is β-Lipschitz.
Next consider the case that w > cρ −1/2 . Let u = w/ w and t = cρ −1/2 . Then by the reasoning above, we have Next, let Z u = X u · Y , then we have Therefore, for this second case, we have shown that Combining the two cases, we have shown that for all w ∈ R d , which proves the desired inequality with Now we turn to the corrupted estimator w ρ n . First we will need a lemma to establish some concentration results. and and sup w ≤r where r 1 , r 2 , r 3 , r 4 , r 5 > 0 depend only on α and on the constants in Assumptions 1 and 2, and not on n, d, r, or t.
We are now ready to prove the remainder of Lemma 1. First we bound w ρ n . Define C = 2r2 r1 and fix t = , which therefore satisfies

Y. Lee and R. F. Barber
We will show that, for any u ∈ S d−1 , Then assuming ρ ≥ C · d log n n , the bound (6) in Lemma 3 implies that for all u ∈ S d−1 . Furthermore, since t = C 0 ρ −1/2 , the bound (7) in Lemma 3 (applied with 0.5c 2 t in place of t) together with our assumption ρ ≥ C · d log n n implies that 1 n for all u ∈ S d−1 . Following identical arguments as in the population case, we have for all u ∈ S d−1 , where the last step holds by definition of t and of C 0 . Since L ρ n is continuous (because we have assumed the loss is continuous), as for the population case this again proves that L ρ n (w) must attain its infimum, and that any w ∈ argmin w∈R d L ρ n (w) must satisfy w ≤ t. Finally, the bound sup w ≤C0ρ −1/2 L ρ n (w) − L ρ (w) ≤ C 2 ρ −1/2 d log n n follows immediately from the bound (8) in Lemma 3, by setting C 2 = C 0 r 5 .

A.2. Proof of Lemma 3
First, we prove (6). The distribution of (X, Y ) can equivalently be represented as where R ∼ Bernoulli(2ρ) is generated independently from (X, Y ), and Z ∼ Unif{±1} is generated independently from (X, Y, R). Let (X i , Y i , R i , Z i ) generate the n i.i.d. data points. Furthermore, definē Then we can check that, for all u ∈ S d−1 , We can verify that, sinceX, R, Z are independent, by definition of their distributions we have Furthermore, by Jensen's inequality, where the last inequality applies Assumption 2 together with Markov's inequality. Rearranging terms, then, Therefore, combining everything we have shown so far, it holds deterministically that Now we need to bound Δ with high probability. By the symmetrization inequality Koltchinskii [10, Theorem 2.1] we have and so combining everything so far, we have shown that Moreover, we can see that ( while ξ i ∼ Unif{±1} is drawn independently from the data), and so Finally, since by definition, it holds deterministically that X i ≤ 4E[ X ], while R i ∼ Bernoulli(2ρ) is independent from X i . Combining everything so far, Next, since for all u ∈ S d−1 we have where r is chosen appropriately as a function of α, a 0 , and a 1 . Therefore, we have shown that with probability at least 1 − 1 3n α , which is sufficient to verify (6) with r 1 , r 2 chosen appropriately, since it holds that ρ · d log n n ≤ r ρ 2 + d log n 2r n for all r > 0. Next we prove (7). Note that, comparing the two terms in the desired upper bound and noting that 1/t is only dominant if t ≤ n d log n , we can see that it suffices to prove the result for t ≤ n d log n , since t → sup u∈S d−1 1 n n i=1 e −t|X i u| is monotone nonincreasing in t.
We have sup Since we have assumed that t ≤ n, taking = n −2.5 we obtain which clearly satisfies (7) with r 3 , r 4 chosen appropriately, since as shown before, a0 . Finally we prove (8). We first bound the quantity in the expected value. We have as long as λ 2 ≤ a 0 /d. Following the proof of Kontorovich [11,Theorem 1], since sup w ≤r L ρ n (w) − L ρ (w) is a Lr n -Lipschitz function of each data point product Taking λ = a 0 4nd log a 1 · 8nd log a 1 · log(3n α ) a 0 (which clearly satisfies λ ≤ a0 d for sufficiently large n), this probability is bounded by 1 3n α . (If instead n is not sufficiently large (i.e., λ > a0 d ), then the guarantee (8) holds trivially.) Combining everything, and choosing r 5 appropriately, we have established (8).