Is distribution-free inference possible for binary regression?

For a regression problem with a binary label response, we examine the problem of constructing confidence intervals for the label probability conditional on the features. In a setting where we do not have any information about the underlying distribution, we would ideally like to provide confidence intervals that are distribution-free---that is, valid with no assumptions on the distribution of the data. Our results establish an explicit lower bound on the length of any distribution-free confidence interval, and construct a procedure that can approximately achieve this length. In particular, this lower bound is independent of the sample size and holds for all distributions with no point masses, meaning that it is not possible for any distribution-free procedure to be adaptive with respect to any type of special structure in the distribution.


Introduction
Consider a regression problem where we would like to model the relationship between a feature vector X ∈ R d and a response Y ∈ R, based on a sample of n data points, (X 1 , Y 1 ), . . . , (X n , Y n ) iid ∼ P . In a high-dimensional setting where d is large, many modern methods are available to build powerful predictive models for Y given X, but relatively little is known about their theoretical properties-for example, if we train a neural network on the n available data points, can we quantify its accuracy on unseen test data, without making strong assumptions on P , the unknown distribution of the data?
If we are willing to assume that the data follows a regression model E P [Y | X = x] = f (x) where the function f satisfies certain assumptions, then classical statistical results assure that these questions can be answered using more simple regression methods. For example, if f (x) lies in a parametric family (e.g., linear regression) then we can perform inference within this parametric model. In a more general nonparametric setting, if f (x) is assumed to satisfy some smoothness conditions, classical nonparametric methods such as nearest neighbors will also yield guarantees on the accuracy of our estimate of f (x). However, the reported results will be invalid if the assumptions (the parametric model, or the smoothness conditions) do not hold.
To address this concern, the recent field of distribution-free prediction considers the problem of providing valid predictive inference without any assumptions on the data distribution. The aim of distribution-free prediction is formulated as follows: given a training data set (X 1 , Y 1 ), . . . , (X n , Y n ) ∈ R d × R, our task is to construct a map C n , mapping a new data point x ∈ R d to an interval or set C n (x) ⊆ R, such that (1) Here the probability is taken with respect to (X i , Y i ) iid ∼ P for i = 1, . . . , n+1 (the training and test data are drawn from the same distribution P ). The bound is required to hold uniformly over all distributions P , without constraining to, say, distributions that satisfy some notion of smoothness. For example, the conformal inference methodology [Vovk et al., 2005] provides an elegant framework for distribution-free prediction, and can adapt to the favorable properties of the underlying distribution to achieve asymptotically optimal prediction intervals in certain settings (see, e.g., Lei and Wasserman [2014], Lei et al. [2018]).
Distribution-free prediction has also been studied in the context of a binary response Y ∈ {0, 1}, where the output is a set C n (X n+1 ) ⊆ {0, 1} (or more generally, in a setting with a finite set of possible labels) [Vovk et al., 2005, Lei, 2014, Sadinle et al., 2019. For a binary Y , the goal is to satisfy (2) For distributions P where the conditional probability π P (X) = P P {Y = 1 | X} is typically close to either 0 or 1, given sufficient data the resulting distribution-free predictive set C n (X n+1 ) can often be a singleton set, {0} or {1}. If the labels are inherently noisy, however-that is, if π P (X) is typically bounded away from both 0 and 1-then {0, 1} will often be the only possible set offering guaranteed predictive coverage, even if we were to have oracle knowledge of the distribution P . In other words, predictive coverage (whether distribution-free or not) is not a meaningful target for binary regression problems with noisy labels; we would like to estimate the label probability π P (X) directly, rather than try to predict the inherently noisy label Y .

Summary of contributions
In this work, we ask whether the distribution-free framework can be extended beyond the prediction task, in the binary regression setting. We will aim to provide distribution-free inference on the conditional label probability π P (X) = P P {Y = 1 | X}. We are particularly interested in scenarios where π P (X) is typically not close to either 0 or 1 and so meaningful predictive inference would not be possible even if P were known. In this type of setting, is it nonetheless possible to provide a nontrivial confidence interval for π P (X), and to ensure a distribution-free guarantee of coverage?
Specifically, our goal is to investigate the feasibility of constructing an algorithm that satisfies the following condition: Definition 1. An algorithm C n provides a (1 − α)-distribution-free confidence interval for binary regression if it holds that P (X i ,Y i ) iid ∼ P π P (X n+1 ) ∈ C n (X n+1 ) ≥ 1−α for all distributions P on R d ×{0, 1}. (3) This notion of a valid distribution-free confidence interval was previously studied by Vovk et al. [2005, Section 5.2], under the name "weakly valid probability estimators".
Definition 1 requires a fairly weak form of coverage-we ask that coverage holds on average over the new feature vector X n+1 , rather than requiring P π P (x) ∈ C n (x) ≥ 1 − α to hold uniformly over all x ∈ R d . Nonetheless, the main results of our work establish that even this weak notion of distribution-free coverage is fundamentally incompatible with the goal of precise inference; with some caveats, the main message of our results is that the property (3) can only be attained by algorithms C n that return confidence intervals whose length does not vanish with the sample size n. To make this more precise, our main results are the following: • Distribution-free confidence leads to distribution-free prediction. In Theorem 1, we prove that any algorithm C n satisfying (3) will inevitably also yield a valid prediction interval for Y n+1 , for any nonatomic distribution P , i.e., P has no point masses. This result is closely related to Vovk et al. [2005, Proposition 5.1], where it is shown that C n must include the endpoints 0 and/or 1 with large probability. Intuitively, this implies that, in a noisy setting where π P (X) is not typically close to 0 or 1, any distribution-free confidence interval C n (X n+1 ) is likely to be quite wide since it needs to reach one or both endpoints.
• A lower bound on the length of a distribution-free confidence interval.
In Theorem 2, we formalize the above intuition, establishing a lower bound on the expected length of C n (X n+1 ) with an explicit function of the distribution of π P (X) (again, for any nonatomic P ). Importantly, this lower bound is independent of the sample size n. In other words, for any fixed nonatomic distribution P , the length of our distribution-free confidence intervals cannot go to zero even as n → ∞. This means that distribution-free confidence intervals cannot be adaptive-by requiring coverage to hold for all distributions P , we no longer have the possibility of providing precise confidence intervals for any distribution P , regardless of whether π P satisfies "nice" conditions such as smoothness.
• A matching upper bound. In Theorem 3 we propose a concrete construction for C n (X n+1 ) that satisfies the distribution-free coverage property (3). In particular, Corollary 1 proves that the length of our proposed algorithm is asymptotically equal to the lower bound established in Theorem 2, for any distribution P where it is possible estimate π P (X) consistently as n → ∞.

Fixed vs. random intervals
In some cases, we may want to allow additional randomness in our construction (formally, we would define C n as mapping a new feature vector x to a distribution over subsets of R, and C n (x) denotes a subset drawn from this distribution). In this case, the probability in statements such as (1), (2), and (3) should be interpreted as being taken with respect to the distribution of the data (X i , Y i ) iid ∼ P and the additional randomness in the construction of C n (X n+1 ). From this point on, we will assume that probabilities and expectations are taken on average over any randomness in the construction of the relevant prediction or confidence interval, without further comment. In particular, all results proved in this paper apply to both fixed and random intervals.

Related work
As mentioned above, the problem of valid distribution-free confidence intervals was previously studied by Vovk et al. [2005, Section 5.2]. In addition, this problem is closely related to two lines of work in the recent statistical literature-nonparametric inference (specifically, confidence intervals for nonparametric regression), and distribution-free prediction.

Nonparametric confidence intervals
and where the noise distribution is constrained (e.g., subgaussian with some bounded variance). In this setting, we may assume that the true regression function f lies in some constrained class-for example, it may be constrained to be Lipschitz, or to have a Lipschitz gradient (corresponding to a smoothness assumption with exponent β = 1 or β = 2, respectively). There is a rich literature on the problems of estimating f (x), and providing inference (e.g., confidence bands) for f (x). If β is known, then the problem is fairly straightforward-for example, a k-nearest neighbors method with k ∼ n 2β/(2β+d) yields the optimal estimation error rate O(n −β/(2β+d) ), ignoring log factors (and, correspondingly, confidence intervals of this length) [Low, 1997, Györfi et al., 2006]. However, a key question of interest is that of adaptivity-if β is unknown but is assumed to satisfy β ≥ β 0 is it possible to construct confidence intervals that are valid at smoothness level β 0 , but if applied to data with smoothness β > β 0 would still achieve the optimal rate at that β? This question has been studied extensively in the literature-see, e.g., Low [1997], Genovese and Wasserman [2008], Cai et al. [2014] and the references therein. It turns out that the question of adaptivity is closely tied to how we choose to define coverage-if we require uniform coverage, i.e., f (x) lies within its confidence interval uniformly for all x, then adaptivity is impossible [Low, 1997], while relaxing to nearly-uniform coverage allows for adaptivity up to β = 2β 0 [Cai et al., 2014]; a bootstrap based approach for nearly-uniform coverage is studied also by Hall and Horowitz [2013]. A different relaxation of the coverage condition is coverage on average over a random draw of X, studied by Wahba [1983], which is similar to the coverage condition (3) studied in this work. Genovese and Wasserman [2008] propose a different relaxation, providing confidence intervals guaranteed to cover a "surrogate" of the function f (a smoothed version of the regression function). A different notion of coverage is the "confidence ball", guaranteeing a bound 2 error x ( f (x)−f (x)) 2 dx rather than providing pointwise confidence intervals for f (x) at each x (see, e.g., Cai and Low [2006]).
Distribution-free prediction The field of distribution-free prediction aims to provide prediction intervals that are uniformly valid over all distributions, without assuming some minimum level of smoothness as in the nonparametric inference literature. As mentioned above, the conformal prediction framework [Vovk et al., 2005] provides methodology for this aim. Alternative methods offering distribution-free predictive guarantees include holdout set methods (also known as "split" or "inductive" conformal prediction, see, e.g., Papadopoulos et al. [2002], Vovk et al. [2005], Papadopoulos [2008], Lei et al. [2018]), and the jackknife+ [Barber et al., 2019b], a variant of the jackknife (i.e., leave-one-out cross-validation). Lei and Wasserman [2014] establish that distribution-free prediction is possible while (approximately) achieving the minimum possible length prediction intervals for any "nice" (e.g., smooth) distribution P .
The work of Vovk [2012], Lei and Wasserman [2014], Barber et al. [2019a] study whether a stronger form of predictive coverage can be attained-namely, distributionfree conditional coverage, aiming for a guarantee that holds pointwise at (almost) every x, i.e., P Y n+1 ∈ C n (X n+1 ) X n+1 = x ≥ 1 − α. Distribution-free pointwise coverage is shown to be impossible for any finite-length interval [Vovk, 2012, Lei andWasserman, 2014]. Barber et al. [2019a] study a weaker notion of conditional coverage, aiming to ensure P Y n+1 ∈ C n (X n+1 ) X n+1 ∈ X ≥ 1 − α for all sufficiently large subsets X ∈ R d , and prove lower bounds on the length of any resulting distributionfree interval. Some of the techniques in these lower bounds are related to the proof techniques we use in the present work.
In the setting where the response Y is binary or takes finitely many values, as discussed earlier, Vovk et al. [2005], Lei [2014], Sadinle et al. [2019] apply the conformal prediction framework to the problem of distribution-free classification. If the goal is to estimate label probabilities (rather than output a predictive set), an alternative notion of validity in the binary setting is calibration, where for an estimate π(X) of the label probability π P (X), we require P {Y = 1 | π(X)} = π(X), studied in the distributionfree setting by Vovk and Petej [2014] via the methodology of Venn predictors.

Main results: lower bounds
In this section, we will prove that a distribution-free confidence interval for binary regression cannot provide precise inference about the parameter π P (X). To do this, we first compare to the problem of predictive inference, and then turn to proving lower bounds on the length of any distribution-free confidence interval.

Confidence vs. prediction
Our first main result proves that, in the binary regression setting, any algorithm providing distribution-free coverage of π P (X) must necessarily also cover the binary label Y , for every distribution P that is nonatomic (i.e., zero probability at any single point).
Theorem 1. Let C n be any algorithm that provides a (1 − α)-distribution-free confidence interval for binary regression, as in (3). Then C n also satisfies (1 − α) predictive coverage uniformly over all nonatomic distributions P . That is, C n satisfies For example, consider a distribution P with a constant label probability, π P (x) ≡ 0.5. Given a large sample size n, we might hope that our algorithm would detect the simple nature of this distribution, and could output a narrow interval, C n (X n+1 ) = 0.5 ± o(1). However, Theorem 1 tells us that any distribution-free confidence interval C n must necessarily include both endpoints 0 and 1 with substantial probability. In particular, this example suggests that, unless π P (X) is usually close to 0 or 1, any distribution-free confidence interval C n is unlikely to be precise (i.e., C n (X n+1 ) is unlikely to be a short interval). In Theorem 2 below, we will formalize this intuition by finding a lower bound on the expected length of C n (X n+1 ).
We note that Theorem 1 is closely related to a result of Vovk et al. [2005, Proposition 5.1] (see also Nouretdinov et al. [2001, Theorem 7, Corollary 18] for an earlier related result). Their work establishes that if C n provides a (1−α)-distribution-free confidence interval for binary regression (3), then there exists some other algorithm C n , also satisfying the property (3), such that Clearly this last quantity cannot be larger than α (since this would immediately contradict the coverage property (3)), and so this result, like Theorem 1 above, indicates that C n (X n+1 ) must often include endpoints 0 and/or 1. While Vovk et al. [2005]'s result appears different from the conclusion of Theorem 1 above, the construction in their proof in fact suffices to prove Theorem 1 as well.

A key lemma
Rather than proving Theorem 1 directly, we will instead generalize to a more powerful result: Lemma 1. Let C n be any algorithm that provides a (1 − α)-distribution-free confidence interval for binary regression (3). Let P be any nonatomic distribution on (X, Y ) ∈ (Proofs for this lemma and all subsequent theoretical results are deferred to the Appendix.) With this lemma in place, Theorem 1 follows immediately-namely, defining Z n+1 = Y n+1 , we have proved the theorem. The lemma is substantially more general, however, and we will need its full generality in order to prove our lower bounds on the length of C n (X n+1 ) below.

A lower bound on length
Next, we will establish bounds on the length of a distribution-free confidence interval for binary regression. We begin with a few definitions. First, for t ∈ [0, 1 2 ] and a ∈ [0, 1], we define and for t ∈ ( 1 2 , 1] let (t, a) = (1 − t, a). The function (t, a) is illustrated in Figure 1. To understand the role of this function in our work, we begin with the following lemma: Then it holds that Next, for any distribution Q on [0, 1] and any α ∈ [0, 1], define In the following theorem, we will see that this function allows us to explicitly characterize lower and upper bounds for the length of any distribution-free confidence interval. Theorem 2. Let C n be any algorithm that provides a (1 − α)-distribution-free confidence interval for binary regression (3). Then for any nonatomic distribution P on where Π P is the distribution of the random variable π P (X) ∈ [0, 1].
Here leb() denotes the Lebegue measure on R.

Proof sketch for Theorem 2
First, for intuition, we can consider a simple case where Π P = δ t , a point mass at some t ∈ [0, 1]-that is, the label probability is constant, with π P (x) = t for all x. Define a function f : Applying Lemma 2, we therefore have by Fubini's theorem. We can also verify that L(Π P ) = L(δ t ) = (t, α), completing the proof for this simple case. Now, in general, π P (x) will not be a constant. Consider any distribution-free confidence interval C n , and let a P (t) be the noncoverage rate over test points X n+1 conditional on π P (X n+1 ) = t, for a particular distribution P : We must therefore have E T ∼Π P [a P (T )] ≤ α, in order to achieve at least 1 − α coverage. By comparing to the constant-probability case, informally we can see that (t, a P (t)) must be a lower bound on E leb( C n (X n+1 )) π P (X n+1 ) = t . Therefore, marginalizing The full details of this proof are deferred to the Appendix.

Interpreting the lower bound
As mentioned in the proof sketch above, we have seen that If π P (X) = t almost surely, then L α (Π P ) = (t, α), giving us an exact expression for the lower bound on length in the case where the label probability is constant.
More generally, we can verify that, for any t ∈ [0, 1 2 ], (This bound holds because, for any s ∈ [t, 1 − t], we have (s, a) ≥ (t, a) for all a; examining the definition of L α leads immediately to this lower bound.) This inequality means that, if π P (X) is bounded away from 0 and 1, then there is a fundamental lower bound on the length of any distribution-free confidence interval regardless of the sample size n, since the lower bound (t, α) > 0 does not depend on n. In other words, an infinite sample size does not lead to infinite precision.

Comparison to existing results in predictive inference
Vovk et al. [2005, Section 3.4] study distribution-free prediction in a setting where the label Y takes values in a finite set Y, with binary labels Y ∈ {0, 1} as a special case. Their results (specifically, see Vovk et al. [2005, Propositions 3.3-3.5]) characterize the minimum possible expected cardinality of any distribution-free predictive set in terms of the "predictability" of Y given X-in the special case of a binary label, this translates to studying the distribution Π P of π P (X). However, the two problems (predictive subsets of {0, 1} versus confidence intervals that are subsets of [0, 1]) are very different in nature, and their results are not directly related to the map L α (Π P ) derived in our work.

Main results: upper bounds
We will next investigate whether the lower bound on confidence interval length, proved in Theorem 2, can in fact be achieved by a distribution-free method. In order to be able to construct confidence intervals based on a finite sample, we will work in a setting where we approximate P via a partition. We begin by defining some notation. First suppose we are given a predefined partition R d = X 1 ∪ · · · ∪ X M . (Later on, for a distribution-free algorithm, we will allow the partition to be chosen as a function of the data.) For each x ∈ R d , we define m(x) to be the index of the region containing x, i.e., X m(x) x. We also define the probability within each region, p P,m = P P X {X ∈ X m } and the average label probability within each region, We will consider confidence intervals that pool data within each region-in particular, we will construct a confidence interval C n (X n+1 ) that depends on X n+1 only through m(X n+1 ). As a result, it is clear that we will only be able to produce a precise confidence interval if π P (x) is approximately constant over x ∈ X m , for each m. To capture this notion, we define the "blur" of the partition X 1: In other words, ∆ P (X 1:M ) will be low if the partition is highly informative, separating R d into regions where π P (x) is nearly constant. For example, if we have access to a good estimate of the function π P (x), we can partition R d by clustering together points with similar estimated probabilities. Of course, the blur ∆ P (X 1:M ) cannot in general be small unless we choose M to be large. We will discuss this tradeoff later on.

An oracle algorithm
In order to motivate our distribution-free construction, we begin with a simpler problem. Suppose that we are given a fixed partition R d = X 1 ∪ · · · ∪ X M , and are given oracle knowledge of the probabilities p P,m and π P,m defined above. What is the best possible interval length that can be obtained using this oracle knowledge?
As for the lower bound, let us begin by examining the function (t, a). For any t, a ∈ [0, 1], define a function f t,a : [0, 1] → [0, 1] as follows. If t ∈ [0, 1 2 ], define and for t ∈ ( 1 2 , 1] define f t,a (s) = f 1−t,a (1 − s). In the proof of Lemma 2, we will establish that f t,a satisfies 1 s=0 f t,a (s) ds = (t, a), and that E [f t,a (Z)] ≥ 1 − a for any Z with E [Z] = t. In other words, the function f t,a attains the infimum in the statement of that lemma.
Next we will leverage this function to construct a confidence interval using the given partition.
The following lemma examines the length and coverage properties of this construction:

If additionally it holds that
For all m, either 0 ≤ π P,m ≤ t m ≤ 1 2 or 1 2 ≤ t m ≤ π P,m ≤ 1, then (Note that C t,a (X) is a randomized interval, and so the probability and expectation are taken with respect to the data point X ∼ P X and the independent random variable U ∼ Unif[0, 1] used in the construction of C t,a (X).) Next, we define the oracle confidence interval. Suppose we are given oracle knowledge of the p P,m 's and π P,m 's. Define p P,m a m ≤ α . (7) (This is a convex optimization problem, since (t, a) is convex in a.) Given a test point x ∈ R d , we define the oracle interval as where we write π P = (π P,1 , . . . , π P,M ). To understand this oracle interval, by the results of Lemma 3, we can observe that the constraint M m=1 p P,m a m ≤ α ensures 1 − α coverage for the distribution P , while minimizing M m=1 p P,m (π P,m , a m ) ensures the lowest possible length under this coverage constraint.
Our next result shows that if the blur ∆ P (X 1:M ) of the partition is low, then the expected length of C * P is close to the distribution-free lower bound L α (Π P ). Lemma 4. The oracle interval C * P constructed in (8) satisfies Of course, C * P is not a distribution-free confidence interval-its coverage is guaranteed for a specific distribution P (not uniformly over all P ), with the assumption that we have information about the distribution-namely, the p P,m 's and π P,m 's. We next extend this construction into the distribution-free setting by using the training sample to estimate these quantities.

A distribution-free algorithm
We are now ready to present our distribution-free algorithm. For now, assume again that we are given a fixed partition R d = X 1 ∪· · ·∪X M (we will allow for a data-dependent partition later on). After observing a sample (X 1 , Y 1 ), . . . , (X n , Y n ) ∈ R d × {0, 1}, we first estimate the probability p P,m in each region, and estimate the corresponding label probability π P,m , or set π m = 1 2 if p m = 0. In order to ensure distribution-free coverage we will need to work with slightly more conservative estimates. Let which is chosen to ensure that p m ≤p m with high probability, and let π m =    min 1 2 , π m + π m · 2 log(4M n/α) n pm This definition ofπ m is designed to pull the estimate π m closer to 1 2 (since 1 2 is the most challenging label probability for coverage, this is therefore a more conservative estimate of π P,m ).
From this point on, we use the same construction (5) as for the "oracle" interval C * P above, but with the conservative empirical estimatesp m andπ m in place of the unknown true quantities p P,m and π P,m . Definẽ and define C n (X n+1 ) = Cπ ,ã (X n+1 ), where we writeπ = (π 1 , . . . ,π M ). We now prove that this construction offers a distribution-free confidence interval, and establish an upper bound on its expected length.
Theorem 3. Let n ≥ 2. The confidence interval C n constructed in (12) provides a (1 − α)-distribution-free confidence interval for binary regression (3). Furthermore, for all distributions P on R d × {0, 1}, the confidence interval C n satisfies where c is a universal constant.
The last term vanishes as n → ∞, and thus the blur ∆ P (X 1:M ) defines the dominant term in the excess length of our constructed interval C n , as compared to the lower bound L α (Π P ) established in Theorem 2.

Sample splitting for a data-dependent partition
Thus far we have assumed that the partition R d = X 1 ∪ · · · ∪ X M is fixed (and has low blur ∆ P (X 1:M ), in order for the upper bound to be meaningful). Of course, in practice, the partition itself would need to be constructed using the data. To do so, we follow a sample splitting strategy. We will use the first half of the training data to estimate the function π P (x) and define the partition accordingly, and the second half of the training data to construct C n based on this partition. Our construction can be paired with any regression algorithm R that maps a training data set of size n to an estimate π R n (x) of the function π P (x)-for example, we might take R to be logistic regression or k-nearest neighbors. We define the expected error of this regression algorithm as: where the expected value is taken over (X 1 , Y 1 ), . . . , (X n , Y n ) iid ∼ P and an independent test point X n+1 ∼ P X . We mention a few special cases: • In a well-specified parametric model (e.g., Y |X follows a logistic model), we typically have ∆ n,P (R) ∼ d n in a low-dimensional (d < n) setting, or ∆ n,P (R) ∼ k log d n in a high-dimensional sparse setting, e.g., running 1 -penalized logistic regression when the true model is k-sparse [Negahban et al., 2012].
• In a nonparametric setting, if x → π P (x) is assumed to be Lipschitz continuous, a k-nearest neighbors (k-NN) method yields ∆ n,P (R) ∼ n −1/(2+d) , or ∆ n,P (R) ∼ n −2/(4+d) if we make the stronger assumption that x → π P (x) is smooth (see, e.g., Györfi et al. [2006]). In the special case where the data is supported on a d 0 -dimensional manifold in R d for some d 0 < d, then the bound may hold with d 0 in place of d for a faster convergence rate (see, e.g., Jiang [2019] for finite sample results).
Now we split the sample to construct our distribution-free confidence interval. First, define π R n 2 , fitted on data points i = 1, . . . , n 2 . Fix M = n/ log n , and define With this partition in place, we then define C n exactly as before, except that we restrict our sample to the remaining data points i = n 2 + 1, . . . , n. Specifically, let or set π m = 1 2 if p m = 0. Define thep m 's andπ m 's exactly as in (9) and (10) except with n 2 in place of n. We then defineã as in (11) before and set C R n (X n+1 ) = Cπ ,ã (X n+1 ).
Corollary 1. Let n ≥ 3. The interval C R n constructed via data splitting with regression algorithm R (13) provides a (1 − α)-distribution-free confidence interval for binary regression (3). Furthermore, for all distributions P on R d × {0, 1}, C R n satisfies where c is a universal constant.
This corollary will follow directly from Theorem 3 by observing that, due to the construction of the random partition X R 1:M , we have E ∆ P ( X R 1:M ) ≤ 2∆ n 2 ,P (R) + 1 M .

Discussion
The lower bounds established in this paper prove that, in the distribution-free setting, parameter estimation is fundamentally as imprecise as prediction, and confidence intervals for estimating the label probabilities π P (X) = P {Y = 1 | X} have a lower bound on their length that does not vanish even with sample size n → ∞. Unlike the classical literature where these types of results are established for pointwise coverage (i.e., coverage of π P (x) for all x), our new results prove this fundamental lower bound holds even when we require coverage to hold only on average over a new point X drawn from the distribution, and we provide an exact calculation of the minimum possible length. These lower bounds imply that, if we wish to maintain the versatility of the distribution-free setting (i.e., avoiding smoothness assumptions), we can only obtain meaningful confidence intervals by substantially relaxing our notion of coverage. In future work, we hope to examine alternatives-for instance, coverage of a surrogate function approximating π P (X), as in the work of Genovese and Wasserman [2008] for confidence bands in nonparametric regression. We may also ask whether the results here, proved in the setting of binary regression where Y ∈ {0, 1}, may be extended to a more general regression setting-that is, whether it is possible to estimate µ P (X) = E [Y | X] under a joint distribution P over (X, Y ) ∈ R d × R. An initial exploration suggests that such an extension would not be straightforward. Specifically, if Y is unbounded, then confidence intervals for µ P (X) must necessarily have infinite expected length. To see why, consider a contamination model, where, after drawing (X, Y ) ∼ P , with probability n we replace Y with a corrupted value c n . If we choose n n −1 , we cannot distinguish between P and this contaminated model P on a sample of size n-this means that C n (X n+1 ) must cover the mean whether the distribution is P or P . But, by choosing c n so that c n n → ∞, the means µ P (X) and µ P (X) are arbitrarily far apart, leading to arbitrarily large length of C n (X n+1 ). Therefore we cannot obtain nontrivial results for unbounded Y . If we instead assume that Y is bounded, on the other hand, then the lower bounds established for the binary case no longer apply (for example, if Y = 0.5 almost surely, a distribution-free confidence interval could potentially have vanishing length even if it is constructed only assuming Y ∈ [0, 1]). It is therefore not clear whether there are settings for the general regression problem where we may obtain meaningful upper and lower bounds for distribution-free confidence intervals on the conditional mean µ P (X), and we leave these questions for future work.
Jing Lei. Classification with confidence. Biometrika, 101 (4) The proof of this lemma follows a similar strategy as in Barber et al. [2019a, Lemma 3], and is a generalization of the construction used in the proof of Vovk et al. [2005, Proposition 5.1]. First we embed the variables in the lemma into a distribution on triples (X, Y, Z). WritingP X,Z to denote the joint distribution of (X n+1 , Z n+1 ), we defineP as follows: Note that, marginalizing over Z, the pair (X, Y ) follows distribution P . In other words, the joint distribution of ∼P , matches the distribution specified in the lemma. Therefore, it is equivalent to prove the bound Next fix any integer M ≥ n + 1, and let (X (m) , Z (m) ) iid ∼P X,Z for m = 1, . . . , M . Let L specify this sequence of M pairs. Next, fixing L, we draw {(X i , Y i , Z i )} i=1,...,n+1 as follows: Sample m 1 , . . . , m n+1 uniformly without replacement from {1, . . . , M }, Set (X i , Z i ) = (X (m i ) , Z (m i ) ) for each i = 1, . . . , n + 1, Draw Y i ∼ Bernoulli(Z i ) for each i = 1, . . . , n + 1.
Then clearly, after marginalizing over L, the triples (X i , Y i , Z i ) are drawn i.i.d. from P . In other words, Now consider any sequence L = {(X (m) , Z (m) )} m=1,...,M , and let Q L be the distribution on (X, Y, Z) defined by sampling (X, Z) uniformly at random from L, then drawing Y ∼ Bernoulli(Z). We can consider drawing (X i , Y i , Z i ) iid ∼ Q L for i = 1, . . . , n + 1, which is equivalent to A simple calculation shows that, when m 1 , . . . , m n+1 are sampled uniformly with replacement from {1, . . . , M }, the probability of the event {m i = m j for any i = j} is bounded by n 2 /M . Therefore, for any fixed L, the total variation distance between the two sampling schemes (15) and (17) is at most n 2 /M , and so Next we calculate π Q L (x), i.e., the label probability under Q L . If X (1) , . . . , X (M ) are all distinct for this L, then by definition of Q L , we can see that π Q L (X (m) ) = Z (m) for each m = 1, . . . , M , and therefore where the last step holds since C n is assumed to be an algorithm that provides a (1 − α)-distribution-free confidence interval for any distribution (3), and in particular must provide coverage under Q L . In other words, for any L, we have proved that Combining this bound with (16) and (18) establishes that Since P is assumed to be nonatomic, the marginal P X is nonatomic as well, and so the X (m) 's are distinct with probability 1. Therefore, Finally, since M can be taken to be arbitrarily large, this establishes the desired bound (14), and thus completes the proof of the lemma.

A.2 Proof of Lemma 2
We will prove a stronger form of Lemma 2. We define a set of distributions on [0, 1], Q = Q (0) ∪ Q (1) , where each distribution is a mixture of a point mass and a uniform distribution: Lemma 5. For any t, a ∈ [0, 1], define and let F t,a be defined as in the statement of Lemma 2. Then it holds that Proof of Lemma 5. Since F + t,a ⊆ F t,a ⊆ F * t,a , it clearly holds that We now need to prove two remaining inequalities to establish the lemma: and First we prove (20). Fix any t, a ∈ [0, 1] and any f ∈ F * t,a . We split into cases: • If t = 0, then (t, a) = 0, and the bound holds trivially.
We split into cases: • By symmetry, the analogous calculations hold if t > 1 2 . Therefore we have established that (22) holds in all cases. Next we check that f t,a ∈ F + t,a . Let Z be any random variable satisfying either 0 ≤ E [Z] ≤ t ≤ 1 2 or 1 2 ≤ t ≤ E [Z] ≤ 1. We again split into cases.
• If 0 < t ≤ 1 2 ≤ a, then E [Z] ≤ t and so • By symmetry, the analogous calculations hold if t > 1 2 . Therefore, the bound E [f t,a (Z)] ≥ 1 − a holds in all cases, and so f t,a ∈ F + t,a by definition. Therefore, and so (21) holds.

A.3 Proof of Theorem 2
Recall the set of distributions Q defined in (19). Define a function g : [0, 1] × [0, 1] → [0, 1] as g(t, z) = P z ∈ C n (X n+1 ) π P (X n+1 ) = t , and define another function h : For each t ∈ is measurable, and furthermore there exists a measurable function t → Q t ∈ Q t such that h(t, Q t ) = a(t) = sup Q∈Qt h(t, Q) for all t ∈ [0, 1], as long as we verify the following conditions: • Q is a separable metrizable space. To verify this condition, we will use the total variation distance as our metric on Q. Define where Q is the set of rational numbers. Then Q (0) * ∪ Q (1) * is a countable set, and is a dense subset of Q (under the total variation distance). Therefore, Q is separable.
• Q t is compact for all t. To prove this, first consider the case t ≤ 1 2 . If t = 0 then Q t ∩ Q (0) = {δ 0 }, and is trivially compact. Now consider 0 < t ≤ 1 2 . Writing Q p,c = pδ 0 + (1 − p)Unif[0, c], fix any Q p,c , Q p ,c ∈ Q t ∩ Q (0) . We must have c, c ≥ 2t and p, p ≤ 1 − 2t in order to obtain expected value t, and we can calculate that this implies This proves that, on Q t ∩ Q (0) , the topology induced by total variation distance is the same as the topology induced by the Euclidean distance on (p, c) ∈ [0, 1] 2 . Therefore, since Q t ∩ Q (0) = {Q p,c : (1 − p) · c 2 = t} corresponds to a closed subset of (p, c) ∈ [0, 1] 2 , this set is compact for the case t ≤ 1 2 . If instead t > 1 2 , then Q t ∩ Q (0) = ∅ and is trivially compact. An analogous argument shows that Q t ∩ Q (1) is compact. Therefore, Q t = (Q t ∩ Q (0) ) ∪ (Q t ∩ Q (1) ) is compact.
• t → h(t, Q) is measurable for all Q, which holds since t → g(t, z) is measurable by definition of the conditional expectation, and therefore t → h(t, Q) = E Q [g(t, Z)] is measurable as well.
Since this holds for all t ∈ [0, 1], we therefore have Finally, applying Fubini's theorem completes the proof: Next, for each m, in the proof of Lemma 5 we established that f tm,am ∈ F + tm,am . By definition of this set, this implies that since E [π P (X n+1 ) | X n+1 ∈ X m ] = π P,m , and π P,m satisfies the assumption (6). Therefore, as desired.

B.2 Proof of Lemma 4
First, the coverage statement follows immediately from Lemma 3, since M m=1 p P,m a * P,m ≤ α by definition of a * P . Now we consider the expected length. For each x define δ(x) = π P (x) − π P,m(x) .

Let
If > 1 then the result of the lemma holds trivially, since leb(C * P (X n+1 )) ≤ leb([0, 1]) = 1 always. Therefore we can restrict our attention to the case that ∈ [0, 1]. Now fix any function a : We can calculate Therefore, a • is feasible for the optimization problem (7), and so by optimality of a * P , we must have M m=1 p P,m (π P,m , a * P,m ) ≤ M m=1 p P,m (π P,m , a • m ).

B.3 Proof of Theorem 3
Before proving the theorem, we state two supporting lemmas. The first is a basic property of the function (t, a).
The second is a simple consequence of the Chernoff bound on the Binomial distribution.
Lemma 7 verifies that P {E 1 } ≥ 1 − α n for any distribution P . Recall that C n (X n+1 ) = Cπ ,ã (X n+1 ) depends on the training data only throughπ,ã, which themselves are functions of the p m 's and π m 's. By Lemma 3, on the event E 1 , it holds that where the last step holds by definition ofã. Therefore, P π P (X n+1 ) ∈ C n (X n+1 ) ≤ P π P (X n+1 ) ∈ C n (X n+1 ) E 1 + α n ≤ α, which verifies the distribution-free coverage guarantee.
as long as ≤ 1. Furthermore, the assumption ≤ 1 ensures that log(4M n/α) ≤ 2 log n, and so plugging in the definition of , we have proved that E leb( C n (X n+1 )) ≤ L α (Π P ) + 2∆ P (X 1:M ) α + c M log n αn for a sufficiently large universal constant c . If instead we have > 1, then we have E leb( C n (X n+1 )) ≤ 1 ≤ c M log n αn for a sufficiently large universal constant c . Taking c = max{c , c }, we have completed the proof of the theorem.

B.4 Proof of Corollary 1
To help with the proof, we begin by adapting our previous notation to the setting of a data-dependent partition. The partition X R 1:M is a function of the first half of the data, i.e., data points {(X i , Y i )} i=1,..., n 2 . Conditioning on these data points, we define where we should interpret this to mean that the partition X R 1:M is treated as fixed, and the probability and expectation are calculated with respect to an independent draw X ∼ P X . Similarly, we will write ∆ X R 1:M = E P X |π P (X) − π P,m(X) | {(X i , Y i )} i=1,..., n 2 .
These quantities are now functions of the data points {(X i , Y i )} i=1,..., n 2 . Next we apply Theorem 3. Specifically, we will condition on the data points {(X i , Y i )} i=1,..., n 2 used to choose the partition so that the partition can be treated as fixed, and will apply Theorem 3 with n 2 ≥ 2 in place of n (i.e., we apply the theorem to data points i = n 2 + 1, . . . , n in place of i = 1, . . . , n). This proves that .., n 2 ≤ L α (Π P ) + 2∆ P ( X R 1:M ) α + c M log(n/2) α · (n/2) .