The bias of isotonic regression

We study the bias of the isotonic regression estimator. While there is extensive work characterizing the mean squared error of the isotonic regression estimator, relatively little is known about the bias. In this paper, we provide a sharp characterization, proving that the bias scales as $O(n^{-\beta/3})$ up to log factors, where $1 \leq \beta \leq 2$ is the exponent corresponding to H{\"o}lder smoothness of the underlying mean. Importantly, this result only requires a strictly monotone mean and that the noise distribution has subexponential tails, without relying on symmetric noise or other restrictive assumptions.


Introduction
Isotonic regression involves solving a regression problem under monotonicity constraints. In particular, suppose that we observe data Y ∈ R n with E [Y ] = µ, where the mean µ satisfies a monotonicity constraint, µ 1 ≤ · · · ≤ µ n .
To recover the mean vector µ, the least-squares isotonic regression estimator is given by where y → iso(y) is the projection to the monotonicity constraint, defined on y ∈ R n as iso(y) = arg min x∈R n y − x 2 2 : x 1 ≤ · · · ≤ x n .
Since this constraint defines a convex region in R n (in fact, a convex cone), the resulting optimization problem is convex, and can be solved efficiently with the "pool adjacent violators" (PAVA) algorithm [Barlow et al., 1972].
There is a large literature on isotonic regression dating back to the work of Bartholomew [1959a,b], Brunk [1955], Miles [1959]; for further background see, e.g., Barlow et al. [1972], Robertson et al. [1988]. Isotonic regression also appears in the problem of estimating a monotone density estimation, namely, in the Grenander estimator [Grenander, 1956].
Recently, there has been renewed interest in isotonic regression as one of the most widely-used examples of regression under shape constraints-for example, the work of de Leeuw et al. [2009], Chatterjee et al. [2015], Gao et al. [2017], , Guntuboyina and Sen [2018], Banerjee et al. [2019]. Given the significance of isotonic regression, understanding its statistical properties is of fundamental importance. In this paper, we provide a sharp characterization of the bias of the isotonic regression estimator.

Bias and variance
It is well known that the estimation error µ − µ of isotonic regression decays at a slower rate than for parametric regression, generally with a dependence on n that scales as | µ i − µ i | n −1/3 , rather than the rate n −1/2 that we would expect for parametric problems.
To make this more precise, suppose that the mean vector µ is Lipschitz, satisfying |µ i − µ i+1 | ≤ L 1 /n, and that the noise Z = Y − µ has independent entries Z i with E [Z i ] = 0 and with each Z i assumed to be λ-subgaussian, that is, E e tZ i ≤ e t 2 λ 2 /2 for any t ∈ R. In this setting, many works including Brunk [1970], Cator [2011], Chatterjee et al. [2015] establish the n −1/3 bound on the root-mean-square error of isotonic regression; this scaling can be proved to hold pointwise at each index i (an analogous n −1/3 rate is established by Durot et al. [2012] for the Grenander estimator of a monotone density). For instance, Yang and Barber [2018] prove that, for any δ > 0. where . Furthermore, Chatterjee et al. [2015] establish that this error scaling attains the minimax rate, meaning that we cannot improve on the n −1/3 error rate obtained by fitting µ via least-squares isotonic regression. However, in certain special cases, such as when the mean vector µ is piecewise constant, the error of µ achieves the parametric n −1/2 rate. While these results summarized above give a fairly complete picture of the tail behavior of the error µ − µ in isotonic regression, relatively little is known about its expected value E [ µ] − µ, i.e., the bias of the least-squares estimator. Banerjee et al. [2019] establish an asymptotic result under certain smoothness and strict monotonicity conditions on µ, proving that the bias is at most In the special case where the variance of the noise is constant across the data, with Var (Z i ) = σ 2 for all i, Durot [2002] prove an improved bound of However, as Banerjee et al. [2019] point out, in many settings we would prefer to avoid the assumption of constant variance-for example, if the data is binary with Y i ∼ Bernoulli(µ i ), the variance of the Z i 's will certainly be nonconstant. Furthermore, in the prior literature, it is not clear if n −1/2 is the correct scaling for the bias (whether in the constant-variance or general case), or if this scaling might be possible to improve.

Contributions
In this work, we examine the question of bias more closely, and prove that with no assumption of constant variance, when the underlying mean µ is Lipschitz, strictly increasing, and smooth. This is a faster scaling than what was previously known for both constant and non-constant variance settings. We furthermore establish weaker bounds when µ satisfies only Hölder smoothness, with when µ is Hölder smooth with exponent β, for some 1 ≤ β < 2. (β = 2 corresponds to the smooth case.) Matching lower bounds show that, up to log factors, our results are tight.
In particular, if we only assume µ to be Lipschitz but without smoothness, this corresponds to Hölder smoothness with β = 1. In this case, the bias is lower bounded as n −1/3 up to log factors. Since it is known that the error | µ i − µ i | is bounded on the order of n −1/3 (up to logs) with high probability, we see that without a smoothness assumption, it may be the case that the bias is on the same order as the error itself-that is, bias does not vanish relative to variance. At the other extreme, under smoothness (β = 2), error scales as n −1/3 while bias is bounded as n −2/3 (up to logs), meaning that the bias is vanishing.

Main results
We begin with some definitions and conditions on the distribution of the data. We assume that the mean vector µ is Lipschitz, for some L 1 , and is monotonic, for some L 0 ≥ 0 (note that L 0 > 0 corresponds to a strictly increasing condition). The significance of the strictly increasing assumption is illustrated in Theorem 3. In many settings, we might think of the mean values µ i as evaluations of an underlying function at a sequence of values-for example an evenly spaced grid, i.e., µ i = f (i/n) for i = 1, . . . , n. In this case, the constants L 0 , L 1 are lower and upper bounds on the gradient of the underlying function f . Next, we will also assume that µ is Hölder smooth, satisfying for some M and some exponent β with 1 ≤ β ≤ 2. As before, if µ i = f (i/n) for some underlying function f defining the signal, then (β, M )-Hölder smoothness of the function f , defined as the property that is sufficient to ensure that the vector of mean values µ satisfies this assumption. The exponent β in assumption (7) controls the smoothness, with β = 2 corresponding to a bounded second derivative (or a Lipschitz gradient) while β < 2 denotes a weaker smoothness assumption. In particular, if we were to take β = 1, then this assumption does not in fact imply any smoothness, as it is trivially satisfied with M = L 1 for any signal µ that is L 1 -Lipschitz as in our condition (5).
Next we turn to our assumptions on the noise Z. We assume independent noise with subexponential tails: and E e tZ i ≤ e t 2 λ 2 /2 for all |t| ≤ τ and all i = 1, . . . , n. (9) We will also require that the variances σ 2 i are bounded from below and are Lipschitz along the sequence i = 1, . . . , n, satisfying (Note that we must have σ 2 i ≤ λ 2 by the subexponential tails assumption (9).)

An upper bound for smooth signals
We now present our upper bound on the bias of isotonic regression.
We remark that in the case of Gaussian noise, the upper bound can be tightened to C log n n β/3 for any β ∈ [1, 2], i.e., the extra power of the log term for the case β = 2 no longer appears (in the non-Gaussian case, it is likely an artifact of the proof).
To better understand the result of this theorem, consider the extreme case where we take β = 1 (which, as mentioned above, simply reduces to the Lipschitz assumption). In this case, the upper bound of Theorem 1 results in the bias bound which is, up to a constant, the same as the bound (2) proved by Yang and Barber [2018] to hold with high probability on the maximum entrywise error µ i − µ i . In other words, this suggests that the bias may be as large as the (square root) variance.
On the other hand, if β = 2, then the bias scales as (polylog n/n) 2/3 while the high probability bound on the error is still (log n/n) 1/3 -the bias is vanishing relative to the error of any one draw of the data.

Simulation
To explore this scaling, we conduct a simple simulation 1 to compare the case β = 2 with β = 1. For these two cases, Theorem 1 establishes that, at each index i (bounded away from the endpoints 1 and n), bias is bounded as n −2/3 and n −1/3 , respectively, up to log factors. We will see that this scaling is achieved by our simple examples.
The two mean vectors, µ s for the smooth case and µ ns for the non-smooth case, are illustrated in the top two panels of Figure 1. The smooth mean is defined with a sine wave function, while the non-smooth mean is a piecewise linear "hinge" shape: µ s i = (i/n) + sin(4π · i/n)/16, µ ns i = 0.1(i/n), i ≤ n/2, 1.9(i/n) − 0.9, i > n/2, For both the smooth and non-smooth mean, writing µ to denote either µ s or µ ns as appropriate, we generate a signal Y = µ + N (0, σ 2 I n ) for σ = 0.1, and compute iso(Y ). Our final result is the magnitude of the bias at the midpoint for the non-smooth case, or averaged over several points for the smooth case, where E [iso(Y )] is estimated by averaging 500,000 trials. We run the same procedure for each n = 10000, 12000, . . . , 20000. Our simulation results, shown in the bottom panels of Figure 1, verify that the bias at the selected points indeed appears to scale as n −2/3 for the smooth case, and as n −1/3 for the non-smooth case. We next turn to the question of establishing these lower bounds theoretically.

A matching lower bound for smoothness
We next show that, as suggested by our simulations, our dependence on the smoothness assumption is tight-up to constants and log factors, we cannot improve our dependence on the Hölder exponent β (i.e., the power of n, n −β/3 , appearing in our upper bound, Theorem 1). While our simulated example only verifies this at a constant number of indices i, here we show a stronger result-the lower bound is attained by a constant fraction of the indices i = 1, . . . , n.
Note that this noise distribution, i.e. Z i iid ∼ N (0, σ 2 ), trivially satisfies the assumptions (9) and (10) that we require in our upper bound. In other words, the lower bound result shows that our upper bound is tight for all 1 ≤ β ≤ 2, up to log factors.

Necessity of the strictly increasing mean
Finally, we study the role of the strictly increasing assumption for the mean µ (6). Wright [1981] studied this problem in a different context, considering a signal µ i = f (i/n) where the function f is monotone nondecreasing and satisfies at some point t 0 ∈ (0, 1), for some α ≥ 1. For example, if α is an integer, this is satisfied if f is required to have f (k) (t 0 ) = 0 for all k = 1, . . . , α − 1, and f (α) (t 0 ) > 0, where f (k) denotes the kth derivative of f . In this setting, writing t 0 = i 0 /n, Wright [1981, Theorem 1] establishes that the rescaled error n α/(2α+1) µ i 0 − µ i 0 converges to some fixed distribution. In other words, the error µ i 0 − µ i 0 is on the scale n −α/(2α+1) .
If the function f is twice differentiable, the Hölder smoothness condition (8) is equivalent to requiring that |f (t)| is bounded (for all t), while Wright's condition (11) instead requires that |f (t 0 )| is bounded and additionally f (t 0 ) = 0. In other words, the Hölder condition requires that f is smooth, while Wright's condition requires that f is both smooth and flat.
Our next result establishes that, in the worst case, the bias of the isotonic regression estimator is on the same order up to log factors-that is, for any α ≥ 1, we construct a signal µ satisfying Wright [1981]'s condition (11) Theorem 3. Fix any parameters L 1 > 0, M > 0, and σ 2 > 0. Then for any n ≥ C, α ≥ 1, there exists some µ ∈ R n that is L 1 -Lipschitz (5), monotone nondecreasing (i.e., L 0 -increasing (6) with L 0 = 0), and is equal to µ i = f (i/n) for some function f satisfying the condition (11) at t 0 = 0.5, such that, for Y ∼ N (µ, σ 2 I n ) and µ = iso(Y ), the bias satisfies at the index i 0 = n/2, i.e., the index satisfying t 0 = i 0 /n. (Here C, C depend only on the parameters in our assumptions, and not on n.) Up to constants and log factors, this result matches the upper bound proved by Wright [1981]. We remark that, for α ≥ 2, the signal µ constructed in the proof of the theorem is (β, M )-smooth with β = 2. (For α < 2, the signal µ constructed in the proof is instead (β, M )-smooth with β = α.) Setting α = 2, for example, we obtain a lower bound on bias that scales as n −2/5 , while as α → ∞ the scaling approaches n −1/2 (up to log factors). We can compare this to our results in Theorems 1 and 2, where we establish a faster scaling n −2/3 for the worst-case bias (up to log factors) for signals µ with Hölder exponent β = 2 that are also assumed to be strictly increasing. The gap between these powers of n highlights the role of the strict monotonicity condition in the isotonic regression problem.

Properties of isotonic regression
Before we proceed with our proofs, it will be useful to recall some well-known properties of the isotonic projection, y → iso(y) (for background see, e.g., Barlow et al. [1972, Chapter 1]).
First, the individual entries of iso(y) can be expressed with the useful "min-max" formula [Barlow et al., 1972, (1.9)]: for any y ∈ R n and any index i, where y j:k = 1 k − j + 1 k =j y is the average of the subvector (y j , . . . , y k ) of y, and the max and min are attained at j = j i (y) and k = k i (y) where j i (y) = min{j : iso(y) j = iso(y) i } and k i (y) = max{k : iso(y) k = iso(y) i }, the endpoints of the constant interval of iso(y) containing index i. Second, using the min-max definition in (12), it is trivial to see that isotonic regression commutes with shifts in location and scale: for any y ∈ R n , any a ∈ R, and any b > 0, it holds that [Barlow et al., 1972, (1.14 Next, if we run isotonic regression on a subvector (y a , . . . , y b ) of y, this can only add breakpoints relative to y. Specifically, for any y ∈ R n , and any indices 1 ≤ a (Note that indices i − a + 1, i − a + 2 in the subvectorỹ correspond to indices i, i + 1 in the full vector y.) Finally, on the same subvectorỹ = (y a , . . . , If iso(y) a−1 = iso(y) i (or a = 1) and iso(y That is, truncating the sequence y at a breakpoint of iso(y) will not affect the estimated values. These last two properties (15) and (16) follow from the definition of the "pool adjacent violators" (PAVA) algorithm for computing iso(y) [Barlow et al., 1972, pg. 13-15].

Breakpoint lemma
Before proving our theorems, we first present a result bounding the probability of a breakpoint occurring at any particular location in a Gaussian isotonic regression problem. We will use this lemma to prove both our upper and lower bounds.
where C 3 depends only on C 1 , C 2 and not on m or n.
In other words, if the means µ j and variances σ 2 j are approximately constant near index i, then the probability of a breakpoint at index i is low. To gain intuition for the consequences of this lemma, consider a simple setting where the mean µ is L 1 -Lipschitz (5), and the variances are constant, σ 2 i = σ 2 > 0. In this case, the conditions (17) are satisfied for m n 2/3 /(log n) 1/3 , and the probability of a breakpoint at a given index i with m ≤ i ≤ n − m is therefore (log n) 4/3 /n 2/3 . We can then expect constant segments of iso(Y ) to have length n 2/3 /(log n) 4/3 . If instead the mean is constant with µ 1 = · · · = µ n (and the variance is again constant), then the conditions (17) are satisfied for m n; in this case we can expect iso(Y ) to have constant segments of length n/(log n).
In order to prove this result, we first consider the case that Y is standard Gaussian. We will use the following classical result: Lemma 2 (Andersen [1954]). Let W ∼ N (0 m , I m ) for any m ≥ 1. Then the number of piecewise constant segments of iso(W ), denoted by N (W ), is distributed as where I 1 , . . . , I m are independent Bernoulli random variables with P {I j = 1} = 1/j. This result allows us to prove the breakpoint lemma for a standard Gaussian: . . , m, which again has the distributionỸ ∼ N (0 m , I m ). The relationship between Y andỸ is illustrated in Figure 2. By the property (15) of isotonic regression, we know that which concludes the proof.
Indices ofỸ Figure 2: Illustration of the proof idea for Lemma 3. A break between indices i and i + 1 in the isotonic regression ofỸ , corresponds to a break between indices m and m + 1 for Y .
Finally, the proof of Lemma 1 for the general case is established by comparing the distribution of Y to the distribution of a standard Gaussian random vector, using our assumptions on the low variability in the µ j 's and σ j 's near the index i. The proof is given in Appendix A.1.

Proof of upper bound (Theorem 1)
The proof of our upper bound will follow these steps to bound the bias of iso(Y ) i : • Step 1: We replace the random vector Y with a Gaussian random vectorỸ whose entries have the same means and variances, and bound the change in the bias at index i. • Step 2: From the Gaussian random vectorỸ , we extract a subvector of length n 2/3 (log n) 1/3 centered at index i, and bound the change in the bias at index i. • Step 3: We take an approximation to the new subvector, with linearly increasing means and with constant variance, and bound the change in the bias at index i. We will also see that the new approximation has zero bias due to symmetry.
Then there exists a coupling between Y and where C 1 , C 2 depend only on L 1 , λ, τ, L σ , σ min , and not on n.
This result is proved in Appendix A.2, using the Gaussian coupling theorem of Sakhanenko [1985, Theorem 1] along with our breakpoint lemma (Lemma 1). Now defineỸ ∼ N (µ 1 , . . . , µ n ), diag{σ 2 1 , . . . , σ 2 n } . Applying Lemma 4 together with the triangle inequality, we see that for some C 1 , C 2 depending on L 1 , λ, τ, L σ , σ min . From this point on, then, it suffices to bound the first term in this upper boundthat is, we need to prove the main result, Theorem 1, in the special case that the noise terms Z i are Gaussian. From this point on, we will work withỸ = µ +Z wherẽ Z i ∼ N (0, σ 2 i ).

3.3.2
Step 2: extract a subvector and assume that m ≤ i ≤ n + 1 − m. We will see that E iso(Ỹ ) i is not changed substantially if we instead calculate the isotonic regression of Note that index m + 1 in the subvectorỸ (i) corresponds to index i in the full vector Y . Now we bound the bias of iso(Ỹ ) i in terms of the subvector. We can write Deterministically, we have and therefore, j 's are Gaussian with variances bounded by λ 2 , and the expected maximum of n χ 2 1 's is bounded by 4 log n for any n ≥ 4. Next we bound P iso(Ỹ ) i = iso(Ỹ (i) ) m+1 . By the property (16) of isotonic regression, if iso(Ỹ ) i = iso(Ỹ (i) ) m+1 then it must be the case that either iso(Ỹ ) i = iso(Ỹ ) i−m−1 or iso(Ỹ ) i = iso(Ỹ ) i+m+1 . Now, since µ is L 0 -strictly increasing (6)), we have where the last step holds by our definition of m. Applying the same reasoning to the second case (i.e., iso(Ỹ ) i = iso(Ỹ ) i+m+1 ) and combining the two cases, this yields (20) Now, we recall Yang and Barber [2018]'s result (2)-sinceỸ is L 1 -Lipschitz with λsubgaussian noise, choosing probability δ = 1/n 2 we have P max where j 0 = λn √ 5 log n L 1 2/3 . In order to apply this to bound (20), we need to ensure that j 0 ≤ i − m − 1 and i + m + 1 ≤ n + 1 − j 0 . Plugging in our definition of m, this is equivalent to Combining (21) with (20) we have P iso(Ỹ ) i = iso(Ỹ (i) ) m+1 ≤ 1/n 2 . Returning to our work above, therefore, we obtain 3.3.3 Step 3: take a linear approximation Finally, we will see that E iso(Ỹ (i) ) m+1 is not changed substantially if we replace it with a random vector that has linear means and has constant variance. Defině and finally letY This construction is illustrated in Figure 3.
We will now show that E iso(Ỹ (i) ) m+1 − iso(Y (i) ) i is small, so that we can usě Y (i) to bound the bias of iso(Ỹ ) i . Using the "min-max" formulation of isotonic regression (12), An analogous lower bound holds similarly, and so Next, since the random vectorY (i) is Gaussian with a linear mean and with constant variance, by symmetry we can see that it has zero bias at its midpoint, i.e., Therefore, by the triangle inequality, By definition ofμ and the smoothness assumption (7), we have and therefore Finally, to bound the last term, we have Since theZ j 's are independent zero-mean Gaussians, for each pair of indices j, k, we see that 1 and so since there are at most n 2 /2 many choices of j, k. Combining everything, then, Plugging in our choice of m, this simplifies to for appropriately chosen C 3 .

Combining everything
Combining our three steps, we see that the bounds (18), (23), (24) combine to prove that , log n n 2/3 , (log n) 10/3 n 2/3 , √ log n n , for C chosen appropriately as a function of all the assumption parameters. Since β ≤ 2, the dominant term is log n n β/3 for β < 2 or (log n) 10/3 n 2/3 for β = 2. Examining the assumptions (19) and (22) on the index i, we see that this holds for all i satisfying C (log n) 1/3 n 2/3 ≤ i ≤ n + 1 − C (log n) 1/3 n 2/3 , with C chosen appropriately as a function of all the assumption parameters. This completes the proof of the theorem.
We remark that, if the noise is Gaussian, then Step 1 is not needed, and so the term (log n) 10/3 n 2/3 does not appear in the upper bound. This means that the upper bound scales as log n n β/3 both for the case 1 ≤ β < 2 and the case β = 2, under Gaussian noise.

Proof of lower bound for smoothness (Theorem 2)
Fix any L 1 > L 0 ≥ 0, M > 0, β ∈ [1, 2]. Consider a linear mean vector µ lin , with entries µ lin i = a n · i n , and a mean vector µ that adds an oscillation, This construction is illustrated in Figure 4. We will specify the parameters a n , b n , c n ≥ 0 later on, but for the moment we assume that b n c n ≤ a n and C 1 ≤ c n ≤ C 2 n and C 3 √ n log n ≤ a n ≤ C 4 (c n ) 3/2 (log n) 2 √ n , where C 1 , C 2 , C 3 , C 4 will not depend on n and will be specified later on. The first condition, a n ≥ b n c n , ensures that µ is monotone nondecreasing. The second condition, essentially requiring that 1 c n n, ensures that the oscillations of ∆ are visible on the discrete points i = 1, . . . , n-that is, the sine wave completes many full cycles over the points i = 1, . . . , n, and each cycle contains many indices i. The third condition will be necessary for some calculations later on. Now let Z ∼ N 0 n , σ 2 I n for some σ 2 > 0, and define Y = µ+Z and Y lin = µ lin +Z. We will see that the nonlinear oscillations in Y cannot be fully recovered by isotonic regression-while the gap between Y and Y lin is equal to ±b n at the positive and negative peaks of the oscillation term ∆, the gap between iso(Y ) and iso(Y lin ) at these points will typically be smaller. This is because iso(Y ) i and iso(Y lin ) i are calculated via local averages (according to the min-max formula (12)), and the oscillations of ∆ are therefore smoothed out, shrinking the difference between the two vectors towards zero. The bias at or near the peaks of ∆ will therefore be of order b n , achieving a lower bound.
The remainder of the proof will follow these steps: • Step 1: We will show that µ i ≥ µ lin i + 0.8b n for at least C 5 n many indices i, where C 5 will be specified later. • Step 2: We will show that, for all indices i, • Step 3: We will show that, for all indices i, where C 6 will be specified later. • Step 4: We will show that P k i (Y lin ) ≥ i + C 6 n c n ≥ 0.5 for at least (1 − 0.5C 5 )n many indices i.
Combining Steps 2, 3, and 4, we see that 6b n for at least (1 − 0.5C 5 )n many indices i. (27) Combining this with Step 1, therefore, there must be at least 0.5C 5 n many indices i for which the bounds (26) and (27) both hold. Applying the triangle inequality, we then have for all such i. This means that, for either Y or Y lin , it holds that the bias satisfies the lower bound 0.1b n for at least 0.25C 5 n many indices i. By choosing a n , b n , c n appropriately, this will establish the lower bounds that we need.
We next give the details for the choice of a n , b n , c n to complete the proof of Theorem 2, and then return to proving the bounds in Steps 1, 2, 3, 4 above. We will choose a n = L 1 + L 0 2 , b n = min M 2 , L 1 − L 0 2 · n(log n) 5 −β/3 , c n = n 1/3 (log n) 5/3 .
First we check that the choices of a n , b n , c n satisfy the requirements (25) for the steps of the proof, which is trivial to verify. Next, it's clear that µ, µ lin are both L 1 -Lipschitz (5) and L 0 -strictly increasing (6). Finally we verify the Hölder smoothness condition (7). For µ lin this is trivial since it is a linear mean. For µ, we can write for which we have Now we check that this is bounded by M |t 0 − t 1 | β−1 . If the first term in the minimum is larger, i.e., |t 0 − t 1 | ≥ 2/c n , then by our choice of b n , c n . If instead the second term in the minimum is larger, i.e., |t 0 − t 1 | ≤ 2/c n , then by our choice of b n , c n . Therefore the function f is (β, M )-Hölder smooth, and so this property is inherited by the vector µ. This verifies that µ, µ lin each satisfy the assumptions needed for the theorem. Applying the calculations above, we see that the bias of either Y or Y lin satisfies the lower bound 0.1b n for at least 0.25C 5 n many indices i. Since b n scales as n(log n) 5 −β/3 , this proves the desired result.

Proof of Step 1
The sine wave ∆, given by ∆ i = sin(c n · i/n), has period n/c n , and we recall that we have assumed that C 1 ≤ c n ≤ C 2 n. This means that, for any c ∈ (0, 1), we can trivially show that, if n is sufficiently large, ∆ i ≥ 1 − c for at least c n many indices i ∈ {1, . . . , n} for some c > 0 that only depends on c, C 1 , C 2 and not on n or c n . Choosing c = 0.2, we then set C 5 = c to establish Step 1.

Proof of Step 2
By definition, we have for all i, and therefore, it holds deterministically that for all i (this holds since iso(y) − iso(y ) ∞ ≤ y − y ∞ for any y, y ∈ R n , by Yang and Barber [2018, Lemma 1]).

Proof of Step 3
Next, we fix any index i, and consider the min-max formulation of isotonic regression as in (12) and (13): Therefore, it suffices to show that, for an appropriately chosen C 6 , Since j i (Y ) ≤ i by definition, we can weaken this to the statement ∆ j:k ≤ 0.2 for all indices j, k ∈ {1, . . . , n} with k ≥ j + C 6 n/c n .
To see why this is true, observe that ∆ is a sine wave with period n/c n . Since the mean of the sine wave over one full cycle is zero, and the mean over any partial cycle is bounded by 1, this means that the bound above will hold for sufficiently large C 6 depending only on C 1 , C 2 and not on n or c n .

Proof of Step 4
For this step we will use the breakpoint lemma. We have , and note that m ≥ log n C 2 (C 4 ) 2/3 by our assumptions (25), which satisfies m ≥ 2 for sufficiently large n. Applying Lemma 1 combined with the property (15) of isotonic regression, we have where C 8 depends only on σ 2 . This implies that and the constant C 4 in our assumption (25) is chosen to be sufficiently small. This completes Step 4 as long as C 6 n c n ≤ m 3C 8 log n = 1 C 8 log n n a n √ log n 2/3 and 2m + m 3C 8 log n = 2 + 1 3 + C 8 log n n a n √ log n 2/3 ≤ 0.5C 5 n.
These conditions both hold for a n , c n selected as in the proofs of the two theorems above (recalling condition (25), with C 3 chosen to be sufficiently large and C 4 sufficiently small).

Proof of lower bound for strict monotonicity (Theorem 3)
Our argument for the proof of Theorem 3 is similar to that for the smoothness result, Theorem 2. Define function f 0 , f 1 : [0, 1] → [0, 1] as follows. Let c n , n > 0 be parameters that we will specify later on. First let g : [0, 1] → [0, 1] be defined as and then define and similarly Define µ ( ) i = f (i/n) for each = 0, 1 and each i = 1, . . . , n. This construction is illustrated in Figure 5.
We will see that the difference between Y (0) and Y (1) around the index i 0 = n/2 cannot be fully recovered by isotonic regression. The intuition for this phenomenon is the same as in the proof of Theorem 2; since iso(Y (0) ) i 0 and iso(Y (1) ) i 0 are calculated via local averages near i 0 (according to the min-max formula (12)), while the difference between the vectors Y (0) and Y (1) is constrained to a vanishing region (they are equal everywhere except for indices i ∈ i 0 ± n n ), isotonic regression will often shrink the difference between the two signals at the index i 0 -this will happen whenever the local average is taken over a sufficiently wide range, i.e., in the min-max formula (12), iso(Y ( ) ) i 0 = Y ( ) j:k for k − j sufficiently large. The bias at index i 0 will therefore be of order c n , achieving a lower bound.
The remainder of the proof will follow these steps: • Step 1: We will show that, deterministically, Index j=1,...,n 1 i 0 − nε n i 0 i 0 + nε n n 0 c n µ (0) µ (1) Figure 5: Illustration of the construction for the lower bound result for strict monotonicity (Theorem 3). The figure illustrates the two different signal vectors µ (0) and µ (1) used in the construction for the proof. • Step 2: We will show that • Step 3: We will show that P k i 0 (Y (1) ) ≥ i 0 + 10n n ≥ 0.5.
Combining these steps, we see that Applying the triangle inequality, we then have i 0 + c n . This means that, for either Y (0) or Y (1) , it holds that the bias at i 0 satisfies the lower bound 0.2c n . With c n as specified above, this completes the proof of Theorem 3. Now we turn to proving the bounds in Steps 1, 2, 3 above.

Proof of Step 1
By definition, we have for all i, and therefore, it holds deterministically that iso(Y (0) ) i − iso(Y (1) ) i ≤ c n for all i (this holds since iso(y) − iso(y ) ∞ ≤ y − y ∞ for any y, y ∈ R n , by Yang and Barber [2018, Lemma 1]).

Proof of Step 2
Next, we define and consider the min-max formulation of isotonic regression as in (12) and (13): Therefore, it suffices to show that Since j i 0 (Y (0) ) ≤ i 0 by definition, we can weaken this to the statement ∆ j:k ≤ 0.2 for all indices j, k ∈ {1, . . . , n} with j ≤ i 0 and k ≥ i 0 + 10n n . This is true simply because, by definition of ∆, we have ∆ i ≤ 1 for all i and ∆ i = 0 for all |i − i 0 | ≥ n n .

Proof of Step 3
For this step we will use the breakpoint lemma. We have for any m ≥ 1 and any with m ≤ ≤ n − m. Choose m = C 4 c 2 n log n which satisfies m ≥ 2 for sufficiently large n. Applying Lemma 1, we have where C 5 depends only on σ 2 , C 4 . This implies that P k i 0 (Y (1) ) < i 0 + 10n n = P iso(Y (1) ) = iso(Y (1) ) +1 for some i 0 ≤ ≤ i 0 + 10n n ≤ (10n n + 1)· C 5 log n m , as long as i 0 satisfies By definition of c n , n , m, as long as C 1 , C 2 , C 4 are chosen appropriately, we therefore have P k i 0 (Y (1) ) < i 0 + 10n n ≤ 0.5, which completes the proof of Step 3.

Discussion
In this work, we develop a sharp bound on the bias of isotonic regression in one dimension for strictly increasing signals, establishing (up to log factors) that the bias is no larger than n −2/3 for smooth signals, but may be as large as n −1/3 in the non-smooth case. Many important open questions remain, for instance, whether these bounds on the bias may be extended to the multidimensional setting, or to settings such as the Grenander estimator for a monotone density function.

A Additional proofs
A.1 Proof of breakpoint lemma (Lemma 1) In Section 3.2, we proved Lemma 3, which establishes the desired result for the case of a standard Gaussian. We now need to reduce to this case. First, we reduce to a shorter subvector of length 2m. Definẽ The property (15) of isotonic regression implies that so we now only need to bound this last probability. Next we show that, since the means µ j and variances σ 2 j are nearly constant over j = i − m + 1, . . . , i + m, we can reduce to a standard Gaussian where the means and variances are constant. We first state a trivial result comparing multivariate Gaussians, which we prove in Appendix A.3: Lemma 5. Fix any integer k ≥ 2, and any a 1 , . . . , a k ∈ R and b 1 , . . . , b k > 0. Let Then, for any c > 0 and any measurable A ⊆ R k , where C 3 depends only on c, C 1 , C 2 and not on k.
Now we apply this result toỸ . LetY ∼ N μ1 2m ,σ 2 I 2m , withμ,σ defined as in the statement of Lemma 1. Then applying Lemma 5 with k = 2m, c = 1, and with the set A defined as we see that the conditions of Lemma 5 are satisfied for some C 1 , C 2 depending only on the values C 1 , C 2 in the statement of Lemma 1. So, we have where C 3 depends only on C 1 , C 2 . Finally, consider a standard multivariate Gaussian, Z ∼ N 0 2m , I 2m . Clearly we can writeY =μ1 2m +σZ, and so iso(Y ) =μ1 2m +σ · iso(Z) by the property (14) of isotonic regression. This implies that where the last step applies Lemma 3. We have therefore proved that which completes the proof of Lemma 1 with C 3 chosen appropriately.

A.2 Proof of Gaussian coupling lemma (Lemma 4)
Our lemma is a consequence of Sakhanenko [1985]'s Gaussian coupling result: Theorem 4 (Sakhanenko [1985, Theorem 1]). Suppose that Z 1 , . . . , Z n are independent random variables with E [Z j ] = 0 and Var (Z j ) = σ 2 j , and that for some ν > 0, each Z j satisfies E ν|Z j | 3 e ν|Z j | ≤ σ 2 j . Then there exists a coupling between (Z 1 , . . . , Z n ) and (Z 1 , . . . ,Z n ) ∼ N 0, diag{σ 2 1 , . . . , σ 2 n } that satisfies where c > 0 is a universal constant. Now define Z j = Y j − µ j for j = 1, . . . , n. To apply Theorem 4, we will first work with the sequence Z i , Z i+1 , . . . , Z n instead of Z 1 , . . . , Z n , i.e. we begin at the index i. Since each Z j is (λ, τ )-subexponential, it therefore satisfies the assumption E ν|Z j | 3 e ν|Z j | ≤ σ 2 min ≤ σ 2 j when we choose ν as an appropriate function of λ, τ , and σ min . By Theorem 4, then, we have a coupling between (Z i , . . . , Z n ) and (Z i , . . . ,Z n ) ∼ N 0, diag{σ 2 i , . . . , σ 2 n } such that (where we recall that max j σ j ≤ λ by the subexponential tails assumption (9) on the original noise terms Z j ). In particular, this implies that for all δ > 0, when C is chosen appropriately as a function of λ, τ, σ min . Following an identical argument on the sequence Z i−1 , Z i−2 , . . . , Z 1 , we can construct a coupling of (Z 1 , . . . , Z i−1 ) to a Gaussian vector (Z 1 , . . . , Finally, since (Z 1 , . . . , Z i−1 ) and (Z i , . . . , Z n ) are independent, we can also take (Z 1 , . . . ,Z i−1 ) and (Z i , . . . ,Z n ) to be independent, and thus we have a coupling between Z and a Gaussian random vectorZ ∼ N 0, diag{σ 2 1 , . . . , σ 2 n } such that both (28) and (29) hold, which implies that Let ∆ = Z −Z denote the error in the coupling constructed above, andỸ = µ +Z. We therefore have Y j:k =Ỹ j:k + ∆ j:k for all indices j, k. Now, for each i, let j i (Ỹ ) = min{j : iso(Ỹ ) j = iso(Ỹ ) i } and k i (Ỹ ) = max{k : iso(Ỹ ) k = iso(Ỹ ) i } as in (13), which are the endpoints of the constant segment of the isotonic projection iso(Ỹ ) containing index i. Using the definition of isotonic regression, we calculate Similarly we can show that Therefore, for any c > 0. We can further calculate when c is chosen as an appropriate function of C. Combining what we have so far, then, We will now bound these expected values. For any ≥ 1 with i + ≤ n, it is trivial to see that Therefore, for any max ≥ 1 with i + max ≤ n, Next, we will use the breakpoint lemma (Lemma 1) to bound these probabilities. Fix m = max = n 2/3 (log n) 1/3 . We apply the lemma at the index i ( ) = + i − 1 in place of i. Since we've assumed that C 2 n 2/3 (log n) 1/3 ≤ i ≤ n − C 2 n 2/3 (log n) 1/3 , it's trivial to check that m ≤ i ( ) ≤ n − m for an appropriate choice of C 2 . Next, since the means µ j are L 1 -Lipschitz (5) and the standard deviations σ j are L σ -Lipschitz (10), we can verify that the assumptions (17) are satisfied for C 1 , C 2 depending only on L 1 , L σ (and not on n). Therefore P iso(Ỹ ) i ( ) = iso(Ỹ ) i ( )+1 ≤ C 3 (log n) 4/3 n 2/3 for all = 1, . . . , max , where C 3 depends only on C 1 , C 2 . Returning to the work above, then, C(log n) 4/3 n 2/3 + 1 max ≤ C (log n) 7/3 n 2/3 , where C is chosen appropriately as a function of C 3 . An identical argument holds for bounding E 1 i−j i (Ỹ )+1 . Combining everything, we see that E iso(Y ) i − iso(Ỹ ) i ≤ c · C · (log n) 10/3 n 2/3 + 1 n , which proves the coupling lemma for an appropriately chosen C 1 .

A.3 Proof of Lemma 5
First we define likelihoods, where C 3 will be defined below. Then P {V ∈ A} ≤ P {V ∈ B} + P {V ∈ A\B} , and P {V ∈ A\B} = x∈R k where the first inequality holds by definition of B. Therefore, we now only need to bound P {V ∈ B}. We calculate Next we will bound each term separately. For Term 1, first note that we can assume C 2 √ k log k ≤ 1 2 (because if this does not hold then k is bounded by a constant, and so the conclusion of the lemma holds trivially; i.e., by setting C 3 appropriately, the claim reduces to bounding a probability by 1). Therefore, Next, for Term 2, note that Combining the two, and using the fact that k i=1 r + i − r − i = 0 by definition ofb, this simplifies to Since log k ≤ √ k log k for all integers k ≥ 2, we thus have Turning to Term 3, since V i −a i b i iid ∼ N (0, 1), we have We can weaken this to Plugging in our definitions of C 1 , C 2 , and using k ≥ 2, we then have P Term 3 ≤ 1 + C 2 √ 2 log 2 · 2cC 1 ≥ 1 − k −c .