Nonparametric inference under a monotone hazard ratio order

The ratio of the hazard functions of two populations or two strata of a single population plays an important role in time-to-event analysis. Cox regression is commonly used to estimate the hazard ratio under the assumption that it is constant in time, which is known as the proportional hazards assumption. However, this assumption is often violated in practice, and when it is violated, the parameter estimated by Cox regression is difficult to interpret. The hazard ratio can be estimated in a nonparametric manner using smoothing, but smoothing-based estimators are sensitive to the selection of tuning parameters, and it is often difficult to perform valid inference with such estimators. In some cases, it is known that the hazard ratio function is monotone. In this article, we demonstrate that monotonicity of the hazard ratio function defines an invariant stochastic order, and we study the properties of this order. Furthermore, we introduce an estimator of the hazard ratio function under a monotonicity constraint. We demonstrate that our estimator converges in distribution to a mean-zero limit, and we use this result to construct asymptotically valid confidence intervals. Finally, we conduct numerical studies to assess the finite-sample behavior of our estimator, and we use our methods to estimate the hazard ratio of progression-free survival in pulmonary adenocarcinoma patients treated with Gefitinib or carboplatin-paclitaxel.


Introduction 1.Background and literature review
Time-to-event data are commonplace in many scientific fields, including biomedicine, economics, and engineering.In many circumstances, interest focuses on comparing the distribution of the time it takes for some event to occur, known as the event time, in two populations.For instance, in the medical sciences, patients may be randomly assigned to treatment or control, and followed until an event of interest occurs, such as onset, recurrence, or cure of a disease.In this case, the two populations are patients randomized to treatment and patients randomized to control.While the methods discussed in this paper are applicable to any time-to-event data, we will use "patients" to refer to the units in the population of interest for convenience.
In the analysis of time-to-event data, one common parameter of interest is the cumulative distribution function of the event time, or equivalently, its survival function.However, in many settings, the event time is not observed for all patients in the study because, for example, some patients may prematurely leave the study, or the event may not have occurred before the end of the study period.This is known as right-censoring of the event time.If the censoring process is independent of the event process, the Kaplan-Meier estimator (Kaplan and Meier, 1958) is a consistent nonparametric estimator of the survival function of the event time.
The distribution and survival functions describe cumulative probabilities, but in some cases it is of interest to quantify the instantaneous rate of the event at a point in time among patients who have not yet experienced the event of interest.This is known as the hazard rate.When comparing the distributions of an event time in two populations, the ratio of the hazard rates, known as the hazard ratio, describes the relative event rates among patients who have not yet experienced the event in the two populations over time.Estimating the hazard rate or ratio is more difficult than estimating the survival function because the hazard rate and ratio concern events occurring in an infinitesimal window of time.However, estimation of the hazard ratio is made much simpler by assuming that it is constant in time, which is known as the proportional hazards assumption.
When this assumption holds, Cox proportional hazards regression can be used to estimate the hazard ratio (Cox, 1972).In this case, the hazard ratio for comparing two populations reduces to a single number.The hazard ratio estimated from a simple Cox regression comparing two populations has become one of the most important tools in the analysis of time-to-event data, and in some studies it is the only effect reported (Hernán, 2010).
Despite the widespread use of Cox regression, the proportional hazards assumption underlying it is easily violated.For example, if a treatment only offers short-term benefits over control, then the hazard ratio is unlikely to be constant (Li et al., 2015).In addition, the proportional hazards assumption implies that the survival function of one group can be expressed as the survival function of the other group raised to a constant power.Hence, if the survival curves cross, then the proportional hazards assumption cannot hold (see, e.g.Klein and Moeschberger, 2003).The hazard ratio estimated by a Cox regression in a setting where the proportional hazards assumption is violated is approximately a weighted average of the hazard ratio function over time (Struthers and Kalbfleisch, 1986).However, the weighting function depends on the censoring pattern in the study, which complicates the interpretation of the parameter estimated by the Cox model in such a misspecificed model (O'Quigley, 2008;Whitney et al., 2019) When the proportional hazards assumption is violated, estimating the hazard ratio function is more difficult.One simple approach is to estimate the hazard ratio using the ratio of estimators of the individual hazard rate functions.For example, if correctly specified parametric models for the distributions are available, the hazard rates in the two distributions can be estimated using maximum likelihood estimation (Kalbfleisch and Prentice, 2011).Alternatively, nonparametric methods for estimating hazard functions based on smoothing have also been proposed (Anderson and Senthilselvan, 1980;Müller and Wang, 1994;Rebora et al., 2014).However, estimators based on smoothing are often sensitive to the selection of certain tuning parameters, such as bandwidths, kernel functions, or the number of knots in a spline function.In addition, obtaining valid inference using a smoothing-based estimator can be challenging due to bias in the asymptotic distribution of the estimator (see, e.g.Wasserman, 2013 andCalonico et al., 2018).
In some cases, it may be known that the hazard ratio is monotone as a function of time.In general, the hazard ratio can be expected to be monotone when the relative rate of events in the two groups increases or decreases over time.For example, if the effectiveness of a treatment wanes over time, then the hazard ratio between treated and placebo groups of a randomized trial may be expected to be monotone non-decreasing (Durham et al., 1998).Similarly, harmful exposures can result in a monotone non-decreasing hazard ratio between the exposed and unexposed groups (Sekula et al., 2013).We discuss the motivation and application of monotone hazard ratios more in Section 2.
We are only aware of a few studies concerning monotonicity of the hazard ratio function.Gill and Schumacher (1987) and Deshpande and Sengupta (1995) proposed tests of the proportional hazards assumption against the non-decreasing hazard ratio alternative.Kim et al. (2011) proposed an estimator of a monotone hazard ratio function using a nonparametric Bayesian approach, which we discuss further in Section 3.

Contribution and organization of the article
In this article, we study the situation in which the hazard ratio between two populations is known to be non-decreasing in time.First, we define a new stochastic order called the monotone hazard ratio order, demonstrate that it is an invariant stochastic order in the sense of Lehmann and Rojo (1992), and study the properties of this novel stochastic order.As we will discuss more below, this is important because it gives stability to the monotonicity assumption, and because it connects our new order to the existing literature on stochastic orders.Second, we propose a novel estimator of a hazard ratio function under a monotonicity constraint in the presence of independent rightcensoring.Finally, we derive the large-sample properties of our estimator, including convergence in distribution of our estimator at the rate n −1/3 to a mean-zero limit, and use this result to construct asymptotically valid pointwise confidence intervals for the hazard ratio function.To the best of our knowledge, we are the first to study the stochastic order defined by monotonicity of the hazard ratio function, and we are also the first to produce asymptotically valid confidence intervals for a monotone hazard ratio function.
The paper proceeds as follows.In Section 2, we define the monotone hazard ratio order and establish properties of this order.In Section 3, we introduce our nonparametric estimator of a monotone hazard ratio function, establish asymptotic theory of our estimator, and use this theory to construct confidence intervals.In Section 4, we present numerical studies evaluating the finite-sample performance of our method.Finally, in Section 5, we use our method to estimate the hazard ratio function comparing the length of progression-free survival of pulmonary adenocarcinoma patients treated with gefitinib or carboplatin-paclitaxel.Proofs of all theorems can be found in Supplementary Material.

Notation
x ∈ I, we denote by ∂ − H(x) the left derivative of H at x.We also denote the image of H by Im(H) := {u ∈ R : H(x) = u for some x ∈ I}.If H is non-decreasing, we define the support of H as Supp(H) := {x ∈ I : H(u) < H(v) for all u < x < v}.We define the greatest convex minorant (GCM) of H on I, denoted GCM I (H) : I → R, as the pointwise supremum of all convex functions on I bounded above by H.We say that H is monotone on A ⊆ I if H(x) ≤ H(y) for all x < y with x, y ∈ A, and similarly we say that H is convex on A if H(tx+(1−t)y) ≤ tH(x)+(1−t)H(y) for all x, y ∈ A and t ∈ [0, 1] such that tx+(1−t)y ∈ A as well.We set H − (u) := inf{t ≤ u : H(t) ≥ H(u)} as the generalized inverse function corresponding to H.The properties of such functions when H is a distribution function (in which case H − is its quantile function) are summarized in Chapter 21 of van der Vaart (2000).All integrals should be interpreted as Riemann-Stieltjes integrals, and t 0 := (0,t] by default.

Monotone hazard ratio ordering 2.1 Definition of the monotone hazard ratio order
We now introduce and motivate the monotone hazard ratio order.We let S and T be nonnegative random variables, and we let F S , FS , F T , and FT be the distribution and survival functions corresponding to S and T , respectively.If S and T are absolutely continuous with density functions f S = F S and f T = F T , then λ S := f S / FS and λ T := f T / FT are the hazard functions corresponding to S and T , respectively.In this case, we say S ≥ M HR T if t → θ(t) := λ S (t)/λ T (t) is nondecreasing for t such that f T (t) > 0 or f S (t) > 0. On the other hand, if S and T are fully discrete random variables with support contained on a finite or countably infinite set {t 1 < t 2 < • • •}, then λ S (t j ) := f S (t j )/ FS (t j−1 ) and λ T (t j ) := f T (t j )/ FT (t j−1 ) are the corresponding hazard functions, where f S (t) := P (S = t) and f T (t) := P (T = t) are the corresponding mass functions (and where We define ≥ M HR in such a way that encompasses both the above cases, as well as more complicated cases where S and T may be mixed discrete-continuous random variables.We let µ be any sigma-finite measure dominating both F S and F T , and we define f S := dF S /dµ and f T := dF T /dµ. We then define the hazard functions relative to µ as λ S := f S / FS,− on the support of f S , and 0 otherwise, and similarly for λ T .The hazard ratio function θ : Supp(F S ) ∪ Supp(F T ) → [0, ∞] is then defined as θ := λ S /λ T .We note that θ does not depend on the choice of dominating measure µ, that θ = 0 on Supp(F T )\Supp(F S ), and that θ = +∞ on Supp(F S )\Supp(F T ).We then have the following general definition of the monotone hazard ratio relation.
Definition 1.We say that S ≥ M HR T if θ = λ S /λ T is non-decreasing on Supp(F S ) ∪ Supp(F T ).
When both S and T are dominated by Lebesgue measure, we recover the first case discussed above, and when they are both dominated by counting measure on the countable set {t 1 < t 2 < • • •}, then we recover the second case.
Monotone hazard ratios abound in the literature because monotonicity of hazard ratio function can be expected to hold in several general situations.First, if S is the time to an adverse event under treatment and T is the same under control, we can expect S ≥ M HR T if the protective effect of the treatment on those who have not yet experienced it wanes over time.There are many examples of such treatments, including vaccines (Durham et al., 1998) and blood transfusion (Holcomb et al., 2013).Second, if S is the time to an adverse event under control, and T is the same under exposure to a condition with short-term toxic effects, then we may again expect that S ≥ M HR T .Drug overdose is an example of such a toxic exposure (Hernandez et al., 2018).We note that the individual hazard functions of S and T may not be monotone in the above cases.For instance, there may be underlying time trends (e.g., weekly, monthly, or seasonal trends) unrelated to treatment that induce non-monotonic trends in the hazards.If these trends influence the hazards of S and T equally, then the hazard ratio may still be expected to be monotone.
The statistical model induced by the monotone hazard ratio order is a generalization of the popular proportional hazards model with a time trend, where the time trend is allowed to be any monotone function.Choosing a specific time trend for a proportional hazards model can be difficult, and if the time trend is chosen based on the data, obtaining valid inference for the regression coefficient is challenging (Desquilbet and Meyer, 2005).Hence, the flexibility in permitting any monotone time trend is appealing because it avoids the need to choose a specific trend.
We will see in Section 3 that when it is known that the hazard ratio is monotone, this knowledge can be exploited to obtain a simple nonparametric estimator of the hazard ratio function and asymptotically valid pointwise inference.Furthermore, the estimator and inferential procedure avoid estimating or modelling the individual hazard functions directly, in the same spirit as the proportional hazards estimator, which yields improved robustness over methods that estimate the hazard functions.This will be explored more in numerical studies in Section 4.

Properties of the monotone hazard ratio order
We now establish several important properties of the monotone hazard ratio order.First, we show that the relation defined above is an invariant stochastic order in the sense of Lehmann and Rojo (1992).Intuitively, stochastic orders are ways of defining what it means for one probability distribution to be "larger" than another.Specifically, a stochastic order ≥ S is a relation on the space of probability distributions on some measurable space satisfying the conditions of a preorder: for any probability distributions F , G, and H on the space, (1) F ≥ S F , and (2) G ≥ S F and H ≥ S G implies that H ≥ S F .We will be focused on distributions on the reals.In this case, a stochastic order is invariant under monotone transformations, For two real-valued random variables S and T with distribution functions F S and F T , we say S ≥ S T for a stochastic order ≥ S if F S ≥ S F T .We now show that these properties hold for the monotone hazard ratio order defined above.The fact that the monotone hazard ratio forms a stochastic order is important due to the stability it provides when comparing the hazard ratios of multiple event times.The fact that it is invariant to monotone transformations is especially important because it means that the order is independent of time scale.We also note that S ≥ M HR T and T ≥ M HR S implies that the hazard ratio is constant, but does not imply that S = T in distribution.Hence, the monotone hazard ratio order is not antisymmetric, and therefore does not induce a partial order.
We now provide two characterizations of the monotone hazard ratio order in the special case where F S F T , i.e.F S is dominated by F T .We define Λ S (t) := t 0 F S (du)/ FS − (u) and Λ T (t) := t 0 F T (du)/ FT − (u) as the cumulative hazard functions corresponding to S and T , respectively, and we note that if F S F T , then Λ S Λ T , and θ = dΛ S /dΛ T .We also define R = F S • F − T as the ordinal dominance curve corresponding to the distributions of S and T .Lehmann and Rojo (1992) demonstrated that all invariant stochastic orders are equivalent to a pre-order on the space of ordinal dominance curves that is closed under composition.In the next result, we provide two characterizations of the monotone hazard ratio order: one in terms of the ordinal dominance curve, and a second in terms of the cumulative hazard functions Λ S and Λ T .
Theorem 2. (a) If F S F T and θ is continuous, then the following are equivalent: , where I is the smallest closed interval containing Im(Λ T ).
The assumption that F S F T is important for the characterizations in Theorem 2. For example, if F S is the uniform distribution on [0, 1.5] and F T is the Bernoulli distribution with probability 1/2, then F S is not dominated by F T , but Λ S •Λ − T is convex on Im(Λ T ) and S M HR T .
This is similar to a counterexample provided in Mösching and Dümbgen (2020) for the likelihood ratio order.If treatment or exposure does not change the set of possible event times, which is the case in many real-world situations, then F S F T can be expected to hold.
The characterization of the monotone hazard ratio order in terms of the ordinal dominance curve provided in Theorem 2 is somewhat more complicated than the characterization of the other three common invariant stochastic orders discussed below.This is due to the complexity of the general relationship between a hazard function and the corresponding distribution function.In the case of absolutely continuous F S and F T , the characterization in terms of the ordinal dominance curve can be stated somewhat simpler.In particular, , in the absolutely continuous case the monotone hazard ratio order is equivalent to t → R(1 − e −t ) being log-convex on t ∈ [0, ∞).In this case, it is necessary but not sufficient that R be log-convex.
The relationship between the hazard and cumulative hazard functions is analogous to that between a density and distribution function.Hence, the characterization of the monotone hazard ratio order in terms of the cumulative hazard functions provided in Theorem 2 parallels the characterization of a likelihood ratio order in terms of the distribution functions (Westling et al., 2021;Mösching and Dümbgen, 2020).We will see in Section 3 that part (b) of Theorem 2 suggests a natural estimator of θ.
Theorem 2 can also be used to informally assess the plausibility of the monotone hazard ratio order given data.We note that Λ S • Λ − T is convex on Im(Λ T ) if and only if the parametrized curve Hence, comparing this curve to its GCM gives an informal graphical check of the monotone hazard ratio order.This same procedure was proposed by Gill and Schumacher (1987).

Relationship to other stochastic orders
A variety of stochastic orders have been studied; Shaked and Shanthikumar (2007) contains detailed results and discussion.We briefly review three of the most common stochastic orders used in the context of univariate time-to-event analysis.The usual or uniform stochastic order is defined as where F and G are cumulative distribution functions on R. Estimators under the usual stochastic order were developed by Brunk et al. (1966) and Dykstra (1982), and the corresponding asymptotic properties were derived by Praestgaard and Huang (1996).The hazard rate order is defined as G ≥ HR F if F / Ḡ is non-increasing, which is equivalent to f / F ≥ g/ Ḡ in the case of absolutely continuous distributions, where f and g are the densities corresponding to F and G. Dykstra et al. (1991) studied estimation and inference under a hazard rate order.Finally, the likelihood ratio order is defined as It is natural to ask where the monotone hazard ratio order fits into the hierarchy of the three common stochastic orders.It turns out that the monotone hazard ratio order does not generally imply, nor is it implied by, any of the three common stochastic orders.To show this, we provide continuous and discrete counterexamples for each case.These examples are illustrated in Figure 1.
We note that the fact that our order is not implied by nor implies these other orders means in particular that previously established properties of and methods for inference under these orders do not apply to the monotone hazard ratio order.
We first show that the monotone hazard ratio order does not imply the usual stochastic order, which further implies that the monotone hazard ratio order does not imply the hazard rate or monotone likelihood ratio orders.Suppose that S and T have Weibull distributions with shape parameters k S and k T and scale parameters σ S and σ T , respectively.Then the hazard ratio function and also implies that S HR T and S LR T .Hence, the monotone hazard ratio order does not imply any of these other three common orders in the continuous case (first column of Figure 1).For a counterexample in the discrete case, suppose that F S follows a geometric distribution with success probability p S on {1, 2, . . .}, so that λ S (k) = p S for all k ∈ {1, 2, . . .}. Hence, S ≥ M HR T for any Both of these are the case, for instance, if T also follows a geometric distribution with success probability p T < p S (second column of Figure 1).
We now show that the likelihood ratio order does not imply the monotone hazard ratio order, which further implies that the hazard rate order and usual stochastic order do not imply the monotone hazard ratio order.For an example in the continuous case, suppose that S and T follow Beta distributions with parameters (α, β S ) and (α, β T ) for β S < β T .Then the density ratio is proportional to (1 − t) β S −β T , which is strictly increasing, so S ≥ LR T .Furthermore, if α ∈ (0, 1), then one can also show that the hazard ratio function is strictly decreasing, so that S < M HR T .Therefore, the likelihood ratio order does not imply the monotone hazard ratio order in the continuous case, so neither do the hazard rate or usual stochastic orders (third column of Figure 1).For a counterexample in the discrete case, suppose S has a uniform distribution on 1) and ( 2) can be achieved simultaneously if and only if f T (t 1 ) ≥ 1/K.Then the ratio of the mass functions is proportional to f T , so the likelihood ratio order holds by assumption (1).However, we can also show that λ S (t j−1 )/λ T (t j−1 ) > λ S (t j )/λ T (t j ) for all j = 2, . . ., K − 1.So the monotone hazard ratio order cannot hold (last column of Figure 1).
One special case where the monotone hazard ratio order does imply the hazard rate order, and therefore the usual order as well, is when lim t→tmax λ S (t)/λ T (t) ≤ 1, where t max := sup{Supp(F S ) ∪ Supp(F T )}.This is the case, for instance, when a treatment is known to be non-toxic, or when a harmful exposure is known to never be beneficial.In particular, if F S and F T are supported on the same finite discrete set 3 Nonparametric inference with right-censored data

Statistical setting
In this section, we provide an estimator of a monotone hazard ratio function θ using independently right-censored data.We derive the asymptotic distribution of our estimator, and use this result to construct asymptotically valid pointwise confidence intervals for θ.
For each i ∈ {1, . . ., n}, we let A i ∼ Bernoulli(π) indicate the cohort for unit i.For a randomized study, A i = 0 corresponds to control, and A i = 1 corresponds to treatment, though the data need not be from a randomized trial.We assume that π ∈ (0, 1).For i such that A i = 1, we let S i ∼ F S be the event time and U i ∼ F U be the censoring time.For i such that A i = 0, we let T i ∼ F T be the event time and V i ∼ F V be the censoring time.We assume that S i and U i are independent and T i and V i are independent for each i -that is, the censoring is independent of the event within each treatment arm.If A i = 1, we observe the right-censored data Y i := min{S i , U i } and , and we assume that O 1 , . . ., O n are IID.
When F S and F T are discrete, the hazard ratio function can be estimated using the ratio of the empirical hazard functions within each treatment arm.The empirical hazard functions converge at the rate n −1/2 to normal limits, so by the delta method, their ratio does as well.Hence, inference for the hazard ratio function in this case can be obtained using standard methods.Furthermore, monotonicity of the hazard ratio function can be enforced by projecting the empirical estimator onto the space of monotone functions (Westling et al., 2020b).Therefore, here, we focus on the more challenging case where F S and F T are absolutely continuous distributions.We make no assumptions about the censoring distributions F U and F V .

Proposed estimator
Our estimator is based on the representation of θ presented in Theorem 2. We recall from Theorem 2 that if F S F T and θ is non-decreasing and continuous on the support of F T , then we can represent θ in terms of the cumulative hazard functions Λ S and Λ T as , where I is the smallest closed interval containing Im(Λ T ).Our estimator is defined by replacing the unknown elements in this representation with nonparametric estimators thereof.We let Λ S,n be the stratified Nelson-Aalen estimator (Nelson, 1969;Aalen, 1978) of the cumulative hazard function Λ S based on the cohort for which A = 1.Similarly, we let Λ T,n be the stratified Nelson-Aalen estimator of Λ T based on the control cohort for which A = 0. We also define η n := Λ T,n (γ n ), where γ n is the minimum of the empirical 1 − r n quantile of the Y i 's for which A i = 0 and the empirical 1 − r n quantile of the Y i 's for which A i = 1, where r n > 0 is a non-increasing sequence converging to r ≥ 0. It follows that γ n is converging to γ, the minimum of the (1 − r)th quantile of Y given A = 1 and the (1 − r)th quantile of Y given A = 0. Additional conditions on r n and practical suggestions for setting r n will be provided below.We then define our estimator θ n of θ as It is straightforward to compute θ n using standard software packages.Specifically, in the statistical computing software R (R Core Team, 2021), the Nelson-Aalen estimators Λ S,n , Λ T,n can be obtained using the package survival (Therneau, 2022), and the slopes of the greatest convex minorant of Λ S,n • Λ − T,n can be obtained using the package fdrtool (Strimmer, 2008).Code for computing θ n is provided in Supplementary Material.Kim et al. (2011) proposed a nonparametric Bayesian approach to estimating a monotone hazard ratio function.Their model permits either monotone non-decreasing or non-increasing hazard ratio functions, whereas the type of monotonicity must be known a priori for our estimator.Their model also allows for the incorporation of covariates, which we have have not explored.However, approximating the posterior distribution in their model is complicated and possibly computationally intensive, in contrast to the simple implementation of our procedure.

Convergence in distribution
We now demonstrate that n 1/3 [θ n (x) − θ 0 (x)] converges in distribution for fixed x to a scaled Chernoff distribution.The (standard) Chernoff distribution is defined as the derivative at zero of the GCM of a Brownian motion plus a quadratic; i.e.W : for B a standard two-sided Brownian motion with B(0) = 0.
Theorem 3. Suppose x ∈ (0, γ) is such that that F S , F T , and θ are continuously differentiable at x with finite and strictly positive derivatives, and F S , F U , F T , and F V are < 1 in a neighborhood of x.Also suppose that there exist ε, C > 0 such that r n ≥ C(log n) 2+ε /n for all n.Then where W follows the Chernoff distribution and .
Due to its connection with GCMs, the Chernoff distribution appears in the asymptotic distribution of summaries of many monotonicity-constrained estimators (e.g., Groeneboom, 1985;Huang and Wellner, 1995;Westling et al., 2021, among many others).The properties of the Chernoff distribution were studied extensively by Groeneboom and Wellner (2001).In particular, common quantiles of the distribution are tabulated therein, which facilitates the construction of asymptotic confidence intervals for θ using Theorem 3, as we discuss below.
Theorem 3 implies that θ n (x) converges to θ(x) at the rate n −1/3 .This is slower than the rate n −2/5 achieved by estimators of the hazard function based on kernel smoothing with optimal bandwidth selection (Müller and Wang, 1990;Groeneboom et al., 2010).However, this latter result requires that the hazards possess two continuous derivatives, while Theorem 3 only requires one continuous derivative of the hazard ratio.In addition, asymptotically valid inference using estimators based on kernel smoothing is challenging due to bias arising in the limit distribution (Calonico et al., 2018).
Theorem 3 requires that r n not converge too quickly to zero, meaning that the upper limit of the region over which the GCM is taken not converge too quickly to the upper limit of support of the observed times.This ensures that Λ T,n and Λ S,n are uniformly consistent on the increasing interval [0, γ n ] (Stute, 1994).The requirement is satisfied if, for instance, r n = r > 0 for all n, or if /n for some ε > 0. In practice, we recommend setting r n = 0.05 for n < 1000, and Kim et al. (2011) proposed a nonparametric Bayesian approach to estimating a monotone hazard ratio function.Their model permits either monotone non-decreasing or non-increasing hazard ratio functions.The type of monotonicity must be known a priori for our estimator, but we expect that in most cases where monotonicity can be assumed, the direction of monotonicity is also known.
Approximating the posterior distribution in their model is complicated and possibly computationally intensive, in contrast to the simple implementation of our procedure.Kim et al. (2017) proved that the rate of convergence of the posterior distribution of the nonparametric Bayesian estimator proposed by Kim et al. (2011) is (n/log n) −1/3 , which is just a poly-log factor slower than the rate of convergence of our estimator.However, to the best of our knowledge, it is not known whether the posterior distribution of the estimator proposed by Kim et al. (2011) yields asymptotically calibrated confidence intervals for θ(x).In the next section, we use Theorem 3 to construct asymptotically valid pointwise intervals using our estimator.

Construction of confidence intervals
We propose two methods of constructing confidence intervals for θ.The first method is based on the asymptotic distribution of θ provided in Theorem 3. By Theorem 3, a Wald-type asymptotic , and q p is the pth quantile of the standard Chernoff distribution.Quantiles of the Chernoff distribution are tabulated in Groeneboom and Wellner (2001).We note that τ (x) involves both λ S (x) and λ T (x), so one approach to estimating τ (x) would be to plug in consistent estimators of λ S (x) and λ T (x).Instead, we rewrite τ (x) as . This form of τ (x) no longer depends directly on λ S or λ T .In this expression, θ n , Λ T,n , and the Kaplan-Meier estimators F S,n , F U,n , F T,n , and F V,n can be substituted for their true counterparts in constructing an estimator τ n (x) of τ (x).Hence, the only remaining challenge is to estimate . We do this using the derivative estimator obtained by applying a local linear kernel smoother to the set of points , where m n = n 2/3 , and We choose the bandwidth for the kernel smoother using cross validation (Guidoum, 2020).
Sample splitting has also been shown to yield valid inference and reduced variance for estimators with n −1/3 -rate asymptotics without the need to estimate additional nuisance parameters in the limit distribution (Banerjee and Wellner, 2005;Banerjee et al., 2019).To implement this method, the n observations are first split randomly into m disjoint subsets of approximately equal size.The estimator θ n,j is then computed for each subset j ∈ {1, . . ., m}.

Numerical studies
To assess the finite-sample performance of our proposed estimator and confidence intervals, we performed the following numerical study.We simulated data from three different scenarios corresponding to linear, convex, and concave θ.Defining λ(x) := 0.25 + sin 2 (6πx), in the linear case, we set λ S (x) = xλ(x) and λ T (x) = λ(x), so that θ(x) = x.In the convex case, we set λ S (x) = x 2 λ(x) and λ T (x) = λ(x), so that θ(x) = x 2 .In the concave case, we set λ S (x) = xλ(x) Notably, λ S (x) > 0 and λ T (x) > 0 for all x > 0, and are multiples of a periodic function due to the inclusion of sin 2 .This is common in many applications where event rates follow weekly, monthly, or seasonal trends.For the censoring distributions, we set both F U (t) and F V (t) as 1 − e −0.1t for 0 ≤ t < 1, 1 − e −0.15t for 1 ≤ t < 2, and 1 for t ≥ 2.
Hence, the censoring distributions are mixed discrete-continuous distributions supported on [0, 2], and have discrete components at 1 and 2 with probabilities 0.044 and 0.078, respectively.Finally, we set π = 0.5.
For each sample size n equal to 1000, 3000, 6000, and 10000, we simulated 1000 right-censored datasets for each of the three mechanisms described above.For each dataset and for each x equal to 0.005, 0.01, . . ., 2, we computed our proposed estimator θ n (x), the sample splitting estimator θn,m (x) with m = 5 splits, and the corresponding confidence intervals defined in Section 3.4.
For comparison, we also computed an estimator and confidence intervals based on taking the ratio of kernel smoothing estimators of the individual hazard functions, which does not require or enforce monotonicity of θ (Watson and Leadbetter, 1964).For the kernel smoothing estimators of the hazard functions, we used the Epanechnikov kernel and selected the bandwidths using cross validation.We did not compare our procedure to that of Kim et al. (2011) due to the lack of availability of computer code implementing their procedure.
The top panel of Figure 2 displays n 1/3 times the absolute bias of the three estimators as a function of x. Figure 6 in Supplementary Material displays the relative absolute bias of the smoothing and sample splitting estimators to our estimator.The scaled bias of all three estimators generally decreases with sample size, which aligns with the expectation that the biases decrease faster than n −1/3 for x ∈ (0, 2).The absolute bias of the three estimators exhibits periodicity inherited from the periodicity of the underlying hazard functions.All three estimators exhibit large bias near x = 2, which is expected given the challenges of estimation near the boundary of support.The bias near x = 0 is highest for all three estimators in the concave θ case, and the bias for x between 1 and 2 is largest in the convex case, which makes sense because this is when the derivative of θ is large.
For most values of x, our estimator has slightly smaller absolute bias than the sample splitting estimator, especially for n = 1000, which is expected because the sample splitting estimator inherits the bias of our estimator with one-fifth the sample size.The absolute bias of the smoothing-biased estimator relative to that of our estimator is generally proportional to the magnitude of the second derivatives of λ S and λ T .The absolute bias of our estimator also generally improves relative to that of the smoothing-based estimator as x increases.We believe this is due to a combination of the monotonicity assumption and censoring.As x increases, the effective sample size decreases as a result of right-censoring, which generally increases bias.However, the monotonicity assumption of our estimator may aid in reducing this bias by using information from earlier time-points, unlike the smoothing-based estimator.
The bottom panel of Figure 2 displays n 2/3 times the absolute bias of the three estimators as a function of x. Figure 6 in Supplementary Material displays the relative variance of the smoothing and sample splitting estimators to our estimator.The variance of our estimator is close to the theoretical limit except for x near 2 for all values of n.The empirical variance does not capture the periodic pattern of the true variance, but we expect it would at larger sample sizes.The variance of the sample splitting estimator is a constant factor smaller than the variance of our estimator, as expected based on the theory of Banerjee et al. (2019).The variance increases as a function of x fastest for the convex case, followed by the linear and concave cases.This is due Figure 3: Empirical coverage probabilities of nominal 95% confidence intervals for the three estimators as a function of x.The rows correspond to linear, convex and concave θ.The first column is our estimator, the second column is the sample splitting estimator, and the third column is the kernel smoothing estimator.
to the appearance of θ(x) and θ (x) in the scale parameter in the limit distribution established in Theorem 3.Both of these values are increasing fastest for the convex case.The variance of the smoothing-based estimator is greater than the variance of our estimator in the sample sizes we considered.However, the relative variance of the smoothing-based estimator improves with sample size because the variance of the smoothing-based estimator goes to zero faster than the variance of our estimator.Overall, the mean squared error of our estimator is no worse than that of the smoothing based estimator for all values of x and sample sizes we considered (Figure 7 in Supplementary Material).
Figure 3 shows the coverage probability of nominal 95% confidence intervals for the three estimators.The coverage of the plug-in intervals centered around our estimator have close to nominal coverage for values of x not too close to 0 or 2. For values of x close to 0, the coverage of the plug-in method is poor due to the difficulty of estimating the derivative θ in this region.
The sample splitting method has poor coverage for n = 1000 due to high bias, but the coverage converges to the nominal level as the sample size increases.The smoothing-based estimator has poor coverage for values of x where the second derivatives of the hazard functions are large, which is due to the bias of the smoothing-based estimator.

Analysis of treatment of pulmonary adenocarcinoma
In this section, we use the methods developed in this article to estimate the all-cause mortality hazard ratio of two treatments for pulmonary adenocarcinoma: gefitinib and carboplatin-paclitaxel.
Carboplatin-paclitaxel is a type of intravenous chemotherapy, usually taken over a three-hour period once every three weeks for approximately six cycles (Herbst et al., 2004).Like many chemotherapies, carboplatin-paclitaxel is an invasive treatment that can have severe adverse side effects.Gefitinib is a kinase inhibitor that is taken orally as a tablet once per day.Gefitinib is hence less invasive than carboplatin-paclitaxel, but can also cause adverse side effects.We refer the reader to Mok et al. (2009) and Inoue et al. (2013) for additional details about these treatments.
We re-analyzed the results of a clinical trial comparing gefitinib and carboplatin-paclitaxel first reported in Mok et al. (2009).The cohort consisted of n = 1217 adults with stage IIIB or IV non-small-cell lung cancer with histologic features of adenocarcinoma, and who were nonsmokers or former light smokers and had no previous chemotherapy or biologic or immunologic therapy.
These patients were randomly assigned to gefitinib (609 patients) or carboplatin-paclitaxel (608 patients).Treatment for both groups continued until progression of the disease, development of unacceptable toxic effects, a request by the patient or physician to discontinue treatment, serious noncompliance with the protocol, or completion of six chemotherapy cycles.The event time of interest was the time from randomization to the earliest sign of disease progression or death from any cause.Additional details of the trial and cohort design can be found in Mok et al. (2009).Since the raw data from this trial are unavailable, we used the event and censoring times reconstructed by Argyropoulos and Unruh (2015) from the published Kaplan-Meier estimates.
Starting from the beginning of treatment, the 12-month estimated survival rates were 24.9% (95% CI: 21.4, 29.4) with gefitinib and 6.7% (95% CI: 4.3, 8.9) with carboplatin-paclitaxel, suggesting that gefitinib was more effective in preventing the progression of pulmonary adenocarcinoma.Mok et al. (2009) also estimated a Cox proportional hazard model with treatment by gefitinib, smoking history and gender and obtained a hazard ratio of 0.74 (95% CI: 0.65, 0.85) corresponding to treatment with gefitinib.They concluded that gefitinib was superior to carboplatin-paclitaxel for treating pulmonary adenocarcinoma.
Although their experimental results confirmed that assignment to gefitinib yielded higher overall 12-month survival probability, the survival curves of the two groups crossed, which suggests that the proportional hazards assumption is violated.Hence, it is of interest to estimate the hazard ratio over time to assess the time-varying effect of gefitinib relative to carboplatin-paclitaxel.The left panel of Figure 4 displays the Nelson-Aalen estimators of the cumulative hazard function for the gefitinib cohort versus that of the carboplation-paclitaxel cohort, and its GCM.This plot suggests that it is reasonable to believe that the hazard ratio function is monotone.Furthermore, prior estimates of the hazard ratio function have also suggested that it is monotone (Argyropoulos and Unruh, 2015).Here, we estimate the hazard ratio using our monotone estimator, and construct confidence intervals using the plug-in method described in Section 3.4.The right panel of Figure 4 displays the estimated hazard ratio of gefitinib versus carboplationpaclitaxel, as well as the constant hazard ratio estimated by the proportional hazard model.The hazard ratio is only shown through month six, since the estimated curve is flat thereafter.We estimate that the hazard ratio increases to one over the span of four months, after which it increases to 1.6 (95% CI: 1.07, 2.10).Hence, we find evidence that the hazard of disease progression for patients assigned to gefitinib is lower than that of patients assigned to carboplation-paclitaxel through four months post-randomization, but is greater after four months.This could be due to a stronger early benefit of gefitinib and a delayed effect of carboplation-paclitaxel.Alternatively, it could be due to heterogeneous effects of carboplation-paclitaxel relative to gefitinib.For example, frailer patients may have been more likely to progress quickly taking carboplation-paclitaxel than taking gefitinib, leaving a less frail cohort with better survival prospects after four months.
Next, we show that if r is continuous on G := Supp(Φ), and R Γ,Φ = Γ • Φ − is convex on Im(Φ), then r is non-decreasing on G.The idea is to compare the slopes of chords of r using the convexity of R Γ,Φ .Let x, y ∈ G be such that x < y.Suppose there exist two sequences {z j } j≥1 and {w j } j≥1 such that lim j→∞ s j = r(x) and lim j→∞ t j = r(y), where If z j ≤ w j for all j large enough, then convexity of R Γ,Φ on Im(Φ) would imply that s j ≤ t j for all j large enough, and hence r(x) ≤ r(y).Hence, if we can find such sequences, we have established the claim.The slopes of the chords of R Γ,Φ depend on the behavior of Φ near x and y, each of which has three cases.We note that since y ∈ Supp(Φ) by assumption, exactly one of the following three situations must hold: (y1) Φ(y) > Φ(y−) and there exists p ∈ [x, y) such that Φ(p) = Φ(y−), (y2) Φ(y) > Φ(y−) but there's no p ∈ [x, y) such that Φ(p) = Φ(y−), and (y3) Φ(y) = Φ(y−).There are three analogous cases for x: (x1) Φ(x) > Φ(x−) and there exists q < x such that Φ(q) = Φ(x−), (x2) Φ(x) > Φ(x−) but there's no such q < x, and (x3) Φ(x) = Φ(x−).We proceed by defining {z j } j≥1 and {w j } j≥1 in each case.
In case (y1), we let w j = p for all j.We can set p = Φ − • Φ(y−) and still have Φ(p) = Φ(y−).We know that p ∈ [x, y); if p = x then p is in G by assumption, and if p > x, then Φ(u) < Φ(p) for all u ∈ [x, p) since p = Φ − • Φ(y−), in which case p ∈ G as well.Hence, p ∈ G necessarily.We therefore have since Φ is by assumption flat on (p, y) and has a jump at y.In case (y2), we have Φ − • Φ(y−) = y.Thus, there exists {w j } j≥1 increasing to y such that w j ∈ (x, y) ∩ G for each j and Φ(w j ) increases to Φ(y−).Therefore, In case (y3), since y ∈ G, there exists {w j } j≥1 that either (y3a) increases to y and Φ(w j ) < Φ(y) for each j, or (y3b) decreases to y and Φ(w j ) > Φ(y) for each j.In case (y3a), since Φ • Φ − • Φ(z) = Φ(z) for all z, we have By continuity of r over G, for any > 0, we can find m such that j ≥ m implies |r(u) − r(y)|< for all u ∈ [w j , y] ∩ G.We then have for all j ≥ m, so lim j→∞ t j = r(y).If (y3b) holds, then a similar argument shows that lim j→∞ t j = r(y).
Applying the same exact reasoning for the three cases for x, we see that s j converges to r(x).
We have now shown that there exist sequences z j and w j such that s j converges to r(x) and t j converges to r(y), where s j and t j are defined above.Hence, if z j ≤ w j for all j large enough, then convexity of R Γ,Φ on Im(Φ) implies that s j ≤ t j for all j large enough, and hence r(x) ≤ r(y).It is clear that z j ≤ w j for all 16 pairings of definitions of z j and w j implied by cases (x1)-( x3b) and ( y1)-(y3b) except for when z j decreases to x (case x3b) and w j = p (case y1).In this case, we note that if x = p, then Φ is flat on [x, y), so that case (x3b) cannot hold.Therefore, if z j decreases to x and w j = p, then p must be strictly larger than x, so that z j < w j for all j large enough.
It remains to show (1) ⇐⇒ (2).By Lemma 1 with r = θ, and Φ = F T , θ is nondecreasing on Supp(F T ) if and only if then (1) ⇐⇒ (2).We can write the continuous part of R + .We address the discrete and continuous parts of the integral in turn.
We note that ∆R and Therefore, where the last equality follows because t < F − T,+ (u) if and only if t ≤ F − T (u).We now address the continuous part of the integral.Using the fact derived above that Therefore, We then note that v ∈ Supp(R c + ) implies that v ∈ Supp(F − T,+ ), and hence v = F T • F − T,+ (v) unless v is at the left end of a flat of F − T,+ .Such points form a R c + measure zero set.Similarly, if ∆R + (v) = 0, then , and v such that ∆R + (v) > 0 form a R c + measure zero set.Therefore, we have Now we note that F − T,+ is strictly increasing on the support of R c + , so by the change of variables y = F − T,+ (v), we have FT,− (y) FS,− (y) dΛ S dΛ T (y) dF c T (y).
Finally, we note that if F − T,+ (0) > 0, then F c T is flat on (−∞, F − T,+ (0)), and y < F − T,+ (u) if and only if y ≤ F − T (u), so Putting together the discrete and continuous parts of the integral, we now have We denote P n as the empirical distribution of O 1 , . . ., O n , P as the true distribution of O i (as implied by F S , F T , F U , F V , and π), and G n = n 1/2 (P n − P ).For any probability distribution Q and Q-integrable function h, we denote Qh := h dQ.We also let π n := n i=1 A i /n be the observed fraction of treated units.
Proof of Theorem 3. To prove Theorem 3, we will use Theorem 4 of Westling et al. (2020a).For convenience, we refer to Westling et al. (2020a) as WC hereafter.In the notation of WC, we have Γ n = Λ S,n , Φ n = Λ T,n , Γ 0 = Λ S , and Φ 0 = Λ T .To use Theorem 4 of WC, we need to first verify that the decomposition of equation ( 2 . By adding and subtracting terms, we have Λ S,n (x) , where These are the functions corresponding to equation (2) of WC.Condition (B1).We define the local difference function g x,u To verify condition (B1), we need to bound the uniform entropy of the class {g x,u : |u|≤ R} for all R small enough.We further decompose g x,u = g x,u,S − θ 0 (x)g x,u,T for g x,u,S := D * x+u − D * x and g x,u,T := L * x+u − L * x .
The function g x,u,S can be written as The  We now verify that P 0 Ĝ2 x,R,S = O(R) as R → 0 under the stated conditions.Due to the boundedness of FS and FV away from zero and independent censoring, for some C, C < ∞.This latter expression is O(R) as R → 0 because F S is continuously differentiable at x with finite derivative.Next, by Jensen's inequality, As above, the last expression is O(R) as R → 0 because F S is Lipschitz at x and FS and FU are both positive in a neighborhood of x.By the triangle inequality, we then have P 0 Ĝ2 x,R,S = O(R).For the second part of condition (B2), we note that since FS and FU are both positive in a neighborhood of x, Ĝx,R,S is uniformly bounded for all R small enough.Hence, for all η, for all R small enough (possibly depending on η), {R Ĝx,R,S > η} is identically 0, which implies in particular that P 0 Ĝ2 x,R,S {R Ĝx,R,S > η} = o(R).Identical analysis applies to an envelope for {g x,u,T : |u|≤ R}.Condition (B3).This condition concerns properties of the covariance function defined as Σ(s, t) We will address each term in this expansion.We first have Next, the second term can be simplified as follows Similarly, the third term can be written as Finally, for the last term, by decomposing the double integral into two regions, we can write Hence, the second through fourth terms in the decomposition of P [D * s D * t ] provided in equation ( 1) cancel, and we are left with .
By symmetry of D * s and L * s , we also have .
Thus, we can write In the notation of condition (B3) of WC, we have Σ * (s, t) = 0, H(v, w) = v, and Q can be taken as any probability measure since there are no covariates.Sub-conditions (B3a) and (B3d) are automatically satisfied.Sub-condition (B3b) is satisfied because A does not depend on s or t.Sub-condition (B3c) requires that v → A(x, x, v, w) be continuous at v = x, which would appear to require that F U and F V are continuous at x.However, in the proof of Theorem 4 on pages 4 and 5 of the Supplementary Material of WC, it is actually only used that v → A(x, x, v, w) possesses a right-limit A(x, x, x+, w) as v approaches x from above (since α in the proof stands in for n −1/3 , which is positive), in which case A 0 (x, x, x+, w) should take the place of A(x, x, x, w) in the result.This is important in our work because in many applications the censoring distributions F U and F V possess mass points.Hence, this weaker version of sub-condition (B3c) holds with A(x, x, x+, w) = λ S (x) π FS (x) FU (x) + θ(x) 2 λ T (x) (1 − π) FT (x) FV (x) = θ(x) λ T (x) π FS (x) FU (x) + λ S (x) (1 − π) FT (x) FV (x) .
Condition (A4).For condition (A4), it suffices to show that E[sup t≤x+δ |Λ T,n (t) − Λ T (t)|] = o(n −1/3 ) for some δ > 0. We define FT , FT,n , RT , and RT,n as we did above for S, but with A = 0 in the conditionals instead.We then have We used integration by parts to bound the first term in the second inequality.By assumption, RT is bounded away from zero in a neighborhood of x, and as a result, RT,n is almost surely bounded away from zero in a neighborhood of x for all n large enough.Then, for some δ > 0 and C > 0, almost surely for all n large enough it holds that .
We can show that this expression is O(n −1/2 ) using similar empirical process techniques as we did with S above.Condition (A5).For this condition, since I n ⊂ [0, γ n ], it suffices to show that the stratified Nelson-Aalen estimators are uniformly consistent on [0, γ n ], i.e.Λ T,n − Λ T ∞,[0,γn] and Λ T,n − Λ T ∞,[0,γn] tend to zero in probability.This follows from Corollary 1.2 of Stute (1994) by the assumed lower bound for r n .
For a function H on a domain I ⊆ R to the extended real line R, we let H := 1 − H.If H possesses limits from the left, then we let H − := x → H(u−) := lim u↑x H(u) be the left-continuous version of H, and if H possesses limits from the right, then we let H + := x → H(x+) := lim u↓x H(u) be the right-continuous version of H.We set ∆H

Theorem 1 .
(1) For any random variable S, S ≥ M HR S; (2) for any S, T , and U such that S ≥ M HR T and T ≥ M HR U , it holds that S ≥ M HR U ; and (3) for any strictly increasing continuous function ψ with lim x→−∞ ψ(x) = −∞ and lim x→+∞ ψ(x) = +∞, S ≥ M HR T implies that ψ(S) ≥ M HR ψ(T ).
These estimators are averaged to obtain a pooled estimator θn,m = 1 m m j=1 θ n,j .Finally, an asymptotic (1 − α)-level confidence interval for θ(x) is given by θn,m (x) ± t 1−α/2,m−1 σ n,m (x)/ √ m, where σ n,m (x) is the empirical standard deviation of the m subset estimators {θ n,1 (x), . . ., θ n,m (x)} and t p,k is the pth quantile of the t distribution with k degrees of freedom.

Figure 2 :
Figure 2: Top: absolute bias of the three estimators scaled by n 1/3 as a function of x.Bottom: variance of the three estimators scaled by n 2/3 as a function of x.The rows correspond to linear, convex and concave θ.The first column is our estimator, the second column is the sample splitting estimator, and the third column is the kernel smoothing estimator.

Figure 4 :
Figure 4: Results of the analysis of the pulmonary adenocarcinoma data.Left panel: the Nelson-Aalen estimator for the gefitinib group plotted against that of the carboplation-paclitaxel group, along with the corresponding greatest convex minorant.Right panel: estimated hazard ratio function and 95% pointwise confidence intervals using our method and the Cox proportional hazards model.

Figure 6 :
Figure 6: Relative bias and variance of the estimators.Upper panel: The relative absolute bias of our monotone estimator over that of the smoothing estimator (left column) and sample splitting estimator (right column).Lower panel: The relative variance of our monotone estimator over that of the smoothing estimator and sample splitting estimator.The solid gray horizontal line represents a ratio of 1.

Figure 7 :
Figure 7: Relative mean squared error (MSE) of the estimators.Left column: The relative MSE of our monotone estimator over that of the smoothing estimator.Right panel: The relative MSE of our monotone estimator over that of the sample splitting estimator.The solid gray horizontal line represents a ratio of 1.
and therefore F T as well, have jumps atF − T,+ (v).Hence, R + (v) > R(v) if and only if F S and F T are both flat on [F − T (v), F − T,+ (v)) and have a jump at F − T,+ (v), and v