Bayesian Survival Tree Ensembles with Submodel Shrinkage

We consider Bayesian nonparametric estimation of a survival time subject to right-censoring in the presence of potentially high-dimensional predictors. We argue that several approaches, such as random survival forests and existing Bayesian nonparametric approaches, possess several drawbacks, including: computational difficulties; lack of known theoretical properties; and ineffectiveness at filtering out irrelevant predictors. We propose two models based on the Bayesian additive regression trees (BART) framework. The first, Modulated BART (MBART), is fully-nonparametric and models the failure time as the first occurrence of a non-homogeneous Poisson process. The second, CoxBART, uses a Bayesian implementation of Cox’s partial likelihood. These models are adapted to high-dimensional predictors, have default prior specifications, and require simple modifications of existing BART methods to implement. We show the effectiveness of these methods on simulated and benchmark datasets. We also establish that, for a simplified variant of MBART, the posterior distribution contracts at a near-minimax optimal rate in a high-dimensional sparse asymptotic regime.


Introduction
We consider the nonparametric conditional survival analysis problem, where our goal is to assess the impact of P predictors x = (x 1 , . . . , x P ) on the survival function S(t | x) = Pr(T > t | X = x) and the hazard function h(t | x) = − d dt log S(t | x). The time-constant predictors x may include treatments and prognostic markers in a clinical study. A popular semiparametric model in survival analysis is the Cox proportional hazards model (Cox, 1972) h(t | x) = λ(t) e x β , where λ(t) is a nonparametric baseline hazard model. Instead of this restrictive assumption, methods based on decision trees (Breiman et al., 1984) construct a partition of the predictor space X and estimate S(t | x) separately for each equivalence class. Decision trees are also used as building blocks for ensemble methods. For example, the random survival forests algorithm (Ishwaran et al., 2008) aggregates many decision trees together to obtain a flexible estimate of S(t | x). Models based on Bayesian additive regression trees (BART) have also been proposed.
While Bayesian nonparametric methods for survival regression analysis exist, we argue that they often lack the following properties we feel to be desirable: (i) An argument for using Bayesian nonparametrics is that it allows us to use a prior which centers on a specified parametric model (similar to a "prior guess"), effectively "shrinking" the fitted model towards the parametric structure while allowing the model to adapt to lack-of-fit when the parametric structure is incorrect. For example, parametric accelerated failure time (AFT) models and the semiparametric Cox models are two examples of such "prior guesses." This gives us the best of both worlds: the flexibility of a nonparametric model and (when the guess is supported by data) the stability of a (semi) parametric model. Arguments for the desirability of this property include: maintenance of interpretability when the prior guess is accurate (Müller and Mitra, 2013); increased stability of inference with small sample sizes; losing the minimal amount of predictive accuracy when the prior guess is accurate; and guarding oneself from under-fitting when the prior guess is inaccurate (Dunson, 2009).
(ii) The prior should be able to adapt to structure in the data, such as low-order interactions in the predictors, sparsity, or smoothness of the hazard as a function of time.
(iii) The posterior should be computationally tractable.
(iv) The prior should have "large support," with the posterior ideally concentrating at a near-minimax rate adaptively over a variety of function spaces.
We propose two models using the BART framework. The first approach, which we refer to as Modulated BART (MBART), is fully nonparametric and satisfies (i)-(iv) above. The MBART model sets h(t | x) = λ(t | x, θ) Φ{g(t, x)} where λ(t | x, θ) is the hazard of a (semi) parametric model parameterized by θ that we wish to shrink towards, while g(t, x) is a decision tree ensemble that controls deviations from the base model through the link function Φ : R → [0, 1]. The second approach, which we refer to as CoxBART, is based on a Bayesian interpretation of the Cox partial likelihood. CoxBART is less computationally intensive than MBART and retains the simpler interpretation of the Cox proportional hazards model; CoxBART is also a useful point of comparison for MBART because we will often shrink the MBART model towards a proportional hazards model.
We provide the first theoretical guarantees for BART survival models. We show that a simplified version of our MBART model, when combined with the smooth decision trees used by Linero and Yang (2018), adapts to sparsity in h(t | x) when x ∈ R P is high dimensional but h(t | x) depends on D P predictors. MBART also adapts to the smoothness level of h(t | x). In both cases, we obtain near-minimax rates of convergence with respect to a type of Hellinger distance.

Related Methods
There are several existing approaches to using the BART framework with survival data. Sparapani et al. (2016) developed a fully nonparametric regression method for discrete survival data. This approach can be used for continuous survival data only after discretizing the N observed survival times to a grid of N grid times, and the fitted model depends on both the number and location of the N grid grid points. The Gibbs sampler developed by Sparapani et al. (2016) has computational complexity Ω(MNN grid ) where M is the number of trees in the ensemble; this is substantially more computationally intensive than usual BART algorithms, which have complexity Ω(MN), and forces N grid to be small. It also violates (i) by not being centered on any (semi)parametric submodel. Bonato et al. (2010) proposed a variety of semiparametric models based on BART, including semiparametric accelerated failure time models of the form log T i = g(X i ) + i and a Weibull regression model. Most important for our purposes, they considered the proportional hazards model This latent variable structure is imposed only to allow for existing Gibbs samplers to be used with ω i as the response; the ω i 's can then be updated by Metropolis-Hastings during Gibbs sampling. This model is essentially a frailty model h(t | x, z) = λ(t) exp{g(x) + z} where exp(Z i ) is a log-normal frailty. Similar to the identifiability issues of the frailty distribution and marginal regression structure for semiparametric univariate survival models (Oakes, 1989;Hougaard, 2000), the distribution of ω i is not identifiable and the marginal distribution of T i given X i does not preserve the proportional hazards structure (thus g(x) fails to describe a proportional hazards relationship between T i and X i ). By contrast, our CoxBART model induces a proportional hazards model marginally. Henderson et al. (2020) introduced an AFT model with large support in the class of all AFT models log T i = g(X i ) + i by modeling the residual distribution i ∼ F as a Dirichlet process mixture; this accomplishes a similar goal as our CoxBART model by allowing for the use of BART in a large class of nonparametric survival models.
Our proposed modeling strategy is similar in spirit to recent work of Li et al. (2020) on nonparametric conditional distribution estimation. In concurrent work by the authors of this paper, the methodology is extended to the setting of interval-censored clustered survival times (Basak et al., 2020); the present work differs by incorporating targeted smoothing, centering the nonparametric model on a "prior guess," providing theoretical justification, and by introducing the CoxBART model. A similar data augmentation algorithm to the one described here is used with Gaussian processes to perform survival analysis (Fernández et al., 2016). More generally, there is a sizable literature on Bayesian nonparametric survival analysis based on Dirichlet processes and Gaussian processes. See, for example, De Iorio et al. (2009), who develop an ANOVA-DDP model to perform fully-nonparametric Bayesian survival analysis.

Outline of the Paper
In Section 2 we describe the MBART and CoxBART models. In Section 3 we propose several base models that MBART can be shrunk towards. In Section 4 we study the theoretical properties of a special case of the MBART model, establishing posterior concentration rates. In Section 5 we illustrate MBART and CoxBART on simulated data and on publicly available data on liver disease. We conclude in Section 6 with a discussion. Additional computational details, proofs of the main theorems, and auxiliary results are given in the Supplementary Material (Linero et al., 2021).

Survival Models Using Bayesian Tree Ensembles
Let (T i , C i ) denote the survival and censoring times respectively for i = 1, . . . , N. We . Throughout this paper we assume that the censoring time C i is independent of T i given X i . is constant on each equivalence class of the partition. A schematic showing a particular decision tree with the induced partition over X = [0, 1] 2 is given in Figure 1. We divide the decision tree nodes into a collection of leaf nodes ∈ L and branch nodes b ∈ B, where L consists of the nodes with no children. Associated to each branch b is a decision rule of the form [X j b ≤ C b ], while each leaf is associated to a prediction μ m .

Review of Bayesian Additive Regression Trees
We write g ∼ BART(π T , π M ) for the BART prior with prior T m iid ∼ π T and μ m iid ∼ π M given {T m }. It is standard to take π M to be Normal(0, σ 2 μ ) where σ μ ∝ M −1/2 ; this is conditionally conjugate, and ensures that Var{g(x)} = Mσ 2 μ does not depend on the number of trees M . We take π T to be the prior described by Chipman et al. (1998), which can be sampled from by initializing T m with a single root note of depth d = 0. This node is then made branch with probability γ/(1 + d) β and a leaf node otherwise; if the node is a branch node, we add its two children at depth d + 1. This process then iterates over all the nodes of depth d = 1, 2, . . . until all of the nodes at some depth are leaf nodes. We use the following prior on the splitting rules through this paper. First j b ∼ Categorical(s) for some probability vector s. Then, given j b and the values of (j b , C b ) for the ancestor nodes of b, we set is the hyperrectangle of x-values which are associated to branch b. Following Linero (2018), we set s ∼ Dirichlet(α/P, . . . , α/P ) and in our illustrations use α/(α + P ) ∼ Beta(0.5, 1). This prior for s encourages the model to concentrate on models with a small number of relevant predictors. We also review the soft Bayesian additive regression trees (SBART) prior defined by Linero and Yang (2018). Note that g ∼ BART(π T , M T ) can equivalently be expressed as where w m (x) = 1 or 0 according as x is associated with leaf of tree m or not. SBART has the form (1), but uses smooth weights w m ( is the indicator that leaf of tree m is associated to the right path of branch b. The function ψ(x; C, τ ) is the cumulative distribution function of a continuous symmetric random variable; throughout this work, we will take ψ(x; C, τ ) = expit{(x − C)/τ } where expit(x) = (1 + e −x ) −1 . We note that, in the limit as τ → 0, we revert to the usual (non-soft) decision trees. Trees which use smooth weights w m (x) are referred to as soft decision trees; see, for example, Irsoy et al. (2012). When g(x) has an SBART prior we will write g ∼ SBART(π T , π M ). Using SBART makes the function g(x) continuous in x, which leads to better theoretical and practical performance (Linero and Yang, 2018). A drawback of SBART is that it is more computationally intensive to fit.

Modulated BART
Modulated BART (MBART) models hazard function h 0 (t | x) as the random function where Tree(x; T m , M m ) is a smooth decision tree, and λ(t | x, θ) is either a parametric or semiparametric model which serves as the "best guess" at h 0 (t | x). Under (2), a random variable T can be generated by simulating a Poisson process with intensity λ(t | x, θ) on (0, ∞), thinning the points with probability Φ{g(t, x)}, and setting T to be the smallest accepted point.
We treat t different in g(t, x) in order to facilitate shrinkage towards a semiparametric proportional hazards model (See Section 3). In particular, we use the targeted smoothing framework of Starling et al. (2020). This also ensures that estimated survival functions will be smooth in time even if we replace SBART with BART for computational purposes, and removes the need to standardize T i to be supported on [0, 1] (which may be difficult due to right-censoring).
We implement targeted smoothing over t using a random Fourier series ψ (Li et al., 2020;Rahimi and Recht, 2008;Fernández et al., 2016). This approximates the ideal model proposed by Starling et al. (2020) T ) computations where N T is the unique number of observed failure times. By contrast, the random Fourier series imposes no additional burden on the original BART algorithm. For more details on using random Fourier series with BART, see Li et al. (2020).
Our preference is to choose a prior which encourages a 0 in (2) so that the resulting prior puts a high weight around the event Φ{g(t, x)} ≈ 1.

Default Prior Specification of the Thinning Process
An advantage of the BART framework is that it is straight-forward to develop a default prior which works well in practice. For our survival model we must specify π T , π M , a, the link Φ, the base model λ(t | x, θ), and the associated prior for θ. In our illustrations we take Φ(·) to be the probit link. We discuss specification of the base model in Section 3. We specify a Half-Normal(1, 1) prior for a to encourage larger values of g(t, x); as previously noted, this shrinks our model towards λ(t | x, θ). Encouraging g(t, x) to be large has the additional benefit of improving the efficiency the data augmentation scheme described below. We set σ μ ∼ Half-Cauchy(0, 1.5/ √ M ) and use the default prior for π T described in Section 2.1.

Data Augmentation
The observed data likelihood of model (2) has the form This likelihood is inconvenient to work with both because of the analytically intractable integral and because it does not cleanly allow for the use of existing Bayesian backfitting algorithms for fitting BART models. To construct a Markov chain Monte Carlo (MCMC) algorithm for MBART we use two steps of data augmentation. The first step removes the intractable integral

θ) ds is the cumulative hazard of the base model.
A variant of Proposition 1 is used by Adams et al. (2009) to fit non-homogeneous Poisson processes. For completeness, we give a simple proof of this result.
Proof. The likelihood component of (Y 1 , . . . , Y N ) is given by (3), whereas the likelihood component of the event times {W ij : i = 1, . . . , N, j = 1, . . . , J i } simulated independently from our non-homogeneous Poisson processes on (0, Multiplying these quantities and noting that the exponential terms can be combined to is the probit link, we can use the approach of Albert and Chib (1993) to simplify the likelihood further. We introduce The usual Bayesian backfitting algorithm of Chipman et al. (2010) can now be applied by treating the Z ij 's as the response. Similarly, the logistic link Φ(μ) = (1 + e −μ ) −1 or Student's T link can be implemented using the Gaussian scale-mixture representation of the logistic and Student's T distribution (Holmes and Held, 2006). Detailed algorithms are given in the Supplementary Material.

Proportional Hazards with CoxBART
We now consider the proportional hazards model is the baseline hazard and g(x) is unknown, with g(x) estimated using the Cox partial likelihood PL(g) = i:δi=1 Expression (5) arises under an improper prior for Λ(t). Consider a discrete time proportional hazards model Sinha et al. (2003) show that PL(g) is the integrated likelihood of g(x) when the parameters φ i are given an improper data-dependent prior where the t i 's are set equal to the observed values of the The conditional distribution of (φ 1 , . . . , φ N , D) given g under this model is From (6) . Integrating out φ, we obtain PL(g).
We refer to this model with g ∼ BART(π T , π M ) as CoxBART. CoxBART is a nonparametric model in the sense that g(x) is nonparametric, but we do assume proportional hazards. It is more flexible than the usual Cox model, where g(x) = x β, but is not as flexible as MBART. CoxBART is interesting both for providing a benchmark to measure MBART against and independently due to the common use of the proportional hazards assumption in practice. A related model is given by Bonato et al. (2010), however this model only satisfies the proportional hazards assumption conditional on latent variables.
Conveniently, as shown below, (6) can be combined with the log-linear BART prior of Murray (2020) to construct a Bayesian backfitting algorithm for sampling g(x) and φ.

Bayesian Backfitting for CoxBART
Let T −m and M −m denote the tree topologies and leaf node parameters for all trees except for the m th , and let x ( , m) mean that x is associated to leaf of T m . In order to implement the generalized Bayesian backfitting approach of Hill et al. ( where

Base Models for MBART
MBART can be used with essentially any base hazard λ(t | x, θ). It is ideal for λ(t | x, θ) to be a good approximation to the true hazard h 0 (t | x) for two reasons. First, a good specification of λ(t | x, θ) will encourage the model to disregard the nonparametric component, reducing the model complexity. Second, the efficiency of the MCMC algorithm depends on the sizes of the J i 's, which will be large if the base model fits poorly.
Parametric Weibull A commonly used parametric subclass of the proportional hazards model is the Weibull model λ(t | x, θ) = κ e b x t κ−1 , where θ = (κ, b) (Ibrahim et al., 2001, Section 2.2). The full conditional of θ when using Proposition 1 is given by This full conditional can be sampled using Hamiltonian Monte Carlo.
Weibull-BART A semiparametric variant of the Weibull model can be developed using the log-linear BART model of Murray (2020). This model sets Here, θ = (κ, r). Similar to CoxBART, we set η m ∼ log Gam(α η , β η ). As a default, we choose α η and β η so that E(log η m ) = 3 σ/(2 √ M ) where σ is an estimate of the standard deviation of log Y . In Section 3.1, we derive a simple Bayesian backfitting algorithm for this model.

CoxBART2
It is possible to shrink towards a semiparametric proportional hazards model using the Weibull-BART base model. In the special case where g(t, x) does not depend on x, the Weibull-BART model simplifies to h(t | x) = λ (t) e r(x) where λ (t) = κt κ−1 Φ{g(t)}, which is a Cox proportional hazards model. Because g(t) is modeled nonparametrically, the baseline hazard λ (t) has a very flexible prior. We can shrink towards g(t, x) ≡ g(t) by using a prior which encourages the tree structures to consist of only a root node. To accomplish this, we simply choose the tree parameters (γ, β) so that most trees consist only of the root node (e.g., by taking γ small). A similar approach is used by Hahn et al. (2020) in the context of causal inference in order to shrink towards a homogeneous treatment effect.
As a default, we have used Weibull-BART in all of our illustrations and simulations, with κ ≡ 1; equivalently, we have used an exponential distribution for the base model. For some discussion of possible priors for κ, see Van Niekerk et al. (2020), who recommend a penalized complexity prior.

Bayesian Backfitting for Weibull-BART
After performing the data augmentation in Proposition 1, the likelihood becomes , and define the leaf node suffi- if X i is associated to leaf of tree m. Then the likelihood of M r m holding all other quantities fixed is This is proportional to a product of log Gam(α η + N , β + E ) densities. Integrating out the η m 's, we obtain the conditional distribution where α = α η + N and β = β η + E . Similarly, the full conditional of η m is log Gam(α , β ). We can now update (T r m , M r m ) by first updating T r m using Metropolis-Hastings and then sampling M r m from its full conditional.

Theoretical Results
We now establish convergence rates for the MBART posterior. These results are similar in spirit to results of Linero and Yang (2018) for regression and Li et al. (2020) for conditional distribution estimation. We operate in the Frequentist setup, with uncoarsened data (T 1 , C 1 , X 1 ), . . . , (T n , C n , X n ) sampled iid from a joint distribution F 0 . Throughout, we make the independent censoring assumption.

Condition R (random censoring)
The true joint distribution F 0 of (T i , C i , X i ) is such that T i and C i are conditionally independent given X i .
We assume the T i 's have conditional density p 0 (t | x) on (0, ∞). The C i 's are assumed to be bounded by a constant C i ≤ C (typically the time period of the study), with conditional density f C (c | x) = p C (c | x)I(c < C) + S C (c | x) I(c = C) with respect to the sum of Lebesgue measure on (0, C) and a point mass at C; we write ν(dt) for this measure, and we will use it throughout this section. Without loss of generality we assume that C ≤ 1. The constant C represents the time of the end of the study, so that all observations with T i > C are censored. We define Y i = min(T i , C i ) and δ i = I(T i ≤ C i ). We also define Y i = min(T i , C), so that we can study the setting of fixed censoring within this framework as well. Let D n = {Y i , δ i , X i : i = 1, . . . , n} and let D n = { Y i , X i : i = 1, . . . , n}.
For simplicity we assume that the baseline hazard λ(t) is fixed, so that we need only consider a prior distribution Π on g(t, x) and its associated hyperparameters. To each u : ds is a cumulative hazard. We make the following assumption about the true hazard function h 0 (t | x), which guarantees that we can define the "true" value of g 0 (t, x) by Φ −1 (h 0 /λ).
At face value Condition H is highly restrictive, as it requires an a-priori known upper bound for h 0 (t | x). We stress that this is a minor technical convenience, as if (say) h 0 (t | x) is bounded then the theory we develop goes through with λ(t) ≡ λ with λ given an appropriate prior. Additionally, the restriction that x ∈ [0, 1] P is not restrictive, as in practice we will always perform a quantile transformation to all predictors.

Fixed Censoring
Fixed censoring occurs if T i is observed as long as T i ≤ C. We study this by conditioning on D n . Given g(t, x), the Y i 's have conditional density given by with respect to ν(dy). We measure the accuracy of the posterior distribution using the integrated Hellinger metric H(g 0 , g) defined by where F X is the true distribution of the X i 's. While not used as part of the model, integrating with respect to F X gives a natural metric by which to judge estimation accuracy. Our goal is to establish Π{H(g 0 , g) ≤ K n | D n } n→∞ −→ in F 0 -probability for a sequence n ↓ 0 and some fixed positive constant K; in this case, we say that the convergence rate of the posterior is faster than n . Under the assumption that g 0 (t, x) is an α-Hölder smooth function depending on D coordinates, the oracle minimax estimation rate when α and the relevant coordinates of (t, x) are known is n = n −α/(2α+D) . The following theorem, which is proved in the Supplementary Material, shows that we adaptively obtain this rate up-to a logarithmic term log(n) t and a variable selection term D log(P + 1)/n. This result allows P (but not D) to diverge, and permits consistent estimation even when log P grows nearly linearly with n.

Theorem 1. Suppose that Condition H, Condition L, and Condition P hold. Let
in F 0 -probability for some sufficiently large constant K.

Random Censoring
Our results for a fixed censoring time C extend in a straight-forward fashion to the case where the C i 's are bounded with C i ≤ C. Under the independent censoring assumption, the joint density of (Y i , δ i ) given X i = x and g is We now study posterior concentration with respect to the integrated Hellinger distance In addition to the covariate distribution F X , H q also depends on the distribution of C i . In the Supplementary Material we prove the following result.
Remark 1. Our proof of Theorem 1 is based on checking the sufficient conditions for posterior convergence rates given by Ghosal et al. (2000). To prove Theorem 2 we use a monotonicity property of f -divergences (Ali and Silvey, 1966) to show that the sufficient conditions used to prove Theorem 1 also suffice to prove Theorem 2; for example, the fact that the Hellinger distance is an f -divergence can be used to show that H q (g 0 , g) H(g 0 , g). This strategy works because all divergences used in Theorem 2.1 of Ghosal et al. (2000) are f -divergences, or can be made equivalent to an f -divergence after applying some linear interpolation (see Lemma S.3 in the Supplementary Material).

Remark 2. We cannot make progress on divergences like H
, which is accomplished by H 2 (g 0 , g), as well as H 2 q (g 0 , g) when S C (y | x) (the probability that an individual is not censored before the end of the study) is bounded away from 0.

Simulation Experiment
We conduct a simulation study to assess (i) the ability of MBART and CoxBART to capture nonlinear relationships and (ii) to assess what one loses when CoxBART is used when the proportional hazards assumption fails. The function f (x) = sin(πx 1 x 2 )+2(x 3 − 0.5) 2 +x 4 +0.5x 2 used in S2 and S3 is a nonlinear function described by Friedman (1991), having linear, nonlinear, and interaction effects. We consider the following models for the hazard.
• S5, ZK: T i ∼ Gam{α(x), 0.5} where α(x) = 0.5 + 0.3| 15 j=11 x j | and P = 25. This simulation is taken from Zhu and Kosorok (2012) and strongly violates the proportional hazards assumption. Covariates are simulated independently from a Uniform(0, 1) distribution, with the exception of S5 where the covariates X i are multivariate normal with mean 0 and covariance matrix V with entries V jk = (0.75) |j−k| . The function h Gam (t) used in S1 and S3 as a baseline hazard denotes the hazard function of a Gam(3, 1) random variable. For each simulation setting we took N = 500, except for the ZK setting where we took N = 300 to match Zhu and Kosorok (2012).
The purpose of S1 is to determine how much is lost by the nonparametric models when a simple semiparametric model holds. The purpose of S2 and S3 is to assess how much is lost if we use MBART instead of CoxBART when the proportional hazards assumption holds. Settings S4 and S5 are designed to assess how well MBART performs relative to other nonparametric techniques like random survival forests (Ishwaran et al., 2008), as well as how much is lost using CoxBART when the proportional hazards assumption fails.
Figure 2 displays randomly sampled survival curves for the settings S3, S4, and S5. Setting S3 obeys the proportional hazards assumption; consequently, the shape of each survival curve is similar and there is no curve crossing. Setting S5 also has no curve crossing because the Gam(α, 0.5) distribution is stochastically increasing in α; the shape of the survival curve, however, varies substantially. Depending on the value of α, both decreasing and increasing hazard functions are possible under S5. Setting S4 allows for both a variety of shapes for the survival curves and the possibility for the curves to cross. We compare the following methods.
• MBART-Light: The MBART model with mild shrinkage towards a proportional hazards model. The baseline model is an exponential BART model (i.e., Weibull-BART with κ = 1). This is the default prior described in Section 2.2.  Table 1: Results for the simulation study described in Section 5.1 for the settings S1-S5. Each entry is the integrated mean squared error in estimating the survival function. All results are relative to the best performing method, i.e., the best performing method always has the result 1.00.

Cox-Linear CoxBART RSF MBART MBART-Heavy
• MBART-Heavy: The same as the default MBART model, but with γ = 0.3 so that most trees in the ensemble a-priori do not split on any covariates. This encourages the model to resemble the CoxBART2 base model described in Section 3.
• CoxBART: The CoxBART model with the default prior.
• Cox-Linear: The semiparametric Cox proportional hazards model fit with the log-linear link h(t | x) = λ(t) exp(x β). This is fit with the coxph function in the survival package in R.
• Random Survival Forests: The random survival forests algorithm (Ishwaran et al., 2008) fit with the randomForestSRC package in R. This fits a fully-nonparametric model to the survival function, and does not invoke the proportional hazards assumption.
• Exp-BART: the exponential BART base model used with our MBART models. This is included to determine if the base model is adequate by itself.
We attempted to fit the surv.bart function in the BART package in R. Unfortunately, because we do not coarsen the time scale, the algorithm was too memory intensive.
We assess the performance of each method according to how well they estimate the conditional function S(t | x). As a measure of accuracy, we consider the average integrated squared distance between the estimated survival function S(t | x) and the true survival where F X denotes the true distribution of the predictors and T max is the maximum of all observed survival times in the sample. Because MSE cannot be computed in closed form, we approximate the integral numerically as where the dt integral is computed numerically and the X 1 , . . . , X N are heldout covariates sampled independently from F X . Results are averaged over 200 simulated datasets. We used N * = 100 held-out covariates and approximated the dt integral using a grid of size 500.
Results are given in Table 1. Except for S1, either CoxBART or MBART-Light performs the best. Predictably, CoxBART performs the best when the nonparamet- Curiously, MBART has mediocre performance under S2, but does manage to outperform CoxBART under S1 despite not making the proportional hazards assumption. The performance of Exp-BART indicates that the baseline model alone is typically not adequate, and performs very poorly in S1 and S3 where the baseline hazard is far away from an exponential.
To better understand how the presence of irrelevant predictors affects the resulting estimators, we conducted an additional simulation under S4 with the number of predictors P being increased. To ensure that the comparison with Cox-Linear is fair, we use a lasso penalty implemented with the glmnet package in R (Tibshirani, 1997). We consider only MBART-Light, as the performance of MBART-Light and MBART-Heavy is similar in Table 1. Results are given in Table 2 for P ∈ {25, 50, 100, 200, 400}. Across all settings, we see that MBART outperforms the other methods, and that all methods except for random survival forests are robust to the inclusion of irrelevant predictors. The random survival forests algorithm performs poorly as the dimensionality of the problem increases; at P = 50, random survival forests perform the same as CoxBART, but by P = 400 it performs worse than all methods except for Exp-BART. The lack of robustness of random forests to irrelevant predictors has been noted in other works (Zhu et al., 2015;Linero, 2018) as well, and does not depend on the selection of tuning parameters.
We conclude that both MBART and CoxBART are generally effective for situations when the semiparametric Cox proportional hazards model h(t | x) = λ(t) exp(x β) fails. For situations where the nonparametric proportional hazards model h(t | x) = λ(t) exp{f (x)} holds CoxBART performs best, while for situations where the nonparametric proportional hazards model fails MBART performs best. Both methods are highly robust to the inclusion of irrelevant predictors, which is not true of random survival forests.

Liver Disease Data
We analyze a publicly available dataset on time to death (scaled to have standard deviation 1) for patients of primary biliary cirrhosis (PBC), a chronic disease in which the bile ducts of the liver are slowly destroyed (available as the pbc dataset in randomForestSRC).
The survival times were subject to right censoring due to either liver transplant or survival beyond the end of the study. We select this study for re-analysis to demonstrate the capability of our methods (MBART and CoxBART) to accommodate unknown non-linear and time-varying effects of the covariates on the hazard function, because it has observed in prior analyses that at least two covariates (Bilirubin and Protime) have such effects. We first fit the usual semiparametric Cox proportional hazards model λ(t | x) = λ 0 (t)e x β using the following covariates: age, baseline bilirubin level (Bilirubin), baseline albumin level (Albumin), presence of edema, and prothrombin time (Protime). An analysis of the martingale residuals (Fleming and Harrington, 2011, Section 4.6) of the predictors shows that some covariates have non-linear and possibly time-varying effects on the hazard rate; for example, the top left panel of Figure 3 suggests a non-linear effect of bilirubin which can be accommodated by using a log transformation (top right panel). CoxBART detects these effects automatically, eliminating the need to look for appropriate transformations, and also outperforms a similar generalized additive model based on natural cubic splines. To demonstrate this, we performed 5-fold cross-validation and computed the held-out deviance −2 log PL = −2 5 k=1 log PL −k where PL −k denotes the Cox partial likelihood obtained by regressing (Y i , δ i ) in the k th fold on the estimated risk g(X i ), where CoxBART used the posterior mean of g(X i ) for g(X i ). This was replicated across 10 different splits into 5-folds and averaged over the 10 splits. Results are given in Table 3. The best performing proportional hazards model uses hand-selected transformations for the continuous predictors -in particular, taking the logarithm of the bilirubin level -but CoxBART performs nearly as well (and much better than the linear and GAM models).
While CoxBART is able to automatically find an appropriate transformation of the predictors, the Schoenfeld residuals in Figure 3 (middle left) suggest a time-varying effect of prothrombin time (Protime) which cannot be accommodated using a nonlinear transformation, implying the proportional hazards assumption does not hold. A formal test for the violation of the proportional hazards assumption using the cox.zph function in R gives a significant global test, primarily driven by the effect of protime (P -value ≈ 0.003). To show that MBART is able to accommodate this failure of the proportional hazards assumption, we use 5-fold cross validation to compare the fit of MBART to the CoxBART2 model described in Section 3; we use CoxBART2 here because it admits a density and so can be readily compared to MBART using likelihoodbased criteria. We measure predictive accuracy using held-out deviance of the data dev = −2 and S ρ(i) denote the estimated density and survival function computed on the fold which excluded observation i. This experiment was replicated on all 10 splits. MBART outperformed CoxBART2 over all 10 splits, with the average difference in deviance between the two models being dev Cox − dev M = 7.81. Hence MBART is able to obtain a better fit by weakening the proportional hazards assumption.
Interestingly, MBART and CoxBART give similar risk measures for each individual; see Figure 3 (middle right), where λ is the posterior mean of g(X i ) for the CoxBART model and Λ is the estimated cumulative hazard max(Yi) 0 λ(t | X i , θ) Φ{g(t, X i )} dt. We see that MBART detects the same risk structure as CoxBART. The bottom panels of Figure 3 show the estimated survival function and hazard function as a function of protime when all other variables are fixed at their sample medians. We see that the estimated hazard functions are clearly not proportional; in particular, individuals with high values of protime have relatively constant risk, while individuals with low values of protime have a small risk at the beginning of the study and large risk at the end. The corresponding survival function estimates for high values of protime are similar to what was observed under the ZK simulation setting, which violated the proportional hazards assumption.
We conclude that MBART is able to effectively model the PBC data. Importantly, MBART finds the same structure as CoxBART, while also being able to account for failure of the proportional hazards assumption. This allows MBART to be used without needing to check assumptions regarding proportional hazards or lack of functional form fit.

Discussion
In this paper we introduced two survival models for right-censored data using Bayesian additive regression trees. The first model is based on the observation that a response T i with hazard h(t | x) can be modeled as the first occurrence of a non-homogeneous Poisson process with intensity function h(t | x); by taking h(t | x) to correspond to a thinned Poisson process λ(t | x, θ) Φ{g(t, x)} where λ(t | x, θ) is chosen to be tractable, we were able to construct a two-layer data augmentation scheme for updating the function g(t, x). A single-layer strategy using this idea has been used for Gaussian processes (Fernández et al., 2016), but is not applicable to our model due to our use of discrete parameters and only shrinks towards marginal models λ(t). We also establish theoretically that our thinned Poisson process model obtains near-minimax optimal rates of posterior contraction in the high-dimensional setting.
The second model we introduced is a nonparametric variant of the Cox proportional hazards model, which takes h(t | x) = λ(t) exp{g(x)} with the baseline hazard function λ(t) modeled nonparametrically. Inference in this model proceeds from the Cox partial likelihood PL(g). By using a Bayesian justification of the Cox proportional hazards model (Sinha et al., 2003) we were able to both construct a tractable Gibbs sampling algorithm and obtain a nonparametric estimate of the cumulative hazard function.
An interesting avenue for future research is to develop methods for summarizing the BART posteriors for these models. A key advantage that classical semiparametric models have is that they reduce the effect of covariates to interpretable scalar parameters. One possibility in this direction is to develop tools which systematically project the BART posterior onto more interpretable models (Woody et al., 2020).

Supplementary Material
Supplementary Material for Bayesian Survival Tree Ensembles with Submodel Shrinkage (DOI: 10.1214/21-BA1285SUPP; .pdf). Contains proofs of our theoretical results and explicit algorithms for our methods. Software for MBART. Contains code which replicates our analysis of our real data example.