Approximation of Bayesian models for time-to-event data

Random measures are the key ingredient for effective nonparametric Bayesian modeling of time-to-event data. This paper focuses on priors for the hazard rate function, a popular choice being the kernel mixture with respect to a gamma random measure. Sampling schemes are usually based on approximations of the underlying random measure, both a priori and conditionally on the data. Our main goal is the quantification of approximation errors through the Wasserstein distance. Though easy to simulate, the Wasserstein distance is generally difficult to evaluate, making tractable and informative bounds essential. Here we accomplish this task on the wider class of completely random measures, yielding a measure of discrepancy between many noteworthy random measures, including the gamma, generalized gamma and beta families. By specializing these results to gamma kernel mixtures, we achieve upper and lower bounds for the Wasserstein distance between hazard rates, cumulative hazard rates and survival functions.


Introduction
One of the most attractive features of the Bayesian nonparametric approach to statistical inference is the modeling flexibility implied by priors with large support. There are several classes of priors where this property is complemented by analytical tractability, thus contributing to making Bayesian nonparametrics very popular in several applied areas. See Hjort et al. (2010) and Ghosal & van der Vaart (2017) for broad overviews. In this framework, survival analysis stands out as one of the most lively fields of application. A prominent model for exchangeable time-to-event data is the extended gamma process for hazard rates (Dykstra & Laud, 1981), which allows for continuous observables and has been further generalized to kernel mixtures in Lo & Weng (1989) and James (2005). These works paved the way for another active line of research that defines priors for the hazard rates by relaxing the dependence structure between the observables, going beyond the exchangeability assumption. For example, Pennell & Dunson (2006), De Iorio et al. (2009 and Hanson et al. (2012) model subject specific hazards based on continuous covariates; Lijoi & Nipoti (2014) and Zhou et al. (2015) define priors for cluster specific hazards, while Nipoti et al. (2018) account for both individual and cluster covariates simultaneously. In this work we rather focus on priors for the hazard rates of exchangeable time-to-event data.
An important feature shared by most classes of nonparametric priors is their definition in terms of random measures and transformations thereof. While there is a wealth of theoretical results that have eased their actual implementation in practice, sampling schemes are typically based on approximations of the underlying random measures. Nonetheless, with a very few exceptions (Ishwaran & James, 2001;Arbel et al., 2019;Campbell et al., 2019), there is no extensive analysis on how to judge the quality of such approximations. Consider the common situation where one is interested in making inference or sampling from a wide class of random measures C, but can only treat a subclass C π because of convenient analytical or computational properties. The restriction to C π is usually argued through density statements, typically in terms of weak convergence of random measures. In many cases this reduces to the weak convergence of one-dimensional distributions, i.e. for everyμ ∈ C there exists an approximating sequence {μ n } n≥1 in C π such thatμ n (A) converges weakly toμ(A) for every Borel set A. This leaves out the possibility of establishing the rate of convergence and, more importantly, provides no guidance on the choice of the approximationμn ∈ {μ n } n≥1 to use in practical implementations. Spurred by these considerations, the goal we pursue is to quantify the approximation errors by evaluating the Wasserstein distance betweenμ n (A) andμ(A). Since convergence in Wasserstein distance implies weak convergence, this has the additional advantage of strengthening most results known in the literature.
The Wasserstein distance was first defined by Gini (1914) as a simple measure of discrepancy between random variables. During the 20 th century it has been redefined and studied in many other disciplines, such as transportation theory, partial differential equations, ergodic theory and optimization. Nowadays, depending on the field of study, it is known with different names, such as Gini distance, coupling distance, Monge-Kantorovich distance, Earth Moving distance and Mallows distance; see Villani (2008), Rachev (1985) and Cifarelli & Regazzini (2017) for reviews. Indeed, one can find it scattered across the statistics literature (Mallows, 1972;Dudley, 1976;Bickel & Freedman, 1981;Chen, 1995), though only in recent years it has achieved major success, especially in probability and machine learning. For a detailed review on the uses of the Wasserstein distance in statistics see Panaretos & Zemel (2018). As for the Bayesian literature, the Wasserstein distance was first used in Nguyen (2013) and has been mainly used to evaluate approximations of the posterior distribution and to address consistency (Nguyen, 2013;Srivastava et al., 2015;Cifarelli et al., 2016;Gao & van der Vaart, 2016;Donnet et al., 2018;Heinrich & Kahn, 2018). These works deal with the convergence of the (random) Wasserstein distance between the attained values of random probability measures. In a similar vein, though without a specific statistical motivation, Mijoule et al. (2016) examine the Wasserstein convergence rate of the empirical distribution to the prior, namely the de Finetti measure, for an exchangeable sequence of {0,1}-valued random variables. Our approach goes in a different direction: we are interested in a distance between the laws of random measures rather than a random distance between measures.
The Wasserstein distance is easy to simulate (Sriperumbudur et al., 2012) but difficult to evaluate analytically and, hence, tractable bounds are needed for concrete applications. We achieve them in two steps. First, we determine bounds for the Wasserstein distance between so-called completely random measures, since they act as building blocks of most popular nonparametric priors. This is carried out by relying on results in Mariucci & Reiß (2018) on Lévy processes. The techniques we develop in this first part measure the discrepancy between the laws of many noteworthy random measures, including the gamma, generalized gamma and beta families. Secondly, we move on to using these bounds in order to quantify the divergence between hazard rate mixture models that are used to analyze time-to-event data. These are then applied to evaluate the approximation error in a posterior sampling scheme for the hazards, in multiplicative intensity models, that relies on an algorithm for extended gamma processes (Al Masry et al., 2017).
The outline of the paper is as follows. After providing some basic notions and results on the Wasserstein distance and on completely random measures in Section 2, we determine upper and lower bounds for the Wasserstein distance between one-dimensional distributions associated to completely random measures in Section 3. This is, then, specialized to the case of gamma and beta completely random measures in Section 3.2. These results are the starting point for carrying out an in-depth analysis of hazard rate mixture models driven by completely random measures. In Section 4 we obtain a quantification of the discrepancy between two hazard rate mixtures and for the associated random survival functions. Examples related to its specification with mixing gamma random measures may be found in Section 4.3. Finally, in Section 5 we apply these results to evaluate the approximation error of a sampling scheme for the posterior hazards, conditional on the data. Proofs of the main results are deferred to Section 6.

Background and preliminaries
In this first section we recall some basic notions about completely random measures and their convergence in terms of the Wasserstein distance. Let X be a Polish space with distance d X and Borel σ-algebra X . The space M X of boundedly finite measures on X endowed with the weak topology is a Polish space as well; see Daley & Vere-Jones (2002). We denote by M X the corresponding Borel σ-algebra. A random measure is a measurable function from some probability space (Ω, Σ, P) to (M X , M X ).
Definition 1. If a random measureμ is such that, for any n ≥ 2 and any collection of pairwise disjoint bounded sets {A 1 , · · · , A n } in X , the random variablesμ(A 1 ), · · · ,μ(A n ) are mutually independent, then it is a completely random measure (crm).
Every crm can be uniquely represented as the sum of three independent components, µ + µ f +μ, where µ is a fixed measure on X,μ f is a random measure with fixed atoms andμ is a random measure without fixed atoms. See Kingman (1967). Here we focus on crms without fixed atoms and rely on the fact that their distribution is uniquely determined by a Poisson random measure. Indeed,μ where N is a crm on R + × X such that N (B) has a Poisson distribution of parameter ν(B) = EN (B) for every Borel set B ∈ B(R + ) ⊗ X such that ν(B) < +∞. The mean measure ν on R + × X is referred to as Lévy intensity ofμ and is such that for all x ∈ X, ν(R + × {x}) = 0, and for all bounded A in X and > 0, where ∧ denotes the minimum. Motivated by Bayesian nonparametric modeling, we focus on Lévy intensities ν without atoms such that: (i) integrability condition (2) holds for every A in X ; (ii) for all A in X and > 0, ν((0, ] × A) = +∞. The latter corresponds to assumingμ infinitely active. Infinite activity is a necessary requirement for defining random probability measures by normalization of crms. See Lijoi & Prünster (2010) for a review. Finally, in view of (1), the probability distribution ofμ can be characterized through the Laplace functional transform for all measurable functions f : X → [0, +∞). When dealing with convergence of random measures we think of random measures in terms of probability distributions on M X . Results in strong convergence are often too hard to establish, so that one usually deals with weak convergence (of distributions) of random measures, L(μ n ) ⇒ L(μ), where L(X) denotes the probability distribution of a random element X, which can be either finite-or infinite-dimensional. A remarkable result establishes that this is equivalent to the weak convergence of all finite-dimensional distributions L(μ n (A 1 ), · · · ,μ n (A d )) ⇒ L(μ(A 1 ), · · · ,μ(A d )), for A 1 , . . . , A d ∈ X stochastic continuity sets forμ; see Theorem 11.1.VII in Daley & Vere-Jones (2007). Moreover, when dealing with crms, the weak convergence of finite-dimensional distributions is equivalent to the weak convergence of one-dimensional distributions. Thus, if d denotes a metric on the space of probability distributions on R whose convergence is stronger than the weak convergence, one has that if d(L(μ n (A)), L(μ(A))) → 0 for every A ∈ X , thenμ n converges weakly toμ. In the sequel, we will choose d as the Wasserstein distance with respect to the Euclidean norm on R. See Villani (2008). In order to define this distance in its full generality, let Y be a Polish with respect to the metric d Y . For any pair of random elements Y 1 and Y 2 taking values in Y, let C(Y 1 , Y 2 ) denote the Fréchet class of Y 1 and Y 2 , i.e. the set of random elements (Z 1 , Z 2 ) on the product space Y 2 such that L(Z i ) = L(Y i ) for i = 1, 2.
Definition 2. The Wasserstein distance of order p ∈ [1, +∞) between L(Y 1 ) and L(Y 2 ) is defined as We focus on the case p = 1 and (Y, d Y ) = (R, | · |). We denote such a distance as W and omit reference to the law L in the notation. In view of the previous discussion on weak convergence, a major goal that we pursue is evaluating or bounding W(μ 1 (A),μ 2 (A)), for A in X . The results and general techniques that will be detailed in the next sections make use of some known facts on Wasserstein distances and crms concisely recalled here. First, for any pair of random variables (X, Y ), Thus the Wasserstein distance is finite when the random variables have finite mean. We will therefore focus our attention on crms whose total mass has finite mean and refer to them as crms with finite mean. By Campbell's Theorem this boils down to Finally, we highlight two properties of the Wasserstein distance that will be used in many proofs. Let X, Y be random variables and let F X ,F Y denote their distribution functions. Then by (Dall'Aglio, 1956), Moreover, if X 1 , . . . , X n are independent random variables and Y 1 , · · · , Y n are independent as well, then by (Bickel & Freedman, 1981), 3 Wasserstein bounds for CRMs

General result
There are situations where one is only interested in a numerical value for the Wasserstein distance: in such a case there are efficient ways to simulate it. See (Sriperumbudur et al., 2012). On the other hand, one may be interested in understanding how the distance is affected by the parameters of the distributions or by meaningful functionals, such as moments. This raises the need for an analytical evaluation of the Wasserstein distance, which in general is not an easy task. The most common practice is thus to develop informative bounds and to analyze how these are affected by the choices above. In this section we will express a bound for the Wasserstein distance between the one-dimensional distributions of crms in terms of their corresponding Lévy intensities. The proof is based on a compound Poisson approximation of crms.
Theorem 1. Letμ 1 andμ 2 be infinitely active crms with finite mean. Then for every A ∈ X We observe that g u (A) has a compelling form with respect to the upper bound in (5), since it equals zero ifμ 1 d =μ 2 . We stress that this bound holds for all crms and may be evaluated through numerical integration. Nonetheless, when specializing to certain classes of crms, we manage to upper bound g u (A) with an expression that can be evaluated exactly, as we do in Section 3.2. In particular, easily computable upper bounds are available whenever the tails of the Lévy intensities are ordered, as we prove in the next corollary. In such a case not only we have a simple expression for g u (A), we may also prove that the upper and lower bounds coincide, providing the exact expression of the Wasserstein distance itself.
Corollary 2. Consider the hypotheses of Theorem 1 and let A ∈ X . If the tails of ν 1 (ds × A) and Remark 1. The condition of Corollary 2 holds whenever there exists a dominating measure η on R + such that the Radon-Nikodym derivatives of ν 1 (ds × A) and ν 2 (ds × A) are ordered, i.e. ν i,A (s) ≤ ν j,A (s) for all s ∈ R + and i = j in {1, 2}. This more restrictive condition, which is however much easier to verify, holds true for many examples to be displayed in the sequel.
As underlined in Section 2, the convergence in the Wasserstein distance ofμ n (A) toμ(A), for every A ∈ X , is sufficient to guarantee the weak convergence of the sequence (μ n ) n≥1 and provide convergence rates. This motivates our main interest in set-wise results as those in Theorem 1. However, one can also define a uniform distance between laws of random measures with finite mean, by considering Corollary 2 can be used to find the exact expression of such distance. We focus on homogeneous crms, i.e. such that their Lévy intensity is a product measure ν(ds, dx) = ρ(s) ds α(dx). It will be next shown that d W admits a very intuitive representation, being proportional to the total variation distance between the base measures Corollary 3. Letμ i be infinitely active homogeneous crms with finite mean such that the Lévy

Examples
When the conditions of Corollary 2 do not hold, we may often find upper bounds of g u (A) which may be evaluated exactly for specific examples of crms. In the next proposition we consider a gamma crm with rate parameter b > 0 and base measure α whose Lévy intensity is ν(ds, dy) = e −sb s 1 (0,+∞) (s) ds α(dy).
We use the notationμ ∼ Ga(b, α). The random measureμ is infinitely active and, if α is a finite measure on X, it has finite mean.
This result extends the ones in Gairing et al. (2015), who develop upper bounds for similar integrals of gamma Lévy intensities in a more restrictive framework as they do not allow for both base measures and the scale parameter to differ between the two specifications. The bounds of Proposition 4 are informative in the sense that, the closer the parameters of the two crms, the smaller the bound of the Wasserstein distance. Moreover, when the base measures are equal on A, the upper and lower bounds coincide, providing the exact expression for the Wasserstein distance, in accordance with Corollary 2. The same holds true bounds. In the upper panel α 1 (A) = 1, b 1 = 2, b 2 = 3 are fixed, whereas α 2 (A) ranges from 0.5 to 2. In the lower one α 1 (A) = 1, b 1 = 1, α 2 (A) = 2 are fixed, whereas b 2 ranges from 0.5 to 2. In both plots the Simulated Wasserstein distance is based on 10 samples of 1000 observations using the Python Optimal Transport (POT) package (Flamary & Courty, 2017).
In Figure 1 we compare the simulated Wasserstein distance between two gamma crms with the upper bound in Theorem 1, which can be evaluated numerically, the ones in Proposition 4, which can be evaluated exactly, and the upper bound in (5), which is non-informative. For a wide range of parameters the bounds of Theorem 1 and Proposition 4 coincide with the Wasserstein distance. In contrast, when the Lévy intensities are not ordered, the upper and lower bounds do not coincide. The upper bound of Proposition 4 is tight whenever at least one of the two parameters is close to the corresponding parameter of the other crm (i.e. α 1 (A) ≈ α 2 (A) or b 1 ≈ b 2 ), whereas the upper bound of Theorem 1 is tight on the whole range of parameters. Moreover, they are both more informative than the upper bound in (5). The lower bound, on the other hand, is always tight and becomes non-informative when the two crms have different parameters but equal ratios (α i (A)/b i ), i.e. when they have equal mean. A different situation occurs with beta crms, where the Lévy densities corresponding to different concentration parameters and same base measure are not ordered. We recall thatμ ∼ Be(c, α) is a beta crm of concentration parameter c and base measure α if the Lévy intensity is We conclude this section with an immediate application of Corollary 3 on the distance d W between the laws of generalized gamma crms.

Hazard rate mixtures
Applications in survival analysis and reliability involve time-to-event data and have spurred important developments in Bayesian nonparametric modeling. Stimulating and exhaustive overviews of popular models in the area can be found in Müller et al. (2015) and in Ghosal & van der Vaart (2017). If T 1 , . . . , T n are from an exchangeable sequence of time-to-event data, i.e.
the choice of Π follows from specifying a prior on the survival function t →S(t) =P ((t, ∞)). This may be done directly by resorting, e.g., to neutral to the right random probability measure (Doksum, 1974), or by setting a prior on the corresponding cumulative hazard function by means of, e.g., the Beta process (Hjort, 1990). Alternatively, one may specify a prior on the hazard rate function if one can assume thatS is almost surely continuous: in this case a convenient option is a kernel mixture model (Dykstra & Laud, 1981). For all these model specifications, one can also take into account the presence of censored observations. The most common mechanism is rightcensoring, which associates to each T i a censoring time C i . In this case, the actual observations whenever it equals 1. Here we focus on priors for the hazard rates, i.e. the instantaneous risk of failure, that are induced by kernel mixtures over a gamma crm. The model, originally proposed as a prior for increasing hazard rates in Dykstra & Laud (1981), is ideal for treating right censored observations and has led to several interesting generalizations. Henceforth, we consider a specification that has been investigated in its full generality by James (2005). Before focusing on our main results, let us first recall some basic definitions that will also allow us to set the notation to be used throughout. If F is a cumulative distribution function on [0, +∞) and S = 1−F the corresponding survival function, we assume it is absolutely continuous so that one can define the hazard rate h = F /(1 − F ) and rewrite, for any t ≥ 0, where H is the cumulative hazard function. Let k : R + × X → [0, +∞) be a measurable kernel function. Ifμ is a crm, with corresponding Poisson random measure N , and k is such that a prior for the hazard rates is the probability distribution of the process {h(t) | t ≥ 0} such that for any t ≥ 0,h Thus, condition (10) ensures that the mean cumulative hazards go to +∞ as time increases. We use the techniques developed in the previous sections to obtain bounds on the Wasserstein distance between the marginal hazard rates coming from different kernels and different crms. Moreover, we successfully address the same issue when considering the Wasserstein distance between cumulative hazards and survival functions.

Bounds for hazard rates
Consider two random hazard ratesh 1 = {h 1 (t) | t ≥ 0} andh 2 = {h 2 (t) | t ≥ 0} as in (11). From a statistical standpoint, these may be seen as different prior specifications corresponding, e.g., to different mixing cmrs or kernels. Alternatively,h 2 may be thought of as an approximation of the actual priorh 1 and one may be interested to ascertain the quality of such an approximation. The issue is of great interest when we need to sample fromh 1 , or its posterior distribution, while it is much easier to sample fromh 2 : in this case a bound on the error can provide an effective guidance as on how to fix the parameters of the approximating distribution. We first investigate how different crms and kernels impact the marginal hazards and use the Wasserstein distance as a measure. In other terms, we will be focusing on W(h 1 (t),h 2 (t)) for every t ≥ 0. The results in the previous sections will provide the necessary background for obtaining the desired bounds. Before displaying these, we state a technical result. To this end, we recall that if ν is a measure on X and g : X → Y is a measurable function, the pushforward measure g # ν on Y is defined by (g # ν)(A) = ν(g −1 (A)).
Lemma 6. Letμ be a crm with intensity measure ν and let f : X → R + be a measurable function. Then the random measureμ f (dy) = f (y)μ(dy) is a crm with Lévy intensity equal to the pushforward measure ν f = p f # ν where p f (s, y) = (s f (y), y). Thus for every A ∈ X , When ν(ds, dy) = ν(s, y) ds α(dy), by Lemma 6 with a change of variable, , y ds α(dy).
Thus, we will use the notation ν f (ds, dy) = 1 f (s) ν(d s f (y) , dy). The relevance of this change of measure result is apparent since the prior specification in (11) involves a multiplicative structure with the kernel and the mixing crm. The following example deals with the gamma case.
Example 2. Considerμ ∼ Ga(b, α) and a generic kernel k. Then the random measures defined byμ k(t|·) (dy) = k(t | y)μ(dy) are crms with Lévy intensity Thusμ k(t|·) is an extended gamma crm with scale function β(y) = k(t | y) b and base measure α. Extended gamma crms are easily shown to be infinitely active.
Lemma 6 ensures that marginally the hazard process in (11) satisfiesh(t) d =μ k(t|·) (X), wherẽ µ k(t|·) is a crm. In order to bound the Wasserstein distance between marginal hazards we may thus apply the results of Theorem 1 with A = X. By (12),μ k(t|·) has finite mean and it is infinitely active if, respectively, for every ≥ 0, A ∈ X and t ≥ 0. If ν is infinitely active, (14) holds.
be random hazard rates as in (11) with associated infinitely active crmsμ i , Lévy intensity ν i , and kernel k i that satisfy (10) and (13), for i = 1, 2. Then the Wasserstein distance between the marginal hazard rates is finite and for every t ≥ 0, where , dy du.

Bounds for survival functions
The bounds we have derived for the hazard rates translate into bounds for the corresponding survival functions and these are of great interest since one typically targets estimation of functionals of the survival function. First, we consider the corresponding cumulative hazards processesH = {H(t) | t ≥ 0}, defined bỹ where K(t | y) = t 0 k(u | y) du is the cumulative kernel. Thus, the cumulative hazards can be treated as a kernel mixture as well, and an analogue of Theorem 7 is available.
be two random cumulative hazards as in (15) with associated infinitely active crmsμ i , with Lévy intensity ν i , and kernel k i that satisfy (10), for i = 1, 2. If the cumulative kernels K i (t | y) = t 0 k i (u | y)du satisfy (13) with k = K i , the Wasserstein distance between the marginal cumulative hazards is finite and for every t ≥ 0, where g (t) = R + ×X K 1 (t | y) s ν 1 (ds, dy) − K 2 (t | y) s ν 2 (ds, dy) ; , dy du.

Examples
We now apply these results on kernels of the type of Dykstra & Laud (1981), k(t|y) = β(y)1 [0,t] (y), which is a popular choice when one wants to model increasing hazards. In this setting X = [0, +∞). For simplicity we will restrict our attention to constant functions β(s) = β, which is a common choice in applications (Dykstra & Laud, 1981;Laud et al., 1996), and gamma crms with the same base measure α. In this scenario α may also be an infinite measure, though it must be boundedly finite. We will consider the Lebesgue measure on the positive real axis, Leb + (ds) = 1 [0,+∞) (s) ds, which is the base measure proposed in the original paper of Dykstra & Laud (1981) and meets the conditions of Theorem 7.
Example 3. Letμ i ∼ Ga(b i , Leb + ) and let k i (t | y) = β i 1 [0,t] (y), with b i , β i > 0, for i = 1, 2. If h 1 andh 2 are the corresponding hazard rate mixtures, then Proof. The general expression for ν k(t|·) was derived in Example 2. With our choices, which corresponds to the Lévy intensity of a gamma crm of parameter b i /β i and the restriction of Leb + to [0, t] as base measure. Since the Lebesgue measure on a bounded set is finite, as observed in Proposition 4, (17) is infinitely active and has finite mean. Thus condition (13) holds. In order to check condition (10) on the expected cumulative hazards we first observe that for every t > 0, Thus t 0 E(h i (s)) ds = t 2 β i 2 b i diverges as t → +∞, and condition (10) holds. The results in Theorem 7 apply and since the densities of (17) are ordered, we easily derive the expression for W(h 1 (t),h 2 (t)) from the expression of E(h i (t)), in accordance with the results in Proposition 4 for gamma crms.
The choice of the kernel allows for great flexibility and usually depends on the type of experiment one is considering. For example, if we are dealing with the failure of objects whose material wears out in time, the assumption of increasing hazard rates appears to be the most plausible. Besides the kernel of Dykstra & Laud (1981), which leads to almost surely increasing hazard rates, one can resort to other options such as: More details can be found in Lo & Weng (1989); James (2003);De Blasi et al. (2009). The choice of the kernel is typically driven by the type of data one is examining. As for the choice of the random measure, this may be dictated by specific inferential properties but it is usually motivated by analytical tractability and prior flexibility. In this regards, the gamma crm is a popular alternative. We thus pick one of the kernels above and focus on gamma kernel mixtures. One is, then, left with the choice of the parameter b, which heuristically quantifies the prior belief on the steepness of the hazard. Given these specifications, one may be interested in quantifying the discrepancy induced by it on the corresponding hazards. Before proceeding, we underline how the same reasoning could be applied to the base measure, but for simplicity we consider gamma crms with a given shared base measure. We point out that in all cases we achieve the exact expression for the Wasserstein distance between the hazard rates.
Example 4. Letμ i ∼ Ga(b i , Leb + ), with b i > 0, for i = 1, 2. Let k 1 = k 2 = k be one of the kernels (a)-(d) above. Then the Wasserstein distances between the corresponding hazard rates mixtures equal where f + = max(f, 0) for any measurable function f with values in R.
Proof. Kernel (a) is very similar to the one in Example 3. The Lévy intensity ,t+τ ] (y) ds dy is the one of a gamma crm with parameter b and Lebesgue base measure on [0 ∧ (t − τ ), t + τ ].
Since the corresponding densities are ordered, the exact Wasserstein distance is available and coincides with (a ). The same is true for kernel (b). With kernel (c) one has The corresponding densities are ordered, thus if the conditions of Theorem 7 hold we only need to evaluate the expected value of the hazards to derive the exact Wasserstein distance: This also proves the finite mean condition (13). Since t 0 (1 − e gs )ds = 1−e gt g + t diverges as t → +∞, also condition (10) holds. Finally for kernel (d), The mean hazard rates are and thus condition (13) holds. Moreover, where γ is the Euler gamma constant. This quantity diverges as log(t) for t → +∞ and thus condition (10) holds. We conclude as in the previous cases.
Next, we apply the bounds on cumulative hazards and survival functions of Theorem 8 and Theorem 9 to the case where mixtures of gamma crms are used, as in Example 3.
Example 5. Consider the prior specification in Example 3. Denote byH i the corresponding cumulative process (15) and byS i the corresponding survival process (16), for i = 1, 2. Then for every t ≥ 0, where Proof. Since the Lévy densities of are ordered, if the conditions of Theorem 8 hold the expression for the Wasserstein distance between the cumulative hazards easily derives from Now, condition (10) on the kernels has already been checked in Example 3. Moreover, (20) proves condition (13) on the finite mean.
As for the Wasserstein distance between the survival functions, in order to apply Theorem 9 it suffices to evaluate the mean of the survival functions. This is easily done thanks to the properties of the Laplace functional of a crm (3). Specifically, E e − R K(t | y)μ i (dy) is equal to  In Figure 2 a graphical representation of the upper and lower bounds for the Wasserstein distance between the corresponding survival functions is given. In particular, the distance between the survival functions lies in the gray area in the figure. The first upper bound g u,1 appears to be tighter for small times, while the second g u,2 is more informative as time increases. This depends on the fact that in the first case we are using the bound e −H 1 (t)∧H 2 (t) ≤ 1, which is effective for small values of the cumulative hazard function, i.e. for small times, while in the second one we are using e −H 1 (t)∧H 1 (t) ≤ e −H 1 (t) + e −H 2 (t) , which is effective for large values of the cumulative hazard function, i.e. for large t. Moreover, we point out that the Wasserstein distance between survival functions is considerably smaller than the one between the hazard rates, which is what we expect from a modeling perspective.

Posterior Sampling Scheme
The techniques we have developed in the previous sections may be fruitfully applied to evaluate approximation errors in posterior sampling schemes. In this section we focus on the gamma kernel mixture by Dykstra & Laud (1981) and rely on the posterior analysis by James (2005). Even when the prior hazards are modeled as a gamma process (i.e. constant β), conditionally on the data and a set of latent variables, the non-atomic part of the posterior hazards is an extended gamma process. There are many available methods in the literature to sample from an extended gamma process, as the finite dimensional approximation by Ishwaran & James (2004), the inverse Lévy methods of Ferguson & Klass (1972) and Wolpert & Ickstadt (1998), and the series representation of Bondesson (1982), which serves as a basis for the algorithm in Laud et al. (1996). Other available series representations can be found in Rosiński (2001). Recently, Al Masry et al. (2017) proposed a new algorithm based on a discretization of the scale function: in such case the extended gamma process can be approximated by a sum of gamma random increments. The construction of the discretization is not always simple but, when possible, it allows for a precise quantification of the approximation error. In Al Masry et al. (2017) the error is quantified through a bound on the L 2 distance. Here, we build the discrete approximation of the scale function of the posterior hazards corresponding to a gamma process prior and use the Wasserstein distance to quantify the approximation error between the induced hazard rates. Moreover, since one is usually interested in the cumulative hazards or in the survival function, we provide an estimate for their approximations as well, which yields a novel and meaningful guide for fixing the approximation error in the algorithm.
We first recall the posterior characterization of mixture hazard rates models, with censored data, as achieved by James (2005). This result is suited to our case, since it concerns crm-driven mixtures under a multiplicative intensity model. In order to provide a summary description of the posterior distribution, henceforth T 1 , . . . , T n are random elements from an exchangeable sequence as in (9) with Π being the law of a random probability measure with hazard rateh as in (11). Furthermore, if n e = n i=1 ∆ i is the number of exact observations in the sample, we may assume without loss of generality that ∆ 1 = · · · = ∆ ne = 1 and, hence, the last n − n e observations are censored. The data are, then, given by {(x j , ∆ j )} n j=1 . A representation of the likelihood function that is convenient for Bayesian computations is obtained by relying on a suitable augmentation that involves a collection of latent variables Y 1 , . . . , Y ne corresponding to the exact observations. Hence, the augmented likelihood is given by where x = (x 1 , . . . , x n ), y * 1 , · · · , y * k are the k ≤ n e distinct values in y = (y 1 , . . . , y ne ), C j = {r : y r = y * j } and n j = card(C j ). The function K n is interpretable as a kernel for the cumlative hazards and, in general, accounts for different forms of censoring. For simplicity we henceforth focus on the case of right-censored observations and this yields The posterior characterization we rely on is as follows.
Theorem 10 (James (2005)). Let T 1 , . . . , T n be random elements from an exchangeable sequence as in (9), with Π being the law of a random probability measure with hazard rateh as in (11).
Conditional on the observed data x = (x 1 , · · · , x n ) and latent variables y = (y 1 , · · · , y ne ),μ equals in distributionμ * d whereμ * c is a crm without fixed jump points and with intensity ν * (ds, dy) = e −sKn(y) ν(ds, dy) = e −sKn(y) ρ y (ds) η(dy), while J 1 , · · · , J k are mutually independent and independent fromμ * c . For h = 1, . . . , k, the generic h-th jump J h has distribution In the rest of the section we focus on the case X = R, k(t|y) = β1 [0,t] (y) andμ gamma crm with rate parameter b and base measure α, which is a typical choice in applications. Thus the non-atomic posterior crmμ * c has Lévy intensity It follows thatμ * c is an extended gamma crm with base measure α and scale function 1/(b + β n i=1 (y − x i ) + ). The non-atomic posterior hazards are an extended gamma process and can thus be written ash where β * (y) = β/(b + β n i=1 (y − x i ) + ) andμ is a gamma crm with parameter 1 and base measure α. Consider an interval of interest [0, T ], which can be thought of as the initial and final time of the study, so that 0 < x 1 ≤ · · · ≤ x n < T . The algorithm proposed by Al Masry et al. (2017) to sample from {h * (t) | t ∈ [0, T ]} is based on a piecewise constant approximation of β * on the interval [0, T ]. If β (y) = n( ) h=0 β h 1 (t h ,t h+1 ] (y), then for every t ≥ 0, where n t is such that t nt ≤ t ≤ t nt+1 . The increments δ h = β hμ (t h , t h+1 ] have a gamma distribution with scale β h and shape α(t h , t h+1 ). If the points{t h | h = 1, . . . , n( )} are dense in the interval [0, T ] as n( ) → +∞, one samples directly from a sum of gamma random variables t h ≤t δ h . In order to apply this algorithm we need to build an approximating strictly positive piecewise constant function β : [0, T ] → (0, +∞) and find a reasonable criterion to fix the approximation error. We will build β by discretizing the reciprocal of β * , namely γ(y) for every j = 0, . . . , n − 1 and k = 0, . . . , (n − j)(x j+1 − x j ) −1 , where x 0 = 0 and [x] denotes the integer part of x ≥ 0, so that n( ) = n + 1 + n−1 i=0 (n − i)(x i+1 − x i ) −1 . We observe that Next we set γ h = γ(t h+1 ) and define Moreover, γ (y) ≥ γ(y) for every y ∈ [0, T ].
From this theorem one deduces a very simple uniform bound for the discrepancy between β and β * .
For a given , these results provide a constructive rule for determining an approximation of β * , and hence an approximating hazardh . One may then wonder which value of should be specified if we wish to achieve a prescribed error of approximation for the posterior hazards and survivals. This is achieved in the next Corollary, where we propose three different rules based on the Wasserstein distance between the hazards, cumulative hazards and the survival functions respectively. The result is stated for the hypotheses of Example 3, α = Leb + , but can be easily adapted to any base measure.
Lemma 14. Letμ be a crm with Lévy intensity ν and finite mean (6). Then for every A ∈ X , Proof. For every δ > 0 consider > 0 such that < δ. Then The second integral is finite by (2) Since this is true for every δ > 0, by the absolute continuity of the integral ν([ , +∞) × A) → 0 as → 0.
We now prove the results in Theorem 1. The lower bound g (A) is easily achieved by (5) and by Campbell's Theorem applied to the underlying Poisson random measures with respect to the measurable function f (s, x) = s 1 A (x), similarly to (6). We thus concentrate on the upper bound. Since the Lévy intensities are diffuse and infinitely active, for every A ∈ X and r > 0 there exists for i = 1, 2. By denoting with N i the Poisson random measure underlyingμ i as in (1), s N i (ds, dy).
We use the notation J S i,r,A = i,r,A 0 A s N i (ds, dy) for the small jumps and J B i,r,A = ∞ i,r A s N i (ds, dy) for the big jumps. The independence of the increments of a Poisson random measure ensures that J S i,r,A and J B i,r,A are independent, thus by (8) W(μ 1 (A),μ 2 (A)) ≤ W(J S 1,r,A , J S 2,r,A ) + W(J B 1,r,A , J B 2,r,A ).
We first show that the small jumps do not play any role in the final bound. By (5), W(J S 1,r,A , J S 2,r,A ) ≤ E(J S 1,r,A ) + E(J S 2,r,A ).
The means E(J S i,r,A ) = i,r,A 0 s ν i (ds × A) are finite by (2) and thus go to zero as r → +∞ by the absolute continuity of the integral. We now focus on the big jumps. By (2), these are integrals of Poisson random measures with finite mean measure, ν i (ds, dy) 1 [ i,r,A ,+∞) (s). Proposition 19.5 in Sato (1999) then ensures that J B i,r,A has a compound Poisson distribution, so that where N i,r,A is a Poisson random variable with intensity r = ν i ([ i,r,A , +∞) × A) and (ξ h ) h are independent and identically distributed random variables, independent from N i,r,A , with distribution Theorem 10 in Mariucci & Reiß (2018) deals with the Wasserstein distance between compound Poisson distributions. Since J B 1,r,A and J B 2,r,A have the same total intensity measure r but different jump distribution ρ i,r,A , an immediate adaptation of their result yields W(J B 1,r,A , J B 2,r,A ) ≤ r W(ρ 1,r,A , ρ 2,r,A ).
By interchanging the integrals this is equal to the lower bound in Theorem 1.

Proof of Theorem 11
The proof relies on γ(y) being a decreasing continuous piecewise linear function. We first include x 1 , . . . , x n in the set {t h | h = 1, . . . , n( )}. Then, for every i = 1, . . . n we iteratively include all points t ∈ (x i , x i+1 ) such that the counterimage γ −1 (t) is at distance from the previous point. We easily conclude by observing that on (x i , x i+1 ] the function γ is linear with coefficient equal to −(n − i). See Figure 3.

Proof of Corollary 13
The proof is based on observing thath 1 =h * andh 2 =h are two kernel mixture hazards with k 1 (y|t) = k 2 (y|t) = 1 [0,t] (y) andμ i extended gamma crms with scale function β 1 (y) = β * (y) and β 2 (y) = β (y) and Lebesgue base measure on the positive axis, i.e. These kernel and Lévy intensities satisfy both the conditions of Theorem 7 and of Theorem 8. Since by construction β (y) ≤ β * (y) for every y ∈ [0, +∞), the Lévy densities of the hazards and of the cumulative hazards are ordered. Thus the Wasserstein distance reduces to the absolute difference of their means: by (27). Similarly, Finally, the bound for the survival function derives directly from the one on the cumulative hazards, as in Theorem 9.