On a Dirichlet Process Mixture Representation of Phase-Type Distributions

An explicit representation of phase-type distributions as an infinite mixture of Erlang distributions is introduced. The representation unveils a novel and useful connection between a class of Bayesian nonparametric mixture models and phase-type distributions. In particular, this sheds some light on two hot topics, estimation techniques for phase-type distributions, and the availability of closed-form expressions for some functionals related to Dirichlet process mixture models. The power of this connection is illustrated via a posterior inference algorithm to estimate phase-type distributions, avoiding some difficulties with the simulation of latent Markov jump processes, commonly encountered in phase-type Bayesian inference. On the other hand, closed-form expressions for functionals of Dirichlet process mixture models are illustrated with density and renewal function estimation, related to the optimal salmon weight distribution of an aquaculture study.


Introduction
Mixture models are ubiquitous in statistics. Their study can be traced back to Pearson (1894) with excellent, up to date, reviews by Titterington et al. (1985), McLachlan and Peel (2000) and Frühwirth-Schnatter (2006). A mixture model can be written as with N = 1, . . . , ∞, where K(·|θ) is a kernel density, here supported in R + , and {θ h , w h } h≥1 are N -dimensional parameters satisfying N h=1 w h = 1. Depending on the kernel and weights specification, this model might induce a dense class of densities, namely it could capture any density on R + . When N = ∞ and the parameters are random, mixture models (1) are widely studied in Bayesian nonparametrics (see, e.g., Ghosh and Ramamoorthi, 2003;Müller and Quintana, 2004;Dunson, 2010;Müller and Mitra, 2013), with the benchmark model being the celebrated Dirichlet process mixture (DPM) model (Ferguson, 1973(Ferguson, , 1974Lo, 1984;Escobar and West, 1995). The DPM model can be defined as the random density model driven by a Dirichlet Process (DP), P ∼ DP(α, P 0 ), i.e. a random probability measure P = h≥0 w h δ θ h , where weights, {w h } h≥1 , and locations, {θ h } h≥1 , are random and independent, given by and θ h iid ∼ P 0 , respectively. Here, P 0 , sometimes referred as the baseline distribution, is assumed to be a non-atomic distribution on R + . The Bayesian nonparametric literature offers a wide choice of other models for P , being of practical interest those falling in the general class of species sampling models (Ghosal and van der Vaart, 2017). Among other aspects, the Dirichlet process stands out within this latter general class as being the only one tractable for atomic P 0 . Though this discreteness of P 0 could be potentially relevant for our purposes below, letting P 0 to be atomic is not always adequate, as it prevents posterior distributions to smoothly deviate from the prior. Indeed, this has encouraged other proposals in the literature (Canale and Dunson, 2011;Canale and Prünster, 2017) to overcome the difficulty of modeling random mass probability functions. Hence, we keep the assumption of non-atomic P 0 . Notice that random density (2) can be also simplified as and, when describing a set of iid observations {y 1 , . . . , y n } from it, sometimes written in the hierarchical representation form y k | θ k ind ∼ K(y | θ k ) , k= 1, 2, . . . , n, On an unseemly connected direction, phase-type distributions (Neuts, 1975(Neuts, , 1978) (sometimes abbreviated as PH-distributions) have been mainly studied in the applied probability literature, see, e.g., Bladt and Nielsen (2017) for extensive treatment. The basic idea of phase-type distributions starts by considering a Markov jump process {X t } t≥0 with state space E = {1, 2, . . . , p, p + 1}, where states 1, 2, . . . , p are transient, and state p + 1 is absorbing. This process is driven by an intensity matrix of the form where T ∈ T, with T denoting the space of subintensity matrices of dimension p × p. Given the row elements in an intensity matrix add up to 0 (i.e. Λ1 = 0), the space T contains all the square matrices whose row sums are non-positive, and contains negative elements in the main diagonal, that is, where λ i > 0 corresponds to the parameter of an exponential distribution. This amounts to say that the process remains in state i, an exponential time with rate λ i , and then jumps to state j with transition probability Notice that λ ij is the rate at which transition from i to j occurs. Given the finite nature of this Markov chain, transition probabilities are typically represented with a matrix P with elements p(i, j), for i = j, and p(i, i) = 0. The vector t = −T1 is the exit rate, as it contains the jump rates to the absorbing state (Bladt and Nielsen, 2017). Here, 1 is a p-dimensional column vector of ones. Now let π = (π 1 , . . . , π p ), with π i = P(X 0 = i), be a row vector in the p-dimensional simplex space S p . A phase-type distribution with dimension p is defined as the time until absorption of the Markov jump process {X t } t≥0 , with initial distribution π and subintensity matrix T, namely the distribution of the random variable Y := inf {t > 0 | X t = p + 1}. Accordingly, we will use the notation Y ∼ PH p (π, T) for a phase type distribution with the embedded Markov process {X t } t≥0 characterized by π and T. It is worth noting that, to keep the standard notation in phase-type distributions literature, we are keeping π as a row vector.
If Y ∼ PH p (π, T), the corresponding density and cumulative distribution function (cdf) are given by f (y) = πe Ty t and F (y) = 1 − πe Ty 1 , respectively, where e Ty = ∞ =0 1 ! (Ty) and the parameter space is given by Θ = S p ×T. Just as for DPM models, this class can be dense in the space of distributions on the positive real line when the number of phases p tends to infinity (Asmussen, 2000a). Phase-type distributions are appealing in a variety of complex statistical problems. They are closed under convolutions and mixing, they have closed-form expressions to solve problems in Survival Analysis (Aalen, 1995), Renewal theory (Asmussen and Bladt, 1996), computation of Ruin probabilities (Asmussen, 2000b), the estimation of the Lorenz curve and Gini index (Bladt and Nielsen, 2011), and various other estimation problems (see, e.g., Bladt et al., 2016). As it will become clear later, these results unveil various novel applications of DPM models.
Although connections between phase-type distributions and mixture models have been described in the literature (see, e.g., Cumani, 1982;O'Cinneide, 1989;Mocanu and Commault, 1999;Lin, 2010, 2012;Zadeh and Stanford, 2016), they have not been exploited in depth. Inference for Bayesian nonparametric mixture models, is nowadays relatively standard (see, e.g., Escobar and West, 1995;Neal, 2000;Ishwaran and James, 2001;Walker, 2007;Kalli et al., 2011;Miller and Harrison, 2018), whereas inference for phase-type distributions is still challenging task (Bladt et al., 2003;Aslett, 2012). Both classical and Bayesian approaches to phase-type distribution inference available in the literature, resort to the underlying Markov jump processes {X t } k t≥0 , k = 1, . . . , n to write the complete likelihood, and thus to estimate the parameters using the Expectation-Maximization (EM)  or Gibbs Sampling algorithms (Bladt et al., 2003). In both approaches, it is difficult scaling up the corresponding algorithms to consider relatively large sample sizes n. In the EM algorithm, the computation of the Expectation step, based on latent trajectory of embedded jump process, is necessary for each data in the sample. On the other hand, in the Gibbs algorithm, the simulation of the associated Markov jump process for each data point is required. Furthermore, a question that arises in both approaches is: How to determine the number of phases p? As we will show below, our proposal solves all these difficulties.
The main purpose of this work is to reveal an appealing connection between DPM mixtures and phase-type distributions, thus mutually benefiting both research areas. With this in mind, we present an explicit representation of phase-type distributions as an infinite mixture of Erlang distributions. This new representation is derived using the corresponding Laplace transform, which admits loops of the process to the same state, and generalizes an existing result by Zadeh and Stanford (2016).
The manuscript is organized as follows. In Section 2, after presenting some background material related to infinite-dimensional phase-type distributions also known as SPH-distributions, we present results that connect phase-type distributions with infinite mixtures of Erlang distributions. This section includes the connection with Bayesian nonparametrics, which then allows to adapt known posterior inference techniques, shown in Section 3. An extensive Monte Carlo simulation study is included in Section 4. Section 5 illustrates the availability of a closed expression for the renewal function and density estimation with aquaculture data. Some concluding remarks are deferred to Section 6.

A SPH-distribution representation via Erlang kernels
When the number of transient states is infinity, p = ∞, PH-distributions are not necessarily proper, defining the class of infinite-dimensional phase-type (IPH) distributions. Shi et al. (1996) give conditions under which a subset of this class, identified with the acronym SPH, contains only proper distribution functions. Let us recast their result: Corollary 1 (Shi et al., 2005). Let (w 1 , w 2 , . . .) be a probability vector and for each h = 1, 2, . . ., let F h (t) be the cdf of a PH p h (π h , T(p h )) distribution, with initial distribution π h and subintensity matrix T(p h ), where (p h ) ∞ h=1 is a sequence of finite dimension values. Now assume there exists λ := sup Such distribution is referred to as a SPH-distribution.
Theorem 2.1 in Shi et al. (1996) establishes that SPH-distributions are proper if and only if the infinite matrix W is invertible. SPH-distributions share similar closure properties with PH-distributions; in fact, the class of finite PH-distributions is contained in the SPH class. In particular, if all diagonal elements of the matrix W are bounded and W is invertible, then ∞ h=1 w h F h (t) defines a proper density of the form in (6) with parameters (φ, W), which is established in Shi et al. (Theorem 2.2, 1996). The following sections are valid for p arbitrarily large but finite. The conditions in Shi et al. (1996) justify using infinite mixtures of Erlang distributions, which satisfy such conditions, have a phase-type representation, and can be seen as an infinite-dimensional PH-distribution.
In general, PH-distributions are not identifiable (Telek and Horváth, 2007), meaning two different sets of parameters, (π, T) and (π * , T * ), might account for the same probability mass. This lack of identifiability can be seen as consequence of the observation process: there are different possible trajectories, of the embedded Markov jump process, , that lead to the same observed absorption time. Additionally, even identifiable cases such as the Exponential distribution of rate one, can be represented as a non-identifiable PH-distribution with a higher dimension p. Notice that there is a difference between the dimension and the order of a PH-distribution, the latter being the smallest dimension among all its representations. All this complicates the learning process in estimation procedures, as noted by  when fitting of the Old Faithful geyser data.
As mentioned above, the literature offers some instances of connections between phase-type distributions and mixture models. However, to the best of our knowledge, such connections have not been used for Bayesian inference. Here, we use a novel characterization of PH-distributions as SPH-distributions, with Erlang kernels. Proposition 1. Let Y ∼ PH p (π, T), where π ∈ S p , T ∈ T, p < ∞, and denote by P the transition matrix of the embedded Markov process. Then the Laplace transform of Y can be represented as L(s) = π ∞ k=0 (DP) k D(I − P)1, for s > 0, where D denotes the diagonal matrix with elements λi λi+s , corresponding to Laplace transforms of exponential distributions with rates λ i > 0 for all i = 1, . . . , p.
Proof. Using the same notation as in Section 1, the Laplace transform of Y , L(s) := E[e −sY ], can be represented as a non-homogeneous linear system of equations, as there exists a positive probability that the embedded Markov process {X t } t≥0 jumps to the absorbing state directly from any initial state. That is Here, p ij denotes the ij-element of P. The first term of L i corresponds to the case where the embedded Markov process {X t } t≥0 jumps to state j, after an exponential time in state i. The second term corresponds to the case where the process jumps to the absorbing state directly from any initial state i.
Using the notation L := (L 1 (s), . . . , L p (s)) t , we have L = DPL+D (1 − P1), which solving with respect to L, simplifies as L = (I − DP) Hence, the Laplace transform of Y can be expressed as L(s) = πL = π(I − DP) −1 D(1 − P1) and, for s > 0, represented as L(s) = π( ∞ k=0 (DP) k )D(I − P)1. This latter representation relies on the series expansion We have to show that DP ∞ < 1. Using the sub-multiplicative property of the matrix norm, and the fact that the norm of a sub-stochastic matrix P is less than or equal to one, we have where the second inequality follows from max i j |p ij | ≤ 1, and λ max = max {λ 1 , . . . , λ p }. Hence, DP ∞ < 1 implying that ( Note that the Laplace transform of Phase-type distributions can be well defined for negative values of its argument. However, having s > 0 is sufficient for the celebrated Laplace uniqueness Theorem (e.g. Feller, 1971, Section XIII) to follow, and thus to ensure the coincidence in distribution of positive random variables. This will be used in the results below.
Proposition 2. Let Y ∼ PH p (π, T), where π ∈ S p , T = λ(P − I), that is, T is a subintensity matrix with all the elements on the main diagonal equal to −λ, with P the associated transition matrix of the embedded Markov process and λ > 0. Denote by Er(a, b), the Erlang distribution with mean a/b. Hence, the density of Y can be represented as which is an identifiable statistical model.
Proof. The proof follows directly from Proposition 1. In fact, as T = λ(P − I), then D = λ λ+s I, where λ λ+s is the Laplace transform of an exponential distribution with rate λ > 0, and Then, when the matrix T has all the elements of its diagonal equal to λ, the phasetype distribution has an equivalent representation as an infinite mixture of Erlang distributions. The mixture is finite when P is nilpotent because P q = 0 for all q ≥ p. Note that the family of Erlang kernels in the mixture has a lexicographical order, then using Theorem 2 in (Teicher, 1963), one can deduce that a phase-type distribution with T = λ(P − I) and P nilpotent is an identifiable statistical model.
Denote by P * := 1 λ T + I a transition matrix that admits loops to the same state, where λ is any arbitrary value such that λ > max {-diag(T)}. Then, the density of Y can be represented as Proof. First, note that P * is a sub-stochastic matrix. In fact, the diagonal elements are of the form p ii = 1 − λ i /λ and the off-diagonal elements p ij = λ ij /λ. Each row adds up to a number less or equal than one. On the other hand, let D * = λ λ+s I, s > 0. Then, we have Using the result in Proposition 1, we have The Laplace transform is given by L Y (s) = π(sI − T) −1 t = 2 (s+1)(s+2) . The equivalent representation is given by the transition matrix With the above parameters we have that α * k = 2 In such a case, the density of Y has an equivalent representation as an infinite mixture of Erlang distributions Mixture distribution (7) can be equivalently rewritten as a convolution. This can be derived using the Laplace transform of Proposition 1, that is namely the Laplace transform of a convolution o two Erlang distributions, Er(1, 1) and Er(1, 2). The representation of P * , D * and π can be equivalently expressed without loops to the same state, but with a double number of states, such as, Y ∼ PH 4 (ν, Q), Computing the eigenvalues and eigenvectors of Q it is immediate to show that the Laplace transform of such representation is also given by L Y (s) = 2 (s+1)(s+2) . This example illustrates how a phase-type distribution with representation (π, P * ) can be represented without loops to the same state, but with a double number of states.
Proposition 3 states that any phase-type distribution has an equivalent representation as an infinite mixture of Erlang kernels. Hence, a natural way to represent a genuine infinite mixture distribution, without resorting to truncation, is to use an infinitedimensional prior distribution like the Dirichlet process. Note that, given a value of λ and the number of phases p, the infinite sequence of weights α * k , k = 1, 2, . . . can always be recovered with a stick-breaking construction as the one used to construct the Dirichlet process, see e.g. Bissiri and Ongaro (2014). The parameter λ, somehow mimics a similar effect of the total mass parameter in the DP. Indeed, fitting mixture model of Proposition 3 can be achieved by fitting a DPM model with the following hierarchical representation with P 0 = Ga(a 0 , b 0 ), the Gamma distribution with mean a 0 /b 0 , and where x denotes the least integer greater than or equal to x. Though one could think of modeling the shape parameter of the Erlang distribution above with a Dirichlet Process with atomic baseline, P 0 , this results in a poor posterior performance, as noted by Canale and Prünster (2017). Therefore, to avoid this, as well as to keep mixtures of Erlangs and benefit of the closed-form expressions for phase-type distributions, we resorted to above truncated mechanism. Indeed, this resembles the rounding function approach suggested by Canale and Dunson (2011).
For the sake of simplicity, model (8) has common rate parameter λ, however this is not a limitation, as mixtures of Erlang distributions with common rate parameter λ are dense in the space of distributions with support on the positive real numbers (Tijms, 1994;Lee and Lin, 2010).

Posterior inference
Assume that we have i.i.d. observations y = (y 1 , . . . , y n ) from model (8). The main difficulty in the estimation process is the infinite-dimensional nature of the parameter space. In Bayesian nonparametric literature, dealing with the infinite-dimensional nature of a model is virtually a routine problem. We begin by examining the likelihood function of the observed data: Now, let d k be a latent variable such that for k = 1, . . . , n, h = 1, 2, . . . , and P[d k = h] = w h . This inclusion leads to a simplification of (9), which can be rewritten as where d = (d 1 , . . . , d n ) and n h = n k=1 1{d k = h}, h = 1, 2, . . . . To bypass the computation of an infinite number of terms in (10), Walker (2007) introduced a set of latent variables {u k } n k=1 , such that: The advantage of this approach is that only a finite subset of w h 's will satisfy the condition (w h > u k , k = 1, . . . , n). Hence, in a sampling scenario it is only necessary to sample N parameter sets, where N = max k {N k } and N k is the smallest integer such that (Walker, 2007;Kalli et al., 2011). Namely, in a particular iteration, the infinite-dimensional parameters w, φ reduce to a finite value. The number of components, N , is random and its magnitude depends on the complexity of the data, influenced by aspects like the number of modes, skewness levels or outlier observations. Using the stick-breaking representation of (3), with stick lengths denoted by v k s, and adding the latent variable d k previously defined, the joint distribution is given by and the complete data likelihood for n observations ends up being The following algorithm implements a slice Gibbs sampler based on the above likelihood. Details of the corresponding full conditionals can be found in the Supplementary Material (Ayala et al., 2021).
To learn about the precision parameter α, one can further implement the step described in Escobar and West (1995). See details in the Supplementary Material. This algorithm was implemented in an R function, mcmcErlangMix. The code is included in the Supplementary Material. It is worth emphasizing that any other valid algorithm for DPM models could be alternatively used.
Note that the mixture model induced in each iteration of Algorithm 1 is given by where the superscript r denotes the iteration, N * ≤ N is the effective number of mixture components, and it comes from the number of different φ (r) h values of φ (r) . Consequently, it is necessary to factorize weights w (r) h in w (r) whose corresponding φ

Phase-type representation
Here, we develop an algorithm to recover the parameters of the phase-type representation from model (8). The algorithm is based on the phase-type representation of the mixture which corresponds to parameters π = (w N , . . . , w 1 ), Note that the maximum value of h determines the number of phases. The initial probabilities are given by the weights in reverse order. For instance, the first component in the mixture is an Erlang distribution with parameters 1 and λ, which is represented by the initial probability π N = w 1 , that is, a process that starts on state N where it remains an exponential time, and then jumps to the absorbing state. On the other hand, the last component in the mixture is an Erlang distribution with parameters N and λ. In this case, the initial probability is π 1 = w N ; here, the process starts in the state 1 and then jumps to the following states until absorption from state N .
In our case, the parameters for r-th posterior sample are φ (r) N * , and λ (r) , for r = 1, . . . , R. The number of phases is computed using the same logic of mixture (12), that is p = max φ Finally, the subintensity matrix T is constructed as a p × p bi-diagonal matrix, with t jj = −λ (r) for j = 1 . . . , p , and t j,j+1 = λ (r) for j = 1, . . . , p − 1 , using the same logic of mixture (12). The estimated π has a sparse structure, which makes sense as we are representing a phase-type distribution with p < ∞, through an infinite-dimensional mixture distribution.
It is important to emphasize that even though the parameter space is of infinite dimension by definition, the resulting mixture estimate and corresponding phase-type representation can only be expressed as finite-dimensional object, due to the random truncating nature of Algorithm 1.

Monte Carlo study
To assess the behavior of Algorithm 1 for model (8), a Monte Carlo (MC) study was designed to estimate six different density functions on R + . Distributions, their corresponding density functions and parameter values are shown in Table 1. It is worth noting that these distributions do not belong to the matrix-exponential distributions family, except obviously for the phase-type case.
The simulation study was structured as follows: first, we generated 100 random samples for every distribution, each with three sample sizes (n = 125, 250, 500). Subsequently, Markov chain density estimates were obtained for all samples via Algorithm 1. The hyper-parameters for the base measure were fixed at a 0 = 2 and b 0 = 0.1, reflecting our lack of knowledge about their values, resulting in a prior mean for φ equal to 20 and a variance of 200. Similarly, the hyper-parameters for the parameter λ were fixed at a 1 = a 2 = 0.1, resulting in a prior mean for λ equal to 1 and a variance of 10, supporting the parametric space. Finally, for the hyper-prior distribution of the precision parameter α, we assumed a α = b α = 1.
Subsequently, to gauge the estimation process's overall performance quantitatively, the Mean Integrated Squared Error (MISE), E f n − f 2 2 = E (f n (x) − f (x)) 2 dx, was estimated for every density function and sample size, over a reasonable grid which covers up to two times the maximum of sampled deviates. Here, f and f n denote the true and estimated densities, respectively. The results for the MISE are reported in Table 2, and they show an improvement in the precision of density estimates as sample size increases for each distribution.    Lastly, we analyzed the resulting dimension p of the equivalent phase-type representation for every replication of the Monte Carlo study. Then, for each estimation, we computed the mode for p in the posterior samples and reported the mode for each MC sample in Figure 2. We observe that for distributions with lowest MISE values (GIG, GIG mixture, and phase-type), the estimated dimension needed to accommodate for the given data complexity is also low. In contrast, the required dimension for the phase-type representation is noticeable more spread out for the remaining distributions. Table 3 includes the mode for the estimated dimension p for each scenario, that is, the mode in the 100 replications. It also shows the sample average skewness, γ 1 , and sample kurtosis of all generated samples of size 500. We observe that for unimodal distributions, the magnitude of the mode for p is related to negative values of γ 1 ; specifically, the threeparameter Weibull distributed data produced the largest estimated value of p, with 109 transient states.

Distribution
Mode (

Comparison with other inference approaches
Here we provide two simulated examples to compare our approach with the results from the R packages mapfit and PhaseType. The mapfit package was developed by Okamura and Dohi (2015) based on a variation of the Expectation-Maximization algorithm proposed by . The PhaseType package (Aslett, 2012), follows the Bayesian approach proposed by (Bladt et al., 2003). This latter approach resorts to the Metropolis-Hasting algorithm to simulate the underlying Markov jump processes and then performs the inference, defining a gamma distribution as a prior on the elements of T and a Dirichlet prior for the initial probabilities π. We simulate n = 1,000 realizations from the phase-type distributions PH 5 (π 1 , T 1 ) and PH 10 (π 2 , T 2 ) where π 1 = (0, 0, 0, 1, 0), π 2 = (0.6, 0, 0, 0, 0, 0, 0, 0.4, 0, 0), and T 2 a bidiagonal matrix with elements t ii = −1.5 and t i,i+1 = 1.5, i = 1, . . . , 10. Figure 3 shows the estimated densities. For the distribution PH 5 (π 1 , T 1 ), which is unimodal, the mapfit package and our proposal got very good fits. The performance of the PhaseType package is low. For the distribution PH 10 (π 2 , T 2 ) our proposal gives good results, the mapfit and PhaseType were unable to detect the two modes. The number of phases in the mapfit and PhaseType packages requires to be fixed by the user. We have tried different values, e.g., 10, 20, 30. Figure 3 shows the best results we observed, corresponding to p = 30. On contrast, our inference strategy adjusts the number of phases automatically, which is an advantage. O' Cinneide (1990Cinneide ( , 1999 demonstrates that a phase-type distribution of order p is determined by at most 2p − 1 independent parameters. The results of the mapfit and PhaseType packages are based on p 2 + p − 1 parameters, which is clearly larger than 2p − 1. In general, for highly redundant parametrizations, the performance of phase-type distributions is well-known to have problems, see, e.g., . In our inference strategy reduces to estimate p parameters. Overall, our method avoids the simulation of latent processes, used by most approaches available in the literature, which results in a more efficient technique.

Renewal function estimation
Within the context of Renewal Theory, a quantity of great interest is the expected number of renewals at time τ , U (τ ) := E[η(τ )], of the associated counting process η (Feller, 1971). In the case of phase-type distributions, U (τ ) has an analytical expression (Bladt and Nielsen, 2017) given by with ϑ = πT −1 /πT −2 t and s = (−T) −1 t .
Accordingly, we exemplify the computation of the renewal function for the phasetype simulated data sets described in Table 1. To that effect, we constructed the set of parameters (π, T) from the Erlang mixture parameters estimates, as specified in sub-Section 3.1, and then continued with the calculation of U (τ ). Resulting U (τ ) estimates are shown in Figure 4. Here we can see that computed renewal functions do not differ significantly from the true renewal function. Furthermore, we point out the ability to capture the non-linear behavior of the true renewal function, a feature not captured by simple Poisson counting processes (Winkelmann, 1995).

Application to real data sets
Here we compute density function and renewal function estimation for three datasets: the first two have been studied in the Renewal processes literature, and a third one belongs to an aquaculture study. The first dataset consists of the Old Faithful geyser eruption data explored by . However, we use the more comprehensive data set, which consists of 299 eruption observations (duration and waiting time) from August 1 st to August 15 th , 1985 (Azzalini and Bowman, 1990).
The estimated density clearly captures the bi-modal shape of the observed data ( Figure 5, panel (a)), as well as the delay for times lower than 40 minutes. The renewal function U (τ ) in this case presents a distinctly nonlinear pattern, reflecting the multimodal density function ( Figure 5, panel (b)). The second dataset comprises coal-mining disasters presented in Jarrett (1979) and studied by Xiao (2015). This reports the number of days between 191 successive explosions on a coal-mining site and involving ten or more men killed. The dataset includes information of a period from March 15 th , 1851 to March 22 nd , 1962. The observations exhibit an exponential-like shape with a heavy tail, which is known to be challenging to model when using finite-dimensional phase-type distributions (Bladt and Rojas-Nandayapa, 2018). Nevertheless, our proposed model captures correctly the shape, as shown in Figure 6.  Regarding the renewal function estimation, it is also non-linear. Having expression (13) at hand allows us to compute the number of expected events for different time windows. For example, by day 365, one can expect the occurrence of two (≈ 2.06) mine disasters with the characteristics of interest. By day 700, the predicted number consists of almost four events (≈ 3.88), which is of paramount importance for risk assessment.
Finally, we tackled an estimation problem within a fish farming set up. The main goal was to estimate the salmon weight population's density function in a cage throughout time. Although the underlying Markov jump process in phase-type distributions represents the time until absorption, phase-type distributions are suitable for any random variable with support on the positive real numbers. The dataset contains weight measurements of sampled fish in a culture tank at day 15 (n = 243), day 34 (n = 256), day 74 (n = 195) and day 154 (n = 251). The relevance of density estimation in this context lies in the necessity to know the proportion of fish that is in a given weight range due to the different commercial value according to the fish's size. Figure 7: Posterior mean estimates (solid lines) and 95% credible intervals (shaded areas) for the salmon weight density by day. Figure 7 shows the posterior mean estimate of the density for the cage's weight distribution. In the first stages of development, we can observe more variability and even two modes. The weight distribution can be explained, in part, by the vaccination effect. It is well known that a proportion of the individuals in a cage do not receive the corresponding doses by difficulties in the capture.

Discussion and concluding remarks
We demonstrated a clear connection between phase-type distributions, mixtures of Erlang distributions, and Bayesian nonparametric inference in this work. As established in Propositions 1 to 3, any phase-type distribution has an equivalent infinite mixture of Erlang distributions representation, and if its associated transition matrix P is nilpotent, then the corresponding Erlang mixture is finite-dimensional. In addition, if λ 1 ≤ λ 2 ≤ . . . ≤ λ p , the finite mixture is an identifiable statistical model.
Although some links between phase-type distributions and Erlang mixtures have been explored in the past, none of them have exploited this relationship for inference. Bayesian nonparametric methods allow us to treat infinite-dimensional statistical models and thus fit phase-type distributions using their infinite mixture model representations. Under our framework, one avoids the simulation of a latent Markov jump process for each observation when implementing a Gibbs sampler, overcoming serious identifiability and numerical problems inherent to other methods available in the literature, e.g. Bladt et al. (2003) and Aslett (2012). The significant reduction of the computational burden translates into faster and more efficient algorithms.
As a byproduct of the established connection, we are able to use well-known closedform expressions obtainable for phase-type distributions' density functionals, not readily available from a DPM models standpoint. See, e.g., heavy-tailed data modeling in queueing theory (Greiner et al., 1999); the numerical approximations to estimate the Lorenz curve and Gini index in Hasegawa and Kozumi (2003); or the numerical inversion of the Laplace transform to compute the renewal function, when inter-arrival times follow an infinite mixture of Erlang distributions in Xiao (2015). In particular, the renewal function is no longer an approximation but an exact analytic quantity. Therefore, we are now capable of performing inference on a counting process by analyzing its waiting times, even though their distribution does not belong to the exponential distributions family, which has been the usual approach. This is an exceptional result as we can study non-regular counting processes, as long as their inter-arrival times are assumed i.i.d. phase-type random variables. The connection makes feasible model over and under dispersion, something not possible in the Poisson-Exponential scenario (Cox, 1962).

Supplementary Material
Supplementary material for: On a Dirichlet process mixture representation of phasetype distributions (DOI: 10.1214/21-BA1272SUPP; .pdf). The online Supplementary Material contains the posterior inference derivations of the algorithm of Section 3, some results of the Monte Carlo simulation study of Section 4, and an R function for posterior sampling.