Predictive inference with Fleming--Viot-driven dependent Dirichlet processes

We consider predictive inference using a class of temporally dependent Dirichlet processes driven by Fleming--Viot diffusions, which have a natural bearing in Bayesian nonparametrics and lend the resulting family of random probability measures to analytical posterior analysis. Formulating the implied statistical model as a hidden Markov model, we fully describe the predictive distribution induced by these Fleming--Viot-driven dependent Dirichlet processes, for a sequence of observations collected at a certain time given another set of draws collected at several previous times. This is identified as a mixture of P\'olya urns, whereby the observations can be values from the baseline distribution or copies of previous draws collected at the same time as in the usual P\`olya urn, or can be sampled from a random subset of the data collected at previous times. We characterise the time-dependent weights of the mixture which select such subsets and discuss the asymptotic regimes. We describe the induced partition by means of a Chinese restaurant process metaphor with a conveyor belt, whereby new customers who do not sit at an occupied table open a new table by picking a dish either from the baseline distribution or from a time-varying offer available on the conveyor belt. We lay out explicit algorithms for exact and approximate posterior sampling of both observations and partitions, and illustrate our results on predictive problems with synthetic and real data.


Introduction and summary of results
Bayesian nonparametric methodology has undergone a tremendous development in the last decades, often standing out among competitors for flexibility, interpretability and computational convenience. See for example Hjort et al. (2010), Müller et al. (2015), Ghosal and van der Vaart (2017). The cornerstone of Bayesian nonparametrics is the sampling model based on the Dirichlet process (Ferguson, 1973), whereby Here, given a sampling space Y, we use X to denote a random probability measure (RPM) on Y, and observations Y i are assumed to be independent with distribution x when X = x. We also denote by Π α the distribution X induces on the space P(Y) of probability measures on Y, with α = θP 0 , θ > 0 and P 0 a nonatomic probability measure on Y. Notable properties of the Dirichlet process are its large weak support and conjugacy, whereby the conditional RPM X, given observations Y 1 , . . . , Y n from (1.1), is still a Dirichlet process with updated parameter α + n i=1 δ Y i . The great appeal offered by the relative simplicity of the Dirichlet process boosted a number of extensions, among which some of the most successful are mixtures of Dirichlet processes (Antoniak, 1974), Dirichlet process mixtures (Lo, 1984), Pólya trees (Mauldin et al., 1992, Lavine, 1992, Pitman-Yor processes (Perman et al., 1992, Pitman andYor, 1997), Gibbstype random measures (Gnedin andPitman, 2005, De Blasi et al., 2015), normalised random measures with independent increments (Regazzini et al., 2003, Lijoi et al., 2005, to mention a few. The common thread linking all the above developments is the assumption of exchangeability of the data, equivalent to the conditional independence and identity in distribution in (1.1) by virtue of de Finetti's Theorem. This can be restrictive when modelling data that are known to be generated from partially inhomogeneous sources, as for example in time series modelling or when the data are collected in subpopulations. Such framework can be accommodated by partial exchangeability, a weaker type of dependence whereby observations in two or more groups of data are exchangeable within each group but not overall. If groups are identified by a covariate value z ∈ Z, then observations are exchangeable only if their covariates have the same value.
One of the most active lines of research in Bayesian nonparametrics in recent years aims at extending the basic paradigm (1.1) to this more general framework. Besides pioneering contributions, recent progresses have stemmed from MacEachern (1999), who called a collection of RPMs {X z , z ∈ Z} indexed by a finite-dimensional measurement z ∈ Z a dependent Dirichlet process (DDP) if each marginal measure X z is a Dirichlet process with parameter that depends on z.
Here we focus on DDPs with temporal dependence, and replace z with t ∈ [0, ∞) representing time. Previous contributions in this framework include Dunson (2006), Caron et al. (2007), Rodriguez and ter Horst (2008), Griffin and Steel (2010), Caron and Teh (2012), Mena and Ruggiero (2016), Caron et al. (2017), Gutierrez et al. (2016), Canale and R. (2016), Kon Kam . Many proposals in this area start from the celebrated stick-breaking representation of the Dirichlet process (Sethuraman, 1994), whereby X in (1.1) is such that and the temporal dependence is induced by letting each V i and/or Y i depend on time in a way that preserves the marginal distributions. This approach has many advantages, among which: simplicity and versatility, since inducing dynamics on V i or Y i allows for a variety of solutions; flexibility, since under mild conditions the resulting processes have large support (cf. Barrientos et al., 2012); ease of implementation, since strategies for posterior computation based on MCMC sampling are readily available. However, the stick-breaking structure makes the analytical derivation of further posterior information, like for example characterizing the predictive distribution of the observations, often a daunting task. This typically holds for other approaches to temporal Bayesian nonparametric modelling as well. Determining explicitly such quantities would not only give a deeper insight into the model posterior properties, which otherwise remain obscure to a large extent, but also provide a further tool for direct application or as a building block in more involved dependent models, whose computational efficiency would benefit from an explicit computation.
In this paper, we provide analytical results related to the posterior predictive distribution of the observations induced by class of temporal DDP models driven by Fleming-Viot processes. The latter are a class of diffusion processes whose marginal values are Dirichlet processes. The continuous time dependence is a distinctive feature of our proposal, compared to the bulk of literature in the area. In particular, here we complement previous work done in Papaspiliopoulos et al. (2016), which focussed on identifying the laws of the dependent RPMs involved, by investigating the distributional properties of future observations, characterized as a mixture of Pólya urn schemes, and those of the induced partitions.
More specifically, in Section 2 we detail the statistical model we adopt, which directly extends (1.1) by assuming a hidden Markov model structure whereby observations are conditionally iid given the marginal value of a Fleming-Viot-driven DDP. We recall some key properties of this model, and include a new result on the weak support of the induced prior. In Section 3 we present our main results. Conditioning on samples, with possibly different sizes, collected at p times 0 = t 0 < · · · < t p−1 = T , possibly in different amount at different times, we characterise the predictive distribution of a further sequence drawn at time T + t. This task can be seen as a dynamic counterpart of obtaining the predictive distribution of Y k+1 |Y 1 , . . . , Y k for any k ≥ 1 in (1.1), when the RPM X is integrated out, thus yielding for any Borel set A of Y, where P k denotes the empirical distribution of (Y 1 , . . . , Y k ). In the hidden Markov model framework, we identify the predictive distribution of the DDP at time T + t as a time-dependent mixture of Pólya urn schemes. This can be thought of as being generated by a latent variable which selects a random subset of the data collected at previous times, whereby every component of the mixture is a classical posterior Pòlya urn conditioned to a different subset of the past data. We characterize the mixture weights, where the temporal dependence arises, and derive an explicit expression for the correlation between observations at different time points. Furthermore, we discuss two asymptotic regimes of the predictive distribution -as the time index diverges, which recovers (1.3), and as the current sample size diverges, which links the sequence at time T + t with its de Finetti measure -and lay out explicit algorithms for exact and approximate sampling from the predictive. Next, we discuss the induced partition at time T + t and derive an algorithm for sampling from its distribution. The partition sampling process is interpreted as a Chinese restaurant with conveyor belt, whereby arriving customers who do not sit at an already occupied table, open  a new table by choosing a dish either from the baseline distribution P 0 or from a temporally dependent selection of dishes that run through the restaurant on a conveyor belt, which in turn depends on past dishes popularity. We defer all proofs to the Supplementary Material. Finally, Section 4 illustrates the use of our results for predictive inference through synthetic data and through a dataset on the Karnofsky score related to a Hodgkins lymphoma study.

Fleming-Viot dependent Dirichlet processes
We consider a class of dependent Dirichlet processes with continuous temporal covariate. Instead of inducing the temporal dependence through the building blocks of the stick-breaking representation (1.2), we let the dynamics of the dependent process be driven by a Fleming-Viot (FV) diffusion. FV processes have been extensively studied in relation to population genetics (see Ethier and Kurtz (1993) for a review), while their role in Bayesian nonparametrics was first pointed out in Walker et al. (2007) (see also Favaro et al., 2009). A loose but intuitive way of thinking a FV diffusion is of being composed by infinitely-many probability masses, associated to different locations in the sampling space Y, each behaving like a diffusion in the interval [0, 1], under the overall constraint that the masses sum up to 1. In addition, locations whose masses touch 0 are removed, while new locations are inserted at a rate which depends on a parameter θ > 0. As a consequence, the random measures X t and X s , with t = s, will share some, though not all, their support points. The transition function that characterizes a FV process admits the following natural interpretation in Bayesian nonparametrics (cf. Walker et al., 2007). Initiate the process at the RPM X 0 ∼ Π α , and denote by D t a time-indexed latent variable taking values in Z + . Conditional on D t = m ∈ Z + , the value of the process at time t is a posterior DP X t with law Here, the realisation of the latent variable D t determines how many atoms m are drawn from the initial state X 0 , to become atoms of the posterior Dirichlet from which the arrival state is drawn. Such D t is a pure-death process, which starts at infinity with probability one and jumps from state m to state m−1 after an exponentially distributed waiting time with inhomogenous parameter λ m = m(θ + m − 1)/2. The transition probabilities of D t have been computed by Griffiths (1980), Tavaré (1984), and in particular Here the fact that D 0 = ∞ almost surely should be understood as an entrance boundary, i.e., the process decreases from infinity at infinite speed so that at each t > 0 the value of D t is finite. The unconditional transition of the FV process is thus obtained by integrating D t , Y 1 , . . . , Y Dt out of (2.1), leading to This was first found by Ethier and Griffiths (1993). It is known that Π α is the invariant measure of P t if X 0 ∼ Π α , i.e., if the initial distribution is Dirichlet, in which case all marginal RPMs X t are Dirichlet processes with the same parameter. In particular, the death process D t determines the correlation between RPMs at different times. Indeed, a larger t implies a lower m with higher probability, hence a decreasing (on average) number of support points will be shared by the random measures X 0 and X t when t increases. On the contrary, as t → 0 we have D t → ∞, which in turn implies infinitely-many atoms shared by X 0 and X t , until the two RPMs eventually coincide. See Lijoi et al. (2016) for further discussion. For definiteness, we formalize the following definition.
Definition 1. A Markov process {X t } t≥0 taking values in the space of atomic measures on Y is a Fleming-Viot dependent Dirichlet process with parameter α, denoted X t ∼ FV-DDP(α), if X 0 ∼ Π α and its transition function is (2.3).
Seeing a FV-DDP as a collection of RPMs, one is immediately led to wonder about the support properties of the induced prior. The weak support of a DDP indexed by an R + -valued covariate is the smallest closed set in B{P(Y) R + } with probability one, where P(Y) is the set of probability measures on Y and B{P(Y) R + } is the Borel σ-field generated by the product topology of weak convergence. Barrientos et al. (2012) investigated these aspects for a large class of DDPs based on the stick-breaking representation of the Dirichlet process. Since no such representation is known for the FV process, our case falls outside that class. The following proposition states that a FV-DDP has full weak support, relative to the support of P 0 .
In order to formalize the statistical setup, we cast the FV-DDP into a hidden Markov model framework. A hidden Markov model is a double sequence {(X tn , Y tn ), n ≥ 0} where X tn is an unobserved Markov chain, called hidden or latent signal, and Y tn are conditionally independent observations given the signal. The signal can be thought of as the discrete-time sampling of a continuous time process, and is assumed to completely specify the distributions of the observations, called emission distributions. While the literature on hidden Markov models has mainly focussed on finite-dimensional signals, infinite-dimensional cases have been previously considered in Beal et al. (2002) Here we take X tn to be a FV-DDP as in Definition 1, evaluated at p times 0 = t 0 < · · · < t p−1 = T . The sampling model is thus It follows that any two variables Y i tn , Y j tm are conditionally independent given X tn and X tm , with product distribution X tn × X tm .
In addition, similarly to mixing a DP with respect to its parameter measure as in Antoniak (1974), one could also consider randomizing the parameter α in (2.4), e.g. by letting α = α γ and γ ∼ π on an appropriate space.
In the following, we will denote for brevity Y n := Y tn and Y 0: where Y i is the set of n i observations collected at time t i . We will sometimes refer to Y 0:T as the past values, since the inferential interest will be set at time T +t. We will also denote by (y * 1 , . . . , y * K ) the K distinct values in Y 0:T , where K ≤ T i=0 n i . In this framework, Papaspiliopoulos et al. (2016) showed that the conditional distribution of the RPM X T , given Y 0:T , can be written as The weights w m can be computed recursively as detailed in Papaspiliopoulos et al. (2016). In particular, M is a finite convex set of vector multiplicities m = (m 1 , . . . , m K ) ∈ Z K + determined by Y 0:T , which identify the mixture components in (2.5) with strictly positive weight. We will call M the set of currently active indices. In particular, M is given by the points that lie between the counts of (y * 1 , . . . , y * K ) in Y T , which is the bottom node, and the counts of (y * 1 , . . . , y * K ) in Y 0:T , which is the top node. For example, if T = 1 suppose we observe Y 0 = (y * 1 , y * 2 ) for some values y * 1 = y * 2 and Y 1 = Y 0 , hence K = 2. Then the top node is (2, 2) since in Y 0:1 there are 2 of each of (y * 1 , y * 2 ) and the bottom node is (1, 1) which is the counts of (y * 1 , y * 2 ) in Y 1 . Cf. Figure 1. Note that observations with K = 3 distinct values would generate a 3-dimensional graph, with the origin (0, 0, 0) linked to 3 level-1 nodes (1, 0, 0), (0, 1, 0), (0, 0, 1), and so on. In general, each upper level node is obtained by adding 1 to one of the lower node coordinates.
We note here that the presence of d m (t) in (2.3) makes the computations with FV processes in principle intractable, yielding in general infinite mixtures difficult to simulate from (2.5), corresponding to points m ∈ Z K + with positive weight. In this example K = 2, and the graph refers (cf. Jenkins and Spanò, 2017). It is then remarkable that conditioning on past data one is able to obtain conditional distributions for the signal given by finite mixtures as in (2.5).

Predictive distribution
In the above framework, we are primarily interested in predictive inference, which requires obtaining the predictive distribution of Y 1 T +t , . . . , Y k T +t |Y 0:T , that is the marginal distribution of a k-sized sample drawn at time T + t, given data collected up to time T , when the random measures involved are integrated out. See Figure 2. Note that by virtue of the stationarity of the FV process, if X 0 ∼ Π α , then P(Y t ∈ A) = P 0 (A) for any t ≥ 0. Note also that if one mixes model (2.4) by randomizing the parameter measure α = α γ as mentioned above, the evaluation the predictive distributions is of paramount importance for posterior computation. Indeed, one needs the distribution of γ|Y 0:T , and if for example γ has discrete support on Z + with probabilities {p j , j ∈ Z + }, then to be all the points in Z K + lying below the top node of M. E.g., if M is given by the red nodes in Figure 1, then L(M) is given by all nodes shown in the figure. Figure 2: The predictive problem depicted as a graphical model. The upper yellow nodes are nonobserved states of the infinite-dimensional signal, the lower green nodes are conditionally independent observed data whose distribution is determined by the signal, the light gray node is the object of interest.
Proposition 2. Assume (2.4), and let the law of X T given data Y 0:T be as in (2.5), where the weights w m have been computed recursively. Then, for any Borel set A of Y, the first observation at time T + t has distribution and the (k + 1)st observation at time T + t, given the first k, has distribution Before discussing the details of the above statement, a heuristic read of (3.2) is that the first observation at time T + t is either a draw from the baseline distribution P 0 , or a draw from a random subset of the past data points Y 0:T , identified by the latent variable n ∈ L(M). Given how L(M) is defined, Y T +t can therefore be thought of as being drawn from a mixture of Pólya urns, each conditional on a different subset of the data, ranging from the full dataset to the empty set. Indeed, recall from Section 2 that the top node of M, hence of L(M) in (3.1), is the vector of multiplicities of the distinct values (y * 1 , . . . , y * K ) contained in the entire dataset Y 0:T . The probability weights associated to each lower node n ∈ L(M) are determined by a death process on L(M), that differs from D t in (2.2). In particular this is a Markov process  Figure 1, the weight of the mixture component with index n = (0, 2), i.e., no atoms y * 1 and 2 atoms y * 2 , is the sum of the probabilities of reaching node (0, 2) via the path starting from (1, 2) (left) and from (2, 2) (right).
that jumps from node m to node m − e i after an Exponential amount of time with parameter m i (θ+|m|−1)/2, with e i being the canonical vector in the ith direction. The weight associated with node n ∈ L(M) is then given by the probability that such death process is in n after time t, if started from any node in M. For example, if M is as in Figure 1, than the weight of the node (0, 2) is given by the probability that the death process is in (0, 2) after time t if started from any other node of M. Being a non increasing process, the admissible starting nodes are (2, 2) and (1, 2). Figure 3 highlights these two admissible paths of the death process which land at node (0, 2).
The transition probabilities of this death process are where HG(i; m, |i|) is the multivariate hypergeometric probability function evaluated at i, namely with dim(m) denoting the dimension of vector m, while p |m|,|n| (t) is the probability of descending from level |m| to |n| (see Lemma 1 in the Supplementary Material). Hence, in general, the probability of reaching node n ∈ L(M) from any node in M is In conclusion, with probability p t (M, n) the first draw at time T + t will be either from P 0 , with probability θ/(θ + |n|), or a uniform sample from the subset of data identified by the multiplicity vector n.
Concerning the general case for the (k+1)st observation at time T +t, trivial manipulations of (3.3) provide different interpretative angles. Rearranging the term in brackets one obtains which bears a clear structural resemblance to (1.3). Here play the role of concentration parameter and baseline probability measure (i.e, the initial urn configuration), respectively. Thus (3.3) can be seen as a mixture of Pólya urns where the base measure has a randomised discrete component P n . Unlike in (1.3), observations not drawn from empirical measure P k of the current sample can therefore be drawn either from P 0 or from the empirical measure P n , where past observations are assigned multiplicities n with probability p (k) t (M, n). An alternative interpretation is obtained by developing the sum in (3.3) to obtain a single generalised Pólya urn, written in compact form as where A is a Borel set of Y. In this case, the first observation is either from P 0 or a copy of a past value Y 0:T , namely while the (k + 1)st can also be a copy of one of the first k current observations Y 1:k T +t , namely The pool of values to be copied is therefore given by past values Y 0:T and current, already sampled observations Y 1:k T +t . After each draw, the weights associated to each node need to be updated according to the likelihood that the observation was generated by the associated mixture component, similarly to what is done for mixtures of Dirichlet processes. Specifically, where (3.10) p(y k+1 T +t | y 1:k T +t , n) := is the predictive distribution of the (k + 1)st observation given the previous k and conditional on n, and p 0 is the density of P 0 with respect to the Lebsegue or the counting measure.
As a byproduct of Proposition 2, we can evaluate the correlation between observations at different time points.
Proposition 3. For t, s > 0, let Y t , Y t+s be from (2.4). Then Unsurprisingly, the correlation decays to 0 as the lag s goes to infinity. Moreover, which is the correlation of two observations from a DP as in (1.1).

Sampling from the predictive distribution
In order to make Proposition 2 useful in practice, we provide an explicit algorithm to sample from the predictive distribution (3.3), which can be useful per se or for approximating posterior quantities of interest. Exploiting (3.7) and the fact that (3.3) can be seen as a mixture of Pólya urns, we can see n ∈ Z K + as a latent variable whereby, given n, sampling proceeds very similarly to a usual Pólya urn.
Recalling that |n| = K j=1 n i , a simple algorithm for the (k + 1)st observation would therefore be: • sample n ∈ L(M) w.p. p (k) t (M, n); • sample from P 0 , P n or P k with probabilities proportional to θ, |n|, k respectively; • update weights p A detailed pseudo-code description is provided in Algorithm 1.
A possible downside of the above sampling strategy is that when the set L (M) is large, updating all weights may be computationally demanding. Indeed, the size of the set L(M) is |L(M)| = K j=1 (1 + m j ), where m j is the multiplicity of y * j in the data, which can grow considerably with the number of observations (cf. also Proposition 2.5 in Papaspiliopoulos and Ruggiero, 2014). It is however to be noted that, due to the properties of the death process that ultimately governs the time-dependent mixture weights, typically only a small portion of these will be significantly different from zero. Figure 4 illustrates this point by showing the nodes in {0, . . . , 50} with weight larger than 0.05 at different times, if at time 0 there is a unit mass at the node 50, when θ = 1. A deeper investigation of these aspects in a similar, but parametric, framework, can be found in Kon Kam .
Hence an approximate version of the above algorithm can be particularly useful to exploit this aspect. We can therefore target a setM ⊂ L t (M, n), n ∈ L(M) 3: Sample Y from P 0 , P n or P k w.p. θ θ+|n|+k , |n| θ+|n|+k , k θ+|n|+k respectively 4: Set y k+1 T +t = Y 5: Update parameters: 6: for n ∈ L(M) and p(y k+1 T +t | y 1:k T +t ) as in (3.10) do 7: Normalize p process with a large number of particles. The empirical frequencies of the particles landing nodes will then provide an estimate of the weights p t (M, n) in (3.2). Furthermore, the simulation of the multidimensional death process can be factorised into simulating a one-dimensional death process, which simply tracks the number of steps down the graph, and hypergeometric sampling for choosing the landing node within the reached level. A simple algorithm for Return n and exit cycle. • draw m with probability w m and set m = |m|; • run a one-dimensional death process from m, and let n be the landing point after time t; • draw n (i) ∼ HG(n, m/|m|); and return {n (i) , i = 1, . . . , N }. Note, in turn, that the simulation of the death process trajectories does not require to evaluate its transition probabilities (3.5), which are prone to numerical instability, and can instead be straightforwardly set up in terms of successive exponential draws by repeating the following cycle: for i ≥ 1, • if j≤i Z j < t set m = m − 1 else return n = m − i + 1 and exit cycle.
Algorithm 2 outlines the pseudocode for sampling approximately from (3.3) according to this strategy. the current selection, given by three dishes of type 1, one of type 2 and two empty slots from which previously available dishes have been removed.

Partition structure and Chinese restaurants with conveyor belt
A sample from (3.3) will clearly feature ties among the observations, since there are two discrete sources for the data, namely P n and P k . A fundamental task concerning sampling models with ties is to characterize the distributional properties of the induced random partition. We say that a random sample (Y 1 , . . . , Y n ) induces the partition (n 1 , . . . , n K ) if K i=1 n i = n and grouping the observed values gives multiplicities (n 1 , . . . , n K ). The distribution of a random partition generated by an exchangeable sequence is encoded in the so-called exchangeable partition probability function, which for the Dirichlet process is given by Pitman (2006) (3.11) p(n 1 , . . . , The sampling scheme on the space of partitions associated to the Dirichlet process is generally depicted through a Chinese restaurant process (Pitman, 2006): the first customer sits at a table and orders a dish from the menu P 0 , while successive customers either sit at an existing table j, with probability proportional to its current occupancy n j , and receive the same dish as the other occupants, or sit at an unoccupied table, with probability proportional to θ, and order from P 0 . To account for random partitions induced by a FV-DDP, one can think of a conveyor belt typical of some Chinese restaurants, which delivers a non constant selection of dishes that customers can choose to pick up. See Figure 5. In the context of (3.3), each new customer on day T + t faces a different configuration n of dishes available on the conveyor belt, determined by the weights p (k) t (M, n). This depends on the following factors: (i) which dishes were most popular on day T , the greater the popularity, the higher their multiplicity in the nodes of M, hence the greater their average multiplicity on the conveyor on day T + t as determined by n; (ii) the removal of dishes that showed symptoms of food spoilage before the first customer arrives, as determined by the temporal component; (iii) previous customers choices, as the kitchen readjusts the conveyor at each new customer by reinforcing the most popular dishes, as determined by the update (3.9).
Schematically, the Chinese restaurant process with conveyor belt proceeds as follows. The first customer at time T + t arrives at the restaurant, finds the configuration n on the conveyor belt, then picks a dish • from the conveyor belt, with probability |n|/(θ + |n|) • from the menu P 0 , with probability θ/(θ + |n|) and sits at the first table. The kitchen then readjusts the offer on the conveyor belt based on the first customer's choice, through (3.9). The (k + 1)st customer arrives at the restaurant, finds a configuration n on the conveyor belt, then • with probability m j /(θ + |n | + k) sits at table j and receives the same dish as the other occupants, m j being the current table occupancy • otherwise picks a dish • from the conveyor belt, with probability |n |/(θ + |n | + k) • from the menu P 0 , with probability θ/(θ + |n | + k) and sits at a new table.
Note that node 0 has always positive probability, in which case the conveyor belt is empty and (3.3) reduces to (1.3). Hence a customer facing the configuration n = 0 is entering a usual Chinese restaurant. The usual way for formally deriving the law of a random partition induced by n observations from X T +t would be to compute which evaluates the probability of all possible configurations of multiplicities (n 1 , . . . , n q ), with q ≤ n and q h=1 n h = n, irrespective of the values Y i that generated them. This entails a considerable combinatorial complexity, particularly given by the fact that X T +t , which has a similar representation to (2.5), is given by a mixture of Dirichlet processes whose base measures have partially shared discrete components.
Alternatively, one can derive (3.11) from (1.3), better seen by rewriting P k in terms of multiplicities of the distinct values, by assuming observations in the same group arrive sequentially, so that the first group has multiplicity n 1 with probability proportional to θ(n 1 − 1)!, the second has multiplicity n 2 with probability proportional to θ(n 2 −1)!, and so on. Similarly, we can use the results in Proposition 2 to derive the explicit law of a partition induced by a sample from X T +t . The resulting expression, given in Lemma 2 in the Supplementary Material, suffers from the combinatorial complexity due to the possibility of sampling values that start a group both from P 0 and from P n , where n is itself random. Here instead we provide an algorithm for generating such random partitions, which can be used, for example, used to study the posterior distribution of the number of clusters directly, i.e. without resorting to Proposition 2. Mimicking the argument above, we need to • choose whether to sample a new value from P 0 or from any of the P n 's Sample N equal to 0 w.p. A L and equal to i w.p. C i,L , with i ∈ K.

7:
Sample l equal to 8: Sample l equal to 14: From (3.8), the probability of drawing a new observation is therefore given by A k + i∈K C i,k , where K is the set of past observations still not present in the current sample. The probability of enlarging a group associated to the value y by one is instead Algorithm 3 outlines the pseudocode for sampling a random partition according to this strategy.

Asymptotics
We investigate two asymptotic regimes for (3.3). The following Proposition shows that when t → ∞, the FV-DDP predictive distribution converges to the usual Pólya urn (1.3).

Proposition 4. Under the hypotheses of Proposition 2, we have
with P k as in (3.4).
Here TV −→ denotes convergence in total variation distance, and the statement is almost sure with respect to the probability measure induced by the FV model on the space of measurevalued temporal trajectories. A heuristic interpretation of the above result is that, when the lag between the last and the current data collection point diverges, the information given by past observations Y 0:T becomes obsolete, and sampling from (3.3) approximates sampling from the prior Pòlya urn (1.3). This should be intuitive, as very old information, relative to the current inferential goals, should have a negligible effect.
Unsurprisingly, it can be easily proved that an analogous result holds for the distribution of the induced partition, which converges to the EPPF of the Dirichlet process as t → ∞. The proof follows similar lines to that of Proposition 4. In the conveyor belt metaphor, as t increases all dishes on the conveyor belt have been removed due to food spoilage, before the next customer comes in.
The following Proposition shows that when k → ∞ in (3.3), we recover the law of X T +t given Y 0:T as de Finetti measure.
Proposition 5. Under the hypotheses of Proposition 2, we have where P * ∼ L(X T +t |Y 0:T ).
Here P * is a random measure with the same distribution as the FV-DDP at time T + t given only the past information Y 0:T . Recall for comparison that the same type of limit for (1.3) yields where Π α is the de Finetti measure of the sequence and P * is sometimes called the directing random measure.

Illustration
We illustrate predictive inference using FV-DDPs, based on Proposition 2. Besides the usual prior specification regarding models based on the Dirichlet process, that concern the choice of the total mass θ and of the baseline distribution P 0 , here we can also introduce a parameter σ > 0 that controls the speed of the DDP. This acts as a time rescaling, whereby the data collection times t i are rescaled to σt i . This additional parameter provides extra flexibility for estimation, as it can be used to adapt the prior to the correct time scale of the underlying data generating process.
We fit the data by using a FV-DDP model as specified in (2.4), with the following prior specification. We consider two choices for P 0 , a Negative Binomial with parameters (2, 0.5) and a Binomial with parameters (99, 0.3), which respectively concentrate most of their mass around small values and around the value 30. We consider a uniform prior on θ concentrated on the points {.5, 1, 1.5, . . . , 15}. A continuous prior could also be envisaged, at the cost of adding a Metropolis-Hastings step in the posterior simulation, which we avoid here for the sake of computational efficiency. Similarly, for σ we consider a uniform prior on the values {0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 1.5}. The estimates are obtained by means of 500 replicates of (3.3) of 1000 observations each, using the approximate method outlined in Algorithm 2 with 10000 Monte Carlo iterates. We also compare the FV-DDP estimate with that obtained using the DDP proposed in Gutierrez et al. (2016). This is constructed from the stick-breaking representation (1.2) by letting in (1.2) and keeping the locations Y i fixed. We let the resulting DDP be the mixing measure in a time-dependent mixture of Poisson kernels, which provides additional flexibility to this model with respect to our proposal. Furthermore, we give the competitor model a considerable advantage by training it also with the data points collected at times 6 and 7, which provide information on the prediction targets, and by centering it on the Negative Binomial with parameters (2, 0.5), rather than on the above mentioned mixture, which puts mass closer to where most mass of the true pmf lies. Figure 6 shows the results on one-step-ahead prediction with 15 collection times. The posterior of σ (not shown) concentrates most of the mass on points 0.7 and 0.9, which leads to learning the correct time scale for prediction, resulting in an accurate estimate of the true pmf. The credible intervals are quite wide, and a better precision may be achieved by increasing the number of time points at which the data are recorded.
We compare the previous results with those obtained by choosing σ via out-of-sample validation. This is done here using times 0 to 4 as training and time 5 as test, whereby for each σ ∈ {.0001, .001, .01, .1, 0.5, 1, 1.5} we compute the sum of absolute errors (SAE) between the FV-DDP posterior predictive mean and the true pmf. These are shown in Table 1, leading to choose σ = .01.      Figure 7 shows the results in this case for the one-and two-step-ahead predictions given only 5 data collection times. The true pmf is correctly predicted by the FV-DDP estimate even in this short horizon scenario, and the associated 95% pointwise credible intervals are significantly sharper if compared to Figure 6, obtained with a longer horizon. The prediction based on the alternative DDP mixture does not infer correctly the target, leading to an associated normalised 1 distance from the true pmf of 12.72% and 12.84%, compared to 4.95% and 4.90% for the FV-DDP prediction.

Karnofsky score data
We consider the dataset hodg used in Klein and Moeschberger (1997), which contains records on the time to death or relapse and the Karnofsky score for 43 patients with a lymphoma disease. The Karnofsky score (KS) is an index attributed to individual patients, with higher values indicating a better prognosis.
In the framework of model (2.4), we take the times of death or relapse as collection times and let the KS of the survivors at each time be the data. We aim at predicting the future distribution of the KS among the patients who are still in the experiment at that time, which would be an indirect assessment of the effectiveness of the score in describing the patients' prognosis. We also include censored observations (patients leaving the experiment for reasons different from death or relapse), without having them trigger a collection time. The FV-DDP appears as the ideal modeling tool in this framework since it includes a probabilistic mechanism that accounts for the reduced number of observations through different time points. We train the model up to 42, 108 and 406 days after the start of the experiment, and we make predictions 28, 112 and 144 days ahead, respectively. As regards the prior, we put a uniform distribution on the observed scores (note that new score values cannot appear along the experiment) and we uniformly randomize θ over {.5, 1, 1.5, . . . , 15}, analogously to Section 4.1. Given the results of the previous subsection for different approaches to selecting σ, here, after transforming the lags in annual, we proceed by selecting σ for each value of θ by maximizing the probability that the death process makes the right number of transitions in the desired laps of time. Some of the selected values for σ 1 , σ 2 , σ 3 for the three different trainings, depending on θ, are shown in Table 3.

θ
.5 1 1.5 · · · 29 29.5 30 σ 1 0.4947 0.4913 0.4885 · · · 0.3235 0.3266 0.3228 σ 2 0.6059 0.6014 0.5696 · · · 0.3684 0.3130 0.3361 σ 3 0.6149 0.6150 0.5789 · · · 0.3063 0.3018 0.2901 Table 3: Choice of σ for some values of θ for the three trainings. Figure 8 shows the three predictions of the scores distribution. Coherently with the intuition, as the experiment goes by, individuals with higher KS become predominant: from 70 to 230 days the predicted weight associated to a score of 90 increases of more than 10%, and similarly for 100. However the distribution of the scores remains pretty stable, apart from the lowest values, meaning that the highest scoring patients actually had much better prognoses, as showed by the third prediction.
These findings are consistent with the Kaplan-Meyer estimate of the survival function, shown in the bottom right panel, which decreases rapidly between 70 and 230 and flattens after that point, implying that the FV-DDP prediction adapted to the periods of quick change in the underlying distribution and periods of relative steady behaviour.

Discussion
We have derived the predictive distribution for the observations generated by a class of dependent Dirichlet processes driven by a Fleming-Viot diffusion model, which can be characterised as a time-dependent mixture of Pólya urns, and described the induced partition structure together with practical algorithms for exact and approximate sampling of these quantities. An upside of inducing the dynamics through a FV process is that one can implicitly exploit the rich and well understood underlying probabilistic structure in order to obtain manageable closedform formulae for the quantities of interest. This ultimately relies on the duality with respect to Kigman's coalescent, which was first used for inferential purposes in Papaspiliopoulos and Ruggiero (2014). The approach we have described yields dependent RPMs with almost surely discrete realisations. While such a feature perfectly fits the specific illustrations we have discussed, it is not suited to draw inferences with continuous data. An immediate and natural extension of the proposed model, which accommodates continuous outcomes would be to consider dependent mixtures of continuous kernels, whereby the observation y from the RPM at time t becomes a latent variable acting as parameter in a parametric kernel f (z|y). This approach would be in line with the extensive Bayesian literature on semi-parametric mixture models, which has largely used the DP or its various extensions as mixing measure. It remains however a non trivial exercise to derive in this framework the corresponding formulae for prediction, which we will leave for future investigation.
Lemma 2. Assume (2.4) and (2.5). Let I 1 , . . . , I q , with q ∈ N be a partition of {1, . . . , q} and let m j = |I j |. Then the distribution of the partition of {1, . . . , q} induced at time T + t is with A k , B k and C i,k as in (3.8),q = min{q, K} and k i = i j=1 m j , while K = {1, . . . , K} and P i l is the empirical of past values excluding the already observed ones, i.e it is the empirical of the past data points denoted by the set K\{i 0 , . . . , i l−1 }.
Proof. We want to compute P (P m = {I 1 , . . . , I q }) where q j=1 m j = m. For simplicity we consider the vector (k 0 , k 1 , k 2 , . . . , k q ) = (0, m 1 , m 1 + m 2 , . . . , q j=1 m j ). Referring to (3.8), conditioning on the new observations, for any group m s+1 we have two possibilities: • similarly to the DP case, with probability A ks the new value z * comes from P 0 and the next m s+1 − 1 = k s+1 − k s − 1 observations will be equal to z * with probability • with probability C i,ks the new value z * is y * i and the next m s+1 − 1 = k s+1 − k s − 1 observations will be equal to z * with probability In the second case we have to sum over all different past observations i that were not taken into consideration in the groups before. Then, by exchangeability, we can assume that the first j groups sample a new observation from P n , i.e. the past observations, while the others from P 0 . Definingq = min{q, K}, we have that the probabilities of multiplicities m 1 , . . . , m q with j groups from P n andq − j from P 0 , still conditioned on the new observations, is j−1 s=0 i j ∈K\{i 0 ,...,i j−1 } C i j ,ks where we have used i 0 , i 1 , . . . , i j to highlight that we cannot consider twice the same past observation. The latter in turn is the same as considering the first j groups from P n and the lastq − j from P 0 times the binomial coefficient q j . Then the thesis is obtained by integrating out j and the new observations.
• Q t i , , i = 1, . . . , N is a probability measure absolutely continuous with respect to P 0 .
As is well known, projecting a Dirichlet process Π α on a partition A 0 , . . . , A k yields a kdimensional Dirichlet density π α with parameters (α(A 0 ), . . . , α(A k )). Similarly, projecting a FV process yields a a k-dimensional Wright-Fisher (WF) diffusion, which is reversible and stationary with respect to π α (cf. Dawson (2010)). Consistently with (2.3), the transition density of the WF is given by: Then we can rewrite (5.1) as: Since B(s t 1 , ) has strictly positive Lebsegue measure, we just need to show that the integrand is strictly bigger than 0 for any (x 1 , . . . , x N ) ∈ B(s t 1 , ) × · · · × B(s t N , ). Clearly π α (x 1 ) > 0 for any x 1 ∈ B(s t 1 , ). For what concerns 1 < j ≤ N , we have: which completes the proof.

Proof of Proposition 2
Conditioning on the random measure X T +t at time T + t yields where the second equality follows from the conditional independence of the observations given the signal; cf. (2.4). From (2.5), eq. (3.7) in Papaspiliopoulos et al. (2016) implies that X T +t | Y 0 , . . . , Y T is the mixture of Dirichlet processes p t (M, n) |n| θ + |n| P n , which is (3.2) with k = 0. When k > 0, using again (1.3) and the conjugacy property of mixture of Dirichlet processes, the RHS of (5.2) reads
We need to compute , where E[Y t Y t+s ] = y t y t+s P (dy t , dy t+s ).