Perfect Sampling of the Posterior in the Hierarchical Pitman...Yor Process

The predictive probabilities of the hierarchical Pitman–Yor process are approximated through Monte Carlo algorithms that exploits the Chinese Restaurant Franchise (CRF) representation. However, in order to simulate the posterior distribution of the hierarchical Pitman–Yor process, a set of auxiliary variables representing the arrangement of customers in tables of the CRF must be sampled through Markov chain Monte Carlo. This paper develops a perfect sampler for these latent variables employing ideas from the Propp–Wilson algorithm and evaluates its average running time by extensive simulations. The simulations reveal a significant dependence of running time on the parameters of the model, which exhibits sharp transitions. The algorithm is compared to simpler Gibbs sampling procedures, as well as a procedure for unbiased Monte Carlo estimation proposed by Glynn and Rhee. We illustrate its use with an example in microbial genomics studies.


Introduction
The hierarchical Pitman-Yor process was introduced in Teh et al. (2006) and Teh (2006) as a nonparametric prior model for a collection of discrete distributions with heavy tails. See Teh and Jordan (2010) for a review on hierarchical nonparametric priors. In Bayesian nonparametrics, theoretical developments and applications of the hierarchical Pitman-Yor process have been considered in language modeling (Teh, 2006;Huang and Renals, 2007;Wood et al., 2009), infinite hidden Markov modeling (Beal et al., 2002;Van Gael et al., 2008;Blunsom and Cohn, 2011), species sampling with multiple populations (Battiston et al., 2018;Bassetti et al., 2020;, clustering , graphical modeling (Creamschi et al., 2020), image segmentation (Sudderth and Jordan, 2009), and topic models (Sato and Nakagawa, 2010;Araki et al., 2012;Lindsey et al., 2012). In this paper we evaluate and compare multiple computational strategies for posterior inference under the hierarchical Pitman-Yor process prior. In particular, we discuss: i) a novel conditional Gibbs sampling; ii) an application of coupling from the past for perfect sampling; iii) the application of a more recent framework studied by Glynn and Rhee (2014) with the aim of deriving unbiased posterior estimates from Markov chain sampling. The latter approach can be viewed as an intermediate solution between Gibbs sampling and perfect simulations.

The Pitman-Yor process
The Pitman-Yor process (Pitman, 1995) is a discrete random probability measure whose distribution depends on two parameters (β, θ), with β ∈ [0, 1) and θ > 0, and a probability measure ν. The parameter β is usually indicated as the concentration parameter, the parameter θ is referred to as the mass parameter, and the measure ν is the base measure. Among the various possible definitions, a simple and intuitive one follows from the so-called stick-breaking construction (Pitman, 2006). Let (v i ) i≥1 be independent random variables such that v i is distributed as a Beta distribution with parameter (θ+iβ, 1−β), and define π 1 = v 1 and π i = v i 1≤j≤i−1 (1−v j ) for i ≥ 1. Also, let (y i ) i≥1 be independent random variables with distribution ν. The discrete random probability measure where δ y is the point mass at y, is a Pitman-Yor process with concentration β, mass θ and base distribution ν. For brevity we write μ ∼ P Y (ν, θ, β). The Dirichlet process (Ferguson, 1973) is recovered as a special case of the Pitman-Yor process for β = 0. See Sethuraman (1994).
Discrete random probability measures play a fundamental role in Bayesian nonparametrics, since their laws act as nonparametric priors for discrete distributions (Lijoi and Prünster, 2010). The Pitman-Yor process is arguably one of the most popular priors. In particular, several Bayesian nonparametric models include a collection of random variables (X 1 , . . . , X n ) either observed or latent, from a Pitman-Yor process. That is, Here (X 1 , . . . , X n ) are the first n coordinates of an exchangeable sequence (X i ) i≥1 whose de Finetti measure (or prior) on the unknown distribution μ is the law of the Pitman-Yor process. Because of the discreteness of μ, (X 1 , . . . , X n ) from μ presents k ≤ n distinct types, labelled by X * 1 , X * 2 , . . . , X * k with frequencies (n 1 , . . . , n k ) such that n i ≥ 1 and 1≤i≤k n i = n. Here X * i is the i-th distinct value that appears in (X i ) i≥1 . For example if X 1 = X 2 and X 2 = X 3 then X * 2 = X 3 . The number of distinct values k increases with the sample size n; we can therefore use the notation k(n) if necessary to make the relationship explicit.
The predictive probabilities induced by the Pitman-Yor process (Pitman, 1995), i.e. the conditional distribution of X n+1 given (X 1 , . . . , X n ), have the following explicit form for n ≥ 1. Here to simplify the presentation we are assuming that ν is nonatomic, i.e. ν({x}) = 0 for every singleton {x}. The Chinese Restaurant Process (CRP) of Pitman (1995) gives an intuitive metaphorical description of the predictive probability (1.1).
In particular, consider a sequence of customers entering a restaurant and sitting at various tables, each table serving a single dish. Customers select their table through a reinforced urn scheme, with balls sequentially drawn. The probability of selecting a ball is proportional to its weight. Initially the restaurant is empty and the urn contains only a black ball with weight θ. Customers seat sequentially to various tables accordingly to the following scheme. Whenever the black ball is selected, the next customer sits in a new table, and the dish served on this table is sampled from ν. In this case a new ball labeled by the table with weight 1 − β is added to the urn, and the weight of the black ball is increased by β. When, instead, a ball labelled by a table is selected, the next customer sits at the corresponding table, and the weight of the ball is increased by 1.
The parameter β has a critical role since it tunes the rate at which new dishes are generated. Indeed, when a new dish (X n = X i , i = 1, . . . , n − 1) is generated, a reinforcement equal to β is assigned to the continuous component ν (expression (1.1)), and it affects the probability of generating further new dishes. The larger β the stronger the reinforcement mechanism, which is absent in the Dirichlet process (β = 0). In different words, the expected number of unique dishes k(n) increases with the value of the parameter β.

The hierarchical Pitman-Yor process
The hierarchical Pitman-Yor process is an extension of the Pitman-Yor process. It defines a nonparametric prior model for a collection of discrete distributions by means of a hierarchy of Pitman-Yor processes. Precisely, for any R ≥ 1, the hierarchical Pitman-Yor process is a collection of dependent discrete random probability measures (μ 1 , . . . , μ R ) defined as The dependence among the random probability measures μ r is induced by the common base measure μ. The hierarchical Dirichlet process  is recovered when β = 0 and β 0 = 0. As a generalization of the previous Bayesian nonparametric construction, let {(X r,1 , . . . , X r,nr )} r=1,...,R be samples from the hierarchical Pitman-Yor process, i.e., for i = 1, . . . , n r and r = 1, . . . , R, By construction, conditional on the Pitman-Yor process μ, the μ r 's are independent Pitman-Yor processes. Here (X r,1 , . . . , X r,nr ; r = 1, . . . , R) includes the first n r coordinates, for r = 1, . . . , R, of a partially exchangeable array (X r,i ) r≥1,i≥1 whose de Finetti (or prior) measure on the unknown (μ 1 , . . . , μ R ) is the law of the hierarchical Pitman-Yor process.
The predictive probabilities induced by the hierarchical Pitman-Yor process can be described by means of the Chinese Restaurant Franchise (CRF), which extends the CRP. In particular, each sample (X r,1 , . . . , X r,nr ) identifies the dishes of the n r customers in restaurant r, for r = 1, . . . , R. Customers seating at the same table eat the same dish and, due to the discreteness of the common base measure μ, the same dish can be served at multiple tables within the same restaurant and in different restaurants. The assignment of the n r customers to different tables is identical as in the CRP, and the assignment of customers to tables is independent in the R restaurants. Conditionally on μ, each table in the R restaurants is assigned a dish sampled from μ. We denote by I the number of distinct dishes, i.e. distinct values in the finite array (X r,1 , . . . , X r,nr ; r = 1, . . . , R), labelled by {X * 1 , . . . , X * I }, across the R restaurants, and by n r,i ≥ 0 the number of customers in restaurant r eating dish X * i . We can arbitrarily assign the indices i = 1, . . . , I to the I distinct dishes in (X r,1 , . . . , X r,nr ; r = 1, . . . , R). Furthermore, we introduce a variable k r,i for the number of tables in restaurant r serving the dish X * i . The distribution of X r,nr+1 , for any index 1 ≤ r ≤ R, conditioning on (i) (X r,1 , . . . , X r,nr ; r = 1, . . . , R), (ii) the number of tables in restaurant r occupied by the first n r customers, k r,· = i k r,i , and (iii) the number of tables occupied in the CRF, k = R 1 k r,· and (iv) k ·,i = i k r,i , can be represented as We refer to the recent works of  and Bassetti et al. (2020) for a detailed study of the predictive probability (1.2) and related results. Teh et al. (2006) proposed a strategy for sampling from the posterior distribution of the hierarchical Dirichlet process and from the posterior distribution of the hierarchical Pitman-Yor process. In particular, they proposed a collapsed Gibbs sampler, which marginalizes out the random probability measures μ and (μ 1 , . . . , μ R ). Let t r,j be the index of the table assigned to customer j in restaurant r. The target of the Gibbs sampler is the joint distribution of {(t r,j , . . . , t r,nr )} r=1,...,R conditional on the observations X = {(X r,1 , . . . , X r,nr )} r=1,...,R . Teh et al. (2006) also proposed a conditional Gibbs sampler which augments the set of latent variables with the measure μ. Van Gael et al. (2008) and Papaspiliopoulos and Roberts (2008) define similar conditional samplers which augment the sample space with the random probability measures μ and (μ 1 , . . . , μ R ), in the special case β = β 0 = 0.

Contributions
In this paper we discuss the problem of posterior sampling from the hierarchical Pitman-Yor process. In particular, we employ an augmentation strategy which specifies the state of the urns in the CRF. We start by describing three Gibbs sampling algorithms, two of them based on the collapsed and conditional samplers of Teh et al. (2006), and a novel Gibbs algorithm which we call doubly conditional as it employs a larger augmentation. Then, as an alternative to Gibbs sampling, we propose a perfect sampling algorithm based on the coupling from the past method of Propp and Wilson (1996). Perfect sampling allows to perform posterior inference and to compute Monte Carlo approximations of predictive probabilities. Finally, as an intermediate solution between the Gibbs sampling algorithms and perfect simulations, we consider the application of a procedure for unbiased posterior estimation introduced by Glynn and Rhee (2014).
We present an evaluation of the convergence of the proposed Gibbs samplers and discuss the running time of the perfect sampling procedure. While perfect sampling yields exact samples from the posterior distribution, the algorithm has a random stopping time and it may not be practical in certain situations. An extensive simulation study is presented, revealing a dependence of the running time on the parameters (θ, β) and (θ 0 , β 0 ) of the hierarchical Pitman-Yor process prior, with some parameter settings making it prohibitively time consuming to draw samples from the posterior. In contrast, the simulation-based evaluations of the doubly conditional Gibbs sampler suggest that the computing time to approximate the posterior does not vary substantially across parameterizations of the prior.
The paper is structured as follows. Section 2 contains the main contributions of the paper: three Gibbs sampling algorithms, the perfect sampling algorithm, and the procedure to compute unbiased approximations of posterior estimates. In Section 3 we present the simulation study for evaluating the convergence of the Gibbs samplers and the running time of the perfect sampling algorithm. Section 4 contains an application in microbial genomics. Section 5 concludes the paper with a discussion of related problems and open questions.

Posterior sampling from the hierarchical Pitman-Yor process
In the CRF metaphor, we have that: i) I denotes the number of dishes across the R samples X = {(X r,1 , . . . , X r,nr )} r=1,...,R ; ii) n r,i ≥ 0 denotes the number of customers in restaurant r and eating dish i; iii) k r,i ∈ {1, . . . , n r,i } denotes the number of tables serving dish i in restaurant r, and k r,i = 0 if n r,i = 0. Moreover, we use the notation k ·,i = r k r,i , k r,· = i k r,i , and k = r,i k r,i . We denote by (x) n↑a the generalized factorial of x ≥ 0 of order n ∈ N and increment a ≥ 0, that is (x) n↑a = 0≤i≤n−1 (x+ai) with the proviso (x) 0↑a = 1.
We observe that, according to the predictive probabilities (1.2), the summaries n = {(n r,1 , . . . , n r,I )} r=1,...,R and k = {(k r,1 , . . . , k r,I ) i≥1 } r=1,...,R allow one to straightforwardly predict the dish assigned to future customers of the CRF. In particular in the CRF the joint probability of R partitions of customers into groups with sizes n allocated at k tables is Precisely, the expression in (2.1) is obtained by summing, for every pair (r, i), over all possible times at which new tables in restaurant r serving dish i are created. Then the only factors depending on this order are gathered in f (n r,i , k r,i ). The triangular array defined in equation (2.2) satisfies the following recursion for n = s, with f (n, n) = 1. This fact can be verified by induction using the definition of the function f in equation (2.2). We remark that the sampling algorithms in this section require memorizing the triangular array defined in equation (2.2), and our implementations compute it through the recursion 2.3. We wish to sample the conditional distribution

Gibbs sampling algorithms
Markov chain Monte Carlo algorithms for the hierarchical Dirichlet process have been designed for models in which the dishes in the CRF analogy are not observed. In particular, dishes are assumed latent variables which determine, for example, component membership in a mixture model. Here, we focus on a different setting where the dishes are observed, which makes it difficult to directly compare with algorithms previously proposed in the literature. However, Markov chains in both cases can be characterized as either collapsed or conditional samplers, depending on whether the algorithm augments the sample space by the latent probability measure μ or not. We first define a Gibbs sampling algorithm of each type, and then introduce a third Gibbs sampler with a larger augmentation.

Collapsed Gibbs sampler
One of the Gibbs sampling algorithms originally proposed by Teh et al. (2006) augments the sample space with dish assignments, which we assume to be observed, and a variable capturing the number of tables for each dish, in each restaurant, which is equivalent to the array k defined in the previous section. In particular, each iteration of the algorithm samples the distribution p(k r,i | {k r ,i ; (r , i ) = (r, i)}, n) for each entry k r,i in the array k. Then, the conditional distribution is known up to a constant. Specifically, we can write

Conditional Gibbs sampler
A different algorithm discussed by Teh et al. (2006) augments the space with the latent probability measure μ. The key fact is that the distribution of μ given k can be sampled retrospectively (Papaspiliopoulos and Roberts, 2008). That is, one only needs to sample the mass assigned to the atoms corresponding to the observed dishes in the probability measure μ, which we denote by μ (1), . . . , μ(I), and this can be done through the stick-breaking representation of the Dirichlet process (Sethuraman, 1994). The same strategy can be adopted in the context of the hierarchical Pitman-Yor process. In particular, conditionally on (k, n), the vector (μ(1), . . . , μ(I)) is identically distributed to BD, where B ∼ Beta(k − Iβ 0 , θ 0 + Iβ 0 ) and D ∼ Dirichlet(k ·,1 − β 0 , . . . , k ·,I − β 0 ) are independent. Then, conditionally on (μ, n) each row of k is independent and has distribution It is worth pointing out that, as the support of this distribution could be very large, it is convenient to sample one entry k r,i at a time conditional on all the others.

Doubly conditional Gibbs sampler
We introduce a novel Gibbs sampling algorithm, which augments the sample space by two vectors, one indexed by restaurants and one indexed by dishes. To eliminate the rising factorials in equation (2.4) for our target distribution p(k |n), here we apply a standard augmentation trick for normalized completely random probability measures (Favaro and Teh, 2013). In particular, we introduce independent random variables (2.5) The random variables k have the same support in (2.5) and in (2.4). Moreover, by integrating the right-hand side with respect to g = (g 1 , . . . , g R ) and d = ( This line of reasoning leads to the following Gibbs sampling algorithm for the distribution q(k, d, g) and for our target distribution p(k | n). First, conditionally on k, the vectors d and g in (2.5) are Dirichlet distributed and Gamma distributed, respectively. Moreover, they are independent vectors. Second, conditionally on d and g, the variables k r,i in (2.5) are independent. In particular, since the random variables k r,i are discrete variables with a finite support, it is straightforward to sample their conditional distribution.

Perfect sampler
We start by introducing a collection of independent random variables (U, E, E , G 0 , G) which will be used to define multiple coupled Markov chains. Let U m,r,i be independent random variables identically distributed as a Uniform distribution on (0, 1), for m ∈ The index m will indicate time in the coupled Markov chains, while indices i and r indicate dishes and restaurants, as in previous sections. The random variables U m,r,i are combined with the variables E m, ,h and E m, ,h , which are independent and have Exponential distribution with parameter 1, for m, , h ∈ Z. Moreover, let G 0, ,m be independent Gamma random variables with parameters (1 − β 0 , 1), for , m ∈ Z, and let G ,m be independent Gamma random variables Gamma with parameters (θ/β, 1), for , m ∈ Z. For any a > 0, we define the distribution where the support of k is the same as in equation (2.4), and Z(a) is a normalization constant. We define an augmentation of this distribution similar to (2.5), where i) p G indicates the distribution of independent Gamma random variables with parameter (θ/β, 1); ii) p G0 indicates the distribution of independent Gamma random variables with parameter (1 − β 0 , 1).
Integrating (2.7) with respect to g and g 0 yields p a (k). In (2.7) the array entries k r,i are independent conditional on G and G 0 , which leads to a natural Gibbs sampler similar to the doubly conditional algorithm of the previous section. We will apply coupling from the past to sample from p a (k) exactly, and subsequently extend the procedure in order to sample p(k | n).
We construct a coupling of Gibbs samplers where we generate each Markov transition by using the set of random numbers (U, E, E , G 0 , G). Let φ g0,g a,r,i be the inverse marginal cumulative distribution function of k r,i after we condition on where k init r,i are fixed integers and (2.10) In the above construction, the sequences ( . . , I}, defined by equation (2.8), equation (2.9) and equation (2.10), for distinct values of j are copies of the same Gibbs Markov chain with stationary distribution p a (k). In particular, the Markov chains are coupled through the use of common random numbers (U, E, E , G 0 , G). Furthermore, this coupling is monotone with respect to the following partial order: k k if and only if k r,i ≥k r,i for all r = 1, . . . , R, and i = 1, . . . , I. That is, k j,m k j ,m implies k j,m+1 k j ,m+1 with probability 1.
Following the usual coupling from the past construction, we define two arrays k j,m and k j,m for j ≥ m, through the recursive equations (2.8)-(2.10) where, in the first case, the initial state k init is set to the maximum of the partial order, k init r,i = n r,i for all r, i, and in the second case to the minimum, k init r,i = 1(n r,i > 0) for all r, i. Theorem A.1 guarantees that lim almost surely, and the distribution of this limit is p a (k). This theorem is based on the coupling from the past approach introduced in the work of Propp and Wilson (1996). We shall denote by k (a) = {k (a) r,i ; r = 1, . . . , R, i = 1, . . . , I} the limiting random element with distribution p a (k).
In the described construction k (a) is a function of (U, E, E , G 0 , G), and if a ≥ a, (1), and . This is because the right hand side is strictly increasing in H and the left hand side is a.s. monotone decreasing in H. Define the random variable H as the solution to this equation, when it exists, with H = −1 when it does not. Then, the distribution of k (E * * H ) conditional on the event that the equation above has a solution matches the target posterior p(k | n). Indeed we can write the following Algorithm 1 provides a schematic view of a perfect sampler for the distribution p(k | n). The pseudocode outputs one sample of the posterior distribution. Note that the inner while loop terminates when we find an integer H which satisfies equation (2.11), in which case we return the array k (E * * H ) , or alternatively, when we can verify that the equation has no integer solution, in which case we restart the outer loop. The routine called in Line 7 computes, for the given value of a, the arrays k −j,0 and k −j,0 for increasing values of j ∈ {2, 2 2 , 2 3 , . . . }, until they become equal and therefore converge to k (a) . The computational cost of each iteration of the inner while loop is O(RIj max ), where R is the number of restaurants, I is the number of dishes, and j max is the number of steps of the coupling required in Line 7. The next section will quantify this computational burden by simulations. Line 3 is purely schematic, as E, E , U, G 0 , G are infinite arrays. These pseudorandom numbers may be memorized or recomputed from a random seed as needed, which ensures that memory requirements do not increase with j max .

Unbiased estimation through Markov chain couplings
The inference objective in a Bayesian analysis is usually a posterior moment of the form h * = E k|n h(k, n) for a function h of interest. In what follows we will assume that the function h is bounded; examples include h(k, n) = k r,i and predictive probabilities in species sampling problems with multiple populations ). An MCMC estimator derived from a Markov chain k 1 , . . . , k M , is generally biased. The bias is a function of the starting point and the length of the chain. Importance sampling can also be applied to estimate posterior moments, but estimates will be biased if the posterior density is only known up to a constant. On the other hand, if the sequence k 1 , . . . , k M consists of perfect samples of the posterior of k given n, then the estimator above is unbiased. Glynn and Rhee (2014) proposed an alternative strategy to estimate posterior expectations without bias. Concretely, let (h m ) m≥0 be a sequence of estimators for h * which is asymptotically unbiased, that is, E(h m ) → h * as m → ∞. The Glynn-Rhee procedure constructs unbiased estimators from this sequence. While this yields one of the desirable properties of perfect sampling, the unbiased estimators produced by this method don't need to be bounded, even when the function h is, and indeed constructing bounded estimators is not always possible (Jacob and Thiery, 2015).
To define the estimator, let Δ m for m ≥ 0 be random variables satisfying E(Δ m ) = E(h m − h m−1 ) with the convention that h −1 = 0. Let T be a random positive integer independent of (Δ m ) m≥0 with p(T ≥ m) > 0 for all m ≥ 0. Then, by Fubini's theorem, the estimatorĥ This last condition implies that Δ m → 0 weakly. The Glynn-Rhee estimatorĥ GR has finite variance if the variables Δ m tend to vanish quickly. Previous work on the outlined method includes two constructions of sequences (Δ m ) m≥0 , based on Markov chain couplings. For both cases Glynn and Rhee (2014) discussed sufficient conditions to verify thatĥ GR has finite variance. The recent work of Jacob et al. (2020) reviews and extends the discussion on these sufficient conditions. See also Theorem 1 in Rhee and Glynn (2015) for a selfcontained analysis. Glynn and Rhee (2014)  where (k m ) m≥0 and (k m ) m≥0 are identically distributed, coupled Markov chains with stationary distribution p(k | n), in which k m+1 − k m decreases quickly with high probability as m → ∞. The requirements on this coupling are less stringent than the monotonicity required by the Propp-Wilson algorithm. Thus, there is significant flexibility in how to define it. By way of example, we propose using two copies of the doubly conditional Gibbs sampler introduced in Section 2.1. This construction is used in Section 4 to produce unbiased posterior moments in an example from microbial genomics.
In particular, letk 0 = k 0 be a fixed array. Each step of the coupled Gibbs samplers involves sampling the variables g, d, and k (see expression (2.6)). We define a Markov coupling similar to the one used to define the perfect sampler, in which the same set of random variables is utilized to define two Markov chains (k m ) m≥0 and (k m ) m≥0 . Each transition k m−1 → k m is coupled to the transitionk m →k m+1 for m ≥ 1, in such a way that if k j =k j+1 , then for all m > j, k m =k m+1 . In each of these transitions, conditional sampling of g and d requires generating Gamma and Dirichlet random variables (expression (2.6)), and conditional on g and d the entries of k are generated using the inverse cumulative distribution. In the construction of the Markov chains, at each transition time Gamma random variables with different parameters are coupled by using shared arrays of exponential random variables. That is, we obtain Gamma distributed variables by summation of independent exponential random variables. The Dirichlet random variables are coupled similarly, by generating coupled Gamma random variables and normalizing them. In this case we exploit the fact that a Dirichlet vector can be sampled by normalizing independent Gamma random variables. Finally, at each transition, the discrete variables (k r,i andk r,i ) are coupled by applying the conditional inverse cumulative distribution functions to the same Uniform(0, 1) variables.
As the support of the latent array k is finite, the coupling satisfies p(k m =k m+1 | k m−1 ,k m ) > c 1 , for some c 1 > 0. The inequality follows from the fact that for any configuration of (k m−1 ,k m ) the conditional probability that both k m andk m+1 take minimal value is strictly positive. Therefore, the coupling coalesces at a geometric rate, and as h is assumed to be bounded, we can deduce that E( 3 , for some c 2 > 0 and c 3 ∈ (0, 1). The same argument implies that .These inequalities will be applied in the sequel to show that, for some choices of the random variable T , the estimatorĥ GR has finite variance and the computing time has finite expectation.
The second construction in Glynn and Rhee (2014) employs a coupling from the past of Markov chain samplers for the target posterior p(k | n). To be precise, let π be a random mapping from the state space of the array k to itself, such that if k ∼ p(k | n) and π is independent of k , then π(k ) ∼ p(k | n). Letting (π m ) m≥1 be i.i.d. copies of π, define k m = π 1 • π 2 • · · · • π m (k 0 ), for k 0 fixed. We can define a Glynn-Rhee estimator through equation (2.12), with (2.14) By way of example, we could use the doubly conditional Gibbs sampler (Subsection 2.1) to specify π. The random function π m , which maps points from the support of k into the same finite set, can be specified using arrays of Gamma and uniform random variables. The definition of π m is nearly identical to the transitions in the inner loop of Algorithm 1 (line 6) where, given the arrays of Gamma and uniform random variables indexed by a transition time m, the transitions from one point in the support of k to the subsequent one become deterministic. In different words, arrays of Gamma and uniform random variables indexed by transition times (m = 1, 2, . . .) are used to define π m and the grand coupling (π 1 • π 2 • · · · • π m ; m ≥ 1).
As in the first coupling, in this second construction the conditional probability p(∩ j≥0 {k m+j = k m+j+1 } | k 0 , . . . , k m−1 ) > c 1 for some strictly positive c 1 . This follows from the fact that, with strictly positive probability, the random function π m can take the same minimal (or maximal) value at every point in the function domain. In this case the inequality implies that E(|Δ m | 2 ) < c 2 c m 3 , and E(|Δ m1 Δ m2 |) < c 2 c max (m1,m2) 3 for some c 2 > 0 and c 3 ∈ (0, 1).
For both of the couplings defined above, we can directly bound the variance of the estimatorĥ GR , and for several choices of p(T ≥ m), for example p(T ≥ m) ∝ m c4 , the right hand of the expression is finite. If the tails are exponential, i.e. p(T ≥ m) ∝ c m 4 , then the constant c 4 must be large enough for the bound to be finite. As explained in Glynn and Rhee (2014) the required constant is determined by the geometric rate at which the coupling coalesces.
The simulations in Section 4 are limited to the two constructions in equations (2.13) and (2.14), both tailored to the doubly conditional Gibbs sampler in Subsection 2.1. However, the other two Gibbs samplers presented in Section 2 can be coupled in a similar way, and there is great flexibility on how to specify the dependent Markov chains, as monotonicity is not required. This is a key difference between perfect sampling and the Glynn-Rhee estimation procedure. The Propp-Wilson algorithm requires a grand coupling with monotonicity, which severely restricts its applicability, whereas the Glynn-Rhee estimator can use any coupling where the two copies of the Markov chain tend to coalesce quickly. Our couplings are based on the divisibility property of gamma variables and inverse integral transformations. Recent papers by Jacob et al. (2020) and Jacob and Heng (2019) propose several constructions beyond those we considered to define coupled Markov chains in popular MCMC algorithms, such as Gibbs sampling and Metropolis-Hastings. One example is a maximal coupling (see Jacob et al. (2020)) which at each transition maximizes the probability of coalescence. Jacob et al. (2020) and Rhee and Glynn (2015) discuss unbiased estimators which are similar toĝ GR but may have significantly lower variance. For example, Jacob et al. (2020) considered an estimator which does not use a truncation variable T independent of the sequence (Δ m ) m≥0 . This is defined bŷ where τ = inf{m : Δ m+j = 0 for every j ≥ 0}. This and other modifications are discussed in more detail in Appendix C, in relation to the application in Section 4.

Simulation study
The Gibbs samplers and perfect sampler of Section 2 were implemented (code available at http://www.github.com/bacallado/hpy). This section describes numerical experiments with the aim of evaluating the efficiency of the three Gibbs samplers, and the running time of the perfect sampler, for a range of parameter settings.

Testing the correctness of the implementation
The testing procedure described by Geweke (2004) was used to verify the correctness of the algorithms and their implementation. To be precise, for a fixed setting of parameters (θ, θ 0 , β, β 0 ), define the following routine: 2. Sample an array k from one of the algorithms in the following list with target distribution p(k | n): (a) 100 iterations of the collapsed Gibbs sampler, initialised at k , (b) 100 iterations of the conditional Gibbs sampler, initialised at k , (c) 100 iterations of the doubly conditional Gibbs sampler, initialised at k , or (d) perfect sampling.
In each case (a-d), if the implementation is correct, the pair (k , k ) should be exchangeable. For a specific setting of the parameters, namely β 0 = β = 0.1 and θ 0 = θ = 1, we sample 200 independent copies (k m , k m ) 200 m=1 of the pair, and test exchangeability through a permutation test using the following test statistics The test was performed for both statistics and all four algorithms. Using a Bonferroni correction to maintain the familywise error below 1% for the two tests and all algorithms, the null hypothesis was not rejected in any case.

Convergence diagnostics for Gibbs samplers
Simulations of the three Gibbs samplers were run for a fixed dataset and two settings of the parameters (θ, θ 0 , β, β 0 ) = (1, 1, 0.1, 0.1) or (1, 1, 0.7, 0.1). We use a dataset from the human vaginal microbiome study discussed in Ravel et al. (2011). In the CRF analogy, there are 900 restaurants, 134 distinct dishes observed, and we subsample 1,000 customers per restaurant.
For each algorithm and each parameter setting, we simulate a total of 16 chains, which are initialized at the extremes of the state space of k, namely k r,i = n r,i and k r,i = 1(n r,i > 0). Each chain is 2,000 steps long. Figure 5 in the Supplementary Material (Bacallado et al., 2021) shows trace plots for two functions of k, organized by algorithm (columns) and initial state (rows), with (θ, θ 0 , β, β 0 ) = (1, 1, 0.7, 0.1). The quantities examined were the mean of the array, (RI) −1 r,i k r,i , and a specific entry, k 758,1 , which had the largest mean across simulations. A visual inspection of the plots suggests rapid mixing for all three algorithms, and the figure is similar for the parameters (θ, θ 0 , β, β 0 ) = (1, 1, 0.1, 0.1). In addition, two convergence diagnostics were computed using the R package CODA (Plummer et al., 2006), the autocorrelation function and the potential scale reduction factor of Gelman and Rubin (1992). In every case, the three Gibbs samplers appear to mix in just 20 iterations, with potential scale reduction factors below 1.02 with 95% confidence, and autocorrelations after 20 steps below 0.043 in absolute value. Table 1: Parameter settings for simulation experiments. We define four grids with two parameters fixed in each case, and the other ones varying over the ranges Θ = {1, 5, 10, 15, 20} or B = {0. 1, 0.2, 0.3, 0.4, 0.45, 0.5, 0.55, 0.6, 0.7, 0.8, 0.9}.

Testing the doubly conditional Gibbs sampler's convergence
To further evaluate the convergence of the doubly conditional Gibbs sampler, for a range of parameter values summarised in Table 1, we apply a test similar to that of Section 3.1. For each setting of the parameters (θ, θ 0 , β, β 0 ), we iterate the following steps: 1. Sample a pair (n, k ) from the CRF, with R = 5 populations, and n r· = 100 samples for each population r = 1, . . . , 5.
2. Obtain a pseudo-sample k from the distribution p(k | n) by simulating 1000 steps of the Gibbs sampler from the initial state k init with k init r,i = 1(n i,r > 0).
The difference between this routine and that of Section 3.1 is the initial state of the Markov chain. The routine was iterated 200 times to obtain pairs (k m , k m ) 1≤m≤200 . If the Markov chain produced a perfect sample from the target distribution, then k m and k m would be exchangeable. Thus, we test the null hypothesis of good mixing through a permutation test, using the statistics in equation (3.1) which have mean zero under the null.
With a Bonferroni correction to maintain the familywise error rate at 1% across the list of parameter values described in Table 1, we find that the test using either s 1 or s 2 does not reject the null hypothesis of good mixing.

Evaluation of perfect sampler's running time
The perfect sampler of Section 2 has a random running time. The running time of the sampler for a given simulation setup will be characterized in terms of two quantities: (i) the total number of Markov coupled transitions that had to be simulated in order to obtain a sample, and (ii) the number of attempts required until equation (2.11) is satisfied. The first quantity is linearly related to the CPU time required to obtain one sample.
2. Sample an array k from the distribution p(k | n) using the perfect sampler, recording the two running time statistics listed above.
This routine allows us to evaluate the running time of the algorithm in a well-specified setting; i.e. when the data n is sampled from the same model used for posterior inference.
For each combination of the parameters in Table 1, the routine above was iterated for a total of 3 CPU hours. The most striking results emerged from the experiment which fixed θ 0 and β 0 and varied θ and β; they are plotted in Figure 1. Large values of β and small values of θ dramatically increase the number of coupling steps required to obtain a sample. In fact, for the range of parameters θ considered in this simulation study, the algorithm almost never output a sample when β > 0.5. In addition, the number of attempts required until equation (2.11) is satisfied also increased with large β and small θ. This is despite the fact that convergence diagnostics suggest good mixing of the doubly conditional Gibbs sampler for all the parameter values in Table 1. Figure 1 illustrates also that the average computing time of the exact sampler, for fixed values of θ and β, tend to decrease with respect to θ 0 and slightly increases with β 0 .
To explain this phenomenon we provide some heuristics. In the posterior distribution p(k | n) of equation (2.4), the dependence between variables (k r,i ; 1 ≤ r ≤ R, 1 ≤ i ≤ I) may be represented by a factor graph with potential energy function (3. 2) The potentials collected in the first term each depend on a single variable k r,i , so the correlation between the variables is due to the other terms. In particular, the interaction terms of the form log Γ(θ/β +k r· ) will be nearly linear in k r· when the crucial parameter θ/β is large. Figure 2 provides a graphical illustration of this fact. Approximating these terms by linear functions of k r· = i k r,i breaks the dependence. Thus, one might speculate that a large value of θ/β makes computational inference of this posterior distribution easier. Now, considering the perfect sampling algorithm of Section 2, and the distribution p a (k) in particular, the same argument implies that a large value of θ/β makes the columns of k approximately independent. This could lead to faster mixing of the conditional Gibbs sampler for p a (k), and a faster running time for coupling from the past.

Application
This section illustrates the inference methods of Sections 2 and 2.3 using data from Ravel et al. (2011) on the vaginal microbiome. The data are collected in a contingency table n where rows represent 900 different biological samples from pregnant women, and columns represent 134 bacterial species. We split the table into a training set n train , with one hundred counts per biological sample, and a test set n test , such that n train + n test = n.  one of the biological samples. To define these quantities, let (X * 1 , X * 2 , . . . ) be the atoms of the measure μ, and let {X * i ; i ∈ I r } be the set of species observed in sample r. The missing-in-sample mass for sample r is defined by i.e. the total probability of all species which have not been observed in the sample r.
The missing mass for sample r is defined by i.e. the total probability of all species which have not been observed in any of the samples. Estimators for the missing mass, such as the Good-Turing estimator, are classical in the literature on species sampling and Bayesian nonparametric counterparts have been studied (Good, 2000;Favaro et al., 2016).
We observe that the posterior expectation of the missing-in-sample and missing mass are simply predictive probabilities, respectively, the probability that the next observation from the biological sample in question is a new species for the sample, or a new species overall. Given the data n train and the latent variable k, these probabilities are available in closed form. Therefore, given perfect samples from the posterior of k given n train , we can estimate the missing-in-sample and missing mass by averaging the corresponding predictive probabilities. Table 2 contrasts the estimates for missing-in-sample and missing mass for different settings of the model parameters (θ, β). The estimates derived via perfect sampling are compared to those obtained through Markov chain Monte Carlo with the doubly conditional Gibbs sampler of Section 2.1, as well as unbiased estimates obtained with the method of Glynn and Rhee based on the same Gibbs sampler. In particular, we use the first construction of the unbiased estimator defined in equation (2.13), with a stopping time T = 1000 + Geometric(0.1). This choice was based on the observation that the coupled Gibbs samplers tend to coalesce before 1000 transitions, making the estimator identical with high probability to the estimatorĥ GR2 described by Jacob et al. (2020), which does not require a truncation variable T . Other versions of the Glynn-Rhee estimator are compared in Appendix C.
The cost of computing the confidence intervals in Table 2 was different for each method, so one should not interpret the width of the intervals as a measurement of efficiency. For completeness, the cost of each method is quantified in Table 3 in Appendix B. As one would expect, perfect sampling was between 100 and 1000 times as expensive as the other two methods.
The 95% confidence intervals displayed in Table 2 for the perfect sampling and Glynn-Rhee estimators are constructed via normal approximation. In the case of β = 0.7 and θ = 1, perfect sampling becomes computationally infeasible, whereas we can derive missing mass estimators through Glynn-Rhee method. In this example, we expect Glynn-Rhee confidence intervals to have good coverage (see simulations in Appendix C).
It is possible to simulate the posterior predictive distribution given n train , by drawing k from its posterior and then simulating the CRF urn scheme. This two-step procedure was applied to sample replicates of n test , with the same row margins. Figure 3 shows the posterior distribution of the number of new species discovered, for two different settings of the model parameters. The observed statistic in n test is marked on the plot. As expected the parameters affect the prediction.

Conclusions
The hierarchical Pitman-Yor process is a popular model for dependent random probability measures. However, the behavior of MCMC methods for posterior inference is not well-understood. This paper proposes three new inference algorithms -a Gibbs sampler, a method for unbiased estimation, and a perfect sampler-which provide different levels of reliability. It is difficult to rigorously evaluate the convergence of the Markov chain sampler. The Glynn-Rhee method eliminates the bias of traditional MCMC estimators. The perfect sampler, on the other hand, provides the strongest guarantees but has a random running time.
Availability of multiple computational methods for inference under the hierarchical Pitman-Yor process facilitates the use of the model in data analyses. Our simulation study suggests the use of the doubly-collapsed Gibbs sampler and the application of the Glynn-Rhee framework, among the procedures that we tested in our manuscript, as effective tools for data analyses like those presented in Section 4. The comparison of data analyses repeated leveraging different algorithms to approximate the posterior, is a viable solution to validate their implementations. This includes the evaluation of the number of MCMC iterations necessary to approximate the posterior and the assessment of potential bias of the MCMC estimates. A major advantage of the coupling from the past sampler is that it permit straightforward comparisons between exact samples and other approaches to approximate the posterior distribution.
Also, MCMC algorithms can be easily modified, for example to include a prior distribution on the hierarchical Pitman-Yor process parameters (θ, β, θ 0 , β 0 ), or to produce posterior inference under other variants of the hierarchical model. In these cases, as we discussed, the Glynn-Rhee methodology, appears more flexible compared to coupling of Markov chains for exact sampling. Indeed, the application of this methodology allows the analyst to build on MCMC procedures, without the restrictive monotonicity requirements of the Propp and Wilson approach (Propp and Wilson, 1996).
One advantage of the perfect sampler is that it allows us to evaluate the running time by numerical simulations. We found that there are regions of the parameter space in which the algorithm terminates quickly and others in which the computational burden to obtain posterior samples increases substantially. These regions are separated by a relatively sharp boundary. The problematic regions correspond to values of the parameters for which we expect posterior inference to be harder, based on the analytic expression of p(k | n).
We focused on applications where the dishes assigned to each customer, in the CRF analogy, are directly observed, as in the human microbiome dataset of Section 4. In many applications of the hierarchical Pitman-Yor distribution, the dishes assigned to each customer are not observed and instead represent latent variables, such as cluster membership indicators in a mixture model. The collapsed Gibbs sampler and the Table 2: Bayes estimators of the missing-in-sample and missing mass in the first biological sample with 95% confidence intervals. Two parameters are fixed at θ 0 = 1 and β 0 = 0.1.
conditional Gibbs sampler in Section 2 were originally designed with such models in mind. Appendix D describes two examples of algorithms that leverage and adapt the doubly conditional Gibbs sampler for posterior inference with a mixture model, where the assigned dishes in the CRF representation are latent variables. Although the perfect sampler defined here could be employed within a larger Gibbs sampler, in practice this would not be efficient due to the large computational cost compared to the Markov chain sampler it is based on. On the other hand, using coupled Markov chains for unbiased estimation in the context of mixture models appears as an attractive complement to MCMC for posterior inference, for example on the number of clusters, predictive probabilities, and other summaries of interest.