Hierarchical Species Sampling Models

This paper introduces a general class of hierarchical nonparametric prior distributions. The random probability measures are constructed by a hierarchy of generalized species sampling processes with possibly non-diffuse base measures. The proposed framework provides a general probabilistic foundation for hierarchical random measures with either atomic or mixed base measures and allows for studying their properties, such as the distribution of the marginal and total number of clusters. We show that hierarchical species sampling models have a Chinese Restaurants Franchise representation and can be used as prior distributions to undertake Bayesian nonparametric inference. We provide a method to sample from the posterior distribution together with some numerical illustrations. Our class of priors includes some new hierarchical mixture priors such as the hierarchical Gnedin measures, and other well-known prior distributions such as the hierarchical Pitman-Yor and the hierarchical normalized random measures.


Introduction
Cluster structures in multiple groups of observations can be modelled by means of hierarchical random probability measures or hierarchical processes that allow for heterogenous clustering effects across groups and for sharing clusters among groups.As an effect of the heterogeneity, in these models the number of clusters in each group (marginal number of clusters) can differ, and due to cluster sharing, the number of clusters in the entire sample (total number of clusters) can be smaller than the sum of the marginal number of clusters.An important example of hierarchical random measure is the Hierarchical Dirichlet Process (HDP), introduced in the seminal paper of Teh et al. (2006).The HDP involves a simple Bayesian hierarchy where the common base measure for a set of Dirichlet processes is itself distributed according to a Dirichlet process.This means that the joint law of the random probability measures (p 1 , . . ., p I ) is where DP (θ, p) denotes the Dirichlet process with base measure p and concentration parameter θ > 0. Once the joint law of (p 1 , . . ., p I ) has been specified, observations [ξ i,j ] i=1,...,I;j≥1 are assumed to be conditionally independent given (p 1 , . . ., p I ) with ξ i,j |(p 1 , . . ., p I ) ind ∼ p i , i = 1, . . ., I and j ≥ 1.
Hierarchical processes are widely used as prior distributions in Bayesian nonparametric inference (see Teh and Jordan (2010) and reference therein), by assuming ξ i,j are latent variables describing the clustering structure of the data and the observations in the i-th group, Y i,j , are conditionally independent given ξ i,j with where f is a suitable kernel density.
In this paper, we introduce a new class of hierarchical random probability measures, called Hierarchical Species Sampling Model (HSSM), based on a hierarchy of species sampling models.
A Species Sampling random probability (SSrp) is defined as where (Z j ) j≥1 and (q j ) j≥1 are stochastically independent sequences, the atoms Z j are i.i.d. with common distribution H 0 (base measure) and the non-negative weights q j ≥ 0 sum to one almost surely.By Kingman's theory on exchangeable partitions, any random sequence of positive weights such that j≥1 q j ≤ 1 can be associated to an exchangeable random partition of the integers (Π n ) n≥1 .Moreover, the law of an exchangeable random partition (Π n ) n≥1 is completely described by an exchangeable partition probability function (EPPF) q 0 .Hence the law of the measure p defined in (1.2) is parametrized by q 0 and H 0 , and it will be denoted by SSrp(q 0 , H 0 ).
The proposed framework provides a general probabilistic foundation of both existing and novel hierarchical random measures, and relies on a convenient parametrization of the hierarchical process in terms of two EPPFs and a base measure.Our HSSM class includes the HDP, its generalizations given by the Hierarchical Pitman-Yor process (HPYP), see Teh (2006); Du et al. (2010); Lim et al. (2016); Camerlenghi et al. (2017) and the hierarchical normalized random measures with independent increments (HNRMI), first studied in Camerlenghi et al. (2018), Camerlenghi et al. (2019) and more recently in Argiento et al. (2019).Among the novel measures, we study hierarchical generalizations of Gnedin (Gnedin (2010)) and of finite mixture (e.g., Miller and Harrison (2018)) processes and asymmetric hierarchical constructions with p 0 and p i of different type (Du et al. (2010)).Another motivation for studying HSSMs relies on the introduction of non-diffuse base measures (e.g., the spike-and-slab prior of George and McCulloch (1993)) now widely used in Bayesian parametric (e.g., Castillo et al. (2015) and Rockova and George (2018)) and nonparametric (e.g., Kim et al. (2009), Canale et al. (2017)) inference.
We show that the arrays of observations from HSSMs have a Chinese Restaurant Franchise representation, that is appealing for the applications to Bayesian nonparametrics, since it sheds light on the clustering mechanism of the observations and suggests a simple and general sampling algorithm for posterior computations.The sampler can be used under both assumptions of diffuse and non-diffuse (e.g.spike-and-slab) base measure, whenever the EPPFs q 0 and q are known explicitly.
By exploiting the properties of species sampling sequences, we are able to provide the finite sample distribution of the number of clusters for each group of observations and the total number of clusters for the hierarchy.We provide some new asymptotic results when the number of observations goes to infinity, thus extending to our general class of processes the asymptotic approximations given in Pitman (2006) and Camerlenghi et al. (2019) for species sampling and hierarchical normalized random measures, respectively.
The paper is organized as follows.Section 2 introduces exchangeable random partitions, generalized species sampling sequences and species sampling random probability measures.Section 3 defines hierarchical species sampling models and shows some useful properties for the applications to Bayesian nonparametric inference.Section 4 gives finite-sample and asymptotic distributions of the number of clusters under both assumptions of diffuse and non-diffuse base measure.A general Gibbs sampler for hierarchical species sampling mixtures is established in Section 5. Section 6 presents some simulation studies and a real data application.

Background Material
Our Hierarchical Species Sampling Models build on exchangeable random partitions and related processes, such as species sampling sequences and species sampling random probability measures.We review some of their definitions and properties, which will be used in the rest of the paper.Supplementary material (Bassetti et al., 2019a) provides further details, examples and some new results under the assumption of non-diffuse base measure.
and we denote by |π c,n |, the number of elements of the block c = 1, . . ., k.We denote with P n the collection of all partitions of [n] and, given a partition, we list its blocks in ascending order of their smallest element.In other words, a partition π n ∈ P n is coded with elements in order of appearance.
A random partition of N is a sequence of random partitions, Π = (Π n ) n , such that each element Π n takes values in P n and the restriction of Π n to P m , m < n is Π m (consistency property).A random partition of N is said to be exchangeable if for every n the distribution of Π n is invariant under the action of all permutations (acting on Π n in the natural way).
Exchangeable random partitions are characterized by the fact that their distribution depends on Π n only through its block size.A random partition on N is exchangeable if and only if its distribution can be written in terms of exchangeable partition probability function (EPPF).An EPPF is a symmetric function q defined on the integers (n 1 , . . ., n k ), with k i=1 n i = n, that satisfies the additions rule q(n 1 , . . ., n k ) = k j=1 q(n 1 , . . ., n j + 1, . . ., n k ) + q(n 1 , . . ., n k , 1), (see Pitman (2006)).If (Π n ) n is an exchangeable random partition of N, there exists an EPPF such that for every n and where k = |π n |.In other words, q(n 1 , . . ., n k ) corresponds to the probability that Π n is equal to any of the partitions of [n] with k distinct blocks and block frequencies (n 1 , . . ., n k ).
Given an EPPF q, one deduces the corresponding sequence of predictive distributions.Starting with Π 1 = {1}, given Π n = π n (with |π n | = k), the conditional probability of adding a new block (containing n + 1) to Π n is while the conditional probability of adding n+1 to the c-th block of Π n (for c = 1, . . ., k) is An important class of exchangeable random partitions is the Gibbs-type partitions, introduced in Gnedin and Pitman (2005) and characterized by the EPPF where (x) n = x(x+1) . . .(x+n−1) is the rising factorial (or Pochhammer's polynomial), σ < 1 and V n,k are positive real numbers such that V 1,1 = 1 and (2.4)

Species Sampling Models with General Base Measure
Kingman's theory of random partitions sets up a one-one correspondence (Kingman's correspondence) between EPPFs and distributions for decreasing sequences of random variables (q ↓ k ) k with q ↓ i ≥ 0 and i q ↓ i ≤ 1 almost surely, by using the notion of random partition induced by a sequence of random variables.Let us recall that a sequence of random variables (ζ n ) n induces a random partition on N by equivalence classes i ∼ j if and only if If i q ↓ i = 1 a.s.then Kingman's correspondence between EPPF and (q ↓ j ) j can be defined as follows.Let (U j ) j be an i.i.d.sequence of uniform random variables on (0, 1) independent from (q ↓ j ) j and let Π be the random partition induced by a sequence (θ n ) n of conditionally i.i.d.random variables from j≥1 q j δ Uj where (q j ) j is any (possibly random) permutation of (q ↓ j ) j .Then the EPPF in the Kingman's correspondence is the EPPF of Π.In point of fact, one can prove that where (j 1 , . . ., j k ) ranges over all ordered k-tuples of distinct positive integers.See Equation (2.14) in Pitman (2006).
A Species Sampling random probability of parameters q and H, in symbols p ∼ SSrp(q, H), is a random distribution p = j≥1 δ Zj q j , (2.6) where (Z j ) j are i.i.d.random variables on a Polish space X with possibly non-diffuse common distribution H and EPPF q given in (2.5).Such random probability measures are sometimes called species sampling models.In this parametrization, q takes into account only the law of (q ↓ j ) j while H describes the law of the Z j s.If H is diffuse, a sequence (ξ n ) n sampled from p in (2.6), i.e. with ξ n conditionally i.i.d.(given p) with law p ∼ SSrp(q, H), is a Species Sampling Sequence as defined by Pitman (1996) (Proposition 13 in Pitman (1996)) and the EPPF of the partition induced by (ξ n ) n is exactly q.On the contrary, when H is not diffuse then (ξ n ) n is not a Species Sampling Sequence in the sense of Pitman (1996) and the EPPF of the induced partition is not q.Nevertheless, as shown in the next Proposition, there exists an augmented space X × (0, 1) and a latent partition related to (ξ n ) n with EPPF q.
Hereafter, for a general base measure H, we refer to (ξ n ) n as generalized species sampling sequence, gSSS(q, H).
Proposition 1.Let (U j ) j be an i.i.d.sequence of uniform random variables on (0, 1), (Z j ) j an i.i.d.sequence with possibly non-diffuse common distribution H and (q j ) j a sequence of positive numbers with j q j = 1 a.s.. Assume that all the previous elements are independent and let (ζ n ) n := (ξ n , θ n ) n be a sequence of random variables, with values in X × (0, 1), conditionally i.i.d.from p given p = j≥1 δ (Zj ,Uj ) q j . (2.7) Then, the EPPF of the partition induced by (ζ n ) n is q given in (2.5) and (ξ n ) n is a gSSS(q, H).
From the previous Proposition, it follows that the partition induced by (ζ n ) n is in general finer than the partition induced by (ξ n ) n , with the equality if H is diffuse.This result is essential in order to properly define and study hierarchical models of type (1.3), since the random measure p 0 in (1.3) is almost surely discrete and hence not diffuse.Further properties of the gSSS are proved in the supplementary material (Bassetti et al., 2019a), whereas further results are available in Sangalli (2006) for normalized random measures with independent increments.These properties are relevant to the comprehension of the implications of mixed based measures for Bayesian non-parametrics, especially for hierarchical prior constructions.

Hierarchical Species Sampling Models
We introduce hierarchical species sampling models (HSSMs), provide some examples and derive relevant properties.

HSSM Definition and Examples
In the following definition a hierarchy of species sampling random probabilities is used to build hierarchical species sampling models.Definition 1.Let q and q 0 be two EPPFs and H 0 a probability distribution on the Polish space X.A Hierarchical Species Sampling model, HSSM (q, q 0 , H 0 ), of parameters (q, q 0 , H 0 ) is a vector of random probably measures (p 0 , p 1 , . . ., p I ) such that An array [ξ i,j ] i=1,...,I,j≥1 is sampled from HSSM (q, q 0 , H 0 ) if its elements are conditionally independent random variables given (p 1 , . . ., p I ) with ξ i,j |(p 1 , . . ., p I ) ind ∼ p i , where i = 1, . . ., I and j ≥ 1.By de Finetti's representation theorem it follows that the array [ξ i,j ] i=1,...,I,j≥1 is partially exchangeable (in the sense of de Finetti), i.e.
Definition 1 is general and provides a probabilistic foundation for a wide class of hierarchical random models.The properties of the SSrp and of the gSSS, guarantee that the hierarchical random measures in Definition 1 are well defined also for nondiffuse (e.g., atomic or mixed) probability measures H 0 .
The HSSM class in Definition 1 includes well-known (e.g., Teh et al. (2006), Teh (2006), Bacallado et al. (2017)) and new hierarchical processes, as shown in the following examples.We assume that the reader is familiar with basic non-parametric prior processes.A brief account to these topics is included in the supplementary material (Bassetti et al., 2019a).
Example 2 (Hierarchical homogeneous normalized random measures).Hierarchical homogeneous Normalized Random Measures (HNRMI) introduced in Camerlenghi et al. (2019) are defined by where NRMI(θ, η, H) denotes a normalized homogeneous random measure with parameters (θ, η, H), where θ > 0, η is Lévy a measure on R + (absolutely continuous with respect to the Lebesgue measure) and H a measure on X.A NRMI is a SSrp and hence HNRMI are HSSM.
Our class of HSSM includes new hierarchical processes such as hierarchical mixtures of finite mixture processes and combinations of finite mixture processes and P Y P .
A Hierarchical MFMP with parameters σ i , ρ (i) = (ρ As a special case when |σ i | = 1 and for a suitable ρ (i) (i = 0, 1, . . .), one obtains the Hierarchical Gnedin Process with parameters , which is a hierarchical extension of the Gnedin Process.For further details see Examples S.2 and S.3 in the supplementary material (Bassetti et al., 2019a).

HSSM and Chinese Restaurant Franchising Representation
The next proposition gives the marginal law of an array sampled from a HSSM.When ] is a partition of [n] and q an EPPF, we will write Proposition 2. Let [ξ i,j ] i=1,...,I,j≥1 be sampled from HSSM (q, q 0 , H 0 ), then for every vector of integer numbers (n 1 , . . ., n I ) and every collection of Borel sets {A i,j Starting from Proposition 2 we show that an array sampled from a HSSM has a Chinese Restaurant Franchise representation.Such representation is very useful because it leads to a generative interpretation of the nonparametric-priors in the HSSM class, and naturally allows for posterior simulation procedures (see Section 5).
In the Chinese Restaurant Franchise metaphor, observations are attributed to "customers", identified by the indices (i, j), and groups are described as "restaurants" (i = 1, . . ., I).In each "restaurant", "customers" are clustered according to "tables", which are then clustered at the second hierarchy level by means of "dishes".Observations are clustered across restaurants at the second level of the clustering process, when dishes are associated to tables.One can think that the first customer sitting at each table chooses a dish from a common menu and this dish is shared by all other customers who join the same table afterwards.
The first level of the clustering process, acting within each group, is driven by independent random partitions Π (1) , . . ., Π (I) with EPPF q.The second level, acting between groups, is driven by a random partition Π (0) with EPPF q 0 .
Given n 1 , . . ., n I integer numbers, we introduce the following set of observations: and denote with C j (Π) the random index of the block of the random partition Π that contains j, that is ..,I,j≥1 is a sample from a HSSM (q, q 0 , H 0 ), then O and {φ d * i,j : j = 1, . . ., n i ; i = 1, . . .I} have the same laws, where 1) , . . ., Π (I) are i.i.d.exchangeable partitions with EPPF q and Π (0) is an exchangeable partition with EPPF q 0 .All the previous random variables are independent.
, then the construction in Theorem 1 can be summarized by the following hierarchical structure where, following the Chinese Restaurant Franchise metaphor (see Figure 1), c i,j is the table at which the j-th "customer" of the "restaurant" i sits, d i,c is the index of the Figure 1: Illustration of the HSSM (q, q 0 , H 0 ) clustering process given in Theorem 1.We assume two groups (restaurants), I = 2, with n 1 = 6 and n 2 = 4 observations (customers) each.Top-left: Samples (dishes) φ n from the non-diffuse base measure.Dishes have the same colour and line type if they take the same values.Mid-left: Indexes D(i, c) (from 1 to 7 in lexicographical order) of the tables which share the same dish.Boxes represent the blocks of the random partition at the top of the hierarchy.Bottomleft: Observations (customers) allocated by c i,j to each table (circles) in the groupspecific random partitions.Top-right: Table lexicographical ordering and dishes assigned to the tables by the top level partition.Bottom-right: observations clustering implied by the joint tables and dishes allocation d * i,j .
"dish" served at table c in the restaurant i and d * i,j is the index of the "dish" served to the j-th customer of the i-th restaurant.
A special case of Theorem 1 has been independently proved in Proposition 2 of Argiento et al. (2019) for HNRMI.Theorem 1 can also be used to describe in a recursive way the array O. Having in mind the Chinese Restaurant Franchise, we shall denote with n icd the number of customers in restaurant i seated at table c and being served dish d and with m id the number of tables in the restaurant i serving dish d.We denote with dots the marginal counts.Thus, n i•d is the number of customers in restaurant i being served dish d, m i• is the number of tables in restaurant i, n i•• is the number of customers in restaurant i (i.e. the n i observations), and m •• is the number of tables.
Finally, let ω n,k and ν n be the weights of the predictive distribution of the random partitions Π (i) (i = 1, . . ., I) with EPPF q (see Section 2.1).Also, let ωn,k and νn be the weights of the predictive distribution of the random partitions Π (0) with EPPF q 0 defined analogously by using q 0 in place of q.We can sample {ξ i,j 1,1 = φ 1 ∼ H 0 and then iterating, for i = 1, . . ., I, the following steps: and let c it = c for the chosen c, we leave m i• the same and set Remark 1.The Chinese Restaurant Franchise representation and the Pólya Urn sampler in (S1)-( S3) are deduced directly from the latent partition representation given in Theorem 1, with no additional assumptions on H 0 and without resorting to the expression of the distribution of the partition induced by the observations.This expression can be derived for HSSM as a side result of our combinatorial framework and includes Theorem 3 and 4 of Camerlenghi et al. (2019) as special cases when the HSSM is a HNRMI.Since the derivation of this law is not a central result of the paper, it is given in the supplementary material (Bassetti et al., 2019a).

Cluster Sizes Distributions
We study the distribution of the number of clusters in each group of observations (i.e., the number of distinct dishes served in the restaurant i), as well as the global number of clusters (i.e. the total number of distinct dishes in the restaurant franchise).
Let us introduce a time index t to describe the customers arrival process.At time t = 1, 2, . . .and for each group i, O it is the observation set and n i (t) is the number of elements in O it , i.e. the number of observations in the group i at time t.The collection of all the n(t) := , each group has one new observation between t − 1 and t and hence the total number of observations at time t is n(t) = It.Different sampling rates can be assumed within our framework.For example n i (t) = tb i for suitable integers b i describes an asymmetric sampling scheme in which groups have different arrival rates, b i .
We find the exact finite sample distribution of the number of clusters for given n(t) and n i (t) when t < ∞.Some properties, such as the prior mean and variance, are discussed in order to provide some guidelines to setting HSSM parameters in the applications.We present some new asymptotic results when the number of observations goes to infinity, such that n(t) diverges to +∞ as t goes to +∞.The results extend existing asymptotic approximations for species sampling (Pitman (2006)) and for hierarchical normalized random measures (Camerlenghi et al. (2019)) to the general class of HSSMs.Finally, we provide a numerical study of the approximation accuracy.

Distribution of the Cluster Size Under the Prior
For every i = 1, . . ., I, we define By Theorem 1, for every fixed t, the laws of K i,t and K t are the same as the ones of the number of "active tables" in "restaurant" i and of the total number of "active tables" in the whole franchise, respectively.Analogously, the laws of D t and D i,t are the same as the laws of the number of dishes served in the restaurant i and in the whole franchise, respectively.If H 0 is diffuse, then D t and the number of distinct clusters in O t have the same law and also D i,t and the number of clusters in the group i follow the same law.
The distributions of D t and D i,t are derived in the following Proposition 3.For every n ≥ 1 and k = 1, . . ., n, we define q n (k) := P |Π and
One of the advantages of our framework is that the gSSS properties allow us to easily derive the distribution of the number of clusters when H 0 is not diffuse.Indeed, it can be deduced by considering possible coalescences of latent clusters (due to ties in the i.i.d.sequence (φ n ) n of Theorem 1) forming a true cluster.Let us denote with Dt and Di,t the number of distinct clusters in O t and O it , respectively.The assumption of atomic base measures behind HDP and HPYP has been used in many studies, and some of its theoretical and computational implications have been investigated (e.g., see Nguyen (2016) and Sohn and Xing (2009)), whereas the implications of the use of mixed base measures are not yet well studied, especially in hierarchical constructions.In the following we state some new results for the case of a spike-and-slab base measure.
The probability of Dt has the same expression as above with D t in place of D i,t and n t in place of n i,t .Moreover, and E[ Dt ] has an analogous expression with D i,t replaced by D t .
For a Gibbs-type EPPF with σ > 0, using results in Gnedin and Pitman (2005), we get where V n,k satisfies the partial difference equation in (2.4) and S σ (n, k) is a generalized Stirling number of the first kind, defined as for σ = 0 and S 0 (n, k) = |s(n, k)| for σ = 0, where |s(n, k)| is the unsigned Stirling number of the first kind, see Pitman (2006).See De Blasi et al. ( 2015) for an up-to-date review of Gibbs-type prior processes.
For the hierarchical PY process the distribution q n (k) has closed-form expression For the Gnedin model (Gnedin, 2010) the distribution q n (k) is In the supplementary material (Bassetti et al., 2019b), we provide a graphical illustration of the prior distributions presented here above and a sensitivity analysis with respect to the prior parameters.

Asymptotic Distribution of the Cluster Size
An exchangeable random partition (Π n ) n≥1 has asymptotic diversity S if for a positive random variable S and a suitable normalizing sequence (c n ) n≥1 .Asymptotic diversity generalizes the notion of σ-diversity, see Definition 3.10 in Pitman (2006).
In the following propositions, we use the (marginal) limiting behaviour (4.2) of the random partitions Π ∞ , respectively) for suitable diverging sequences a n and b n .Moreover assume that a n = n σ0 L 0 (n) and b n = n σ1 L 1 (n), with σ i ≥ 0 and L i is a slowly varying function, i = 0, 1, and set Remark 2. Part (ii) extends to HSSM with different group sizes, n i (t), the results in Theorem 7 of Camerlenghi et al. (2019) for HNRMI with groups of equal size.Both part (i) and (ii) provide deterministic scaling of diversities, in the spirit of Pitman (2006), and differently from Camerlenghi et al. (2019) where a random scaling is obtained.
Remark 3. Combining Propositions 4 and 6 one can obtain similar asymptotic results also for Di,t and Dt .For instance, one can prove that, under the same assumptions of Proposition 4, if and H C diffuse (as in the spike-and-slab case), for t → +∞ one has Di,t and Dt The second general result describes the asymptotic behaviour of D i,t and D t in presence of random partitions for which c n = 1 for every n.
n | converges a.s. to a positive random variable K i as n → +∞, then for every k ≥ 1 lim and Starting from Propositions 6 and 7, analytic expressions for the asymptotic distributions of D i,t and D t can be deduced for some special HSSMs.
As an example, consider the HGP and the HPYGP in Examples 3 and 4. If (Π n ) n is a Gnedin's partition, then |Π n | converges almost surely to a random variable K (see Gnedin (2010) and Example S.3 in the supplementary material (Bassetti et al., 2019a)) and the asymptotic behaviour of the number of clusters can be derived from Proposition 7 as stated here below.

Hierarchical Species Sampling Models
(ii) for HP Y DP (θ 0 , σ 0 ; θ 1 ) with σ 0 > 0: (iv) for HDP (θ 0 , θ 1 ): In Figure 2, we compare exact and asymptotic values (see Proposition 3 and Corollary 2, respectively) of the expected marginal number of clusters for the HSSMs in the PY family: HDP (θ 0 ; θ 1 ), HDP Y P (θ 0 ; σ 1 , θ 1 ), HP Y P (σ 0 , θ 0 ; σ 1 , θ 1 ) and HP Y DP (θ 0 , σ 0 ; θ 1 ) (different rows of Figure 2).For each HSSM we consider n i (t) increasing from 1 to 500 and different parameter settings (different columns and lines).For the HDP the exact value (dashed lines) is well approximated by the asymptotic one (solid line) for all sample sizes n i (t), and different values of θ i (gray and blacks lines in the left and right plots of panel (i)).For the HPYP, the results in panel (ii) show that there are larger differences when θ i , i = 0, 1 are large and σ 0 and σ 1 are close to zero (left plot).The approximation is good for small θ i (right plot) and improves slowly with increasing n i (t) for smaller σ i (gray lines in the right plot).In the panels (iii) and (iv) for HDPYP and HPYDP, there exist parameter settings where the asymptotic approximation is not satisfactory and is not improving when n i (t) increases.
Our numerical results point out that the asymptotic approximation for both PY and HPY lacks of accuracy for some parameters settings.Thus, the exact formula for the number of clusters should be used in the applications when calibrating the parameters of the process.

Chinese Restaurant Franchise Sampler
Random measures and hierarchical random measures are widely used in Bayesian nonparametric inference (see Hjort et al. (2010) for an introduction) as prior distributions for the parameters of a given density function.In this context a further stage is added to the hierarchical structure of Equation (3.7) involving an observation model where f is a suitable kernel density.
The resulting model is an infinite mixture, which is the object of the Bayesian inference.In this framework, the posterior distribution is usually not tractable and Gibbs sampling is used to approximate the posterior quantities of interest.There are two main classes of samplers for posterior approximation in Bayesian nonparametrics: marginal (see Escobar (1994) and Escobar and West (1995)) and conditional (Walker (2007), Papaspiliopoulos and Roberts (2008), Kalli et al. (2011)) samplers.See also Figure 2: Exact (dashed lines) and asymptotic (solid lines) expected marginal number of clusters E(D i,t ) when n i (t) = 1, . . ., 500 for different HSSMs.Favaro and Teh (2013) for an up-to-date review.In this section, we extend the marginal sampler for HDP mixture (see Teh et al. (2006), Teh (2006) and Teh and Jordan (2010)), to our general class of HSSMs.We present the sampler for the case kernel and base measure are conjugate.When this assumption is not satisfied our sampling method can be easily modified following the auxiliary variable sampler of Neal (2000) and Favaro and Teh (2013).
Following the notation in Section 3.2, we consider the data structure Y i,j , c i,j : i ∈ J , and j = 1, . . ., n  Denote with the superscript ¬ij the counts and sets in which the customer j in the restaurant i is removed and, analogously, with ic the counts and sets in which all the customers in the table c of the restaurant i are removed.We denote with p(X) the density of the random variable X.
The proposed Gibbs sampler simulates iteratively the elements of c and d from their full conditional distributions, where the latent variables φ d are integrated out analytically.In sampling the latent variable c, we need to sample jointly [c, d * ] and, since d is a function of [c, d * ], this also gives a sample for d.In order to improve the mixing we re-sample d given c in a second step.In summary, the sampler iterates for i = 1, . . ., I according to the following steps: Equation (S.32) in the supplementary material (Bassetti et al., 2019a)), for j = 1, . . ., n i•• ; (ii) (re)-sample d i,c from p(d i,c |Y , c, d ic ) (see Equation (S.34) in the supplementary material (Bassetti et al., 2019a)), for c = 1, . . ., m i• .
A detailed description of the Gibbs sampler is given in the supplementary material (Bassetti et al., 2019a).

Simulation Experiments
We compare some of the HSSMs described in Section 3 on synthetic data generated under different assumptions on the true model.In the first experimental setting, we consider three groups of observations sampled from three-component normal mixtures with common mixture components, but different mixture probabilities: iid ∼ 0.3N (−5, 1) + 0.3N (0, 1) + 0.4N (5, 1), j = 1, . . ., 100,
The parameters of the different prior processes are chosen such that the marginal expected number of clusters is E(D i,t ) = 5 and its variance is between 1.97 and 3.53 assuming n i (t) = n i = 50 with t = 1 for i = 1, . . ., 3.
In the second and third experimental settings, we consider ten groups of observations from two-and three-component normal mixtures respectively with one common component across groups.In the second experiment, we assume with n i (t) increasing from 5 to 100 with t = 1 and for i = 1, . . ., 10.In the third setting, we assume a smaller weight for the common component and larger number of group specific components: The parameters of the prior processes are chosen such that the marginal expected value is E(D i,t ) = 10 and the variance is between 4.37 and 6.53 assuming n i (t) = 20 with t = 1 for i = 1, . . ., 10.
For each setting we generate 50 independent datasets and run the marginal sampler described in Section 5 with 6, 000 iterations to approximate the posterior predictive distribution and the posterior distribution of the clustering variables c and d.We discard the first 1, 000 iterations of each run.All inferences are averaged over the 50 independent runs.
We compare the models by evaluating their co-clustering errors and predictive abilities (see Favaro and Teh (2013) and Dahl (2006)).We denote with d(m) = (d ), the vector of allocation variables for all the observations, sampled at the Gibbs iteration m = 1, . . ., M, where M is the number of Gibbs iterations.The co-clustering matrix of posterior pairwise probabilities of joint classification is estimated by: Let d0 be the true value of the allocation vector d.The co-clustering error can be measured as the average L 1 distance between the true pairwise co-clustering matrix, δ {d 0l } (d 0k ) and the estimated co-clustering probability matrix, P lk , i.e.: The following alternative measure can be defined by using the Hamming norm and the estimated co-clustering matrix, I(P lk > 0.5): Both accuracy measures CN and CN * attain 0 in absence of co-clustering error and 1 when co-clustering is mispredicted.
The L 1 distance between the true group-specific densities, f (Y i,ni+1 ) and the corresponding posterior predictive densities, p(Y i,ni+1 |Y), can be used to define the predictive score: Finally, we consider the posterior median ( q 0.5 (D)) and variance ( V (D)) of the total number of clusters D.
The results in Table 1 point out similar co-clustering accuracy across HSSMs and experiments.In the first and second experimental settings, HPYP and HDPYP have significantly small co-clustering errors, CN and CN * .As regard the predictive score SC, the seven HSSMs behave similarly in the three restaurants experiment (panel a), whereas in the two-components experiment the HDPYP performs slightly better with respect to  the other HSSMs.In presence of large heterogeneity across restaurants (third setting), the HGPYP is performing best following the co-clustering norm and the predictive score measures.A comparison between HPYP and HGPYP shows that these results do not depend on the number of observations and can be explained by a better fitting of tails and dispersion of the group-specific densities provided by the HGPYP.For illustrative purposes, we provide in Figure 3 a comparison of the log-predictive scores of the two models for an increasing number of observations.In the first setting, the posterior number of clusters, q 0.5 (D), for all the HSSMs (panel (a) in Table 1) is significantly close to the true value, that corresponds to 3 mixture components.Increasing the number of restaurants (second and third settings), the HPYP tends to have extra clusters causing larger posterior median and variance of the number of clusters ( q 0.5 (D) and V (D) in Table 1).Conversely, the HGPYP have a smaller dispersion of the number of clusters with respect to the HPYP.
The results for the third experiment suggest that HGPYP performs better when groups of observations are heterogeneous.Also increasing the number of observations, HGPYP provides a consistent estimate of the true number of components (Figure 3).In conclusion, our experiments indicate that using the Pitman-Yor process at some stage of the hierarchy may lead to a better accuracy.The HDPYP did reasonably well in all our experiments in line with previous findings on hierarchical Dirichlet and Pitman-Yor processes for topic models (see Du et al. (2010)).Also, using Gnedin process at the top of the hierarchy might lead to a better accuracy when groups of observations are heterogeneous.Moreover, when the researcher is interested in a consistent estimate of the number of components, HGPYP should be preferred.Further details and results are in the supplementary material (Bassetti et al., 2019b).

Real Data Application
Bayesian nonparametrics is used in economic time series modelling to capture observation clustering effects (e.g., see Hirano, 2002;Griffin and Steel, 2011;Bassetti et al., 2014;Kalli and Griffin, 2018;Billio et al., 2019).In this paper, we consider the industrial production index, an important indicator of macroeconomic activity used in business cycle analysis (see Stock and Watson (2002)).One of the most relevant issues in this field concerns the classification of observations by allowing for different parameter values in periods (called regimes) of recession and expansion.
The data has been previously analysed by Bassetti et al. (2014) and contains the seasonally and working day adjusted industrial production indexes (IPI) at a monthly frequency from April 1971 to January 2011 for both United States (US) and European Union (EU).We generate autoregressive-filtered IPI quarterly growth rates by calculating the residuals of a vector autoregressive model of order 4.
We follow a Bayesian nonparametric approach based on HSSM prior for the estimation of the number of regimes or structural breaks.Based on the simulation results, we focus on the HPYP, with hyperparameters (θ 0 , σ 0 ) = (1.2,0.2) and (θ 1 , σ 1 ) = (2, 0.2), and on the HGPYP, with hyperparameters (γ 0 , ζ 0 ) = (14.7,130) and (θ 1 , σ 1 ) = (2, 0.23), such that the prior mean of the number of clusters is 5.The main results of the nonparametric inference can be summarized through the implied data clustering (panel (a) of Figure 4) and the marginal, total and common posterior number of clusters (panel (b)).
One of the most striking feature of the co-clustering is that in the first and second block of the minor diagonal there are vertical and horizontal black lines.They correspond to observations of a country, which belong to the same cluster that is the same phase of the business cycle.
Another feature that motivates the use of HSSMs is given by the black horizontal and vertical lines in the two main diagonal blocks.They correspond to observations of the two countries allocated to common clusters.The appearance of the posterior total number of clusters (see panel b.1) suggests that at least three clusters should be used in a joint modelling of the US and EU business cycle.The larger dispersion of the marginal number of cluster for EU (b.3) with respect to US (b.2) confirms the evidence in Bassetti et al. (2014) of a larger heterogeneity in the EU cycle.Finally, we found evidence (panel b.4) of common clusters of observations between EU and US business cycles.

Supplementary Material
Supplementary material A to Hierarchical Species Sampling Models (DOI: 10.1214/19-BA1168SUPPA; .pdf).This document contains the derivations of the results of the paper and a detailed analysis of the generalized species sampling (with a general base measure).It also describes the Chinese Restaurant Franchise Sampler for Hierarchical Species Sampling Mixtures.

Proposition 4 .
Let H0 (d|k) (for 1 ≤ d ≤ k) be the probability of observing exactly d distinct values in the vector (φ 1 , . . ., φ k ) where the φ n s are i.i.d.H 0 .Then, P{ Di,t = d} = ni(t) k=d H0 (d|k)P{D i,t = k} for d = 1, . . ., n i (t).The probability of Dt has the same expression as above with D t in place of D i,t and n(t) in place of n i (t).If H 0 is diffuse, then P{ Di,t = d} = P{D i,t = d} and P{ Dt = d} = P{D t = d}, for every d ≥ 1.
0, . . ., I), to obtain the asymptotic distribution of D i,t and D t assuming c n = n σ L(n), with L slowly-varying.The first general result deals with HSSM where Π n = Π (i) n satisfies (4.2) for every i = 1, . . ., I and c n → +∞ and hence the cluster size |Π (i) n | diverges to +∞.Proposition 6. Assume that Π (0) and Π (i) (for i = 1, . . ., I) are independent exchangeable random partitions such that |Π (0) n |/a n (|Π (i) n |/b n for i = 1, . . ., I, respectively) converges almost surely to a strictly positive random variable D (0) where c = [c i : i ∈ J ], with c i = [c i,j : j = 1, . . ., n i•• ], d = [d i,c : i ∈ J , c = 1, . . ., m i• ], φ = [φ d : d ∈ D],and, with a slight abuse of notation, we write [c, d] ∼ HSSM in order to denote the distribution of the labels [c, d] obtained from a HSSM as in (3.7).If we defined * i,j = d i,ci,j and d * = [d * i,j : i ∈ J , j = 1, . . ., n i•• ],then [c, d] and [c, d * ] contain the same amount of information, indeed d * is a function of d and c, while d is a function of d * and c.From now on, we denote with Y = [Y i,j : i ∈ J , j = 1, . . ., n i•• ] the set of observations.If f and H are conjugate, the Chinese Restaurant Franchise Sampler of Teh et al. (2006) can be generalized and a new sampler can be obtained for our class of models.

Figure 3 :
Figure3: Top-left: Log-posterior predictive score for the right tail (above the 97.5% quantile of the true distribution).Top-right: posterior mean when the number of customers increases for HGPYP (solid) and HPYP (dashed).Bottom: posterior number of clusters for the HPYP (left) and HGPYP (right).In this setting the true number of clusters is 11.

Figure 4 :
Figure 4: (a) Co-clustering matrix for the US (bottom left block) and EU (top right block) business cycles and cross-co-clustering (main diagonal blocks) between US and EU for the HPYP.(b) Posterior number of clusters.Total (b.1), marginal for US (b.2) and EU (b.3) and common (b.4) for the HPYP (solid line) and for the HGPYP (dashed line).
t is sampled from Git , we set ξ * i,c = φ d , let d ic = d for chosen d and increment m •c by one.If ξ i,t is sampled from H 0 , then we increment D by one and set φ D = ξ it , ξ * i,c = ξ i,t and d ic = D.In both cases, we increment m •• by one.(S3) Having sampled ξ i,t with t = n i•• in the previous Step, set i : b n converges a.s. to a strictly positive random variable D i•• , d i,c : i ∈ J , and c = 1, ..., m i• , φ d : d ∈ D,where Y i,j is the j-th observation in the i-th group, n i•• = n i is the total number of observations in the i-th group, and J = {1, . .., I} is the set of group indexes.The latent variable c i,j denotes the table at which the j-th "customer" of "restaurant" i sits and d i,c the index of the "dish" served at table c in restaurant i.The random variables φ d are the "dishes" and D = {d : d = d i,c for some i ∈ J and c ∈ {1, . .., m i• }} is the set of indexes of the served dishes.Let us assume that the distribution H of the atoms φ d s has density h and the observations Y i,j have a kernel density f (•|•), then our hierarchical infinite mixture model is

Table 1 :
Model accuracy for seven HSSMs