The semi-hierarchical Dirichlet Process and its application to clustering homogeneous distributions

Assessing homogeneity of distributions is an old problem that has received considerable attention, especially in the nonparametric Bayesian literature. To this effect, we propose the semi-hierarchical Dirichlet process, a novel hierarchical prior that extends the hierarchical Dirichlet process of Teh et al. (2006) and that avoids the degeneracy issues of nested processes recently described by Camerlenghi et al. (2019a). We go beyond the simple yes/no answer to the homogeneity question and embed the proposed prior in a random partition model; this procedure allows us to give a more comprehensive response to the above question and in fact find groups of populations that are internally homogeneous when I greater or equal than 2 such populations are considered. We study theoretical properties of the semi-hierarchical Dirichlet process and of the Bayes factor for the homogeneity test when I = 2. Extensive simulation studies and applications to educational data are also discussed.


Introduction
The study and development of random probability measures in models that take into account the notion of data that are not fully exchangeable has sparked considerable interest in the Bayesian nonparametric literature. We consider here the notion of partial exchangeability in the sense of de Finetti (see de Finetti, 1938;Diaconis, 1988), which straightforwardly generalized the notion of an exchangeable sequence of random variables to the case of invariance under a restricted class of permutations. See also Camerlenghi et al. (2017) and references therein. In particular, our focus is on assessing whether two or more populations (or groups) of random variables can be considered exchangeable rather than partially exchangeable, that is whether they arose from a common population/random distribution or not.
Partial exchangeability can thus be conceptualized as invariance of the joint law above under the class of all permutations acting on the indices within each of the samples. Here and from now on, the distribution of a random element y is denoted by L(y).
The previous setting can be immediately extended to the case of I different populations or groups. By de Finetti's representation theorem (see the proof in Regazzini, 1991), partial exchangeability for the array of I sequences of random variables (y 11 , y 12 , . . . , y 21 , y 22 , . . ., y I1 , y I2 , . . .) is equivalent to P (y ij ∈ A ij , j = 1, . . . , N i , i = 1, . . . , I) = P I Y I i=1 N i j=1 p i (A ij ) Q(dp 1 , . . . , dp I ), for any N 1 , . . . , N I ≥ 1 and Borel sets {A ij } for j = 1, . . . , N i and i = 1, . . . , I. In this case, de Finetti's measure Q is defined on the I-fold product space P I Y = P Y ×P Y ×· · ·×P Y , and (p 1 , p 2 , . . . , p I ) ∼ Q. The whole joint sequence of random variables is exchangeable if and only if Q gives probability 1 to the measurable set S = {(p 1 , p 2 , . . . , p I ) ∈ P I Y : p 1 = p 2 = · · · = p I }.
Hence, partial exchangeability of data from different groups (or related studies) is a convenient context to analyze departures from exchangeability. While homogeneity of groups here amounts to full exchangeability, departures from this case may follow different directions, including independence of the population distributions p 1 , p 2 , . . . , p I . However it could be interesting to investigate other types of departures from exchangeability beyond independence. The main goal here is to build a prior Q for (p 1 , p 2 , . . . , p I ) that is able to capture a wider range of different behaviors, not only restricting the analysis to assessing equality or independence among p 1 , p 2 , . . . , p I . In the simplest case of I = 2, we just compare two distributions/populations, but we aim here at extending this notion to I > 2 groups. In particular, we address the following issue: if the answer to the question of homogeneity within all these groups is negative, a natural question immediately arises, namely, can we assess the existence of homogeneity within certain populations? In other words, we would like to find clusters of internally homogeneous populations.
Vectors of dependent random distributions appeared first in Cifarelli and Regazzini (1978), but it was in MacEachern (1999) where a large class of dependent Dirichlet processes was introduced, incorporating dependence on covariates through the atoms and/or weights of the stick-breaking representation. Following this line, De Iorio et al. (2004) proposed an ANOVA-type dependence for the atoms. These last two papers have generated an intense stream of research which is not our focus here. For a review of such constructions, see Quintana et al. (2020).
Our approach instead constructs a prior that explicitly considers a departure from exchangeability. Other authors have considered similar problems. Müller et al. (2004) and Lijoi et al. (2014a) constructed priors for the population distributions by these distributions with the addition of a common component. See also Hatjispyros et al. (2011), Hatjispyros et al. (2016) and Hatjispyros et al. (2018) for related models with increasing level of generalization. Several references where the focus is on testing homogeneity across groups of observations are available. Ma and Wong (2011) and Soriano and Ma (2017) propose the coupling optional Pólya tree prior, which jointly generates two dependent random distributions through a random-partition-and-assignment procedure similar to Pólya trees. The former paper consider both testing hypotheses from a global point of view, while the latter takes a local perspective on the two-sample hypothesis, detecting high resolution local differences. Bhattacharya and Dunson (2012) propose a Dirichlet process (DP) mixture model for testing whether there is a difference in distributions between groups of observations on a manifold. Both Chen and Hanson (2014) and Holmes et al. (2015) consider the two-sample testing problem, using a Pólya tree prior for the common distribution in the null, while the model for the alternative hypothesis assumes that the two population distributions are independent draws from the same Pólya tree prior. Their approaches differ in the way they specify the Pólya tree prior. Gutiérrez et al. (2019) consider a related problem, where a Bayesian nonparametric strategy to test for differences between a control group and several treatment regimes is proposed. Pereira et al. (2020) extend this idea to testing equality of distributions of paired samples, with a model for the joint distribution of both samples defined as a mixture of DPs with a spike-and-slab prior specification for its base measure.
Another traditional (and fruitful) approach for modeling data arising from a collection of groups or related studies involves the construction of hierarchical random prior probability measures. One of the first such examples in the BNP literature, is the well-known hierarchical DP mixtures introduced in Teh et al. (2006). Generalizations beyond the DP case are currently an active area of research, as testified by a series of recent papers dealing with various such hierarchical constructions; these include Camerlenghi et al. (2019b), Argiento et al. (2019) and Bassetti et al. (2019). See the discussion below.
Our first contribution is the introduction of a novel class of nonparametric priors that, just as discussed in Camerlenghi et al. (2019a), avoids the degeneracy issue of the nested Dirichlet process (NDP) of Rodríguez et al. (2008) that arises from the presence of shared atoms across populations. Indeed, Camerlenghi et al. (2019a) showed that under the NDP, if two populations share at least one common latent variable in the mixture model, then the model identifies the corresponding distributions as completely equal. To overcome the degeneracy issue, they resort to a latent nested construction in terms of normalized random measures that adds a shared random measure to draws from the NDP. Instead, we use a variation of the hierarchical DP (HDP), that we term the semi-HDP, but where the baseline distribution is itself a mixture of a DP and a non-atomic measure. We will show that this procedure solves the degeneracy problem as well. While relying on a different model, Lijoi et al. (2020a) also propose to build on the HDP, combining it with the NDP, to overcome the degeneracy issue of nested processes.
Our second contribution is that the proposed model overcomes some of the practical and applied limitations of the latent nested approach by Camerlenghi et al. (2019a). As pointed out in Beraha and Guglielmi (2019), the latent nested approach becomes computationally burdensome in the case of I > 2 populations. In contrast, implementing posterior inference for the semi-HDP prior does not require restrictions on I. We discuss in detail how to carry out posterior inference in the context of hierarchical models based on the semi-HDP.
A third contribution of this article is that we combine the proposed semi-HDP prior with a random partition model that allows different populations to be grouped in clusters that are internally homogeneous, i.e. arising from the same distribution. See an early discussion of this idea in the context of contingency tables in Quintana (1998). The far more general extension we aim for here is also useful from the applied viewpoint of finding out which, if any, of the I populations are internally homogeneous when homogeneity of the whole set does not hold. For the purpose of assessing global exchangeability, one may resort to discrepancy measures (Gelman et al., 1996); see also Catalano et al. (2021). In our approach, homogeneity corresponds to a point-null hypothesis about a discrete vector parameter, as we adopt a "larger" model for the alternative hypothesis within which homogeneity is nested. We discuss the specific case of adopting Bayes factors for the proposed test within the partial exchangeability framework. We show that the Bayes factor for this test is immediately available, and derive some of its theoretical properties.
The rest of this article is organized as follows. Section 2 gives some additional background that is relevant for later developments, presents the semi-HDP prior (Section 2.2) and, in particular, it describes a food court of Chinese restaurants with private and shared areas metaphor (Section 2.3). Section 3 studies several theoretical properties of the semi-HDP such as support, moments, the corresponding partially exchangeable partition probability function (in a particular case) and specially how the degeneracy issue is overcome under this setting. Section 3.3 specializes the discussion to the related issue of testing homogeneity when I = 2 populations are present, and we study properties of the Bayes Factor for this test. Section 4 describes a computational strategy to implement posterior inference for the class of hierarchical models based on our proposed semi-HDP prior. Extensive simulations, with I = 2, 4 and 100 populations are presented in Section 5. An application to an educational data set is discussed in Section 6. The article concludes with a discussion in Section 7. An appendix collects the proofs for the theoretical results and a discussion on consistency for the Bayes Factor in the case of I = 2 homogeneous populations. Code for posterior inference has been implemented in C++ and is available as part of the BayesMix library 1 .

Assessing Exchangeability within a Partially Exchangeable Framework
While exchangeability can be explored in more generality, for clarity of exposition we set up our discussion in the context of continuous univariate responses, but extensions to, e.g. multivariate responses, can be straightforwardly accommodated in our framework.

A common home for exchangeability and partial exchangeability
A flexible nonparametric model for each group can be constructed by assuming a mixture, where the mixing group-specific distribution G i is a random discrete probability measure (r.p.m.), i.e.
where k(· | θ) is a density in Y for any θ ∈ Θ, and G i is, for example, a DP on Θ. Note that, with a little abuse of notation, p i in (1) and in the rest of the paper denotes the conditional population density of group i (before p i represented the population distribution of group i in de Finetti's theorem). In what follows, we will always assume that the parametric space is contained in R p for some positive integer p, and we will always assume the Borel σ-field B(Θ) of Θ. Using the well-known alternative representation of the mixture in terms of latent variables, the previous expression is equivalent to assuming that for any i, 1 https://github.com/bayesmix-dev/bayesmix In this case, partial exchangeability of observations (y ij ) ij is equivalent to partial exchangeability of the latent variables (θ ij ) ij . Hence exchangeability of observations (y ij ) ij is equivalent to the statement G 1 = G 2 = · · · = G I with probability one.
In the next subsection we develop one of the main contributions of this paper, namely, the construction of a prior distribution π(G 1 , . . . , G I ) such that there is positive prior probability that G 1 = G 2 = · · · = G I , but avoiding the degeneracy issues discussed in Camerlenghi et al. (2019a) and that would arise if we assumed that (G 1 , . . . , G I ) were distributed as the NDP by Rodríguez et al. (2008). Briefly, (G 1 , . . . , G I ) is distributed as the NDP if i.e., the independent atoms in G are all drawn from a DP on Θ, specifically ∼ G 00 , a probability measure on Θ, and α, γ > 0. The weights (π j ) j and (w h ) h , = 1, 2, . . ., are independently obtained from the usual stick-breaking construction, with parameters α and γ, respectively. Here D γG 00 denotes the Dirichlet measure, i.e. the distribution of a r.p.m. that is a DP with measure parameter γG 00 . However, nesting discrete random probability measures produces degeneracy to the exchangeable case. As mentioned in Section 1, Camerlenghi et al. (2019a) showed that the posterior distribution degenerates to the exchangeable case whenever a shared component is detected, i.e., the NDP does not allow for sharing clusters among non-homogeneous populations. The problem is shown to affect any construction that uses nesting, and not just the NDP.
To overcome the degeneracy issue, while retaining flexibility, Camerlenghi et al. (2019a) proposed the so-called Latent Nested Nonparametric priors. These models involve a shared random measure that is added to the draws from a Nested Random Measure, hence accommodating for shared atoms. See also the discussion by Beraha and Guglielmi (2019). There are two key ideas in their model: (i) nesting discrete random probability measures as in the case of the NDP, and (ii) contaminating the population distributions with a common component as in Müller et al. (2004) and also, Lijoi et al. (2014b). The contamination aspect of the model yields dependence among population-specific random probability measures, and avoids the degeneracy issue pointed out by the authors, while the former accounts for testing homogeneity in multiple-sample problems. Their approach, however, becomes computationally burdensome in the case of I > 2 populations, and it is not clear how to extend their construction to allow for the desired additional analysis, i.e. assessing which, if any, of the I populations are internally homogeneous when homogeneity of the whole set does not hold.

The Model
We present now a hierarchical model that allows us to assess homogeneity, while avoiding the undesired degeneracy issues and which further enables us to construct a grouping of populations that are internally homogeneous. To do so we create a hierarchical representation of distributions that emulates the behavior arising from an exchangeable partition probability function (EPPF; Pitman, 2006) such as the Pólya urn. But the main difference with previous proposals to overcome degeneracy is that we now allow for different populations to arise from the same distribution, while simultaneously incorporating an additional mechanism for populations to explicitly differ from each other. Denote [I] = {1, . . . , I}. A partition S 1 , . . . , S k of [I] can be described by cluster assignment indicators c = (c 1 , . . . , c I ) with c i = iff i ∈ S . Assume this partition arises from a given EPPF. We introduce the following model for the latent variables in a mixture model such as (2). Let y i := (y i1 , . . . , y iN i ), for i = 1, . . . , I. We assume that y 1 , . . . , y I , given all the population distributions F 1 , . . . , F I are independent, and furthermore arising from where α, γ > 0. Thus the role of the population mixing distribution G i in (1) -or, equivalently, in (2) -is now played by F c i . Observe that F 1 , . . . , F I in (5) play a role similar to the cluster specific parameters in more standard mixture models. Consider for example a case where I = 4 and c = (1, 2, 3, 1). Under the above setting, F 1 , F 2 , F 3 define a model for three different distributions, so that populations 1 and 4 share a common mixing distribution, and F 4 is never employed. Equation (5) means that conditionally on G each F k is an independent draw from a DP prior with mean parameter P (and total mass α), i.e. F k is a discrete r.p.m. on Θ ⊂ R p for some positive integer p, with F k = h≥1 w kh δ θ * kh where for any k the weights are independently generated from a stick-breaking process, (1 − β kj ) for h = 2, 3, . . ., β ij iid ∼ Beta(1, α), We assume the centering measure P in (6) to be a contaminated draw G from a DP prior, with centering measure G 00 , with a fixed probability measure G 0 . Both G 0 and G 00 are assumed to be absolutely continuous (and hence non-atomic) probability measures defined on (Θ, B(Θ)).
iid ∼ G 00 are independent weights and location points. The model definition is completed by specifying π c (c 1 , . . . , c I ). We assume that the c i 's are (conditionally) i.i.d. draws from a categorical distribution on ; ω), where the elements of ω are non-negative and constrained to add up to 1. A convenient prior for ω is a finite dimensional Dirichlet distribution with parameter η = (η 1 , . . . η I ). Observe that distributions F c 1 , . . . , F c I allow us to cluster populations, so that there are at most I clusters and consequently F 1 , . . . , F I are all of the cluster distributions that ever need to be considered.
We note several immediate yet interesting properties of the model. First, note that if κ = 1 in (6), then all the atoms and weights in the representation of the F i 's are independent and different with probability one, since the beta distribution and G 0 are absolutely continuous. If κ = 0, then our prior (5)-(7) coincides with the Hierarchical Dirichlet Process in Teh et al. (2006). Since G = h≥1 p h δ τ h , then, with positive probability, we have θ * kh = θ * k m = τ for k = k , i.e. all the F k 's share the same atoms in the stick-breaking representation of G. However, even when κ = 0, F k = F j with probability one, as the weights {w kh } h and {w jh } h are different, since they are built from (conditionally) independent stick-breaking priors. This is precisely the feature that allows us to circumvent the degeneracy problem. Second, our model introduces a vector parameter c, which assists selecting each population distribution from the finite set F 1 , . . . , F I , in turn assumed to arise from the semi-HDP prior (5)-(7). The former allows two different populations to have the same distribution (or mixing measure) with positive probability, while the latter allows to overcome the degeneracy issue while retaining exchangeability. Indeed, as noted above, F i and F j may share atoms. The atoms in common arise from the atomicity of the base measure and we let the atomic component of the base measure to be a draw from a DP. The result is a very flexible model, that on one hand is particularly well-suited for problems such as density estimation, and on the other, can be used to construct clusters of the I populations, as desired.

A restaurant representation
To better understand the cluster allocation under model (3)-(7), we rewrite (3) introducing the latent variables {θ ij } as follows and {θ i } ⊥ {θ jm } m for i = j, conditionally on F 1 , . . . , F I . We first derive the conditional law of the θ ij 's under (9) -(10), and (4)-(6), given G. All customers of group i enter restaurant r (such that c i = r). If group i is the first group entering restaurant r, then the usual Chinese Restaurant metaphor applies. Instead, let us imagine that group i is the last group entering restaurant r among those such that c m = r. Upon entering the restaurant, the customer is presented with the usual Chinese Restaurant Process (CRP), so that that is the CRP when considering all the groups entering restaurant r as a single group.
Here H r denotes the number of tables in restaurant r, and n r is the number of customers who entered from restaurant r and are seating at table . Moreover, note that θ * r | G iid ∼ P , so that, as in the HDP, there might be ties among the θ * r also when keeping r fixed. This is an important observation as the fact that there might be ties for different values of r = r instead, is exactly what let us avoid the degeneracy to the exchangeable case. Note that (11) holds also for θ i1 , i.e. the first customer in group i. In the following, we will use clusters or tables interchangeably. However, note that, unlike traditional CRPs, the number of clusters does not coincide with the number of unique values in a sample. This point is clarified in Argiento et al. (2019), who introduce the notion of -cluster, which is essentially the table in our restaurant metaphor.
Observe from (11) that when a new cluster is created, its label is sampled from P . In practice, we augment the parameter space with a new binary latent variable for each cluster, namely h r , with h r iid ∼ Bernoulli(κ), so that Upon conditioning on {h r } it is straightforward to integrate out G. Indeed, we can write the joint distribution of {θ * r , ∀r ∀ }, conditional on {h r } as Hence we see that {θ * r , ∀r ∀ : h r = 0} is a conditionally i.i.d sample from G (given all the h rl 's and G), so that we can write: and τ k iid ∼ G 00 , where H 0 denotes the number of tables in the common area in Figure 1, and m rk denotes the cardinality of the set {θ * r : θ * r = τ k }. The dot subindex denotes summation over the corresponding subindex values. Hence, conditioning on all the (r, ) such that h r = 0, with r corresponding to a non-empty restaurant, we recover the Chinese Restaurant Franchise (CRF) that describes the HDP.
We can describe the previously discussed clustering structure in terms of a restaurant metaphor as the "food court of Chinese restaurants with private and shared areas". Here, the θ * r correspond to the tables and θ ij to the customers. Moreover, a dish is associated to each table. Dishes are represented by the various θ * r 's . There is one big common area where tables are shared among all the restaurants and I additional "private" small rooms, one per restaurant, as seen in Figure 1. The common area accommodates tables arising from the HDP, i.e. those tables such that τ k iid ∼ G 00 , while the small rooms host those tables associated to non empty restaurants, such that θ * r | h r = 1 iid ∼ G 0 . All the customers of group i enter restaurant r (such that c i = r). Upon entering the restaurant, a customer is presented with a menu. The H r dishes in the menu are the θ * r 's, and because θ * r iid ∼ P , there might be repeated dishes; see (11). The customer either chooses one of the dishes in the menu, with probability proportional to the number of customers who entered the same restaurant and chose that dish, or a new dish (that is not included in the menu yet) with probability proportional to α; again, see (11). If the latter option is chosen, with probability κ a new table is created in the restaurant-specific area, H r is incremented by one and a new dish θ * rHr+1 is drawn from G 0 . With probability 1 − κ instead, the customer is directed to the shared area, where (s)he chooses to seat in one of the occupied tables with a probability proportional to m ·k , i.e. the number of items in the menus (from all the restaurants) that are equal to dish τ k , or seats at a new table with a probability proportional to γ, as seen from (12). We point out that the choice of table in this case is made without any knowledge of which restaurant the dishes came from. Moreover, if the customer chooses to sit at a new table, we increment H 0 by one and draw τ H 0 +1 ∼ G 00 ; we also increment H r by one and set θ * rHr+1 = τ H 0 +1 . Observe that in the original CRF metaphor, it is not the tables that are shared across restaurants, but rather common area (1, 4, 1, 4) so that groups one and three enter in restaurant R1 while groups two and four enter in restaurant R4. In the "common area" two tables are represented, τ 1 and τ 2 . "Zooming" into τ 2 shows that there are three different θ * 's associated to the value τ 2 , namely θ * 12 , θ * 13 and θ * 43 . The first two originate from R1, showing that it is possible to have ties among the θ * 's even inside the same restaurant, while the table labeled θ * 43 shows that it is possible to have ties across different restaurants. the dishes. In our metaphor instead, we group together all the tables corresponding to the same τ h and place them in the shared area. This is somewhat reminiscent of the direct sampler scheme for the HDP. Nevertheless, observe that the bookkeeping of the m rk 's is still needed. To exemplify this, in Figure 1 we report a "zoom" on a particular shared table τ , showing that the θ * 's associated to that table are still present in our metaphor, but can be collapsed into a single shared table when it is convenient.

Theoretical properties of the semi-HDP prior
Here we develop additional properties of the proposed prior model. In particular, we study the topological support of the semi-HDP and show how exactly the degeneracy issue is resolved by studying the induced joint random partition model on the I populations.

Support and moments
An essential requirement of nonparametric priors is that they should have large topological support; see Ferguson (1973). Let us denote by π G the probability measure on P I Θ corresponding to the prior distribution π(G 1 , . . . , G I ) of the random vector (G 1 , . . . , G I ) specified in (4)-(7), with G i = F c i ; see (1). We show here that the prior probability measure π G has full weak support, i.e. given any point g = (g 1 , . . . , g I ) in P I Θ , π G gives positive mass to any weak neighborhood U(g; ) of g, of diameter .
Proof: see the Appendix, Section A.
It is straightforward to show that in case where π c (c 1 , . . . , c I ) is exchangeable and P (c i = ) = ω for = 1, . . . , I then (3)-(7) becomes, after marginalizing with respect to c, In this case, the conditional marginal distribution of each observation can be expressed as a finite mixture of mixtures of the density k(· | θ) with respect to each of the random measures F 1 , . . . , F I , i.e. a finite mixture of Bayesian nonparametric mixtures. We have mentioned above that in the case in which G 00 = G 0 in Equations (6) -(7), the marginal law of F i is G 0 , and equivalently, for each A ∈ B(Θ), E[F i (A)] = G 0 (A) for any i. In this case, the covariance between F 1 and F 2 is given by See the Appendix, Section A, for the proof of these formulas. Note that, in the case of Hierarchical Normalized Completely Random Measures, and hence in the HDP, the covariance between F 1 and F 2 depends exclusively on the intensity of the random measure governing G (in the case of the DP the dependence is on γ). For instance, see Argiento et al. (2019), Equation (5) in the Supplementary Material. Instead, in the Semi-HDP, an additional parameter can be used to tune such covariance: the weight κ. Indeed, as κ approaches 1, the two measures become more and more uncorrelated, the limiting case being full independence as discussed at the end of Section 2.2. In the Appendix, Section A, we also report an expression for the higher moments of F i (A) for any i.

Degeneracy and marginal law
We now formalize the intuition given in Section 2.3 and show that our model, as defined in (3)-(7), does not incur in the degeneracy issue described by Camerlenghi et al. (2019a). The degeneracy of a nested nonparametric model refers to the following situation: if there are shared values (or atoms in the corresponding mixture model) across any two populations, then the posterior of these population/random probabilities degenerates, forcing homogeneity across the corresponding samples. See also the discussion in Beraha and Guglielmi (2019). From the food court metaphor described above, it is straightforward to see that degeneracy is avoided if two customers sit in the same table (of the common area) with positive probability, conditioning on the event that they entered from two different restaurants.
To get a more in-depth look at these issues, we follow Camerlenghi et al. (2019a) and study properties of the partially exchangeable partition probability function (pEPPF) induced by our model, which we define in the special case of I = 2. Consider a sample θ = (θ 1 , θ 2 ) of size N = N 1 + N 2 from model (10), together with (4)-(7) for I = 2 populations; let k = k 1 + k 2 + k 0 the number of unique values in the samples, with k 1 (k 2 ) unique values specific to group 1 (2) and k 0 shared between the groups. Call n i the frequencies of the k i unique values in group i and q i the frequencies of the k 0 shared values in group i; this is the same notation as in Camerlenghi et al. (2019a), Section 2.2. The pEPPF is defined as Proposition 3.2 Let κ in (6) be equal to 1, let π 1 = P (c 1 = c 2 ), then the pEPPF Π (N ) k (n 1 , n 2 , q 1 , q 2 ) can be expressed as: is the EPPF of the fully exchangeable case, and is the marginal EPPF for the individual group i. Proof: see the Appendix, Section A.
This result shows that a suitable prior for κ requires assigning zero probability to the event κ = 1. The assumption in (8) trivially satisfies this requirement. Finally, we consider the marginal law of a sequence of vectors (θ 1 , . . . , θ I ), θ = (θ 1 , . . . θ N l ) from model (3)-(7). Let us first derive the marginal law conditioning on c, as the full marginal law will be the mixture of these conditional laws over all the possible values of c.

Proposition 3.3 The marginal law of a sequence of vectors
Here, {θ * } L =1 = {θ * 11 , . . . , θ * IH I } is a sequence representing all the tables in the process, obtained by concatenating the tables in each restaurant. Moreover, R(c) is the number of unique values in c, i.e. the number of non-empty restaurants, n r i is the vector of -cluster sizes for restaurant r i , m r i is the vector of the cluster sizes of the θ * such that h = 0 and θ * * k are the unique values among such θ * , where "eppf " denotes the the distribution of the partition induced by the table assignment procedure in the food court of Chinese restaurants described in Section 2.3. Proof: see the Appendix, Section A.
The marginal law of (θ 1 , . . . , θ I ) is then where L(dθ 1 , . . . , dθ I | c) is given in (14). Observe that in Proposition 3.2 we denoted by Φ the EPPF, while in (14) we use notation "eppf ". This is to remark that these objects are inherently different: Φ is the EPPF of the partition of unique values in the sample, while eppf here is the EPPF of the tables, or -clusters, induced by the table assignment procedure described in Section 2.3. Hence, from a sample θ one can recover n 1 , n 2 , q 1 , q 2 in (13) but not n r i in (14).

Some results on the Bayes factor for testing homogeneity
We consider now testing for homogeneity within the proposed partial exchangeability framework. As a byproduct of the assumed model, the corresponding Bayes factor is immediately available. For example, if one wanted to test whether populations i and j were homogeneous, it would suffice to compute the Bayes factor for the test which can be straightforwardly estimated from the output of the posterior simulation algorithm that will be presented later on. Note that these "pairwise" homogeneity tests are not the only object of interest that we can tackle within our framework. Indeed it is possible to test any possible combination of c against an alternative. These tests admit an equivalent representation in terms of a model selection problem; for example in the case of I = 2 populations, we can rewrite (15), for i = 1 and j = 2, as a model selection test for M 1 against M 2 , where In this case BF 12 := BF 12 (y 11 , . . . , y 1N 1 , y 21 , . . . , y 2N 2 ) = m M 1 (y 11 , . . . , y 1N 1 , y 21 , . . . , y 2N 2 ) m M 2 (y 11 , . . . , y 1N 1 , y 21 , . . . , y 2N 2 ) , where m M i denotes the marginal law of the data under model M i , i = 1, 2, defined above. Asymptotic properties of Bayes factors have been discussed by several authors. We refer to Walker et al. (2004), Ghosal et al. (2008) for a more detailed discussion and to Chib and Kuffner (2016) for a recent survey on the topic. Chatterjee et al. (2020) is a recent and solid contribution to the almost sure convergence of Bayes factor in the general set-up that includes dependent data, i.e. beyond the usual i.i.d. context. In words, our approach can be described as follows. When the data are assumed to be exchangeable, we assume that both samples are generated i.i.d from a distribution P 0 with density p 0 . If the data are instead assumed to be partially exchangeable, then we consider the first population to be generated i.i.d from a certain P 0 with density p 0 , while the second one is generated from Q 0 with density q 0 , with P 0 = Q 0 and independence holds across populations. The Bayes factor for comparing M 1 against M 2 is thus consistent if: The two scenarios must be checked separately. In the latter case, consistency of the Bayes factor can be proved by arguing that only model M 2 satisfies the so-called Kullback-Leibler property, so that consistency is ensured by the theory in Walker et al. (2004). We summarize this result in the following proposition.
Observe that, out of the nine conditions B1-B9, we have that B1 − B3, B7 and B9 involve regularity conditions of the kernel k(·|θ). These are satisfied if the kernel is, for example, univariate Gaussian with parameters θ = (µ, σ 2 ). Conditions B4 − B6 involve regularity of the true data generating density, which are usually satisfied in practice. Condition B8 requires that the mixing measure has full weak support, already proved in Proposition 3.1.
On the other hand, when p 0 = q 0 , consistency of the Bayes factor would require BF 12 → +∞. This is a result we have not been able to prove so far. The Appendix, Section B discusses the relevant issues arising when trying to prove the consistency in this setting; we just report here that the key missing condition is an upper bound of the prior mass of M 2 . The lack of such bounds for general nonparametric models is well known in the literature, and not specific to our case, as it is shared, for instance, by Bhattacharya and Dunson (2012) and Tokdar and Martin (2019). In both cases, the authors were able to prove the consistency under the alternative hypothesis but not under the null. For a discussion on the "necessity" of these bounds in nonparametric models, see Tokdar and Martin (2019).
In light of the previous consistency result for the non-homogeneous case, our recommendation to carry out the homogeneity test is to decide in favor of H 0 whenever the posterior of c i , c j does not strongly concentrate on c i = c j . As Section 5 shows, in our simulated data experiments this choice consistently identifies the right structure of homogeneity among populations. See also the discussion later in Section 7.

Posterior Simulation
We illustrate an MCMC sampler based on the restaurant representation derived in Section 2.3. The random measures {F i } i and G are marginalized out for all the updates except for the case of c, for which we use a result from Pitman (1996) to sample from the full conditional of each F i , truncating the infinite stick-breaking sum adaptively; see below.
We refer to this algorithm as marginal. We also note that, by a prior truncation of all the stick-breaking infinite sums to a fixed number of atoms, we can derive a blocked Gibbs sampler as in Ishwaran and James (2001). However, in our applications the blocked Gibbs sampler was significantly slower both in reaching convergence to the stationary distribution and to complete one single iteration of the MCMC update. Hence, we will describe and use only the marginal algorithm.
We follow the notation introduced in Section 2.3. The state of our MCMC sampler consists of the restaurant tables {θ * rh }, the tables in the common area {τ h }, a set of binary variables {h rj }, indicating if each table is "located" in the restaurant-specific or in the common area, a set of discrete shared table allocation variables t r , one for each θ * r such that θ * r = τ k iff t r = k and h r = 0, the categorical variables c i , indicating the restaurant for each population, κ ∈ (0, 1), and the table allocation variable s ij : for each observation such that θ ij = θ * rh iff c i = r and s ij = h. We also denote by H 0 and H r the number of tables occupied in the shared area and in restaurant r respectively, m rk indicates the number of customers in the common area entered from restaurant r seating at table k. We use the dot notation for marginal counts, for example n r· indicates all the customers entered in restaurant r. We summarize the Gibbs sampling scheme next.
• Sample the cluster allocation variables using the Chinese Restaurant Process, where p(y ij | s −ij , rest) = κ k(y ij | θ)G 0 (dθ)+ and where the notation x −ij means that observation y ij is removed from the calculations involving the variable x.
If s = s new , a new table is created. The associated value θ * rs new is sampled from G 0 with probability κ or from G with probability 1 − κ, as described in Section 2.3. The corresponding latent variables h rs new and t rs new are set accordingly. When sampling from (12) a new table in the shared area might be created. In that case, t rs new is set to H 0 + 1.
• Sample the table allocation variables t r as in the HDP: where the notation x −r means that table θ * r , including all the associated observations, is entirely removed from the calculations involving variable x. If k = k new a new table is created in the shared area, the allocation variables s ij are left unchanged.
• Sample the cluster values from and where the product ( * ) is over all the index couples such that c i = r, s ij = , h r = 0 and θ * r = τ k . Observe that, when h r = 0, it means that θ * r = τ k for some k. Hence, in this case, θ * r is purely symbolic and we do not need to sample a value for it. • Sample each h r independently from where, as in (18), the notation x −r means that table θ * r , including all its associated observations, is removed from the calculations involving variable x. Observe that, while in the update of the cluster values all the θ * r referring to the same τ k were updated at once, here we move the tables one by one.
where I[·] denotes the indicator function.
• Sample each c i in c = (c 1 , . . . , c I ) independently from If the new value of c i differs from the previous one, then following (16), all the observations y i1 , . . . , y iN i are reallocated to the new restaurant.
Note that the update in (19) involves the previously marginalized random probability measures F 1 , . . . , F I . Thus, before performing this update, we need to draw the F i 's from their corresponding full conditional distributions. It follows from Corollary 20 in Pitman (1996) that the conditional distribution of F r given c, n r , θ * r , κ, and G coincides with the distribution of π r0 F r + Hr h=1 π rh δ θ * rh , where (π r0 , π r1 , . . . , π rHr ) ∼ Dirichlet(α, n r1 , . . . , n rHr ) and F r | G ∼ D α P . This result was employed in Taddy et al. (2012) to quantify posterior uncertainty of functionals of a Dirichlet process, and also in Canale et al. (2019) to derive an alternative MCMC scheme for mixture models. It follows from the usual stick breaking rep- Similarly, the conditional distribution of G given τ and m coincides with the distribution of . . , m ·H 0 ) and G ∼ D γG 00 . In practice, we draw each F r by truncating the infinite sum. Note that we do not need to set a priori the truncation level. Instead, we can specify an upper bound for the error introduced by the truncation and set the level adaptively. In fact, as a straightforward consequence of Theorem 1 in Ishwaran and James (2002) we have that the total variation distance between F r and its approximation with M atoms, say F M r , is bounded by ε M = 1 − M h=1 w rh (see also Theorem 2 in Lijoi et al., 2020b). The error induced on F r is then bounded by π r0 ε M . Note that simulation of the atoms θ rh involves the discrete measure G . However, we only need to draw a finite number of samples from it, and not its full trajectory, so that no truncation is necessary for G . For ease of bookkeeping, we employ retrospective sampling (Papaspiliopoulos and Roberts, 2008) to simulate the atoms. Alternatively, the classical CRP representation can be used. In our experiments, because Hr h=1 n rh α we have π r0 Hr h=1 π rh ≈ 1. Thus, choosing a truncation level M = 10 always produces an error on F i lower than 10 −4 (henceforth fixed as the truncation error threshold). Furthermore, we are often not even required to draw samples from G .
Of the aforementioned steps, the bottleneck is the update of c because for each c i we are required to evaluate the densities of N i points in I mixtures. If N i = N for all i, the computational cost of this step is O(N I 2 ), which can be extremely demanding for large values of I. We can mitigate the computational burden by replacing this Gibbs step with a Metropolis-within-Gibbs step, in the same spirit of the Metropolised Carlin and Chib algorithm proposed in Dellaportas et al. (2002). At each step we propose a move from c ( ) i = r to c ( +1) i = m with a certain probability p i (m | r). The transition is then accepted with the usual Metropolis-Hastings rule, i.e. the new update becomes: • Propose a candidate m by sampling p i (m | r) • Accept the move with probability q, where q = min 1, We call this alternative sampling scheme the Metropolised sampler. The key point is that if evaluating the proposal p i (·|·) has a negligible cost, the computational cost of this step will be O(2N I) as for each data point we need to evaluate only two mixtures: the one corresponding to the current state F r and the one corresponding to the proposed state F m . Of course, the efficiency and mixing of the Markov chain will depend on a suitable choice of the transition probabilities p i (·|·); some possible alternatives are discussed in Section 5. When, at the end of an iteration, a cluster is left unallocated (or empty), the probability of assigning an observation to that cluster will be zero for all subsequent steps. As in standard literature, we employ a relabeling step that gets rid of all the unused clusters. However, this relabeling step is slightly more complicated since there are two different types of clusters: one arising from G 0 and ones arising from G. Details of the relabeling procedure are discussed in the Appendix, Section C.

Use of pseudopriors
The above mentioned sampling scheme presents a major issue that could severely impact the mixing. Consider as an example the case when I = 2; if, at iteration k, the state jumps to c 1 = c 2 = 1, then all the tables of the second restaurant would be erased from the state, because no observation is assigned to them anymore. Switching back to c 1 = c 2 would then require that the approximation of F 2 sampled from its prior distribution gives sufficiently high likelihood to either y 1 or y 2 , an extremely unlikely event in practice.
To overcome this issue, we make use of pseudopriors as in Carlin and Chib (1995), that is, whenever a random measure F r in (F 1 , . . . , F I ) is not associated with any group, we sample the part of the state corresponding to that measure (the atoms {θ * r } and number of customers {n r } in each restaurant) from its pseudoprior. From the computational point of view, this is accomplished by running first a preliminary MCMC simulation where the c i 's are fixed as c i = i, and collecting the samples. Then, in the actual MCMC simulation, whenever restaurant r is empty we change the state by choosing at random one of the previous samples obtained with fixed c i 's. Note that this use of pseudopriors does not alter the stationary distribution of the MCMC chain. Furthermore, the way pseudopriors are collected and sampled from is completely arbitrary, and our proposed solution works well in practice. Other valid options include approximations based on preliminary chain runs, as discussed in Carlin and Chib (1995).
Section 5 below contains extensive simulation studies that show that the proposed model can be used to efficiently estimate densities for each population. We also tried the case of a large number of populations, e.g. I = 100 without any significant loss of performance.

Two populations
We first focus on the special case of I = 2 populations. Consider generating data as follows that is each population is a mixture of two normal components. This is the same example considered in Camerlenghi et al. (2019a). Table 1 summarizes the parameters used to generate the data. Note that these three scenarios cover either the full exchangeability case across both populations (Scenario I), as well as the partial exchangeability between the two populations (scenarios II and III). For each case, we simulated N 1 = N 2 = 100 observations for each group (independently). Table 2 reports the posterior probabilities of the two population being identified as equal for the three scenarios. We can see that our model recovers the ground truth. Moreover Figure 2 shows the density estimates, i.e. the posterior mean of the density (µ 1 , σ 1 ) (µ 2 , σ 2 ) (µ 3 , σ 3 ) (µ 4 , σ 4 ) w 1 w 2 Scenario I (0.0, 1.0) (5.0, 1.0) (0.0, 1.0) (5.0, 1.0) 0.5 0.5 Scenario II (5.0, 0.6) (10.0, 0.6) (5.0, 0.6) (0.0, 0.6) 0.9 0.1 Scenario III (0.0, 1.0) (5.0, 1.0) (0.0, 1.0) (5.0, 1.0) 0.8 0.2 Table 1: Parameters of the simulated datasets evaluated on a fixed grid of points, together with pointwise 95% posterior credible intervals at each point x in the grid, obtained by our MCMC for scenarios I and III. Here, densities are estimated from the corresponding posterior mean evaluated on a fixed grid of points, while credible intervals are obtained by approximating the F i 's as discussed in Section 4. We can see that in both the cases, locations and scales of the populations are recovered perfectly, while it seems that the weights of the mixture components are slightly more precise in Scenario I than in Scenario III.   Table 2 with the ones in Camerlenghi et al. (2019a) (5.86, 0.0 and 0.54 for the three scenarios, respectively), we see that both models are able to correctly assess homogeneity. However, the Bayes Factors obtained under our model tend to assume more extreme than those from Camerlenghi et al. (2019a). Figure 3 shows the posterior distribution of the number of shared and private unique values (reconstructed from the cluster allocation variables s ij and the table allocation variables t r ) in Scenario II, when either κ ∼ Beta(2, 2) or κ = 1. Also in the he latter case P (c 1 = c 2 | data) = 0, but the shared component between groups one and two is not recovered, due to the degeneracy issue described in Proposition 3.2.
As the central point of our model is to allow for different random measures to share at least one atom, we test more in detail this scenario. To do so, we simulate 50 different datasets from (20), by selecting µ 1 , µ 2 , µ 4 iid ∼ N (0, 10) and σ 2 1 , σ 2 2 , σ 2 4 iid ∼ inv − gamma(2, 2), w 1 ∼ Beta(1, 1) and setting µ 3 = µ 1 , σ 2 3 = σ 2 1 , w 2 = w 1 . In this way we create 50 independent scenarios where the two population share exactly one component and give the same weight to this component. Figure 4 reports the scatter plot of the estimated posterior probabilities of c 1 = c 2 obtained from the MCMC samples. It is clear that our model recovers the right scenario most of the times. Out of 50 examples, only in four of them P (c 1 = c 2 | data) is greater than 0.5, by a visual analysis we see from the plot of the true densities that in those cases the two populations were really similar.

More than two populations
We extend now the simulation study to scenarios with more than two populations. We consider three simulated datasets with four populations each and different clustering structures at the population level. In particular, we use the same scenarios as in Gutiérrez et al. (2019), and simulate N i = 100 points for each population i = 1, 2, 3, 4 as follows  By SN (ξ, ω, α) in Scenario IV we mean the skew-normal distribution with location ξ, scale ω and shape α; in this case, the mean of the distribution is equal to Note that we focus on a different problem than what Gutiérrez et al. (2019) discussed, as they considered testing for multiple treatments against a control. In particular they were concerned about testing the hypothesis of equality in distribution between data coming from different treatments y j (j = 2, 3, 4 in these scenarios), and data coming from a control group y 1 . Instead our goal is to cluster these populations based on their distributions.
Observe how the prior chosen for c does not translate directly into a distribution on the partition ρ, as it is affected by the so called label switching. Thus, in order to summarize our inference, we post-process our chains and transform the samples c (1) , . . . , c (M ) from c to samples ρ (1) , . . . , ρ (M ) from ρ. For example we have that c (i) = (1, 1, 1, 3) and The posterior probabilities of the true clusters P (ρ i = ρ true i | data) are estimated using the transformed (as described above) MCMC samples and equal 0.75, 0.95 and 0.99 for the three scenarios respectively. Figure 5 shows the posterior distribution of ρ, and Figure 6 reports the density estimation of each group, for Scenario IV. Observe how the posterior mode is in ρ true 4 but significant mass is given also to the case {{1}, {2, 3}, {4}}. We believe that this behavior is mainly due to our use of pseudopriors, as it makes the transition between these three states fairly smooth. On the other hand, in Scenario V, where the posterior mass on the true cluster is close to 1, it is clear that such transitions happen very rarely, as the posterior distribution, not shown here, is completely concentrated on ρ true 5 . Our insight is that the pseudopriors make a transition between two states, say c (j) = (1, 1, 3, 4) and c (j+1) = (1, 2, 3, 4) (or viceversa), more likely when the mixing distributions of population one and two are the same.
We compared the performance of the Metropolised algorithm against the full Gibbs move for the update of c, computing the effective sample size (ESS) of the number of population level clusters (i.e. the number of unique values in c) over CPU time. We consider two choices for the proposal distribution p i (r | m), namely, the discrete uniform over {1, . . . , I} and another discrete alternative, with weights given by where d 2 (F r , F m ) is the squared L 2 distance between the Gaussian mixture represented by F r and that represented by F m , which are sampled as discussed in Section 4. Let p r = Hr i=1 w ri N (µ ri , σ 2 ri ) and p m = Hm j=1 w mj N (µ mj , σ 2 mj ) be the mixture densities associated to the mixing measures F r and F m respectively, then: Hm j=1 w ri w mj N (y; µ ri , σ 2 ri )N (y; µ mj , σ 2 mj )dy.
See the Appendix, Section A, for the proof of Equations (22)-(23). Results for data as in Scenario IV show that the best efficiency is obtained using the full Gibbs update, with an ESS per second of 57.1. The Metropolised sampler with proposal as in (21) comes second, yielding an ESS per second of 34.1 while the Metropolised sampler with uniform proposal is the worst performer with an ESS per second of 12.8. Hence, even when the number of groups is not enormous, the good performance of the Metropolised sampler is clear. Preliminary analysis showed how the Metropolised sampler outperforms the full Gibbs one as the number of groups increases.
Finally, we test how our algorithm performs when the number of populations increases significantly. We do so by generating 100 populations in Scenario VII as follows: To compute posterior inference, we run the Metropolised sampler with proposal (21). To get a rough idea of the computational costs associated to this large simulated dataset, [[1, 2, 3]  we report that running the full Gibbs sampler would have required more than 24 hours on a 32-core machine (having parallelized all the computations which can be safely parallelized), while the Metropolised sampler ran in less than 3 hours on a 6-core laptop. As a summary of the posterior distribution of the random partition ρ 100 , we compute the posterior similarity matrix [P (c i = c j | data)] I i,j=1 . Estimates of these probabilities are straightforward to obtain using the output of the MCMC algorithm. Figure 7 shows the posterior similarity matrix as well as the density estimates of five different populations. It is clear that the clustering structure of the populations is recovered perfectly and that the density estimates are coherent with the true ones.

A note on the mixing
One aspect of the inference presented so far that is clear from all the simulated scenarios, is that the posterior simulation of c, and hence of the partition ρ, usually stabilizes around one particular value and then very rarely moves. This could be interpreted as a mixing issue of the MCMC chain. However, notice that once the "true" partition of the population is identified, it is extremely unlikely to move from that state, which can be seen directly from Equation (19). Indeed, moving from one state to another modifies the likelihood of an entire population. In particular, moving from a state where c i = c j for two populations i and j that are actually homogeneous, to a state where c i = c j is an extremely unlikely move.
To further illustrate the point, consider for ease of explanation the case of Simulation Scenario I where both populations are the same, and suppose that at a certain MCMC iteration we impute c 1 = c 2 = 1. In order for the chain to jump to c 2 = 2, the "empty" mixing distribution F 2 must be sampled in such a way to give a reasonably high likelihood to all the data from the second population y 21 , . . . y 2N 2 ; again, see (19). If one did not make use of pseudopriors, this would mean that F 2 would be sampled from the prior, thus making this transition virtually impossible. But even using pseudopriors, the transition remains quite unlikely. Indeed, once c 1 = c 2 = 1, we get an estimate of F 1 using data from the two homogeneous groups, hence getting a much better estimate that one would get when c 1 = c 2 .
Nevertheless, in all simulation scenarios we tried this problem has not prevented the posterior simulation algorithm from identifying the correct partition of populations, as defined in these scenarios. In particular, we found that P (ρ true posterior expectation of Binder's loss (Binder, 1978) under equal misclassification costs and of the variation of information loss (Wade and Ghahramani, 2018). In all the examples proposed, the "true" partition was correctly detected by both estimates.

Chilean grades dataset
The School of Mathematics at Pontificia Universidad Católica de Chile teaches many undergraduate courses to students from virtually all fields. When the number of students exceeds a certain maximum pre-established quota, several sections are formed, and courses are taught in parallel. There is a high degree of preparation in such cases, so as to guarantee that courses cover the same material and are coordinated to function as virtual copies of each other. In such cases, only the instructor changes across sections, but all materials related to the courses are the same, including exams, homework, assignments, projects, etc., and there is a shared team of graders that are common to all the parallel sections. According to the rules, every student gets a final grade on a scale from 1.0 to 7.0, using one decimal place, where 4.0 is the minimum passing grade. We consider here the specific case of a version of Calculus II, taught in parallel to three different sections (A, B and C) in a recent semester. Our main goal here is to examine the instructor effectiveness, by comparing the distributions of the final grades obtained by each of the three populations (sections). The sample sizes of these populations are 76, 65 and 50 respectively.
A possible way to model these data could be to employ a truncated normal distribution as the kernel in (2). However since our primary interest is to investigate the homogeneity of the underlying distributions and not to perform density estimates, we decided to first add a small amount of zero-mean Gaussian noise, with variance 0.1 to the data (i.e. "jittering") and then proceeded to standardize the whole dataset, by letting y new ij = (y ij −ȳ)/s y , wherē y = ( ij y ij )/( i N i ) and s 2 y = ( ij (y ij −ȳ) 2 )/( i N i − 1) are the global sample mean and variance, respectively. In the sequel, index i = 1, 2, 3 denotes sections A, B and C, respectively, as described above. Figure 8 reports density estimates in all groups (i.e. posterior density means evaluated on a fixed grid of points and pointwise 95% posterior credible intervals at each point x in the grid), as well as the posterior distribution of the random partition ρ, obtained from the posterior distribution of c, getting rid of the label switching in a post-processing step (see also Section 5.2). From Figure 8 we see that the posterior distribution of ρ gives high probability to the case of the three groups being all different as well as to the case when the first and third groups are homogeneous but different from the second one. This is in accordance with a visual analysis of the observed and estimated densities.
We considered several functionals of the random population distribution F c i (see (3)) for i = 1, 2, 3. Recall that, according to notation in (1), F c i = G i . First of all, we consider the mean and variance functionals of the random density p i (y) = Θ k(y|θ)F c i (dθ) = Θ k(y|θ)G i (dθ), for each i = 1, 2, 3. Observe how they are functionals of the random probability F c i = G i . Moreover, since Figure 8 seems to suggest that the three groups differ mainly due to their different asymmetries, we considered two more functionals of G i , i.e. two indicators of skewness: Pearson's moment coefficient of skewness sk and the measure of skewness with respect to the mode γ M proposed by Arnold and Groeneveld (1995). Pearson's moment coefficient of skewness of the random variable T is defined as sk = E[((T − E(T ))/ Var(T )) 3 ], while the measure of skewness with respect to the mode as γ M = 1 − 2F T (M T ), where M T is the mode of T and F T denotes its distribution function. The last functional of G i we consider is the probability, under the density  Table 3: Posterior means of functionals µ i , . . . , P 4i of the population density p i for each Section A (i = 1), B (i = 2) and C (i = 3) in the Chilean grades dataset. All the functionals refer to standardized data {y new ij = (y ij −ȳ)/s y }.
p i (y) = Θ k(y|θ)G i (dθ) of getting a passing grade (≥ 4.0 before normalization), that is P 4i = +∞ (4−ȳ)/sy p i (y)dy. Table 3 shows the posterior mean of the functionals µ i , σ 2 i (mean and variance functionals), sk i , γ M i and P 4i of p i , for i = 1, 2, 3. To be clear, the posterior mean of the mean functional µ 1 is computed as where M is the MCMC sample size, and the superscript ( ) attached to a random variable denotes its value at the -th MCMC iteration.
In agreement with the posterior distribution of the partition ρ, for all the functionals considered we observed close values for sections A and C, while both differ significantly from the values for section B. In summary, we conclude that section B presents a heavier right tail than sections A and C, hence it is characterized by a higher mean (positive) and also more spread across the range. Section B shows a larger (estimated) value for P 4 , i.e. students in section B are more likely to pass the exam than their colleagues from the other sections. This seems to suggest that a higher concentration of good students (with high grades) was present in Section B, compared to A and C, possibly combined with a higher effectiveness of the instructor in this Section. We also computed the pairwise L 1 distances between the estimated densities in the populations. Ifp i denotes the estimated density (posterior mean of p i evaluated in a grid of points) for each population, we found d(p A ,p B ) = 0.56, d(p A ,p C ) = 0.15 and d(p B ,p C ) = 0.44. This confirms once again that the estimated densities for section A and C are closer than when comparing sections A and B and sections B and C.
To end the analysis, we show in Figure 9 estimated couples of densities (p i , p ), i = , i, = 1, 2, 3, i.e. the posterior mean of (p i , p ), evaluated on a fixed grid in R 2 . While sections A and C look independent (central panel in Figure 9), the (posterior) propensity of section B to get higher grades is confirmed in the left and right panels in Figure 9.

Discussion
Motivated by the traditional problem of testing homogeneity across I different groups or populations, we have presented a model that is able to not only address the problem but also to perform a cluster analysis of the groups. The model is built on a prior for the population distributions that we termed the semi-hierarchical Dirichlet process, and it was shown to have good properties and also to perform well in synthetic and real data examples, also in case of I = 100 groups. One of the driving features of our proposal was to solve the degeneracy limitation of nested constructions that has been pointed out by Camerlenghi et al. (2019a). The crucial aspect of the semi-HDP that solves this problem was described using the metaphor of a food court of Chinese restaurants with common and private dining area. The hierarchical construction introduces a random partition at the population level, which allows for identifying possible clusters of internally homogeneous groups.
Our examples focus on unidimensional data, though extensions to multivariate responses can be straightforwardly accommodated in our framework. However, scaling with respect to data dimension is not a property we claim to have. In fact, this is a situation shared with any type of hierarchical mixture models.
We studied support properties of the semi-HDP and also the posterior asymptotic behavior of the Bayes factor for the homogeneity test when I = 2, as posed within the proposed hierarchical construction. We showed that the Bayes factor has the appropriate asymptotic behavior under the alternative hypothesis of partial exchangeability, but a final answer under the assumption of truly exchangeable data is still pending. The lack of asymptotic guarantees is not at all specific to our case. In fact, this situation is rather common to all model selection problems when the hypothesis are not well separated and at least one of the two models under comparison is "truly" nonparametric, as, for instance, in Bhattacharya and Dunson (2012) and Tokdar and Martin (2019). Indeed, as discussed in Tokdar and Martin (2019), it is not even clear if in such cases the need for an upper bound on the prior mass under the more complex model is a natural requirement or rather a technical one. More generally, intuition about BFs (at least in parametric cases) is that they tend to favor the more parsimonious model. In the particular context described in Section 3.3, model M 1 can be regarded as a degenerate case of model M 2 , even though they are "equally complicated". In this case, the above intuition evaporates, since technically, embedding one model in the other is still one infinite-dimensional model contained in another infinite-dimensional model, and it is probably meaningless to ask which model is "simpler". Under this scenario exploratory use of discrepancy measures, such as those discussed in Gelman et al. (1996), may offer some guidance.
In the simulation studies presented, our model always recovers the true latent clustering among groups, thus providing empirical evidence in favor of our model to perform homogeneity tests. We provide some practical suggestions when the actual interest is on making this decision. Our insight is that in order to prove asymptotic consistency of the Bayes factor, one should introduce explicit separation between the competing hypotheses. One possible way to accomplish this goal is, for example, by introducing some kind of repulsion among the mixing measures F i 's in the model. This point will be focus of further study.

A Proofs
Proof of Proposition 3.1. Consider I = 2 for ease of exposition. We aim at showing that under suitable choices of G 0 and G 00 , the vector of random probability measures (G 1 , G 2 ), where G i = F c i has full support on P Θ × P Θ .
This means that for every couple of distributions (g 1 , g 2 ) ∈ P Θ × P Θ , every weak neighborhood W 1 × W 2 of (g 1 , g 2 ) receives non null probability. In short, this condition Hence, since we are assuming that π c (l, m) > 0 for all l, m, it is sufficient to show that π F 1 ,F 2 , that is the measure associated to the SemiHDP prior with I = 2, has full weak support.
In the following, with a slight abuse of notation we denote by π F 1 ,F 2 | G (W 1 × W 2 ) the measure associated to the SemiHDP prior, conditional to a particular value of G. We distinguish three cases: κ = 1, 0 < κ < 1 and κ = 0. The case κ = 1 is trivial, since F 1 and F 2 are marginally independently distributed with Dirichlet process prior, so that π F 1 ,F 2 (W 1 × W 2 ) = D αG 0 (W 1 )D αG 0 (W 2 ) > 0 as long as G 0 has full support in Θ (see, for example, Ghosal and Van der Vaart, 2017). Secondly consider 0 < κ < 1, we show that as long as G 0 has full support, then also π G will have full support, regardless of the properties of G 00 . We have Now observe that if G 0 has full support, also P = κG 0 +(1−κ) G will have full support, for any value of G. Hence by the properties of the Dirichlet Process, we get that π F 1 ,F 2 (W 1 × W 2 ) > 0 since the integrand in (24) is bounded away from zero.
The case κ = 0 is more delicate and requires additional work. We follow the path outlined in De Blasi et al. (2013), extending it to our hierarchical case. In the following, let S m denote the m − 1 dimensional simplex, i.e.
Let d w denote the Prokhorov metric on P Θ , which, as it is well known, metrizes the topology of the weak convergence on P Θ . Moreover, being Θ separable, (P Θ , d w ) is separable as well and the set of discrete measures with a finite number of point masses is dense in P Θ .
Hence, for any (g 1 , g 2 ) and any > 0, there exist two discrete measures with weights p (i) ∈ S k i and points The difficulty when κ = 0 is that conditionally on G, the measure D α G does not have full weak support. Indeed, its support is concentrated on the measures that have the same atoms of G. The proof will proceed as follows: start by defining weak neighborhoods W i of F p (i) ,x (i) by looking at neighborhoods of their weights p (i) (U i ) and atoms x (i) (V i ). Secondly, we join these neighborhoods. If G(ω) belongs to this union (and this occurs with positive probability), we guarantee that the atoms of both of F 1 and F 2 , that are shared with G, are suited to approximate both F p (i) ,x (i) , i = 1, 2. Hence, by the properties of the Dirichlet Process one gets the support property.
More in detail, define the sets j | < δ}, i = 1, 2 and let V = V 1 ∪ V 2 . Then we operate a change of index by concatenating x (1) and x (2) , and call it x * , i.e. x * = [x (1) , x (2) ]. Hence we characterize the set V as Secondly, define p * by concatenating p (1) and and p (2) : p * = [p (1) , p (2) ] and let U 1 (η) = {p ∈ S k 1 +k 2 s.t. |p j − p * j | < η for j = 1, . . . , k 1 , |p j − 0| < η elsewhere} U 2 (η) = {p ∈ S k 1 +k 2 s.t. |p j − p * j | < η for j = k 1 + 1, . . . , k 1 + k 2 , |p j − 0| < η elsewhere} Finally, define the following neighborhoods p j δ x j for any p ∈ U i , and any x ∈ V }, i = 1, 2 W 0 := { k 1 +k 2 j=1 p j δ x j for any p ∈ S k 1 +k 2 , and any x ∈ V }. This means that the V i sets are the neighborhoods of the atoms x i that are well suited to approximate F p (i) ,x (i) and V is their union. The sets U i , i = 1, 2, instead, are related to the weights of F p (i) ,x (i) . In particular, each U i is constructed in such a way to approximate well p (i) (a vector in S k i ) with a vector of weights in S k 1 +k 2 . This is necessary because if G has support points in V , so will do the draws F 1 and F 2 from D α G . However, by assigning a negligible weight in U 1 to the atoms x (2) and vice-versa for the atoms x (1) in U 2 , we guarantee that the probability measures in W i constitute a weak neighborhood of F p (i) ,x (i) for each i = 1, 2.
From De Blasi et al. (2013), it is sufficient to show that π F 1 ,F 2 (W 1 × W 2 ) > 0 since for appropriate choices of η and δ one has that d w ( F 1 , g 1 ) + d w ( F 2 , g 2 ) < for all choices of F 1 ∈ W 1 and F 2 ∈ W 2 . Hence Now observe that for any G ∈ W 0 , we have that D α G (W i ) > 0. This follows again from the properties of the Dirichlet process, since for any value of G(ω), there exists a non-empty since the Dirichlet process gives positive probability to the weak neighborhoods of measures whose support is contained in the support of its base measure, i.e. G. 2 Proof of Covariance of the semi-HDP.
The last equality follows because G is a Dirichlet process. 2 Higher order moments.
To compute higher order moments, we make use of a result from Argiento et al. (2019). Let F 1 | P ∼ D α P as in (5) -(7); then one has, for any set A ∈ B(Θ): where K n is the random variable representing the number of clusters in a sample of size n; see (15) in Argiento et al. (2019). If, as in our case, the base measure is not absolutely continuous, the term clusters might be misleading as they do not coincide with the unique values in the sample, but rather with the number of the tables in the Chinese restaurant process. In the following we refer to cluster or table interchangeably. Hence, we have: where K h is the number of clusters from a sample of size h from the DP G. Moreover, if we assume G 0 = G 00 we get

Proof of Proposition 3.2.
Indicating with τ j the shared unique values between θ 1 and θ 2 , and with θ * ij the unique values in sample θ i that are specific to group i, i.e. not shared, the pEPPF, given c, can be written as: See (23) in Camerlenghi et al. (2019a). Marginalizing out c we obtain that: π c (c = (l, m))Π (N ) k (n 1 , n 2 , q 1 , q 2 |c = (l, m)).
The cases c = (1, 1) and c = (2, 2) can be easily managed as it corresponds to full exchangeability and the EPPF corresponding to those cases is already available. Hence, let us consider the case when c = (1, 2), as the case c = (2, 1) will be identical because the F i 's are iid.
Observe that where ρ r i is the partition induced by the -clusters in the r i restaurant. We use the same definition of -cluster as in Argiento et al. (2019). We underline that {θ * r i j , j = 1, . . . , H r i } are not the unique values in the sample, since the base measure is atomic. Hence we have Hr i j=1 P (dθ * r i j )L(d G).
Now observe how the values {θ * rj : r = 1, . . . R, j = 1, . . . H r i } are all iid from P . So, there is no need for the division into restaurants anymore. We can thus stack all the vectors θ * r i together, apply a change of indices (r i , j) → l so that now these {θ * r i } are represented by (θ * 1 , . . . , θ * L ) and Now, as done in Section 2.2, we introduce a set of latent variables h = (h 1 , . . . , h L ), h l iid ∼ Bernoulli(κ), that gives where η is the partition of the {θ * l : l = 1, . . . , L and h l = 0}, i.e. the partition of L l=1 (1 − h l ) objects arising form the Dirichlet process G, while {θ * * k } are the unique values among {θ * l : l = 1, . . . , L and h l = 0} and p(h) = L l=1 κ h l (1 − κ) 1−h l is the joint distribution of h. 2 Proof of Proposition 3.4. Model M 2 defines a prior Π 2 on the space of densities (p, q) ∈ P Y × P Y . On the other hand, model M 1 defines a prior on P Y . However, by embedding P Y in the product space P Y × P Y via the mapping p → (p, p), we can also consider the prior Π 1 induced by model M 1 as a measure on (a subset of) P Y × P Y . Now, showing that Π 2 satisfies the Kullback-Leibler property is a straightforward application of Theorem 3 in Wu and Ghosal (2008), under the same set of assumptions on the kernel k(·|θ), and on p 0 and q 0 , that we do not report here. Notice that these assumptions are satisfied when k(·|θ) is the univariate Gaussian kernel with parameters given by the mean and the scale, and under standard regularity conditions on p 0 and q 0 . Now we turn our attention to Π 1 . It is obvious to argue that Π 1 does not have the Kullback-Leibler property in the larger space P Y × P Y , since it gives positive mass only to sets {(p, q) ∈ P Y × P Y : p = q}. Consequently, if p 0 = q 0 , one will have that for a small enough δ: Π 1 ((p, q) : D KL ((p, q), (p 0 , q 0 ) < δ) = 0, thus proving that Π 1 does not have the Kullback-Leibler property. In summary, under the same assumptions on p 0 , q 0 and the kernel k(· | θ) as in Ghosal et al. (2008), and assuming p 0 = q 0 , we are comparing a model (M 2 ) with the Kullback-Leibler property against one (M 1 ) that does not have it. Theorem 1 in Walker et al. (2004) implies that the Bayes factor consistency is ensured. 2 Proof of Equation (22). Let p r = Hr i=1 w ri N (µ ri , σ 2 ri ) and p m = Hm j=1 w mj N (µ mj , σ 2 mj ) be the mixture densities associated to the mixing measures F r and F m respectively. Observe that both H m and H r are finite here as F r and F m have been approximated as shown in the description of the Gibbs sampler in Section 4. Then d 2 (F r , F m ) = L 2 2 (p r , p m ) = (p r (y) − p m (y)) 2 dy =   Hr i=1 w ri N (y; µ ri , σ 2 ri ) − Hm j=1 w mj N (y; µ mj , σ 2 mj )   2 dy For any value of y the above integrand reduces to Hr i=1 w ri N (y; µ ri , σ 2 ri ) 2 + Hm j=1 w mj N (y; µ mj , σ 2 mj ) 2 + − 2 Hr i=1 w ri N (y; µ ri , σ 2 ri ) Hm j=1 w mj N (y; µ mj , σ 2 mj ) Each term in the right hand side can be expressed as a product of two summations, say ( i a i )( j b j ) = i,j a i b j . When {a i } and {b j } are equal, this further reduces to i,i a i a i .

B Discussion of Bayes Factor consistency in the homogeneous case
When p 0 = q 0 , consistency of the Bayes factor would require BF 12 → +∞. This is a result we have not been able to prove so far, but it is worth pointing out the following relevant issues. To begin with, note that both models M 1 and M 2 have the Kullback-Leibler property. Several papers discuss this case, for example Corollary 3.1 in Ghosal et al. (2008), Section 5 in Chib and Kuffner (2016) and Corollary 3 in Chatterjee et al. (2020) in the general setting of dependent data. For more specific applications, refer also to Tokdar and Martin (2019) where the focus is on testing Gaussianity of the data under a Dirichlet process mixture alternative, Mcvinish et al. (2009) for goodness of fit tests using mixtures of triangular distribution and Bhattacharya and Dunson (2012) for data distributed over non-euclidean manifolds. As pointed out in Tokdar and Martin (2019), the hypotheses in Corollary 3.1 by Ghosal et al. (2008) are usually difficult to prove, since they require a lower bound on the prior mass Π 2 around neighborhoods of (p 0 , p 0 ) ∈ P Y × P Y . To the best of our knowledge, this kind of bounds have been derived only for the very special kind of mixtures in Mcvinish et al. (2009). Similarly, the approach by Chib and Kuffner (2016) would require a knowledge of such lower bounds too (see for instance their Assumption 3). Corollary 3 in Chatterjee et al. (2020) does not apply in our case as well, because one of their main assumptions presumes that both models specify a population distribution (i.e. a likelihood) with density w.r.t some common σ-finite measure, together with the true distribution of the data. In our case M 1 specifies random probability measures that are absolutely continuous w.r.t the Lebesgue measure on R, while under model M 2 the random probability measures have density under the Lebesgue measure on R 2 .

C Relabeling step
In the following, we adopt a slightly different notation to simplify the pseudocode notation. Figure 11 depicts the state at a particular iteration. We denote by ψ rh the atoms in restaurant r arising from G 0 and with τ h the atoms arising from G 00 . Observe how in restaurant 1 the value τ 2 appears more than once.
In our implementation, the state composed by ψ r , τ (i.e. all the unique values of the atoms) and the indicator variables {t rl } and {h rl } that let us reconstruct the value of θ * rl .