Generalized mixtures of finite mixtures and telescoping sampling

Within a Bayesian framework, a comprehensive investigation of mixtures of finite mixtures (MFMs), i.e., finite mixtures with a prior on the number of components, is performed. This model class has applications in model-based clustering as well as for semi-parametric density estimation and requires suitable prior specifications and inference methods to exploit its full potential. We contribute by considering a generalized class of MFMs where the hyperparameter $\gamma_K$ of a symmetric Dirichlet prior on the weight distribution depends on the number of components. We show that this model class may be regarded as a Bayesian non-parametric mixture outside the class of Gibbs-type priors. We emphasize the distinction between the number of components $K$ of a mixture and the number of clusters $K_+$, i.e., the number of filled components given the data. In the MFM model, $K_+$ is a random variable and its prior depends on the prior on $K$ and on the hyperparameter $\gamma_K$. We employ a flexible prior distribution for the number of components $K$ and derive the corresponding prior on the number of clusters $K_+$ for generalized MFMs. For posterior inference, we propose the novel telescoping sampler which allows Bayesian inference for mixtures with arbitrary component distributions without resorting to reversible jump Markov chain Monte Carlo (MCMC) methods. The telescoping sampler explicitly samples the number of components, but otherwise requires only the usual MCMC steps of a finite mixture model. The ease of its application using different component distributions is demonstrated on several data sets.


Introduction
The present paper contributes to Bayesian mixture analysis where the number of components K is unknown and a prior on K is specified. This class of mixtures of finite mixtures (MFMs) has a long tradition in Bayesian mixture modeling (Richardson and Green, 1997;Nobile, 2004;McCullagh and Yang, 2008) and has gained recent attention by Miller and Harrison (2018); Geng et al. (2019); Xie and Xu (2020), among others.
Previously considered MFMs differ with respect to prior specifications on K and the component weights. We combine the different approaches to a generalized MFM model specification. We base our considerations on the crucial distinction between the number of components K in the mixture distribution and the number of clusters K + in the data which is defined as the number of "filled" mixture components used to generate the observed data. This fundamental distinction between K and K + has always been prevalent in Bayesian non-parametric (BNP) mixture analysis, see, e.g., the recent work by Argiento and De Iorio (2019). In applied finite mixture analysis, however, it is still common to assume that K and K + are the same entity, despite earlier work by Nobile (2004), McCullagh and Yang (2008), and, more recently, Miller and Harrison (2018).
Dirichlet process mixtures (DPMs) are the most popular BNP mixture approach. Their focus naturally lies on inference on the number of clusters, with K being fixed at +∞. For DPMs, the number of clusters grows as K + ∼ α log(N ) as the number of observations N increases. Doubt about the usefulness of DPMs for clustering has been voiced for many years and, indeed, Miller and Harrison (2013) proved inconsistency of DPMs for the number of clusters for the simple case of univariate Gaussian mixtures with unit variances. As a two-parameter alternative to DPMs, Pitman-Yor process mixtures were introduced in the BNP literature by Pitman and Yor (1997). Malsiner-Walli et al. (2016, 2017 introduced sparse finite mixtures (SFMs) in the context of applied finite mixture analysis. As shown by De Blasi et al. (2015), both model classes are closely connected. SFMs choose a fixed, clearly overfitting value of K in the spirit of Rousseau and Mengersen (2011) and a symmetric Dirichlet prior on the weight distribution with a very small hyperparameter γ K . Whereas K is fixed, this choice allows the number of clusters K + to be a random variable taking values smaller than K. However, the larger K, the smaller γ K has to be, motivating the "dynamic" SFM introduced in Frühwirth-Schnatter and Malsiner-Walli (2019), where γ K = α/K was chosen with α being a hyperparameter independent of K.
The class of generalized MFMs we introduce in this paper is a finite mixture model with a prior on K, where the hyperparameter γ K may change as a function of K. We consider two special cases of this specification. The static MFM uses a fixed value γ K ≡ γ. The dynamic MFM uses γ K = α/K and can be regarded as a dynamic SFM with a prior on K. This MFM specification, considered previously in McCullagh and Yang (2008), is less common in applied finite mixture analysis than the static MFM. McCullagh and Yang (2008) conjecture that the static and dynamic versions of the MFM are quite different. We shed light on this by investigating the exchangeable partition probability function (EPPF), i.e., the prior induced on the random partition of the data (Pitman, 1995) by the generalized MFM and discuss its specific form for static and dynamic MFMs. As shown in the seminal work by Gnedin and Pitman (2006), the static MFM considered in Richardson and Green (1997) and Miller and Harrison (2018) is equivalent to a BNP mixture with a Gibbstype prior on the random partitions. Based on the EPPF of the generalized MFM, we show that the static MFM is the only mixture within this class that induces a Gibbs-type prior. Any specification where the hyperparameter γ K varies with K leads to a BNP mixture beyond Gibbs-type priors. We focus on the dynamic MFM where γ K = α/K is inversely proportional to the number of components and show that it converges to a DPM with concentration parameter α, if the prior p(K) puts all mass on +∞. Hence, while staying within the finite mixture framework, the dynamic MFM can be regarded as a "natural generalization" of the celebrated Dirichlet process prior beyond the class of Gibbs-type priors.
A tremendous challenge for Bayesian mixtures with an unknown number of components is practical statistical inference. To this aim, Richardson and Green (1997) introduced reversible jump Markov chain Monte Carlo (RJMCMC) for static MFMs with univariate Gaussian components. Exploiting that static MFMs are Gibbs-type priors, Miller and Harrison (2018) introduced sampling techniques from BNP statistics to finite mixture analysis. Applying the Chinese restaurant process (CRP) sampler of Jain andNeal (2004, 2007), they sample the partitions and, in this way, the number of clusters K + and infer the number of components K in a post-processing step by linking the distribution of K to the distribution of K + .
In this paper, we introduce a novel MCMC algorithm for generalized MFMs called telescoping sampling that updates simultaneously the number of clusters K + and the number of components K during sampling without resorting to RJMCMC. As opposed to CRP sampler, telescoping sampling also works outside the class of Gibbs-type priors. Sampling K only depends on the current partition of the data and is independent of the component parameters. This makes our sampler a most generic inference tool for finite mixture models with an unknown number of components which can be applied to arbitrary mixture families. Our sampler is easily implemented, for instance, for multivariate Gaussian mixtures with an unknown number of components, and thus provides an attractive alternative to RJMCMC which is challenging to tune in higher dimensions, see, e.g., Dellaportas and Papageorgiou (2006).
The paper is structured as follows. In Section 2, we present the generalized MFM model and derive the EPPF. Section 3 proposes the beta-negative-binomial as a prior on K and derives the prior on the number of clusters K + for a generalized MFM. Section 4 discusses connections between applied finite mixture analysis based on MFMs and BNP mixtures. Our novel MCMC sampler is presented in Section 5 and is benchmarked against RJMCMC and the CRP sampler in Section 6. Additionally, MFMs with various uni-and multivariate component densities are applied both to artificial and real data of varying dimension and sample size. Section 7 concludes.
2 Generalized mixtures of finite mixture models
Model (2.1) contains the finite mixture model with a prior on the number of components K studied by Richardson and Green (1997) and Miller and Harrison (2018), who termed this model a mixture of finite mixtures (MFM), as that special case where γ K ≡ γ. As noted by Miller and Harrison (2018), assuming the same γ for all K is a "genuine restriction" which considerably simplifies the derivation of the implied partition distribution -a crucial ingredient to their inference algorithm. McCullagh and Yang (2008) extend this "static" MFM with constant γ by specifying a "dynamic" MFM where γ K = α/K is inversely proportional to K and depends on a hyperparameter α, i.e., η K |K, α ∼ D K (α/K).
For a given K, K + is defined as the number of components that generated the data, i.e., K + = K k=1 I{N k > 0}, where N k = #{i : S i = k} counts the observations generated by component k. In the following we refer to K + as the number of clusters. Including a prior p(K) leads to both K + and K being random a priori. As opposed to the common perception that for a finite mixture K + given K is deterministic and equal to K, we show in Section 3 that the sequence of hyperparameters γ = {γ K } has a crucial impact on the induced prior of the data partitions and the number of clusters K + . For a static MFM with γ = 1 (Richardson and Green, 1997;Miller and Harrison, 2018), e.g., the prior expected number of clusters, E(K + |N, γ = 1), is indeed close to E(K) for many priors p(K) with finite mean, even for small N . However, having γ K decrease with increasing K induces randomness in the prior distribution of K + given K, allowing for a gap between K + and K for a wide range of α and N values.
Under model (2.1), the joint distribution of the data y = (y 1 , . . . , y N ) has a representation as a countably infinite MFM with K components: The type of mixtures which are summed over in (2.2) vary with the prior parameter γ K of the component weights. Using a symmetric Dirichlet prior, a priori the mean of the component weights given K is equal to a vector of dimension K with values 1/K. However, the variance decreases with increasing hyperparameter γ K and thus more prior mass is assigned to balanced weight distributions. On the other hand, the variance increases and the component weights a priori become more unbalanced with decreasing values of γ K . For a static MFM with γ K ≡ γ, mixtures of a similar type are combined. For a dynamic MFM with γ K = α/K, mixtures favoring different component size distributions are combined: standard mixture models with balanced components, which emerge for small K, are mixed with SFMs for moderate K and finally, as K goes to infinity, with DPMs favoring extremely unbalanced component sizes. As will be shown in Section 2.2, the dynamic prior on the component weights increases the flexibility of the prior induced on the partitions and K + and leads outside the family of Gibbs-type priors. Moreover, a hyperprior on α, to be discussed in Section 4.3, achieves additional adaptivity of the induced prior on the partitions to the data at hand.

The EPPF and the prior distribution of cluster sizes
The MFM model (2.1) induces through the latent indicators S = (S 1 , . . . , S N ) a random partition C = {C 1 , . . . , C K + } of the N data points into K + clusters where each cluster C j contains all observations generated by the same mixture component, i.e., S i = S j for all y i , y j ∈ C j , see Lau and Green (2007). In the tradition of Pitman (1995), we derive in Theorem 2.1 the prior partition probability function p(C|N, γ) of a generalized MFM for a given sequence γ = {γ K } and discuss static MFMs with γ K ≡ γ and dynamic MFMs with γ K = α/K as special cases. In addition, we derive the prior distribution p(N 1 , . . . , N K + |N, γ) of the labeled cluster sizes N j = card (C j ), where the K + clusters in C are arranged in some exchangeable random order and we assign label 1 to the first cluster, label 2 to the second cluster and so forth (Pitman, 2006). 1 Theorem 2.1. For a generalized MFM with proper prior p(K) and η K |K, γ ∼ D K (γ K ), the probability mass function p(C|N, γ) of the set partition C = {C 1 , . . . , C K + } and the prior distribution p(N 1 , . . . , N K + |N, γ) of the labeled cluster 1 One such order is arrangement in order of appearance (Pitman, 1996), where the first observation y 1 belongs to the first cluster and for each j = 2, . . . K + , the first observation not assigned to ∪ j−1 =1 C belongs to cluster C j . However, any other exchangeable random ordering will do. sizes are given by: Being a symmetric function of the cluster sizes (N 1 , . . . , N K + ), p(C|N, γ) is an EPPF (Pitman, 1995) and defines an exchangeable random partition of the N data points for the class of generalized MFMs. The EPPF is instrumental for understanding the mathematical properties of the implied partitions and is a main object of interest in BNP mixtures, see, e.g., Lijoi and Prünster (2010).
An important class of BNP mixture models are mixtures relying on Gibbs-type random probability measures, or Gibbs-type priors, introduced in the seminal work by Gnedin and Pitman (2006). They are considered the most natural generalization of DPMs as they allow better control of the clustering behavior, see the excellent work of De Blasi et al. (2015). Under a Gibbs-type prior, the EPPF takes a specific product form which allows to study the EPPF of a generalized MFM in this regard. Relying on Gnedin and Pitman (2006), Gnedin (2010) andDe Blasi et al. (2013), among others, Miller and Harrison (2018) show that a static MFM induces a Gibbstype prior on the partitions. Indeed, for γ K ≡ γ the EPPF in (2.3) takes the following product form: satisfies the following recursion for k = 1, . . . , N − 1 (see Appendix A in the supplementary material for a proof): 2 However, for a generalized MFM with γ K depending on K, we obtain a mixture model with a partition structure beyond Gibbs-type priors. For a dynamic MFM, we establish in Theorem 2.2 that the EPPF p(C|N, α) can be expressed explicitly in relation to a DPM with precision parameter α, for which the EPPF is given by the Ewens distribution: (2.9) 2 Note that the normalizationṼ γ N,k = γ k V γ N,k is needed to represent (2.7) as the common EPPF of a Gibbs-type prior: p(C|N, Number of clusters K + Prior probability Figure 1: The implicit prior p(K + |N, γ = 1) on the number of clusters K + for the static MFM under the uniform prior K ∼ U{1, 30} for various data sizes, N = 20, 100, 1000.
Theorem 2.2. For a dynamic MFM with γ K = α/K, the EPPF p(C|N, α) can be expressed as: where p DP (C|N, α) is the probability mass function (pmf ) of the Ewens distribution and N is the vector of induced cluster sizes (N 1 , . . . , N K + ).
It follows from Theorem 2.2 that dynamic MFMs can be regarded as a "natural generalization" of the celebrated Dirichlet process prior beyond the class of Gibbstype priors. Theorems 2.1 and 2.2 (which are proven in Appendix A) are exploited further in Section 3 to derive the induced prior on the number of clusters p(K + |N, γ) and in Section 4 to investigate connections between applied finite mixture analysis based on MFMs and commonly used BNP mixtures in more depth.
3 The prior distributions of K and K + This section proposes a suitable choice for p(K) and derives the implicit prior of K + in dependence of p(K), the hyperparameters γ and N for a generalized MFM.

Choosing the prior on the number of components K
In their seminal paper, Richardson and Green (1997) suggest a uniform prior K ∼ U{1, K max } for a static MFM with γ K ≡ 1. However, depending on N , the prior on K + might be surprisingly informative and far from a uniform distribution. Figure 1 shows the implied prior p(K + |N, γ = 1) for a static MFM under the prior K ∼ U{1, 30} for various data sizes (N = 20, 100, 1000). Evidently, the prior mode depends on N and only for larger N approximately a uniform prior results. Nobile (2004) shows that, as an alternative to the uniform prior, any proper prior p(K) which satisfies p(K) > 0 for all K ∈ N can be adopted. While most discrete probability distributions include zero, in a mixture context the prior p(K) has to exclude zero. This is often achieved by truncating the pmf at one, e.g., Nobile and Fearnside (2007) use the Poisson distribution K ∼ P (1) restricted to {1, 2, . . . , K max }. However, it is more convenient to work with the translated prior K −1 ∼ p t , where the pmf p(K) = p t (K −1) is obtained by evaluating the translated pmf at K − 1, as for translated priors hierarchical priors can be more easily introduced. We propose a translated prior, where K − 1 ∼ BNB (α λ , a π , b π ) follows the beta-negative-binomial (BNB) distribution which is a hierarchical generalization of the Poisson, the geometric and the negative-binomial distribution. The corresponding pmf is given by: (3.1) Appendix B provides the hierarchical derivation of the prior and illustrates the shapes for various parameter values. For a π > 1, the expectation E(K) = 1 + α λ b π /(a π − 1) is finite. Prior (3.1) generalizes the prior derived by Cerquetti (2010) for the Gnedin-Fisher model and the prior derived by Grazian et al. (2020) from loss-based considerations which can be regarded as a BNB (1, b π , a π ) prior. In their applications, Grazian et al. (2020) apply the BNB (1, 1, 1) prior with no finite moments.
The three-parameters α λ , a π and b π of the BNB (α λ , a π , b π ) prior allow simultaneous control over the expectation and the tails of p(K) and the implied prior p(K + |N, γ) and its expectation E(K + |N, γ). Priors p(K) with finite expectation imply that E(K + |N, γ) is finite, even for increasing N . In a clustering context, we propose to use the prior K − 1 ∼ BNB (1, 4, 3) with E(K) = 2. The induced prior on p(K + |N, γ) is investigated in more detail in Section 3.2 and differs considerably from previous choices such as the geometric or the uniform distribution. The BNB (1, 4, 3) prior leads to a weakly informative prior on K + which is concentrated on moderate number of clusters and exhibits fat tails to ensure that also a high number of clusters may be estimated.

The induced prior on the number of clusters K +
In applied mixture analysis, we often aim at partitions of the data with a finite, but a priori random number of clusters K + . Since the number K of components is random a priori for a MFM, this induces K + to be random as well, but the induced prior p(K + |N ) on K + does not necessarily coincide with the prior p(K) for a finite number of observations N . The induced prior p(K + |N ) has been derived earlier for various mixture models. For DPMs, Antoniak (1974) provides the prior of K + as is the Stirling number of the first kind. Nobile (2004, Proposition 4.2) gives the prior on K + for a standard finite mixture, while Gnedin and Pitman (2006) derive p(K + |N ) for Gibbs-type priors.
Building on this literature, we derive the prior p(K + |N, γ) for generalized MFMs under arbitrary priors p(K). One way to obtain this prior is summing the EPPF (2.3) over all partitions C: are the generalized Stirling numbers of the second kind. Alternatively, Theorem 3.1 derives p(K + |N, γ) from the prior of the labeled cluster sizes p(N 1 , . . . , N K + |N, γ) given in (2.5).
Theorem 3.1. For a generalized MFM with priors p(K) and η K |K, γ ∼ D K (γ K ), the prior of the number of clusters K + conditional on the sample size N is given for k = 1, 2, . . . , N by: where, for each K, V K,γ K N,k has been defined in (2.6) and C K,γ K N,k is given by summation over the labeled cluster sizes (N 1 , . . . , N k ): By matching (3.2) and (3.3), we find that C K,γ K N,k is related to the generalized Stirling We found it convenient to compute C K,γ K N,k recursively through Algorithm 1. The recursion is straightforward to implement and scales well for large N , see Greve et al. (2020); Greve (2021) and Appendix A for mathematical derivations.
For a dynamic MFM with γ K = α/K, C K,γ K N,k can be written as C K,α N,k depending on K and α: . (3.7) Putting all prior mass on K = +∞, the following way to compute p DP (K + |N, α) for a DPM emerges from (3.7), where C ∞ N,K + is independent of α and obtained through recursion (3.6) with w n = 1/n. For a static MFM, Theorem 3.1 simplifies to the following expression: where V γ N,k is determined recursively from (2.8) and C K,γ K N,k is written as C γ N,k independent of K and can be obtained in a single recursion from (3.6). Using, again, the normalizationṼ γ N,k = γ k V γ N,k , prior (3.9) is a special case of the prior given in Gnedin and Pitman (2006) for Gibbs-type priors: Finally, putting all prior mass on K = K f , (3.9) gives the result of Nobile (2004, Proposition 4.2) for a standard finite mixture: For illustration, Figure 2 shows the impact of various priors p(K) on the induced prior p(K + |N, γ) for static MFMs with γ = 1 (top row) and dynamic MFMs with α = 1 (bottom row). The priors p(K) in the three columns are the translated beta-negative-binomial prior K − 1 ∼ BNB(1, 4, 3) with E(K) = 2 suggested in Section 3.1, the prior K − 1 ∼ Geo(0.1) with E(K) = 10 suggested by Miller and Harrison (2018) and the uniform prior K ∼ U{1, 30} with E(K) = 15.5 used by Richardson and Green (1997).
3. Then, for all k ≥ 1, C K,γ K N,k is equal to the first element of the vector c K,k . For static MFMs, p(K + ) ≈ p(K) for all three priors for values for K + and K between one and ten. In contrast, for a dynamic MFM p(K + ) and p(K) are only close for the BNB prior which has a small mean value. For the priors p(K) with larger mean values, p(K + ) considerably differs from p(K) with mass being pulled towards smaller values of K + . The corresponding posteriors of K and K + obtained under these priors for the famous Galaxy data are shown in Figure 6 in Section 6.2.

Bridging finite mixtures analysis and BNP mixtures 4.1 Connecting SFMs, MFMs and BNP mixtures
Generalized MFMs extend both Dirichlet process mixtures (DPMs) and sparse finite mixtures (SFMs). By allowing the number of components K to be finite and random, MFMs provide notably more flexibility in the prior distribution on the partition space than DPMs and SFMs, similar to popular BNP mixtures (De Blasi et al., 2015). SFMs result as that special case of MFMs, where p(K) = I{K = K f } puts all prior mass on a fixed number of components K f . It follows from Theorem 2.2 and earlier work by Ishwaran and Zarepour (2000) that the prior distribution imposed on the partition space by a SFM lacks flexibility with increasing K f and approaches the Ewens distribution (2.9) as γ K f = α/K f approaches 0: This implies that SFMs do not easily deal with situations with many, well-balanced clusters, a behavior that is also observed for DPMs. By considering K as an additional second parameter following a prior p(K), the dynamic MFM emerges as a more flexible family than a SFM with K = K f fixed. Dynamic MFMs can also be regarded as a more flexible extension of a DPM. Since R K,α N,K + in Theorem 2.2 converges to 1 as K increases, putting all prior mass on K = +∞ yields the Ewens distribution as limiting case. Thus, DPMs result as the limiting case of a dynamic MFM where the prior p(K) increasingly concentrates all prior mass at K = +∞.
Several close connections between MFMs and Pitman-Yor process mixtures (PYM) deserve to be mentioned. In Bayesian non-parametrics, mixtures based on the Pitman-Yor prior PY(σ, θ) with σ ∈ [0, 1), θ > −σ (Pitman and Yor, 1997) are a commonly used two-parameter alternative to DPMs which are the special case where σ = 0 and θ = α. There exists a second family of PYMs, where σ < 0 and θ = K|σ| with K ∈ N being a natural number, see Gnedin (2010) andDe Blasi et al. (2015). In the corresponding stick-breaking representation, stick v K = 1 a.s. Hence, this prior yields a mixture with infinitely many components, of which only K have non-zero weights, with the symmetric Dirichlet distribution D K (|σ|) acting as prior. Furthermore, at most K components can be populated. The EPPF of a PYM (with K known) reads: By matching EPPFs (and using Γ(1 − σ) = |σ|Γ(|σ|)), it is evident that a finite mixture with K known and γ K > 0 is equivalent to a mixture with a PY(−γ K , Kγ K ) prior, as proven in Gnedin and Pitman (2006). This equivalence of SFMs and PYMs provides a theoretical explanation of the empirical finding that SFMs can lead to more sensible cluster solutions than DPMs, see, e.g., Frühwirth-Schnatter and Malsiner-Walli (2019). Even more interesting connections to BNP mixtures arise for MFMs, where K is random. As pointed out by Miller and Harrison (2014) and proven much earlier by Gnedin and Pitman (2006), for a static MFM, the dual BNP mixture is a Gibbstype prior which arises from mixing a PY(−γ, Kγ) prior over the concentration parameter θ K = Kγ, while the reinforcement parameter σ = −γ is fixed. The Fisher-Gnedin model studied in Gnedin (2010) is equivalent to a static MFM with γ = 1 and K − 1 ∼ P (λ). The static MFM is also a special case of the class of mixtures based on normalized independent finite point processes recently introduced by Argiento and De Iorio (2019).
On the other hand, for a dynamic MFM, the prior partition distribution of the dual BNP mixture lies outside of the family of Gibbs-type priors, as it arises from mixing a PY(−α/K, α)-prior over the reinforcement parameter σ K = −α/K, while the concentration parameter θ = α is fixed, see also the discussion in De Blasi et al. (2015). As shown in Pitman (1996), a system of predictive distributions emerges from the EPPF, quantifying the probability that a new observation y N +1 belongs to any of the K + = k existing clusters in C = {C 1 , . . . , C k } or creates a partitions C new = {C 1 , . . . , C k , C k+1 } with a new cluster C k+1 of size N k+1 = 1. For a dynamic MFM the prior probability to introduce a new cluster for y N +1 is given by (see Appendix A for a proof): This probability (bounded by the predictive probability α/(α + N ) of a DPM) not only depends on N and the current number of clusters K + , which characterizes Gibbs-type priors (De Blasi et al., 2013), but also on the occupation numbers N 1 , . . . , N K + . This confirms once more that dynamic MFMs, while staying within the finite mixture framework, are an example of a general random partition prior (De Blasi et al., 2015).

Comparing static and dynamic MFMs and DPMs
In the following we compare the induced priors on the number of clusters and the partitions for static and dynamic MFMs and DPMs in more detail and investigate the influence of the prior on K and, respectively, the hyperparameters γ and α.
Regarding the prior on the number of clusters K + , a fundamental question is whether a MFM allows K + to be different from K a priori, as for DPMs (where K = ∞). To gain further understanding, we plot in Figure 3 the expectation of the induced prior p(K + |N, γ) as a function of γ (for static MFMs) and α (for DPMs and dynamic MFMs) for N = 100 under various priors p(K). For both classes of MFMs, the gap between the expected number of clusters, E(K + |N, γ), and the expected number of components, E(K), decreases for increasing γ or α. However, for dynamic MFMs the decrease is much slower and, even as α increases, a considerable gap remains between E(K + |N, γ) and E(K). This is the effect of linking γ to K through γ K = α/K, thus avoiding that K + increases too quickly as K increases. This implies that the influence of the prior on K on the induced prior on K + is attenuated for an extended range of α values.
As emphasized by Green and Richardson (2001), beyond the induced prior on K + , the conditional EPPF, induced for a given number of clusters K + = k, is important for comparing mixture models. This prior allows a deeper understanding of the impact of choosing γ for MFMs on the partition distribution. For a DPM, the conditional EPPF can be expressed using Theorem 3.1 as: and is known to be highly unbalanced (Antoniak, 1974), favoring partitions with some small values N j due to the factors 1/N j , j = 1, . . . , k (Miller and Harrison, 2018). However, being independent of α, the conditional EPPF cannot be made more flexible for a DPM. In contrast, for a static MFM, the conditional EPPF depends on γ, 4 . (4.4) For γ = 1, the uniform distribution over all partitions of N data points into K + = k clusters results. Varying the hyperparameter γ introduces flexibility in the conditional EPPF for a static MFM: decreasing γ favors more unequal allocations, increasing γ favors partitions with more equal allocations. The conditional EPPF of a dynamic MFM is obtained by dividing (2.5) by (3.3): This conditional EPPF depends both on α and p(K), whereas the conditional EPPF of a static MFM is independent of p(K). Thus, having a second parameter K, dynamic MFMs are more flexible than static MFMs regarding the conditional EPPF.
Overall, in comparison to DPMs, static and dynamic MFMs induce more flexible prior structures both on the prior of the number of clusters and on the partition distribution, see Greve et al. (2020) for a detailed further investigation. Additional flexibility is achieved by adjusting the hyperparameters γ and α to suit the data. In Section 4.3, a hyperprior on α is suggested, to achieve adaptivity of the induced prior on the partition to the data at hand. Also a static MFM can be combined with a prior on γ, rather than choosing a fixed value such as γ = 1.

Choosing the prior on α for dynamic MFMs
For a dynamic MFM the parameter α plays a crucial role for the prior distribution induced on the number of clusters and the partitions. On the one hand, the prior should have positive mass close to zero to allow a priori for a single cluster solution which corresponds to homogeneity. At the same time, fat tails should allow a priori larger values of K + and partitions with balanced cluster sizes.
The DPM literature would suggest a Gamma distribution α ∼ G(a, b) (e.g., Escobar and West, 1995;Jara et al., 2007). If a = b 1, the expectation of α is one, while the variance is large, leading to a vague prior on α. For DPMs this induces a very informative prior on the number of clusters which is concentrated on 1 and +∞ (see Dorazio, 2009;Murugiah and Sweeting, 2012). For dynamic MFMs, such a prior would -given its mode at zero -strongly favor homogeneity, and fail for data with balanced cluster sizes. Instead, we propose to use the F -distribution α ∼ F(ν l , ν r ). The two parameters allow to control the behavior of the prior close to zero and in the tail independently. Choosing ν r small gives fat tails. For a finite mean value, given by ν r /(ν r − 2), but no higher moments, we specify 2 < ν r ≤ 3. Choosing a small value for ν l allows independent control over the prior probability of homogeneity. Since the mode is given by (ν l − 2)ν r /(ν l (ν r + 2)), choosing ν l > 2 avoids a spike at 0. In our empirical analysis, we use α ∼ F(6, 3).

Inference algorithm: Telescoping sampling
A novel sampling method called telescoping sampling is introduced for a Bayesian analysis of finite mixtures with an unknown number of components which is related to, but also fundamentally different from RJMCMC (Richardson and Green, 1997) and the CRP sampler (Jain andNeal, 2004, 2007) applied in Miller and Harrison (2018).
Similar to Jain andNeal (2004, 2007), the telescoping sampler is a transdimensional Gibbs sampler which exploits the EPPF of a MFM given in (2.3). However, we do not work with the marginal EPPF p(C|N, γ), as Miller and Harrison (2018) do, but use a second level of data augmentation where we introduce the unknown number of components K, in addition to the partition C, as a latent variable. This allows to apply the telescoping sampler outside the framework of Gibbs-type priors. We explicitly include K in the sampling scheme as in Richardson and Green (1997). However, rather than using RJMCMC, K is sampled conditional on C from the conditional posterior p(K|C, γ K ) ∝ p(C|N, K, γ K )p(K) which is obtained by combining the conditional EPPF p(C|N, K, γ K ) provided in (2.4) with the prior p(K): While Miller and Harrison (2018) use (5.1) for static MFMs to infer K in a post-processing step, the telescoping (TS) sampler integrates (5.1) into a transdimensional Gibbs sampler for generalized MFMs and samples K and the partitions C (including K + ) in different blocks. Since K ≥ K + by definition, the number of empty components K − K + varies over the iterations of the sampler, taking zero or a larger value. The difference between K and K + behaves similar to a telescope which can also be stretched or pulled together; hence the name of the sampler. Full details of the TS sampler are provided for dynamic MFMs in Algorithm 2. The TS sampler can be applied with minor modifications to static MFMs (see Algorithm 3 in Appendix C). In both cases, the hyperparameter ω = α or, respectively, ω = γ is assumed to be unknown.
Very conveniently, due to the conditional independence between the parameters θ k in the (non-empty) clusters and the number of components K, given the partition C, K is sampled from the conditional posterior p(K|C, γ K ) given in (5.1) without any reference to the specific component distribution. Hence, the TS sampler is straightforward to implement and very generic, since the conditional posterior p(K|C, γ K ) does not depend on the component parameters. This makes our sampler a most generic, easily implemented algorithm for finite mixture models with simultaneous inference on the unknown number of components and the unknown number of clusters for a wide range of component models. This greatly simplifies the application of MFMs in new application contexts allowing for arbitrary component distributions and extensions with hierarchical priors. In contrast, the challenge to design good moves for RJMCMC is legendary. But also for CRP samplers (which are confined to static MFMs), the creation of new clusters requires knowledge of the marginal likelihood which depends on the chosen mixture family and might be difficult to work out for more complex mixtures.
More specifically, the TS sampler is a partially marginalized sampler, moving back and forth between sampling from the mixture posterior distribution p(K, S, η K , θ 1 , . . . , θ K , φ, ω|y), which lives in the augmented parameter space of the mixture distribution, and sampling from the collapsed posterior p(K, C, θ 1 , . . . , θ K + , φ, ω|y), which lives in the set partition space and is marginalized with respect to the parameters of the empty components, the weight distribution η K and all allocations S that induce the same set partition C. The full mixture posterior p(K, S, η K , θ 1 , . . . , θ K , φ, ω|y) is proportional to where y [k] are the N k > 0 observations in cluster C k of the partition C = {C 1 , . . . , C K + } implied by S (after reordering such that the K + non-empty clusters appear first). The posterior (5.4) lends itself to the conditional sampling Step 1 Algorithm 2 Telescoping sampling for a dynamic MFM.
(b) Sample the hyperparameter φ (if any) conditional on K + and θ 1 , . . . , θ K+ from 3. Conditional on C, draw new values of K and α: . 4. Conditional on K, S, α and φ, add K − K + empty components and update η K : (a) If K > K + , then add K − K + empty components (i.e., N k = 0 for k = K + + 1, . . . , K) and sample θ k |φ from the prior p(θ k |φ) for k = K + + 1, . . . , K. (b) Sample η K |K, α, S ∼ D (e 1 , . . . , e K ), where e k = α/K + N k . of the TS sampler which is a standard step for finite mixtures with K known. The TS sampler is related to conditional samplers for infinite mixtures insofar, as all indicators S are sampled jointly due to the conditional independence of S given η K , θ 1 , . . . , θ K , y. As opposed to this, the CRP sampler applied in Miller and Harrison (2018) is a single-move sampler updating the allocation of each observation one-at-a-time.
Integrating (5.4) with respect to the weight distribution η K , the parameters θ k of the empty components and all allocations S that induce the same partition C yields (after suitable relabeling) the collapsed posterior which lives in the set partition space: We see in (5.5) that updating of the parameters θ 1 , . . . , θ K + and φ (Step 2) can be performed independently from updating K and the hyperparameter ω ( Step 3). It should be noted that the conditional posterior p(K|C, ω) of K given C that results from (5.5) is identical with (5.1), verifying the validity of Step 3(a) (or 3(a*)) in our partially marginalized sampler. In practice, Step 3(a) (or 3(a*)) is implemented by considering an upper bound K max for K and sampling K from a multinomial distribution over {K + , . . . , K max }, with the success probabilities being proportional to the non-normalized posterior probability of K. In the following empirical analysis we use a maximum value of K max = 100.
The sampler returns to conditional sampling from the full mixture posterior in Step 4(b) (or 4(b*)), by sampling the parameters of the empty components conditional on φ and sampling the weight distribution η K from the conventional Dirichlet posterior distribution. Using the stick breaking representation of a finite mixture, with the sticks following v k |K, γ K ∼ B (γ K , (K − k)γ K ), Step 4(b) (or 4(b*)) can be rewritten in terms of sampling the sticks from a generalized Dirichlet distribution, see, e.g., Algorithm 1 of Frühwirth-Schnatter and Malsiner-Walli (2019).
In order to learn the component parameters, a hierarchical prior structure is introduced in the Bayesian mixture model (2.1). Basically, in Step 2(b) of the TS sampler, any hierarchical prior p(φ) on the model parameters can be used. For other samplers, such as the allocation sampler (Nobile and Fearnside, 2007), the prior p(φ) has to be conditionally conjugate to easily integrate out the component parameters θ k . A specific feature of the TS sampler is that the hyperparameters φ are learned in Step 2(b) only from the K + filled components and that the parameters of the K − K + empty components are sampled subsequently in Step 4(a) from the conditional prior p(θ k |φ) for k = K + + 1, . . . , K. In this way, the parameters of the filled components inform the parameters of the empty components. In our opinion, this is an elegant way to handle hierarchical priors for component parameters in a dimension changing framework.
The TS sampler allows for a varying, but conditionally finite model dimension K. Truncation, however, does not result from slice sampling (Kalli et al., 2011), a popular method for DPMs to turn the infinite mixture into a conditionally finite one. The TS sampler adds and deletes components as follows.
Step 3(a) is a birth move, where new components are created, if a value K > K + is sampled. These components are empty, since we leave the filled components in partition C unchanged. Observations are allocated to these empty components during the subsequent sweep of the sampler in Step 1(a). Components can only disappear, if they get emptied in the allocation Step 1(a). Hence, for the TS sampler to work well, the tail probability K>K + p(K|C, ω) cannot be too small, as this probability controls how many empty components are added in Step 3(a) (or 3(a*)). The more p(K|C, ω) is concentrated at K + , the more likely mixing for K + and K will be poor for the TS sampler. This is true both for static and dynamic MFMs.
Finally, we allow the hyperparameter of the weight distribution, either α or γ, to be an unknown parameter estimated from the data under a hyperprior. α (or γ) are updated in Step 3(b) (or 3(b*)), which is the only updating step where a random walk Metropolis-Hastings step is employed.
6 Empirical demonstrations 6.1 Benchmarking the telescoping sampler We compare the performance of the TS sampler to two other samplers previously proposed to fit a static MFM with univariate Gaussian components, namely, reversible jump MCMC (RJ; Richardson and Green, 1997) and the Jain-Neal splitmerge algorithm (JN; Jain andNeal, 2004, 2007;Miller and Harrison, 2018). In contrast to the TS sampler, where in each iteration both K and K + are updated, the RJ sampler just samples K while K + is calculated a posteriori from the sampled allocations, and the JN sampler just samples the partitions and thus K + , whereas the posterior of K is reconstructed in a post-processing step (see Miller and Harrison, 2018, Equation (3.7)).
For this comparison we consider the well-known Galaxy data (Roeder, 1990), which is a small data set of N = 82 measurements on velocities of different galaxies from six well-separated sections of the space, and fit univariate Gaussian mixtures, y i |S i = k ∼ N (µ k , σ 2 k ) , with K unknown. Priors are chosen as in Richardson and Green (1997), namely p(K) is a uniform distribution U{1, 30}, η K |K ∼ D K (γ K ) with γ K ≡ 1 is uniform, whereas µ k ∼ N (m, R 2 ), σ 2 k ∼ G −1 (2, C 0 ), and C 0 ∼ G (0.2, 10/R 2 ), where m and R are the midpoint and the length of the observation interval. These priors are imposed for sake of comparison with previous results, but not motivated by modeling considerations nor selected to favor the TS sampler.
Results were obtained for the RJ sampler using the Nmix software provided by Peter Green and for the JN sampler as implemented in Miller and Harrison (2018) 5 . Each sampler was run for 1,000,000 iterations without thinning after discarding the first 10,000 iterations and using 100 different initializations. Table 1 summarizes the posterior p(K + |y) over all 100 runs based on the means for all three samplers (see Appendix D.1 for more detailed results). The posteriors estimated by all three samplers are very similar indicating that the TS sampler provides suitable draws from this posterior distribution.
The performance of the three samplers is compared by inspecting the number of clusters K + as well as the number of components K obtained for the MCMC iterations, if available. For this comparison, we use a simulated data set with a data generating process similar to the Galaxy data set. We draw N = 1000 ob-   Richardson and Green (1997). Trace plots of K (gray) and K + (black) for the TS, RJ and JN sampler.
servations from a three-component univariate Gaussian mixture (see Figure D.1 in Appendix D.1) and specify priors on the component parameters as used in Richardson and Green (1997) for the Galaxy data set and fit a static MFM with γ = 0.1. The smaller value for the Dirichlet parameter increases the gap between the prior on K and K + and thus improves the mixing of the TS and RJ samplers. Each sampler is run for 100,000 iterations without thinning. The first 10% iterations are omitted as burn-in. Figure 4 shows a combined trace plot of K + and K (if available) for each of the three samplers using the first 5,000 iterations after omitting burn-in. In each trace plot the black line shows how the number of clusters K + induced by the sampled partitions varies over the iterations. For the TS and RJ samplers, in addition, the gray lines show how the number of components K vary. For the TS sampler, K is sampled given K + , while for the RJ sampler K changes if components are split or combined or due to a birth or death of an empty component. This difference is clearly visible in the trace plots with poorer mixing in K for the RJ relative to the TN sampler.
We assess the efficiency of the three samplers by estimating auto-correlation functions (ACFs) for the sampled K + and K values (if available) and visualizing them in Figure 5. Regarding K + , the efficiency is rather comparable over the three samplers, with slight advantages for JN followed by TS and RJ being the least efficient. Comparing the ACFs for K clearly confirms that TS outperforms RJ.
The performance comparison indicates that TS is competitive with the other samplers, while providing the advantage of being easily adjusted and immediately applicable for mixtures with other component distributions or models. Note how- ever, that an appropriate choice of γ K has an impact on the efficiency of the sampler as a too large value of γ K prevents that empty components are created while too small values induce many (superfluous) additional empty components.

Sensitivity to the prior choice on the number of components
In the following we use the TS sampler to investigate how the posteriors of K and K + vary in dependence of different prior specifications p(K, γ K ) for the Galaxy data set. Although this data set is very popular in the clustering literature, there is no consensus on the number of clusters in the sample, see for instance Aitkin (2001), Grün et al. (2021) and the discussion in Appendix D.2. In contrast to these previous Bayesian analyses, we keep the priors on the component parameters fixed to those as specified by Richardson and Green (1997) for all analyses. In this way, the impact of the priors on K and the component weights can be investigated without mixing these effects with those of different prior specifications on the component parameters. We consider the static and dynamic MFM with the same priors p(K) and γ K as specified in Figure 2, i.e., K − 1 ∼ BNB (1, 4, 3), K − 1 ∼ Geo (0.1), and K ∼ U{1, 30}, and γ = 1 for the static MFM and α = 1 for the dynamic MFM.
In Figure 6 in the top row, the posteriors of K and K + are reported for the static MFM with γ K ≡ 1. The posteriors p(K + |y) and p(K|y) are very similar to each other regardless of p(K) specified. In contrast, for the dynamic prior γ K = 1/K, shown in the middle row, the posteriors p(K|y) and p(K + |y) differ considerably. While the posterior p(K|y) becomes flatter compared to fixed γ = 1, most of the posterior mass of p(K + |y) concentrates on K + equal to 3, 4 or 5 which are reasonable values for the number of clusters in this data set. Comparing the posteriors of K + and K to the corresponding priors in Figure 2 indicates that the posteriors are strongly influenced by the prior distributions. E.g., the flat prior for K + induced by the uniform distribution and γ K = 1 (plot in Figure 2  overestimates the number of clusters in this small data set. In contrast, a sparse prior on K + in combination with a dynamic MFM favors a sparse estimation of the number of clusters also a posteriori, see, e.g., the posterior p(K + |y) for the BNB (1, 4, 3) prior where three clusters are estimated. Under the hyperprior α ∼ F(6, 3), the posterior of K + looks rather similar to assuming that α = 1 fixed, see Figure 6 at the bottom. However, if the shrinkage prior α ∼ G(1, 20) is specified, the posterior of K + becomes completely independent of both the prior and posterior of K, see Appendix D.2 where also results for other specifications on K and the weights are reported, in particular a FM, a SFM and a DPM model. Figure 6 shows that depending on the prior on K and whether a static or dynamic MFM is specified, the posterior mode of p(K + |y) varies. This highlights the impact of the implicitly specified prior on K + on the posterior of K + . This especially applies to the Galaxy data set which contains only N = 82 observations and has no clear cluster structure. If, in contrast, there is considerable information in the data, the posteriors of K + for different prior specifications p(K) coincide, as can be seen in the next section when analyzing the Thyroid data set.

Changing the clustering kernel
We use the TS sampler to fit dynamic MFMs with different component distributions, i.e., the multivariate Gaussian distribution and the latent class model for multivariate categorical data. This demonstrates how easily the TS sampler can be used to fit a MFM regardless of the component distributions. For K we use the same priors p(K) as in the previous section. It will turn out that a prior specification for K where E(K) is small and the tails are not too light, in combination with the dynamic prior γ K = α/K on the component weights and α ∼ F(6, 3) gives good clustering results.
The final partition is obtained by identifying the models through the postprocessing procedure suggested by Frühwirth-Schnatter (2006) and applied in Malsiner-Walli et al. (2016, 2017. First, the number of clustersK + is estimated by the mode of the posterior p(K + |y). Then for all posterior draws where K (m) + =K + , the component parameters are clustered in the point process representation intoK + clusters using k-means clustering. A unique labeling of the draws is obtained and used to reorder all draws, including the sampled allocations. The final partition of the data is then determined by the maximum a posteriori (MAP) estimate of the relabeled cluster allocations.

Multivariate Gaussian mixtures: Thyroid data
The Thyroid data are a benchmark data set for multivariate normal mixtures included in the R package mclust (Scrucca et al., 2016). It consists of five laboratory test variables and a categorical variable indicating the operation diagnosis (with three potential values) for 215 patients. A dynamic MFM with multivariate normal component densities is fitted using a simplified version of the priors proposed in Malsiner-Walli et al. (2016) for the component parameters (for details see Appendix D.3.1). As can be seen in the left-hand column of Table 2, for all priors on K the mode of the posteriors for K + lies at three, even for the uniform prior. Also the posterior mode of K is three, indicating that rarely empty components were sampled. For the K − 1 ∼ BNB (1, 4, 3) prior, the final partition obtained through the MAP estimate consists of three clusters with 28, 37 and 150 patients. The ARI of this partition with the known operation diagnosis is 0.88, which is equal to the ARI of the mclust solution. Overall these results suggest that, if the data are informative regarding a specific cluster structure, the clustering result is not susceptible to the prior specification of p(K). Stern et al. (1994) F(6, 3). The posteriors of K + and K are summarized by their modes, followed by the 1st and 3rd quartiles.

Latent class analysis: Fear data
parameters. Table 2 shows that the prior K − 1 ∼ BNB (1, 4, 3) selects K + = 2, confirming the theoretically expected number of clusters. The geometric prior with E(K) = 10 and the truncated uniform prior, however, overestimate the number of clusters with the mode of K + at 4 and 6, respectively. The results obtained when identifying the MCMC output from a dynamic MFM with K −1 ∼ BNB (1, 4, 3) and α ∼ F(6, 3) indicate that the two classes have a rather different profile regarding the occurrence probabilities of the categories (see Appendix D.3.2), which coincides with the findings in Stern et al. (1994).

Investigating the telescoping sampler with artificial data
We perform a simulation study with artificial data to investigate how the TS sampler performs in dependence of sample size N , dimension r and number of clusters K + . In addition, we vary the priors for p(K, γ K ) considering static and dynamic MFMs and in particular include the suggested priors K − 1 ∼ BNB (1, 4, 3) and α ∼ F(6, 3). We sample 100 data sets from a multivariate normal mixture with eight equally sized components, varying dimension (r = 2, 8, 12) and increasing sample size (N = 400, 4000, 10000), combining higher values of the dimension r with larger sample sizes N . A detailed description of the data generating processes of the simulated data as well as the specified priors p(K) and Dirichlet parameters γ and α is given in Appendix D.4.
Results are visualized in a bubble plot in Figure 7. The area of the bubbles is proportional to the percentage of data sets with a specific number of clusters K + estimated as indicated on the y-axis. The results show how the influence of the prior p(K, γ K ) decreases when the information in the sample increases. If the information is weak, i.e., for N = 400 and r = 2, the prior specifications on K and on γ K have considerable impact on the clustering result (first column of Figure 7). The estimated number of clusters K + tends to be lower for the Poisson prior regardless of the prior imposed on γ K . While the Poisson prior with λ = 1 induces the same prior mean E(K) = 2 as the BNB (1, 4, 3) prior, it has also light tails. Thus, the fatter tails of the BNB (1, 4, 3) prior allow to estimate the number of clusters in the data correctly despite its sparsity inducing properties. Regarding the prior on the Dirichlet parameter γ K , the results of the static MFM clearly indicate that the estimated number of clusters decreases for decreasing values of γ. In the dynamic case, using α ∼ F(6, 3) gives more reliable results than the other specifications for α regardless of the prior on K. In contrast, the influence of the sparsity inducing prior α ∼ G(1, 20) is clearly visible across all priors on K, leading even to four estimated clusters instead of eight. Overall, the results for the combination K − 1 ∼ BNB (1, 4, 3) and α ∼ F(6, 3) confirm the suitability of this prior specification for determining the number of clusters in a Bayesian cluster analysis application. For N = 4000 the estimated number of data clustersK + is equal to eight for nearly all data sets regardless of the prior specifications. Results are similar for N = 10000.

Concluding remarks
Being a finite mixture model where the number of components is unknown, the MFM model has a long tradition in Bayesian mixture analysis. Building on this tradition, a key aspect of our work is to explicitly distinguish between the number of components K in the mixture distribution and the number of clusters K + in the partition of the data, corresponding to non-empty components given the data.
With this fundamental distinction in mind, we contribute to MFMs both from a methodological as well as a computational perspective. Traditionally, the hyperparameter γ of a symmetric Dirichlet prior on the component weights is a fixed value, often equal to one. In this paper, we investigate in detail a more general MFM specification which defines the hyperparameter γ K of the symmetric Dirichlet prior dynamically and dependent on K. We provide theoretical results that characterize how this specification of a dynamic symmetric Dirichlet prior on the component weights influences the induced prior on the number of clusters and the partition structure. While a static MFM with fixed γ corresponds to a Bayesian non-parametric mixture within the class of Gibbs-type priors, our dynamic version where γ K depends on K leads to more a flexible mixture outside the class of Gibbs-type priors.
Regarding posterior inference, we introduce the novel telescoping (TS) sampler which is a trans-dimensional Gibbs sampler that simultaneously infers the posterior on the number of components K and the number of clusters K + . As illustrated, for instance, for multivariate Gaussian mixtures, the TS sampler can be easily implemented for any kind of component model or distribution. Based on the TS sampler, in future work many different kinds of mixture models can be easily fitted to cluster different types of data which require the use of specific component distributions and models. Future work should also investigate the potential to improve the computational efficiency of the TS sampler, e.g., by reducing the computational burden due to the empty components.

Supplementary material for: "Generalized mixtures of finite mixtures and telescoping sampling"
A Mathematical derivations Proof of Theorem 2.1. Let S = (S 1 , . . . , S N ) be the collection of all component indicators which, for a given K, associate each observation y i with the component that generated this data point (see model (2.1)). For any MFM with prior η K |K, γ K ∼ D K (γ K ), the marginal prior p(S|K, γ K ) for a fixed K is given by:  .24)). If we define K + as the number of all occupied components with N > 0 and reorder the components such that the non-empty components appear first, with N 1 , . . . , N K + being the corresponding occupation numbers, then S defines a set partition C = {C 1 , . . . , C K + } of the data indices {1, . . . , N } with N j = card (C j ). There are assignment vectors S that define the same partition C, where the first factor accounts for choosing K + among the K components, which are labeled {1, . . . , K + }, while the second factor accounts for all possibilities to relabel these K + (non-empty) components. Multiplying (A.1) by this number yields: where V K,γ K N,K + is defined as in (2.6). Averaging p(C|N, K, γ K ) over the prior p(K) yields the probability mass function (pmf) p(C|N, γ) given in (2.3): For any K + = 1, 2, . . . , N , consider the cluster sizes (N 1 , . . . , N K + ) of the K + nonempty clusters which are labeled {1, . . . , K + }. For this labeling, there are K K + = K! K + !(K − K + )! ways to choose K + non-empty among the K components and N N 1 · · · N K + = N ! N 1 ! · · · N K + ! different ways to assign N observations into clusters of size N 1 , . . . N K + . Multiplying (A.1) by this number yields: Averaging over the prior p(K) yields the prior of the labeled cluster sizes given in (2.5): Derivation of (2.8). Using Γ(γK + N ) = Γ(γK + N + 1)/(γK + N ), we obtain: Splitting γK + N = (γk + N ) + γ(K − k), we obtain: Proof of Theorem 2.2. For a dynamic MFM, and we obtain the following EPPF from (2.3): Using Γ( α K ) = K α Γ(1 + α K ), we obtain: . Therefore, p(C|N, α) can be expressed as in (2.10): Proof of Theorem 3.1 and Algorithm 1. The marginal prior Pr{K + = k|N, γ} is obtained by aggregating the prior pmf p(N 1 , . . . , N k |N, γ) of the labeled cluster sizes (N 1 , . . . , N k ) of a partition with k non-empty clusters, given in (2.5), over all cluster sizes N 1 , . . . , N k such that N 1 + . . . + N k = N . The resulting prior Pr{K + = k|N, γ} can be represented as where Pr{K + = k|N, K, γ K } is the prior of K + for a fixed number of components K, and the prior uncertainty with respect to K is integrated out. This proves Theorem 3.1. The number of terms in C K,γ K N,k is the number of partitions of N into k integer summands with regard to order. Algorithm 1 is based on following recursion to compute C K,γ K N,k for k ∈ {1, 2, . . .}. We write w N j , where w n = Γ(n + γ K ) Γ(n + 1) , n = 1, . . . , N.
Since the cluster sizes are labeled, this can be written for k ≥ 2 as: where C K,γ K n,k is defined fork ∈ {1, . . . , K} andñ = 1, . . . , N as: .4) is equivalent to the following recursive system: Hence, if we define for all k ∈ {2, 3, . . .}, then we obtain from (A.6): Obviously, C K,γ K N,k is equal to the first element of the vector c K,k for all k ∈ {1, 2, . . .}. W 1 takes the form given in Algorithm 1 and W k is obtained from W k−1 for all k ∈ {2, 3, . . .} by deleting the first row and the first column.
To make this relation more evident, the predictive probability is expressed as in (4.1):

B The beta-negative-binomial distribution
The beta-negative-binomial (BNB) distribution is a hierarchical generalization of the Poisson, the geometric and the negative-binomial distribution. This can be derived in the following way: The starting point is the translated Poisson distribution K − 1 ∼ P (λ) introduced by Miller and Harrison (2018) with a fixed value of λ which also determines the prior mean E(K − 1) = λ. A typical choice is λ = 1, but this choice might be influential and it appears promising to consider hierarchical priors.
Assuming the Gamma prior λ ∼ G (α λ , β) on λ leads to the translated negativebinomial distribution K − 1 ∼ NegBin (α λ , β). For α λ = 1, this distribution reduces to the translated geometric distribution K − 1 ∼ Geo (π) with success probability π = β/(1 + β), modeling the number of failures before the first success. The pmf of the negative-binomial distribution can be combined with the hierarchical prior π ∼ B (a π , b π ) on π = β/(1 + β). Marginally, this yields the translated BNB distribution K − 1 ∼ BNB (α λ , a π , b π ). Table A.1 gives an overview on the beta-negative-binomial (BNB) distribution including its special cases given by the Poisson, negative-binomial and the geometric distribution. The translated pmf is provided as well as the prior mean values E(K − 1) and E(K). The different shapes of the BNB distribution possible for various values of the parameters are illustrated in Figure B.1.

C Inference algorithm: Telescoping sampling
In the following we provide more details about using telescoping sampling for MCMC estimation of MFMs. Algorithm 2 can be easily modified for static MFMs, as outlined in Algorithm 3. Starting values and burn-in. We define starting values in Algorithms 2 and 3, respectively, in the following way. k-means (?) or k-modes (?) clustering is used to split the data into K 0 initial clusters, where K 0 is clearly overfitting the number of clusters, e.g., K 0 = 10 or 15, if about 5 clusters are expected. The cluster centers returned by k-means or k-modes are the initial values for the component means. In case the component distributions have a variance parameter independent of the mean, e.g., for Gaussian distributions, sufficiently large values are specified to encourage merging of the components in the first classification steps. The component weights are initialized using uniform weights. We repeat Algorithms 2 and 3, respectively, for M 0 +M iterations and discard the draws from the first M 0 iteration as burn-in. In general only a rather small number of burn-in iterations (e.g., 1,000) is required to reach a region of the parameter space with high posterior values, while many iterations (e.g., 100,000) need to be recorded in order to sufficiently well explore regions of the parameter space with high posterior values. Convergence of the MCMC sampler is assessed by exploring trace plots of the posterior of the number of clusters K + or the component weights.

Details on
Step 1(b). To reorder the components, determine the indices {i 1 , . . . , i K + } ⊂ {1, . . . , K} of the K + non-empty components and let i K + +1 , . . . , i K be the remaining sub-indices corresponding to the K −K + empty components. Note that {i 1 , . . . , i K } is not unique, but the algorithm is invariant to the specific choice. Given {i 1 , . . . , i K }, the cluster sizes, the component parameters and the component weights are reordered using: Algorithm 3 Telescoping sampling for a static MFM.

Details on
Step 3(a). In Step 3(a), Γ( α K ) = K α Γ(1 + α K ) is used to evaluate the posterior (5.3) to increase the numeric stability for large values of K or small values of α, respectively. .

(C.2)
D Empirical demonstrations -Details and additional results

D.1 Benchmarking the telescoping sampler
Tables D.1 and D.2 provide additional details on the results obtained in Section 6.1 when using the telescoping sampler (TS), RJMCMC (RJ) and the Jain-Neal sampler (JN) to fit a static MFM to the Galaxy data set using the priors as suggested in Richardson and Green (1997).   overlap between the components.

D.2 Sensitivity to the prior choice on the number of components
The Galaxy data set has been used numerous times in the literature to illustrate the use of Bayesian methods to fit a mixture model with univariate Gaussian components, in particular to address the issue of the number of components and clusters. Aitkin (2001) compares the results obtained in Escobar and West (1995), ?, ?, ? and Richardson and Green (1997) and points out that the posterior probabilities for K obtained in the different analyses are rather diffuse over the range 4-9, except for ? who conclude that the number of components is almost certainly three. The five Bayesian analyses did not only differ with respect to the prior specification on K and γ K , but also the priors specified for the component parameters. Aitkin (2001) also compares the Bayesian results to those obtained using a maximum likelihood analysis which shows strong evidence for 3 or 4 mixture components, depending on whether equal or unequal variances between the components are considered. In addition to static and dynamic MFMs considered in Section 6.2 also a finite mixture model (FM), a sparse finite mixture model (SFM) and a Dirichlet process mixture (DPM) model are fitted. While all three modeling approaches might be seen as special cases of MFMs, they differ in the specification of K and the Dirichlet parameter. The FM uses a fixed value of K and γ = 1 inducing a priori a uniform distribution on the component weights. The SFM combines a fixed value of K with a fixed small value for γ which induces that a priori empty components occur. Figure D.2 visualizes the priors for K and K + for the FM, SFM, and DPM. The fixed value for K is equal to 10 and the number of observations is selected as N = 82. We use for the FM γ = 1 (corresponding to α = 10 for a dynamic MFM), for the SFM γ = 0.01 (corresponding to α = 0.1 for a dynamic MFM) and for the DPM  Posterior of K + (solid red lines, circles) and K (dashed blue lines, triangles) for a standard finite mixture with K = 10, γ = 1 (left), a sparse finite mixture with K = 10, γ = 0.01 (middle) and a DPM with α = 1 (right). α = 1. For FM and SFM the prior on K has a degenerate distribution putting all prior mass at 10, whereas the DPM puts all mass at K = ∞. The FM with γ = 1 implies a mode at 9 for the prior on K + while also putting considerable mass on K + = 8 and K + = 10. Clearly a value of γ = 1 is not sufficiently large to ensure that all components are filled. For the SFM, the prior on K + has its mode at 1 and is quickly decreasing putting also some mass on K + = 2, but negligible mass on higher values of K + . The DPM prior for K + is a unimodal distribution with mode at 4-5, but essentially no mass assigned to K + = 1 or K + = 10 and beyond.
The posterior distributions for the priors in Figure D.2 are shown in Figure D.3 when fitting the corresponding finite mixture and DPM specifications to the Galaxy data set. While K (being fixed) remains unchanged, the differences in prior distributions for K + are also reflected in different posteriors for K + . The FM obtains a fine-grained approximation of the data density with in general 8-9 components being filled in the mixture model. The SFM obtains an approximation with only 4-5 components being filled with a high probability, with some probability also be- 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8  for the cluster-specific success probabilities after model identification for K − 1 ∼ BNB (1, 4, 3) and a dynamic MFM with α ∼ F(6, 3) and a uniform Dirichlet prior on the component parameters. ing assigned to 3 or 6 components being filled. The approximation with the DPM specification is the sparsest with a mode at K + = 3 and most mass assigned to the values 3-5.
If the shrinkage prior α ∼ G(1, 20) is specified, the posterior of K + becomes completely independent of both the prior and posterior of K, as can be seen in Figure D N (b 0 , B 0 ), b 0 = median(y), B 0 = Diag (R 2 1 , . . . , R 2 r ), where R j is the range of the data in dimension j, and r = 5. For the component covariance matrices the hierarchical prior Σ −1 k ∼ W(c 0 , C 0 ), C 0 ∼ W(g 0 , G 0 ), where c 0 = 2.5 + (r − 1)/2, g 0 = 0.5 + (r − 1)/2 and G 0 = 100g 0 /c 0 Diag (1/R 2 1 , . . . , 1/R 2 r ), is assumed. Note that the same priors on the component parameters are used in the simulation study with artificial data in Section 6.4 where also multivariate Gaussian mixtures are fitted. Table D.3 summarizes the cluster-specific parameter estimates obtained for a dynamic MFM model after model identification. A dynamic MFM was fitted with the following prior specifications: K − 1 ∼ BNB (1, 4, 3), α ∼ F(6, 3) and uniform Dirichlet priors on the component parameters. Model identification is performed by first selecting the mode of the posterior on K + as suitable number of clusters.

D.3.2 Latent class analysis: Fear data
In the following only the MCMC draws are considered where the number of filled components equals the estimated number of clusters K + and unique labels are assigned by clustering the component parameters of filled components in the point process representation and retaining only MCMC draws where the cluster labels assigned to the component parameters of filled components from the same MCMC draw represent a permutation of the numbers 1 to the estimated number of clusters. The posterior distributions of the cluster-specific parameters obtained in this way are summarized in Table D.3 by the posterior mean and standard deviation. Note that the categories can be interpreted as scores with higher scores indicating a stronger behavior. Whereas children belonging to class 2 are more likely to have higher scores in all three variables, children in class 1 show less motor activity, crying behavior and fear at the same time. This clustering result coincides with both the results reported in Frühwirth-Schnatter and Malsiner-Walli (2019) and the psychological theory behind the experiments, according to which all three behavioral variables are regularized by the same physiological mechanism, see Stern et al. (1994) for more details.

D.4 Investigating the telescoping sampler with artificial data
For the simulation study in Section 6.4, we draw artificial data from finite mixtures of multivariate Gaussian distributions with eight components. The component weights are set to be equal, i.e., η k = 1/8 for all k = 1, . . . , 8. The mean vectors for each of the components are determined in the following way. The four values {2, 6, 10, 14} are combined in one dimension with the two values {0, 5} in a second dimension through a full factorial design to define eight different two-dimensional mean values. To obtain the mean vectors for higher dimensional data (where r is an even number) the two dimensions are replicated but also multiplied with the square root of the number of replicates to ensure that the Euclidean distance between mean vectors remains the same. The variance-covariance matrices of the component distributions are assumed to be all equal to the identity matrix. For drawing the artificial data, the number of dimensions r and the sample sizes N are varied using the following settings: (N = 400, r = 2), (N = 4000, r = 8) and (N = 10000, r = 12). For each setting 100 data sets are drawn. For illustration, an example data set with N = 400 and r = 2 is shown in Figure D.5. As prior on K, we use the beta-negative-binomial distribution BNB (1, 4, 3) for K − 1 (see Section 3.1), the Poisson distribution with λ = 1 for K − 1 (similar to Nobile and Fearnside, 2007), BNB (1, 1, 1) for K − 1 (as suggested by Grazian et al., 2020), Geo (0.1) for K − 1 (as suggested by Miller and Harrison, 2018) and U{1, 30} for K (as suggested by Richardson and Green, 1997). For the Dirichlet parameter γ K we consider different priors for static as well as dynamic MFMs. For the static MFM where γ K ≡ γ we use γ ∈ {1, 1/ log(N ), 0.01}. γ = 1 corresponds to the value used in Richardson and Green (1997) and Miller and Harrison (2018); γ = 0.01 induces a sparse solution as suggested by Malsiner-Walli et al. (2016). In addition we consider a specification for γ where γ decreases in an indirectly proportional way to the log of the sample size N . For the sample sizes considered, the values of log(N ) vary only moderately and take values between 6.0 and 9.2. For the dynamic MFM with γ K = α/K, we consider a fixed value for α where α = 1 and settings where a prior on α is assumed. In addition to the prior α ∼ F(6, 3) (see Section 4.3), we consider α ∼ G(2, 4) (see Escobar and West, 1995)  MCMC sampling is performed using the TS sampler. The sampler is initialized using 15 filled components and then run for 10,000 burn-in iterations and 100,000 iterations are recorded without any thinning. The number of data clusters are estimated using the mode of the posterior of the number of clusters. Applying the TS sampler to fit static and dynamic MFMs is straightforward, whereas the RJMCMC and JN implementations, used in Section 6.1 as benchmarks for the TS sampler, would require major changes to be applicable for this simulation setup where multivariate data and hierarchical priors are considered.
In this simulation study the true data generating process is included in the fitted model. For larger sample sizes, we would thus expect to have the sampler concentrate on the part of the parameter space coinciding with the true data generating process. Results indicate that the TS sampler succeeds in converging during burn-in to the part of the parameter space where the estimated number of clusters K + corresponds to the true number of clusters. Note that the TS sampler is initialized with 15 filled components which implies that during burn-in filled components are merged and emptied. Overall the results indicate the feasibility of the TS sampler to be successfully applied in Bayesian cluster analysis for data with a clear clustering structure for sample sizes up to 10,000 and dimensions up to 12.