Infinite Mixtures of Infinite Factor Analysers

Factor-analytic Gaussian mixture models are often employed as a model-based approach to clustering high-dimensional data. Typically, the numbers of clusters and latent factors must be specified in advance of model fitting, and remain fixed. The pair which optimises some model selection criterion is then chosen. For computational reasons, models in which the number of latent factors differ across clusters are rarely considered. Here the infinite mixture of infinite factor analysers (IMIFA) model is introduced. IMIFA employs a Pitman-Yor process prior to facilitate automatic inference of the number of clusters using the stick-breaking construction and a slice sampler. Furthermore, IMIFA employs multiplicative gamma process shrinkage priors to allow cluster-specific numbers of factors, automatically inferred via an adaptive Gibbs sampler. IMIFA is presented as the flagship of a family of factor-analytic mixture models, providing flexible approaches to clustering high-dimensional data. Applications to a benchmark data set, metabolomic spectral data, and a manifold learning handwritten digit example illustrate the IMIFA model and its advantageous features. These include obviating the need for model selection criteria, reducing the computational burden associated with the search of the model space, improving clustering performance by allowing cluster-specific numbers of factors, and quantifying uncertainty in the numbers of clusters and cluster-specific factors.


Introduction
In cases where the number of variables p is comparable to or greater than the number of observations N , many clustering techniques tend to perform poorly or be intractable. Factor analysis (FA; Knott & Bartholomew, 1999) is a well-known approach to parsimoniously modelling data. Bai & Li (2012) outline some computational difficulties which arise when N p. Model-based clustering methods which rely on latent factor models have long been successfully utilised to cluster high-dimensional data. Ghahramani & Hinton (1996) propose a mixture of factor analysers model (MFA) with cluster-specific parsimonious covariance matrices and estimate it via an expectation-maximisation algorithm;  provide a succinct overview. Estimation of MFA models has also been considered in a Bayesian framework (Diebolt & Robert, 1994;Richardson & Green, 1997). McNicholas & Murphy (2008 develop a suite of similar parsimonious Gaussian mixture models. Other related developments in this area include Baek et al. (2010) and Viroli (2010), among others.

The IMIFA Model Family
The hierarchy of the IMIFA family of models is delineated herein, including a review of extant methodologies, the introduction of novel sub-models, and concluding with the flagship IMIFA model. Prior specifications, Markov chain Monte Carlo (MCMC) inferential procedures, approaches to posterior predictive model checking, and model-specific implementation issues that arise in practice are addressed.

Mixtures of Factor Analysers
Mixtures of factor analysers are Gaussian latent variable models often used for clustering highdimensional data. For each of G clusters in these finite mixtures, the cluster-specific FA model in cluster g is given by x i − µ g = Λ g η i + ε ig . The observed feature vector x i = (x i1 , . . . , x ip ) with mean µ g and covariance matrix Σ g is assumed to linearly depend on a q-vector (q p) of latent common factor scores η i and additional sources of variation called specific factors ε ig . It is assumed that η i has a q-variate Gaussian distribution N q (0, I q ), where I q denotes the q × q identity matrix, and that ε ig ∼ N p (0, Ψ g ), where Ψ g is a diagonal matrix with non-zero elements ψ 1g , . . . , ψ pg known as uniquenesses. Here, Λ g denotes the p × q factor loadings matrix of cluster g and notably q = 0 is permitted.
To facilitate estimation, a latent cluster indicator vector z i = (z i1 , . . . , z iG ) is introduced such that z ig = 1 if observation i belongs to cluster g and z ig = 0 otherwise. Hence, z i has a Mult(1, π) distribution where π = (π 1 , . . . , π G ) are the cluster mixing proportions which sum to 1. A symmetric uniform Dirichlet prior π ∼ Dir G (α = (α, . . . , α) = 1) is assumed. Upon marginalising out z i and η i , MFA yields a parsimonious finite sum covariance structure for the observed data where N p (x i ; ·, ·) denotes the density of a p-variate Gaussian distribution evaluated at x i and θ g = {µ g , Λ g , Ψ g } are the cluster-specific FA parameters for which inference is straightforward under a Gibbs sampling scheme. Imposing constraints on Ψ g (McNicholas & Murphy, 2008 and/or fixing π g = 1 /G ∀ g may be useful in some settings.

Prior Specification and Practical Issues
The conditionally conjugate nature of the various prior distributions detailed below facilitates MCMC sampling via straightforward Gibbs updates. A multivariate Gaussian prior is assumed for the factor loadings of the variable j across the q factors of cluster g: Similarly, a diffuse multivariate Gaussian prior is assumed for the component means, where µ is the overall sample mean and the scalar ϕ controls the level of diffusion. An inverse gamma prior ψ jg ∼ IG(α 0 , β j ) is assumed for the uniquenesses of variable j in cluster g. Guided by Frühwirth-Schnatter & Lopes (2010), hyperparameters are chosen to ensure ψ jg is bounded away from 0, thereby avoiding Heywood problems. With a sufficiently large shape α 0 , variable-specific scales are derived from the sample precision matrix S = S −1 via β j = (α 0 − 1)/S jj . However, when N /p is close to or less than 1, or when S −1 is otherwise unavailable, S is replaced by a ridge-type estimator S −1 = β 0 + N /2 β 0 I p + 0.
which combines the the inverse Wishart prior S −1 ∼ W p (β 0 , β 0 I p ) with the sample information, where β 0 is a hyperparameter (Frühwirth-Schnatter & Lopes, 2018). For unstandardised data, this estimator is constructed for the inverse correlation matrix and then appropriately scaled using the diagonal entries of S (Wang et al., 2015). When the variances are roughly balanced, constraining Ψ g to ψ g I p , and/or using β j = β = (α 0 − 1)/max(diag(S )), provides additional parsimony. Notably, the isotropic constraint provides the link between factor analysis and probabilistic principal component analysis (Tipping & Bishop, 1999). The rotational invariance property which makes FA models non-identifiable is well known: most covariance matrices Σ cannot be uniquely factored as ΛΛ + Ψ when q > 1. Though identifiability of Λ is not strictly necessary for the purposes of clustering or inferring Σ, addressing the identifiability problem offline using the parameter expanded approach of Ghosh & Dunson (2008) in tandem with Procrustean methods, as in McParland et al. (2014) and Aßmann et al. (2016), yields interpretable posterior summaries. Another practical issue is the label switching phenomenon (Frühwirth-Schnatter, 2010) which is addressed offline using the cost-minimising permutation given by the square assignment algorithm (Carpaneto & Toth, 1980). Finally, optimal FA and MFA models are chosen using the BIC-MCMC criterion (Frühwirth-Schnatter, 2011) where necessary in what follows.

Mixtures of Infinite Factor Analysers
To overcome the requirement to specify q, infinite factor analysis (IFA) models are employed (Bhattacharya & Dunson, 2011). The IFA model is a factor analysis model which assumes a multiplicative gamma process (MGP) shrinkage prior on the loadings matrix. This prior allows the degree of shrinkage towards zero to increase as the column index k → ∞, mitigating against the factor splitting phenomenon. Here the IFA model is generalised to the mixture setting, leading to the novel mixture of infinite factor analysers (MIFA) model. Under MIFA, the MGP prior is placed on each parameter expanded Λ g matrix with no restrictions on its entries, thereby making the induced prior on Σ g invariant to the ordering of the variables. The MGP prior is conditionally conjugate, facilitating block Gibbs updates of the loadings and hence rapid mixing. Thus, the MGP prior in mixture settings is given by where τ kg is a column shrinkage parameter for the k-th column in the g-th cluster's loadings matrix Λ g ∀ k = 1, . . . , ∞, and Ga(α, β) denotes the gamma distribution with mean α /β. The role of the local shrinkage parameters φ 1kg , . . . , φ pkg for the p elements in column k of Λ g is to favour sparsity while also preserving the signal of non-zero loadings. Lastly, the cluster shrinkage parameter σ g reflects the belief that the degree of shrinkage is cluster-specific. A schematic illustration of the MGP prior is given in Figure 1; note that loadings can shrink arbitrarily close, but not exactly, to zero. Bhattacharya & Dunson (2011) fix β 1 = β 2 = 1 and recommend that α 2 > β 2 . However, Durante (2017) elaborates on the cumulative shrinkage properties and roles played by hyperparameters, showing in particular that α 2 > β 2 + 1 is necessary in order to have column-specific variances τ −1 kg that decrease in expectation with growing k. It is also recommended that α 2 be moderately large relative to α 1 (to ensure that the cumulative shrinkage property for which the prior was developed holds) and to avoid excessively high values for α 1 (to avoid over-shrinking to increasingly low-dimensional factorisations). While Bhattacharya & Dunson (2011) assume Ga(ν, ν) priors for the local shrinkage parameters, here more general settings are used to allow control over prior non-informativity. In the spirit of Durante (2017), the expectation ν 2 /(ν 1 − 1) of the induced inverse gamma prior on φ −1 jkg is suggested to be ≤ 1 to induce sparsity on average. Furthermore, following the guidelines of Durante (2017), it is generally advisable that all MGP hyperparameters are chosen such that the first two moments of the associated hyperprior are defined, as this leads to superior performance in terms of the expected deviation between the true and estimated covariance matrices. In the mixture setting, α 1 and α 2 may need to be higher than the values suggested by Durante (2017) to enforce a greater degree of shrinkage in clusters with few units; this aspect is highlighted in simulation studies in Appendix B.

The Adaptive Gibbs Sampler
An adaptive Gibbs sampler (AGS) is employed when performing inference for MIFA. This dynamically shrinks the loadings matrices (and the infinite scores matrix η) to have finite numbers of columns, by selecting the number of 'active' factors. This practically facilitates posterior computation while closely approximating the IFA model, without requiring specification of Q = (q 1 , . . . , q G ) . However, a strategy is required for choosing appropriate truncation levels, q g , that strike a balance between missing important factors and wasting computational effort. For computational reasons, a conservatively high upper bound is used, such that q g = min 3(p) , N − 1, p − 1 ∀ g. The number of factors in each Λ g is then adaptively tuned as the MCMC chain progresses. Adaptation can be made to occur only after the burn-in period, in order to ensure the true posterior distribution is being sampled from before truncating the loadings matrices. At the t-th iteration, adaptation occurs with probability p(t) = exp(−b 0 − b 1 t), with b 0 ≥ 0 and b 1 > 0 chosen so that adaptation occurs often at the beginning of the chain but then decreases exponentially fast in frequency. Here b 0 = 0.1 and b 1 = 5 × 10 −5 are used. With probability p(t), loadings columns having some pre-specified proportion of elements ς in a small neighbourhood of zero are monitored. If there are no such columns, an additional column is added by simulation from the MGP prior. Otherwise redundant columns are discarded and the AGS proceeds with all parameters corresponding to non-redundant columns retained. Choice of ς and can be delicate, as there is an implicit trade-off between these two fixed tuning parameters; smaller ς and larger speed up the algorithm by favouring the discarding of factors during the adaptation step, and vice versa. Typically, ς should be kept close to 1 and should be kept small, relative to the scale of the data. Here, ς = 0.7 × p /p and = 0.1 are found to strike an appropriate balance. The dimension of the matrix η of factor scores at a given iteration are set to p × q = p × max (Q (t)); rows corresponding to observations currently assigned to a cluster with fewer latent factors than q are padded with zeros. Notably, here q g may shrink to 0 thus allowing diagonal covariance structure within a component. If this occurs, the decision to simulate a new column is based on a binary trial with probability 1 − ς as there are no loadings columns to monitor.
The numbers of active factors in each cluster for each retained posterior sample can be used to construct a barchart approximation to the posterior distribution of q g . The posterior mode is used to estimate each q g , with credible intervals quantifying uncertainty. Another strategy, which circumvents the need to pre-specify ς and is to forego adaptation (provided the computational burden of doing so is tolerable) and estimate q g from the number of non-redundant columns in the posterior mean loadings matrices. However, this approach is not considered further here.
In any case, the main advantages of MIFA are that different clusters can be modelled by different numbers of factors and that the model search is reduced to one for G only, as q g is estimated automatically during model fitting. Here, for MIFA models, the optimal G is chosen via the BICM (BIC-Monte (Carlo)) proposed by Raftery et al. (2007), with BICM = 2 ln L − 2s 2 l (ln(N ) − 1), where L and s 2 l are the sample mean and sample variance, respectively, of the log-likelihood values calculated for each retained posterior sample. This criterion is particularly useful in the context of nonparametric models where the number of free parameters is difficult to quantify, though we caution that it may be biased in favour of G = 1 models, under which the log-likelihoods tend to exhibit less variability, and that a large number of posterior samples are required to ensure stable estimation of s 2 l .

Other Infinite Factor Models
This work offers an extension of the MGP prior and its related AGS routine to the mixture modelling context. Wang et al. (2016) develop a related model employing a multiplicative exponential process prior. Other nonparametric approaches to inferring the number of factors include Knowles & Ghahramani (2007), in which a two-parameter Indian Buffet Process (IBP) prior is assumed on an infinite binary matrix underlying the factor scores, thus selecting features of interest, with associated standard Gaussian weights. A closely related approach using the beta process (BP) is provided by Paisley & Carin (2009). In Knowles & Ghahramani (2011) and Ročková & George (2016), an IBP prior is instead assumed for sparsifying the loadings. These models assume a single sparse infinite factor model for the whole data set. However, embedding them in a mixture modelling setting, similar to the IMIFA framework, is intuitively feasible. Indeed, Chen et al. (2010) employ the BP prior, coupled with a Dirichlet process prior, to perform clustering in a manifold learning setting. While the BP and IBP priors achieve exact sparsity, which may be advantageous in certain applications, the MGP prior has a weaker notion of sparsity by virtue of cumulatively shrinking an infinite series arbitrarily close to zero, thereby preserving small signals. The block updates of each row of Λ g facilitated by the MGP prior and parameter expansion mean the AGS approach is a simpler, more computationally efficient alternative to the BP and IBP priors.

Overfitted Mixtures of (Infinite) Factor Analysers
While MIFA obviates the need to pre-specify Q, the issue of model choice is not yet fully resolved. Overfitted mixtures (Rousseau & Mengersen, 2011;van Havre et al., 2015) are one means of extending MIFA; indeed Papastamoulis (2018) proposes an overfitted mixture of factor analysers (OMFA), albeit with finite factors. Here, the overfitted mixture of infinite factor analysers (OMIFA) model is introduced.
In overfitted mixtures the symmetric Dirichlet prior on π plays an important role. Estimation is approached by initially overfitting the number of clusters expected to be present. Small values of the hyperparameter α encourage emptying out excess components in the posterior distribution; the uniform prior with α = 1 is rather indifferent in this respect. The sampler is initialised with a conservatively high number of components: G = max 3 ln(N ) , 25, N − 1 , though this may be too high if it is close to N . While G = G remains fixed throughout the MCMC chain, the number of non-empty clusters is recorded at each iteration of the sampler as is the indicator function. The true G is estimated by G, the G 0 value visited most often. Cluster-specific inference is conducted only on samples corresponding to those visits. For the OMIFA model, the AGS is modified to handle empty components: the MGP-related parameters are simulated from the relevant priors and each corresponding Λ g matrix is restricted to having q factors, i.e. the same number of columns currently in the matrix of factor scores η, either by truncation or by padding with zeros, as required.

Infinite Mixtures of (Infinite) Factor Analysers
Embedding MFA and MIFA in an infinite mixture setting leads, respectively, to the infinite mixture of finite factor analysers model (IMFA) and the flagship infinite mixture of infinite factor analysers model (IMIFA). These models employ a nonparametric Pitman-Yor process (PYP) prior which is easily incorporated into the MCMC sampling scheme.
The PYP is a stochastic process whose draws are discrete probability measures, whereby H ∼ PYP(α, d, H 0 ) denotes a PYP probability distribution H, with base distribution H 0 interpreted as the mean of the PYP, discount parameter d ∈ [0, 1), and concentration parameter α > −d.
For the PYP mixture model IMFA and the PYP-MGP mixture model IMIFA H 0 comes from the factor-analytic mixture (1), hence The stick-breaking representation of the PYP (Pitman, 1996) is used as a prior process for generating the mixing proportions in (2). This construction views {π 1 , π 2 , . . .} as pieces of a unit-length stick that is sequentially broken in an infinite process, with stick-breaking proportions Υ = {υ 1 , υ 2 , . . .}, summarised as where δ θg is the Dirac delta centred at θ g , such that draws are composed of a sum of infinitely many point masses. The PYP reduces to the DP when d = 0, in which case mass shifts to the right with increasing dispersion as α increases, implying an a priori larger number of components. However, some important distributional features fundamentally differ when d = 0 (De Blasi et al., 2015). The PYP exhibits heavier tail behaviour and allows the stick-breaking distribution to vary according to the component index g, without sacrificing much in the way of tractability.
In particular, increasing d values have the effect of flattening the prior, controlling its degree of non-informativity. Slice sampling (Walker, 2007;Kalli et al., 2011) is used here to yield samples from the PYP by adaptively truncating the number of components needed to be sampled at each iteration. By introducing an auxiliary variable u i > 0 which preserves the marginal distribution of the data, and denoting by ξ = {ξ 1 , ξ 2 , . . .} a positive sequence of infinite quantities which sum to 1, the joint density of (x, u) is given by f (x, u | θ, ξ) = ∞ g=1 π g Unif(u; 0, ξ g ) f (x | θ g ). Since only a finite number of ξ g are greater than u, the conditional density of x | u can be written as a finite mixture with G = max i ∈ {1,...,N } (|A ξ (u i )|) 'active' components at each iteration, where | · | denotes cardinality and A ξ (u) = {g : u < ξ g }. Though G is infinite in theory, G can be at most equal to N . Thus, the infinite mixture of (infinite) factor analysers models can be sampled from.
Typical implementations of the slice sampler arise when ξ g = π g (Walker, 2007) but independent slice-efficient sampling (Kalli et al., 2011) allows for a deterministic decreasing sequence, e.g. geometric decay, given by ξ g = (1 − ρ) ρ g−1 where ρ ∈ [0, 1) is a fixed value to be chosen with care. Higher values generally lead to better mixing but longer run-times, as the average cardinality of A ξ (u) increases, and vice versa. Setting ρ = 0.75, in line with the recommendations of Kalli et al. (2011), appears to strike an appropriate balance in the applications considered here.

Inference for Infinite Mixtures of Factor Analysers Models
For clarity, what follows focuses on the IMIFA model where inference proceeds via the independent slice-efficient sampler with geometric decay. Inference for other models in the IMIFA family is closely related. The joint density of the IMIFA model is where B(·) is the Beta function and f (θ) is the product of the previously defined collection of conditionally conjugate priors with additional layers for hyperparameters. Only the parameters of the G active components are sampled at each iteration. The algorithm is initialised with the same G value detailed in Section 2.3, typically above the anticipated number to which the algorithm will converge, in the spirit of Hastie et al. (2014). Here, however, G can theoretically exceed this value. For computational reasons, a finite upper limit is placed on G with max(G , min(N − 1, 50)) found to be sufficiently large. However, G is only regarded as a set of proposals as to where to allocate observations; as in Section 2.3, it is the subset of non-empty clusters G 0 that is of inferential interest. Bayesian approaches to clustering are known to be sensitive to initial cluster allocations. While starting values for z i can be obtained by any means, model-based agglomerative hierarchical clustering (Scrucca et al., 2016) is used here. Though this is fast and intuitive given that IMIFA models are initialised at a conservatively high number of components, which are then merged as the sampler proceeds, heavily imbalanced initial cluster sizes are cautioned against. By extension, initial cluster means and mixing proportions are computed empirically. Other parameter starting values are simulated from their relevant prior distributions. The adaptive inferential algorithm for IMIFA then proceeds mostly via Gibbs updates. For those which are multivariate Gaussian, using the Cholesky factor of the covariance matrices and employing block updates speeds up the algorithm (Rue & Held, 2005). The allocations z i are sampled in a fast, numerically stable fashion, using the Gumbel-Max trick (Yellott, 1977). Finally, state spaces for applications of IMIFA to real data can be highly multimodal with well-separated regions of high posterior probability coexisting, corresponding to clusterings with different numbers of components. Thus, label switching moves (Papaspiliopoulos & Roberts, 2008) are incorporated in order to improve mixing. Details of the Gibbs updates, Gumbel-Max trick, and label switching moves are provided in Appendix A.

Assessing Model Fit and Mixing
As is good statistical practice, posterior predictive model checking (Gelman et al., 2004) is employed. Sampled model parameters from the MCMC chain are used to generate replicate data from the posterior predictive distribution. Valid posterior samples, after conditioning on G, are those for which max(Q (t)) ≥ max q 1 , . . . , q G such that the dimension of the estimated scores matrix η is preserved. To assess model fit, histograms of the modelled data X are compared to histograms of the replicate data in a global sense using the Posterior Predictive Reconstruction Error (PPRE), calculated as follows: 1. Gather the histogram bin counts of each variable in X into the h × p matrix H, where h is the maximum number of bins across all variables and H is padded with zeros as required.
3. Create a similar matrix of histogram bin counts H (r) for each X (r) using the same breakpoints with which H was constructed (with endpoint bins extended to ± ∞).

Compute the Frobenius norm · F between H and H (r)
, standardising to the 0-1 scale using the triangle inequality: The distribution of PPRE values can be visualised using boxplots and summarised by the median, with credible intervals quantifying uncertainty. This discrepancy measure is well-suited to assessing model adequacy for mixtures of multivariate data: it accounts for inherent multimodality and gives a global quantitative measure of agreement between the distributions of the observed variables and their posterior predictive counterparts. Convergence of the MCMC chains is assessed using the potential scale reduction factor (PSRF; Brooks & Gelman, 1998;Plummer et al., 2006). Random allocations of the initial cluster labels, resulting in different draws from the relevant priors for parameter initialisation, are used to construct the multiple overdispersed chains required. The MAP labels of each chain are matched to the main chain prior to computing the diagnostics; Λ g matrices are also rotated to a common template for each cluster. Good convergence is indicated by upper PSRF 95% confidence interval limits close to 1; this is a stricter requirement than the PSRF values themselves being near 1.

Comparing the IMIFA Family Models
Though IMIFA and OMIFA come with the computational complexities inherent in nonparametric methods, diminishing adaptation, and extra tuning parameters, their advantages over other models in the IMIFA family are numerous: i) flexibility, in the sense that models where q g = q g can be fitted, ii) computational efficiency, in the sense that the burden is reduced relative to searching over a range of fitted MFA or MIFA models, iii) removing the need for model selection criteria, and iv) the ability to quantify the uncertainty in G and q g . Both methods offer simpler alternatives to reversible jump MCMC (Richardson & Green, 1997) and birth-death MCMC (Stephens, 2000). Hence, among the IMIFA family, the infinite factor models are recommended over the finite factor models and the infinite and overfitted mixtures are recommended over the finite mixtures. However, the MIFA model is appropriate if one wishes to fix G but infer q g .
While infinite mixtures are often used for density estimation, they are also employed to infer the number of components in cluster analyses (e.g. Kim et al. 2006;Xing et al. 2006;Yerebakan et al. 2014). However, Miller & Harrison (2013) raise concerns about the guarantee of posterior consistency for the number of non-empty clusters, showing the number uncovered is typically greater than or equal to the truth, often with several vanishingly small clusters inferred. These concerns highlight the need for practitioners to pay due consideration to the uncertainty in the number of clusters offered by IMIFA models. Relatedly, Frühwirth-Schnatter & Malsiner-Walli (2019) compare infinite mixtures to overfitted ('sparse finite') mixtures. They highlight that overfitted mixtures are useful for applications in which the data arise from a moderate number of clusters, even as the sample size increases, whereas infinite mixtures are suited to cases where the number of clusters also increases. However, they show that clustering results are driven less by the assumption of whether the data arose from a finite or infinite mixture, but by the hyperprior on the DP parameters or the sparseness of the Dirichlet prior in the overfitted setting. Indeed, they show that overfitted and infinite mixtures yield comparable clustering performance on the observed data when these hyperpriors are matched. This matching leads to 'sparse' infinite mixtures that avoid overfitting the number of clusters. Similar behaviour is observed for the PYP prior in the applications in Section 3, where the IMIFA and OMIFA models, with matched hyperpriors, give comparable results.
The issue of choosing α can make implementing overfitted models challenging. With fixed α = γ /G , the prior approximates a DP with concentration parameter γ as G tends to infinity (Green & Richardson, 2001). Here, following Frühwirth-Schnatter & Malsiner-Walli (2019), a Ga(a, bG ) hyperprior is assumed for α. This favours small values and allows α to be updated via Metropolis-Hastings. In the infinite mixture setting, learning the PYP parameters (which also requires Metropolis-Hastings steps) and adopting the label-switching moves enables accurate inference on G 0 . A joint hyperprior p(α, d) = p(α | d) p(d) is assumed (Carmona et al., 2019) where p(α | d) = Ga(α + d; a, b); choosing a large b encourages clustering (Müller & Mitra, 2013). A spike-and-slab hyperprior d ∼ κδ 0 + (1 − κ) Beta(a , b ) is assumed. The estimated proportion κ can then be used to assess whether the data arose from a DP or a PYP at little extra computational cost. See Appendix A for further details.

Illustrative Applications
The flexibility and performance of the IMIFA model and its related model family are demonstrated below through application to benchmark and real data sets. All results are obtained through the IMIFA R package; code to reproduce many of the results is available in the associated vignette 1 . Appendix B reports on simulation studies demonstrating the performance of IMIFA under different scenarios, including effects of the N /p ratio, the PYP parameters, imbalanced cluster sizes, uncommon q g , and the degree of loadings sparsity. Appendix C explores the robustness of IMIFA.
MCMC chains were run for 50, 000 iterations, except for Section 3.3 in which 20, 000 were run. Every 2 nd sample was thinned and the first 20% of iterations were discarded as burn-in. All computations were performed on a Dell Latitude 5491 laptop, equipped with a 6-core 2.60 GHz Intel Core i7-8850H processor and 16 GB of RAM. Where necessary, the optimal finite and infinite factor models are chosen by the BIC-MCMC and BICM criteria, respectively. Throughout, · denotes the posterior mode, posterior mean, or relevant optimal value. Unless otherwise stated, data were mean-centred and unit-scaled and no constraints were imposed on the uniquenesses. Hyperprior specifications are detailed in Table 1. While there are many hyperparameters to select, the choices are all reasonably standard. However, poor settings may introduce additional factors or clusters to maintain flexibility and so care in specifying hyperparameters is advised.

Benchmark Data: Italian Olive Oils
The Italian olive oil data (Forina & Tiscornia, 1982;Forina et al., 1983) is often clustered using factor-analytic models, e.g. McNicholas (2010). The data detail the percentage composition of 8 fatty acids in 572 Italian olive oils, known to originate from three areas: southern and northern Italy and Sardinia. Each area is composed of different regions: southern Italy comprises north Apulia, Calabria, south Apulia, and Sicily; Sardinia is divided into inland and coastal Sardinia; and northern Italy comprises Umbria and east and west Liguria. Hence, the true number of clusters is hypothesised to correspond to either 3 areas or 9 regions. The full family of IMIFA models is fitted to the olive oil data with results detailed in Table  2. Models relying on pre-specification of finite ranges of G and/or q are based on G = 1, . . . , 9 and q = 0, . . . , 6. Clustering performance is evaluated using the adjusted Rand index (ARI; Hubert & Arabie, 1985) and the misclassification rate, compared to the 3 area labels. The α parameter is reported as its fixed value or posterior mean, as appropriate. Table 2 shows the flexibility and accuracy of the developed model family, and of the IMIFA model in particular which has the best clustering performance. Additionally, IMIFA is the most computationally efficient model considered, among those in the IMIFA family achieving clustering, as it requires only one run. This speed improvement would be exacerbated with larger data sets. However, methods requiring fitting of multiple models were run here in series; parallel implementations would reduce runtimes. Finally, models with different numbers of cluster-specific factors show improved clustering performance compared to the corresponding finite factor model in every case.
Table 2 also shows that the performance of the IMIFA model compares favourably to the best parsimonious Gaussian mixture model, fit via the pgmm R package (McNicholas et al., 2019) and the best mixture of factor mixture analysers (MFMA) model (Viroli, 2010), evaluated with 1, . . . , 5 components in both layers. Models with zero factors were not considered in either case. IMIFA also outperforms the best constrained Gaussian mixture model fitted using mclust (Scrucca et al., 2016). These finite mixtures are fit via maximum likelihood and use the BIC for model selection after fitting a large number of candidate models. Table 2: Results of fitting a range of models, including the full IMIFA family, to the Italian olive oil data, detailing the number of candidate models explored, the run-time relative to the IMIFA run (approx. 782 seconds), the posterior mean or fixed value of α, the posterior mean of d, modal estimates of G and Q, and the ARI and misclassification rate as evaluated against the known area labels, under the optimal or modal model as appropriate. pgmm † , ‡ 588 4.46 --5 6, 6, 6, 6, 6 0.53 35.84 † Due to the various covariance matrix decompositions considered, the results for mclust, MFMA, and pgmm are reported for the unstandardised data, for which superior clustering performance in terms of the ARI was achieved in each case. ‡ The optimal pgmm model uses the UCU constraints on the uniquenesses (i.e. Ψg = Ψ). Among the more directly comparable unconstrained UUU models, the optimal one according to BIC has G = 6 components, each with 5 factors, and achieves an ARI of 0.43. Notably, the pgmm models chosen by BIC both have more components than the IMIFA model.
It is also notable that within the set of IMIFA models relying on information criteria, those deemed optimal were not necessarily optimal in a clustering sense. For instance, the 4-cluster MIFA model yields an ARI of 0.94 and a misclassification rate of 6.99%, with respect to the 3 area labels, despite its sub-optimal BICM. Similarly, the BICM and BIC-MCMC criteria suggest different optimal MFA models. For the IMIFA model κ ≈ 0.89, suggesting similar inference would have resulted under a DP prior. Indeed, the results obtained by the OMIFA and OMFA models are similar to those of their infinite mixture counterparts, though the latter provide a better fit to the data (see Figure 5).  Table 3a tabulates the MAP clustering against the 3 area labels and suggests this solution makes geographic sense, in that northern oils are cleanly split into two sub-clusters. Cluster 1 contains all of the 323 southern Italy oils: this large cluster requires the largest number of factors ( q 1 = 6 [5, 6], with 95% credible intervals in brackets). Some of the other clusters require notably fewer ( q 2 = 3 [1, 6], q 3 = 6 [3, 6], and q 4 = 2 [1, 4]). Table 3b gives the confusion matrix with oils from the north labelled by their associated region(s), yielding an ARI of 0.994 and a misclassification rate of 0.52%. Figure 3 shows the uncertainty in the allocations to these clusters. Only three oils have large probability of belonging to a cluster other than the one to which they were assigned by the IMIFA model.  Table 3b are highlighted in red. Table 3: Confusion matrices of the MAP IMIFA clustering of the Italian olive oils against (a) the known 3 area labels and (b) the new labelling in which northern Italy is split into its constituent sub-regions. To assess sensitivity to starting values, the IMIFA model was re-fitted using multiple random initial allocations, implying also different random draws from the priors for parameter starting values. These runs led to identical inference about G and Q and equivalent clustering performance. These overdispersed chains were used to compute the upper 95% PSRF confidence limits depicted in Figure 4, which indicate good convergence. The PPRE boxplots in Figure 5 demonstrate the superior fit of the IMIFA model (with a median PPRE of 0.10) to the olive oil data, compared to the other IMIFA family models. Histograms comparing the bin counts between the modelled and replicate data sets for each variable, under the IMIFA model, are given in Appendix D.

Spectral Metabolomic Data
IMIFA is employed to cluster spectral metabolomic data for which N p ( Figure 6). The data are nuclear magnetic resonance spectra consisting of p = 189 spectral peaks from urine samples of N = 18 participants, half of which are known to have epilepsy (Carmody & Brennan, 2010;Nyamundanda et al., 2010). Interest lies in uncovering any underlying clustering structure given the N p setting.
Data were mean-centred and Pareto scaled (van den Berg et al., 2006). Although N p, no restrictions are imposed on the uniquenesses as the sample variances are quite imbalanced. Fitting MIFA models for G = 1, . . . , 5 is feasible as N is small. The BICM criterion chooses G = 2 as optimal and one participant is misclassified. IMIFA, however, unanimously visits a 2-cluster model and perfectly uncovers the group structure.
The modal estimates of the number of factors in each IMIFA cluster are q 1 = 3 [2, 9] and q 2 = 5 [4, 13] (see Figure 7). Cluster 1 corresponds to the control group and Cluster 2 to the epileptic participants. Figure 8 illustrates the p × q g posterior mean loadings matrices, based on retained samples with q g or more factors, after Procrustes rotation to a common template for both clusters. The sparsity and shrinkage induced by the MGP prior is apparent, as is the greater complexity in Cluster 2, given the greater variation in colour and larger number of factors. For instance, many elevated loadings are visible for chemical shift values between 8 and 10 for the first two factors in Cluster 2; this activity is not present for other factors in either cluster. In general, the distributions of the loadings within a factor exhibit narrow spread around zero, particularly for the cluster of control participants, with the exception of the regions of the spectrum corresponding to the large peaks between chemical shifts of 3 and 5 in Figure 6. IMIFA outperforms the optimal G = 3 mclust model and the optimal G = 2, q = 5 pgmm model, with respective ARI values of 0.73 and 0.27. The clustering performance of the optimal MFMA model is identical to the optimal MIFA model described above. Given the N p nature of the data, spectral clustering with the Gaussian kernel (Ng et al., 2001) is also considered. The eigengap heuristic suggests G = 2 and a perfect clustering is achieved almost instantaneously. However, the approach does not characterise the uncovered clusters in an interpretable manner, nor provide estimates of cluster membership uncertainty as given by model-based clustering approaches such as IMIFA.
The median PPRE for the IMIFA model of 0.21 [0.18, 0.24] shows good model fit, given the size and dimensionality of the data. The median PSRF upper 95% confidence limits, using three randomly initialised auxiliary chains, for the cluster means, uniquenesses, loadings, and mixing proportions of 1.01 (0.01), 1.00 (< 0.01), 1.01 (0.08), and 1.00 (< 0.01) respectively, show good mixing also (standard deviations in parentheses). Notably, all chains yield the same inference about G and Q. So too, again, does the OMIFA model, although its model fit is inferior (median PPRE=0.26).

Handwritten Digit Data
A final illustration of IMIFA is given through its application to handwritten digit data from the United States Postal Service (USPS; Le Cun et al., 1990;Hastie et al., 2001). Here N = 7, 291 images of the digits 0, . . . , 9 are considered, taken from handwritten zip codes. The data are not balanced in terms of digit labels. Each image is a 16 × 16 grayscale grid concatenated into a p = 256-dimensional vector; data were mean-centred but not scaled. Such data are often considered in the context of manifold learning, positing that the data dimensionality is artificially high. Given N and p, fitting a range of MFA or MIFA models is practically infeasible. Results of a single IMIFA run are presented here. For these data, it is reasonable to expect the number of components to grow as the sample size grows. It is anticipated that the flexibility afforded by having clusterspecific numbers of factors will help characterise digits with different geometric features.
The IMIFA model visited a G = 21 cluster solution in all posterior samples; Table 4 crosstabulates the MAP clustering against the known digit labels and achieves an ARI of 0.33. The median PPRE of 0.05 [0.04, 0.06] indicates good model fit. The overdispersed chains used to compute the PSRF diagnostics lead to identical inference about the number of clusters but slightly different inference about the modal numbers of cluster-specific factors. The ARI values between each resulting pair of MAP partitions all exceed 0.93. As before, good mixing is indicated by median PSRF upper 95% confidence limits for the cluster means, uniquenesses, and mixing proportions of 1.01 (0.01), 1.01 (0.01), and 1.01 (< 0.01), respectively. In computing the diagnostic for the loadings (1.14 (0.35)), only the first factor (common to all loadings matrices across all clusters in all chains) was considered, for reasons of fairness and computational resource constraints. Table 4: Cross tabulation of the IMIFA model's MAP clustering (rows) against true digit labels (columns) for the USPS data. Cells that are 0 are blank for clarity. Posterior means π g and modal estimates q g , with associated 95% credible intervals, are also given. Generally, IMIFA assigns images of the same digit, albeit written differently, to different clusters. Posterior mean images for each cluster are shown in Figure 9, ordered, as is Table 4, from 0 to 9 according to the digit most frequently assigned to the related cluster. Cluster 7 and the smaller cluster 8 capture the digit 1 written in a straight and slanted fashion, respectively. Clusters 15-17 represent the digit 6 written with extended, medium, and compact loop curvature, respectively. Notably, cluster 15 requires more factors than clusters 16 and 17. A similar interpretation follows for clusters 20 and 21 ( q 20 = 2, q 21 = 3), mostly capturing the digit 9 with small and large loops, respectively. Cluster 19 appears to mostly represent the digit 8 and has a large number of factors ( q 19 = 6) in comparison, say, to clusters 7 and 8 ( q 7 = 2, q 8 = 1) which capture the digit 1. This is intuitive, as 8 is a more geometrically complex digit than 1. However, some clusters appear to be diluted by the confusion of the so-called 'closed' 4, in contrast to the 'open' 4 in cluster 12, with the digits 3, 5, and 8 (cluster 19) and the digits 7 (written with a horizontal bar) and 9 (clusters 20 and 21). Many clusters capture the most common digit 0, with differing degrees of elongation and border thickness. Of concern here is cluster 4, containing just 9 units; the fact that q 4 = 16, the upper AGS limit, suggests that the model struggles to shrink the number of factors in poorly populated clusters. This difficulty is highlighted further in the simulation studies in Appendix B. Finally, Table 4 indicates that clusters 10, 13, and 14 also capture several other digits, all of which are reflected in the blurriness of the resulting posterior mean images and in q 10 , q 13 , and q 14 being quite large. The cluster-membership uncertainties are visualised in Appendix D.   Table 4 and labelled with the modal q g .
It is computationally infeasible to run mclust, pgmm, or MFMA on these large data, as an exhaustive model search would be too vast. For comparative purposes, a DP-BP model (Chen et al., 2010) is fitted; this approach also simultaneously assumes infinitely many components and factors. It finds 43 clusters, each with around 14 factors, and achieves an ARI of 0.32. Cross tabulating this clustering against the 21 clusters of the IMIFA model shows that some of the DP-BP clusters are encapsulated by the larger IMIFA clusters. IMIFA is thus the more parsimonious approach and affords greater cluster-specific factor flexibility. Additionally, a finite mixture of matrix-normal distributions (Viroli, 2011) is also fitted. This approach accounts for the grid nature of the data, but is computationally infeasible for G > 15 and requires a model selection strategy. The optimal model according to BIC yields G = 12 and ARI = 0.38. While neither IMIFA nor the DP-BP model account for the spatial structure in the data, they demonstrate comparative performance without the need for a computationally expensive model search.

Discussion
The proposed IMIFA model is a Bayesian nonparametric approach clustering high-dimensional data using factor-analytic mixture models. By extending the MGP prior (Bhattacharya & Dunson, 2011) to the PYP-MGP setting, the model sidesteps the fraught and computationally intensive task of determining the optimal number of clusters and factors using model selection criteria. Thus, the IMIFA model is recommended when fitting factor-analytic mixtures in settings where an exhaustive model search is computationally infeasible. Though IMIFA is not entirely choice-free, it achieves improved clustering results by allowing factor-analytic models of different dimensions in different clusters. If small clusters are inferred, one may wish to prune or merge small clusters with the larger clusters (West et al., 1994) or assess whether the small clusters are in fact of domain-specific interest. While comparative performance can be achieved by the IMIFA and OMIFA models, one may wish to fit a MIFA or OMIFA model when the expectation is that the number of clusters is fixed or unlikely to grow with N , respectively.
Future research directions are varied and plentiful. Incorporating covariates, in the spirit of Bayesian factor regression models (West, 2003;Carvalho et al., 2008), would allow for direct inclusion of the weight and urine pH covariates available with the metabolomic data, for example. Furthermore, the models could be extended to the (semi-)supervised model-based classification setting where all (or some) of the data are labelled. While constraints on the uniquenesses across variables and/or clusters are allowed, there is scope for also constraining the loadings across clusters. Though the number of factors would no longer be cluster-specific, the common number of loadings columns would be estimated in a similarly automatic fashion. However, incorporating covariance matrix constraints in the IMIFA model family problematically reintroduces the need for model selection strategies, in order to choose between them, though the BICM criterion could feasibly be used for this purpose also.
As proposed by Bhattacharya & Dunson (2011), the MGP hyperparameters could be learned via Metropolis-Hastings, and thus also be made cluster-specific. This could help combat some difficulties identified in the simulation studies in Appendix B. For example, learning those related to local shrinkage may help when loadings are notably dense. Learning those related to column shrinkage may help in settings with many small clusters, where IMIFA struggles to adaptively truncate loadings columns. In principle, a further global shrinkage parameter could be added to the MGP prior to borrow information across clusters, i.e. λ jkg | . . . ∼ N 1 0, φ −1 jkg τ −1 kg σ −1 g −1 . Alternatively, the infinite factor prior of Legramanti et al. (2020) could be employed, which decouples control over the cumulative shrinkage rate and the active loadings terms. The IBP prior (Knowles & Ghahramani, 2007Ročková & George, 2016) could be also extended to the infinite mixture setting, as per the DP-BP model of Chen et al. (2010). Thus, the IMIFA family can in fact be considered to be wider than the range of models presented here, with potential for further expansion.
For applied problems, a mismatch between the assumed model and the data distribution will impact inference. Miller & Harrison (2013 highlight that posterior consistency for the number of non-empty clusters in infinite mixtures is contingent on correct specification of the component distributions. While they do not discourage the use of infinite mixtures for clustering, they show that a few tiny extra clusters are typically fitted and suggest robustifying inference. If the data distribution is close to but not exactly a finite mixture of Gaussians, an infinite Gaussian mixture will introduce more components as the amount of data increases. Potential avenues of exploration thus include considering the IMIFA model with the heavy tailed multivariate t-distribution . Similarly, modelling of complex component distributions can be achieved by considering the MFMA approach in the context of infinite factor models. Defining robust inference functions as in Lee & MacEachern (2014) or using nonparametric unimodal component distributions as in Rodriguez & Walker (2014) may also prove fruitful. Another means of robustifying inference is to explicitly include a noise component with zero factors to capture outliers which depart from the component multivariate normality assumption. Finally, a 'coarsened' posterior (Miller & Dunson, 2018) could be used for addressing misspecification, by conditioning on the event that the model generates data close to the observed data in a distributional sense.

Appendices A Posterior Conditional Distributions:
Technical details for sampling from the IMIFA model The structure of the Metropolis-within-Gibbs sampler to conduct inference for the IMIFA model and the exact forms of the required conditional distributions are detailed below. Note that Ga(α, β) refers throughout to the gamma distribution with mean α /β. The number of observations in a component is denoted by n g , where n = n 1 , . . . , n G sums to N , where G is the current number of active components (of which some may be empty), and q g is the current sample of the cluster-specific number of active factors. Algorithms for sampling other models in the IMIFA family can all be considered as special cases of what follows. The algorithm is implemented in the associated R package IMIFA (Murphy et al., 2021).
For the IMIFA and IMFA models, the sampler need only find the maximum over, and only draw Gumbel noise for, log-probabilities for which the indicator function in (3) evaluates to 1. Sampling the parameters of the PYP for non-zero d values requires the introduction of Metropolis-Hastings steps within the Gibbs sampler. A joint hyperprior of the form p(α, d) = p(d) p(α | d) is assumed, as per Jara et al. (2010). Firstly, the hyperprior for the discount parameter d is similar to the one assumed by Carmona et al. (2019); a mixture of a point-mass at zero and a continuous beta distribution, in order to consider the DP special case with d = 0 with positive probability, i.e. d ∼ κδ 0 + (1 − κ) Beta(a , b ). This facilitates explicit comparison between DP models and encompassing PYP alternatives. Secondly, the hyperprior for α is given conditionally on d, s.t. (α | d) ∼ Ga(α + d; a, b), and includes the constraint α > −d by shifting the support of the gamma density to the interval (−d, ∞); choosing a large b value is particularly relevant as it encourages clustering (Müller & Mitra, 2013).
The likelihood for α and d is given by the exchangeable partition probability function induced by the PYP (Pitman, 1995). Thus, the required conditional posterior distributions are Sampling from the distributions in (4) and (5), while always considering the support α > −d, proceeds as per Carmona et al. (2019); a Metropolis-Hastings step is implemented for the discount parameter with independent proposal distribution 0.5δ 0 + 0.5Beta(d; 1, 1), and a random walk Metropolis-Hastings step with proposal distribution given by α | α ∼ Unif(α − ζ, α + ζ) is implemented for the concentration parameter, where ζ (= 2 in our implementation) is used to control the acceptance rate. For d, the mutation rate is considered rather than the acceptance rate, whereby a move is only considered accepted if the proposal differs from the current value. However, when the DP prior is assumed, or when the sampled value of d is exactly zero under the PYP prior, α is updated according to the auxiliary variable routine of West (1992), with Gibbs updates by simulation from a weighted mixture of two gamma distributions, via where G 0 denotes the current number of non-empty clusters, (χ | α, G 0 ) ∼ Beta(α + 1, N ) , and the mixing weights ω χ are defined by The complementary label switching moves of Papaspiliopoulos & Roberts (2008), which are effective at swapping similar and unequal clusters, respectively, are also incorporated. Firstly, the labels of two randomly chosen non-empty clusters g and h are swapped with probability min 1, (π h /π g ) ng−n h .
Secondly, the labels of neighbouring active components l and l + 1 are swapped with probability min 1, (1 − υ l+1 ) n l / (1 − υ l ) n l+1 ; if accepted, υ l and υ l+1 are also swapped. Cluster-specific parameters are reordered accordingly after each accepted move. Finally, for updating α under the sparse finite OMIFA or OMFA models, a random walk Metropolis-Hastings step is implemented, with a Gaussian proposal distribution, where

B Simulation Studies
The performance of the novel IMIFA model with its PYP-MGP priors, in terms of inferring both the number of clusters and the cluster-specific numbers of factors, is assessed here through simulation studies. Section B.1 explores sensitivity to the PYP parameters in a range of dimensionality scenarios, with balanced cluster sizes and a common number of factors. The simulation study in Section B.2 is more challenging; a larger number of clusters (many of which are small) are simulated for N < p data, with different numbers of cluster-specific factors (some of which are large). The final simulation study in Section B.3 mirrors the design in Section B.2, only here the true Λ g matrices used to generate the data are sparse.

B.1 Simulation Study 1
Firstly, data with G = 3 clusters and p = 50 variables are simulated with q g = 4 ∀ g, and with π = ( 1 /3, 1 /3, 1 /3) so that clusters are roughly equally sized. Other model parameters are simulated, with η i ∼ N q (0, I q ), ψ jg ∼ IG(2, 1), and Λ jg ∼ N q (0, I q ). Notably, loadings are not drawn from the MGP prior (Bhattacharya & Dunson, 2011) underpinning the IMIFA model. To ensure clusters are reasonably closely located, µ g ∼ N p ((2g − G − 1) 1, I p ). The data are then simulated according to the conditional mixture model To evaluate performance in different settings, sample sizes less than, equal to, and greater than p are considered, i.e. N = 25, 50, and 300. Sensitivity to the PYP and DP parameters is explored by firstly assuming a DP prior with various values of α less than, equal to, and greater than 1, and by allowing α to be learned as per West (1992), and secondly by incorporating Metropolis-Hastings steps to learn both α and d, assuming a PYP prior.
Results, provided in Table B.1, are based on 10 replicate data sets, standardised prior to model fitting, for each scenario. MCMC chains were run for 25, 000 iterations, with every 2 nd sample thinned and the first 20% of iterations discarded as burn-in. Cluster labels were initialised using mclust (Scrucca et al., 2016), as hierarchical clustering gave poor, heavily imbalanced starting values. As the cluster-specific Λ g and Ψ g parameters could still induce separation among clusters, pairwise scatterplots from one randomly chosen raw replicate data set under the N > p scenario are shown in Figure B.1 to demonstrate the extent of the overlap; for visual clarity, only 5 randomly chosen variables are depicted. Table B.1: Aggregated simulation study results for the IMIFA model under different dimensionality scenarios and settings of the concentration and discount parameters α and d (posterior mean estimates thereof in parentheses where appropriate). The modal estimates of G and associated estimates of q g ∀ g are reported (with 95% credible intervals in brackets). Clustering performance is assessed through the average percentage error rate against the known cluster labels.  Table B.1 clearly demonstrates that the IMIFA model performs well overall for these data, exhibiting capability to uncover the structure within the simulated data sets regardless of dimensionality. The modal estimate of G is equal to the truth in all cases, with only the N < p, α = 5 scenario showing some deviation in the 95% credible interval. Perhaps surprisingly, given the closeness of the cluster means, and the equality of the clusters in terms of their mixing proportions and numbers of factors, G is never underestimated. Indeed, clustering performance is mostly perfect. Furthermore, in every case, the true value of q g = 4 is within the limits of the associated credible intervals, which intuitively become narrower as more data accumulates. While the modal estimates q g are consistently greater than the truth throughout Table B.1, overestimation should be preferred to underestimation; a less parsimonious model which nevertheless fits well and uncovers the true clustering structure is better than one which loses information and fits poorly due to having too few factors. Recall that the loadings were drawn from a standard multivariate Gaussian, rather than the MGP prior underpinning the IMIFA model, i.e. entries in the true Λ g matrices did not shrink with the column index, nor were the loadings sparse. Thus, there is evidence to suggest the model is liable to overestimate the number of factors when the Λ g matrices, and by extension the cluster-specific marginal covariance matrices, are dense. This is explored further in the subsequent simulation studies.  Table B.1, demonstrating the overlap between the 3 clusters.

B.2 Simulation Study 2
Results of a more challenging simulation study are presented in Figure B.3; here, N < p data (N = 200, p = 250) are simulated with a large number of clusters and uncommon numbers of cluster-specific factors. In particular, many of the G = 10 clusters are small (a setting often studied in Bayesian nonparametric modelling), with π = (0.25, 0.2, 0.15, 0.1, 0.05, . . . , 0.05) . The numbers of factors q 1 , . . . , q g are drawn randomly from 0, . . . , min (15, n g − 1), where the upper limit ensures that no cluster has more factors than observations. Otherwise, the same parameter settings as Simulation Study 1 above (Section B.1) were used to generate the data.
Results of fitting an IMIFA model assuming a PYP prior, allowing both α and d to be learned, and otherwise using the same sampler settings as in Section B.1 above, are given for 5 replicates of this scenario, with the π vector ordered randomly for each data set. To demonstrate the extent of the challenge these settings represent, pairwise scatterplots are again shown for 5 randomly chosen variables for the first replicate data set in Figure Figure B.3 shows that the model over-estimates the number of clusters, though in some cases the ARI values are nonetheless quite good, as the larger clusters are generally uncovered well. However, the smaller clusters are further divided, albeit cleanly, into smaller sub-clusters with, in some cases, just 1 or 2 units inside. In these cases, the modal q g estimates are close or equal to the upper limit of the adaptive Gibbs sampler (3 ln(p)), and hence or otherwise greater than the corresponding estimated cluster sizes n g . Thus, there is evidence that the model has difficulty in adaptively shrinking the Λ g matrices when there are many clusters with few units.  Figure B.3: Barplots of the true number of cluster-specific factors q g (left) and estimates q g (right) for each replicate data set and corresponding fitted IMIFA model comprising Simulation Study 2. Bars are sorted in descending order of n g and n g , respectively, and labelled above with these true and estimated cluster sizes. The plots on the left are also labelled below with the cluster indices. Vertical red lines in the plots on the right show 95% credible intervals for q g . Modal G estimates (with 95% credible intervals in brackets), ARI values, and posterior mean estimates α and d are given for each replicate.

B.3 Simulation Study 3
In both previous simulation studies, the true loadings were dense, having been drawn from a standard multivariate Gaussian, rather than the MGP prior underpinning the model. The design of this final simulation study exactly mirrors the parameter and sampler settings used in Section B.2 with the sole exception that, as per the simulation study design in Bhattacharya & Dunson (2011), the true loadings matrices used to generate the data are sparse. Specifically, the number of non-zero loadings in each Λ g matrix begins at p in column 1, and successively decays by 10% for each subsequent column. The locations of the zeros in each column are allocated randomly and non-zero elements are drawn from a standard multivariate Gaussian. Again, pairwise scatterplots are shown for 5 randomly chosen variables for the first of the five replicate data sets in Figure B.4, to demonstrate the extent of the overlap between clusters. Results are presented in Figure B.5. Performance is comparable to the results of Simulation Study 2, in the sense that, again, the number of clusters is over-estimated. ARI values are nonetheless acceptable. Small clusters are divided into even smaller sub-clusters for which the model struggles to adaptively shrink the number of factors. The comparability of the results of these experiments suggests that performance is being driven not by whether the loadings used to generate the data exhibit increasing levels of sparsity across columns, in line with the MGP prior underpinning the model, but by the presence of many small clusters.
The over-estimation of q g in the small clusters in simulation studies 2 and 3 suggests that the hyperparameters α 1 and α 2 related to the MGP column shrinkage parameters may need to be higher in mixture settings to enforce a greater degree of shrinkage as there will be fewer data in each cluster from which local and global shrinkage parameters can be learned, compared to fitting an IFA model on the full data set. Introducing Metropolis-Hastings steps to allow these hyperparameters be cluster-specific and learned from the data, rather than fixed, may also help in this regard.  Figure B.5: Barplots of the true number of cluster-specific factors q g (left) and estimates q g (right) for each replicate data set and corresponding fitted IMIFA model comprising Simulation Study 3. Bars are sorted in descending order of n g and n g , respectively, and labelled above with these true and estimated cluster sizes. The plots on the left are also labelled below with the cluster indices. Vertical red lines in the plots on the right show 95% credible intervals for q g . Modal G estimates (with 95% credible intervals in brackets), ARI values, and posterior mean estimates α and d are given for each replicate.

C Assessing Robustness of the IMIFA Model
In order to assess the robustness of the IMIFA model, N 1 (0, 1) noise with no clustering information was appended separately to the rows and columns of the olive oil data set. Six new scenarios were generated with 10, 50, and 100 extra variables, and the same numbers of extra observations. Cluster validity is evaluated in Table C.1 with respect to the 4 area relabelling in Table 3b. In the case of extra observations, noise observations are labelled as though they belong to a fifth cluster. Data were mean-centred and unit-scaled only after expansion. As the number of irrelevant variables increases, the clustering structure can still be uncovered quite well, however mixing becomes slower and there is increasing support for clusters with only one or no factors as the signal-to-noise ratio decreases. As such, variable selection, or at least data pre-processing, may still be required. As rows of noise are appended, IMIFA generally has no difficulty in assigning these observations to a cluster of their own. Interestingly, clusters corresponding to noise observations correctly require no latent factor structure.

D Additional Results and Visualisations
In this Section, some additional visualisations of the results of the illustrative applications are provided. Specifically, more details are provided on the posterior predictive model fit assessment and the observation-specific cluster membership uncertainties. All plots were produced using the associated R package IMIFA (Murphy et al., 2021). The Posterior Predictive Reconstruction Error (PPRE) has been proposed as a posterior predictive checking strategy for models in the IMIFA family. In short, this involves computing the standardised Frobenius norm of the difference between a matrix of histogram bin counts for the modelled data set and similar matrices constructed using replicate data drawn from the posterior predictive distribution. While the median PPRE value or boxplots of the distribution of PPRE values have been shown to yield useful global measures of model fit in multivariate settings, the histograms themselves can also be studied on a variable-by-variable basis.
In high-dimensional settings, such as the spectral metabolomic (p = 189) and USPS digits (p = 256) data sets, it is only feasible to examine the histograms for a subset of the variables. Nonetheless, the global median PPRE measures for these data sets are quite good (0.21 and 0.05, respectively). Hence, Figure D.1 shows only the histograms comparing bin counts for the p = 8 variables in the standardised Italian olive oil data, to which an IMIFA model was fitted, against corresponding counts for the replicate data under the fitted IMIFA model. The true bin counts are within the 95% credible intervals of the replicate data bin counts in the vast majority of cases, indicating good model fit: recall that this IMIFA model achieves a median PPRE of just 0.10. The IMIFA model fitted to the USPS digits data set uncovers G = 21 clusters. Regarding the uncertainty in the allocations to these clusters, the model-based nature of IMIFA facilitates estimation of the uncertainty with which observation i is assigned to its cluster g via where z ig is the estimated probability that observation i belongs to cluster g. Figure D.2 shows that the observation-specific cluster membership uncertainties are generally quite low, with the mean uncertainty being just 0.02 and 92% of observations being assigned with uncertainty less than 1 / G. A similar plot for the olive oil data is shown in the main text ( Figure 3); uncertainties for the spectral metabolomic data are not shown, as there was no uncertainty in the assignments under the fitted IMIFA model (i.e. U i = 0 ∀ i = 1, . . . , N ).
Observations in order of increasing cluster−membership uncertainty