On Posterior Consistency of Bayesian Factor Models in High Dimensions

As a principled dimension reduction technique, factor models have been widely adopted in social science, economics, bioinformatics, and many other fields. However, in high-dimensional settings, conducting a 'correct' Bayesianfactor analysis can be subtle since it requires both a careful prescription of the prior distribution and a suitable computational strategy. In particular, we analyze the issues related to the attempt of being"noninformative"for elements of the factor loading matrix, especially for sparse Bayesian factor models in high dimensions, and propose solutions to them. We show here why adopting the orthogonal factor assumption is appropriate and can result in a consistent posterior inference of the loading matrix conditional on the true idiosyncratic variance and the allocation of nonzero elements in the true loading matrix. We also provide an efficient Gibbs sampler to conduct the full posterior inference based on the prior setup from Rockova and George (2016)and a uniform orthogonal factor assumption on the factor matrix.


Introduction
Factor models have been widely adopted in social science, economics, bioinformatics, and many other fields that need interpretable dimension reduction for their data.They serve as a formal way to encode high-dimensional observations as a linear combination of a few latent factors plus idiosyncratic errors, which accommodate some intuitive interpretations and can sometimes be further validated by additional supplemental knowledge.In this article, we consider the following standard parametric formulation: each G-dimensional vector observation y i (e.g., daily returns of ∼3000 U.S. stocks) is assumed to be linearly related to a K-dimensional vector of latent factors ω i (e.g., 20 market factors) through a skinny tall factor loading matrix B: and the idiosyncratic variance matrix Σ is assumed to be diagonal as in the literature.In matrix form, we denote the observations as Y = (y 1 , • • • , y n ), which is a G × n matrix, and the factors as a K × n matrix Ω = (ω 1 , . . ., ω n ).The factors are usually assumed to be independently and normally distributed: ω i ∼ N K (0, I K ).
People are often interested in estimating the loading matrix B in order to gain insight on the correlation structure of the observations.Marginalizing out ω i , we obtain the relationship [y i | B, Σ] ∼ N G (0, BB T + Σ), implying that the loading matrix B is only identifiable up to a right orthogonal transformation (rotationally invariant).It is thus rather difficult to pinpoint the factor loading matrix consistently, to determine the dimensionality of the latent factors, or to design efficient algorithms to conduct a proper full Bayesian analysis of the model.
In recent years, researchers start to study effects of the sparsity assumption on factor loadings, which apparently aids in the interpretability of the model and helps in the model identifiability.Considerable progresses have been made in the realm of sparse Bayesian factor analysis, such as Fruehwirth-Schnatter and Lopes (2018) and Ročková and George (2016), which are two representatives of the approaches using hierarchical continuous or discrete spike-and-slab (SpSL) priors (i.e., a mixture of a concentrated distribution, which can be either continuous with a small variance or a point mass, and a diffuse distribution) to represent the sparsity of the factor loading matrix.Identifiability issues of sparse factor models are formally discussed in Fruehwirth-Schnatter and Lopes (2018), who also designed an efficient Markov chain Monte Carlo (MCMC) procedure to simulate from the posterior distribution of an over-parameterized sparse factor model under the discrete SpSL prior.Ročková and George (2016) proposed a sparse Bayesian factor analysis framework assuming independent (conditioned on the feature allocation) continuous SpSL priors on loading matrix's elements, under which a fast posterior mode-detecting strategy is proposed.
Motivated by Ročková and George (2016), we consider full Bayesian inferences under their prior formulation.Although their simulation studies show a good consistency (up to trivial rotations) of the maximum a posteriori (MAP) estimation of the loading matrix in various large G and large n scenarios, we found that the corresponding Wald type consistency for the posterior distribution requires n diverging at a faster rate than s besides other numerical conditions on the true loading matrix that are generally required for justifying the posterior contraction (Pati et al., 2014).Here s is the average number of nonzero elements of each column of the loading matrix B and is usually much smaller than G.
When s ≥ n but is still much smaller than G, which is not unusual in practice, we observed from simulations a 'magnitude inflation' phenomenon when independent SpSL priors were employed for elements of the loading matrix.That is, posterior samples of the loading matrix are inflated in the matrix norm compared to the datagenerating loading matrix, and the extent of inflation is related to the variance of the slab part of the SpSL prior -the more diffuse the slab prior we use the more inflation we observe.
The reason for this inflation phenomena is not immediately obvious since the total number of observed quantities is nG, corresponding to n observed G-dimensional vectors y i , i = 1, . . ., n, which is often much larger than s×K, the number of nonzero elements in the loading matrix.Consider a special case with K = 1, G = s, and Σ = I G is known.Then ω i for i = 1, . . ., n is a scalar, and B = (b 1 , . . ., b G ) T is a G-dimensional vector.Thus, each component y ij of y i can be written as ∼ N (0, 1).
Although the total number of unknown parameters in the model is G+n, the number of independent scalar observations y ij is n × G, much larger than G + n.The model is unidentifiable because ω i × b j = (ω i /c) × (b j c) for any c = 0. Requiring that the ω i i.i.d.
∼ N (0, 1), i = 1, . . ., n, can indeed alleviate the identifiability issue, but is not enough to "tie down" the b j 's in the posterior distribution if there are too many of them, which manifests itself in the inflation phenomena.But how many is "too many"?In this simple example, there are "too many" if G ≥ n (Section 4).More generally, our later theoretical analysis shows that, if the column average number s of nonzero elements of B is no smaller than n, the inflation will provably happen, although we observed empirically that the inflation occurs when s ∼ n! Additionally, an apparent remedy revealed from the above intuition and our later analysis is to further restrict the ω i 's, such as requiring that n i=1 ω 2 i = n.More generally speaking, due to a nearly non-identifiable structure of model ( 1), an overdose of independent diffuse priors on loading matrix elements dilutes the signal from the data.Problems with the use of diffuse priors in Bayesian inference when observation sample sizes are small relative to the number of parameters being estimated have been noted and studied in the literature (Efron, 1973;Kass and Wasserman, 1996;Natarajan and McCulloch, 1998).This problem for Bayesian factor analysis was also noted in Ghosh and Dunson (2009) and a practical solution was proposed without further theoretical investigations.The Ghosh-Dunson model allows each factor to have an unknown variance that follows an inverse Gamma prior and imposes more restrictive standard Gaussian priors on the loading matrix's elements.But the consistency of the Ghosh-Dunson model in high dimensional settings remains to be justified.
In this article, we study asymptotic behaviors of the posterior distributions when an independent SpSL prior is employed for elements of the loading matrix and a right-rotational invariant distribution is assumed on the factor matrix Ω (i.e., Ω and ΩR follows the same distribution for all n × n orthogonal matrix R, this is different from the left-rotational invariance that makes B nonidentifiable).All consistency and convergence concepts in our work are in the frequentist (repeated-sampling) sense.Take the loading matrix for example.If for any open neighborhood N of an entry of the true loading matrix (the magnitude of entries are at the constant order), the probability for a random draw from the posterior distribution of that entry to fall in N , as a function of the data in the repeated sampling sense, converges to 1 almost surely as n and G go to infinity, we say that the posterior inference of the loading matrix is consistent, or simply that "the posterior sample of the loading matrix converges to the truth." We theoretically show that the observed inflation phenomena of the posterior distribution is due to the weak control of ΩΩ T /n -more specifically, singular values of ΩΩ T /n under only the normality assumption on the factor matrix Ω.This analysis suggests a natural solution for achieving posterior consistency of the loading matrix in Ročková and George (2016)'s framework under high dimensions: employing a stronger control over ΩΩ T /n.More concretely, we can change the normal distribution assumption of Ω to the uniform distribution on the orthogonal matrices.That is, we let Ω/ √ n be uniform on the Stiefel manifold (Stiefel manifold-St(K, n) is the set of all orthonormal k-frames in R n ) or, equivalently, the first few rows of a Haar-distributed random orthogonal matrix (there exists an unique right and left invariant Haar measure on the set of orthogonal matrices, see Meckes (2014)).Utilizing the MAP estimate from the PXL-EM algorithm of Ročková and George (2016) as an initial value, our Gibbs sampler can efficiently sample from the full posterior distribution, which shows consistent performance through data examples.This more restrictive assumption on the factor matrix allows us to conduct a full posterior analysis of the sparse Bayesian factor model under the "large s, small n" setting with a justifiable posterior consistency, and thus can be used to construct meaningful credible intervals for elements of the loading matrix as well as the covariance matrix.
The article is structured as follows.Section 2 introduces the Bayesian factor analysis framework from Ročková and George (2016) and a corresponding basic Gibbs sampler.Under their framework, Section 3 illustrates the 'magnitude inflation' phenomena in posterior samples of the loading matrix and its dependence upon the slab prior through a synthetic example.Section 4 provides theoretical explanations for the inflation phenomena.Section 5 presents our strategy for resolving the magnitude inflation with a theoretical guarantee.By revisiting the synthetic example, Section 6 numerically verifies the validity of our solution and compares it to an alternative approach, the modified Ghosh-Dunson model.Section 7 concludes with a short discussion.
2 Bayesian sparse factor model and inference

Prior settings for loading coefficient selection
In order to enhance model identifiability and interpretability, one often imposes a sparsity assumption for the loading matrix.Traditional approaches considered post-hoc rotations as well as regularization methods, see, e.g.Kaiser (1958) and Carvalho et al. (2008).By integrating these two paradigms, Ročková and George (2016) proposed a sparse Bayesian factor model framework along with a fast modeidentifying PXL-EM algorithm.In their framework, sparsity assumption on factor loading matrix is encoded through a hierarchical spike and slab prior: Let β jk denote the (j, k) th element of the loading matrix B. We assume that a priori the β jk 's follow a SpSL prior and are mutually independent given the hyperparameters.More precisely, we can introduce for each element a binary indicator variable γ jk such that where ψ(β|λ) = λ 2 exp(−λ|β|) is a Laplace distribution, and Letting Θ = (θ 1 , . . ., θ K ), we note that the θ k is necessarily decreasing with k.
We call Γ = (γ jk ) G×K the "feature allocation" matrix.The idiosyncratic variance matrix Σ is assumed to be diagonal with elements σ 2 j and a conjugate prior is used: ∼ Inverse-Gamma(η/2, ηε/2).Ročková and George (2016) showed in simulations that the PXL-EM converges dramatically faster than the EM algorithm for finding the maximum a posteriori (MAP) estimator (i.e., B, Σ, Θ that maximizes π(B, Σ, Θ | Y)) and also demonstrated the consistency of MAP estimator in estimating the loading matrix under the "Large s, Small n" setting.However, turning their method into a full Bayesian inference procedure turns out to be more subtle and challenging.

A standard Gibbs sampling procedure
The full posterior distribution of the parameters, (B, Ω, Σ, Γ, Θ), in a Bayes factor model can be written generically as where f denotes the likelihood, p denotes prior, Ω denotes the K × n matrix with columns given by ω i , Γ denotes the G × K matrix with entries given by γ jk and Θ denotes the K-dimensional vector formed by the θ k 's.Here observation Y represents a G × n matrix with columns y i .
A standard Gibbs sampler (Gelfand and Smith, 1990;Liu, 2008;Tanner and Wong, 1987) for sampling from the full posterior distribution (4) iteratively update each component according to the following conditional distributions: where This conditional density can be written as a mixture of two truncated normal density, and thus can be sampled efficiently.
• Update Σ along its diagonal: where B T j• represents the j-th row vector of B.
Due to multimodality of the posterior distribution caused by the invariance of the likelihood function under matrix rotations (therefore only the sparsity prior can provide information to differentiate different modes) and the strong ties between the factor loading and common factors (thus making gaps among different modes very deep), the performance of this basic Gibbs sampler is very sticky and can only explore the neighborhood of the initial values.By initializing the sampler at some estimated mode such as the MAP estimator from the PXL-EM algorithm, however, this sampler appears to be a reasonable tool for exploring the local posterior behavior around the MAP.Indeed, more dramatic global MCMC transition moves are needed in order to have a fully functional MCMC sampler (see Appendix A of the supplementary material).
3 The magnitude inflation phenomenon

A synthetic example
To illustrate the magnitude inflation phenomenon in high dimensional sparse factor models, we generate a dataset from model (1) similar to that of Ročková and George (2016), which consists of n = 100 observations, G = 1956 responses, and K = 5 factors drawn from N (0, I 5 ).The true loading matrix is a block diagonal matrix as shown in the leftmost sub-figure of Figure 1, where black entries correspond to 1 and blank entries correspond to 0 (thus s = 500 > n).Σ ture is selected to be the identity matrix.With the synthetic dataset, we use the basic Gibbs sampler from section 2.2 with α = 1/G, η = = 1, λ 0 = 20, λ 1 ∈ {0.001, 0.1} and K = 8, to explore the posterior distribution.
Ten snapshots of heat-maps of |B| in a Gibbs trajectory of 100 iterations initialized at the true value is displayed in Figure 1, from which we can conclude that the direction of each column vector in the loading matrix is well preserved during Gibbs iterations, whereas the absolute value of every non-zero element increases over the iteration time and eventually stabilizes around a much larger value than the true one (about 4000 in our test setting with λ 1 = 0.001).As a demonstration of the inflation, Figure 2(a) displays the trace plot of log(|β 1,1 |) with λ = 0.001 and 0.1, respectively, and n = 100, which also indicates the slow convergence of the basic Gibbs sampler using a small λ 1 .The degree of inflation is influenced by the relative ratio of observation number n comparing to s, the average number of nonzero elements in each column of the true factor loading matrix, as well as the choice of independent slab priors.For example, when n is increased from 100 to 1000 the posterior samples of the loading matrix stabilize around somewhere much closer to the true loading matrix.
By adding some scaling group moves (Liu and Wu, 1999;Liu and Sabatti, 2000) Figure 1: Heat-maps of |B| in 100 iterations from the basic Gibbs sampler.The black entries correspond to 1 and blank entries correspond to 0. The directions of the columns of the loading matrix are well preserved throughout the Gibbs iterations.

Magnitude inflation and direction consistency
Our numerical results revealed some perplexing consequences of using independent SpSL priors for a Bayesian factor model when s ≥ n, which can be summarized as "magnitude inflation" and "direction consistency".While the former means that the posterior draws of the loading matrix are inflated entry-wise compared with the true loading matrix with the inflation magnitude dependent on how diffuse the slab prior is, the latter says that the direction of columns of posterior samples of the loading matrix somehow still converges to the true direction as n, s → ∞.Intuitively, when the number of independent slab priors employed grows at a faster rate than the number of observations, these priors will overwhelm the signal from data.The interesting observation is that the overdose of independent slab priors only dilutes the signal for the magnitude part in the loading matrix but has little impact on the identification of the column space.
The inflation problem is quite a concern in practice when people try to use these posterior samples of the loading matrix for estimating the observation covariance structure.The low rank part (BB T ) in the estimated covariance matrix is usually exaggerated to some extent depending on the selected slab prior.Traditional literature tends to ignore the inflation problem by treating it as a consequence of the lack of enough observations (i.e., n is too small compared to s) to guarantee posterior sample consistency.But this argument is inaccurate as we will show in next sections.Furthermore, we notice that, with the same amount of observations, the MAP estimator is rather precise in estimating the true loading matrix and directions of columns of the loading matrix are well captured by the posterior samples, provided that the structure of the true feature allocation matrix is known, as in the synthetic example.This suggests that the data provide sufficient information for recovering the true loading with the aid of knowing true feature allocation matrix.Thus, the magnitude inflation phenomena may be caused by some modeling issues.In the next two sections, we will provide some theoretical verification for the magnitude inflation as well as a simple and provable remedy.

Posterior dependence on the slab prior
It is generally recognized that in a Bayesian factor model using an improper flat prior on elements of the loading matrix can be dangerous, and will lead to an improper posterior distribution when G ≥ n.This is in fact not very intuitive, so we illustrate this point with a very simple example with K = 1 factor, n = 2 observations, and independent noises.Let the two vector observations be y 1 and y 2 , each of G-dimensional.We can therefore write y 1 = v 1 + 1 , and y 2 = v 2 + 2 , with i ∼ N (0, I G ), which is very much like the standard Gaussian mean problem, with only one additional requirement: Here, the model assumes that the factor ω j ∼ N (0, 1), and b is a G-dimensional loading matrix (vector).Thus, marginally we have A peculiar thing is that in the standard Gaussian mean problem, if we assign flat priors to v 1 and v 2 , their posterior distributions are simply N (y 1 , I G ) and N (y 2 , I G ), respectively, which are still proper although they yield inadmissible estimators for v 1 and v 2 when G ≥ 3.However, with the factor model assumptions, which effectively reduce the number of parameters from 2G to G, the posterior distribution for b becomes improper if G ≥ 2 and we assign b a flat prior.
Mathematically equivalent phenomena occur even in the simple univariate Gaussian mean estimation: let y ∼ N (αβ, 1).If we assume that α ∼ N (0, 1), then, when assuming a flat prior, the posterior distribution of β is proportional to (β 2 + 1) −1/2 exp {−(2(β 2 + 1)) −1 y 2 }, which is a non-integrable function, thus improper.But if we assume a proper prior on β, its posterior distribution becomes proper but its posterior variance relies heavily on its prior variance.A simple fix of the problem is to realize that we cannot identify both parameters simultaneously and have to let α take a fixed value.These phenomena also happen for the general factor models in certain settings, and our goal is to understand how these issues play out in high dimensional factor models and whether certain intuitive remedies work both theoretically and computationally for these more complex cases.
For the general factor model, we can similarly marginalize out the factor variables and derive the posterior distribution of the loading matrix under the flat prior: where the exponential term is both upper and lower bounded by some functions of Y and Σ. Term where ||B|| F represents the Frobenius norm of B and λ max (Σ) denotes the largest eigenvalue of Σ.When the dimension of B, which is G×K, is no smaller than n×K, π(B|Y, Σ) will integrate to infinity in the complement region of any bounded set in R G×K , leading to an improper posterior distribution.If we impose a proper but diffuse slab prior instead of the improper flat prior on elements of B, the posterior distribution can still be very sensitive to the variance of slab prior, as seen in Figure 3.
To formalize this intuition for general Bayesian factor models, we provide the following theorem on the divergence of the posterior distribution of the loading matrix if we use a sequence of increasingly diffused "slab" priors.Note that for theorems in Section 4, we do not require Σ to be diagonal.To cover generic prior choices, we replace (2) with where ψ denotes the spike prior density and φ denotes the slab prior density.
Theorem 4.1.Let {φ m } m=1,••• be a sequence of densities such that lim m→∞ φ m (β) = 0 for every β ∈ R and there exists a constant C ∈ (0, 1) such that φ m (β) > C max β (φ m (β)) holds for every β in some non-decreasing Borel sets S m that con- Theorem 4.1 partially explains the magnitude inflation and the dependence of the inflation rate on the choice of the slab prior.Let S be any fixed G × K dimensional ball.The theorem implies that the probability of a posterior sample B, conditional on Y, Σ, Γ, m, having a matrix norm smaller than any constant goes to zero as we use a series of slab priors {φ m } m=1,2,••• that is increasingly diffused.In a general sense, it can also be understood as the convergence in distribution of B|Y, Σ, Γ, m towards B|Y, Σ, Γ, ∞ (conditional posterior of B with flat slab prior), which is a point mass at infinity when s ≥ n.For cases such that B|Y, Σ, Γ, ∞ is indeed proper e.g. when s n or the assumed distribution on the factors is changed, we strictly have the convergence of B|Y, Σ, Γ, m towards B|Y, Σ, Γ, ∞ in distribution as stated in the next theorem.Therefore, if the posterior distribution of the loading matrix is proper under a flat slab prior and the Bayesian consistency is justified in this situation, we have approximately the same consistency when employing a reasonably diffuse slab prior.
Theorem 4.2.Consider model (1) without the normality assumption for the factors.Let {φ m } m=1,••• be a sequence of prior densities maximized at 0 such that, posterior density of B under a SpSL prior for its elements, with the spike density ψ and the slab density φ m , and let π(B|Y, Σ, Γ, ∞) be the one corresponding to the flat slab prior (this is appropriate since the indicator matrix Γ is conditioned on).If π(B|Y, Σ, Γ, ∞) is integrable, then B|Y, Σ, Γ, m converges to B|Y, Σ, Γ, ∞ in distribution as m → ∞.

Model modifications for posterior consistency
To clarify some key issues, we study the behavior of the posterior distribution of the Bayesian factor model assuming that the diagonal idiosyncratic variance matrix Σ and the feature allocation matrix Γ (for sparse factor model) are known.In contrast to the solution provided by Ghosh and Dunson (2009), which focuses on modifying the prior, we restrict ourselves to a special class of SpSL priors for loading matrix element, which have a point mass at zero as the spike and a flat (limit of a sequence of increasingly diffused distributions) slab part.This is always appropriate when considering the conditional posterior distributions given Γ.We focus on how to modify model assumptions for the factors to achieve posterior consistency.
Notations: Let H n denote the Haar measure (i.e., uniform distribution) on the space of n × n orthogonal matrices and let m n be the uniform measure on the Stiefel manifold St(K, n).Let M i• and M •j denote the i-th row and the j-th column of matrix M, respectively, as column vectors, and let M i,j denote the element at i-th row and j-th column of M. M i 1 :i 2 denotes the sub-matrix formed by row i 1 -th to i 2 and M i 1 :i 2 ,j 1 :j 2 denote the sub-matrix formed by rows i 1 -th to i 2 and columns j 1 to j 2 .Notation M ⊥ represents an orthogonal complement(not unique) of M when M is not a square matrix, P (•) represents the projection mapping towards the row vector space of a matrix and P (•) is the projection matrix of the mapping.Let λ max (•) and λ min (•) denote the largest and smallest singular values of a matrix, and let λ k (•) denote the k-th largest singular values.The L 2 norm is denoted by | • |, the Frobenius norm is denoted by || • || F , and the outer product is "⊗".

The basic Bayesian factor model
We show the posterior consistency of the loading matrix by first studying the posterior consistency of the factor matrix Ω (defined in section 2.2).It is easy to see that, with a flat prior on every element of B, the posterior distribution of B and Ω can be written as: where p Ω denotes the prior distribution of Ω and " ind ∼ " means that the B j• 's are mutually independent.
For this section, we no longer restrict the factors in Ω to follow the standard Normal distribution, only requiring its distribution p Ω to satisfy the following two conditions: (a) cov(ω i ) = I K , so as to keep the marginal covariance structure of Y unchanged; (b) right rotational-invariant (i.e., Ω and ΩR follow the same distribution ∀ n × n orthogonal matrix R).Two non-Gaussian examples are: (i) each row of Ω follows independently a uniform distribution on the √ n-radius sphere; (ii) Ω/ √ n is uniform on the Stiefel manifold St(K, n), i.e., Ω/ √ n is the first K rows of a Haar-distributed n × n orthogonal random matrix.A straightforward characterization of condition (b) can be made through the LQ decomposition (the transpose of the QR decomposition).Suppose the LQ decomposition of Ω = K(Ω)V(Ω) is done by Gram-Schmidt orthogonalization starting from the first row of Ω, resulting in a K × K lower triangular matrix K(Ω) and a K × n orthonormal matrix V(Ω).Then, requirement (b) enables us to generate Ω from p Ω by generating a pair of K(Ω) and V(Ω) from two independent distributions-a marginal distribution on K(Ω) (denoted as p K ) and a uniform distribution on the Stiefel manifold St(K, n) for V(Ω).
Using the LQ decomposition, we can rewrite expression (7) as is the square of the length of Y j 's projection on the row space of Ω.Therefore, K(Ω) and V(Ω) are independent a posteriori, and Equation ( 9) implies that K(Ω) may have an improper posterior distribution because the likelihood term |K(Ω)K(Ω) T | −G/2 creates "attractors" when the determinant of K(Ω)K(Ω) T is close to be 0. Therefore, with large enough G, the right-hand side of (9) explodes to infinity fast enough around those attractors and becomes nonintegrable, thus leading to an improper posterior distribution for K(Ω).On the other hand, since exp the posterior distribution (10) for V(Ω) is always proper, based on which we can further derive posterior consistency of the row vector space of Ω.

Consistency of the row vector space of the factor matrix
The consistency of row vector space of Ω is intuitive from (10) for the noiseless case (i.e., Y = B 0 Ω 0 ), since the exponential term in ( 10) is uniquely maximized when the row vector spaces of Ω and Ω 0 coincide.As in an annealing algorithm, the exponential term enforces the growing contraction towards the maximum point (where row spaces of Ω and Ω 0 coincide) as G increases.On the other hand, the prior measure in a neighborhood of the row vector space of Ω 0 (defined as p Ω ({Ω : ||V(Ω 0 ) ⊥ V(Ω) T || F < })) gets more diffused as n grows.Therefore, in an asymptotic regime with G, n → ∞, and under some mild conditions on the growing rate of G and n to ensure that the diffusion is slower than the contraction, the consistency of the row vector space of Ω follows immediately as summarized below.Detailed proofs of the lemma and theorem can be found in Sections D.3 and D.4 in the Appendix in supplementary material.
Lemma 5.1.Let B 0,G be a G × K matrix, Ω 0,n be a K × n matrix, and Σ G be a known G × G diagonal matrix.Suppose noiseless data generated as Y = B 0,G Ω 0,n are given.We, however, model each column of Y as mutually independent and With a flat prior on each of B's elements and a right-rotational invariant prior on Ω, we have the following inequality for the posterior distribution of Ω: Lemma 5.1 provides a probability bound between V(Ω) sampled from the posterior distribution and V(Ω 0,n ) when there is no noise in the observation Y. Since F equals to the sum of square of sine canonical angles between the row space of Ω and Ω 0 , lemma 5.1 implies the convergence of these canonical angles towards 0 as n, G = s → ∞ (i.e. the Bayesian consistency of row vector space of Ω) when − log(m n ({V : 2 ), which is the technical requirement that ensures the dilution is "covered up" by the contraction.Base on this lemma, we generalize the consistency of row vector space of Ω to the noisy observation case under the "Large p(s), Small n" paradigm.
Definition 5.1.Let B 0 be a countable array, or a bivariate function of the form B 0 (j, k), with j = 1, • • • , ∞ and k = 1, • • • , K. Intuitively, this is an ∞ × K matrix.We say that B 0 is a regular infinite loading matrix if there are two universal constants Theorem 5.2.Suppose B 0 is a regular infinite loading matrix.Let Ω 0,n be a K × n matrix with linear independent rows and and let Σ = diag(σ 2 1 , • • • ) be a known infinite diagonal matrix in which σ j , ∀j, is bounded below and above by constants c 3 > 0 and c 4 < ∞, respectively.Let Y be an ∞ × n matrix, whose j-th row is generated from N n ((B 0 ) j• Ω 0,n , σ 2 j I n ), independently.For every fixed G, consider modeling the i-th column of . With a flat prior on each of B's elements and a proper rightrotational invariant prior on Ω, we have, for a random draw Ω from its posterior distribution, almost surely (with respect to the randomness in Y) that

Posterior distribution of the loading matrix
From (10), it is clear that data only provide information on the row vector space of V(Ω), the posterior distribution of V(Ω) conditioned on its row vector space is uniform among all the K × n orthonormal matrices within the row space.Utilizing the posterior consistency of the row space provided by Theorem 5.2, we can approximate an V(Ω) drawn from its posterior by another random variable of the form OV(Ω 0,n ), where O is a K × K uniform (Haar distributed) random orthogonal matrix (see Appendix D.5 for details).
Let B 0,G denotes the matrix formed by the first G rows of B 0 .By plugging V(Ω) = OV(Ω 0,n ) into the matrix form of (6), which can be written as we obtain a decomposition for the posterior samples of BK(Ω)/ √ n as: For considerable large n and normal true factor matrix Ω 0,n , K(Ω 0,n )/ √ n, as the Cholesky factor of Ω 0,n Ω T 0,n /n, approaches the identity matrix, so the first term of the right hand side of (11) approaches B 0,G O T .Meanwhile, the second term √ n, which converges in probability to 0 entry-wise as n → ∞.The third term is a centered normal (independent with O) with variance shrinking to 0 as n increases.This implies that under G = s n → ∞ regime, posterior samples of BK(Ω)/ √ n can be asymptotically expressed as the true loading matrix times an uniform random orthogonal matrix.
Remedy for achieving consistency.Posterior distributions of B and K(Ω) are coupled.A "deflation" problem of K(Ω)/ √ n occurs when the factors in Ω are assumed to be normal and n = O(G), in which case the posterior distribution of K(Ω)/ √ n can be derived in closed form by the Bartlett decomposition as: where χ ν denotes the Chi distribution with ν degrees of freedom.Posterior samples of the loading matrix, therefore, have to be inflated correspondingly.Ideally, we desire the convergence of the posterior distribution of K(Ω)/ √ n towards a point mass at the identity matrix to guarantee the posterior consistency (up to rotations) of the loading matrix, and can indeed achieve this by imposing stronger control over the singular values of Ω through the assumption on p Ω .Such remedy is not unique.A particular simple strategy is to require that all factors are orthogonal and have equal norm, which implies that Ω/ √ n is uniform in the Stiefel manifold St(K, n).More discussions are deferred to Section 5.2.2.

Sparse Bayesian factor model
With a special feature allocation design, V(Ω) is identifiable so that the consistency of the row space of the factor matrix can be generalized to the consistency of V(Ω).We impose a generalized lower triangular structure (Fruehwirth-Schnatter and Lopes, 2018) on the feature allocation matrix Γ to cope with the rotational invariance problem of the loading matrix.We call Γ a generalized lower triangular matrix if the row index of the top nonzero entry in the k-th column l k (define l 0 = 1, l K+1 = G + 1) increases with k and γ jk = 1 if and only if j ≥ l k .Under the flat SpSL prior (use a mixture of point mass and flat distribution as prior) on entries of B in the Sparse Bayesian factor model introduced in section 2.1, we can derive the conditional distributions of B and Ω: Given the LQ decomposition Ω = K(Ω)V(Ω) and is the projection of Y j towards the row vector space of Ω 1:k , which is a function of V(Ω).The adoption of the generalized lower triangular structure on feature allocation matrix ensures a separation in likelihood of ( 14) so that the determinant part is connected to Ω only through K(Ω) and the exponential part only through V(Ω).We thus can derive that K(Ω) and V(Ω) are independent a posteriori and that: Expression ( 16) gives a proper posterior for V(Ω), and for the noiseless case (i.e.Y = B 0 Ω 0 ), the density is maximized when the row vector space of V(Ω) 1:k and V(Ω 0 ) 1:k coincide for k = 1, • • • , K, based on which we can generalize theorem 5.2 to the consistency (up to sign permutations) of V(Ω).

Consistency of V(Ω)
Definition 5.2.We let B 0 be a ∞ × K matrix with nonzero rows and Γ 0 be a binary matrix of the same shape.Γ 0 is called a generalized lower triangular feature allocation matrix of B 0 if where I is the indicator function.Furthermore, for every fixed dimension G, let π G denotes the unique permutation (1, Definition 5.3.We let B 0 be a ∞ × K matrix with nonzero rows and Γ 0 is a generalized lower triangular feature allocation matrix of B 0 .B 0,G and Γ 0,G are the two G × K matrices formed by permuting the first G rows of B 0 and Γ 0 according to π G (the j-th row of B 0 is the π(j)-th row of B 0,G ).l 0,k is the row index of the top nonzero entry in the k-th column of the generalized lower triangular matrix Γ 0 , G (define l 0,0 = 1, l 0,K+1 = G + 1).B (k) 0,G is the submatrix of B 0,G formed by rows indexed l 0,k to l 0,k+1 − 1 and columns indexed 1 to k. (B 0 , Γ 0 ) is called a regular infinite loading pair if there are two universal constants Theorem 5.3.Let (B 0 , Γ 0 ) be a regular infinite loading pair with Γ 0 known, let Ω 0,n be a K × n matrix with linearly independent rows, and let For every fixed G, let Y 1:G denotes the matrix formed by permuting the first G rows of and the model assumption represented by p Ω .The determinant term creates singularities when K(Ω) k,k = 0 and the order of these "poles" ∼ s.When this term dominates, we observe the inflation phenomenon of posterior samples of the loading matrix.Meanwhile, the model assumption term can bound K(Ω) away from these singularities by assigning little probability measure in their neighborhoods and also induces the convergence of ΩΩ T /n towards the identity matrix (through requirement (a) introduced in section 5.1).Consequently, the posterior behavior of ΩΩ T /n is influenced by both the increasing rate of n, s and the choice of distribution p Ω .Those p Ω that bounds away singularities with high probability and forces a fast convergence of ΩΩ T /n towards the identity matrix can allow a fast rate of s going to infinity comparing to n, to guarantee the posterior consistency of the loading matrix.A simple and effective choice is to let Ω/ √ n be uniform in the Stiefel manifold St(K, n), in which case ΩΩ T /n is fixed to the identity and the posterior sample consistency of B naturally holds even when n has a rather slow growing rate comparing to s.

Modification of the Gibbs sampler
In Section 5, we justified the adoption of the orthogonal factor assumption (the factor matrix Ω scaled by 1/ √ n is uniform in the Stiefel manifold St(K, n)) for its help in keeping the posterior sample consistent for the loading matrix under the "Large s, Small n" paradigm.To construct a Gibbs sampler under this new orthogonal factor model, we need to revise the conditional sampling step of Ω|Y, B, Σ in the basic Gibbs sampler described in Section 2.2.
Let Ω k• denote the k-th row of the factor matrix and Ω −k denote the remaining rows, all as column vectors.The conditional distribution Ω k• |Y, Ω −k , B, Σ is altered from a multivariate normal distribution to: where is the multivariate normal density function with mean and covariance matrix and p Ω −k is the uniform measure on the centred √ n-radius sphere in the orthogonal space of Ω −k .
To sample from ( 18), we cut this √ n-radius sphere by hyperplanes that are orthogonal to vector Ωk• and denote this collection of intersections of the sphere and hyperplanes as , where d is the distance between the origin and the hyperplane.Essentially, {S d } are (n-k)-dimensional spheres and every point in the same S d has the same multivariate normal density f (•; Ωk• , σ2 k I n ), so we can sample from ( 18) by first sampling d from its marginal distribution and then uniformly sample an Ω k from sphere S d given the sampled d.Using the area formula of sphere, we can deduce the marginal distribution for d as and sample from this unimodal distribution using the Metropolis algorithm.The additional computational cost brought by the model revision only comes from the Metropolis algorithm and is almost negligible.We now revisit the synthetic example in Section 3.1 to check the consistency and robustness (against slab prior specification) of the posterior distribution of the loading matrix under the orthogonal factor model and compare it with the posterior distribution obtained using a modified Ghosh-Dunson model (details provided in Appendix C).
Figures 4 and 5 show the heat map of |B| in 3000 iterations.We perform the PXL-EM algorithm for the first 50 iterations and then Gibbs sampling in both the orthogonal factor and the modified Ghosh-Dunson model, respectively, for the next 2950 iterations.Figures 6 shows the trace plots of β 1,1 and β 1,3 from the Gibbs sampler in the two competing frameworks.We can see that the auto-correlations of the MCMC samples for the orthogonal factor model are much lower than those for the modified Ghosh-Dunson model.The two frameworks also gives very similar 90% credible interval as illustrated here and in Appendix F (with columns permuted to match).Figure 7 illustrates the posterior density of β 1,1 and β 1,3 (estimated by averaging over the conditional posterior densities) using slab priors with ranging variances.We tested with λ 0 = 20, λ 1 ∈ {0.001, 0.01, 0.1, 0.5} and the posterior distribution shows great robustness upon the choice of slab prior.
Finally, we note that, although both the orthogonal factor model and the modified Ghosh-Dunson model give very similar numerical results after appropriately adjusting tuning parameters of the priors, our analysis rigorously justifies the consistency of the former model, while a similar theoretical study of latter model is still beyond our reach.

Discussion
A primary intention of our work is to provide an efficient posterior sampler for the Bayesian factor model in high dimensions and show its consistency.Ročková and George (2016)'s sparse Bayesian factor model framework serves as a great starting point.Their framework reduces the number of global modes in posterior density by encoding sparsity in the prior, which can also boost the performance of MCMC algorithms for posterior simulation.Ročková and George (2016) also provide a fast posterior mode finding algorithm, which is useful for the initialization of MCMC algorithms.Our work bridges the gap between posterior mode finding and full posterior sampling in Ročková and George (2016)'s framework.We analyze the inflation problem for the posterior distribution of the loading matrix under Ročková and George (2016)'s framework in high dimensions and propose remedies.Besides our proposed solution, i.e., enforcing a common scale and orthogonality among the factors, Bernardo et al. (2003) and Ghosh and Dunson (2009) provided another perspective, which is to reduce the dimensionality of diffuse parameters in the prior to ensure they do not overwhelm the data.Their approach allows the factors to have different variances, which follow a common diffused inverse-Gamma prior, but restricts the elements of the loading matrix to follow standard Gaussian a priori.In this article, we provide a further modification of their model by imposing a SpSL prior on the loading matrix's elements, which allows a greater flexibility in handling sparsity in high dimensions (details in Appendix C).
We are able to show theoretically that the adoption of a strict orthogonal factor assumption can ensure posterior consistency and is robust against the prior specification of the loading matrix.Unfortunately, such kind of rigorous analysis for other models, including the Ghosh-Dunson model and its modification, still evades our vigorous attempts.Interests for future exploration may be focused on the design of dependent priors for easy posterior sampling as well as the justification of posterior consistency when using such priors.Beyond the scope of factor models, these two strategies -employing hierarchical prior specifications as in the modified Ghosh-Dunson model or adding restrictions to the latent factors may be adopted as general fixes for the concern from Natarajan and McCulloch (1998) that slab priors may be over-influential for the posterior.
For real data applications, the orthogonal factor model and the Gibbs sampler we proposed can be applied whenever the PXL-EM algorithm succeeds in finding the posterior mode.The choice of λ 0 and λ 1 in the SpSL prior can be determined by the same dynamic exploration process proposed in Ročková and George (2016) for the PXL-EM algorithm.Numerically, we have observed that the MAP and the posterior distribution of a nonzero element in the loading matrix stabilize simultaneously in this process.Appendix E presents some numerical results in this regard, showing the application of our Gibbs sampler to the cerebrum microarray data from AGEMAP (Atlas of Gene Expression in Mouse Aging Project) database of Zahn et al. (2007), which was previously analyzed by Ročková and George (2016) using their PXL-EM algorithm.

A Scaling group moves
To see how the posterior distribution of the loading matrix is influenced by the SpSL prior, we need to observe the sample behavior at equilibrium with different priors.Due to the strong ties between the loading matrix and the latent factors, samples are inflating slowly along the basic Gibbs sampling iterations, which demonstrates the slow mixing behavior of the Gibbs sampler.
A promising way to improve Markov Chain Monte Carlo (MCMC) convergence is to add a group move into the sampler.Liu and Sabatti (2000) proposed the "generalized Gibbs sampling", which can be seen as a generalization of Liu and Wu (1999) for conditional sampling along the trajectories of any designed transformation group.By taking advantage of the model structure and proposing a group trajectory that can cross various significant local modes, this group move can dramatically improve the MCMC convergence.The following theorem from Liu and Sabatti (2000) characterizes how a group move should be conducted.
Theorem A.1.(Liu and Sabatti(2000)) Let π be an arbitrary distribution on a space Z , and suppose t α (z) : Z → Z is a transformation parameterized by α ∈ A .Assume there is group structure on both A and the transformation family, and a left-Haar measure H on A .If z follows distribution π and α is drawn from then t α (z) follows distribution π.
If π in Theorem A.1. is the full posterior distribution, then t α generated by the conditional distribution (20) gives a transformation that preserves the target distribution π.We can add this transformation after each round of Gibbs sampling to improve convergence.To design group moves that can move the loading matrix and factors jointly in the toy example, we let π be the full posterior distribution and consider transformations that scale the loading matrix and factors.For k = 1, • • • , K, we consider the following group of scale transformations: and draw α k iteratively from: We design such group moves that scale each column since we observe a synchronous inflation within every column during Gibbs sampling and changes of magnitude are encumbered due to the strong connection between factors and loading.These scaling group moves are cheap to implement since the conditional distribution of α k is a univariate and unimodal distribution.More delicate moves such as linear restructuring (corresponding to 'rotate' the loading in PXL-EM) t A (B, Ω) : B, Ω → BA, A −1 Ω can be superfluous and difficult to implement in practice.

B Direction consistency when K = 1
We consider the simplest case of the Bayesian factor model with K = 1 factor, ω ∼ N (0, 1), under the asymptotic regime with n, G → ∞.We further assume that the idiosyncratic covariance matrix Σ is the identity matrix.Let b 0 be the true loading matrix, which is a G × 1 column vector with length β G assumed to be of order √ G. Suppose the prior distribution on b is invariant under right orthogonal transformation (e.g., N (0, σ 2 I G )), then its density is a function of |b|, denoted as f (|b|).The posterior distribution is given as: By conditioning on r ≡ |b| and letting v = b |b| we have where µ G is the uniform measure on unit vectors in G-dimensional space and C is a constant.With the single-factor model, we can write where b 0 is the true loading vector, β G = |b 0 |, and b * 0 is the normalized form of b 0 .Let θ be the angle between v and b 0 , then tr n i=1 Since ω i are i.i.d.N (0, 1) random variables, and i are i.i.d.noise vectors, as n → ∞ where Σ = I G is the covariance matrix of the i .Therefore, where the Jacobian term (1 − cos(θ) 2 ) G−3 2 appears from the transformation of µ G measure to Lebesgue measure.When nβ 2 G is large, we need to only consider a small vicinity of 0 so that sin(θ) ≈ θ, then we have

C Modified Ghosh-Dunson model
Since the magnitude inflation is associated with the overdose of independent slab priors on the loading matrix, an immediate counter measure would be controlling the number of slab priors used.Ghosh and Dunson (2009) proposed to use a diffused prior (inverse gamma with a large variance) for the variance of the normal factors and impose the standard Gaussian prior on elements of the loading matrix, which will be called th Ghosh-Dunson model.Here we propose a modified Ghosh-Dunson model by relocating the variance parameters of the factors to the loading matrix and imposing a SpSL prior on its elements: ∼ N K (0, I K ) Priors: β jk = q jk r k , p(r k |λ) = ψ(r k |λ); p(q jk |γ jk , λ 0 , λ 1 ) = (1 − γ jk )ψ(q jk |λ 0 ) + γ jk ψ(q jk |λ 1 ), λ 0 λ 1 ; where β jk denote the (j, k) th element of B and ψ(•|λ) is the normal density with precision λ.We chose λ 0 large and λ 1 = 1.
In this framework, each loading element β jk is expressed as the product of a column-wise magnitude parameter r k and the 'normalized' loading q jk .Ghosh and Dunson (2009)'s original model corresponds to assuming γ jk ≡ θ k ≡ 1, i.e., a normal instead of mixture normal prior for the q jk .We impose a diffuse normal prior on the r k 's and a SpSL prior on q jk .With this dependent prior specification, the number of the "slab parameters" is greatly reduced (all elements in each column of B share a common "slab parameter" r k ), while marginally the prior on each β jk is the same as that of the independent SpSL prior.This prior setup on the loading matrix is similar to the one in the hierarchical linear model in Jia and Xu (2007) where Ω is prescribed, and the prior setup on B establishes connections between rows of the loading matrix to prevent the degeneration of the original model to multiple independent linear regressions.However, the hierarchical linear model is not subject to the inflation problem even if completely independent priors are imposed on the loading matrix since Ω is already prescribed.
Although the dependent slab prior specification is an effective way for resolving the posterior inflation problem, the justification of the posterior consistency is rather difficult under this framework.We simply provide some numerical results in Section 6 to compare the posterior distribution based on the modified Ghosh-Dunson model ( 21) with that resulting from our strategy of imposing the orthogonal factor assumption.The simulation is performed with α = 1/G, η = = 1, λ = 0.001, λ 0 = 200, λ 1 = 1 and K = 8 using a Gibbs sampler starting from the MAP identified by the PXL-EM algorithm.

D Mathematical Proofs
D.1 Proof of Theorem 4.1 Proof.Let β 1 be the vector formed by the β jk 's with their corresponding γ jk = 1 and let β 0 be the vector formed by β jk 's with their corresponding γ jk = 0.
D.6 Proof of Theorem 5.3 Theorem 5.3 is an immediate result of the following lemma and Theorem 5.2.
1. min k λ min ((Σ Let Y = B 0,G Ω 0,n and model Y •i with N G (BΩ •i , Σ G ) for i = 1, • • • , n. Impose a point mass and flat mixture prior on entries of B according to the feature allocation matrix Γ 0,G and assume a distribution on Ω that is invariant under right orthogonal transformations, then for a random draw Ω from its posterior distribution, Proof.We know that for f (n, G) = min k λ min ((Σ 0,G K(Ω 0,n ) 1:k ): 1'.f (n, G) goes to infinity.2'.Let V 0 be a fixed K × n orthonormal matrix, Define two disjoint set S 1 and S 2 as following until the solution path is stabilized.The solution given by the PXL-EM under the final value of λ 0 approximates the MAP estimate under a flat and point mass mixture prior on loading matrix elements and is proposed as the estimator for parameters.
The same procedure can be applied to the full posterior inference based on the orthogonal factor model.We observed a similar stabilization of the posterior distributions of every nonzero loading element when performing dynamic exploration for the orthogonal factor model, which is illustrated in the application of our method to the cerebrum microarray data from AGEMAP (Atlas of Gene Expression in Mouse Aging Project) database of Zahn et al. (2007), which was analyzed by Ročková and George (2016) using their PXL-EM algorithm.For every mice individual in this dataset (5 males and 5 females, at four age periods), cerebrum microarray expression data from 8932 genes are recorded, observations y i , i = 1, • • • , 40 for the factor model are taken to be the residuals of the expression values for each of the 8932 genes regressed on age and gender with an intercept.
We ran the posterior sampler initialized at the mode detected by the PXL-EM algorithm with λ 1 = 0.001, α = 1/G and λ 0 gradually increasing in the sequence of 12,15,20,30,40.The detected dimensionality of factor by PXL-EM algorithm is 1.In figure 8, we demonstrate the evolving of posterior density of β 2873,1 and β 1,1 .The posterior distribution of β 1,1 is centering at 0 and becomes more and more spiky with the enlarging of λ 0 .For the nonzero element β 2873,1 , its posterior distributions resemble normals with a relative stable variance and the absolute value of the mean is first decreasing then increasing.This change of monotonicity is caused by the alteration of γ 2873,1 from 0 to 1 in posterior samples.The increase of λ 0 is imposing more and more compression on the posterior of β j,k with γ j,k = 0 as well as inducing 0 to 1 changes in those γ elements that corresponds to large β value.When γ j,k is altered to 1, the posterior distribution of β j,k will no larger be influenced by the spike prior, thus the posterior distribution is stabilized.Vertical dotted

Figure 2 :
Figure 2: Trace plot of log(|β 1,1 |) from Gibbs sampler with n = 100, 1000, and λ 1 = 0.001, 0.1.The sampler of β 1,1 stabilizes around a much larger value than the truth, 1.The inflation of samples is more severe when n is smaller or the variance of slab priors is larger.
then for any fixed finite measure Borel set S, lim m→∞ P (B ∈ S|Y, Σ, Γ, m) = 0, where [B | Y, Σ, Γ, m] is based on the posterior distribution from model (1) with normally distributed factors and φ m as the slab part in the SpSL prior on loading matrix elements.

Figure 4 :
Figure 4: Heat-maps of |B| in 3000 iterations of Gibbs sampler using the orthogonal factor model.

Figure 5 :
Figure 5: Heat-maps of |B| in 3000 iterations of Gibbs sampler using modified Ghosh-Dunson model.

Figure 6 :
Figure 6: MCMC trace-plots of β 1,1 and β 1,3 , respectively, under (a) the uniform orthogonal factor model, and (b) the modified Ghosh-Dunson model.The two frameworks provide similar and consistent posterior distributions but samples from the modified Ghosh-Dunson model have higher auto-correlations.

Figure 10 :
Figure 10: Boxplot of posterior samples of the first 50 entries of idiosyncratic variances.

Figure 11 :
Figure 11: Boxplot of posterior samples of the first 50 entries of the loading vector.

Figure 12 :
Figure 12: 90% credible interval for the first 500 entries of the loading vector.

F. 2
The synthetic example

Figure 15 :
Figure 15: 90% credible interval for elements in first five columns of loading matrix using modified Ghosh-Dunson model in the synthetic example.