Posterior Consistency of Factor Dimensionality in High-Dimensional Sparse Factor Models

. Factor models aim to describe a dependence structure among high-dimensional random variables in terms of a low-dimensional unobserved random vector called a factor. One of the major practical issues of applying the factor model is to determine the factor dimensionality. In this paper, we propose a computationally feasible nonparametric prior distribution which achieves the posterior consistency of the factor dimensionality. We also derive the posterior contraction rate of the covariance matrix which is optimal when the factor dimensionality of the true covariance matrix is bounded. We conduct numerical studies that illustrate our theoretical results.


Introduction
Factor models describe a dependence structure among a high-dimensional correlated random vector in terms of a low-dimensional unobserved random vector called a latent factor or just factor. To be specific, the (linear) factor model considered in this paper assumes that a p-dimensional random vector Y is distributed as Y|Z = z ∼ N(Bz, σ 2 I), Z ∼ N(0, I), (1.1) where B is a p × K factor loading matrix, Z is a K-dimensional factor with K < p, and σ 2 > 0 is a noise variance. Under this model, the marginal distribution of Y is given by Y ∼ N(0, Σ), Σ := BB + σ 2 I.
That is, the distribution of Y is determined by the structured covariance matrix BB + σ 2 I. This decomposition of the covariance matrix leads to the substantial reduction of the model complexity, and thus the factor model has been applied to a broad range of areas including high-dimensional covariance estimation (Fan et al., 2008(Fan et al., , 2011(Fan et al., , 2018, high-dimensional supervised learning (Fan et al., 2017;Kneip and Sarda, 2011;Silva, 2011;Stock and Watson, 2002) and multiple testing under arbitrary dependence (Fan et al., 2012(Fan et al., , 2019Leek and Storey, 2008), and popularly used in various application fields such as economy, psychology and gene expression studies (e.g., Bernanke et al., 2005;Carvalho et al., 2008;Forni et al., 2003;Hochreiter et al., 2006;McCrae and John, 1992).
A major practical issue of using the factor model is to determine the factor dimensionality K. We need to select an appropriate factor dimensionality to optimize the bias-variance trade-off. In addition, the factor dimensionality is of interest itself in practice since each factor could have a physical interpretation (e.g., the number of interacting pathways in genomics and the number of personality traits in psychology).
Frequentist approaches typically choose the factor dimensionality before estimating the loading matrix. One of the widely used methods is to fit the factor models for different values of K and to select the best K based on a model selection criterion Ng, 2002, 2007). Alternatively, the factor dimensionality can be chosen based on the eigenvalues of the empirical covariance matrix (Ahn and Horenstein, 2013;Lam and Yao, 2012;Onatski, 2010).
Various priors have been developed for Bayesian analysis of the factor model with unknown factor dimensionality. Examples are spike and slab priors with the Indian buffet process (IBP) (Chen et al., 2010;Knowles et al., 2011;Rockova and George, 2016) and shrinkage type priors with the degree of shrinkage increasing across the column index (Bhattacharya and Dunson, 2011;Srivastava et al., 2017).
Large sample properties of the posterior distribution of the factor model also have received much attention. Pati et al. (2014) investigated the posterior contraction rate of the covariance matrix with respect to the spectral norm for a sparse factor model where most of the entries in the factor loading matrix are zero. They showed that the derived posterior contraction rate is near-optimal in the minimax sense, up to a logarithm factor even when p is much larger than n. An improved contraction rate was obtained by Xie et al. (2018) and Ning (2021) proved that the variational posterior can also achieve this improved rate.
However, these studies focused on the covariance matrix but not on the factor dimensionality. Rockova and George (2016) considered the Bayesian factor model with a spike and slab prior with the IBP proposed by Ghahramani and Griffiths (2005), and proved that the posterior probability that the factor dimensionality is bounded by a certain quantity converges to one. But, the posterior consistency of the factor dimensionality is still unsolved. Gao and Zhou (2015) studied a Bayesian sparse principal component analysis (PCA) model, which is equivalent to the factor model with the constraint that the columns of the loading matrix are orthogonal to each other. They derived posterior contraction rates of the covariance matrix and the principal subspace estimation with respect to the spectral norm and proved the posterior consistency of the rank of the covariance matrix. Due to the orthogonality of the loading matrix, the rank of the covariance matrix is equal to the factor dimensionality. But there is no easy computational method to approximate the posterior distribution mainly because of the orthogonality constraint on the loading matrix.
The main contribution of this paper is to propose a Bayesian factor model which is computationally tractable and at the same time achieves the posterior consistency of the factor dimensionality. For this purpose, we consider a spike and slab prior with the two-parameter IBP which is an extension of the one-parameter IBP in Ghahramani and Griffiths (2005). Bayesian factor models with two-parameter IBP have been studied by Chen et al. (2010) and Paisley and Carin (2009) but no theoretical study has been done. In this paper, we develop an easily implementable Markov chain Monte Carlo (MCMC) algorithm for our proposed Bayesian model, which is similar to that for the model of Ghahramani and Griffiths (2005), and prove the posterior consistency of the factor dimensionality. In addition, as a by-product, we derive the posterior contraction rate of the covariance matrix which is similar to or sometimes better than other Bayesian factor models. This paper is organized as follows. In Section 2, we introduce the assumptions on the true model. Then we explain the proposed prior and its properties. In Section 3, we provide posterior asymptotic results. In Section 4, we provide an MCMC algorithm for our proposed Bayesian model. In Section 5, we conduct simulation studies to supplement our asymptotic results and show superiority of the proposed method. Real data analysis is conducted in Section 6. Concluding remarks follow in Section 7.

Notation
Let R be the set of real numbers and N be the set of natural numbers. For the positive integer p, we let [p] := {1, 2, . . . , p}. For a real number x, let x denote the largest integer less than or equal to x and x denote the smallest integer larger than or equal to x. For two real numbers a and b, we write a ∨ b := max{a, b} and a ∧ b := min{a, b}. For two positive sequences {a n } n∈N and {b n } n∈N , we write a n b n if there exists a positive constant C > 0 such that a n ≤ Cb n for any n ∈ N. Moreover, we write a n b n if b n a n and write a n b n if a n b n and a n b n . We denote by 1(·) the indicator function. Let Γ(a) denote the gamma function and B(b, c) denote the beta function, where a, b and c are positive constants. Let 0 and 1 denote the vector of 0's and the one of 1's, respectively.

Assumptions and prior distribution 2.1 Assumptions
Throughout this paper, we assume that for each n ∈ N, we observe n independent and identically distributed p n dimensional observations Y 1:n ≡ (Y 1 , . . . , Y n ) from the normal distribution with mean 0 and covariance matrix Σ 0n and we model the data using the model (1.1).
We introduce some regularity conditions on the sequence of the true covariance matrices {Σ 0n } n∈N . First, we consider sparse loading matrices. Since we want to deal with a high dimensional case where p n is much larger than n, we assume that the true loading matrix is sufficiently sparse to make the factor loading estimable. Define the class of p × k 0 matrices with the sparsity condition that the number of nonzero elements of each column is at most s, (2.1) The above class of sparse loading matrices is also considered by Pati et al. (2014) and Gao and Zhou (2015). The parameter space for the true covariance matrix we consider is given by for some universal constant c 0 > 0. Here the lower bound σ 2 > c 0 is introduced to prevent that Σ 0n becomes ill-conditioned.
We assume that for each sample size n ∈ N, the true covariance matrix Σ 0n belongs to the class C 0n := C(p n , s n , k 0n , c n ), where the sequence of the classes {C 0n } n∈N satisfies the following assumptions: (A1) p n > n.
The assumption (A1) means that we consider high dimensional models. However, our theoretical analysis can be applied to low dimensional cases where n > p n . In the low dimensional cases, the log p n factors in the results are replaced with log n. The quantity c 2 n s 2 n k 0n log p n /n in (A2) is equal to the square of the posterior contraction rate of the covariance matrix under our proposed prior with respect to the spectral norm. In (A3), we assume that the true model has at least one factor and the factor dimensionality does not exceed half of the number of observed variables for technical reasons.
The assumption (A4) implies that we allow the largest eigenvalue of Σ 0n to grow with sample size. The bound c n s n is mild in view of random matrix theory. Suppose that B is a s n ×k 0n random matrix whose entries are independent centered random variables with finite fourth moments. Then by Theorem 2 of Latala (2005) s n , we have that E BB s n . Pati et al. (2014) and Rockova and George (2016) assumed the same condition, while the other works on the Bayesian covariance estimation assumed the stronger condition that the largest eigenvalue of Σ 0n is bounded (Gao and Zhou, 2015;Xie et al., 2018).

Prior and its properties
Let β jk be the (j, k) entry of the p × ∞-dimensional loading matrix B. We consider the following prior where α > 0 and κ ≥ 0 are hyperparameters. Note that {ξ jk } is the stick-breaking representation of the two-parameter IBP (Teh et al., 2007). Thus, we refer to the above distribution on B as SSIBP p (α, κ), which is an abbreviation of spike and slab Indian buffet process.
The (one-parameter) IBP introduced by Ghahramani and Griffiths (2005), which is the prior on {ξ jk } with κ = 0, has been popularly used as a prior on the nonzero entries of the loading matrix. The prior distribution of Knowles et al. (2011) for B is almost the same as the SSIBP prior except that κ = 0 and the Laplace distribution in (2.3) is replaced by the normal distribution. Rockova and George (2016) used the IBP for the prior of ξ jk , but they replace δ 0 in the first line of (2.3) with the Laplace distribution with a very small dispersion. This replacement enables us to use the fast and scalable expectation-maximization algorithm that estimates a posterior mode. We consider the two parameter IBP in order to get the posterior consistency of the factor dimensionality by choosing κ appropriately. Although Chen et al. (2010) and Paisley and Carin (2009) used the two-parameter IBP, no theoretical study has been done.
In the following three subsections, we provide some theoretical properties of the SSIBP prior. All the proofs of the results in this section are deferred to Appendix A.1 in the supplementary material (Ohn and Kim, 2021).

Prior distribution of the factor dimensionality
We derive an upper bound of the tail probability of the factor dimensionality induced by the SSIBP prior. We first define the factor dimensionality of the p × ∞ dimensional loading matrix B.
Definition 2.1. For a given p × ∞ loading matrix B ≡ (β 1 , β 2 , . . . ), we define the factor dimensionality K + (B) as the number of nonzero columns of B, i.e., The following lemma shows that the tail probability of the factor dimensionality is exponentially decaying as the factor dimensionality increases.

Prior distribution of row-wise sparsity
Recall that we assume column-wise maximum sparsity of the true loading matrix. In this section, for the ease of mathematical expositions, we introduce a notion of stronger sparsity called row-wise sparsity.
,h∈N and a positive integer k ∈ N, we define the row-support up to k-th column of B as Note that for any B ∈ R p×∞ and any k ∈ N, where β h denotes the h-th column of B.
Throughout this paper, we set κ = p 1+δ for a fixed constant δ > 0. This choice of the hyperparameter is intended to put most of the prior mass concentrating on sufficiently sparse loading matrices, which is shown in the following lemma.

Prior concentration near the true loading matrix
In this subsection, we show that the SSIBP prior puts sufficiently large mass near the truth. Since the true loading matrix B 0n ∈ R pn×k0n depends on n, we let also the hyperparameters α = α n and κ = κ n also depend on n. We let Π n denote the prior distribution SSIBP pn (α n , κ n ) × IG(a, b) on (B, σ 2 ), where IG(a, b) denotes the inverse gamma distribution with shape parameter a > 0 and scale parameter b > 0. Unless there is a confusion, we understand the loading matrix B 0n ∈ R pn×k0n as the for α n > 0 and δ > 0, then for any n ∈ N and η > 0, for some universal constant C 1 > 0 depending only on δ. Moreover, if η n is such that p −m n η n 1 for some m > 0 and the assumption (A4) holds, we have for some universal constant C 2 > 0 depending only on δ.
Using Lemma 2.3, we can obtain the following prior concentration result for the covariance matrix with respect to the Frobenius norm.
for some universal constant C 1 > 0 depending only on δ, a and b.
Most priors for high dimensional sparse factor models have the lower bound exp(−Cs n k 0n log p n ) for the prior concentration (Gao and Zhou, 2015;Pati et al., 2014;Rockova and George, 2016). The lower bound in (2.11) is similar to them, but the key difference of (2.11) is that the lower bound depends explicitly on the hyperparameter α n . For the posterior consistency of the factor dimensionality, controlling α n is indispensable.

Asymptotic properties of the posterior distribution
In this section we study asymptotic properties of the posterior distributions of the covariance matrix and the factor dimensionality in the sparse factor model, respectively. We denote by Π n (·|Y 1:n ) the posterior distribution induced by the prior Π n and the data Y 1:n ≡ (Y 1 , . . . , Y n ). For a given sample size n and covariance matrix Σ, we let P Σ and E Σ denote the probability measure and the expectation operator under the law (N(0, Σ)) n , where we suppress the dependence on n for simplicity.
The proofs of the results in this section are deferred to Appendix A.2 in the supplementary material (Ohn and Kim, 2021).

Posterior contraction rate of covariance matrix
We let {C 0n } n∈N with C 0n := C(p n , k 0n , s n , c n ) be the sequence of the classes of covariance matrices satisfying the assumptions (A1)-(A4). The next theorem derives the posterior contraction rate of the covariance matrix.
for any sufficiently large M > 0, where If we set the hyperparameter α n to satisfy log( 1 αn ) s n log p n , for instance, α n = α 0 for some α 0 ∈ (0, 1) or α n = p −1 n as in Rockova and George (2016), the posterior contraction rate becomes We compare this rate with those in other related Bayesian studies. Pati et al. (2014) derived a similar posterior contraction rate with their own prior and assumptions. The posterior contraction rate of Pati et al. (2014) is c n s n k 3 0n log p n /n √ log n. We remove k 0n √ log n factor by using the improved test construction developed by Gao and Zhou (2015). Gao and Zhou (2015) obtained the posterior contraction rate s n k 0n log p n /n under the assumption that the largest eigenvalue of the true covariance matrix is bounded. The posterior distribution of Xie et al. (2018) and the variational posterior distribution of Ning (2021) enjoy the contraction rate s n log p n /n under the assumptions that both the true largest eigenvalue and the true factor dimensionality are bounded. Our posterior contraction rate in (3.3) recovers those rates.
Whether the convergence rate n in (3.3) is minimax optimal is still open problem. The closely related minimax result is studied by Cai et al. (2015). They derived the minimax convergence rate of a covariance matrix with a row-wise sparse loading matrix, i.e., the covariance matrix in the class given bȳ Under the additional assumption that σ 2 = 1, Cai et al. (2015) proved that inf Σ sup Σ0n∈C(pn,k0n,sn,cn): which is faster than the posterior convergence rate in Theorem 3.1 by the factor √ c n . However, if we drop the assumption of σ 2 = 1, we have the minimax lower bound inf Σ sup Σ0n∈C(pn,k0n,sn,cn) E Σ0n Σ − Σ 0n c n s n log(p n /s n ) n .
(3.6) See Appendix A.3 in the supplementary material (Ohn and Kim, 2021) for derivation. Note that C(p n , k 0n , s n , c n ) ⊂C(p n , k 0n , s n k 0n , c n ) but C(p n , k 0n , s n , c n ) ⊂C(p n , k 0n , q, c n ) for any q < s n k 0n . Thus, our posterior contraction rate n in (3.3) would be expected to be (nearly) minimax optimal, which we leave as a future work.

Posterior consistency of the factor dimensionality
For the posterior consistency of the factor dimensionality, we need the additional assumption on the smallest eigenvalue of the low rank matrix B 0n B 0n . Define for ζ > 0. We assume that the true covariance matrix Σ 0n belongs to the class C 0n := C (p n , k 0n , s n , c n , ζ n ), where the sequence of the classes {C 0n } n∈N satisfies the following assumption: (A5) ζ n > C 0 c n s 2 n k 0n log p n /n for a sufficiently large constant C 0 > 0 in addition to the assumptions (A1)-(A4).
The assumption (A5) assumes that ζ n is larger than the posterior contraction rate of the covariance matrix, which is necessary for not underestimating the factor dimensionality k 0n . The assumption (A5) plays a similar role as the beta-min condition which has been popularly used to prove variable section consistency in high-dimensional regression (Castillo and van der Vaart, 2012;Martin et al., 2017).
We let {C 0n } n∈N with C 0n := C (p n , k 0n , s n , c n , ζ n ) be the sequence of covariance matrices satisfying the assumptions (A1)-(A5). The following theorem proves that the posterior consistency of the factor dimensionality. (3.8) Rockova and George (2016) proved that E Σ0n [Π n (K + (B) > Ms n k 0n Y)] = o(1) for sufficiently large M > 0 for our prior with κ n = 0 (i.e., the one-parameter IBP). This result is much weaker than ours when s n diverges. Also Rockova and George (2016) did not consider the posterior probability of the underestimation of the factor dimensionality. We introduce a diverging value of κ n and the assumption (A5) to prevent the underestimation.
For posterior consistency of the factor dimensionality, the condition (A5) plays a crucial role. When the assumption (A5) does not hold, that is, the true covariance matrix Σ 0n belongs to the larger class C(p n , k 0n , s n , c n ) instead of C 0n , we only ensure that E Σ0n Π n k n ≤ K + (B) ≤ k 0n Y 1:n → 1, (3.9) wherek n := sup{k ∈ [k 0n ] : λk(B 0n B 0n ) > ζ n } ∨ 0 with ζ n satisfying (A5). The proof of (3.9) is given at the end of Appendix A.3 in the supplementary material (Ohn and Kim, 2021). Cai et al. (2015) proved that if the true covariance matrix is of the form B 0n B 0n + I with B 0n ∈ R pn×k0n and |supp k0n (B 0n )| ≤ q n , the necessary and sufficient condition for consistent estimation of the factor dimensionality is given by ζ n > C q n log p n n for a sufficiently large constant C > 0 under regularity conditions. Compared to the optimal lower bound of Cai et al. (2015), our lower bound of the eigengap in (A5) is larger by √ s n factor. Moreover, for α n in Theorem 3.2, the posterior contraction rate of the covariance matrix becomes n = c n s 2 n k 0n log p n n , ( 3.10) which is also √ s n times slower than the posterior contraction rate in (3.3). This suboptimality would be mainly due to the SSIBP pn (p −As 2 n n , p 1+δ n ) prior but not an inherent problem of Bayesian analysis. There is a prior which achieves the posterior consistency of the factor dimensionality and the fast posterior contraction rate of the covariance matrix as (3.3) simultaneously. An example is the prior of Gao and Zhou (2015). However, note that the SSIBP prior is computationally tractable.
Remark 1. When the true noise variance σ 2 0n is known, we can derive a posterior contraction rate of the covariance matrix with respect to the Frobenius norm from Theorems 3.1 and 3.2. This is because if K + (B) = k 0n holds, we have rank(BB −B 0n B 0n ) ≤ 2k 0n and so Σ − Σ 0n F = BB − B 0n B 0n F ≤ √ 2k 0n BB − B 0n B 0n . Hence the posterior contraction rate of Σ − Σ 0n F is at most √ k 0n n = c n s n k 0n log p n /n.

Bounded factor dimensionality
The suboptimality of the posterior contraction rate of the covariance matrix with the SSIBP prior can be improved when the true factor dimensionality k 0n is bounded. The bounded factor dimensionality is not too restrictive in some situations such as macroeconomic applications (Onatski, 2010;Li et al., 2017). We introduce the following two assumptions instead of (A3) and (A5): (A3 ) 1 ≤ k 0n ≤K for some absolute constantK > 0.
(A5 ) ζ n > C 0 c n s n log p n /n for a sufficiently large constant C 0 > 0.
The detailed proof of Lemma 3.3 is almost similar to Theorems 3.1 and 3.2 and thus omitted.
The convergence rate in (3.11) is minimax optimal (Pati et al., 2014). That is, under the bounded factor dimensionality setup, we can attain both posterior consistency of the factor dimensionality and the optimal convergence of the covariance matrix. As we have shown in Corollary 2.4, the prior concentration is lower bounded by α k0n exp(−C 1 s n log p n ) for some C 1 > 0 in this case, thus the upper bound of the posterior probability E Σ0n Π n (K + (B) > k 0n |Y 1:n ) is given by α n exp(C 2 s n log p n ) for some C 2 > 0. Therefore the choice of the hyperparameter α n = p −Asn n for large A > C 2 , instead of the choice α n = p −As 2 n n in Theorem 3.2, is sufficient to make the upper bound go to zero. Moreover, the choice α n = p −Asn n does not hurt the prior concentration, i.e., α k0n exp(−C 1 s n log p n ) exp(−C 3 s n log p n ) for some C 3 > 0, thus the results of Theorem 3.3 can be obtained. Remark 2. A referee raised a question whether the same results of Lemma 3.3 hold when the upper boundK of the factor dimensionality grows as n goes to infinity.
Unfortunately, the answer is negative. If k 0n ≤K n for some positive sequence {K n } n∈N , it can be shown from our proof that the posterior consistency of the factor dimensionality is obtained with the hyperparameter α n p −AsnKn n for sufficiently large A > 0, but this yields the posterior contraction rate c n s n k 0nKn log p n /n of the covariance matrix, which is larger than the optimal one c n s n k 0n log p n /n whenK n → ∞. That is, with the current proof technique, we cannot obtain the optimal convergence of the covariance matrix and the consistency of the factor dimensionality simultaneously if the factor dimensionality is not bounded.

Posterior computation
In this section, we provide an MCMC algorithm for posterior computation under the prior distribution SSIBP p (α, κ) × IG(a, b) on (B, σ 2 ). We report computation times of the MCMC sampler for some large-scale synthetic data sets in Appendix C.1 in the supplementary material (Ohn and Kim, 2021) to show its tractability. In addition, we provide trace, autocorrelation and partial autocorrelation plots of the posterior samples of some randomly selected parameters in Appendix C.2 in the supplementary material (Ohn and Kim, 2021) to check the convergence of generated MCMC samples.
For notational simplicity, let K * := K + (B) be the number of nonzero columns of the loading matrix. The detailed algorithm is as follows: • For posterior sampling of the factor loading β jk , we use the scale mixture representation of the Laplace distribution. If β jk |τ jk ∼ N(0, τ jk ) and τ jk ∼ Exp(1/2), then marginally we have β jk ∼ Laplace (1), where Exp(1/2) denotes the exponential distribution with mean 2. In the MCMC algorithm, we introduce auxiliary scale parameters τ jk for j ∈ [p] and k ∈ [K * ] to generate β jk from its conditional posterior distribution easily. For j ∈ [p] and k ∈ [K * ], the factor loading β jk is sampled from the conditional posterior • For j ∈ [p] and k ∈ [K * ], the auxiliary scale parameter τ jk is sampled from where GIG(a, b, p) denotes the generalized inverse Gaussian (GIG) distribution with the density f (z) proportional to f (z) ∝ z p−1 e −(az+b/z)/2 .
• For j ∈ [p], the indicator parameters (ξ jk : k ∈ N) are sampled as follows. For k ∈ [K * ], ξ jk is sampled with probability Then we accept the proposal with probability where we denote M j := σ −2 β * j (β * j ) + I and u j : • For i ∈ [n], the latent variable Z i is sampled from • The noise variance parameter σ 2 is sampled from

Simulation study
We conduct numerical experiments that illustrate our theoretical findings. The posterior distribution is estimated by the MCMC algorithm presented in Section 4. For each posterior computation, we run the MCMC sampler for 3,000 draws, discard the first 500 samples as burn-in and save every 10th sample.

Choice of hyperparmeters
We compare the concentrations of the posterior distributions of the factor dimensionality and the covariance matrix with various choices of the hyperparameters α n and κ n by simulation.
We let the sample size n vary among {100, 200, 300}. For each sample size n, we let the dimension of the data be equal to p n = 10n (i.e., p n = 1000, 2000, 3000) and the column-wise maximum sparsity be equal to s n = 3 log p n (i.e., s n = 18, 21, 24). We consider the bounded factor dimensionality cases where the true factor dimensionality k 0n is either 1 or 5. For given s n and k 0n , the p n × k 0n true loading matrix B 0n is generated by selecting s n nonzero elements randomly of each column and then sampling their values from {−2, 2} randomly. A noise variance σ 2 0n is set to be 1. Simulated data are generated from the Gaussian distribution with the mean 0 and covariance matrix Σ 0n := B 0n B 0n + I.
We consider two different choices of α n . The first one is α n = p −1 n , which is used by Rockova and George (2016), and the other one is α n = p −sn n motivated by our theoretical result in Lemma 3.3. For κ n , we consider two different choices: κ n = 0, which corresponds to the one-parameter IBP, and κ n = p 1+δ n with δ = 0.1, which is the prior considered in this paper.
We investigate the two quantities: the posterior probability of the correct selection of the factor dimensionality (i.e., Π n (K + (B) = k 0n |Y 1:n )) and the scaled spectral norm  loss of the covariance matrix (i.e., Σ − Σ 0n / Σ 0n , whereΣ denotes the posterior mean of the covariance matrix). We generate 100 independent data sets to have 100 measures of these two quantities and report the averages of those quantities in Tables 1 and 2, respectively. Table 1 presents the averaged posterior probability of the factor dimensionality being equal to the true one by the four choices of the hyperparameters α n and κ n . The results indicate that the choice of both α n and κ n is important for the posterior consistency of the factor dimensionality, but the choice of α n is more important. For the choice of α n = p −1 n , we found that the posterior distribution puts most of its mass to larger values of the factor dimensionality than the true one. Table 2 presents the averaged scaled spectral norm losses for covariance matrix estimation. The results are similar for all the choices of the hyperparameters. It supports our theoretical results that the posterior contraction rates are the same for the four choices.

Comparison with other methods
In this simulation, we compare finite sample performance of the SSIBP prior with other methods. We use the hyperparameters α n = p −25 n and κ n = p 1.1 n for the SSIBP prior. For factor dimensionality estimation, we use the posterior mode as the point estimator and for covariance matrix estimation, we take the posterior mean as the point estimator.
For synthetic data generation, we consider the following two challenging scenarios.
Case 1 (Low sample size) Generate data with sample size n = 50 and dimension p n = 1000. For a true covariance matrix, we consider two values of the column-wise maximum sparsity s n ∈ {5, 20} and three values of the factor dimensionality k 0n ∈ {1, 3, 5}. As in Section 5.1, nonzero loading values are sampled from {−2, 2} and the noise variance is set to be 1.
Case 2 (Large noise variance) Generate data with larger sample size n = 150 and dimension p n = 1000. But in this case the noise variance is set to 3, three times larger than that is Case 1. The sparsity and true factor dimensionality values are the same as the Case 1.
For each setup, we repeat the data generation 100 times and compute the point estimators of the factor dimensionality and the covariance matrix by the SSIBP prior and other competing methods for each synthetic data set.
• Multiplicative gamma process shrinkage prior (MGPS, Bhattacharya and Dunson (2011)): The posterior mode of the sparse Bayesian factor model with the multiplicative gamma process shrinkage prior.
For covariance estimation, we consider the following three competitors: the principal orthogonal complement thresholding method (POET, Fan et al. (2013)), the variational inference method for Bayesian sparse PCA (VI, Ning (2021)) as well as the MGPS (Bhattacharya and Dunson, 2011). For the POET and VI, an user must select the factor dimensionality a priori. We set the true value of the factor dimensionality for this. We use the posterior mean of the covariance matrix as the point estimator for the MGPS.
The simulation results of factor dimensionality estimation for Cases 1 and 2 are summarized in Table 3 and Table 4, respectively. The SSIBP prior performs very well for all the setups. The frequentist method ET also performs well in general but does not for the setup with s n = 5, k 0n = 5 and σ 2 0n = 3. The frequentist method ER tends to underestimate the factor dimensionality for the extremely sparse setups (i.e., s n = 5) in both Cases 1 and 2. The Bayesian method MGPS is suboptimal in all the setups.  Table 3: Proportions of the estimated factor dimensionality over 100 simulations for Case 1. "True", "Over" and "Under" denote proportions of correct estimation, overestimation and underestimation, respectively. "Ave" is the average of the estimated factor dimensionality. Table 5 and Table 6 present the simulation results of covariance matrix estimation for Cases 1 and 2, respectively. For each method, we report the average of the scaled spectral norm losses Σ − Σ 0 / Σ 0 for the true covariance matrix Σ 0 and the estimateΣ over 100 simulations. The results show that the SSIBP prior outperforms the other competing methods. The VI method behaves reasonably well, however the overall performance is inferior to the SSIBP prior.

Real data analysis
In this section, we analyze the big five personality traits (Goldberg, 1990) data set which we obtained from https://www.kaggle.com/tunguz/big-five-personality-test. This data set contains answers of 50 questions about personality with the five-level Likert scale. The 50 questions can be grouped into five groups which represent the five personality traits of extraversion, agreeableness, openness, conscientiousness, and neuroticism. Therefore we expect that the true factor dimensionality is 5 and each element of the factor represents the one of these five personality traits. The total number of responses are 1,015,342 but the respondents who completely answered are 874,434.
We repeat the following procedure 1,000 times. We first randomly select 200 responses and normalize them to have the mean of each question be 0. We then estimate the factor dimensionality by the four methods we considered in Section 5.2, i.e., eigenvalue threshold estimator (ET, Onatski (2010)), eigenvalue ratio estimator (ER, Ahn and Horenstein (2013); Lam and Yao (2012)), multiplicative gamma process shrinkage prior (MGPS, Bhattacharya and Dunson (2011)) and the SSIBP prior. The hyperparameters of the SSIBP prior are set to be α n = p −20 n and κ n = p 1.1 n . As in Section 5.2, posterior modes are used as the point estimator of the factor dimensionality for the Bayesian methods.
The estimation results of the factor dimensionality by the four methods are summarized in Figure 1. In view of the prior knowledge that the true factor dimensionality is 5, we can see that the ET and SSIBP prior perform better than the other two methods. The ER method has the large proportion of underestimation and the MGPS always yields factor dimensionality estimates larger than 5.   In Figure 2, we compare the estimated loading matrices of the two Bayesian methods. In the figure, we plot the posterior means of the loading matrix given that the factor dimensionality is equal to the posterior mode, that is, E(B|K + (B) =k, Y 1:n ) wherê k = argmax k∈N Π n (K + (B) = k|Y 1:n ). Recall that the total 50 questions in the data set are grouped so that 10 questions in each group correspond to each personality trait. The SSIBP prior captures this structure quite well, while the MGPS prior does not recover it.

Concluding remarks
In this paper, we proposed a computationally tractable prior which has desirable large sample properties. The proposed prior achieves the posterior consistency of both the factor dimensionality and induced covariance matrix. We also derived the posterior contraction rate of the covariance matrix, which is a near-optimal rate. Moreover, if the true factor dimensionality is bounded, we obtained the posterior consistency of the factor dimensionality and the optimal convergence rate of the covariance matrix simultaneously. In addition, we provided the MCMC algorithm and showed that it works well with (moderately) large data sets.
Our MCMC algorithm is applicable without much hamper to moderately large data but would be difficult to use for very large data. Computationally more efficient inferential methods such as variational inference for big data are worth pursuing. Ning (2021) proposed the variational inference algorithm for the sparse Bayesian factor model. However, the algorithm of Ning (2021) does not infer the factor dimensionality, but it fixes the factor dimensionality given by an user. Designing a variational inference algorithm which can infer the factor dimensionality would be an interesting future work. Figure 1: Distributions of the factor dimensionality estimates over 1,000 data sets of size 200 randomly selected from the big five personality traits data.
The selection consistency of entries in the loading matrix is also an important problem. Since it is devised for selecting the factor dimensionality, our prior may not achieve the entry-level selection consistency. Once the true factor dimensionality is known, the problem is similar to the high-dimensional regression problem, but the latent factor is random and not observable. Thus techniques to prove the selection consistency of highdimensional regression models such as those of Castillo et al. (2015) may not be applicable directly. To check the entry-level selection consistency of our prior or to develop a prior which achieves the entry-level selection consistency would require completely new researches.
The proposed prior distribution is nonadaptive in the sense that the true sparsity level s n should be known in advance to choose the hyperparameters optimally. In practice, the hyperparameters could be selected by the maximum marginal likelihood (Rousseau and Szabo, 2017), which would be computationally demanding and not theoretically guaranteed. It would be an interesting future work to develop a computationally tractable prior which is adaptive to the true sparsity level s n .
In an ultra high dimensional setup in which the dimension p n grows exponentially in n, the number of nonzero rows s n of the loading matrix should be of order log p n . This order would be too small compared to the dimension p n because (p n − log p n ) many entries of the random vector Y should be pure noises. To resolve this problem, a G-block covariance model by Bunea et al. (2020) could be an alternative, but there is no Bayesian inference method with theoretical guarantees. We leave this problem as a future work.