Bayesian Sparse Spiked Covariance Model with a Continuous Matrix Shrinkage Prior ∗

. We propose a Bayesian methodology for estimating spiked covariance matrices with a jointly sparse structure in high dimensions. The spiked covariance matrix is reparameterized in terms of the latent factor model, where the loading matrix is equipped with a novel matrix spike-and-slab LASSO prior, which is a continuous shrinkage prior for modeling jointly sparse matrices. We establish the rate-optimal posterior contraction for the covariance matrix with respect to the spectral norm as well as that for the principal subspace with respect to the projection spectral norm loss. We also study the posterior contraction rate of the principal subspace with respect to the two-to-inﬁnity norm loss, a novel loss function measuring the distance between subspaces that is able to capture entrywise eigenvector perturbations. We show that the posterior contraction rate with respect to the two-to-inﬁnity norm loss is tighter than that with respect to the routinely used projection spectral norm loss under certain low-rank and bounded coherence conditions. In addition, a point estimator for the principal subspace is proposed with the rate-optimal risk bound with respect to the projection spectral norm loss. The numerical performance of the proposed methodology is assessed through synthetic examples and the analysis of a real-world face data example. 62H25, 62C10; secondary 62H12.


Introduction
In contemporary statistics, datasets are typically collected with high-dimensionality, where the dimension p can be significantly larger than the sample size n. For example, in genomics studies, the number of genes is typically much larger than the number of subjects (The Cancer Genome Atlas Network et al., 2012). In computer vision, the number of pixels in each image can be comparable to or exceed the number of images when the resolution of these images is relatively high (Georghiades et al., 2001;Lee et al., 2005). When dealing with such high-dimensional datasets, covariance matrix estimation plays a central role in understanding the complex structure of the data and has received significant attention in various contexts, including latent factor models (Geweke and The literature on sparse spiked covariance matrix estimation in high-dimensions from a frequentist perspective is quite rich. In Johnstone and Lu (2009), it is shown that the classical principal component analysis can fail when p n. In Cai et al. (2013) and Vu and Lei (2013), the minimax estimation of the principal subspace (i.e., the linear subspace spanned by the eigenvector matrix U) with respect to the projection Frobenius norm loss under various sparse structures on U is considered. Under the joint sparsity assumption, Cai et al. (2015) further provide minimax estimation procedures of the principal subspace with respect to the projection spectral norm loss.
In contrast, there is comparatively limited literature on Bayesian estimation of sparse spiked covariance matrices that provides theoretical guarantees. Recently, Pati et al. (2014), Gao and Zhou (2015), and Ning (2021) explore the posterior contraction rates for Bayesian estimation of sparse spiked covariance models. In particular, in Pati et al. (2014), the authors discuss the posterior contraction behavior of the covariance matrix Σ with respect to the spectral norm loss under the Dirichlet-Laplace shrinkage prior (Bhattacharya et al., 2015), but the contraction rates are sub-optimal when the number of spikes r grows with the sample size; In Gao and Zhou (2015), the authors propose a carefully designed prior on U that yields the rate-optimal posterior contraction of the principal subspace with respect to the projection Frobenius norm loss, but their approach is computationally intractable except for the posterior mean as a point estimator. Some efficient computation algorithms are developed in Ning (2021) with a spike-and-slab prior, including the expectation-maximization algorithm and the variational inference algorithm. Ning (2021) also discuss the contraction rate of the exact posterior and the variational posterior.
We propose a continuous matrix shrinkage prior, referred to as the matrix spikeand-slab LASSO prior, to model the joint sparsity of the eigenvector matrix U of the spiked covariance matrix. The matrix spike-and-slab LASSO prior is a novel continuous shrinkage prior that generalizes the classical spike-and-slab LASSO prior for vectors (Rockova, 2018;Rockova and George, 2018) to jointly sparse rectangular matrices. The major contribution of this work is two-fold: Firstly, we establish the rate-optimal posterior contraction for the entire covariance matrix Σ with respect to the spectral norm loss as well as that for the principal subspace with respect to the projection spectral norm loss; Secondly, we also focus on the two-to-infinity norm loss, a novel loss function measuring the entrywise behavior between linear subspaces. This loss function can detect entrywise perturbations of the eigenvector matrix U spanning the principal subspace. Under certain low-rank and bounded coherence conditions on U, we obtain a tighter posterior contraction rate for the principal subspace with respect to the two-toinfinity norm loss than that with respect to the projection spectral norm loss. Besides the contraction of the full posterior distribution, the Bayes procedure also leads to a point estimator for the principal subspace with a rate-optimal risk bound.
The rest of the paper is organized as follows. In Section 2, we briefly review the sparse spiked covariance models, introduce the relevant loss functions, and propose the matrix spike-and-slab LASSO prior. Section 3 elaborates on our theoretical contributions of the matrix spike-and-slab LASSO prior and the posterior contraction rates under various loss functions. The numerical performance of the proposed methodology is presented in Section 4 through synthetic examples and the analysis of a real-world computer vision dataset. Further discussion is included in Section 5.
Notations Let p and r be positive integers. We adopt the shorthand notation [p] = {1, . . . , p}. For any finite set S, we use |S| to denote the cardinality of S. The symbols and mean the inequality up to a universal constant, i.e., a b (a b, resp.) if a ≤ Cb (a ≥ Cb) for some absolute constant C > 0. We write a b if a b and a b. The p × r zero matrix is denoted by 0 p×r and the p-dimensional zero column vector is denoted by 0 p . When the dimension is clear, the zero matrix is simply denoted by 0. The p × p identity matrix is denoted by I p and the subscript p is sometimes omitted when the dimension is clear. An orthonormal r-frame in R p is a p × r matrix U with orthonormal columns, i.e., U T U = I r×r . The set of all orthonormal r-frames in R p is denoted by O(p, r). When p = r, we write O(r) = O(r, r). For a p-dimensional vector x ∈ R p , we use x j to denote its jth component, x 1 = p j=1 |x j | to denote its 1norm, x 2 to denote its 2 -norm, and x ∞ = max j∈ [p] |x j | to denote its ∞ -norm. For a symmetric square matrix Σ ∈ R p×p , we use λ k (Σ) to denote the kth-largest eigenvalue of Σ. For a matrix A ∈ R p×r , we use A j * to denote the row vector formed by the jth row of A, A * k to denote the column vector formed by the kth column of A, the lower case letter a ij to denote the (i, j)-th element of A, A F = p j=1 r k=1 a 2 jk to denote the Frobenius norm of A, A 2 = λ 1 (A T A) to denote the spectral norm of A, A 2→∞ = max x 2=1 Ax ∞ to denote the two-to-infinity norm of A, and A ∞ = max x ∞=1 Ax ∞ to denote the (matrix) infinity norm of A. We remark that the matrix infinity norm can be equivalently written as A ∞ = max j∈ [p] r k=1 |a jk |, which differs from the maximum absolute value of the entries of A. The prior and posterior distributions appearing in this paper are denoted by Π. The density of Π with respect to the underlying sigma-finite measure is denoted by π.
2 Sparse Bayesian spiked covariance models

Background and loss functions
In the spiked covariance model (1.1), the matrix Σ has the form Σ = UΛU T +σ 2 I p . We focus on the case where the leading r eigenvectors of Σ (the columns of U) are jointly sparse (Cai et al., 2015;Vu and Lei, 2013). Formally, the row support of U is defined as supp(U) = j ∈ [p] : (U j * ) T = 0 r . We say U is jointly s-sparse if |supp(U)| ≤ s. Heuristically, this assumption asserts that the signal comes from at most s features among all p features. Geometrically, the joint sparsity has the interpretation that at most s coordinates of y i generate the subspace Span{U * 1 , . . . , U * r } (Vu and Lei, 2013). We also remark that s ≥ r due to the orthonormal constraint on the columns of U.
This paper studies a Bayesian approach for estimating the covariance matrix Σ. We quantify how well the proposed methodology can estimate the entire covariance matrix Σ and the principal subspace Span{U * 1 , · · · , U * r } in a high-dimensional and jointly sparse setup. Leaving the Bayesian methodology for a moment, we first introduce some necessary background and the related loss functions for the sparse spiked covariance models in general. Throughout the paper, we write Σ 0 = U 0 Λ 0 U T 0 +σ 0 I p to be the true covariance matrix that generates the data Y = [y 1 , . . . , y n ] T from the p-dimensional multivariate Gaussian distribution N p (0 p , Σ 0 ), where Λ 0 = diag(λ 01 , · · · , λ 0r ), λ 01 ≥ . . . ≥ λ 0r > 0, and σ 2 0 > 0. The parameter space for Σ is When (s log p)/n → 0 and λ 01 , λ 0r are bounded away from 0 and +∞, Cai et al. (2015) establish the minimax rate of convergence for estimating Σ ∈ Θ(p, r, s): For the estimation of the principal subspace Span{U * 1 , . . . , U * r }, a standard loss function is the projection spectral norm loss U U T − U 0 U T 0 2 , which is equivalent to the spectral sine-theta distance between subspaces (Stewart and Sun, 1990). The corresponding minimax rate of convergence for UU T is given by Cai et al. (2015): Motivated by the recent papers Cape et al. (2019a) and Cape et al. (2019b), which presents a collection of technical tools for the analysis of entrywise eigenvector perturbation bounds, we also focus on the following two-to-infinity norm loss for estimating Span{U * 1 , . . . , U * r } in addition to the projection spectral norm loss, where W U = arg inf W∈O(r) U − U 0 W F . Here, W U corresponds to the orthogonal alignment of U 0 so that U and U 0 W U are close in the Frobenius norm sense. As pointed out in Cape et al. (2019b), the use of W U as the orthogonal alignment matrix is preferred over the alignment matrix is not analytically computable in general, whereas W U can be explicitly computed (Stewart and Sun, 1990). Specifically, if U T 0 U admits the singular value decomposition The following lemma formalizes the connection between the projection spectral norm loss and the two-to-infinity norm loss. Lemma 2.1. Let U and U 0 be two orthonormal r-frames in R p , where 2r < p. Then there exists an orthonormal 2r-frame V U in R p depending on U and U 0 , such that where W U = arg inf W∈O(r) U−U 0 W F is the Frobenius orthogonal alignment matrix.
When the projection spectral norm loss UU T − U 0 U T 0 2 is much smaller than one, Lemma 2.1 states that the two-to-infinity norm loss U − U 0 W U 2→∞ can be upper bounded by the product of the projection spectral norm loss and V U 2→∞ , where V U ∈ O(p, 2r) is an orthonormal 2r-frame in R p . In particular, under the sparse spiked covariance models in high dimensions, the number of spikes r can be much smaller than the dimension p (i.e., V U is a "tall and thin" rectangular matrix), and hence the factor V U 2→∞ can be much smaller than max V∈O(p,2r) V 2 = 1.

A continuous matrix shrinkage prior for joint sparsity
The recent decade has witnessed the development of a collection of continuous shrinkage priors that introduce sparse structures in various statistical contexts. For an incomplete list of references, see Bhattacharya et al. (2015), Pati et al. (2014), Carvalho et al. (2010), Rockova and George (2018), Rockova and George (2016), Rockova (2018), Shin et al. (2018, and Shin et al. (2020). In this section, we first illustrate the general Bayesian strategies in modeling the sparsity occurring in high-dimensional models and then elaborate on the proposed prior model. Consider a simple yet canonical sparse normal mean problem. Suppose we observe independent normal data y i ∼ N(β i , 1), i = 1, . . . , n and want to estimate the mean vector β n = (β i ) n i=1 , which is assumed to be sparse in the sense that n i=1 1(|β i | = 0) ≤ s n with the sparsity level s n = o(n) as n → ∞. To model the sparsity of β, classical Bayesian methods impose the following spike-and-slab prior on β: for any measurable set A ⊂ R, where ξ i is the indicator that β i = 0, θ ∈ (0, 1) represents the prior probability of β i being non-zero, δ 0 is the point-mass at 0 (called the "spike" distribution), and ψ(· | λ) is the density of an absolutely continuous distribution (called the "slab" distribution) with respect to the Lebesgue measure on R governed by some hyperparameter λ. Theoretical justifications for the use of the spike-and-slab prior (2.4) for the sparse normal means and the sparse Bayesian factor models have been established in Castillo and van der Vaart (2012) and Pati et al. (2014), respectively. Therein, the spike-and-slab prior (2.4) involves point-mass mixtures, which can be daunting in terms of the posterior simulations (Pati et al., 2014). To address this issue, the spike-and-slab LASSO prior (Rockova, 2018) is designed as a continuous relaxation of (2.4): where ψ(β | λ) = (λ/2) exp(−λ|β|) is the Laplace distribution with mean 0 and variance 2/λ 2 . When λ 0 λ, the spike-and-slab LASSO prior (2.5) closely resembles the spikeand-slab prior (2.4). The continuity feature of the spike-and-slab LASSO prior (2.5), in contrast to the classical spike-and-slab prior (2.4), is highly desired in high-dimensional settings in terms of the computational efficiency.
Motivated by the spike-and-slab LASSO prior, we propose a continuous matrix shrinkage prior to model the joint sparsity in the sparse spiked covariance models (1.1). The orthonormal constraint on the columns of U makes it challenging to incorporate prior distributions. Instead, we consider the following reparameterization of Σ: where B = UΛ 1/2 V T ∈ R p×r , and V ∈ O(r) is an arbitrary orthogonal matrix. Clearly, in contrast to the orthonormal constraint on U, there is no constraint on B except that rank(B) = r. Furthermore, B inherits the joint sparsity from U: For |supp(U)| = s ≥ r, there exists some permutation matrix P ∈ R p×p and U ∈ O(s, r), such that It follows directly that implying that |supp(B)| ≤ s. Therefore, working with B allows us to circumvent the orthonormal constraint while maintaining the jointly sparse structure of U.
We are now in a position to formalize the proposed continuous matrix shrinkage prior on B = [b jk ] p×r . Here we assume that the rank r of UΛU T is known. When r is unknown, one can apply the diagonal thresholding method in Cai et al. (2013) to estimate r consistently. For any α, λ > 0, let ψ α (x | λ) be the density function of the following double Gamma distribution: For each j ∈ [p], we assign the following hierarchical prior distribution on B 1 * , . . . , B p * : , and κ, λ > 0 are tuning parameters. Clearly, with probability 1 − θ, each row B j * follows a multivariate distribution with density r k=1 ψ r (b jk | λ + λ 0 ), and hence, B j * 1 ∼ Exp(λ + λ 0 ) for each j with probability 1 − θ. The exponential distribution Exp(λ + λ 0 ) with a large λ 0 shrinks the entire row B j * towards zero and therefore promotes the joint sparsity on B.
We refer to the hierarchical prior (2.7) on B as the matrix spike-and-slab LASSO prior and denote B ∼ MSSL p×r (λ, 1/p 2 , p 1+κ ). The hyperparameter λ is fixed throughout. In the single-spike case (r = 1), we observe that ψ 1 (b jk | λ) = (λ/2) exp(−λ|b jk |) reduces to the density function of the Laplace distribution, and hence the matrix spikeand-slab LASSO prior coincides with the spike-and-slab LASSO prior (Rockova, 2018). Here, 1 − θ represents the proportion of the rows B j * 's that are close to zero. Therefore, θ ∼ Beta(1, p 1+κ ) indicates that the matrix spike-and-slab LASSO prior favors a large proportion of rows of B being close to 0. We also remark that the first term r k=1 ψ r (b jk | λ + λ 0 ) closely resembles the "spike" component in the spike-and-slab distribution (2.4), whereas the second term r k=1 ψ 1 (b jk | λ) is a multivariate generalization of the "slab" component in (2.4). We complete the prior specification by imposing σ 2 ∼ IG(a σ , b σ ) for some a σ , b σ > 0 for the sake of conjugacy.
Lastly, we remark that the parameterization (2.6) of the spiked covariance models (1.1) has the following interpretation. The sampling model y i ∼ N p (0 p , Σ) can be equivalently represented in terms of the latent factor model where z i , i = 1, . . . , n, are independent and identically distributed (i.i.d.) r-dimensional latent factors, B is a p × r factor loading matrix, and ε i , i = 1, . . . , n are i.i.d. noisy vectors. Since B is also sparse by our earlier discussion, this formulation is related to the sparse Bayesian factor models presented in Bhattacharya and Dunson (2011) and Pati et al. (2014), the differences being the sparsity pattern of B and the prior specification.
In addition, the latent factor formulation (2.8) is convenient for the posterior simulation through a Markov chain Monte Carlo (MCMC) sampler, as discussed in Section 3.1 of Bhattacharya and Dunson (2011).

Theoretical properties 3.1 Properties of the matrix spike-and-slab LASSO prior
The theoretical properties of the classical spike-and-slab LASSO prior have been partially explored by Rockova (2018) and Rockova and George (2018) in the context of the sparse linear models and the sparse normal means problems, respectively. It is not clear whether the properties of the spike-and-slab LASSO priors adapt to other statistical contexts, including the sparse spiked covariance matrix models, the high-dimensional multivariate regression (Bai and Ghosh, 2018), etc. In this subsection, we present a collection of theoretical properties of the matrix spike-and-slab LASSO prior that not only are useful for deriving posterior contraction under the spiked covariance matrix models but also may be of independent interest for other statistical tasks, e.g., sparse Bayesian linear regression with multivariate response (Ning and Ghosal, 2018).
Let B ∈ R p×r be a p × r matrix and let B 0 ∈ R p×r be a jointly s-sparse p × r matrix with r ≤ s ≤ p. Here B 0 corresponds to the underlying truth. In the sparse spiked covariance matrix models, B represents the scaled eigenvector matrix UΛ 1/2 up to an orthonormal matrix in O(r), but for the sake of generality, we do not impose the statistical context in this subsection. A fundamental measure of goodness for various prior models with high dimensionality is the prior mass assignment on a small neighborhood around the true but unknown value of the parameter. This is referred to as the prior concentration in the literature of Bayes theory. Formally, we consider the prior probability of the non-centered ball { B − B 0 F < η} under the prior distribution for small values of η.
Lemma 3.1. Suppose B ∼ MSSL p×r (λ, 1/p 2 , p 1+κ ) for some fixed positive constants λ and κ, and B 0 ∈ R p×r is jointly s-sparse, where 1 ≤ r ≤ s ≤ p/2. Then for small values of η ∈ (0, 1) with η ≥ 1/p γ for some γ > 0, it holds that We next formally characterize how the matrix spike-and-slab LASSO prior promotes the joint sparsity of B using a probabilistic argument. Unlike the classical spike-andslab prior, which enables exact zeros in the mean vector with a positive probability, the matrix spike-and-slab LASSO prior is absolutely continuous with respect to the Lebesgue measure on R p×r , implying that |supp(B)| = p with prior probability one. Rather than forcing the elements of B to be exactly zero, the matrix spike-and-slab LASSO prior shrinks the elements of B toward zero. This behavior suggests the following generalization of the row support of a matrix B: For δ > 0 taken to be small, we define supp δ (B) = {j ∈ [p] : B j * 2 > δ}. Namely, supp δ (B) consists of the row indices of B whose Euclidean norms are greater than δ. Intuitively, one should expect that under the matrix spike-and-slab LASSO prior, |supp δ (B)| should be small with large probability. The following lemma formally confirms this intuition.

Posterior contraction for the sparse Bayesian spiked covariance model
We now elaborate on the posterior contraction rates for the proposed Bayesian sparse spiked covariance models with respect to various loss functions, which are the main results of this paper. We first present a collection of necessary assumptions.
Assumption 3.4 (Low-rank assumption). r log n log p.
Remark 3.1. Several remarks concerning the assumptions above are in order. Assumptions 3.1 and 3.2 are standard for the sparse spiked covariance models. Assumption 3.3 states the high-dimensional nature of the problem (namely, a large p small n scenario) and ensures the consistency with regard to the spectral norm Σ − Σ 0 2 . Assumption 3.4 is a mild low-rank condition that also appeared in Gao and Zhou (2015) and Ning (2021). Loosely speaking, it guarantees that the posterior contraction rate under the intrinsic metric of the normal covariance model is equivalent to the spectral norm. Assumption 3.5 specifies the prior distribution of Σ through the priors of B and σ 2 .
Remark 3.2. We briefly compare the posterior contraction rates obtained in Theorem 3.1 and Theorem 3.2 with some related results in the literature. In Pati et al. (2014), the authors consider the posterior contraction with respect to the spectral norm loss Σ − Σ 0 2 of the entire covariance matrix, while in Gao and Zhou (2015), the authors consider the posterior contraction with respect to the projection Frobenius norm loss UU T − U 0 U T 0 F for estimating Span{U * 1 , . . . , U * r }. Our result is similar to Ning (2021), who derive the posterior contraction rates with a spike-and-slab prior on the entries of the loading matrix B. This differs from the current work as we adopt a continuous matrix shrinkage prior. In Pati et al. (2014), the notion of sparsity is slightly different than the joint sparsity notion presented here. They assume that under the latent factor model representation (2.8), the individual supports of columns of B are not necessarily the same. When r = O(1), the assumption in Pati et al. (2014) coincides the joint sparsity and our rate n = (s log p)/n improves the rate (s log p log n)/n obtained in Pati et al. (2014) by a logarithmic factor. The assumptions in Gao and Zhou (2015) are the same as those in Pati et al. (2014), and Gao and Zhou (2015) focus on designing a prior that yields the rate-optimal posterior contraction with respect to the Frobenius norm loss of the projection matrices as well as the adaptation to the sparsity level s and the rank r. Our result in equation (3.2), which focuses on the projection spectral norm loss, serves as a complement to the rate-optimal posterior contraction for the principal subspace under the joint sparsity assumption in contrast to Gao and Zhou (2015).
The posterior contraction rate (3.2) also leads to the following risk bound for a point estimator of the principal subspace Span{U * 1 , . . . , U * r }:

Theorem 3.3. Assume the conditions in Theorem 3.1 hold. Let
be the posterior mean of the projection matrix U B U T B and U ∈ O(p, r) be the orthonormal r-frame in R p with columns being the eigenvectors corresponding to the first r-largest eigenvalues of Ω. Let M 0 , R 0 be the constants in Theorem 3.1. Then the following risk bound holds for U for sufficiently large n: To derive the posterior contraction rate for the principal subspace with respect to the two-to-infinity norm loss, we need the posterior contraction result for Σ with respect to the stronger matrix infinity norm. This contraction rate is obtained in the following lemma.
Lemma 3.4. Assume the conditions in Theorem 3.1 hold. Further assume that the eigenvector matrix U 0 exhibits the bounded coherence: U 0 2→∞ ≤ C μ r/s for some constant C μ ≥ 1, and the number of spikes r is sufficiently small in the sense that r 3 /s = O(1). Let R 0 , C 0 be the constants in Theorem 3.1. Then there exists some constant M ∞ > 0 depending on σ 0 , λ 01 , λ 0r , and the hyperparameters, such that the following posterior contraction for Σ = BB T + σ 2 I p holds for all M ≥ M ∞ when n is sufficiently large: We are now in a position to present the posterior contraction of the principal subspace with regard to the two-to-infinity norm loss in Theorem 3.4 below.
Remark 3.3. The bounded coherence condition that U 0 2→∞ ≤ C μ r/s originates from the delocalization of eigenvectors in low-rank matrix recovery (Candès and Recht, 2009). Loosely speaking, when U 0 is a tall-and-thin rectangular matrix, the bounded coherence of U 0 states that the orthonormal columns of U 0 are different from the highly localized standard basis vectors. The bounded coherence condition is also related to the pervasive assumption appearing in the econometrics and financial applications (Fan et al., 2008;Pati et al., 2014). Specifically, U 0 represents the left singular vector matrix of the factor loading matrix B 0 := U 0 Λ 1/2 0 V T under the latent factor model representation (2.8) for some V ∈ O(r). By the random matrix theory, B 0 is pervasive when the non-zero rows of B 0 are random realizations from a bounded random vector (Fan et al., 2013). By Proposition 3 in Fan et al. (2016), U 0 satisfies the bounded coherence condition when B 0 is pervasive. In addition, the assumption that r 3 /s = O(1) also appeared in the (dense) covariance estimation problem in Cape et al. (2019b).

Remark 3.4.
We also present some remarks concerning the posterior contraction with respect to the two-to-infinity norm loss U − U 0 W U 2→∞ . Cape et al. (2019b) can be coarsely upper bounded by the projection spectral norm loss UU T − U 0 U T 0 2 . This naive bound immediately yields for some appropriately selected large constant M , which is the same as (3.2). Our result (3.4) improves this rate by a factor of { r 3 /s∨ (s log p)/n}, resulting in a tighter posterior contraction rate with respect to the two-to-infinity norm loss. In particular, when r s (i.e., U 0 is a "tall and thin" rectangular matrix), the factor r 3 /s can be much smaller than 1.

Numerical examples 4.1 Synthetic examples
We evaluate the numerical performance of the proposed Bayesian method for estimating the sparse spiked covariance models via simulation studies. We set the sample size n = 100 and the number of features p = 200. The support size s of the eigenvector matrix U 0 ranges over {8, 12, 20, 40} and the number of spikes r takes values in {1, 4}. The indices of the non-zero rows of U 0 are uniformly sampled from {1, . . . , p} and we set the diagonal elements of Λ 0 to be equally spaced over the interval [10,20], with λ 01 = 20 and λ 0r = 10 if r = 4. The non-zero rows of U 0 , themselves forming an orthonormal r-frame in R s , denoted by U 0 , are taken as the left singular vector matrix of a s × r matrix L whose entries are generated from Unif(1, 2) independently.
The posterior inference is carried out using a standard Metropolis-within-Gibbs sampler. We take the first 1000 iterations of the MCMC sampler as the burn-in phase and collect the subsequent 4000 iterations as the post-burn-in samples. We set λ = 1, a σ = b σ = 1, and κ = 1 in all numerical examples. The convergence diagnostics of the MCMC chains are provided in the Supplementary Material (Xie et al., 2022). We then take the posterior mean Σ of Σ as the point estimator for Σ, and Span( U) given by Theorem 3.3 as the point estimator for the principal subspace Span{U * 1 , . . . , U * r }.
For comparison, several competitors are considered, including the sparse Bayesian factor model with the multiplicative Gamma process shrinkage prior (MGPS, Bhattacharya and Dunson, 2011), the principal orthogonal complement thresholding method (POET, Fan et al., 2013), the sparse principal component analysis (SPCA, Zou et al., 2006), and the adaptive sparse principal component analysis (ASPCA, Cai et al., 2013). In each simulation setup (i.e., each (r, s) pair), 50 independent replicates of the synthetic datasets are generated. For each synthetic dataset, we compute the point estimators Σ, U as well as those offered by the three competing approaches, the spectral norm loss ( Σ−Σ 0 2 ), the two-to-infinity norm loss ( U−U 0 W U 2→∞ ), and the projection spectral norm loss ( U U T − U 0 U T 0 2 ). We then compute the medians of these losses across the 50 replicates. The exception here is ASPCA, which only provides a point estimator for the principal subspace rather than the whole covariance matrix. We only obtain the projection spectral norm loss and the two-to-infinity norm loss for the principal subspace for ASPCA. The results are tabulated in Table 1.
The numerical results in Tables 1(a) and 1(b) indicate that the proposed Bayesian approach yields the smallest spectral norm losses for Σ and the smallest projection spectral norm losses for the subspace estimation, respectively, among others, except for ASPCA. While ASPCA outperforms the proposed Bayesian method in terms of the projection spectral norm loss for small values of s (s ∈ {8, 12}) when r = 4, its performance deteriorates rapidly as soon as the number of non-zero rows of U 0 increases (s ∈ {20, 40}) when r = 1. In terms of the two-to-infinity norm loss for the subspace estimation, Table 1(c) shows that the point estimates U using the proposed approach yield smaller losses compared to the competitors when s ∈ {8, 12, 20} for both r = 1 and r = 4, while POET is more accurate for the single-spike cases when s = 40. The comparison between the two losses for the subspace estimation is also visualized in Figure 1, suggesting that the two-to-infinity norm loss is less sensitive to the row support size s than the projection spectral norm loss as s increases.
We further evaluate the estimation performance for the principal subspace when s = 20, r = 1 and s = 40, r = 4 through a single replicate in Figures 2, 3, and 4, respectively. For the visualization of recovering U 0 across different methods, we rotate  Table 1: Loss functions for the simulation study: The spectral norm loss Σ − Σ 0 2 , the squared projection spectral norm loss U U T − U 0 U T 0 2 2 , and the squared two-toinfinity norm loss U − U 0 W U 2 2→∞ . The medians of the loss function values across 50 replicates of synthetic datasets are tabulated. MSSL stands for the sparse Bayesian spiked covariance matrix model with the matrix spike-and-slab LASSO prior; MGPS refers the sparse Bayesian factor model with the multiplicative Gamma process shrinkage prior; POET refers to the principal orthogonal complement thresholding method; SPCA refers to the sparse principal component analysis; ASPCA refers to the adaptive sparse principal component analysis.
the estimates according to the Frobenius orthogonal alignment. To be more specific, for a point estimator U obtained using a frequentist method, we first compute the orthogonal alignment matrix W U = arg inf W ∈ O(r) U − U 0 W F and then use UW T U as the estimator for U 0 . For the Bayesian method, we compute the orthogonal alignment W U B = arg inf W∈O(r) U B − U 0 W F for each posterior sample U B and then take U B W T U B as the posterior estimate for U 0 . It can clearly be seen that POET is able to capture the signal but fails to recover the row support of the principal subspace, whereas SPCA is able to recover the subspace support but is not accurate in estimating the signal. MGPS performs similarly to POET but results in wider credible intervals than those using the proposed approach. The performance of ASPCA is satisfactory when r = 4 but it severely under-estimates the number of non-zero rows of U 0 when r = 1 and results in unsatisfactory estimates.
In addition, we report the running times of the five methods using one synthetic dataset for each setup in Table 2. The proposed MSSL is faster than the other Bayesian method, MGPS. Although POET is significantly faster than the proposed approach and MGPS when r = 4, and sparse PCA outperforms all the other methods when r = 1, we will see in Section 4.2 that when the dimension p becomes large, POET and sparse PCA fail to produce results within 20 hours. ASPCA is the fastest approach among all methods considered here. However, it is not stable when r = 1, as presented in Table 1 and Figure 2. Overall, the proposed sparse Bayesian spiked covariance model is able to estimate the signals accurately and efficiently, recover the row support of U 0 , and provide better uncertainty quantification with narrower credible intervals for the synthetic datasets.

A face data example
The joint sparsity of the eigenvector matrix U is often desired in the feature extraction for some high-dimensional data. In this subsection, we illustrate how the proposed Bayesian approach is able to extract the key features through a real data example in computer vision.
We consider a subset of the Extended Yale Face Database B (Georghiades et al., 2001;Lee et al., 2005). It consists of the face images for 38 subjects, and for each subject, 64 aligned images of size 192 × 168 are taken under different illumination conditions. Here we focus on the 22nd subject and reduce the size of each image to 96 × 84 (8064 pixels in total), following the preprocessing step in She (2017). In doing so, we obtain a data matrix Y = [y 1 , . . . , y n ] T of size 64 × 8064.
In computer vision, the principal component analysis has been broadly applied to obtain the low-dimensional features, known as the eigenfaces, from high-dimensional face image data. Under the proposed Bayesian framework, we perform the posterior inference by implementing a Metropolis-within-Gibbs sampler with 1000 burn-in iterations and 4000 post-burn-in samples. The number of spikes r is estimated using the diagonal thresholding method proposed in Cai et al. (2013). For comparison, we also implement MGPS (Bhattacharya and Dunson, 2011). Instead of obtaining the eigenfaces, we focus on the extraction of the key pixels via thresholding the obtained estimated eigenvector Figure 2: Simulation performance for a single replicate with s = 20 and r = 1. The estimates are rotated to the simulation truth U 0 according to the Frobenius orthogonal alignment. The red bars in the top two panels are estimated 95% credible intervals using the proposed approach and MGPS, respectively. matrix U. Specifically, for the proposed approach, the estimate U can be computed according to Theorem 3.3, and for MGPS, U can be obtained by computing the left singular vectors of the loading matrix. The key pixels are then obtained by finding {j ∈ [8064] : U j * 1 /r > τ } for some small tolerance τ > 0. The other three competitors are unable to produce valid results for the following reasons. Neither POET nor sparse PCA was able to produce results within 20 hours. ASPCA encountered numerical instability when we attempted to implement it and reported errors. We present the sample images of the 22nd subject in the first row of Figure 5. The key pixels of sample image #1 extracted using the two models with different threshold values of τ are provided in the second and the third rows of Figure 5. We recover the pixels with higher values (corresponding to eyes, lips, and nose tips of the subject) using the proposed model and MGPS. This observation is also in accordance with the conclusion from She (2017). Nevertheless, as the threshold value τ increases, the number of key pixels captured using MGPS decreases significantly, whereas the proposed approach is more robust to the threshold value τ and maintains the key pixels that are sensitive to illumination. This phenomenon is expected since, unlike the matrix spike-and-slab LASSO prior, MGPS is not designed to model the joint sparsity and the feature extraction but rather column-specific sparsity for each individual factor loading.
Figure 5: The face data example: The first row corresponds to sample images of the 22nd subject (image number 1, 20, and 50, respectively). The second and the third rows are the key pixels of the #1 image using the proposed Bayesian approach with the matrix spike-and-slab LASSO prior (MSSL) and MGPS with different threshold values τ .

Discussion
We have shown that the two-to-infinity norm loss for the principal subspace estimation can capture the entrywise perturbations of the eigenvector matrix U in contrast to the routinely used projection spectral norm loss. A novel matrix shrinkage prior that extends the continuous spike-and-slab LASSO due to Rockova and George (2018) and Rockova (2018) has been developed. We have obtained the contraction rate of the full posterior distribution for the principal subspace under the two-to-infinity norm loss, which is sharper than the rate under the usual projection spectral norm loss, provided that U exhibits certain low-rank and bounded coherence conditions.
In future work, we intend to study whether a point estimator can be found from the posterior distribution with a risk bound that coincides with the posterior contraction rate under the two-to-infinity norm loss. In addition, it is also worth exploring the minimax-optimal rates of convergence with respect to the two-to-infinity norm loss. Throughout the paper, we assume that the number of spikes r is known. When r is unknown, a convenient approach is to estimate r using a frequentist method (e.g., the diagonal thresholding method as in Cai et al., 2013) first and then apply our Bayesian method using the estimated r. Alternatively, it is feasible to adaptively estimate r in the literature of Bayesian latent factor models (see, for example, Bhattacharya and Dunson, 2011;Gao and Zhou, 2015;Pati et al., 2014). Hence, exploring a rank-adaptive Bayesian procedure and obtain attractive theoretical properties or computation tractability could be interesting extensions as well.
The low-rank assumption (Assumption 3.4) requires that r log n log p and guarantees that the minimax rate for estimating the covariance matrix Σ under the Frobenius norm coincides with that under the spectral norm. More precisely, the minimax rate with regard to Σ − Σ 0 F is (rs + s log p)/n, whereas the minimax rate under Σ − Σ 0 2 is (s log p)/n and does not depend on the rank r. When Assumption 3.4 is violated, the two rates differ from each other, and the proof technique adopted in this work is no longer applicable to establish the rate-optimal posterior contraction under the (nonintrinsic) spectral norm (Hoffmann et al., 2015). In the recent technical report (Xie, 2021), the author partially addressed the rate-optimal posterior contraction under the spectral norm without assuming r log n log p, but the other assumptions there were more restrictive. The posterior contraction rates under non-intrinsic metrics in general high-dimensional models remain relatively underexplored.
Markov chain Monte Carlo can be computationally intensive for high-dimensional settings in general. In this paper, we applied a standard Metropolis-within-Gibbs sampler for Bayesian computation of the sparse spiked covariance models. Inspired by Ning (2021), it would be desirable to develop computationally efficient methods, such as an expectation-maximization algorithm for the maximum a posteriori estimation instead of computing the full posterior distribution (Rockova and George, 2016) or a penalized least squares method (She, 2017), and explore the underlying theoretical guarantees.

Supplementary Material
Supplementary Material for "Bayesian Sparse Spiked Covariance Model With a Continuous Matrix Shrinkage Prior" (DOI: 10.1214/21-BA1292SUPP; .pdf). The supplementary material contains the proofs of the theoretical results, additional technical results, and additional numerical results.