Maximum Pairwise Bayes Factors for Covariance Structure Testing

Hypothesis testing of structure in covariance matrices is of significant importance, but faces great challenges in high-dimensional settings. Although consistent frequentist one-sample covariance tests have been proposed, there is a lack of simple, computationally scalable, and theoretically sound Bayesian testing methods for large covariance matrices. Motivated by this gap and by the need for tests that are powerful against sparse alternatives, we propose a novel testing framework based on the maximum pairwise Bayes factor. Our initial focus is on one-sample covariance testing; the proposed test can optimally distinguish null and alternative hypotheses in a frequentist asymptotic sense. We then propose diagonal tests and a scalable covariance graph selection procedure that are shown to be consistent. A simulation study evaluates the proposed approach relative to competitors. The performance of our graph selection method is demonstrated through applications to a sonar data set.


Introduction
Consider a sample of observations from a high-dimensional normal model in which the number of variables p can grow with the sample size n, X 1 , . . . , X n | Σ n i.i.d.
where Σ n ∈ R p×p is a covariance matrix. There is often interest in inferring the structure in Σ n and in comparing different alternative covariance structures. This article focuses on this problem from a hypothesis testing perspective. Let X n = (X 1 , . . . , X n ) T ∈ R n×p be the data matrix. A one-sample covariance test can be reduced to the following simple form: H 0 : Σ n = I p versus H 1 : Σ n = I p , by noting that H 0 : Σ n = I p is equivalent to a null hypothesis H 0 : Σ n = Σ 0 for any given positive definite matrix Σ 0 by applying the linear transformation X i → Σ −1/2 0 Another important problem is testing diagonality H 0 : σ ij = 0 for any i = j versus H 1 : not H 0 , where Σ n = (σ ij ). Finally, we consider the problem of support recovery, corresponding to estimating the nonzero elements of covariance matrices.
We are interested in constructing novel Bayesian tests that are practically applicable with theoretical guarantees for the (i) one-sample covariance test, (ii) diagonality test, and (iii) support recovery of the covariance matrix. Throughout the paper, we consider the high-dimensional setting in which the number of variables p can grow to infinity as the sample size n gets larger and possibly be much larger than n. Although it is well known that assuming a restricted covariance class is necessary for consistent estimation of large covariance matrices (Johnstone and Lu;Lee and Lee;2018), in a testing context we focus on alternative hypotheses H 1 that are unconstrained. One natural possibility is to assume a conjugate inverse-Wishart prior IW p (ν n , A n ) for Σ n under H 1 .
However, in order for the resulting posterior to be proper, it is necessary to choose the degrees of freedom ν n > p − 1, suggesting an extremely informative prior in high-dimensional settings. The resulting test will certainly be highly sensitive to the choice of A n , and hence is not very useful outside of narrow applications having substantial prior information. One could instead choose a non-conjugate prior for Σ n under H 1 , but then substantial computational issues arise in attempting to estimate the Bayes factor.
From a frequentist perspective, Chen et al. (2010) and Cai and Ma (2013) suggested consistent one-sample covariance tests based on unbiased estimators of Σ n −I p 2 F , where A F = ij a 2 ij 1/2 is the Frobenius norm of a matrix A = (a ij ). Under the null hypothesis, they showed that their test statistic is asymptotically normal. The test also has power tending to one as n goes to infinity, but it requires the condition, Σ n − I p 2 F n/p → ∞ as n → ∞. This condition implies that they essentially adopted H 1 = {Σ n : Σ n − I p 2 F ≥ b n p/n} for some b n → ∞ as n → ∞ as the alternative class. Cai and Ma (2013) proved that if we consider an alternative class H 1 = {Σ n : Σ n − I p 2 F ≥ n }, the condition n ≥ b n p/n is inevitable for any level α test to have power tending to one. This excludes cases in which a finite number of the components of Σ n − I p have a magnitude p/n, although p/n can be a significant signal when p ≥ n.
The above discussion motivates us to develop hypothesis tests that are easy to implement in practice while possessing theory guarantees. In particular, we wish to construct tests that can perform well (e.g., a consistent test) even when the condition Σ n − I p 2 F n/p → ∞ fails to hold. We achieve this by proposing a novel Bayesian testing framework based on the maximum pairwise Bayes factor (mxPBF) which will be introduced in Section 2.2. The basic strategy is to focus on the pairwise difference between Σ n and I p rather than the Frobenius norm or other matrix norms. More precisely, instead of considering a usual Bayes factor based on a prior on the whole covariance matrix, we first consider the pairwise Bayes factors for each element of the matrix and combine them by taking a maximum over all possible pairs. This approach enables us to consider a different alternative class, H 1 = {Σ n : Σ n − I p 2 max ≥ C log p/n} for some constant C > 0, where A max = max i,j |a ij | for a matrix A = (a ij ). When the primary interest is not on a collection of very weak signals, but on detecting at least one meaningful signal, our test based on mxPBF is much more effective than the frequentist methods mentioned above.
The main contributions of this paper are as follows. The mxPBF is a general theoreticallsupported method with a simple implementation that can be readily used by practitioners. mxPBF is also the first Bayesian test that has been shown to be consistent in the high-dimensional setting for the one-sample or diagonal covariance testing problems. In the one-sample case, the proposed test is rate-optimal in the sense that it can distinguish the elementwise maximum norm-based alternative class H 1 = {Σ n : Σ n − I p 2 max ≥ n } from the null with the fastest rate of n , while guaranteeing consistency under the null. We also propose a scalable graph selection method for high-dimensional covariance graph models using pairwise Bayes factors. The proposed method consistently recovers the true covariance graph structure under a weaker or comparable condition to those in the existing frequentist literature.
The rest of the paper is organized as follows. Section 2 introduces the notations and definition of the mxPBF. Section 3 presents the main results of the paper: the one-sample covariance test, diagonality test and support recovery of the covariance matrix. In Section 4 the practical performance of the proposed methods is evaluated based on simulation study and real data analysis. R codes for implements of our empirical results are available at https://github.com/leekjstat/mxPBF.
Concluding remarks are given in Section 5. Proofs of our main results are included in an Appendix, with additional results in Supplemental Materials.

Notations
For any real values a and b, we denote a ∨ b as the maximum between a and b. For any positive sequences a n and b n , we denote a n b n or a n = o(b n ) if a n /b n → 0 as n → ∞. For any vector x = (x 1 , . . . , x p ) T ∈ R p , we define the vector 1 -and 2 -norm as x 1 = p j=1 |x i | and x 2 = ( p j=1 x 2 j ) 1/2 , respectively. Let C p be the set of all p × p positive definite matrices. We denote χ 2 k (λ) as the non-central chi-square distribution with degrees of freedom k and non-centrality λ ≥ 0, and let χ 2 k = χ 2 k (λ = 0). For positive real values a and b, IG(a, b) denotes the inverse gamma distribution with shape a and scale b.

Maximum Pairwise Bayes Factor (mxPBF)
In this subsection, we introduce the mxPBF focusing on the one-sample covariance test. As described before, the basic strategy is to concentrate on the pairwise difference between Σ n and I p .
LetX j ∈ R n be the jth column vector of X n . For any indices i and j, based on the joint distribution (1), the conditional distribution ofX i givenX j is where a ij ∈ R and τ ij > 0. We can view model (2) as a linear regression model given a design matrixX j . For each paired conditional model (2), we consider a testing problem If H 0,ij is true, σ ij = 0 and σ ii = 1 because a ij = σ ij /σ jj and τ 2 ij = σ ii (1 − ρ 2 ij ), where Σ n = (σ ij ) and R n = (ρ ij ) are covariance and correlation matrices, respectively. We suggest the following prior distribution under the alternative hypothesis H 1,ij in (3), where γ = c γ (n ∨ p) −α and a 0 , b 0 , c γ and α are positive constants. The induced Bayes factor is The null hypothesis in the one-sample covariance test, H 0 : Σ n = I p , is true if H 0,ij is true for all pairs (i, j) such that i = j. We aggregate the information from each pairwise Bayes factor B 10 (X i ,X j ) via the maximum pairwise Bayes factor (mxPBF), A large value for B max,10 (X n ) provides evidence supporting the alternative hypothesis. By taking a maximum, the mxPBF supports the alternative hypothesis if at least one of the pairwise Bayes factors support the alternative. A natural question is whether false positives increase as we take a maximum over more and more pairs. Indeed, we find that this is not the case, either asymptotically based on our consistency results (Theorems 3.1 and 3.3) or in finite samples based on simulations.

One-sample Covariance Test
In this subsection, we show consistency of the mxPBF defined in (5) for the one-sample covariance test H 0 : Σ n = I p versus H 1 : Σ n = I p .
Condition (A1) is required to detect a non-unit variance σ 0,ii . The condition (8) can be interpreted as a beta-min condition for |σ 0,ii − 1|. The beta-min condition gives a lower bound for nonzero parameters and is essential for model selection consistency (Castillo et al.;Martin et al.;2017). Interestingly, the rate of lower bound in (A1) is given by log(n ∨ p)/n, which has been commonly used in the variable selection literature. Condition (A2) is similar to condition (A1), which can be interpreted as a beta-min condition for |τ 2 0,ij − 1|. Condition (A3) is also the beta-min condition for off-diagonal elements of the covariance matrix. The right hand side of (10) implies that the larger τ 2 0,ij and σ 0,jj are, the larger value of σ 2 0,ij is required to consistently detect the dependency betweenX i andX j . Based on (7), the former means that larger value of residual variance τ 2 0,ij makes the inference harder. The conditional expectation in (7) can be considered as σ 0,ij /σ 1/2 0,jj · (X j /σ 1/2 0,jj ) ≡ σ 0,ij /σ 1/2 0,jj ·Z j , whereZ j ∼ N n (0, I n ). Thus, a large value of σ 0,jj makes it hard to detect nonzero σ 0,ij . In summary, conditions (A1)-(A3) imply that for some constant C > 0, which corresponds to the meaningful difference we mentioned earlier. In fact, the rate log p/n is optimal to guarantee the consistency under both hypotheses (Theorem 3.2).
Theorem 3.1 shows that the mxPBF is consistent for the one-sample covariance test even in the high-dimensional setting as long as log p ≤ 2 0 n for some small constant 0 > 0.
Theorem 3.1 Consider model (1) and the one-sample covariance testing problem H 0 : Assume that log p ≤ 2 0 n for all large n, and Σ 0 satisfies at least one of conditions (A1)-(A3) or Σ 0 = I p . Then, the mxPBF defined and under H 1 : Σ n = I p , log B max,10 (X n ) = O p log(n ∨ p) .
We first prove that the pairwise Bayes factor B 10 (X i ,X j ) is consistent on a large event E ij such To show the consistency of the mxPBF under H 0 , we need to prove that i =j P 0 (E c ij ) → 0 as n → ∞, which means that the false discovery rate converges to zero. The condition for α in Theorem 3.1 is closely related to this requirement. To show the consistency of the mxPBF under H 1 , it suffices to show P 0 (E c ij ) → 0 as n → ∞ for some index (i, j) satisfying at least one of conditions (A1)-(A3). Interestingly, the rate of convergence is similar under both hypotheses, unlike most Bayesian testing procedures with the notable exception of non-local prior based methods .
Remark Compared to existing frequentist one-sample covariance tests in Chen et al. (2010) and Cai and Ma (2013), our proposed test can detect H 1 : Σ n = I p if there is at least one meaningful suppose p ≥ n and Σ 0 = I p +τ I(i = j = 1) for some constant τ > 0, which implies Σ 0 −I p 2 F = τ 2 . Then, the mxPBF consistently detects the true H 1 . However, the powers of the frequentist tests do not tend to 1 in this case, because the condition Σ 0 − I p 2 F p/n is not met.
The next theorem shows the optimality of the alternative class which is considered in Theorem 3.1 (Conditions (A1)-(A3)). It says, when the alternative class is defined based on the element-wise maximum norm, the condition Σ 0 − I p 2 max ≥ C log p/n for some constant C > 0 is necessary for any consistent test to have power tending to one. Thus, conditions (A1)-(A3) are rate-optimal to guarantee the consistency under H 0 as well as H 1 .

Testing Diagonality
We now extend mxPBF to test diagonality of the covariance matrix: where Σ n = (σ ij ). The above hypothesis testing problem can be modularized into many pairwise independence tests for all 1 ≤ i < j ≤ p. We can adopt the mxPBF idea to aggregate the pairwise testing information from (12) for all possible pairs (i, j) such that i = j to test (11). Based on the conditional distribution (2), the null hypothesis H 0,ij in (12) is equivalent to H 0,ij : a ij = 0. We suggest the prior τ 2 ij ∼ IG(a 0 , b 0 ) under H 0,ij and prior (4) under H 1,ij , which leads to the pairwise Bayes factor We suggest using the mxPBF,B max,10 (X n ) := max i =jB for the hypothesis testing problem (11). Theorem 3.3 states the consistency of the mxPBF for testing (11) under regularity conditions. For consistency under the alternative hypothesis, we assume the following condition: (A4) There exists a pair (i, j) satisfying for constants C 1 > 0, 0 < C 3 < 1 and C 4 > 1 defined in Section 3.1.
Theorem 3.3 Consider model (1) and the diagonality testing problem (11). For a given pair (i, j) . Assume that log p ≤ 2 0 n for all large n, and Σ 0 satisfies condition (A4) or Σ 0 = I p . Then the mxPBF defined in (13) is consistent Condition (A4) is the beta-min condition for off-diagonal elements of the true covariance matrix.
It indicates that if one of the off-diagonal elements satisfies the beta-min condition (A4), the mxPBF consistently detects the true alternative hypothesis. Similar to Theorem 3.1, the condition for α is required to control the false discovery rate, and the mxPBF has similar rates of convergence under both hypotheses.
As a by-product, when pairwise independence testing (12) itself is of interest, we suggest a pairwise Bayes factorB which can be shown to be consistent. Note that decisions from tests based onB 10 (X i ,X j ) and B 10 (X j ,X i ) can differ in finite samples, which is undesirable. To obtain an order-invariant test, we suggest using the pairwise Bayes factor (14). For consistency under the alternative hypothesis, we assume for constants C 1 > 0, 0 < C 3 < 1 and C 4 > 1 defined in Section 3.1. If we substitute log n in the above condition with log(n ∨ p), it coincides with condition (A4). The proof of Corollary 3.4 is obvious from that of Theorem 3.3, thus it is omitted.
Corollary 3.4 Consider model (1) and a hypothesis testing problem (12) for a given pair (i, j) such that i = j. Suppose we use prior τ 2 ij ∼ IG(a 0 , b 0 ) under H 0,ij and prior (4) under H 1,ij with γ = c γ n −α for some positive constants a 0 , b 0 , c γ and α. Assume that either σ 0,ij = 0 or at least one of σ 0,ij and σ 0,ji satisfy (15). Then, the pairwise Bayes factorB pair,10 (X i ,X j ) is consistent under

Support Recovery of Covariance Matrices
The primary interest of this section is on the recovery of S Estimating S(Σ 0 ) corresponds to graph selection in covariance graph models (Cox and Wermuth;1993). Despite its importance, few Bayesian articles have investigated this problem. Kundu et al. (2013) proposed the regularized inverse Wishart (RIW) prior, which can be viewed as a group Lasso penalty (Yuan and Lin;2006) on the Cholesky factor. They showed the consistency of their selection procedure for the support of precision matrices when the dimension p is fixed. Recently, Gan et al.
(2018) adopted the spike-and-slab Lasso prior (Ročková and George;2016;Ročková et al.;2018) for off-diagonal entries of the precision matrix. Their proposed graph selection procedure for the precision matrix also yields selection consistency. To the best of our knowledge, in the Bayesian literature, a consistent support recovery result for covariance matrices has not been established.
To tackle this gap, we propose a scalable graph selection scheme for high-dimensional covariance matrices based on pairwise Bayes factors. By looking closer at the proof of Theorem 3.3, it reveals that each pairwise Bayes factorB pair,10 (X i ,X j ) can consistently determine whether the corresponding covariance element σ 0,ij is zero or not. Thus, we suggest using the estimated index for some constant C sel > 0. Asymptotically, any constant C sel can be used for consistent selection.
However, in practice, we suggest using the threshold C sel = 10, which corresponds to 'very strong evidence' under the Bayes factor thresholds of Kass and Raftery (1995). In the frequentist literature, Drton and Perlman (2004) and Drton and Perlman (2007) proposed selection procedures using a related idea to (16), which select a graph by multiple hypothesis testing on each edge. However, they considered only the low-dimensional setting, n ≥ p + 1.
For the consistency of S pair , we introduce the following condition for some constants 0 < C 3 < 1, C 4 > 1 and C 5 > 2: (A.ij) For a given pair (i, j) such that i = j, n .
The beta-min condition (A.ij) is almost the same with (A4) except using C 5 > 2 instead of C 1 > 0 to control the probabilities of small events on which the pairwise Bayes factor might not be consistent.
Theorem 3.5 states that (16) (2011) assumed log p = o(n 1/3 ) and min (i,j)∈S(Σ 0 ) σ 2 0,ij ≥ Cσ 0,ii σ 0,jj log p/n for some C > 0 and obtained the consistent support recovery result of the covariance matrix using adaptive thresholding. In terms of the condition on n and p, our condition, log p ≤ 2 0 n, is much weaker than the conditions used in the literature. The beta-min condition (A.ij) is similar to that in Cai and Liu (2011) and also has the same rate to that in Rothman et al. (2009) Thus, the required condition in Theorem 3.5 is weaker or comparable to the conditions used in the literature.

Simulation Study: One-sample Covariance Test
In this section, we demonstrate the performance of our one-sample covariance test in various simulation cases. The four hyperparameters were chosen as a 0 = 2, b 0 = 2, c γ = 10 and α = 8.01(1 − 1/ log n), which satisfy the sufficient conditions in Theorem 3.1. If we assume a small 0 > 0, the above choice of α asymptotically satisfies α > 8(1 + We compare our one-sample covariance test with existing frequentist tests, proposed by Cai and Ma (2013) and Srivastava et al. (2014). The test suggested by Srivastava et al. (2014) is based on estimating the squared Frobenius norm, and has a similar perspective to the test proposed by Cai and Ma (2013). The test statistic is asymptotically normal if n ≤ p δ for some 1/2 < δ < 1, but does not have any theoretical guarantee on power.
We generated 100 data sets from each hypothesis H 0 : Σ n = I p and H 1 : Σ n = I p , for various choices of n and p. We considered two structures for the alternative hypothesis H 1 : Σ n = I p . First, we chose a tridiagonal matrix Σ 0 = (σ 0,ij ) with for some signal strength constant ρ > 0 ranging from 0.1 to 0.5 by 0.025. For each value of ρ, we generated 100 data sets from the multivariate normal distribution N p (0, Σ 0 ). As a second case for Σ 0 , we let σ 0,ij = I(i = j) + ρ · I(i = 1, j = 2) + ρ · I(i = 2, j = 1), for some constant ρ > 0. Because (18) has signals at only two locations, the difference between Σ 0 and I p is sparse. mxPBF and the frequentist tests require Σ 0 −I p 2 max ≥ C log p/n and Σ 0 −I p 2 F p/n, respectively for consistency, while we have Σ 0 −I p 2 F = 2 Σ 0 −I p 2 max for (18). Hence, mxPBF is potentially much more powerful than the tests in Cai and Ma (2013) and Srivastava et al. (2014) in this setting. To check this conjecture, we increased ρ from 0.3 to 0.8 by 0.025 and generated 100 simulated data from N p (0, Σ 0 ).
We calculated receiver operating characteristic (ROC) curves to illustrate and compare the performance of the tests. For each setting, points of the ROC curves were obtained based on various thresholds and significance levels for mxPBF and frequentist tests, respectively. We tried n = 100, 200, 300 and p = 200, 500 for each setting. Figure 1 shows ROC curves based on 100 simulated data from N p (0, I p ) and 100 simulated data from N p (0, Σ 0 ) with a tridiagonal Σ 0 given in (17), when (n, p) = (100, 200) and (n, p) = (200, 500). In this setting, the tests in Cai and Ma (2013) and Srivastava et al. (2014) seem to work better than the test proposed in this paper. This is as expected since these earlier tests are based on the squared Frobenius norm Σ 0 − I p 2 F = 2(p − 1)ρ 2 , while mxPBF focuses on the maximum difference Σ 0 − I p max = ρ. Thus, when the difference between the true covariance and the point null is dense, the test based on the Frobenius norm difference would be more powerful. Figure 2 shows ROC curves based on 100 simulated data from N p (0, I p ) and 100 simulated data from N p (0, Σ 0 ) with a tridiagonal Σ 0 given in (18), when (n, p) = (100, 200) and (n, p) = (200, 500). As expected, the mxPBF is much more powerful than the frequentist tests when Σ 0 − I p is sparse. The mxPBF test works reasonably well when the signal ρ is larger than 0.5, but the frequentist tests do not work well even when ρ is 0.8. Furthermore, unlike the previous result described in Figure 1, the performance of the frequentist tests is not getting better as n and p increase, while mxPBF has better performance when (n, p) = (100, 200) than (n, p) = (200, 500). It confirms the theoretical property that the tests in Cai and Ma (2013) and Srivastava et al. (2014) require Σ 0 − I p 2 F p/n to have power tending to one. Note that Σ 0 − I p 2 F = 2ρ 2 remains the same while p/n increases as the setting is changed from (n, p) = (100, 200) to (n, p) = (200, 500).

Simulation Study: Testing Diagonality
We conducted a simulation study to illustrate the performance of the diagonality test based on the mxPBF. The hyperparameters in the mxPBF were chosen as in section 4.1 except α = 4.01(1 − 1/ log n). For each hypothesis H 0 : σ ij for any i = j and H 1 : not H 0 , 100 data sets were generated.
The two structures of Σ 0 under H 1 used in the previous section, (17) and (18), were considered.
We compare our test with frequentist tests suggested by Cai and Jiang (2011) and Lan et al. (2015). Cai and Jiang (2011) proposed a diagonality test and showed its asymptotic null distribution in the high-dimensional setting. Their test is based on the maximum of sample correlations between variables. Note that τ 2 ij,γ in the pairwise Bayes factorB 10 (X i ,X j ) is a decreasing function of the sample correlation betweenX i andX j . Thus, their test has a similar aspect to our test. Lan et al. (2015) developed a test in the regression setting and showed asymptotic normality. Their test statistic is based on the squared Frobenius norm of a sample covariance matrix, so it is expected that it will have a high power when off-diagonal entries of the true covariance matrix are dense.
The results for the case (n, p) = (200, 500) are reported in Figure 3. ROC curves of tests based on the mxPBF and proposed by Cai and Jiang (2011) are almost identical in every setting. As expected, the test suggested by Lan et al. (2015) is more powerful when Σ 0 − I p is dense, which corresponds to "Type 1" in Figure 3. The tests proposed in this paper and Cai and Jiang (2011) have less power, but also work reasonably well when the signal ρ is larger than 0.15. On the other hand, they tend to have a high power when Σ 0 − I p is sparse, which corresponds to "Type 2" in

Support Recovery using Sonar Data
To describe the practical performance of the support recovery procedure (16), S pair , we analyzed the sonar data set used by Gorman and Sejnowski (1988). The data record sonar signals collected from a rock and a metal cylinder, which were both approximately 5 feet long and embedded in the seabed. The impinging pulse sent to the targets was a wide-band linear frequency-modulated (FM) chirp (ka = 55.6). Data were collected within 10 meters of each object at various angles, spanning  (17) and (18), respectively. mxPBF, Lan and CJ represent the test proposed in this paper, Lan et al. (2015) and Cai and Jiang (2011), respectively. 180 • degrees for the rock and 90 • degrees for the metal cylinder. The rock data contains n r = 97 returned patterns, while the metal data contains n m = 111 returns. Each pattern consists of p = 60 values supported in the range [0, 1] representing the normalized frequency of energy measurements integrated over a certain period of time. We transformed the data by adding a small value and taking the logarithm to convert from [0, 1] to R support. After transformation, the data sets were centered.
For pairwise Bayes factors, the hyperparameters were set at a 0 = 2, b 0 = 2, c γ = 10 and α = 4.01(1 − 1/ log n). The threshold C sel in (16) was set at 10 as a default. Furthermore, as a pragmatic approach, we also incorporated cross-validation (CV) to select C sel . Let n be the number of observations for a given data set. We randomly divided the data 50 times into two subsamples with size n 1 = n/3 and n 2 = n − n 1 as a test set and training set, respectively. Denote I 1 and I 2 as indices for the test set and training set, respectively, thus, |I 1 | = n 1 , |I 2 | = n 2 and I 1 ∪ I 2 = {1, . . . , n}. LetŜ j (C sel ) be the estimated support for the jth column of the covariance matrix via pairwise Bayes factors, based on {X i } i∈I 2 and a given threshold C sel . We calculated the averaged mean squared error (MSE) where β jl is a least square estimate with respect to the dependent variable {X ij } i∈I 1 and covariate {X il } i∈I 1 . The threshold C sel was varied from 0 to 20 with increment 0.2, and we selected C sel which minimizes 50 −1 50 ν=1 M SE ν (C sel ), where M SE ν (C sel ) is the MSE based on the νth split. The selected C sel were 15.2 and 19 for the rock and metal data, respectively.
We compared our method with generalized thresholding estimators of Rothman et al. (2009) and Cai and Liu (2011). Both estimators are based on the sample covariance with the elementwise thresholding rule s λ (·), satisfying certain conditions, where λ is the threshold. The difference is that Rothman et al. (2009) used a universal threshold λ = δ log p/n, while Cai and Liu (2011) used an individual thresholdλ ij = δ θ ij log p/n with a data-dependentθ ij . We denote thresholding estimators proposed by Rothman et al. (2009) and Cai and Liu (2011) by Σ δ and Σ δ , respectively.
For the thresholding rule, we used the "adaptive lasso" rule, s λ (σ) = σ · max(1 − |λ/σ| η , 0) with η = 4, because it gave good support recovery results in simulation studies in Rothman et al. (2009) and Cai and Liu (2011). We adopted the CV method, described in section 4 of Cai and Liu (2011), to select the tuning parameter δ. We randomly divided the data 50 times into two subsamples with sizes n 1 = n/3 and n 2 = n − n 1 , and foundδ minimizing the objective function 50 −1 50 ν=1 S 1,ν n − Σ 2,ν δ 2 F , where S 1,ν n is the sample covariance based on subsamples with size n 1 from the νth split, and Σ 2,ν δ is the generalized thresholding estimator based on the remaining samples from the νth split. The tuning parameter δ was varied from 0 to 4 with increment 0.1. Figure 4 shows the support recovery results for the three methods described above and absolute sample correlation matrices. The first row represents the estimated supports based on S pair,10 , S pair, C sel , Σδ and Σ δ for the rock data set, where S pair,C sel denotes the estimated support via pairwise Bayes factors with the threshold C sel . One can see that S pair,10 , S pair, C sel and Σ δ give sparse support estimates, while Σδ provides a relatively dense support estimate. Note that Σ δ does not capture the small block structure in the upper right side, although it seems to be significant. It is due to relatively large values ofθ ij 's, which correspond to empirical variances of sample covariances betweenX i andX j 's. Similarly, the second row represents the estimated supports for the metal data set. Our support recovery procedure provides a sparse support estimate. On the other hand, Σδ and Σ δ give dense support recovery results. Our approach has the advantage of producing a sparser, and hence potentially more interpretable estimate of support. S pair,10 should give a sparse support estimate because we used the threshold C sel = 10, removing all but 'very strong' covariances. If one wants to have more dense support estimates, a smaller threshold C sel can be used. The first, second, third and fourth columns represent the estimated supports based on S pair,10 , S pair, C sel , Σδ and Σ δ , respectively.

Discussion
We have focused on covariance matrix structure testing in this paper, but the mxPBF idea can be easily applied to other related settings. For example, testing differences across groups in highdimensional mean vectors is an interesting possibility. When the two mean vectors are almost the same but differ only at a few locations, a mxPBF approach should have relatively high power.
Similarly, it can be applied to the high-dimensional two-sample covariance test. Two covariances from two populations may differ only in a small number of entries.
There are some possible generalizations of the pairwise Bayes factor idea. To accelerate the speed of computation, a random subsampling method can be used instead of calculating the pairwise Bayes factor for every single pair (i, j). It should be interesting to develop a suitable random subsampling scheme achieving desirable theoretical properties similar to the mxPBF. Especially when p is huge, it will effectively reduce the computational complexity. The mxPBF approach is also trivially parallelizable. Another possibility is considering alternative combining approaches to the max in merging the information from every pairwise Bayes factor. If there are many weak nonzero covariances (or signals), then the average or summation may be preferable to the maximum.
A suitable modification to learn parameters in the combining operator can potentially make the test powerful to a broad class of alternative hypotheses. 1−(n∨p) −c for some constants C > 0 and c > 2 under Case (i), and log B 10 (X i ,X j ) ≥ C log(n∨p) with probability at least 1 − (n ∨ p) −c for some constants C > 0 and c > 0 under Cases (ii)-(iv).
Then, we have if all pairs (i, j) satisfy Case (i) (i.e. H 0 : Σ n = I p holds), and if there exists at least one pair (i, j) satisfying one of Cases (ii)-(iv) (i.e. H 1 : Σ n = I p holds).
with probability at least 1 − 4(n ∨ p) −c for any constants 0 < C < α − 8(1 + Case (ii). Now assume that σ 0,ii satisfies condition (A1) for some i. Note that (21) can be expressed as We will show that for given constants C 1 > 0 and C 2 > 2 √ α + 2, On this event, we can show that (31) is larger than 4 −1 C 2 2 (1 − 2C 2 0 /3) log(n ∨ p) for all large n by the Taylor expansion of log(1 + x) and the fact that x − log x − 1 is increasing in |x − 1|. Note that (32) is positive and (33) is negligible compared to (31). Then, by (22) and (23), with probability at least 1 − 4(n ∨ p) −C 1 for some constant C and all large n, by the condition on 0 .
Now, we only need to show (34). By Lemma 1 in Laurent and Massart (2000), one can show because n τ 2 i /σ 0,ii ∼ χ 2 n . Thus, it suffices to prove which is satisfied by (A1).
because C 2 ≤ 2. The third equality follows from Lemma B.3 of Lee and Lee (2018). It gives the upper bound inf Σ∈H 1 (C ) which completes the proof.

Proof of Theorem 3.3
Proof For a given pair (i, j) such that i = j, suppose the null hypothesis is true, i.e. σ 0,ij = 0.
Note that condition (A.ij) is the same with condition (A4) except using C 5 instead of C 1 . Thus, by similar arguments used in the proof of Theorem 3.3, it is easy to check that P 0 S ij = 0, S ij (Σ 0 ) = 1 = P 0 logB pair,10 (X i ,X j ) ≤ C sel , σ 0,ij = 0 ≤ (n ∨ p) c for some constant c > 2 and all sufficiently large n. Therefore, we have proved that as n → ∞.