Two-sample tests for high-dimensional covariance matrices using both difference and ratio

Abstract: By borrowing strengths from the difference and ratio between two sample covariance matrices, we propose three tests for testing the equality of two high-dimensional population covariance matrices. One test is shown to be powerful against dense alternatives, and the other two tests are suitable for general cases, including dense and sparse alternatives, or the mixture of the two. Based on random matrix theory, we investigate the asymptotical properties of these three tests under the null hypothesis as the sample size and the dimension tend to infinity proportionally. Limiting behaviors of the new tests are also studied under various local alternatives. Extensive simulation studies demonstrate that the proposed methods outperform or perform equally well compared with the existing tests.


Introduction
Testing whether two populations share the same covariance matrix is of fundamental interest in statistical inference, as many statistical procedures, such as the linear discriminant analysis, require the task of the equality test of covariance matrices. Let x k1 , x k2 , ..., x kN k be an independent and identically distributed sample from the kth p-dimensional population with mean vector μ k and covariance matrix Σ k , where N k is the sample size, k = 1, 2. Let n k = N k − 1 and y n k = p/n k . The sample covariance matrix of Σ k is given by x ki is the sample mean of the kth population, k = 1, 2. We are interested in testing the following hypothesis H 0 : Σ 1 = Σ 2 versus H 1 : Σ 1 = Σ 2 . (1.1) In the conventional low-dimensional setting where p is relatively small compared with the sample sizes, such a testing problem has been well studied. See, for example, [16], [8], [13], [7], [12] and [1]. In the high-dimensional setting where p is large relative to N 1 and N 2 , several different approaches have been proposed to address the failure of classical methods, see [4,14,15,10,5,19,21], and among others. In particular, many of the existing testing procedures are developed based on tr(Σ 1 − Σ 2 ) 2 , the square of the Frobenius norm of Σ 1 and Σ 2 . For example, under the Gaussian population assumption, [14] developed an unbiased estimator by correcting the bias of the sample statistic T d = tr(S 1 − S 2 ) 2 in estimating tr(Σ 1 − Σ 2 ) 2 . [11] considered a U -statistic for tr(Σ 1 − Σ 2 ) 2 and showed that their proposed test can yield desirable performance in various situ-ations, especially when the two population matrices have dense differences. [21] also studied the limiting behavior of T d , under general populations including both Gaussian and non-Gaussian distributions. On the other hand, known as a relative measure, the ratio measure, as given by tr(Σ 1 Σ −1 2 − I p ) 2 with I p denoting the identity matrix, can also characterize the relative discrepancy between Σ 1 and Σ 2 . When S 2 is invertible, a natural estimator of tr(Σ 1 Σ −1 2 − I p ) 2 is T r = tr(S 1 S −1 2 − I p ) 2 . Unfortunately, such a test statistic has been seldom studied. This may be due to the reason that the statistic T r involves the inverse of S 2 , which requires the dimension p does not exceed the sample size N 2 ; and when p is close to N 2 , the behavior of T r is complicated. In fact, the relative performance between the tests T d and T r varies case by case: as will be shown in the simulation studies as well as through the theoretical power analyses, any one of T d and T r cannot dominate the other in terms of statistical power under all scenarios.
This paper makes several contributions to the literature. (1) Firstly, we investigate the relative performances of T d and T r through theoretical power analyses in the high-dimensional setting. In particular, the statistic T r is not well studied in literature. To adaptively borrow strengths from T d and T r , we develop a new testing procedure through establishing the joint limiting null distribution of T d and T r using random matrix theory. The proposed method is able to perform nearly the same as the better one of T d and T r under all scenarios. (2) Since T d and T r only target on dense alternatives where there are many small differences between Σ 1 and Σ 2 , they may be not powerful for sparse alternatives where there are only few but large differences between Σ 1 and Σ 2 . Our second contribution is to propose another two testing procedures, by combining T d , T r , and a maximum norm statistic, to maintain high power for testing the equality of two high-dimensional covariance matrices under both dense and sparse alternatives. To the best of our knowledge, there is no existing approach to two-sample testing problems with high-dimensional covariance matrices using the combination of three statistics. (3) The theoretical properties of the proposed tests based on different combination procedures are extensively investigated in the paper, which sheds lights on the pros and cons of different procedures for combining multiple statistics.
The rest of this paper is organized as follows. Section 2 establishes the joint limiting null distribution of T d and T r , and introduces three novel procedures for testing the equality of two covariance matrices. Section 3 investigates the power functions of the proposed tests under representative alternatives. Section 4 presents various simulation results to demonstrate the performance of the proposed methods. All technical details are presented in the Appendix.

Main results
Before presenting the main results, we first introduce some basic notations. The empirical spectral distribution (ESD) is defined as where A is a p × p non-negative definite matrix, {λ j , 1 ≤ j ≤ p} are the eigenvalues of A and δ a denotes the Dirac mass at point a. Define the operator as The sample covariance matrices are given by x 2i are the sample means. Let the aforementioned two statistics be where I p is the p × p identity matrix. Next, we impose the following two assumptions, which are commonly used in random matrix theory, for studying the limiting behaviors of the considered statistics throughout the paper.
In Assumption B, we require the data dimension p to be less than n 2 so that the sample covariance matrix of the second population is invertible.

Joint limiting null distribution of T d and T r
The following theorem establishes the joint limiting null distribution of the two statistics T d and T r . Let L(x) be the LSD of Σ 1 = Σ 2 = Σ under the null hypothesis in (1.1).
Theorem 2.1 shows that the joint limiting distribution of T d and T r is a bivariate normal distribution under the null hypothesis. Its proof is provided in Appendix. The marginal limiting distribution of T d is consistent with that given in the Proposition 1 of [21]. Due to σ 120 > 0, T d and T r are asymptotically positively-correlated.

n2
(1 − y n2 ) 3 m 2 10 , Based on the Slutsky's theorem, we further have that the random vector Remark 2.2. Instead of simply replacing y 1 and y 2 in σ 220 respectively with y n1 and y n2 , the expression ofσ 220 has additional terms involving v i0p , i = 1, 2, 3, 4. This is because even if the order of v i0p is o(1), the value of v i0p is much greater than 0 when p is small. For more details, one can refer to the proof of Theorem 2.2 in Appendix.

Three test procedures
For testing the hypothesis (1.1), three test procedures are proposed as follows: Test 1: Let the first statistic be That is, T dr is constructed by the maximum absolute value of the standardized statistics T d and T r . For a given test level α, the rejection region is where the critical value t α is obtained by 3) with f (x d , x r ) being the density of N 0 2 , 1ρ ρ 1 andρ =σ 120 / √σ 110σ220 . Test 2: T d and T r are powerful to measure the dense differences between Σ 1 and Σ 2 , so is T dr . To enhance the power of the test procedure (2.2) for sparse alternatives, we use a theoretical result of [5]. Under Conditions (C1), (C2) (or (C2*)) and (C3) of [5] (see Appendix) and the null hypothesis H 0 , for any t ∈ R, ..,p and S 2 = (s 2l1l2 ) l1,l2=1,...,p . Borrowing the idea of [6] and [21], let the second test statistic be where I(·) is the indicator function, and s(N 1 , N 2 , p) is a pre-specified threshold depending on the sample sizes N 1 , N 2 and the dimension p. Specifically, with a carefully selected threshold s(N 1 , N 2 , p), the second term in T drx1 converges to zero under the null hypothesis, and it takes effect as long as T x detects a strong signal. As a result, the first term T dr plays a dominant role, and the second term serves as a power enhancer for screening sparse disturbances between the two covariance matrices. Therefore, such a weighted statistic is able to adaptively combine the information from T d , T r and T x . The corresponding rejection region is with t α obtained from (2.3).

Test 3:
To incorporate the information of T dr and T x , we propose another statistic T drx2 as follow, where c α = t α/2 /q α/2 , t α/2 satisfies the equation (2.3) with α replaced by α/2, and The rejection region is The statistic T drx2 has a similar form as T dr , but incorporates the additional contribution from the statistic T x . and (C3) of [5], the size of the test based on T drx2 is asymptotically equal to or less than the significance level α.
Theorem 2.3 states that all of the three proposed methods can maintain the nominal test level asymptotically. According to Theorem 2.3, in this paper, we take It is easy to show that such a specification satisfies s (N 1 , N [18,9], and it is essentially a multiple-testing procedure that requires multiplicity adjustments. Since t α/2 > t α , when the main signals come from T dr , it can be shown that the power of T drx1 is greater than that of T drx2 . On the other hand, to guarantee the asymptotic unbiasedness of T drx1 , the contribution of T x to T drx1 is heavily penalized. As a result, T drx2 sometimes is more powerful than T drx1 under sparse alternatives. These results are all confirmed in the simulation studies.

Power analysis
From Theorems 2.1-2.2, the rejection regions of the tests based on the statistics T d and T r are defined as and where z 1−α/2 is the (1 − α/2) quantile of N (0, 1).
In this section, we study the asymptotic powers of T d , T r , T dr , T drx1 and T drx2 under the following alternative sets: where J p is a p × p matrix with all elements being 1, and θ kl1l2 = Var((x kl11 − μ kl1 )(x kl21 − μ kl2 )) for k = 1, 2. The first two sets correspond to local dense alternatives, and the last set includes sparse alternatives as a special case.

Power analysis in the dense alternative set Π 1
The following theorem gives the joint limiting distribution of T d and T r when (II) For the test based on T r , where Φ(·) is the distribution function of N (0, 1) and (III) For the test based on T dr , where t α/2 satisfies the equation (3.3) with replacing α by α/2.
Proposition 3.1 shows that the test based on T d suffers from low power if (Σ 1 , Σ 2 ) ∈ Π 1 . Since the asymptotical power functions of the tests based on T r , T dr , T drx1 and T drx2 are increasing functions of Δ 1 , these tests will enjoy high powers if Δ 1 is large enough.

Power analysis in the dense alternative set Π 2
The following theorem gives the joint limiting distribution of T d and T r when (Σ 1 , Σ 2 ) ∈ Π 2 .
Proposition 3.2 shows that the tests based on T d , T r , T dr and T drx1 are all asymptotically unbiased if (Σ 1 , Σ 2 ) ∈ Π 2 and β 2 = 0. Apparently, Δ 2 is an increasing function of a 2 , the tests based on T d , T dr , T drx1 and T drx2 will enjoy high powers when a 2 is large enough. However, with the increase of a 2 , Δ 3 will converge to 1 √ σ 222 Thus, the test based on T r will suffer from low power in the case of small Δ 3 . For example, when y 1 = 0.5 − (2y 2 − y 2 2 )/(1 − y 2 ) and a 2 → ∞, it follows that the power of the test based on T r converges to α.

Power analysis in the alternative set Π 3
The following proposition gives some results on the power functions of the tests based on T drx1 and T drx2 under the alternative set Π 3 . Let (II) Under Conditions (C2) or (C2*) in [5], It can be seen from Proposition 3.3 that the power functions of the tests based on T drx1 and T drx2 will tend to 1 if some entries of Σ d are large enough.

Simulation studies
We perform extensive simulation studies to examine the finite-sample performance of the proposed new tests. The observations are drawn from x ki = Σ 1/2 k w ki , where {w kli , k = 1, 2; l = 1, · · · , p; i = 1, · · · , N k } are independent and identically distributed (i.i.d.) from the Gaussian population N (0, 1) or the Gamma population Gamma(4, 2) − 2. We evaluate our proposed tests under four different scenarios for Σ 1 and Σ 2 .
The above four scenarios all satisfy the conditions listed in Theorem 2.3 with the justifications provided in Appendix. We compare our new tests with two existing tests LC [11] and CLX [5]. The nominal significant level for all the tests is set at α = 0.05. Based on 10,000 replications under each scenario, the empirical sizes for the Gaussian and Gamma populations are reported in Tables 1-4, and the empirical powers for each of the considered methods are exhibited It is observed that all the tests can maintain the nominal level for both Gaussian and Gamma populations. For power comparisons, relative performance of the seven tests depend on the different scenarios. In Scenarios 1-2, the test CLX  (40,80,120) to (320,480,480), and x-axis takes values correspond to the first column in  fails to detect the dense but small disturbances and yields low empirical testing powers. The test based on T d performs better than the test based on T r in Scenario 1; while the test based on T r is more powerful than the test based on T d in Scenario 2. Since the proposed three tests based on T dr , T drx1 and T drx2 all have the advantages of the tests based on T d and T r , these three tests can maintain competitive testing powers in Scenarios 1-2. Specifically, consistent with Remark 2.3, T drx1 delivers higher powers than T drx2 in these two scenarios. Scenario 3 examines the performance of the considered tests under a sparse alternative, as expected, the tests based on T d , T r and T dr fail to detect the sparse signals and have low powers. By contrast, the tests based on T drx1 and T drx2 are adaptive and thus yield high powers. Scenario 4 is a hybrid alternative that has a mixture of dense and sparse signals, in this case, the proposed tests based on T drx1 and T drx2 have better performance than the other methods in  (40,80,120) to (320,480,480), and x-axis takes values correspond to the first column in  terms of empirical powers. Since sparse differences exist in last two scenarios, the test based on T drx2 is more powerful than that using T drx1 . In summary, the simulation studies show that the proposed three tests can effectively borrow strengths from each of the individual statistics in signal detection, and the nominal testing level is well maintained under all methods.

Appendix
The appendix includes three sections: Section A.1 gives Conditions (C1), (C2) (or (C2*)) and (C3) of [5], Section A.2 justifies the conditions of the four scenarios considered in the simulation studies, and Section A.3 provides the proofs of some theorems and propositions.
(C1). Assume that there is a positive constant α 0 and a subset Υ . . , p, is the cardinality of the set of indices l 1 (l 1 = l 2 ) that are highly correlated with variable l 2 in population 1 or 2, as given by   In addition, there is a sequence of numbers Λ p,r such that card(Λ(r)) ≤ Λ p,r = o(p) for some constant 0 < r < 1, where that is, Λ(r) is the set of highly correlated variable incidences. (C2). Assume that log p = o(N 1/5 ) and N 1 N 2 . There exist constants η > 0 and K > 0 satisfying the following moment conditions Additionally, for some constants τ 1 > 0 and τ 2 > 0,  (C2*). Suppose that condition (A.1) holds, and that N 1 N 2 and p ≤ c 1 N γ0 for some γ 0 , c 1 > 0. Furthermore, the following moment conditions hold for some constants > 0 and K > 0. (C3). For any l 1 , l 2 , l 3 , l 4 ∈ {1, 2, · · · , p}, and for some constants
Mimicking the discussion in Scenario 2, we can show that s 3l2 and card(Λ 3 (r)) satisfy Condition (C1).
However, by Proposition 1 in [5], under H 0 and Condition (C2) or (C2*), we have where c α is the 1 − α quantile of the Type I extreme value distribution. It indicates that even without Conditions (C1) and (C3), the test based on T x can effectively control the Type I error. This is the reason that when {w 1 , · · · , w p } are i.i.d. from Gamma(4, 2) − 2 with Condition (C3) being violated, the empirical test sizes are still satisfactory.

A.3. Proofs of some theorems, propositions and lemmas
This section is divided into three subsections. Subsection A.3.1 gives some preparations. Subsection A.3.2 contains the proofs of theorems and propositions. Lemmas used in this paper are placed in Subsection A.3.3.

A.3.1. Preparatory works for proving Theorems 2.1, 3.1 and 3.2
The preparatory works include two steps. The skeletons of the two steps are as follows: Skeleton of Step 1. Define the linear combination T of T d − μ 0 and T r as where ω 1 and ω 2 are two arbitrary constants. In this step, we will obtain where T A and T B are given in (A.14) and (A.15) and μ 1 and μ 2 are given in (A.12) and (A.13).
Step 1: Letting ω 1 and ω 2 be two arbitrary constants, define Denote the noncentralized sample covariance matrices as Letting Similarly, we also have Rewrite

From (A.3) and (A.78), we have
and is almost everywhere bounded. Therefore, from (A.5) and (A.6), we have Based on the proof of Theorem 2.1 in [20], we know that the limiting distribution of tr S −1 2 − Etr S −1 2 is normal and where and Therefore, Using the proof of Theorem 2.1 in [20] again, we have . According to the discussion following Assumptions a-b-c-d-f in Subsection A.3.3, we know that B 2 and B −1 2 are almost everywhere bounded. Since and Therefore, we get and Therefore, we have Based on Slutsky's Theorem, we only need to derive the limiting distribution of T A + T B .
Step 2: We have Based on the central limit theorem of martingale difference sequences, it suffices to consider when i, j ∈ {1, 2} and p → ∞, where f 1 (x) = 1/x and f 2 (x) = 1/x 2 , C 1 and C 2 are closed contours in the complex plan enclosing the support of the LSD F y2,H , and C 1 and C 2 are nonoverlapping,
After sorting and calculation, we obtain and (1) and

Based on (A.81) and (A.82) in Lemma A.2 and (A.86) and (A.87) in Lemma A.3, we have
Therefore, we get Finally, we have and Next, we consider μ 2 , according to (A.11) and (A.13), we have

The Proof of Theorem 2.2.
In this subsection, we prove the Theorem 2.2. Under the Assumptions A-B and H 0 , denoting Σ 1 = Σ 2 = Σ and S = (n 1 + n 2 ) −1 (n 1 S 1 + n 2 S 2 ), from (A.3), (A.4) and (A.76), we have According to the discussion before (A.17)-(A.20), we find that d 10 , d 20 , d 30 , d 40 are the limits of the estimators of p −1 tr are the linear spectral statistics of the noncentralized sample covariance matrix B 2 with the population covariance matrix I p . Therefore, based on the proof of Theorem 2.1 and the Proposition A.1. of [20], for i = 1, 2, 3, 4, we have and Therefore, when we substitute d i0p +v i0p for d i0 in (A.44) for i = 1, 2, 3, 4, we can get the the expression ofσ 220 . Thus, we complete the proof of Theorem 2.2.
which implies that the test based on T dr is asymptotically with the test level α.
Secondly, under the conditions of Theorem 2.1 and the Conditions (C1), (C2) (or (C2*) and (C3) of [5], we have which implies that the test based on T drx2 is asymptotically equal to or less than α. Finally, under the conditions of Theorem 2.1 and the Conditions (C1), (C2) (or (C2*) and (C3) of [5], when the threshold s (N 1 , N 2 , p) − 4 log p ≥ 0, from (2.4), we have Therefore, we have It follows that the test based on T drx1 is asymptotically with the test level α. Thus, we complete the proof of Theorem 2.3.
The Proof of Theorem 3.1. In this subsection, we prove the Theorem 3.1. Similar with the proof of Theorem 2.1, because all quantities are computed under (Σ 1 , Σ 2 ) ∈ Π 1 , we add 1 to the subscripts of these quantities. When (Σ 1 , Σ 2 ) ∈ Π 1 , we have Then we have from (A.12), we obtain .

T. Zou et al.
Therefore, based on Theorem 3.1 and Slutsky's theorem, we have where Φ(·) is the distribution function of N (0, 1) and Under the conditions of Theorem 3.1, based on the definition of t α , we have and t α satisifies According to the definition of T drx1 , we know T drx1 ≥ T dr , then we have which yeilds that Based on the definition of T drx2 , we have where t 1α/2 satisfies the following equation The Proof of Theorem 3.2. In this subsection, we prove the Theorem 3.2. Similar with the proof of Theorem 2.1, because all quantities are computed under (Σ 1 , Σ 2 ) ∈ Π 2 , we add 2 to the subscripts of these quantities. When (Σ 1 , Σ 2 ) ∈ Π 2 , we have Then we have tr(Σ 1 − Σ 2 ) 2 = a 2 2 , when β 2 = 0, from (A.12), we obtain μ 12 = a 2 2 + y n1 + y n2 + β 1 y n1 + o(1).

2
(1 − y 2 ) 8 Therefore, under the conditions of Theorem 3.2, based on the central limit theorem of martingale difference sequences, we have Thus, we complete the proof of Theorem 3.2. and

The proof of Proposition 3.2. Under Assumptions
.
Under the conditions of Theorem 3.2, according to the definition of t α , we have Similar with the proof of Propostion 3.1, we can obtain (IV) and (V). Thus, we complete the proof of Proposition 3.2.

A.3.3. Technical lemmas
We provide some useful lemmas in this subsection. In accordance with the notations in [3], in this subsection, the dimension and sample size are represented by n and N , respectively. The sample x 1 , x 2 , · · · , x N is from a n-dimensional population x. Let x i = (x 1i , · · · , x ni ) T and X n = (x 1 , · · · , x N ), define where X T n denotes the transpose of X n and Γ is an invertible n × n matrix, (z) denotes the imaginary part of the complex number z, m n (z) denotes the Stieltjes transform of F Bn . Let B n = N −1 X T n Γ T ΓX n (the spectra of which differs from that of B n by |n − N | zeros). Its Stieltjes transform is defined as follow where c n = n/N .
• Assumption c. The spectral norm of T n = ΓΓ T is bounded and the ESD H n of T n converges weakly to a LSD H as n → ∞. • Assumption d. The spectral norm of T −1 n , the inverse of T n , is bounded.
From [20], under Assumptions a-b-c, we known the sample covariance matrix B n has the LSD F c,H , namely the Marčenko-Pastur distribution of index (c, H), which has support Moreover, the LSD F c,H has a Dirac mass 1 − 1/c at the origin when c > 1.
Define m c to be the Stieltjes transform of the companion LSD where δ 0 is the point distribution at zero. Then m c is the unique solution in of the equation Let γ j = (1/ √ N )Γx j , E 0 (·) denote expectation and E j (·) denote the conditional expectation with respect to the σ-field generated by γ 1 , · · · , γ j .

Two-sample tests for high dimensional covariance matrices
The proof of Lemma A.1 is similar to that of Lemma 1 given in the Supplementary Material of [21]. Besides, it is worth noting that the identity holds [see (1.15) of [3]] Before giving Lemma A.2, some preparatory work is needed. Let ν = (z). For the following of this subsection we will assume ν > 0. Without loss of generality, we assume T n ≤ 1 ( · denotes the spectral norm of T n ) for all .
All of the three latter quantities are bounded in absolute by |z|/ν [see (3.4) of [2]]. We have For convenience, constants appearing in inequalities will be denoted by K and may be taken as different values from one expression to the next. Assume that matrix A = (a ij ) is a n × n matrix with bounded spectral norm, from (A.76) we get which implies that Proof. According to (4.13) of [3], we have and Under Assumptions a-b-c-d, we know that (zI − b n (z)T n ) −1 and D −1 1 (z) are bounded. Suppose M is an n × n nonrandom matrix and M is bounded, similar with (4.15) and (4.16) in [3], we have (1) and [2]] and Eβ 1 (z) = −zEm n (z) = 1 − c n − zc n Em n (z), when z = 0, we have b n (z) = 1 − c n + o (1). Therefore, we get Therefore, when z = 0, we get Let M 2 = T n , from (A.81), we have Hence, we have which is (A.82). Next we will prove (A.83). Again from (A.79), (A.80) and (A.85), we get Therefore, when z = 0, we get Let M 3 = T n , from (A.81) and (A.82), we have Hence, we can obtain (A.83). Finally, we will prove (A.84). From (A.79), (A.80) and (A.85), we get

This implies that
By the same argument, we have based on the central limit theorem of martingale difference sequences, we consider the following sum For the martingale difference of tr[(B n M 2 ) 2 ], we have Still based on the central limit theorem of martingale difference sequences, we consider the following sum Next, the second derivative of those terms are given as follows