Comparison of Bayesian and Frequentist Multiplicity Correction For Testing Mutually Exclusive Hypotheses Under Data Dependence

The problem of testing mutually exclusive hypotheses with dependent test statistics is considered. Bayesian and frequentist approaches to multiplicity control are studied and compared to help gain understanding as to the effect of test statistic dependence on each approach. The Bayesian approach is shown to have excellent frequentist properties and is argued to be the most effective way of obtaining frequentist multiplicity control, without sacrificing power, when there is considerable test statistic dependence.


Introduction
Modern scientific experiments often require considering a large number of hypotheses simultaneously ( [8], [10]) and has led to extensive interest in controlling for multiple testing (henceforth, just termed controlling for multiplicity). Many multiplicity control methods have been proposed in the frequentist literature, such as the Bonferroni procedure which controls the family-wise error rates, and various versions of false discovery rates (cf. [3] and [13]) which control for the fraction of false discoveries to stated discoveries. The asymptotic behavior of false discovery rate has been studied in [2].
The Bayesian approach to controlling for multiplicity operates through the prior probabilities assigned to hypotheses. For instance, in the scenario Introduction that is considered herein of testing mutually exclusive hypotheses (only one of n considered hypotheses can be true), one can simply assign each hypothesis prior probability equal to 1/n and carry out the Bayesian analysis; this automatically controls for multiplicity. That multiplicity is controlled through prior probabilities of hypotheses or models is extensively discussed in [11], [12], [5] for a two-groups model, variable-selection in linear models, and subset analysis, respectively.
One of the appeals of the Bayesian approach to multiplicity control is that it does not depend on the dependence structure of the test statistics; the Bayes procedure will automatically adapt to the dependence structure through Bayes theorem, but the prior probability assignment that is controlling for multiplicity is unaffected by dependence. In contrast, frequentist approaches to multiplicity control are usually highly affected by test statistic dependence. For instance, the Bonferroni correction is fine if the test statistics for the hypotheses being tested are independent, but can be much too conservative (losing detection power), if the test statistics are dependent.
An interesting possibility for frequentist multiplicity control in dependence situations is thus to develop the procedure in a Bayesian fashion and verify that the procedure has sufficient control from a frequentist perspective. This has the potential of yielding optimally powered frequentist procedures for multiplicity control. There have been other papers that study the frequentist properties of Bayesian multiplicity control procedures ( [6], [9], [1]), but they have not focused on the situation of data dependence.
We investigate the potential for this program by an exhaustive analysis of the simplest multiple testing problem which exhibits test statistic dependence. The data X = (X 1 , . . . X n ) arises from the multivariate normal distribution where ρ is the correlation between the observations. Consider testing the n hypotheses M i 0 : θ i = 0 versus M i 1 : θ i = 0, but under the assumption that at most one alternative hypothesis could be true. (It is possible that no alternative is true.) Although our study of this problem is pedagogical in nature, such testing problems can arise in signal detection, when a signal could arise in one and only one of n channels, and there is common background noise in all channels, leading to the equal correlation structure. We will, for convenience in exposition, use this language in referring to the situation.
In Section 2 we introduce two natural frequentist procedures for multiplicity control in this problem and, in Section 3, we introduce a natural Bayesian procedure. Section 4 explores a highly curious phenomenon that is encountered when ρ is near 1; when n > 2, the Bayesian procedure finds the true alternative hypothesis with near certainty, while an ad hoc frequentist procedure fails to do so. Sections 5 and 6 study the frequentist properties of the original Bayesian procedure and a Type-II MLE approach, showing that, as n → ∞, the Bayesian procedures have strong frequentist control of error. Section 7 considers the situation in which there is a data sample of growing size m for each θ i .

Frequentist Multiplicity Control
Two natural frequentist procedures are considered.

An Ad hoc Procedure
Declare channel i to have the signal if max 1≤j≤n |X j | > c, where c is determined to achieve overall family-wise error control Lemma 2.1. (2.1) can be expressed as where the expectation is with respect to Z ∼ N (0, 1).
Proof. By Lemma 9.1 in the Appendix, under the null model, X i can be written as X i = √ ρZ + √ 1 − ρZ i , where the Z and the Z i are independent standard normal random variables. Thus
• When ρ → 1, Φ(c) → 1 − α 2 , so the critical region is the same as that for a single test.
If ρ → 1, by Lemma 9.1, The extreme effect of dependence on frequentist multiplicity correction is clear here; the correction ranges from full Bonferroni correction to no correction, as the correlation ranges from 0 to 1.

Likelihood Ratio Test
A more principled frequentist procedure would be the likelihood ratio test (LRT): The test statistic arising from the likelihood ratio test is and the LRT would be to reject the null hypothesis if T > c, where c satisfies α = P (T > c | θ i = 0 ∀i).
Proof. Denote and its inverse The likelihood ratio is then, letting f (·) denote the density of X andx Noting that 1 a (ax j + bu j ) 2 it is immediate that LR is equivalent to the test statistic T . The rejection region is LR ≤ k for some k, which is clearly equivalent to T ≥ c for appropriate critical value c.
When ρ = 0, T = max i x 2 i , and the LRT reduces to the ad hoc testing procedure in the previous section. On the other hand, as ρ → 1, which exhibits a quite different behavior that will be discussed later.

A Bayesian Test
On the Bayesian side, it is convenient to view this as the model selection problem of deciding between the n + 1 exclusive models where θ (−i) is the vector of all θ j except θ i . The simplest prior assumption computationally is that, for the nonzero initially we will assume τ 2 to be known, but later will consider it to be unknown. Then under model M i , the marginal likelihood of model M i is where The posterior probability of M i (that the i th channel has the signal) is then where P (M j ) is the prior probability of model M j .
Theorem 3.1. For any ρ ∈ [0, 1) and positive integer n > 1, the null posterior probability is: , and the posterior probability of an alternative model M i is Proof. The posterior probability of M i is The expression can be simplified by further computing Σ −1 i ,(Σ −1 i − Σ −1 k ) and det(Σ i ). First notice that by the Cholesky decomposition for some lower triangular matrix L. Then by the Woodbury identity, the difference of two inverse matrices can be obtained: where τ i = (0, · · · , τ, · · · , 0) T (the i th element is τ 2 ). Therefore, Also the ratio of two determinants is By plugging back these quantities into (3.4), the proof is complete.
Remark 3.2. (9.4) gives the full expression for P (M i | x), without using a, b, and is utilized in the subsequent proofs.
Corollary 3.3. In particular, when n = 2, the null posterior probability is: , and the posterior probability of the alternative M i , i ∈ {1, 2} is:

The situation as the correlation goes to 1
The following theorem shows the surprising result that, when the dimension is greater than 2, the Bayesian method can correctly select the true model when the correlation goes to one. In two dimensions, however, there is nonzero probability of choosing the wrong alternative model if a non-null model is true.
Proof. By Lemma 9.2: When n = 2, under M i , when ρ → 1: For both j = i or j = (−i), the corresponding likelihood ratios go to infinity at the same asymptotic rate since In this case, the true alternative model has largest likelihood ratio (=∞), hence, LRT is fully powered.
From Theorem 4.1 and 4.2, when the correlation goes to 1 and the dimension is larger than 2, both the Bayesian procedure and the LRT are fully powered. This surprising behavior as the correlation goes to one can be explained by the following observations using (9.1).
When n = 2, ρ → 1: Hence, one can correctly distinguish the null model if it is true, but can not declare which non-null model is true when Note that the ad hoc frequentist test does not have this behavior. As ρ → 1, the test • still has probability α of incorrectly rejecting a true M 0 ; • still has positive probability of not detecting a signal when M i is true.
This highlights the danger (in terms of lack of power) of using 'intuitive' procedures for multiplicity control.

Asymptotic frequentist properties of Bayesian procedures
In this section, we will be studying the false positive probability (FPP) theoretically and numerically. We first need to obtain asymptotic posteriors.
The summation term in (5.2) becomes: by the Law of Large Numbers and since Z ∼ N (0, 1) (1)) .
Remark 5.2. Note that, by Lemma 9.2, so that Lemma 5.1 can be written, with respect to z i : Remark 5.3. Figure 1 shows the ratio of the estimated P (M 1 | x) (from Lemma 5.1) and the true probability (from Theorem 3.1), as n grows. Each plot contains 200 different ratio curves based on independent simulations with fixed ρ, P (M 0 ) and τ . As can be seen, the ratio goes to 1 when n grows and the convergence rate indeed depends on the correlation. Remark 5.4. Figure 2 gives the estimated and true posterior probability of M 1 under the assumption that the null model is true, for fixed r = ρ = 0.5 and n, but varied τ . Notice that, for fixed n, the estimated probability is closer to the true probability when τ is small but is worse for larger τ , indicating that larger n is required for obtaining the same precision as τ grows.  The following theorem shows the surprising result that, as n grows when the null model is true, the posterior probability of the null model converges to its prior probability. Thus one cannot learn that the null model is true. Theorem 5.5. As n → ∞ and ρ ∈ [0, 1), under the null model, Proof. First note that

Asymptotic frequentist properties of Bayesian procedures 17
The summation term in the null posterior (Theorem 3.1) becomes 1 − r nr (1)) (by Lemma 9.2) → 1 − r r (by the Strong Law of Large Numbers).
Remark 5.6. Figure 3 shows simulations of the null posterior probability for different numbers of hypotheses and different correlations. Interestingly, by Theorem 4.1, the Bayes procedure identifies the correct model (here the null model) when n is fixed and the correlation goes to 1, resulting in higher initial posterior probability of the null model for highly correlated cases. On the other hand, by Theorem 5.5, this posterior probability converges to its prior probability regardless of the correlation. This convergence can be seen in Figure 3.

False positive probability
Here we focus on the major goal, to find the frequentist false positive probability under the null model of the Bayesian procedure. To begin, we must formally define the Bayesian procedure for detecting a signal.
Definition 5.7 (Bayesian detection criterion). Accept model M i if its posterior probability P (M i | x), is greater than a specified threshold p ∈ (0, 1). If multiple models pass this threshold, choose the one with largest posterior probability.  Proof. Under the null model, by (5.1), P (M i | x) ≥ p is equivalent to By Fact 9.2: This is a surprising and unsettling result: a standard Bayesian procedure yields a false positive probability that goes to zero at a polynomial rate. This is much too strong error probability control from a frequentist perspective; that it happens on the Bayesian side is surely indication that assuming we know τ 2 is too strong an assumption. Hence we turn to a more flexible approach in the next section.

Adaptive choice of τ 2
To increase the frequentist power of the Bayes test, we consider adaptive choices of τ 2 . First, we consider the choice that maximizes the false positive probability. Then we consider a Type II maximum likelihood approach based on estimating τ 2 .
6.1 The adaptive τ 2 which maximizes FPP Theorem 6.1. Given null model prior probability r, correlation ρ, and decision threshold p, as n → ∞, the choice of τ 2 that maximizes FPP is Proof. Without loss of generality, assume max i z 2 i = z 2 1 . By the model selection criteria (5.7), z 1 is a false positive if: Lemma 9.5 establishes that (6.1) maximizes the FPP and, with this choice of τ 2 n , the rejection region becomes z 2 1 > 2 log n + log log n + 2 log Finally, P ( false positive |τ 2 n , p, r) n,τ π 1 + c n,τ 2 log n + 2c + log log n + log 2 + 1 So, with this adaptive choice of τ 2 , the FPP only goes to zero at a 1/(log n + log log n) rate, much slower than the polynomial rate achieved for fixed τ 2 . Remark 6.2. Figure 4 provides the simulated (red curve) and theoretical (in blue) false positive probability (FPP) with respect to the number of hypotheses (denoted by n). As expected, the simulated results match the theoretical prediction, the rate of convergence being around 1/(2 log n + log log n). Note that the FPP does not become extremely small even for very large n. Figure 4: Comparison of the simulated FPP and its asymptotic approximation when p = r = 0.5, ρ = 0 as n varies from 10 1 to 3e 5 , τ 2 is the adaptive choice.

Type II maximum likelihood estimation of τ 2
The type II maximum likelihood approach to choice of the prior under the alternative model replaces a pre-specified τ 2 with that prior variance,τ 2 n , which maximizes the marginal likelihood over all possible τ 2 ; see [4] for discussion of this approach. Lemma 6.3. LetL n (τ 2 ) be the marginal likelihood of τ 2 given (x 1 , ..., x n ), namelyL the Type II mle,τ 2 n , can be found as Proof.
Noting that a, b, Σ 0 and x Σ −1 0 x are independent of τ 2 , the result follows.
Theorem 6.4 (Type II MLE false positive probability). Given null prior probability r, correlation ρ, and decision threshold p, as n → ∞ P(false positive | null model ,τ 2 where k * satisfies: Proof. First, Lemma 9.5 shows that (6.4) provides the absolute lower bound for z 2 1 to be in the rejection region; namely 2 log n + log log n + c(p, r). So the rejection region, denote it by Ω, corresponding to the Type II MLE choice of τ 2 must be a subset of (2 log n + log log n + c(p, r), ∞). Divide this interval into Ω 1 = (2 log n + log log n + c(p, r), 2 log n + log log n + c(p, r) + K) Ω 2 = (2 log n + log log n + c(p, r) + K, ∞) , where K will be chosen large, but fixed. We first determine Ω ∩ Ω 1 .

This is equivalent to
which, using Lemma 9.11 (which shows that l * > c(p, r)), can easily be shown to have a unique solution in Ω ∩ Ω 1 (assuming K is larger than, say, 4 log ( p (1−p)(1−r) )). It also also then easy to show that Ω ∩ Ω 1 = (2 log n + log log n + l * + o(1), 2 log n + log log n + c(p, r) + K) .

By 9.2,
exp (−[2 log n + log log n + c(p, r) + K]/2)/ √ 2 log n + log log n + c(p, r) + K c(p, r) and l * are fixed, we can clearly choose K large enough to make this smaller than any specified . Hence the region Ω 2 can be ignored in the computation of the FPP. (It is almost certainly part of the rejection region, but we do not know whatτ 2 n is for observations in that region and, hence can't say for sure.) We can also use the same argument to say that P (Ω ∩ Ω 1 ) = P ((2 log n + log log n + l * , ∞))(1 + ) .
Since can be made arbitrarily small, the result follows.
The Type II MLE FPP converges to 0 at a logarithmic rate in n, as did the maximal Bayesian FPP. Thus both are far less conservative than the Bayesian procedures with specified τ 2 . Finally, it is interesting that neither of the adaptive asymptotic FPP's depend on ρ.
Remark 6.5. Figure 5 demonstrates how the threshold p (Definition 5.7) can be chosen to achieve a fixed FPP of 0.05. Because, for a fixed p, the FPP goes to zero as a function of n, smaller p are needed to achieve a fixed FPP as n grows. Note that the variation in p is actually quite small over the very large range of n considered in the figure. Remark 6.6. Figure 6 gives the value of 1  Remark 6.7. Figure 7 demonstrates the how the detection power varies when the signal size increases.

Analysis as the information grows
In this section, we generalize model (3.1) to the scenario where each channel has m i.i.d observations. Then the sample mean satisfies Hence, m can be seen as the precision ofX. More generally, we will replace 1/m by a function σ 2 n , where σ 2 n decreases to zero as n grows. The theorem below gives the rate of decrease of σ 2 n which guarantees consistency. For the i.i.d. case, consistency of all models is only guaranteed if m grows faster than log n; consistency fails if m grows slower than log n; and consistency depends on the parameter value if m is O(log n).
Theorem 7.1. Consider model (3.1), with the altered covariance matrix below: 1. When σ 2 n log n → 0, consistency holds for both the null and alternative models.
Since d n = o(log n), 1−c n = o(1), and 0 < c n < 1, one can apply Theorem 9.7 to get the asymptotic analysis of (7.3): Hence, under the null hypothesis, Under the alternative model M j : by Theorem 3.1, The first term is And the last term,

Conclusions
The main purpose of this work was to gain understanding of the behavior of Bayesian procedures that control for multiple testing, under a scenario of high dependence among test statistics, where frequentist methods for multiplicity control become more difficult to implement when trying to maintain high power. In Section 4, the Bayesian procedure was shown to have unexpectedly high power as the correlation gets large, providing an illustration of the gains that can be had by approaching multiplicity control from the Bayesian side. (Bayes theorem often produces things that we could not have produced through our intuition alone.) The other main issue concerning the behavior of the Bayesian procedure is the extent to which it also exhibits desirable frequentist control. Surprising to us was that the Bayesian procedure exhibited too-strong frequentist control, with the FPP (false positive probability under the null model) going to zero at a polynomial rate, as the number n of tests grows. To a Bayesian who believed in the prior distribution that was utilized this would not be viewed as a problem, but we tend to prefer procedures that have a dual Bayesian/frequentist interpretation. To this end, adaptive versions of the Bayesian procedure were considered, and found to have FPP's going to 0 at the much slower 1/ log n rate; indeed, unless n is huge, the resulting FPPs were reasonably moderate.
A number of other surprises were also encountered, such as the fact that, as the number of tests n grows, the posterior probability of the null model converges to its prior probability. (This is actually a very general phenomenon that will be reported elsewhere.) The situation of having i.i.d. replicate observations m was also considered, and it was shown that one The proof can be found in [7]. By expanding a, b in Theorem 3.1, one obtains the following explicit form for the posterior probabilities: Corollary 9.3. The posterior of any non-null model M i is: Alternatively, in terms of z(x), with z i = z i (x): (S n − α n ) β n → 0 in probability where S n = X n,1 + ... + X n,n and α n = n k=1 EX n,k .
See [7] for the proof.
Theorem 9.7. If c n ∈ (0, 1) ∀n and 1 − c n = o(1), then Proof. Take X n,i = exp cn 2 z 2 i ; β n = n √ 1−cn in Fact 9.6. Checking the first assumption of the WLTA: Therefore, First, note that L n (τ 2 n ) → 0 when log n/τ 2 n → ∞, since As this maximum thus exceeds the maximum over the domains log n/τ 2 n → ∞ and log n/τ 2 n → 0, the proof is complete.