Coarse-to-fine Multiple Testing Strategies

We analyze control of the familywise error rate (FWER) in a multiple testing scenario with a great many null hypotheses about the distribution of a high-dimensional random variable among which only a very small fraction are false, or"active". In order to improve power relative to conservative Bonferroni bounds, we explore a coarse-to-fine procedure adapted to a situation in which tests are partitioned into subsets, or"cells", and active hypotheses tend to cluster within cells. We develop procedures for a standard linear model with Gaussian data and a non-parametric case based on generalized permutation testing, and demonstrate considerably higher power than Bonferroni estimates at the same FWER when the active hypotheses do cluster. The main technical difficulty arises from the correlation between the test statistics at the individual and cell levels, which increases the likelihood of a hypothesis being falsely discovered when the cell that contains it is falsely discovered (survivorship bias). This requires sharp estimates of certain quadrant probabilities when a cell is inactive.

1. Introduction.We consider a multiple testing scenario encountered in many current applications of statistics.Given a large index set V and a family (H 0 (v), v ∈ V ) of null hypotheses about the distribution of a highdimensional random vector U ∈ R d , we wish to design a procedure, basically a family of test statistics and thresholds, to estimate the subset A ⊂ V over which the null hypotheses are false.We shall refer to A as the "active set" and write Â = Â(U) for our estimator of A based on a random sample U of size n from U .The hypotheses in Â(U) (namely the ones for which the null is rejected) are referred to as "detections" or "discoveries."Naturally, the goal is to maximize the number |A ∩ Â(U)| of detected true positives while simultaneously controlling the number |A c ∩ Â(U)| of false discoveries.
There are two widely used criteria for controlling false positives: FWER: Assume that U is defined on the probability space (Ω, P).The family-wise error rate (FWER) is FWER( Â) = P Â(U) ∩ A c = ∅ , which is the probability of making at least one false discovery.This is usually controlled using Bonferroni bounds and their refinements [9,13,11,10], or using resampling methods or random permutation.

FDR:
The false discovery rate (FDR) is the expected ratio between the number of false alarms |A c ∩ Â(U)| and the number of discoveries | Â(U)| [3,5,4].
In many cases, including the settings in computational biology which directly motivate this work, we find |A| |V |, n d as well as small "effect sizes."This is the case, for example, in genome-wide association studies (GWAS) where U = (Y, X v , v ∈ V ) and the dependence of the "phenotype" Y on the "genotype" (X v , v ∈ V ) is often assumed to be linear; the active set A are those v with non-zero coefficients and effect size refers to the fraction of the total variance of Y explained by a particular X v .Under these challenging circumstances, the FWER criterion is usually very conservative and power is limited; that is, number of true positive detections is often very small (if not null) compared to |A| (the "missing heritability").This is why the less conservative FDR criterion is sometimes preferred: it allows for a higher number of true detections, but of course at the expense of false positives.However, there are situations, such as GWAS, in which this tradeoff is unacceptable; for example, collecting more data and doing follow-up experiments may be too labor intensive or expensive, and therefore having even one false discovery may be deemed undesirable.
To set the stage for our proposal, suppose we are given a family T v = T v (U), v ∈ V of test statistics and can assume that deviations from the null are captured by small values of T v (U) (e.g., p-values).We make the usual assumption, easily achieved in practice, that the distribution of T v (U) does not depend on v when v ∈ A c , and individual rejection regions are of the form {u ∈ U : T v (u) ≤ θ} for a constant θ independent of v. Defining Â(U) = {v : T v (U) ≤ θ}, the Bonferroni upper-bound is FWER ≤ To ensure that FWER ≤ α, θ = θ B is selected such that P(T v (U) ≤ θ B ) ≤ α/|V | whenever v ∈ A c .The Bonferroni bound can only be marginally improved (see, in particular estimator [13], which will be referred to as Bonferroni-Holm in the rest of the paper) in the general case.While alternative procedures (including permutation tests) can be designed to take advantage of correlations among tests, the bound is sharp when |V | |A| and tests are independent.
Coarse-to-fine Testing: Clearly some additional assumptions or domainspecific knowledge is necessary to ameliorate the reduction in power resulting from controlling the FWER.Motivated by applications in genomics, we suppose the set V has a natural hierarchical structure.In principle, it should then be possible to gain power if the active hypotheses are not randomly distributed throughout V but rather have a tendency to cluster within cells of the hierarchy.In fact, we shall consider the simplest example consisting of only two levels corresponding to individual hypotheses indexed by v ∈ V and a partition of V into non-overlapping subsets (g ⊂ V, g ∈ G), which we call "cells."We will propose a particular multiple testing strategy which is coarse-to-fine with respect to this structure, controls the FWER, and whose power will exceed that of the standard Bonferroni-Holm approach for typical models and realistic parameters when a minimal degree of clustering is present.It is important to note that clustering property is not a condition for a correct control of the FWER at a given level using our coarse-to-fine procedure, but only for its increased efficiency in discovering active hypotheses.
Our estimate of A is now based on two families of test statistics: {T v (U), v ∈ V }, as above, and {T g (U), g ∈ G}.The cell-level test T g is designed to assume small values only when g is "active," meaning that g ∩ A = ∅.Our estimator of A is now One theoretical challenge of this method is to derive a tractable method for controlling the FWER at a given level α.Evidently, this method can only out-perform Bonferroni if θ V > θ B ; otherwise, the coarse-to-fine active set is a subset of the Bonferroni discoveries.A key parameter is J, an upper bound on the number of active cells, and in the next section we will derive an FWER bound under an appropriate compound null hypothesis.
The main results of the paper are in the ensuing analysis for different models for U .In each case, the first objective is to compute Φ for a given θ G and θ V and the second objective is to maximize the power over all pairs (θ G , θ V ) which satisfy Φ ≤ α.The smaller our upper bound on J, the stronger is the clustering of active hypotheses in cells and the greater is the gain in power compared with the Bonferroni bound.In particular, as soon as J |G|, the coarse-to-fine strategy will lead to a considerably less conservative score threshold for individual hypotheses relative to the Bonferroni estimate and the coarse-to-fine procedure will yield an increase in power for a given FWER.Again, our assumptions about clustering are only expressed through an upper bound on J; no other assumptions about the distribution of A are made and the FWER is controlled in all cases.
The main technical difficulty arises from the correlation between the corresponding test statistics.This must be taken into account since it increases the likelihood of an individual index v being falsely declared active when the cell g(v) that contains it is falsely discovered (survivorship bias).More specifically, we require sharp estimates of quadrant probabilities under the joint distribution of T g(v) (U) and T v (U) when g(v), the cell containing v, is inactive.All these issues will be analyzed in two cases.First, we will consider the standard linear model with Gaussian data.In this case Φ is expressed in terms of centered chi-square distributions and the power is expressed in terms of non-centered chi-square distributions.The efficiency of the coarse-to-fine method in detecting active hypotheses will depend on effect sizes, both at the level of cells and individual v, among other factors.A non-parametric procedure will then be developed in section 4 based on generalized permutation testing and invariance assumptions.Finally, we shall derive a high-confidence upper bound on J based on a martingale argument.Extensive simulations comparing the power of the coarse-to-fine and Bonferroni-Holm appear throughout.
Applications and Related Work: As indicated above, our work (and some of our notation) is inspired by statistical issues arising in GWAS [7,8,2] and related areas in computational genomics.In the most common version of GWAS, the "genotype" of an individual is represented by the genetic states X v at a very large family of genomic locations v ∈ V ; these variations are called single nucleotide polymorphisms or SNPs.In any given study the objective is to find those SNPs A ⊂ V "associated" with a given "phenotype", for example a measurable trait Y such as height or blood pressure.The null hypothesis for SNP v is that Y and X v are independent r.v.s, and whereas |V | may run into the millions, the set A of active variants is expected to be fewer than one hundred.(Ideally, one seeks the "causal" variants, an even smaller set, but separating correlation and causality is notoriously difficult.)Control of the FWER is the gold standard and the linear model is common.If the considered variants are confined to coding regions, then the set of genes provides a natural partition of V (and the fact that genes are organized into pathways provides a natural three-level hierarchy) [14] Another application of large-scale multiple testing is variable filtering in high-dimensional prediction: the objective is to predict a categorical or continuous variable Y based on a family of potentially discriminating features is often facilitated by limiting a priori the set of features utilized in training Ŷ to a subset A ⊂ V determined by testing the features one-by-one for dependence on Y and setting a signficance threshold.In most applications of machine learning to artificial perception, no premium is placed on pruning A to a highly distinguished subset; indeed, the particular set of selected features is rarely examined or considered of significance.In contrast, the identities of the particular features selected and appearing in decision rules are often of keen interest in computational genomics, e.g., discovering cancer biomarkers, where the variables X v represent "omics" data (e.g., gene expression), and Y codes for two possible cellular or disease phenotypes.Obtaining a "signature" Â devoid of false positives can be beneficial in understanding the underlying biology and interpreting the decision rules.In this case the Gene Ontology (GO) [1] provides a very rich hierarchical structure, but one example being the organization of genes in pathways.Indeed, building predictors to separate "driver mutations" from "passenger mutations" in cancer would appear to be a promising candidate for coarse-to-fine testing due to the fact that drivers are known to cluster in pathways.
There is a literature on coarse-to-fine pattern recognition (see, e.g., [6] and the references therein), but the emphasis has traditionally been on computational efficiency rather than error control.Computation is not considered here.Moreover, in most of this work, especially applications to vision and speech, the emphasis is on detecting true positives (e.g., patterns of interest such as faces) at the expense of false positives.Simply "reversing" the role of true positives and negatives is not feasible due to the loss of reasonable invariance assumptions; in effect, every pattern of interest is unique.
Finally, in [16], a hierarchical testing approach is used in the context of the FWER.However, the intention is to improve the power of detection relative to the Bonferroni-Holm methods only at level of clusters of hypotheses; in contrast to our method, the two approaches have comparable power at the level of individual hypotheses.
Organization of the Paper: The paper is structured as follows: In section 2 we present a Bonferroni-based inequality that will be central for controlling the FWER using the coarse-to-fine method in different models.In section 3 will consider a parametric model that will illustrate precisely the way we control the FWER at a fixed level and permit a power comparison between coarse-to-fine and Bonferroni-Holm.We then propose a non-parametric procedure in section 4 under general invariance assumptions.A method for estimating an upper bound on the number of active cells and incorporating it into the testing procedure without violating the FWER constraint is derived in section 5. Finally, some concluding remarks are made in the Discussion.
2. Coarse-to-fine framework.The finite family of null hypotheses will be denoted by (H 0 (v), v ∈ V ), where H 0 is either true or false.We are interested in the active set of indices, A = {v ∈ V : H 0 (v) = false} and will write V 0 = A c for the set of inactive indices.Suppose our data U takes values in U.The set Â(U) is commonly designed based on individual rejection regions Γ v ⊂ U, with Â(U) = {v : U ∈ Γ v }.As indicated in the previous section, in the conservative Bonferroni approach, the FWER is controlled at level α by assuming |V | max v∈V 0 P(U ∈ Γ v ) ≤ α.If the rejection regions are designed so that this probability is independent of v whenever H 0 (v) = true, then the condition boils down to While there is not much to do in the general case to improve on the Bonferroni method, it is possible to improve power if V is structured and one has prior knowledge about way the active hypotheses are organized relative to this structure.In this paper, we consider a coarse-to-fine framework in which V is provided with a partition G, so that V = g∈G g, where the subsets g ⊂ V (which we will call cells) are non-overlapping.For v ∈ V , we let g(v) denote the unique cell g that contains it.The "coarse" step selects cells likely to contain active indices, followed by a "fine" step in which a Bonferroni or equivalent procedure is applied only to hypotheses included in the selected cells.More explicitly, we will associate a rejection region Γ g to each g ∈ G and consider the discovery set We will say that a cell g is active if and only if g ∩ A = ∅, which we shall also express as H 0 (g) = false, implicitly defining H 0 (g) as the logical "and" of all H 0 (v), v ∈ g.We will also consider the double null hypothesis H 00 (v) = H 0 (g(v)) of v belonging in an inactive cell (which obviously implies that v is inactive too), and we will let V 00 ⊂ V 0 be the set of such v's.
Let ν G denote the size of the largest cell in G and J be the number of active cells.We will develop our procedure under the assumption that J is known, or, at least bounded from above.While this can actually be a plausible assumption in practice, we will relax it in section 4 in which we will design a procedure to estimate a bound on J. Then under these assumptions we have the following result: Proof.This is just the Bonferroni bound applied to the decomposition and the proposition results from The sets Γ g and Γ v will be designed using statistics T g (U) and for some constants θ G and θ V , and assuming that the distribution of (T In the following sections our goal will be to design θ G and θ V such that this upper bound is smaller than a predetermined level α.Controlling the second term will lead to less conservative choices of the constant θ V (compared to the Bonferroni estimate), as soon as ν G J |V | (or J |G| if all cells have comparable sizes).Depending on the degree of clustering, the probability p 00 of false detection in the two-step procedure can be made much smaller than p 0 without harming the true detection rate and the coarse-to-fine procedure will yield an increase in power for a given FWER.We require tight estimates of p 00 and taking into account the correlation between T g(v) (U) and T v (U) is necessary to deal with "survivorship bias."

Model-based derivation.
3.1.Regression model.In this section, the observation is a realization of an i.i.d.family of random variables U = ((Y k , X k ), k = 1, . . .n) where the Y 's are real-valued and the variables ) is a high-dimensional family of variables indexed by the set V .We assume that the distribution of X k v , v ∈ V , are independent and centered Gaussian, with variance σ 2 v , and that where ξ 1 , . . ., ξ n are i.i.d.Gaussian with variance σ 2 and a v , v ∈ A, are unknown real coefficients.We will denote by Y the vector (Y 1 , . . ., Y n ) and by Ȳ = n k=1 Y k /n 1 n where 1 n is the vector composed by ones repeated n times.We also let Finally, we will denote by σ 2 Y the common variance of Y 1 , . . ., Y n and assume that it is known (or estimated from the observed data).
3.2.Scores.For v ∈ V , we denote by P v the orthogonal projection on the subspace S v spanned by the two vectors X v and 1 n .We will also denote by P g (g ∈ G) the orthogonal projection on the subspace S g spanned by the vectors X v , v ∈ g, and 1 n .The scores at the g level and v level will be respectively: (The projections are simply obtained by least-square regression of Y on X v , v ∈ g, for P g and on X v for P v .)We now provide estimates of Note that, because we consider residual sums of squares, we here use large values of the scores in the rejection regions (instead of small values in the introduction and other parts of the paper), hopefully without risk of confusion.
Proposition 3.1.For all θ G and θ V : where G β (x, a, b) is the CDF of a β(a, b) distribution evaluated at x and: .
where F k is the c.d.f. of a chi-squared distribution with k degrees of freedom.
Proof.For v ∈ V 00 and g = g(v), we can write where I n is the n-dimensional identity matrix).Denote by P v the projection on the orthogonal complement of J in S v and by P g the projection on the orthogonal complement of S v in S g , so that This implies that: At this stage, applying Cochran's theorem to P g (Y/σ Y−g ) and P v (Y/σ Y−g ), which are conditionally independent given X v , v ∈ G, reduces the problem to finding an upper bound for: where η is χ 2 (ν G − 1) and ζ is χ 2 (1), and the two variables are independent.Let us write this probability as which is less than: (Here, E refers to the expectation with respect to P.) Consider the first term in the sum: ).This term can be re-written as: At this stage, we will use the following tail inequality for χ 2 (k) random variables : 1 for any z > 1.We apply this result to k = ν G − 1 and z = θ G −V ν G −1 to get the upper bound: Since the density of a χ 2 (1) is proportional to exp(− ζ 2 )ζ − 1 2 , the term in exp ζ 2 will cancel in the last integral (expectation).Using a simple change of variables in the remaining integral, we have as a final upper bound: where The second upper-bound, for p 0 (θ V ), is easily obtained, the proof being left to the reader.
This leads us immediately to the following corollary: With the thresholds θ G and θ V , an upper bound of the FWER is: Figure 1 provides an illustration of the level curves associated to the above FWER upper bound.More precisely, it illustrates the tradeoff between the conservativeness at the cell level and the individual index level.In the next section, the optimization for power will be made along these level lines.Figure 1 also provides the value of the Bonferroni-Holm threshold.For the coarse-to-fine procedure to be less conservative than the Bonferroni-Holm approach, we need the index-level threshold to be smaller, i.e., the optimal point on the level line to be chosen below the corresponding dashed line.
The derivation of (3) is based on the assumption that we have a fixed cell size (across all the cells), which is not needed.In the case where the size of the cell is varying, it is easy to generalize the previous upper bound.Letting where θ G does not depend on the cell g.

3.
3. Optimal thresholds.Equation (3) provides a constraint on the pair (θ G , θ V ) to control the FWER at a given level .We now show how to obtain "optimal" thresholds (θ * G , θ * V ) that maximize discovery subject to this constraint.The discussion will also help understanding how active indices clustering in cells improve the power of the coarse-to-fine procedure.

The conditional distribution of
It follows from this that, conditionally to these variables, ( P g Y 2 − Ȳ 2 )/σ 2 Y−g follows a non-central chi-square distribution χ 2 (ρ g (X v , v ∈ g), ν g ), with we will work with the approximation With a similar analysis, and letting for with We now have the simple lemma Proof.This is based on the inequality [15], valid for Z ∼ χ 2 (ρ, ν): which implies as soon as θ ≤ ρ + ν, and on the simple lower-bound This proposition can be applied, in our case, to (ρ 1 , ν 1 ) = (nρ g , |g|) and (ρ 2 , ν 2 ) = (nρ v , 1).More concretely, we fix a target effect size η (the ratio of the effect of X v compared to the total variance of Y), and a target cluster size, k, that represents the number of active loci that we expect to find in an active cell, and we take ρ v = η and ρ g = kη to optimize the upperbound in (4) subject to the FWER constraint (3) and θ G ≤ nρ g + |g| and θ V ≤ nρ v + 1 to find optimal constants (θ G , θ V ) for this target case.This is illustrated with numerical simulations in the next section.

Simulation results (parametric case). Figures 2 compares the pow-
ers of the coarse-to-fine procedure and of the Bonferroni-Holm procedure under the parametric model described in the section.
The parameters chosen in our simulations were taken with our motivating application (to GWAS) in mind.Thinking of Y as a phenotype, and V as a The coarse-to-fine method is more powerful when the number of active indices is two or greater.This confirms the intuition that the more the clustering assumption is true, the more powerful is the coarse-to-fine method compared to Bonferoni-Holm approcach .
set of SNP's, we assimilate cells g ∈ G to genes.We used |V | = 10000 and |g| = 10 .The true number of active variables is 50 with a corresponding coefficient a v = 1 for each of them, and we generate the data according to the linear model described in this section with a variance noise that is equal to 10 .We assumed that we knew an upper bound J for the number of active sets (this assumption is relaxed in section 5).To compute the optimal thresholds, some values for ρ g and ρ v have to be chosen (this should not be based on observed data, since this would invalidate our FWER and power estimates).In our experiments, we optimize the upper bound on the probability for an active variable to be detected in an active cell by choosing ρ g = 2/J and ρ v = 1/J.This corresponds to an "almost non noisy case" where the effect size of the "gene" is two times the effect size of the "SNP".

Non-parametric coarse-to-fine testing.
4.1.Notation.Recall that U denotes the random variable representing all the data, taking values in U. We will build our procedure from userdefined scores, denoted ρ v (at the locus level) and ρ g (at the cell level), both defined on U, i.e., functions of the observed data.
Moreover, we assume that there exists a group action of some group S on U, which will be denoted (ξ, u) → ξ u.
For example, if, like in the previous paragraph, one takes and X k = (X k v ) v∈V , we will take S to be the permutation group of {1, . . ., n} To simplify the discussion, we will assume that S is finite and denote by µ the uniform probability measure on S, so that We note, however, that our discussion remains true if S is a compact group, µ the right-invariant Haar probability measure on S and ρ v (ξ u), ρ g (ξ u) are continuous in ξ.
We will also use the following well-known result.
Lemma 4.1.Let X be a random variable and let F − X denote the left limit of its cumulative distribution function, i.e., F − X (t) = P (X < t).Then, for t ∈ [0, 1], one has P F − X (X) ≥ 1 − t ≤ t (with equality if F is continuous).

4.2.
Asymptotic resampling scores.We define the asymptotic scores at the cell and variable level by (5) T g (u) = µ (ξ : ρ g (u) ≤ ρ g (ξ u)) and ( 6) T g (U) and T v (U) are the typical statistics used in randomized tests, estimating the proportion of scores that are higher than the observed one after randomization.For the coarse-to-fine procedure, we will need one more "conditional" statistic.For a given constant θ G , we define We then let We call our scores asymptotic in this section because exact expectations over µ cannot be computed in general, and can only be obtained as limits of Monte-Carlo samples.The practical finite-sample case will be handled in the next section.We this notation, we let which depends on the choice of three constants, θ V , θ G and θ V .We then have: Theorem 4.1.For all v ∈ V 0 : and for all v ∈ V 00 , This result tells us how to control the FWER for a two-level permutation test based on any scores in the (generally intractable) case in which we can exactly compute the test statistics, when we declare an index v active if and only if Proof.For (9), we use a standard argument justifying randomization tests, that we provide here for completeness.If v ∈ V 0 , we have From the invariance assumption, we have where ζ is the random variable on S defined by ζ(ξ ) = ρ v (ξ U), so that, by Lemma 4.1, which proves (9).
Let us now prove (10), assuming v ∈ V 00 and letting g = g(v).We write and find an upper bound for the right-hand side of the inequality.Using the invariance assumption, we have, for all ξ ∈ G, Notice that, since µ is right-invariant, we have and Let μ denote the probability µ conditional to the event T g (ξ u) ≤ θ G (u being fixed).Then 1 Hence, Lemma 4.1 implies that, for each u: 1 Hence, .
Applying Lemma 4.1 to the random variable ξ → ρ g (ξ) for the probability distribution µ, we immediately get N θ G g (U) ≤ θ G so that As an immediate corollary, we have: Remark 4.1 (Continuous approximation).Even though S is finite, it is a huge set in typical applications, and while Lemma (4.1)only provides inequalities for discrete distributions, we can safely ignore the discontinuity in practice and work as if the distributions to which we applied this lemma were continuous.Doing so, it is easy to convince oneself, by inspecting the previous proof that our estimates become (for v ∈ V 00 , g = g(v)) This implies that because we have also: As mentioned above, this result does not have practical interest since it requires applying all possible permutations to the data.In practice, a random subset of permutations is picked instead, and we will develop the related theory in the next section (using these inequalities as intermediary results in our proofs).4.3.Finite resampling scores.We now replace T g , T θ G v and T v with Monte-Carlo estimates and describe how the upper bounds in Theorem 4.1 need to be modified.
For a positive integer K, we let µ K denote the K-fold product measure on S, whose realizations are K independent group elements ξ = (ξ 1 , ..., ξ K ) ∈ S K .We will use the notation P and E to denote probability or expectation for the joint distribution of U and ξ (i.e., P = P ⊗ µ K ).We will also denote by μξ the empirical measure With this notation, we let and We can now define and state: Theorem 4.2.Making the continuous approximation described in Remark 4.1, the following holds.For v ∈ V 0 , (11) and, for v ∈ V 00 and g = g(v), Corollary 4.2.The FWER for the randomized test is controlled by Proof of Theorem 4.2.We start with (11) which is simpler and standard.Let v ∈ V 0 .Conditionally to U, K Tv (U, ξ) follows a binomial distribution Bin(K, T v (U)) (with T v defined by equation ( 6)), so that where [•] denotes the integral part.The continuous approximation implies that T v (U) is uniformly distributed, so that We now consider (12) and take v ∈ V 00 , g = g(v).We need to prove that: ) with success probability T g (U).Hoeffding's inequality [12] implies that So: We now fix i 0 in 1, ..., K and consider Since this probability does not depend on which i 0 is chosen, we will estimate it, without loss of generality, for i 0 = 1, which will simplify the notation.For this, notice that and also that Now we use Hoeffding's inequality to obtain The same upper-bound applies to P Tg (ξ 1 U, ξ) ≥ T g (ξ 1 U) + ε G |U by taking the expectation with respect to ξ 1 , and one deduces from this that On the event , by applying Lemma 4.1 to the random variable ξ → ρ g (ξ U) under the distribution μK ξ .This implies that, on C, we have We now provide an estimate for the last term in (14).We have Now write Given U, the probability (for µ) of the event follows a binomial distribution Bin(K, p(U)).We make the continuous approximation (Remark 4.1) to write since each of the integrals is equal to 1 (they are densities of Beta distributions).From (15), we find which, combined with equations ( 13) and ( 14) completes the proof of the theorem.
4.4.Simulations.For the simulations, we generated data according to the parametric model described in section 3, and compared the detection rate using the non-parametric approach to the one obtained with thresholds optimized as described in section 3. We fixed the ratio between the nonparametric thresholds θ G and θ V to coincide with the ratio between the type I errors at the cell and index levels in the parametric case, i.e., Comparison of the coarse-to-fine and Bonferroni-Holm methods in the parametric and non-parametric methods when data is generated from the parametric model.
for v ∈ V 00 , where T par v , T par v , θ par V , θ par G are the scores and thresholds designed for the parametric case.Fixing θ V /θ G and the FWER uniquely determines the thresholds.
Figure 3 provides a comparison between the coarse-to-fine approaches (parametric and non-parametric) and Bonferroni-Holm.In figures 4 and 5, the non-parametric coarse-to-fine method is compared with Bonferroni-Holm for two different index-level effect sizes.Using the notation of section 3, the cell-level effect size is defined by Y at the index level.The range of index-level effect sizes we consider in these simulation is similar to the one observed in typical genome-wide association studies.

Estimating the number of active cells.
5.1.Notations, assumptions and starting point.We now focus on the issue of estimating the number of active cells, J, from observed data, since this number intervenes in our FWER estimates.We use a method inspired from [18,17] for the estimation of false discovery rates, adapted to our context.Our estimation will be made based on cell statistics (T g , g ∈ G) under the following setting.A1.If g ∩ A = ∅ (g is inactive), then T g is uniformly distributed.A2.If g 1 , g 2 are inactive, then T g 1 , T g 2 are independent.These assumptions will be justified in section 5.3.We will also assume that T g takes large values when g is active, so that, for a suitable non conservative threshold t 0 , we have P (T g ≥ t 0 ) 1. To simplify the argument, we will actually make the approximation that: A3.There exists t 0 ∈ (0, 1) such that P (T g ≥ t 0 ) = 1 if g ∩ A = ∅.
For t ∈ [0, 1], we define Dt = {g : T g ≤ t}.Let D be the set of active cells, and G 0 = G \ D. Note that J = |D|.Then, for t ≥ t 0 , We therefore have, for t ≥ t 0 , where Z t is a centered random variable.The following proposition states that the process Zt √ |G|−J for t ≥ t 0 has the covariance structure of a Brownian bridge.
We now make a Gaussian approximation for large G of the vector when |G| diverges to infinity, where Γ is the covariance matrix with entries: Proof.Our assumptions ensure that ζ satisfies a central limit theorem conditionally to Y , with a limit, N (0, Γ) that is independent of the value of U .This implies that the limit is also unconditional.
We are now able to present our principal result which provides a highprobability upper bound for J. Theorem 5.1.Let Z ∼ N (0, 1) be a randomization variable, independent of (T g , g ∈ G).For i ∈ {1, . . ., n} and C > 0, define As a consequence, given ε > 0, let is such that P(J ≤ Ĵε ) ≤ ε (we use a randomly sampled value of Z).Let us now prove this result.
Proof.We know from Proposition 5.2 that the vector where B N (0, Γ).Then, since min is a continuous function on R n , we deduce that: but: The process M i = −B i − t i Z, i = 1, . . ., n is a martingale and exp (λM i ) is a submartingale for all λ ≥ 0's.Applying Doob's inequality, then optimizing over λ's finally gives: It remains to prove that Then: Solving the quadratic inequality for |G| − J, one finds that , which completes the proof.

5.2.
Application to the coarse-to-fine algorithm.The previous section provided us with an estimator Ĵ = Ĵε in (18) such that Ĵ > J with probability larger than 1 − ε, which implies that with probability 1 − ε at least.
We previously chose constants θ G and θ V by optimizing the detection rate on a well-chosen alternative hypothesis subject to the upper-bound being less than a significance level α.This was done using a deterministic upperbound of J, but cannot be directly applied with a data-based estimation of J since this would yield data-dependent constant θ G and θ V , which cannot be plugged into the definition of the set Â without invalidating our estimation of the FWER.In other terms, if, for a fixed number J , one defines ÂJ to be the discovery set obtained by optimizing θ G and θ V subject to |V | p 00 (θ G , θ V ) + ν G J p 0 (θ V ) ≤ α, our previous results imply that FWER( ÂJ ) ≤ α for all J ≥ J, but not necessarily that FWER( Â Ĵ ) ≤ α + ε.
A simple way to address this issue is to replace Â Because Ã ⊂ ÂJ with probability at least 1 − ε, we have so that Ã controls the FWER at level α + ε as intended.
5.3.Justification of A1 and A2.We check that conditions A1 and A2 are satisfied for the two situations that we consider in this paper.In the example from section 3, we can take (using the same notation and introducing the c.d.f. of a chi-square distribution) (Recall that P g is the orthogonal projection on the space generated by X g = (X v , v ∈ g) and 1.We also let σ 2 Y be the empirical variance of Y.) Note that the conditional distribution of T g given Y = y is always uniform over [0, 1] and therefore does not depend on y, which proves that T g and Y are independent.Similarly, taking g 1 = g 2 ∈ G 0 , T g 1 and T g 2 are conditionally independent given Y (because X g 1 and X g 2 are independent).But T g 1 and T g 2 being conditionally independent given Y and each of them independent of Y implies that the three variables are mutually independent.
The same argument can be applied to the non-parametric case, when (now using notation from that section) one assumes that scores are such that ρ g (U) = ρ g (Y, X g ), and uses -to simplify the discussion-the statistic T g = µ (ξ : ρ g (Y, X g ) ≤ ρ g (ξ (Y, X g ))) , assuming, in addition, the following.If we denote by X g the space where the random variable X g takes its values, There exists a group S, a group isomorphism φ between S and S and a group action of S on X g that we will denote by ˜ satisfying the two following conditions: • The distribution of X g is invariant under the action ˜ .
• ρ g (ξ (Y, X g )) = ρ g (Y, φ(ξ) ˜ X g ) For example, for permutation tests, the group S is simply the group of permutations S itself.The isomorphism φ is the inverse map ξ → ξ −1 .The group action ˜ is just the permutation of the observations.Finally, ρ g can be any score that is symmetric with respect to the observations.Assuming these conditions, one can immediately apply lemma 4.1 to conclude.
6. Discussion.Given a partition of the space of hypotheses, the basic assumption which allows the coarse-to-fine multiple testing algorithm to obtain greater power than the Bonferroni-Holm approach at the same FWER level is that the distribution of the numbers of active hypotheses across the cells of the partition is non-uniform.The gap in performance is then roughly proportional to the degree of skewness.The test derived for the parametric model can be seen as a generalization to coarse-to-fine testing of the F-test for determining whether a set of coefficients is zero in a regression model; the testing procedure derived for the non-parametric case is a generalization of permutation tests to a multi-level multiple testing.
This scenario was motivated by the situation encountered in genome-wide association studies, where the hypotheses are associated with genetic variations (e.g., SNPs), each having a location along the genome, and the cells are associated with genes.In principle, our coarse-to-fine procedure will then detect more active variants to the extent that these variants cluster in genes.Of course this extent will depend in practice on many factors, including effect sizes, the representation of the genotype (i.e., the choice of variants to explore) as well as the phenotype, and complex interactions within the genotype.It may be very difficult and uncommon to know anything specific about the expected nature of the combinatorics between genes and variants.In some sense, "the proof is in the pudding," in that one can simply try both the standard and coarse-to-fine approaches and compare the sets of variants detected.Given tight control of the FWER, everything found is likely to be real.Indeed, the analytical bounds obtained here make this comparison possible, at least under linear model commonly used in GWAS and in a general non-parametric model under invariance assumptions.
Looking ahead, we have only analyzed the coarse-to-fine approach for the simplest case of two-levels and a true partition, i.e., non-overlapping cells.The methods for controlling the FWER for both the parametric and nonparametric cases generalize naturally to multiple levels assuming nested partitions.The analytical challenge is to generalize the coarse-to-fine approach to overlapping cells, even for two levels: while our methods for controlling the FWER remain valid, they are likely to become overly conservative if cell overlap.This case is of particular interest in applications, where genes are grouped into overlapping "pathways."For example, in "systems biology," cellular phenotypes, especially complex diseases such as cancer, are studied in the context of these pathways and mutated genes and other abnormalities are in fact known to cluster in pathways; indeed, this is the justification for a pathway-based analysis.Hence the clustering properties may be stronger for variants or genes in pathways than for variants in genes.

Fig 1 .
Fig 1. Level curves of the upper bound of the FWER for the levels 0.2 (blue), 0.1 (green) and 0.05 (red).The horizontal dashed lines represent the thresholds at the individual level for a Bonferroni-Holm test, with corresponding colors.

Fig 2 .
Fig 2.Comparison of the power of different methods.We compare the detection rate of an individual active index for different number of active indices in the cell containing that index.The coarse-to-fine method is more powerful when the number of active indices is two or greater.This confirms the intuition that the more the clustering assumption is true, the more powerful is the coarse-to-fine method compared to Bonferoni-Holm approcach .

Fig 4 .
Fig 4.Comparison of the non-parametric Bonferroni-Holm and coarse-to-fine methods for an index-level effect size equal to 1/900.One can notice that when the clustering assumption is true, the coarse-to-fine method outperforms the Bonferroni-Holm approach

Fig 5 .
Fig 5.Comparison of the non-parametric Bonferroni-Holm and coarse-to-fine methods for an index-level effect size equal to 1.5/900.A larger effect size (compared to figure4improves the performance of the Bonferroni-Holm method.The coarse-to-fine method is performs better at levels that correspond to two or more active indices per cell.