Semi-supervised multiple testing

An important limitation of standard multiple testing procedures is that the null distribution should be known. Here, we consider a null distribution-free approach for multiple testing in the following semi-supervised setting: the user does not know the null distribution, but has at hand a sample drawn from this null distribution. In practical situations, this null training sample (NTS) can come from previous experiments, from a part of the data under test, from specific simulations, or from a sampling process. In this work, we present theoretical results that handle such a framework, with a focus on the false discovery rate (FDR) control and the Benjamini-Hochberg (BH) procedure. First, we provide upper and lower bounds for the FDR of the BH procedure based on empirical $p$-values. These bounds match when $\alpha (n+1)/m$ is an integer, where $n$ is the NTS sample size and $m$ is the number of tests. Second, we give a power analysis for that procedure suggesting that the price to pay for ignoring the null distribution is low when $n$ is sufficiently large in front of $m$; namely $n\gtrsim m/(\max(1,k))$, where $k$ denotes the number of ``detectable'' alternatives. Third, to complete the picture, we also present a negative result that evidences an intrinsic transition phase to the general semi-supervised multiple testing problem {and shows that the empirical BH method is optimal in the sense that its performance boundary follows this transition phase}. Our theoretical properties are supported by numerical experiments, which also show that the delineated boundary is of correct order without further tuning any constant. Finally, we demonstrate that our work provides a theoretical ground for standard practice in astronomical data analysis, and in particular for the procedure proposed in \cite{Origin2020} for galaxy detection.


Background and motivating examples
Multiple testing, with emphasis on large scale problems, is an important topic in modern statistics. Classical theory and performance guarantees heavily rely on the knowledge of the null distribution. However, in many practical situations, the null distribution is out of reach. A famous situation, described in a series of work by Efron (2004Efron ( , 2007Efron ( , 2008Efron ( , 2009 and followed by, e.g., Schwartzman (2010); Azriel and Schwartzman (2015); Stephens (2017); Sun and Stephens (2018); Roquain and Verzelen (2020b) is the case where the null distribution is mis-specified and is empirically adjusted from the data by fitting some parametric null model (typically Gaussian). In particular, it is well known that using an erroneous null can by disastrous in terms of false discovery rate (FDR), see, e.g., Roquain and Verzelen (2020a). Related works, relying on the famous two-group model (Efron et al., 2001), propose to estimate the null distribution together with the proportion of nulls and the alternative distribution, and to plug them into the so-called local FDR values, see Efron et al. (2001) and Padilla and Bickel (2012); Heller and Yekutieli (2014) among others. The latter can in turn be used into an FDR controlling procedure, see Sun and Cai (2007); Sun and Cai (2009); Cai and Sun (2009) ;Cai et al. (2019); Roquain and Verzelen (2020b); Abraham et al. (2021). The validity of such approaches, often given asymptotically in the number of tests, also requires strong model assumptions to ensure that these parameters can be correctly estimated.
Here, we consider a semi-supervised setting, with essentially no assumption on the null distribution. Instead, the user has at hand a sample, called the null training sample (NTS), of length n ≥ 1, and generated according to this unknown null. This is motivated by the two following generic situations: • Blackbox null sampling: the exact expression of the null distribution is intractable, but a sampling machine is able to simulate according to the null distribution. In that case, the NTS is exogenous and its length n corresponds to the number of sampling, so can be chosen by the user. It is nevertheless typically limited in size by computation time constraints. • Null sample given: the null distribution is unknown, but previous experiments or experts provide a fixed number n of examples under the null. The NTS is exogenous as in the above case, but n cannot be modified by the user. • Null sample learned from data: the null distribution is unknown, but an independent part of the same data set provides an NTS for the user. In that case, the NTS is endogenous, of a given length n that cannot be modified by the user.
The case of "blackbox null sampling" is motivated by numerous situations. Two motivations come from Astrophysics; first when a code can be used to simulate images of astrophysical sources, see e.g. Bacon et al. (2021) (their Figure 15). Second, when the NTS comes from instrumental captures that are made without the objects of interest, see e.g. Choquet et al. (2018) for the detection of exoplanetary debris disks (their Figure 5). In each of these situations, the null distribution is not accessible for the user, and only the NTS can be generated. More broadly, this case is motivated by recent advances in machine learning, especially implicit generative models, as generative adversarial networks (Goodfellow et al., 2014), or variational auto-encoders (Kingma and Welling, 2014), for which sampling is possible without knowing the underlying distribution. An illustration of the blackbox null sampling case is provided in Section A, on a toy example for which multiple likelihood ratio tests are simultaneously performed.
The case of "null sample given" is common in the machine learning context, where the learner is given a sample of "nominal patterns" but without labeled novelties. This is classically referred to as "one class classification" or "learning from positive and unlabeled examples" and we refer the reader to the work Blanchard et al. (2010) that pointed out many references in this abundant literature.
The case of "null sample learned from data" refers to the framework where it is possible to isolate part of the data to produce a sample that contains copies of the test statistics under the null, or approximately so. While it can be met in various datasets, it is motivated by a specific application in Astrophysics that is extensively developed in Section 7. It regards the detection of galaxies in the early Universe from image measurements in multiple wavelength channels. In this application, the distribution of the tests statistics under the null is unknown and it was proposed in Mary et al. (2020); Bacon et al. (2021) to estimate this distribution from a null training sample obtained from the data itself. The NTS is obtained as the population of the opposite of local minima and the whole NTS is used for testing each of the m local maxima.
In both cases, a crucial issue is to build a procedure for making discoveries while being fully interpretable, especially when the number of tests m is large. We thus focus on building a procedure that controls the false discovery rate (FDR), that is, the expected ratio of errors among the discoveries made by the procedure (Benjamini and Hochberg, 1995). Interestingly, controlling the FDR by using a simulated NTS has similarities with the recent "knockoff" method introduced in Barber and Candès (2015) which has been at the origin of an impressive scientific production over the last years, see, e.g., Weinstein et al. (2017); Katsevich and Sabatti (2019); Barber and Candès (2019); Bates et al. (2020). Further comparisons are given in Section 1.3.
When proper p-values can be built, the classical way to control the FDR at level α is to use the Benjamini Hochberg (BH) procedure (Benjamini and Hochberg, 1995). However, in the setting described above, the exact p-values are out of reach, so that the usual BH procedure cannot be used. In our context, we call it the oracle BH procedure, and denote it by BH * α , or BH * for short. Instead, the NTS can be used to build empirical p-values, calledp-values for short. It is then natural to use thep-values into the BH procedure, which is the procedure studied in this paper.
We call it the semi-supervised BH procedure and denote it by BH α , or BH for short.
Let us already note that plugging empirically-based p-values into the BH procedure is not new and has been widely explored in the literature, especially in a Monte Carlo framework, see, e.g., Guo and Peddada (2008); Sandve et al. (2011); Gandy and Hahn (2014); Zhang et al. (2019). However, while the same null sample is used to compute all p-values in our setting, most of the existing works focus on the case where m null samples are available, that is, each test uses a different sample, often generated via randomization process (e.g., permutations). In that case, the computational price is much higher and these works mostly aim at reducing this price. The case of only one null sample has been considered only recently to our knowledge, see Weinstein et al. (2017); Bates et al. (2021). The computational issue can be easily solved (see Algorithm 1), and our emphasis is rather on the theoretical guarantees of the resulting BH procedure ( BH). Further details and comparisons with existing literature are given in Section 1.3 and in Section S4.1 in the supplement.
Finally, an important point of our work will be to determine how large n should be relatively to the number m of tests. Obviously, when n tends to infinity while the number m of tests is kept fixed, the situation becomes similar to the one where the null distribution is known (that is, when n = ∞). But the situation is more complex when both n and m gets large simultaneously, which is typical (e.g., in our galaxy detection example, we have n ≈ m = 3.3 × 10 6 ). As can be guessed, the full picture also depends on the sparsity of the signal. This will be adressed in our theory through a parameter called k, which is a proxy to the number of detectable alternatives.

Contributions
Main contributions. The main contributions of the paper can be summarized as follows: First, we study the FDR of the procedure BH, by providing upper and lower bounds (Theorem 3.1). These bounds hold in a strong sense, that is, for any couple (n, m) with n, m ≥ 1, any number of true nulls m 0 , any null distribution, and any marginal distribution of the alternatives. Moreover, these bounds match and equal αm 0 /m when α(n + 1)/m is an integer. In practice, this provides a first guideline for choosing n in order to avoid over-conservativeness of the procedure.
Second, we provide a power boundary for BH, which puts forward the crucial role of n with respect to m: the power of BH is close to the one of the oracle BH * if n m/α (Proposition 4.1), but is not when n m/α (Proposition 4.2). This leads to the boundary n m/α. In addition, we underline the role of the sparsity in the boundary with the following additional result. For distributions that are more favorable in the sense that the oracle BH * is expected to make at least k true discoveries with high probability (a situation where we say that k alternatives are "detectable"), we show that the power boundary for BH occurs at n m/(kα). As an illustration, for k = 1, the boundary is n m/α and thus is the same as for general distributions. However, for the dense case k = m/2, the boundary reads n 1/α. This indicates that an NTS of size 1/α is enough to recover the power of the oracle in this case. This is markedly different from the case of general distributions. In particular, oracle performances can be achieved in the dense case for a constant value of n, regardless of m. Overall, this leads to a new "rule of thumb" with a transition at n = m/(α max(1, k)), which is implemented in the numerical experiments, see Section 6 and in the astrophysical example, see Section 7.
Third, we show that an intrinsic phase transition occurs in the general case at n m (Corollary 5.3). The boundary n m (α being fixed) can not be improved by another procedure: when n m, no procedure (only based on the observations and the NTS) can both control the FDR while having a power close to the one of BH * (Theorem 5.1). Since BH does mimic the oracle when n m (Proposition 4.1), this establishes a general minimax-type optimality property for BH. (Note that the test statistic is fixed in our setting so that BH * is an appropriate reference for power, see Section 8.2 for a further discussion.) Secondary contributions. Additional secondary contributions are as follows: First, we show how BH can be used in the "Blackbox null sampling" setting in Appendix A. We introduce the Blackbox BH procedure, which is defined as the semi-supervised BH procedure with a preliminary step where the NTS is properly generated, see Algorithm 2. While it can be used in a very broad context, we illustrate its use for likelihood ratio tests for which the oracle is accessible in Section A.2 in the supplement. A comparison with local FDR type approaches is also provided in that case.
Second, we put forward the following, perhaps seemingly paradoxal, fact for FDR control under negative dependence. Even in the classical setting where the true null is known, it is better not to use BH procedure, but to build instead artificially an NTS, and to use it along with the semisupervised procedure BH. This approach is refered to the randomized BH procedure, which is studied separately in Appendix B. While the superiority of the randomized BH procedure over the usual BH procedure in terms of FDR control is shown for an admittedly restrictive dependence structure, correcting the BH procedure to accommodate negative dependencies is known to be a challenging task (see, e.g., Fithian and Lei (2020) and references therein). We think that this intriguing side result is an important proof of concept for the randomized BH procedure.
Third, extensive numerical experiments are given in Section 6 that validate and illustrate our theoretical results. In particular, they corroborate the fact that the boundary where the power of BH α gets of the order of the one of BH * α occurs around n = m/(kα) (without further tuning of the constant), where k is the number of "detectable" alternatives in the data. For instance, and perhaps counter-intuitively, it is shown that oracle performances can be achieved in a dense case for values of n as small as 5 or 10, regardless of m.
Fourth, a detailed application to galaxy detection is given in Section 7. Remarkably, the recent results of Bacon et al. (2021) suggest the likely discovery of an unexpected population of ultra-faint dwarf galaxies 1 . This discovery results from a two-stage detection process, whose first stage relies on a former version of the semi-supervised Benjamini-Hochberg procedure developed in Mary et al. (2020), which also provides the same output as BH α . Hence, the present paper provides a theoretical support to these findings, with guarantees both on the FDR and on the power. Figure 1 summarizes the different power regimes put forward in our analysis. The transition phase n = m/α separates two regimes: the regime where oracle performances can be reached for any distribution ("mimicking the oracle possible in general", lime green) versus the regime where no procedure can reach the oracle performances ("mimicking the oracle impossible in general", tomato + red brick). The line n = m/(αk) is the performance boundary of BH for favorable distributions for which at least k alternatives are detectable (in which case oracle performances can be reached in the lime green + tomato area). Note that our theory proves that these boundaries hold only up to numerical constants, whereas the numerical experiments suggest that they hold with constant 1.

Related works
Permutation-based multiple testing. A common way to generate a "null sample" from the data under test is to apply some randomization that preserves the null distribution, typically by performing permutations of individuals. While single testing using randomization is classical and can be traced back to Fisher (1935), several extensions have been proposed in the literature to accommodate multiple testing criteria, see Westfall and Young (1993); Lin (2005); Wolf (2005, 2007); Hemerik et al. (2019). In particular, an active line of research is dedicated to reduce the computation time of BH procedure with p-values obtained from permutation-based null samples: indeed, the usual permutation-based paradigm requires to generate a different "null sample" for each test, which makes the use of such a BH procedure prohibitive in that framework. In Guo and Peddada (2008), they adapt the number of bootstrap samples sequentially to speed-up BH procedure by using bootstrap confidence intervals for p-values. This method is further refined in Gandy and Hahn (2014), where the procedure recovers with high probability the rejection set of the BH procedure using "ideal" p-values (exhausting all permutations). Another approach is used in Sandve et al. (2011) by allocating the Monte Carlo budget (total number of Monte Carlo samples) according to the significance of the test statistics, itself extending an idea of Besag and Clifford (1991) for single testing. More recently, Zhang et al. (2019) proposed to reduce the computation burden by following a bandit approach. While all these works are based on null training samples, the crucial difference is that our setting only relies on one null sample for all tests. The consequences are the following: first, the complexity of the procedure proposed here ( BH) is much smaller than that of the BH procedure with permutation-based p-values, the need for designing an efficient algorithmic strategy is far less critical than in the works mentioned above (note that our Algorithm 1 for BH is nevertheless efficient). Second, this advantage comes with a counterpart: in the case where the initial test statistics are independent, the permutation-based p-values are also independent, while our setting induces dependencies between thep-values (the same NTS is used to build allp-values). This makes the FDR control more difficult to obtain. Finally, the above comparison has to be moderated by the fact that randomization testing and our semi-supervised setting each come with specific mathematical assumptions: randomization testing relies on a null distributional invariance which is very different from the assumption (Exch) below. Namely, the exchangeability property concerns the set of "variables" (nulls of the test sample plus the null training sample), whereas in permutation testing, the exchangeability concerns the set of individuals. As a result, mathematical results derived in each framework cannot be directly compared. In particular, it is important to note that we do not pretend to address the FDR controlling problem in the permutation-based framework. Our contribution lies in another framework, which thus departs from the Monte-Carlo literature mentioned above.
Earlier occurrences of BH. In the linear Gaussian model, Barber and Candès (2015) proposed to build test statistics (from so-called "knockoff" variables) that have a special symmetry property under the null allowing to properly calibrate an FDR controlling procedure. Still in that framework, a rapidly growing literature proposed further extensions and refinements of this seminal work, see, e.g., Candès et al. (2018); Barber and Candès (2019) Sarkar and Tang (2021). Here, the empirical BH procedure can be seen as an extension of Barber and Candès (2015) to the semi-supervised setting. It turns out that this procedure has been considered in the paper by Weinstein et al. (2017), whose scope is yet quite different. Very recently, it has been considered by Bates et al. (2021); Yang et al. (2021), for which the NTS is called the inlier sample and the bag of nulls, respectively. These papers come with theoretical guarantees, see Remark 3.3 below. Our results have been developed independently.
Multiple comparisons to control. Multiple comparisons to control (MCC) is a long-established problem in multiple testing (Dunnett, 1955;Hsu, 1996;Finner and Strassburger, 2007;Fithian and Lei, 2020) where one typically aims at comparing several treatments to some common benchmark (control). In the MCC setting, one typical observes only one test statistic per treatments and one test statistics for the control. This would correspond to the case where the null training sample is of length n = 1, which is not the typical case considered here. Hence, to our knowledge, the connection to that part of the literature is only weak.
Other FDR controls. Our work is closely related to the task of semi-supervised novelty detections , developed in a machine learning context, where the user has at hand both a null sample and an unlabeled sample and they aim at labeling the unlabeled sample. However, the procedures developed therein are significantly different from here: first, they adjust the test statistics by considering families of classifiers. Second, their FDR control is based on a concentration argument that adds an error term larger than n −1/2 + m −1/2 (see Proposition 12 therein) and depending on the VC-dimension of the classifier class, while the FDR control in Theorem 3.1 is exact (no error term).
Finally, another closely related literature tackles the issue of learning the null distribution without null training sample (only using the original test statistics) but assuming that the null distribution belongs to a parametric model, typically Gaussian with unknown mean and variance. While the most classical line of research is the one following the "local FDR" methodology introduced by Efron, see, e.g., Efron (2008), theoretical results have been obtained by Carpentier et al. (2021); Roquain and Verzelen (2020b). The methodology developed here, and particularly the impossibility result (Section 4.2) and the boundary phenomenon (Section 5.2), are inspired from Roquain and Verzelen (2020b). However, the setting being markedly different, several substantial adjustments are required. Also, we underline that we derive here an FDR control without remainder terms, which was not the case in Carpentier et al. (2021); Roquain and Verzelen (2020b).
Naive solutions to our problem. For completeness, let us discuss two naive solutions that can be straightforwardly used to derive a procedure with a proven FDR control in the present semisupervised setting, and explain why they are not satisfactory. Recall that, even under independence of the test statistics, thep-values are not independent, which is a problem to design an FDR controlling procedure that takes as input thesep-values.
First, one solution is to use the Benjamini-Yekutieli procedure or one of its extension Benjamini and Yekutieli (2001); Blanchard and Roquain (2008) that control the FDR under arbitrary dependence between the p-values, so also when used withp-values. Namely, the semi-supervised Benjamini-Yekutieli procedure, denoted by BY α (or BY for short), considers BH α/cm at level α/c m where c m = 1 + 1/2 + · · · + 1/m. However, it is well known that the power loss is substantial with respect to BH procedure and this general fact also holds in our setting, as it will be shown in the numerical experiments, see Section S4.1 in the supplement. In addition, Theorem 3.1 shows that under Assumption (Exch), the procedure BH already achieves the desired FDR control so there is no need to use the corrected procedure BY.
A second naive solution, referred to as BH Split , is to split the NTS of size n into m null samples T 1 , . . . , T m , each of size n/m (say that the latter ratio is an integer for simplicity) so that eacĥ p-value uses a different part of the null sample, that is, eachp i is computed from the null training sample T i . In that case, if the test statistics are independent, these modifiedp-values are also independent, and the BH procedure using these modifiedp-values does control the FDR by the original result of Benjamini and Hochberg (1995). However, this reduces drastically the size of the (different) NTS used to calibrate each test (n/m instead of n), which leads again to a poor power, see Section S4.1.

Organization of the paper
The paper is organized as follows: while the model, procedures and criteria are detailed in Section 2, the FDR results are given in Section 3. Power properties of BH are then derived in Section 4 with upper and lower bounds, which delineate boundaries for BH. Extending to any procedure the impossibility result below the boundary, the result of Section 5 delivers an optimality property of BH and a general phase transition for the semi-supervised multiple testing problem. We then illustrate our findings with numerical experiments in Section 6 and the motivating application to astrophysical data is investigated in Section 7. We conclude and discuss several open issues related to our work in Section 8. Two by-products of our theory are presented in Appendices A and B, with the blackbox BH procedure and the randomized BH procedure, respectively. For space reasons, we have deferred some materials to a supplemental file, whose sections are numbered with the prefix "S" to avoid any confusion. In this supplement, the main proofs are given in Section S1 and Section S2 for the FDR results and the power results, respectively. Auxiliary results and proofs are postponed to Section S3, while additional numerical experiments are given in Section S4.

Setting
For n, m ≥ 1, let us observe a sample Z = (Z 1 , . . . , Z n+m ) = (Y 1 , . . . , Y n , X 1 , . . . , X m ) ∈ R n+m , whose distribution is denoted by P , the model parameter, that belongs to some model P. The sample Y = (Y 1 , . . . , Y n ) is referred to as the null training sample (NTS), which is assumed to be identically distributed of marginal distribution P 0 = P 0 (P ). We denote the upper-tail function of P 0 by F 0 (t) = P(X i ≥ t), t ∈ R, i ∈ H 0 (P ), which is assumed to be continuous and decreasing on the support of P 0 . This will be the only assumption made on P 0 throughout the manuscript.
The sample X = (X 1 , . . . , X m ) corresponds to the sample under test. We consider the multiple testing problem where we would like to test the i-th null hypothesis H i : "X i ∼ P 0 " (against the complementary alternative), simultaneously for 1 ≤ i ≤ m. Note that while we allow for arbitrary alternatives here, this setting is typically suitable for alternatives that make X i stochastically larger than under the null (decisions will be based upon large values of the X i 's). Classically, let us denote H 0 (P ) = {i ∈ {1, . . . , m} : X i ∼ P 0 } ⊆ {1, . . . , m} the subset corresponding to true null hypotheses and m 0 (P ) = |H 0 (P )|. Let us denote H 1 (P ) the complement of H 0 (P ) in {1, . . . , m} and m 1 (P ) = m − m 0 (P ). Often, we omit the parameter P in the notation P 0 , H 0 , H 1 , m 0 , m 1 for simplicity.
Throughout the paper, we are going to consider various dependence assumptions between the Z i 's. The most simple assumption is Note that (Indep) does not exclude dependencies between the elements of (X i , i / ∈ H 0 ). We also use the following less restrictive condition: Hence, under (Exch), there could be also some dependencies between the elements of (Y 1 , . . . , Y n , X i , i ∈ H 0 ).

Procedures, criteria and p-values
A multiple testing procedure is a (measurable) function R = R(Z) that returns a subset of {1, . . . , m} corresponding to the indices i where H i is rejected. For any such procedure R, the false discovery rate (FDR) of R is defined as the average of the false discovery proportion (FDP) of R under the model parameter P ∈ P, that is, Similarly, the true discovery rate (TDR) is defined as the average of the true discovery proportion (TDP), that is, Note that if m 1 (P ) = 0, TDP(P, R) = 0 for all procedures R.
In the sequel, we will focus on p-value based procedures and we implicitly consider the situation where it is desirable to reject H i for large values of X i . If the null distribution P 0 is known, F 0 is known and we can consider , 1), and thus also the superuniformity property As it is required to obtain valid individual tests, condition (3) is generally considered as the definition of "valid" p-values.
Since in our framework P 0 is unknown, the above p-values are unknown oracle p-values and thus cannot be used in practice. Instead, the null sample (Y 1 , . . . , Y n ) can be used to build the empirical p-values However, the p i 's do not satisfies the necessary super-uniformity (3). For instance, for u = 0, the condition (3) is violated because the event p i (Z) = 0 can occur with a positive probability. Hence, using the p i 's asp-values is not appropriate, especially in a multiple testing context where under-estimating p-values can lead to an increased number of false discoveries. This phenomenon is well known and we refer the reader to the review of Phipson and Smyth (2010) for more details on this issue (see also the references therein). A common way to correct the p i 's is to make them slightly biased upward by considering rather the conservative version (see, e.g., Davison and Hinkley, 1997), given by where we let Under (Exch), since for any i ∈ H 0 , the variables X i , Y 1 , . . . , Y n are exchangeable, the p i (Z)'s do satisfy the super-uniformity (3), see, e.g., Lemma 5.2 in Arlot et al. (2010). Hence, the p i (Z) are "valid" p-values, that can in turn be plugged into multiple testing procedures.

BH procedures
In this work, an important class of multiple testing procedures is the BH-type procedures, which use as input different p-value families. The BH procedure is defined as follows: for some level α ∈ (0, 1), order the p-values in increasing order p (1) ≤ · · · ≤ p (m) and then let where α is the nominal level of BH procedure and where we let p (0) = 0 by convention.
Definition 2.1. We consider the two following versions of BH procedure, depending on which p-value family is given as input: • the oracle BH procedure, denoted by BH * α , is the BH procedure using the unknown p-values • the semi-supervised BH procedure, denoted by BH α , is the BH procedure using thep-values p i (Z), 1 ≤ i ≤ m, given by (5).
Algorithm 1: BH α , the semi-supervised BH procedure q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Importantly, the output of BH α can be quickly derived, even for large values of n and m, thanks to an algorithm of complexity O((n + m) log(n + m)), see Algorithm 1. This comes from the reformulation of BH α given in Section S1.1, which was also used by Weinstein et al. (2017) in another context. Figure 2 provides an illustration of Algorithm 1: it is a stepwise procedure that goes from the smallest values of the test statistics (right) to the largest values (left), and that stops the first time where the FDP falls below α. At each step, the FDP is estimated by the ratio of the number of null samples in the left part plus one (V + 1), to the number of test statistics in the left part (K), this ratio being sample-sized corrected by the factor m/(n + 1). Hence, at each step, the Y i 's are used as benchmarks to evaluate how many false discoveries are expected among the considered X i 's. Finally, while the above version of Algorithm 1 was presented for simplicity, a shortcut (faster) version can obviously be obtained by iterating in the loop only over the indices corresponding to the X i 's (the FDP is computed only at black points in Figure 2).

FDR control
In this section, we study the FDR of the procedure BH.
Theorem 3.1. For all n, m ≥ 1 and α ∈ (0, 1), consider the semi-supervised BH procedure BH α at level α as defined in Definition 2.1. Then, for any parameter P satisfying (Exch), the following holds: where x denotes the largest integer smaller than or equal to x. In particular, when α(n + 1)/m is an integer, the FDR bound is achieved, that is, FDR(P, BH α ) = αm 0 /m.
The proof is given in Section S1 and is based on a super-martingale argument which is similar to that of Barber and Candès (2015). However, a major difference is that the underlying process is not an i.i.d. Bernoulli process, but is only exchangeable, see Lemma S2 for more details. The lower bound part is obtained by looking carefully at the remainder term in the super-martingale property. To our knowledge, this kind of refinement is new in the literature. This allows to evaluate the sharpness of the FDR bound.
In addition, the FDR control of BH α holds under the more general condition (Exch). This is not the case for BH * α that can violate the FDR control under that condition. Hence, Theorem 3.1 puts forward an additional robustness of BH α w.r.t. the negative dependence, which is not enjoyed by BH * α . We provide an example below, see Figure 3 for an illustration. Example 3.2 (Gaussian with maximal negative correlation). Assume that Z = (Y, X) is a centered Gaussian vector with equicorrelation ρ < 0 and variances equal to 1. Classically, since the length of Z is n + m, the condition ρ ≥ −1/(n + m − 1) is necessary to provide that the (n + m) × (n + m) ρ-equicorrelated matrix (that is, with diagonal 1 and off-diagonal element ρ) is non-negative. For instance, the maximal negatively correlated case ρ = −1/(n + m − 1) can be easily realized as N (0, 1) and W denoting the sample mean of the W i 's, 1 ≤ i ≤ n+m. For this specific distribution P of Z, we have P 0 = P 0 (P ) = N (0, 1) and H 0 = {1, . . . , m}. Also, Assumption (Exch) is satisfied so that BH α controls the FDR at level α (with equality when α(n + 1)/m is an integer). On the other hand, it is well known that BH * α has an FDR above α in that case (see also Figure 3). Additional illustrations are given in Section 6.2 in the numerical experiments. This example is also the starting point of the randomized BH procedure developed in Appendix B. Remark 3.4. When α(n+ 1)/m is an integer, we can easily check that BH α coincide with BH α , the BH procedure applied to the naive, unbiased,p-values defined by (4). Hence, Theorem 3.1 implies that FDR(P, BH α ) = αm 0 /m in that case (under (Exch)). This shows that, perhaps surprisingly, the naive way to build empirical p-values eventually leads to a correct FDR control for such values of n. Simulations will show that this is not the necessarily the case for other values of n, see Section 6.

Power result
Section 3 showed that BH α has an FDR smaller than or equal to the one of the oracle BH * α under (Indep), see (8). Now, an important concern is to check whether the power of BH α is comparable to the one of BH * α . In this section, we explore this issue under Assumption (Indep) and the power comparison is established by comparing the true discovery proportions (2) of BH α and BH * α , for α slightly below α. In a nutshell, we establish that the TDP of BH α is larger than the one of BH * α with a probability tending to 1, for any model parameter, when n/m is large (Section 4.1), while we show that it is not true when n/m is small (Section 4.2). Together, this means that the boundary achieved by the procedure BH α is n m/α. We then present the case of particular, more favorable distributions, for which at least k alternatives are "detectable" (Section 4.3). In that case, the boundary achieved by BH α is shown to be nk m/α.
To state our results, let us finally introduce an additional notation: let Since (Indep) is true, P belongs to this class P n,m in the semi-supervised setting presented in Section 2.1, which can thus be considered as the parameter set of the model under that assumption.
In addition, since we look at power results, we are going to focus on distributions in P n,m with at least one true alternative. We denote A n,m = {P ∈ P n,m : m 1 (P ) ≥ 1} the corresponding set.

Upper bound
The following result shows that, under (Indep), when n ≥ γm with γ large enough, the semisupervised BH procedure at level α rejects at least all null hypotheses rejected by the oracle BH procedure at level α = α(1 − η), with high probability and with η small.
Then, for all n, m ≥ 1 with n ≥ γm, for all P ∈ A n,m , we have In particular, for all n, m ≥ 1 with n ≥ γm, Proposition 4.1 is proved in Section S2.1. It is based on a concentration argument of the empirical c.d.f. of the Y i 's, which relies on the independence assumption between the Y i 's. Note that the bound (12) is only informative if γ > γ * (α, η)/3; otherwise the right-hand side of (12) is non-positive and the bound is silent. When γ ≥ γ * (α, η), the probability is larger than or equal to 3/4. Taking γ much larger than γ * (α, η) makes the probability arbitrarily close to 1.

Lower bound
The previous section shows that the power of BH is close to the one of the oracle BH procedure provided that n/m is sufficiently large. We can legitimately ask whether this condition is necessary. The following result addresses this point.
Putting together, Propositions 4.1 and 4.2 establish that the semi-supervised BH procedure achieves the boundary n m/α: for n ≤ m/(4α), there exists a configuration P ∈ A n,m such that the power of BH α is less than the one of the oracle BH * α(1−η) (with probability at least 1 − 2α), while for n m/α all configurations P ∈ A n,m are such that the power of BH α is larger than the one of the oracle (with probability arbitrarily close to 1).

Refinement to more favorable distributions
If there are enough alternatives, with enough signal strength, we show here that the boundary achieved by BH can be much better than m n/α. We extend for this Proposition 4.1 and Proposition 4.2 to a specific set of "more favorable" distributions.
In words, A n,m,k,α,β is the set of distributions such that at least k null hypotheses are false while the probability that the procedure BH * α/2 makes at most k − 1 number of true discoveries is smaller than β. From an intuitive point of view, this means that the distribution contains at least k "detectable" alternatives, in the sense that they are detectable with large probability by the oracle itself (at level α/2). Now, the idea is that for a distribution P ∈ A n,m,k,α,β , the threshold of the oracle procedure is at least αk/m with large probability, so that the precision 1/n of thep-values is enough to mimic the power of the oracle BH if and only if 1/n αk/m, that is, nk m. The following result proves that this informal argument is correct.
Proposition 4.3 is proved in Section S2.3. Point (i) above is an upper-bound: in particular, it shows that having nk/m ≥ γ * (α, η) is enough for BH α to mimic the power of the oracle BH * α(1−η) with probability at least 1 − β − 1/4 when the underlying distribution belongs to the set A n,m,k,α,β . Interestingly, the condition nk/m ≥ γ * (α, η) can be much weaker than the previous condition n/m ≥ γ * (α, η) when k gets large. Point (ii) is a lower-bound showing that the order given in the upper-bound is correct. Together, (i) and (ii) ensure that the boundary achieved by BH α is nk m on the distribution set A n,m,k,α,β . In addition, when α gets small and 1/α cannot be considered as a constant, our result is able to track the dependence in α; since γ * (α, η) is of order 1/α (see (11)), the boundary reads nk m/α. This boundary turns out to be an accurate "rule of thumb" in the numerical experiments of Section 6.

Optimality
For a fixed level α, the previous results show that the semi-supervised BH procedure BH α mimics the oracle BH procedure when n m both in terms of FDR (Theorem 3.1) and power (Proposition 4.1). However, when n m, while BH α still controls the FDR, it looses the power property (Proposition 4.2). Hence, it does not mimic the oracle in that regime. However, this does not exclude that a different procedure, that would use the data Z more cleverly, might be able to mimic the oracle when n m. In this section, we show that no procedure can mimic the oracle in that regime (Theorem 5.1). This shows a general phase transition to the problem of mimicking the oracle (Corollary 5.3) and establishes that BH α achieves this transition, which thus delineates a kind of optimality enjoyed by the semi-supervised BH procedure.
In the regime where γ is too small, the following result shows that achieving simultaneously (14) and (15) is not possible, and this for any procedure R based only on the data Z.

Phase transition
Let us elaborate further on the phase transition that we have put forward. To this end, we introduce the following definition.
This phase transition is illustrated in Figure 1 in the introduction of the paper. The transition is provided under the slightly modified form n = m/α for comparison with Section 4. This also emphasizes that the impossibility result is a worst case analysis over the distribution P ∈ P n,m (FDR) and P ∈ A n,m (power) (suprema are taken in (17) and (18)). In particular, under the more stringent assumption P ∈ A n,m,k,α,β , mimicking the oracle becomes already possible whenever n m/(αk) (as reported in Figure 1 for k = 3 or 100).
This general phase transition is in line with the recent results by Roquain and Verzelen (2020b). Nevertheless, their setting is markedly different: it is unsupervised (no NTS) and the null distribution is assumed to belong to the Gaussian distribution family with unknown mean and variance. The phase transition found there was m/ log(m) (with our notation) where is a lower bound on the number of alternatives m 1 (P ) (with no signal strength assumption). Here, the situation is notably different, with a boundary function of the length n of the NTS. The situation is also very different in terms of FDR control: the mimicking procedure BH provides here an FDR control both above and below the transition, while such property is not possible in the setting of Roquain and Verzelen (2020b) (as proved in Corollary 3.3 therein).

Numerical illustrations
This section provides several numerical illustrations for the theoretical findings derived in Sections 3 and 4.

Simulation setting
While our experiments mostly focus on the two BH-type procedures BH and BH * , we will also consider other competitors: BH, which is the BH procedure applied to the unbiasedp-values defined by (4) (Section 2.2) and the "naive" procedures BY and BH Split described in Section 1.3. Also, for simplicity, the way to evaluate how the power of BH mimics the one of BH * slightly departs from our theoretical study: first, we compare BH to the oracle BH * taken at the same level α (say, η = 0 with the notation of Section 4). This makes the power mimicking more challenging. Second, to stick with the standard way of comparing procedures (for BH * , BH or their competitors), the considered power criterion is simply the TDR (2) (average of the TDP). Unless specified, the setting is Gaussian with a null distribution P 0 = N (0, 1) and an alternative N (µ, 1), for a given value of µ > 0. Across the sections below, we made various choices of n, m and of the sparsity m 1 (number of alternatives). We sometimes fix the level α to the (unusual large) value 0.5 for better visibility of the curves and faster computation time, but the results scale accordingly for smaller values of α. Finally, the FDR (resp. TDR) curves are here estimated by Monte-Carlo simulations. The plots show the estimates FDR and TDR with two error bars: one estimating the standard deviation of FDR (resp. TDR) and one estimating the standard variation of FDP (resp. TDP) (these two deviations being proportional).

FDR control under the full null
The first experiment concerns the case where m 0 = m, which corresponds to the so-called "full null" configuration where there is no alternative. We consider two dependence framework: the independent case (all Z i 's independent) and the negatively equicorrelated case described in Example 3.2. Recall that BH is proved to control the FDR at level α in both cases (Theorem 3.1), while BH * is only proved to control the FDR at level αm 0 /m = α in the independent case. Also, BH (BH procedure applied to the unbiasedp-values defined in Section 2.2) is not proved to control the FDR since thep-values do not satisfy the super-uniformity property (3). Figure 4 displays the obtained FDR curves for BH * (green), BH (blue) and BH (red). The obtained results are consistent with our theoretical findings: negative correlations induce an FDR of BH * slightly above the targeted level, although this effect tends to reduce when n gets larger. This is because the negative correlation ρ = −(m + n − 1) −1 decreases (in absolute value) when n grows. As expected, BH maintains the FDR control is any case. Meanwhile, BH fails to control the FDR in any case, except for some values of n where it has the same FDR value as BH. Hence, we shall discard BH from our plots in the sequel. Interestingly, we also displayed the lower bound of Theorem 3.1 in Figure 4: while it correctly lower bounds the estimated FDR of BH for any n, it illustrates that the FDR is exactly α for n ∈ {3, 7, 11, 15, . . . } as the theory establishes (the curves might also suggest that the lower bound is sharp for n ∈ {4, 8, 12, 16, . . . }, which is not covered by our theory). Finally, note that these results are in expectation: as shown by the shaded areas, there can be large variations for particular samples. This is inherent to the BH procedure when the number of discoveries is not large.  Figure 5 compares the performances of the procedures BH * (dark green and khaki) and BH (dark blue and cyan) in terms of FDR (dark colors) and TDP (light colors) in the dense case where m 1 = m 0 = m 2 , with µ = 1 (left column) and µ = 2 (right column). Regarding the FDR first, the plots show that the FDR of BH tends to the oracle FDR (which is 0.25 = α m0 m = α 2 here). For a fixed value of n, the convergence is faster for smaller values of m. This is coherent with Theorem 3.1, ensuring that the FDR of BH is equal to α/2 for n = m/α − 1. On the other hand, the variance in the FDR (blue shaded area) is smaller at fixed n when m increases, because the larger sample size tends to stabilize the result.

Power study
Turning to the power results, the plots show that the power of BH also tends to that of BH * in this sparsity regime, with also faster convergence for smaller values of m (at fixed n). This is well expected from the "rule of thumb" delineated in Section 4.3 and ensuring that the transition occurs for n ≈ m/(max(1, k)α) where k is a lower bound on the typical true discovery number of the oracle. Given the displayed results, the value of k could be chosen around (2,4,20,40), so that this rule would predict a transition for n occurring around (10, 5, 10, 5) (top-left, top-right, bottom-left, bottom-right). Strikingly enough, the transitions indeed occur at these points in the

different TDR curves.
The sparse case where m 1 = 1 is considered in Figure 6, with µ = 1 (left column) and µ = 3 (right column) and a slightly increased range for n. Here, the oracle FDR is 0.45 for m = 10 and 0.495 for m = 100. The observations made regarding the FDR and TDR in Figure 5 are qualitatively the same. Moreover, in the sparse case, the convergence to the asymptotic regime is slower than in the dense case, while increasing m for fixed n slows down more significantly the convergence than in the dense case. This is coherent with the rule of thumb n ≈ m/(max(1, k)α), predicting that the transition n occurs around m/α = 2m here (only one alternative here). In addition, it is apparent on the plots that the value of the transition n predicted by this rule turns out to be particularly well adjusted, at least in this simulation setup.
Finally, Figure 7 compares the FDR and TDR of the procedures BH * and BH for larger values of m and n and α = 0.2. We fix m = 10 3 and the size of the NTS ranges from n = 1 to 5 × 10 4 . In each plot, we see that the performances of BH indeed increase with n. Despite the increased signal amplitude in the sparse case, the situation is more difficult both in terms of convergence (which is slower) and of variance in the FDP and TDP (which are larger). Interestingly, this corroborates again the rule of thumb predicting a transition n around 20 and 625 (for the choices k ≈ 250 and k ≈ 8) for the dense and sparse situations, respectively.

Additional experiments
Section S4 presents the following additional experiments: first, Section S4.1 presents a comparison with the naive procedures BY and BH Split . There are both shown to be over-conservative and much less powerful than BH. Second, a case study with a Student distribution, leading to similar conclusions, is presented in Section S4.2. Third, Section S4.3 is devoted to simulations for very small values of n (n = 5 or 10) with increasing values of m: it shows that BH can achieve oracle performances in that dense case, regardless of m.

Application
One of the major scientific goals of the MUSE integral field spectrograph, which is installed on one of the 8 m telescopes at the Very Large Telescope in Chile, is the detection of distant and consequently ultra faint galaxies in the early Universe. MUSE delivers 3-dimensional datacubes (two spatial dimensions and one spectral dimension) composed of images taken in different wavelengths channels of the visible spectrum. The values of the data samples correspond to light fluxes. Ordinary datacubes are composed with a pile of 300×300 pixels images in 3700 consecutive visible wavelengths, leading to more than 300 millions voxels.
After multiple calibration and preprocessing stages, the problem of detecting faint galaxies boils down to a typical needle in a haystack problem. The haystack is the datacube, which can be considered as a discrete-valued 3-dimensional random process. This process is generated by various noise sources and by the residual perturbations of numerous bright sources. Consequently, 2 log m ≈ 1.86. Right: m 1 = 10 and µ = √ 2 log m ≈ 3.72 (right). The number of Monte Carlo simulations used for estimating the FDR and TDR is 10 3 for n < 10 3 and 10 2 otherwise. The 2σ confidence interval on the estimated FDR and TDR is plotted in magenta. In all plots the standard deviation (divided by 10) of the FDP and TDP are shown in shaded green for BH * and shaded blue for BH.
the statistics of the random process are poorly constrained. In this haystack, each needle (there are hundreds of them) is a small group of connected voxels, centered on the galaxy's position, in which the flux locally increases.
A dedicated detection strategy, proposed by Mary et al. (2020) and further exploited by Bacon et al. (2021), consists in considering as final test statistics the 3-dimensional local maxima of the processed datacube. In the resulting testing problem, there is one null hypothesis linked to each of the m local maxima, with m typically in the range [10 5 , 10 6 ]. If we denote by x, y, z the position of a particular local maximum, we test H 0,x,y,z : "There is no galaxy centred at position (x, y, z)", against H 1,x,y,z : "There is one galaxy centred at this position" and the considered error criterion is the FDR.
As evoked above, the distribution of the local maxima under the null hypothesis is fairly unknown. To circumvent this difficulty, Mary et al. (2020) proposed to use the population of the opposite values of the local minima (say, Y i , in number n) as an independent "proxy" (a NTS) for the local maxima (say, X i , in number m). They reported numerical simulations suggesting that a procedure close to the Benjamini-Hochberg procedure using p-values computed from this NTS controls the FDR. This astrophysical application involves a common but unknown distribution P 0 under the null hypothesis and the possibility of using a NTS to improve the control of the FDR: this is clearly the setting described in 2.1 and in fact this application has inspired the present study. It is thus interesting to see which light the present study sheds on this initial approach.
The sample sizes considered here are n = 2.3 × 10 6 and m = 3.3 × 10 6 so n < m and both are large. The empirical distribution of the values of the NTS and of the test sample are shown in Figure 8, left panel. The similarity of the two distributions in the left and central parts suggests that the NTS (blue) can serve as a useful proxy for the test sample (red). The right tail of the test sample is logically heavier owing to the presence of galaxies, which tend to shift the values of the local maxima upwards.
While the procedure proposed by Mary et al. (2020) is very close to the BH procedure with Algorithm 1, it differs in the following point. Instead of using FDP = V +1 K m n+1 (see step 4 of Algorithm 1), Mary et al. (2020) use FDP = V K N1 N0 , where N 1 (resp. N 0 ) are the number of voxels of the region where the local extrema of the test sample (resp., the NTS) are computed. Because n is large, V is large as well and using V instead of V +1 has no numerical impact in this regime. The normalization factors are in fact very similar as well, with m n+1 ≈ 1.440 and N1 N0 ≈ 1.442. In effect, it turns out that there is no numerical difference in running these two versions of Algorithm 1: in both cases, the procedure rejects exactly 105 local maxima at target FDR α = 0.2, a situation shown in the right panel of Figure 8. The rejected local maxima in Mary et al. (2020) being the same as those rejected by BH, the discovery set inherits the properties delineated in the present work: first, the FDR control is established from Theorem 3.1, because the only assumption made (Assumption (Indep)) is likely to hold due to the important dimension reduction made in the dataset when focusing on the local maxima/minima. Second, we have n ≈ m while both are large. This means that we are just at the border of the boundary identified in Section 5.2, so the theory is silent for this case. Nevertheless, the distribution of the data exhibits some minimum amount of signal, perhaps k 50 fairly detectable alternatives. Hence, the refined upper-bound given in Proposition 4.3 can also be applied: since the training-to-test ratio n/m is above the boundary, that is, n/m ≈ 1 is much larger than 1/(kα) 0.1, the power of BH should be close to the one of the oracle for this data set. To conclude, the present paper illustrates that BH, together with our theoretical findings, delivers interpretable and useful results for common practice. Meanwhile, it validates the use of the procedure proposed in Mary et al. (2020) on this particular data set.

Summary
In a nutshell, this paper evaluated how classical multiple testing methodology can generalize when replacing the knowledge of the null distribution P 0 by examples Y 1 , . . . , Y n following this null. While this situation is very frequent in practice, it has only been scarcely studied so far and this paper contributed to fill this gap. The FDR control guarantee holds whatever n, m, with no assumption on P 0 and for any marginal alternative, with a bound αm 0 /m (achieved when α(n + 1)/m is an integer), which is similar to the result obtained in the original work of Benjamini and Hochberg (1995) in case where P 0 is known. In addition, the power is comparable to the one of the oracle when n m/(α max(1, k)), where k is a confidence lower bound on the number of true discoveries made by the oracle. This "rule of thumb" has been both validated by theory and numerical experiments. Finally, we demonstrated that our work brought a theoretical support and thus more interpretability in a worked-out application to recent breakthrough findings in astrophysics. In practice, our "rule of thumb" n m/(α max(1, k)) can be used as follows: if the user has no strong prior belief in a minimum number k of discoveries, choosing k = 0 might be safer, which leads to the condition n m/α. By contrast, if k can be accurately guessed a priori, the less demanding condition n m/αk can be opted for.
This work also completed the picture by exhibiting a theoretical intrinsic limitation of the semi-supervised multiple testing setting when the null training sample is not populated enough. It is impossible to control the FDR while mimicking the oracle power for n m when letting the sparsity and the distribution of the alternative arbitrary. This delineates a setting-intrinsic phase transition at n m.

Optimality of BH *
In the past literature, a numerous of works proposed approaches that improve, sometimes substantially, the baseline BH procedure (as local FDR methods listed in introduction). Hence, a common belief is that the BH procedure is well known to be conservative and suboptimal when controlling the FDR. This belief makes the aim of mimicking the performance of the oracle BH procedure (as in Sections 4 and 5) somewhat questionable. However, we argue that this belief is not justified when the test statistic used before applying BH algorithm is suitably chosen, typically using a likelihood ratio or a local FDR transformation. This is shown in particular with simulations in the setting of Appendix A, where BH * improves over a local FDR method, itself well known to enjoy optimality properties (Cai et al., 2019). In a nutshell, the possible conservativeness of BH procedure when controlling the FDR is not due to the BH algorithm per se but rather to the test statistic used as entries of this algorithm. To come back to our framework, the test statistic is assumed to be fixed once for all in our work. Hence, given the chosen test statistic, BH * is close to be optimal and the aim considered in Sections 4 and 5 perfectly makes sense.

Future work
Given that semi-supervised multiple testing setting is versatile, our work raises a number of new perspectives. For instance, in recent machine learning, this setting conveniently bypasses model assumptions on P 0 and only needs a number of null examples, that can be generated by a suitable "blackbox". Nevertheless, in order to avoid potential bias in the null training sample, this blackbox should be properly calibrated with significant prior calibrations and preprocessing steps. While building such an approach deserves an entire devoted study, we anticipate that studying the robustness of the procedure BH with respect the NTS is a key point: what about the case where Y 1 , . . . , Y n are i.i.d. ∼ P 0 with P 0 ≈ P 0 ?
Another avenue for future work is to decline recent advances in multiple testing into this semisupervised setting. For instance, while BH is devoted to the FDR criterion, an interesting and challenging issue is to design semi-supervised counterparts suitable for other criteria, as FDX Genovese and Wasserman (2004), online FDR Foster and Stine (2008); Xu and Ramdas (2021) or post hoc bounds Genovese and Wasserman (2006); Goeman and Solari (2011). In particular, since the variability of the FDP of BH is increased by the NTS, considering criteria accounting for this effect seems particularly interesting. Since various dependence assumptions are used in such studies, we also expect that our main assumption (Exch) can be relaxed in some of these frameworks.
Finally, proper calibrations of the individual tests sometimes require to consider hypothesisdependent null distributions, that is, null distributions P 0,i that depend on i ∈ {1, . . . , m} (see, e.g., Sulis et al., 2017, 2020 for a concrete example). Since m null training samples should be considered in that case, it poses a complexity issue and generalizing our result to this setting is both theoretically challenging and useful to support or improve procedures used in common practice. grateful to Sabine Houssaye for her help when proving Lemma S4 and to Guillaume Lecué for helpful comments. hypothesis testing. J. Amer. Statist. Assoc., 100(469):94-108. Romano, J. P. and Wolf, M. (2007). Control of generalized error rates in multiple testing. Ann.
Recall that an important part of multiple testing literature is devoted to find procedures that control rigorously the FDR at level α while maximizing the power. We emphasize that, in this framework, having an FDR equal to α + 10 −10 (say) is not allowed: the inequality FDR ≤ α must hold under any configuration of the model, which is particularly challenging when negative dependences are possible. For instance, we refer to the very recent work of Fithian and Lei (2020) (see also references therein), that aims at modifying the BH procedure in order to control the FDR under negative dependence. The point of this section is to point out that Theorem 3.1 and Example 3.2 allow to solve this problem in a simple way for some (admittedly specific) dependence structure.
Assume that X is an m-dimensional Gaussian equi-correlated vector with individual variances equal to 1 and with known covariance ρ ∈ [−1/m, 0). Consider n = n(ρ, m) ≥ 1 the largest integer so that ρ ≥ −1/(n + m − 1), that is, n = −ρ −1 − m + 1 and generate a n-sample Y 1 , . . . , Y n such that (Y, X) is Gaussian equi-correlated ρ. This can be done easily via Proposition S3. Then Theorem 3.1 provides that the procedure BH α controls the FDR at level α. Here, since the NTS is generated by the user, this procedure can be seen as a randomized BH procedure, (randBH in short). Algorithm 3 gives the full steps to implement randBH. In addition, our rule of thumb suggests that RandBH has a power comparable to that of BH when ρ −1/(m+m/(α max(1, k))), that is, for configurations close enough to the independent case.
Also, we would like to make a disclaimer: we do not pretend that RandBH is applicable in general practice, at least under the current form, because it is linked to a too specific dependence structure. Rather, the message is that randomization (plus using p-values biased upwards) can help the BH procedure to be more robust with respect to negative dependencies. We think that this intriguing side result is an important proof of concept.
Finally, this phenomenon can be derived for other negative dependence structures: however, the reachable distributions of (X i , i ∈ H 0 ) should necessarily be expressible as a marginal of a larger vector (Y 1 , . . . , Y n , X i , i ∈ H 0 ) that is exchangeable in order to satisfy (Exch).