Online multiple testing with super-uniformity reward

Valid online inference is an important problem in contemporary multiple testing research,to which various solutions have been proposed recently. It is well-known that these existing methods can suffer from a significant loss of power if the null $p$-values are conservative. In this work, we extend the previously introduced methodology to obtain more powerful procedures for the case of super-uniformly distributed $p$-values. These types of $p$-values arise in important settings, e.g. when discrete hypothesis tests are performed or when the $p$-values are weighted. To this end, we introduce the method of super-uniformity reward (SUR) that incorporates information about the individual null cumulative distribution functions. Our approach yields several new 'rewarded' procedures that offer uniform power improvements over known procedures and come with mathematical guarantees for controlling online error criteria based either on the family-wise error rate (FWER) or the marginal false discovery rate (mFDR). We illustrate the benefit of super-uniform rewarding in real-data analyses and simulation studies. While discrete tests serve as our leading example, we also show how our method can be applied to weighted $p$-values.


Background
Multiple testing is a well-established statistical paradigm for the analysis of complex and large-scale data sets, in which each hypothesis typically corresponds to a scientific question.In the classical situation, the set of hypotheses should be pre-specified before running the statistical inference.However, in contrast to the former 'offline' setting, in many contemporary applications questions arise sequentially.A first instance of such sequential application is when testing a single null hypothesis repeatedly as new data are collected, as for continuous monitoring of A/B tests in the information technology industry or marketing research, see Kohavi et al. (2013); Johari et al. (2019) and references therein, or Howard et al. (2021) for recent developments.A second situation is when the null hypotheses are (potentially) different and arise in a continuous stream, and accordingly decisions have to be made one at a time and prior to the termination of the stream.This is generally referred to as the online multiple testing (OMT) framework and is the focus of this paper, see, e.g., Lark (2017); Robertson et al. (2019); Kohavi et al. (2020) for application examples.This second situation also occurs in combination with the first one to form a 'doubly-sequential' experiment (Ramdas, 2019).

Existing literature on online multiple testing
The literature aiming at control of various error rates in OMT has grown rapidly in the last few years.As a starting point, the family-wise error rate (FWER) is the probability of making at least one error in the past discoveries, and a typical aim is to control it at each time of the stream (for a formal definition of this and other error rates, see Section 2.2).Since controlling FWER at a given level α is a strong constraint, it requires employing a procedure that is conservative, thus generally leading to few discoveries.The typical strategy is to distribute over time the initial wealth α, e.g., testing the i-th test at level αγ i for a sequence {γ i } i≥1 summing to 1.This approach is generally referred to as α-spending in the literature (Foster and Stine, 2008).
A less stringent criterion is the false discovery rate (FDR), which corresponds to the expected proportion of false discoveries.This versatile criterion allows many more discoveries than the FWER and has known a huge success in offline multiple testing literature since its introduction by Benjamini and Hochberg (1995), both from a theoretical and practical point of view.In their seminal work on OMT, Foster and Stine (2008) extended the FDR in an online setting by considering the expected proportion of errors among the past discoveries (actually, considering rather the marginal FDR, denoted below by mFDR, which is defined as the ratio of the expectations, rather than the expectation of the ratio).The novel strategy in Foster and Stine (2008), which is called α-investing, is based on the idea that an mFDR controlling procedure is allowed to recover some α-wealth after each rejection, which slows down the natural decrease of the individual test levels.In subsequent papers, many further improvements of this method have been proposed: first, the α-investing rule has been generalized by Aharoni and Rosset (2014), while maintaining marginal FDR control.Later, Javanmard and Montanari (2018) establish the (non-marginal) FDR control of these rules, including the LORD (Levels based On Recent Discovery) procedure.Then, a uniform improvement of LORD, called LORD++, has been proposed by Ramdas et al. (2017), that maintains FDR/mFDR control while extending the theory in several directions (weighting, penalties, decaying memory).
Extensions to other specific frameworks have been proposed, including rules that allow asynchronous online testing (Zrnic et al., 2021), maintain privacy (Zhang et al., 2020), and accommodate a high-dimensional regression model (Johnson et al., 2020).Other online error criteria have also been explored, with false discovery exceedance (Javanmard and Montanari, 2018;Xu and Ramdas, 2021), post hoc false discovery proportion bounds (Katsevich and Ramdas, 2020), or confidence intervals with false coverage rate control (Weinstein and Ramdas, 2020).
Since the online framework is more constrained than the offline framework, the employed procedures are generally less powerful in that context.Hence, another important branch of the literature aims at proposing improved rules that gain more discoveries: first, following the classical 'adaptive' offline strategy, procedures can be made less conservative by implicitly estimating the amount of true null hypotheses, see the SAFFRON procedure for FDR and the adaptive-spending procedure for FWER.Second, under an assumption on the null distribution, increasing the number of discoveries is possible by 'discarding' tests with a too large p-value (Ramdas et al., 2018;Tian andRamdas, 2021, 2019).
A power enhancement can also be obtained by combining online procedures with other methods.A natural idea is to use more sophisticated individual tests in the first place, e.g., based on multi-armed bandits (Yang et al., 2017), or so-called 'always valid p-values', see Johari et al. (2019) and references therein.Another idea is to combine offline procedures to form 'mini-batch' rules, see Zrnic et al. (2020).Further improvements are also possible by incorporating contextual information as done by Chen and Kasiviswanathan (2020a) or using local FDR-like approach, see Gang et al. (2020).Lastly, performance boundaries have been imsart-generic ver.2014/07/30 file: DMR2021_EJS.texdate: January 16, 2023 derived by Chen and Arias-Castro (2021).

Super-uniformity
This paper considers OMT in the setting of super-uniformly distributed p-values (defined in detail in Section 2.1).Super-uniformity may originate from various sources.The first main example we have in mind, and which has been extensively investigated in the statistical literature, is super-uniformity arising from discrete p-values (described in detail in Section 5).Additionally, we show that super-uniformity can also be used in a more indirect way as a device for dealing with online p-value weighting.In the offline setting, this is a powerful and extensively studied approach, which has, however, in the online case, received little attention so far (described in detail in Section 6).
Discrete tests often originate when the tests are based on counts or contingency tables, for example: • in clinical studies, the efficiency or safety of drugs are compared by counting patients who survive a certain period after being treated, or who experience a certain type of adverse drug reaction; • in biology, the genotype effect on the phenotype can be tested by knocking out genes sequentially in time.
The latter case is met for instance with the data from the International Mouse Phenotyping Consortium (IMPC, see Muñoz-Fuentes et al. ( 2018)), which contains many categorical variables, and thus are described with counts and contingency tables.While this data set is frequently used (see e.g., Tian and Ramdas, 2021;Xu and Ramdas, 2021;Karp et al., 2017), the classical OMT procedures do not exploit the discrete nature of the tests, and it turns out that much more powerful procedures can be developed, see Section 5.3.In the literature, different solutions have been proposed for dealing with the conservatism of discrete tests, e.g., by modifying directly the p-values, either by randomization (see Habiger, 2015 and references therein), or by shrinking them to build so-called mid p-values (see Heller and Gur, 2011 and references therein).While randomized approaches possess attractive theoretical properties, they are often criticized for their lack of reproducibility (see, e.g., Berger, 1996 andRipamonti et al., 2017).An active research area explores this phenomenon in the offline multiple testing setting, with the seminal works of Tarone (1990); Westfall and Wolfinger (1997); Gilbert (2005a) and the subsequent studies of Heyse (2011); Heller and Gur (2011); Dickhaus et al. (2012); Habiger (2015); Chen et al. (2015); Döhler (2016); Chen et al. (2018); Döhler et al. (2018); Durand et al. (2019), see also references therein.The present work shows that such an improvement is also possible in the online setting, as far as FWER or mFDR control is concerned.
Finally, weighting p-values is a well-established and popular approach for improving the performance of offline multiple testing procedures.It can be traced back to Holm (1979) and has been further developed, in, e.g., Genovese et al. (2006); Wasserman and Roeder (2006); Rubin et al. (2006); Blanchard and Roquain (2008); Roquain and van de Wiel (2009); Hu et al. (2010); Zhao and Zhang (2014); Ignatiadis et al. (2016); Durand (2019); Ramdas et al. (2019) with weights that can be driven for instance by sample size, groups, or more generally by some covariates.By approaching the problem from the perspective of super-uniformity, our general method also allows seamless and flexible integration of such weighting schemes in an online context.

FWER
), depending on the case.

Contributions of the paper
In this paper, we propose uniform improvements of the classical base procedures listed in Table 1, and prove control of the corresponding error rates.A distinguishing feature of our work is that we assume that a (non-trivial) upper bound for the null cumulative distribution function's (c.d.f.), called the null bounding family, is known (see Section 2.1).By combining this information with base procedures, we construct more efficient OMT procedures (see Table 2).The key quantity involved in this construction can be interpreted as a reward (more details will be provided in Section 2.3) induced by the super-uniformity of the null bounding family.Therefore, we use the acronym SUR (Super-Uniform-Reward) to refer to these new procedures.When we use the uniform null bounding family (i.e., in the classical framework), our SUR procedures reduce to their base counterparts.Our main contributions are as follows: • We propose two new SUR procedures for online FWER control in Section 3: the first one (ρOB) uniformly improves upon the Online Bonferroni procedure (OB), while the second (ρAOB) uniformly improves upon the adaptive spending procedure of  • We present a general and simple way of constructing SUR procedures for any base procedure satisfying some mild conditions, see Section 3.4 for FWER and Section 4.4 for mFDR.This allows us to obtain concise proofs for all our results, which are deferred to the supplement, see Section A. • Application to discrete data: we evaluate the performances of the new SUR procedures on discrete data, with simulated experiments (Section 5.2) and for a classical real data set (Section 5.3), where each hypothesis is tested using a (discrete) Fisher exact test.
The gain in power is shown to be substantial.• Application to p-value weighting: our new SUR procedures can be used to derive weighted online FWER and mFDR controlling procedures.The p-value weighting is carried out by rescaling in a certain way the 'raw' weights so that the weighted p-value distributions become super-uniform and our methodology can be applied.The new online procedures are shown to outperform existing ones both on simulated and real data (Section 6).
For easier readability of the paper, a succinct overview of our work is presented in Tables 1 and 2. It lists the base and SUR procedures and provides links to definitions and results for error rate control.All our numerical experiments (simulations and application) are reproducible from the code provided in the repository https://github.com/iqm15/SUREOMT.

Relation to adaptive discarding
As Tian and Ramdas (2019) pointed out, online multiple testing procedures frequently suffer from significant power loss if the null p-values are too conservative.In Tian and Ramdas (2021) (FWER control) and Tian and Ramdas (2019) (mFDR control), the authors propose adaptive discarding (ADDIS) approaches as improved methods.In particular, an idea is to use a discarding rule, that avoids testing a null when the corresponding p-value exceeds a given threshold.For the particular type of super-uniformity induced by discrete tests, we show that the discarding rule is less efficient than the SUR method, at least in the settings of Sections 5.2 and 5.3.

Setting, procedure and assumptions
Let X = (X t , t ∈ {1, 2, . . .}) be a process composed of random variables.We denote the distribution of X by P , which is assumed to belong to some distribution set P. We consider an online testing problem where, at each time t ≥ 1, the user only observes variable X t and should test a new null hypothesis H t , which corresponds to some subset of P, typically defined from the distribution of X t .We let H 0 = H 0 (P ) = {t ≥ 1 : H t is satisfied by P } the set of (unknown) times where the corresponding null hypothesis is true.Throughout the manuscript, we focus on decisions based upon p-values.Hence, we suppose that at each time t, we have at hand a imsart-generic ver.2014/07/30 file: DMR2021_EJS.texdate: January 16, 2023 p-value p t = p t (X) ∈ [0, 1] (typically depending only on X t although this is not necessary) for testing H t , and we consider online multiple testing procedures based on p-value thresholding.This means that each null H t is rejected whenever p t (X) ≤ α t , where α t ∈ [0, ∞) is a nonnegative threshold, called a critical value, that is allowed to depend on the past decisions.More precisely, we denote R t = 1{p t (X) ≤ α t }, C t = 1{p t (X) ≥ λ} for all t ≥ 1 and assume that each α t is measurable with respect to the σ-field Here, λ ∈ [0, 1] is a parameter that is used for designing adaptive procedures.The particular non-adaptive case is obtained by setting λ = 0, in which case In the literature, this property is referred to as predictability, see Ramdas et al. (2017).Throughout the manuscript, an online multiple testing procedure is identified with a family A = {α t , t ≥ 1} of such predictable critical values.Let us now state the assumptions used in what follows.First, recall the classical super-uniformity assumption: which means that each test rejecting H 0,t when p t (X) is smaller than or equal to u is of level u.Here, we typically consider a setting where these tests may have a more stringent level.Formally, at each time t, there is a known null function , and P ∈ P with t ∈ H 0 . (2) Note that we will sometimes also consider F t (u) for u ≥ 1, in which it is to be understood as F t (u ∧ 1).The family F = {F t , t ≥ 1} will be referred to as the null bounding family.Note that (2) reduces to (1) when choosing F t (u) = u for all u, but encompasses other cases by choosing differently the null bounding family.Typically, for discrete tests, it is well-known that F t (u) can be (much) smaller than u, see Example 2.1 for more details.Second, another important assumption is the online independence within the p-value process: p t (X) is independent of the past decisions F t−1 for all t ∈ H 0 and P ∈ P. (3) For instance, Assumption (3) holds in the case where p t (X) only depends on X t and the variables in (X t , t ≥ 1) are all mutually independent, which means that the data are collected independently at each time.
Remark 2.1.In this manuscript, results are often based on assumptions (2) and (3).In all these results, these two assumptions can be replaced by the weaker condition s. for all u ∈ [0, 1], for all t ∈ H 0 and P ∈ P. (4) When choosing the null bounding family F t (u) = u for all u, the latter condition is sometimes referred to as SuperCoAD (super-uniformity conditionally on all discoveries), see Ramdas et al. (2017).
Throughout the paper, we investigate the two following prototypical examples of superuniformity.
Example 2.1.Our leading example is the case where a discrete test statistic is used for inference in each individual test.Typical instances include tests for analyzing counts represented by contingency tables, such as Fisher's exact test, see Section 5.2.In discrete testing, each p-value p t (X) has its own support S t (known and not depending on P ), that is a finite set imsart-generic ver.2014/07/30 file: DMR2021_EJS.texdate: January 16, 2023 (or, in full generality, a countable set with 0 as the only possible accumulation point).A null bounding family satisfying (2) can easily be derived by considering F t , the right-continuous step function that jumps at each point of S t , see Figure 2 below.Note that the support S t depends on t so that discrete testing also induces heterogeneity over time.
Example 2.2.Our secondary example is p-value weighting, where we start from continuous p-values (uniform under the null), which are weighted using external a priori information in order to increase power, see Section 6.

Error rates and power
Let us define the criteria that we use to measure the quality of a given procedure A = {α t , t ≥ 1}.For each T ≥ 1, let R(T ) = {t ∈ {1, . . ., T } : p t (X) ≤ α t } denote the set of rejection times of the procedure A, up to time T .We consider the two following classical online criteria for type I error rates: with the convention 0/0 = 0.In words, when controlling the online FWER at level α, one has the guarantee that, at each fixed time T , the probability of making at least one false discovery before time T is below α.Since FWER control does not tolerate any false discovery (with high probability), it is generally considered a stringent criterion.By contrast, when controlling the online mFDR, at each time T , the expected number of false discoveries before time T can be non-zero, but in an amount controlled by the expected number of discoveries.
While online FWER has been investigated in Tian and Ramdas (2021), online mFDR control is generally less conservative (that is, allows more discoveries), and is widely used in an online context, see Foster and Stine (2008); Ramdas et al. (2017Ramdas et al. ( , 2018)).The false discovery rate (FDR) is close to the mFDR: it is defined by using the expectation of the ratio, instead of the ratio of the expectations as in (6).Controlling the FDR generally requires more assumptions, while mFDR is particularly useful in an online context (we refer the reader to Section 1.1 of Zrnic et al. (2021) for more discussions on this).For a given error rate, we aim at deriving procedures that maximize power.For any procedure A, we define the power as the expected proportion of signal the procedure can detect, that is, where H 1 is the set of times of false nulls, that is, the complement of H 0 in {1, 2, . . .}.While this power notion will be used in our numerical experiments to compare procedures, our theoretical results will use a stricter comparison criterion.For two procedures A = {α t , t ≥ 1} and A = {α t , t ≥ 1}, we say that A uniformly dominates A when α t ≥ α t for all t ≥ 1 (almost surely).This implies that, almost surely, A makes more discoveries than A, in the sense that the set of discoveries of A is contained in the one of A , that is, R(T ) ⊂ R (T ) for all T ≥ 1 (a.s.).In particular, this implies the same domination for the true discovery sets and thus in particular Power(T, A, P ) ≤ Power(T, A , P ) for all T ≥ 1.With this terminology, we can restate the aim of this work as follows: construct valid OMT procedures that uniformly dominate their base procedures by incorporating the null bounding family F t given in (2).
imsart-generic ver.2014/07/30 file: DMR2021_EJS.texdate: January 16, 2023 Remark 2.2.There is no consensus regarding the most adequate definition of power in online testing literature.The concept of uniform domination that we use in this paper is much stronger than, e.g., the asymptotic power considered by Javanmard and Montanari (2018).It may, however, not be particularly appropriate if the base procedure A is chosen poorly.Since the base procedures given in Table 1 are standard in our settting, the domination criterion seems to be reasonable.

Wealth and super-uniformity reward
In the Generalized Alpha-Investing (GAI) paradigm (see Xu and Ramdas (2021) and the references given therein), the nominal level α, at which one wants to control the type I error rate, can be seen as an overall error budget -or wealth -that may be spent on testing hypotheses in the course of an online experiment.For a given OMT procedure A, it is possible to define a suitable wealth function W (T ) = W (T, A, P ), such that W (T ) represents the wealth available at time T for further testing.As a case in point, Xu and Ramdas (2021) define the (nominal) wealth function for the online Bonferroni procedure by W nom (T ) = α − T t=1 αγ t .Generalizing this expression for arbitrary null distributions we obtain the 'true' or 'effective' wealth W eff (T ) = α − T t=1 F t (αγ t ), where F t is a null-bounding function.In the super-uniform setting, assumption (2) implies W nom (T ) ≤ W eff (T ), and as the two orange curves in Figure 1 illustrate, the discrepancy can be quite large.Nominal wealth for OB (dashed orange curve), effective wealth for OB (solid orange curve) and effective wealth for ρOB (solid green curve) for the male mice from the IMPC data (see Section 5.3 for more details).However, while the user thinks the procedure is spending the budget over time according to the nominal wealth given by the dashed orange curve, in reality, the procedure is underutilizing wealth, as the solid orange true wealth curve indicates.This unnecessarily austere spending behaviour makes the online Bonferroni procedure sub-optimal.In addition, this phenomenon extends to the other procedures and error rates listed in Table 1 as well.Our proposed solution incorporates super-uniformity so that its wealth function behaves more like the targeted nominal wealth, as depicted by the green curve in Figure 1.
For incorporating super-uniformity, we introduce the super-uniformity reward (SUR), a key quantity in our work.For any procedure A = {α t , t ≥ 1} and null bounding family imsart-generic ver.2014/07/30 file: DMR2021_EJS.texdate: January 16, 2023 F = {F t , t ≥ 1}, the super-uniformity reward ρ t at time t is defined by Note that (2) always implies ρ t ≥ 0 for all t ≥ 1.In the case of discrete testing (Example 2.1), we have F t (α t ) = 0 when α t is below the infimum of the support S t .This produces the maximum possible super-uniformity reward at time t, that is, ρ t = α t .Conversely, when α t ∈ S t , we have F t (α t ) = α t and we have no super-uniformity reward at time t, that is, ρ t = 0.In general, we have ρ t ∈ [0, α t ], its actual value depending on the discreteness of the test (that is on the steps of F t ) and of the value of α t .The super-uniformity reward is illustrated in Figure 2 for a single distribution F t and value α t .Mathematically, ρ t is simply the difference between the nominal significance level α t and the truly achieved significance level F t (α t ).In terms of wealth, ρ t can be interpreted as the fraction of nominal significance level which the OMT procedure was unable to 'spend' due to super-uniformity.Intuitively, it seems clear that this amount can be put aside and be re-allocated to the subsequent tests to increase the future critical values (α T , T ≥ t + 1).In Sections 3 and 4, we show in detail how this can be done without sacrificing type I error control.

Spending sequences
As Table 1 displays, the base procedures we use are parametrized by a sequence γ = (γ t ) t≥1 of non-negative values, such that t≥1 γ t ≤ 1, which we refer to as the spending sequence.The spending sequence controls the rate at which the wealth is spent in the course of the online experiment (for instance, see (10) for the online Bonferroni procedure).However, finding suitable spending sequences is not trivial: there is a trade-off between saving wealth for large values of T and the ability to make discoveries in the not-too-distant future.Typical choices for γ in the literature are: • γ t ∝ t −q for all t for some q > 1, see Tian and Ramdas (2021); imsart-generic ver.2014/07/30 file: DMR2021_EJS.texdate: January 16, 2023 • γ t ∝ (t + 1) −1 log −q (t + 1) for all t, for some q > 1, see Tian and Ramdas (2021); , see Javanmard and Montanari (2018).
Throughout the paper, we choose γ t ∝ t −q with q = 1.6, as suggested by previous literature.In the base procedures listed in Table 1, there are two potential sources of wealth: the initial wealth invested at T = 0, and the rejection reward that can be earned by rejections for investing procedures (i.e., mFDR controlling procedures).When one can use super-uniformity reward as described in Section 2.3, an additional source of wealth comes into play.Indeed, our approach is to use an additional SUR spending sequence γ to smoothly incorporate all the rewards collected up to time T to compute the new critical value α T .This SUR spending sequence could be chosen for instance from one of the smoothing sequences listed above.Here, we focus on the following choice: where h ≥ 1 is a suitably chosen integer.Since this leads to procedures that spread rewards uniformly over a finite horizon of length h, we refer to (9) -by analogy with non-parametric density estimation -as a rectangular kernel with bandwidth h.Finally, another idea introduced by Ramdas et al. ( 2018); Tian and Ramdas (2021) in order to slow down the natural decay in the α t sequence is to consider γ T (t) where T (t) is a slowed down clock, see ( 16) and ( 26) below.As we will see in Section 3.3 and Section 4.3, this technique can also be combined with a suitable super-uniformity reward.

Online FWER control
In this section, we aim at finding procedures A such that FWER(A, P ) ≤ α for some targeted level α ∈ (0, 1).We begin with a simple application of our approach to improve the online Bonferroni procedure with a 'greedy' super-uniformity reward, and then turn to a smoother spending of the super-uniformity reward (Theorem 3.1).This approach is then applied in combination with the adaptive online procedure introduced by Tian and Ramdas (2021) (Theorem 3.2).Finally, a general result is provided (Theorem 3.3) that allows to reward any procedure controlling the online FWER in some specific way.This allows unifying all results obtained in this section while further extending the scope of our methodology.

Warming-up: online Bonferroni procedure and a first greedy reward
For any given spending sequence sequence γ = (γ t ) t≥1 , a well-known online FWER controlling procedure is the online Bonferroni procedure, It is also called Alpha-Spending rule (Foster and Stine, 2008) in the context of online FWER control, see Tian and Ramdas (2021).It is straightforward to check that A OB controls the FWER under the classical super-uniformity condition (1): by the Markov inequality, for all imsart-generic ver.2014/07/30 file: DMR2021_EJS.texdate: January 16, 2023 Let us now present the rationale behind our approach in this simple case.Assume more generally that we have at hand a null bounding family F = {F t , t ≥ 1} satisfying (2).The above reasoning leads to the following valid bound for any procedure A = {α t , t ≥ 1} (with deterministic α t ): by choosing The latter is a recursive relation that allows to define a new procedure A = {α t , t ≥ 1} controlling the FWER.Since where is the super-uniformity reward (8) at time T − 1 (with the convention ρ 0 = 0).In addition, from (2), we have ρ T −1 ≥ 0, and the critical values ( 14) uniformly dominate the online Bonferroni critical values (10) (the obtained critical values are in particular nonnegative, thus defining a valid OMT procedure).The approach behind critical values ( 14) is said here to be 'greedy', because it spends the complete super-uniformity reward ρ T −1 obtained at step T − 1 for increasing the next critical value α T .

Smoothing out the super-uniformity reward
The greedy policy described in the previous section is not always appropriate when time is considered on a potentially large period, because the sequence of critical values might fall too abruptly.Instead, we can smooth this effect over time, by distributing the reward collected at time T − 1 over all times following T .To formalize this idea, we introduce a SUR spending sequence (see also Section 2.4), which is defined as a non-negative sequence γ = (γ t ) t≥1 such that t≥1 γ t ≤ 1.While this definition is mathematically the same as the definition of a spending sequence, the role of the SUR spending sequence is different, so we use a different name for it.
Definition 3.1.For any spending sequence γ and any SUR spending sequence γ , the online Bonferroni procedure with super-uniformity reward, denoted by A ρOB = {α ρOB t , t ≥ 1}, is defined by the recursion where ρ t = α ρOB t − F t (α ρOB t ) denotes the super-uniformity reward at time t for that procedure.
imsart-generic ver.2014/07/30 file: DMR2021_EJS.texdate: January 16, 2023 Note that taking γ = (1, 0, . . .0) recovers the 'greedy' critical values ( 14).For the rectangular kernel SUR spending sequence given by ( 9), we have T −1 t=1 γ T −t ρ t = h −1 T −1 t=1∨(T −h) ρ t , which we interpret as a uniform spending of the SUR reward over the last h time points.As shown in Figure 3, the corresponding sequence of critical values (green line) is more 'stable' than the one using the greedy approach (blue line), allowing for some additional discoveries (on this simulated data).The following result provides FWER control of the new rewarded critical values (15), for a general SUR spending sequence.
Theorem 3.1.Consider the setting of Section 2.1, where a null bounding family F = {F t , t ≥ 1} satisfying (2) is at hand.For any spending sequence γ and any SUR spending sequence γ , consider the online Bonferroni procedure A OB = {α OB t , t ≥ 1} (10), and the online Bonferroni with super-uniformity rewards A ρOB = {α ρOB t , t ≥ 1} (15).Then we have FWER(A ρOB , P ) ≤ α for all P ∈ P, while A ρOB uniformly dominates A OB .This result will be a consequence of a more general result, see Section 3.4.The grey dots denote the p-value sequence (those equal to 1 are displayed at the top of the picture).The spending sequence is γ t ∝ t −1.6 .

Rewarded Adaptive Online Bonferroni
It is apparent from ( 11)-( 12) that there is some looseness when upper-bounding t∈H 0 γ t by t≥1 γ t which may lead to unnecessarily conservative procedures.We may attempt to avoid this loss in efficiency by considering a spending sequence γ satisfying the condition t∈H 0 γ t ≤ 1 which is more liberal than t≥1 γ t ≤ 1.In words, this means that the index t in the sequence {γ t , t ≥ 1} should only be incremented when we are testing an hypothesis H t with t ∈ H 0 .Since H 0 is unknown, such a modification cannot be implemented directly in the γ sequence.Nevertheless, an approach proposed by Tian and Ramdas (2021)  replacing the unknown set H 0 by an estimate {1} ∪ {t ≥ 2 : p t−1 ≥ λ} for some parameter λ ∈ (0, 1), and to correct the introduced error in the thresholds α t to maintain the FWER control.More formally, we follow Tian and Ramdas (2021) by introducing the re-indexation functional T : {1, . . .} → {1, . . .} defined by Since a large p-value is more likely to be linked to a true null, T (T ) is used to account for the number of true nulls before time T (note that this estimate is nevertheless biased).From an intuitive point of view, T (T ) slows down the time by only incrementing time when the preceding p-value is large enough.This idea leads to the adaptive online Bonferroni procedure introduced by Tian and Ramdas (2021) (called there 'Adaptive spending'1 ), with spending sequence γ and adaptivity parameter λ ∈ [0, 1), denoted here by A AOB = {α AOB t , t ≥ 1}, and given by It recovers the standard online Bonferroni procedure when λ = 0 (because T (T ) = T for T ≥ 1 in that case), but leads to different thresholds when λ > 0. Comparing A AOB to A OB , no procedure uniformly dominates the other.An improvement of A AOB over A OB is expected to hold when there are many false null hypotheses in the data, and increasingly so if the signal occurs early in the time sequence, see the numerical experiments in Section 5.2.In addition, note that the critical value α AOB T depends on the data X 1 , . . ., X T −1 and thus is random.As a result, the adaptive approach requires additional distributional assumptions compared with the online Bonferroni procedure.In Tian and Ramdas (2021), A AOB is proved to control the FWER under (1) and (3) (actually under the slightly more general condition (4) with F t equal to identity).Let us now use this approach in combination with the super-uniformity reward.Definition 3.2.For any spending sequence γ, any SUR spending sequence γ , and λ ∈ [0, 1), the adaptive online Bonferroni procedure with super-uniformity reward, denoted by A ρAOB = {α ρAOB t , t ≥ 1}, is defined by where ) denotes the super-uniformity reward a time t, and This class of procedures reduces to the class of procedures (15) introduced in the previous section by setting λ = 0.However, when λ > 0 the class is different since the term α(1 − λ)γ T (T ) , which comes from α AOB T , makes the threshold random.Also, the super-uniformity reward is only collected at time t ≤ T − 1 where p t ≥ λ.The latter is well expected from the motivation of the adaptive approach described above: when p t < λ, no testing is performed so no reward could be obtained from ρ t .Nevertheless, note that the additional term ε T −1 allows to collect some reward at time T − 1 in the case where p T −1 < λ.Since this term only appears in critical values of adaptive procedures, we call it the 'adaptive' reward.It is linked to the super-uniformity reward in that no adaptive reward can be obtained if no super-uniformity reward has been collected in the past.The following result shows that this approach is valid from the FWER control perspective.
Theorem 3.2 relies on a more general result (Theorem 3.3 below).Note that, contrary to Theorem 3.1, Theorem 3.2 needs an independence assumption.This was already the case without the super-uniformity reward since this is due to the adaptive methodology that makes the critical values random.If this independence assumption holds, we show in Section 5.2 that A ρAOB can indeed improve A ρOB , while it always improves the procedure A AOB of Tian and Ramdas (2021) (as guaranteed by the above theorem).

Rewarded version for base FWER controlling procedures
In this section we present a general result stating that any procedure ensuring online FWER control (in a specific way) can be rewarded using super-uniformity while maintaining the FWER control.
Theorem 3.3 is proved in Section A.1.Condition ( 19) is essentially the same as Condition (20) derived in Tian and Ramdas (2021).It is satisfied by the online Bonferroni procedure (A 0 = A OB ), and the online adaptive Bonferroni procedure (A 0 = A AOB ).While this is obvious for A OB , the case of A AOB requires to carefully check how the functional T (•) ( 16) slows down the time, which is done in Lemma A.3.Statement (i) of Theorem 3.3 thus proves the online FWER control for these procedures.Statement (ii) of Theorem 3.3 is our main imsart-generic ver.2014/07/30 file: DMR2021_EJS.texdate: January 16, 2023 contribution and reduces to Theorems 3.1 and 3.2, when choosing A 0 = A OB and A 0 = A AOB , respectively.This recovers the rewarded procedures A ρOB and A ρAOB discussed in the previous sections: compare (20) to (15) (with λ = 0), and (20) to (18).Nevertheless, other choices for A 0 satisfying (19) are possible.According to our general result, any such choice is compatible with our reward methodology.

Online mFDR control
In this section, we aim at finding procedures A such that mFDR(A, P ) ≤ α for some targeted level α ∈ (0, 1).We follow the same route as for the FWER: we start with an application of the super-uniformity reward to the classical LORD++ procedure (Ramdas et al., 2017, called just LORD hereafter for short), and then turn to adaptive counterparts.Finally, we propose a general result encompassing all these cases.In this section, we follow the notation of Ramdas et al. (2017) for online mFDR control.For any procedure A = {α t , t ≥ 1} and realization of the p-value process, let us denote the number of rejections of the procedure up to time T , and the first time that the procedure makes j rejections, for any j ≥ 1.

Warming up: LORD procedure and a first greedy reward
While a sufficient condition for online FWER control is t≥1 α t ≤ α (see the previous section and in particular ( 19)), the mFDR control is ensured when t≥1 α t ≤ α(1 ∨ R(T )), as proved in Theorem 2 of Ramdas et al. (2017) (applicable, e.g., under assumptions (2) and (3)).Consequently, for each rejection we earn back wealth α with which we are allowed to increase α t ; typically by starting a new online Bonferroni critical value process.This idea is referred to as α-investing in the literature, see Foster and Stine (2008); Aharoni and Rosset (2014); Javanmard and Montanari (2018).This idea leads to the LORD (Levels based On Recent Discovery) procedure (Javanmard and Montanari, 2018), with the improvement given by Ramdas et al. (2017): where by convention γ t = 0 at any time t ≤ 0 and where γ is an arbitrary spending sequence.
Note that the test level at time T splits the initial α-wealth between the cases where R(T ) = 0 and R(T ) = 1, because the bound is equal to α(1∨R(T )) = α in both cases so the first rejection does not provide an extra room for false discoveries.The resulting additional parameter W 0 ∈ (0, α) balances the initial α-wealth between these two cases to maintain the mFDR control.The procedure A LORD = {α LORD t , t ≥ 1} controls the mFDR under (1) and (3), because t≥1 α LORD t ≤ α(1 ∨ R(T )) (see Section A.2 for a proof).Now, let us consider our more general framework where we have at hand a null bounding family F = {F t , t ≥ 1} imsart-generic ver.2014/07/30 file: DMR2021_EJS.texdate: January 16, 2023 satisfying (2).In that case, we can prove that a sufficient condition on the critical values for mFDR control is that, almost surely, see the general condition (30) below.This can be achieved by choosing This leads to the thresholds where is the super-uniformity reward (8) at time T − 1 (with the convention ρ 0 = 0).Since ρ t ≥ 0 for all t by (2), this procedure uniformly dominates the procedure A LORD .Furthermore, depending on the magnitude of the super-uniformity reward, this new procedure is potentially much more powerful.The spending sequence is γ t ∝ t −1.6 .

Smoothing out the super-uniformity reward
As discussed for FWER control (see Section 3.2), the preliminary procedure ( 24) spends immediately at time T all of the super-uniformity reward collected at time T − 1.However, it is more advantageous to redistribute this reward over subsequent times T, T + 1, . . ., by using a SUR spending sequence γ = (γ t ) t≥1 .This gives rise to the following more general class of online procedures.
Definition 4.1.For a spending sequence γ and a SUR spending sequence γ , the LORD procedure with super-uniformity reward, denoted by A ρLORD = {α ρLORD t , t ≥ 1}, is defined by the recursion where α LORD T is given by (23) and ) denotes the super-uniformity reward at time t.
Figure 4 displays the critical values of the LORD procedure, and of those rewarded with the greedy SUR spending sequence γ = (1, 0, . . . ) or rewarded with the rectangular kernel SUR spending sequence (25) (h = 10).First, the reward given by the α-investing, which is possible for mFDR control, is visible at each discovery for which all critical value curves 'jump'.Second, the effect of the super-uniformity reward is visible between these jumps, and the kernel sequence is able to better smooth the critical value sequence.As a result, the corresponding procedure is likely to make more discoveries (as it is the case on the simulated data presented in Figure 4).The following result establishes the mFDR control of this new class of rewarded procedures.This theorem is proved in Section A.2, as a corollary of a more general result (Theorem 4.3 below).As shown in the numerical experiments (Section 5.2), the improvement of A ρLORD with respect to A LORD can be substantial.

Rewarded Adaptive LORD
In this section, we apply the re-indexation trick of the γ sequence presented in Section 3.3 to improve the performance of the procedures A LORD and A ρLORD .For this, we follow essentially the reasoning used by Ramdas et al. (2018) for deriving the SAFFRON procedure, with a slight modification, as explained below.To start, let us define, for some parameter λ ∈ [0, 1), with T 0 (T ) = T (T ) given by ( 16) by convention.From an intuitive point of view, T j (T ) is like a 'stopwatch' starting after τ j and suspended at each time t for which p t−1 < λ.Hence, imsart-generic ver.2014/07/30 file: DMR2021_EJS.texdate: January 16, 2023 having p t < λ allows to delay the natural dissipation of α-wealth due to online testing.Then, the SAFFRON procedure (Ramdas et al., 2018) is defined by the threshold This procedure controls the mFDR under ( 1) and (3) as proved by Ramdas et al. (2018).However, examining the proof in Ramdas et al. (2018), it turns out that the capping with λ is not necessary.The capping prevents the critical values from exceeding λ, thus avoiding to get p t ≥ λ when p t ≤ α t .However, to our knowledge, the latter does not play any role in the mFDR control, and we work with the (uniformly dominating) procedure With the capping ( 27), an mFDR control is provided in Theorem 1 in Ramdas et al. (2018).
For our version (28), the mFDR control follows as a special case of Theorem 4.2 below with F t (u) = u for all t, u.Also note that A ALORD reduces to A LORD (23) when λ = 0, because T j (T ) = 0 ∨ (T − τ j ) in that case.Now, we generalize this method to our present framework.
Note that A ρALORD reduces to A ρLORD (25) when λ = 0, and to A ALORD when F t (u) = u for all u, t.The following result shows that this class of procedures controls the mFDR.
Theorem 4.2.Consider the setting of Section 2.1 where a null bounding family F = {F t , t ≥ 1} satisfying (2) is at hand.For any spending sequence γ and any SUR spending sequence γ , consider the adaptive LORD procedure A ALORD = {α ALORD t , t ≥ 1} (28), and the adaptive LORD procedure with super-uniformity rewards A ρALORD = {α ρALORD t , t ≥ 1} (29).Then, assuming that the model P is such that (3) holds, we have mFDR(A ρALORD , P ) ≤ α for all P ∈ P while A ρALORD uniformly dominates A ALORD and thus also the SAFFRON procedure of Ramdas et al. (2018).
Theorem 4.2 follows from Theorem 4.3 below.Let us underline that A ρALORD both incorporates α-investing and super-uniformity reward.Thus, it is expected to be the most powerful among the procedures considered in the present paper.This is supported both by the numerical experiments of Section 5.2 and the real data analysis in Section 5.3.Remark 4.2.Note that the critical values of ALORD and ρ-ALORD can exceed 1 (e.g., when all p-values are zero).Since the rejection decision is the same for a critical value larger imsart-generic ver.2014/07/30 file: DMR2021_EJS.texdate: January 16, 2023 than 1 or equal to 1, this may appear at first sight as wasted wealth.While this is indeed the case for ALORD, we emphasize that this is not the case for ρ-ALORD, because the superuniformity reward allows to reuse the exceeding amount of wealth engaged in α ρALORD t ; namely ρ t = α ρALORD t − 1 when α ρALORD t ≥ 1.

Rewarded version for base mFDR controlling procedures
The following result establishes that any base online mFDR controlling procedure (of a specific type) can be rewarded with super-uniformity.
Theorem 4.3 is proved in Section A.2. Condition (30) is essentially the same as the condition found in Theorem 1 of Ramdas et al. (2018).Our main contribution is thus in statement (ii), showing that the super-uniformity reward can be used with any base procedure A 0 satisfying (30).Since the latter condition holds for the LORD procedure A 0 = A LORD , and the adaptive LORD procedure A 0 = A ALORD (see Lemma A.3), Theorem 4.3 entails Theorem 4.1 and Theorem 4.2, respectively.Finally, let us emphasize the similarity between Theorem 3.3 (FWER) and Theorem 4.3 (mFDR).Strikingly, the reward takes exactly the same form (20), which makes the range of improvement comparable for these two criteria.

SUR procedures for discrete tests
In this section, we study the performances of our newly derived SUR procedures in discrete online multiple testing problems for simulated and real data.We defer some of the numerical results to Appendix D.

Simulation setting
We simulate m experiments in which the goal is to detect differences between two groups by counting the number of successes/failures in each group.More specifically, we follow Gilbert (2005b), Heller and Gur (2011) and Döhler et al. (2018) by simulating a two-sample problem in which a vector of m independent binary responses is observed for N subjects in both groups.The goal is to test the m null hypotheses H 0i : 'p 1i = p 2i ', i = 1, ..., m in an online fashion, where p 1i and p 2i are the success probabilities for the i th binary response in group A and B respectively.Thus, for each hypothesis i, the data can be summarized by a 2 × 2 contingency table, and we use (two-sided) Fisher's exact test for testing H 0i .The m hypotheses are split in three groups of size m 1 , m 2 , and m 3 such that m = m 1 + m 2 + m 3 .Then, the binary responses are generated as i.i.d Bernoulli of probability 0.01 (B(0.01)) at m 1 positions for both groups, i.i.d B(0.10) at m 2 positions for both groups, and i.i.d B(0.10) at m 3 positions for one group and i.i.d B(p 3 ) at m 3 positions for the other group.Thus, the null hypotheses are true for m 1 + m 2 positions (set H 0 ), while the null hypotheses are false for m 3 positions (set H 1 ).Therefore, we interpret p 3 as the strength of the signal while π A = m 3 m , corresponds to the proportion of signal.Also, m 1 and m 2 are both taken equal to m−m 3 2 .In these experiments, we fix m = 500, and vary each one of the parameters H 1 (Section 5.2.2), π A (Section 5.2.3),N (Section D.1), p 3 (Section D.2) while keeping the others fixed.The default values are π A = 0.3, N = 25, p 3 = 0.4 and H 1 ⊂ {1, . . ., m} chosen randomly for each simulation run.We estimate the different criteria (FWER (5), mFDR (6), power (7)) using empirical mean over 10 000 independent simulation trials.

Position of signal
We start by studying how the position of the signal can affect the performances of the procedures (it is well-known to be critical, see Foster and Stine, 2008;Ramdas et al., 2017).We investigate different positioning schemes in which the signal can be clustered at the beginning of the stream, or at the end, or clustered between the two, as described in the caption of Figure 5. Consistently with our theoretical results, Figure 5 shows that all procedures control the type I error rate at level α = 0.2.In terms of power, we can see that the rewarded procedures have greater power than the associated base procedures.More specifically, A ρALORD uniformly dominates the other procedures for mFDR control and A ρAOB for FWER control.The gain in power is most noticeable when the signal is not localized at the beginning of the stream (i.e.positions ME, E, and Random) for which the online testing problem is more difficult.These first results indicate that the rewarded procedures may protect against 'α-death'.

Proportion of signal
Figure 6 displays the results for π A varying in {0.1, . . ., 1}.It shows that the aforementioned superiority of the rewarded procedures holds in this whole range.Also note that the SUR reward can affect the monotonicity of the power curves: while most curves are increasing with π A , the power of the rewarded procedure A ρOB decreases.An explanation could be that when π A increases, the marginal counts increase, and thus the degree of discreteness decreases providing a smaller super-uniformity reward.However, using adaptivity seems to compensate for this effect, thus providing better results.
Finally, let us mention that the additional numerical results in Section D provide qualitatively similar conclusions for all other explored parameter configurations: the SUR procedures A ρAOB and A ρALORD always improve, often substantially, the existing OMT procedures.

Application to IMPC data
In this section we analyse data from the International Mouse Phenotyping Consortium (IMPC), which coordinates studies on the genotype influence on mouse phenotype.More precisely, scientists test the hypotheses that the knock-out of certain genes will not change certain phenotypic traits (e.g., the coat or eye color).Since the data set is constantly evolving as new genes are studied for new phenotypic traits of interest, online multiple testing is a natural approach for analysing such data, see also Tian and Ramdas (2021); Xu and Ramdas (2021).We use the data set provided by Karp et al. (2017) which includes, for each studied gene, the count of normal and abnormal phenotype for female and male mice (separately), thus providing two by two contingency tables, which can be analysed using Fisher exact tests.In this section, we   investigate the genotype effect on the phenotype separately for male and female.The data set originally contains nearly 270 000 genes studies, but we focus on the first 30 000 genes for simplicity.We set the global level α to 0.2 and 0.05, respectively for FWER and mFDR procedures.For the procedure parameters, we follow the choice made in Section 5.1.Table 3 presents the number of discoveries for the FWER controlling procedures OB, AOB, ρOB, ρAOB (left) and for the mFDR controlling procedures LORD, ALORD, ρLORD, ρALORD (right).The results show that ignoring the discreteness of the tests causes the scientist to miss (potentially many) discoveries.Hence, using the SUR methods helps to reduce this risk.Figure 7 (FWER procedures) and Figure 8 (mFDR procedures) illustrate in more detail how the super-uniformity reward leads to more discoveries, in the case of male mice (similar findings hold for the female mice for which the corresponding figures can be found in Section E.2).First, note that the smallest p-values occur at the beginning of the stream (see Figure 17 in Section E.1), so that we limit the visual analysis to the first 1500 p-values for clarity of exposition.For the ρOB procedure, the benefit of incorporating the super-uniformity reward is visible in the left panel of Figure 7.As expected from Figure 3, applying a rectangular kernel to these rewards yields a smooth curve.For the ρAOB procedure, presented in the right panel of Figure 7, the improvement is even stronger, but the resulting critical value curve is less smooth.This is due to the 'adaptive' reward, that is, the ε T −1 -component of our improvement, recall (18).More precisely, an explanation of this 'saw-tooth' shape is that during a period with p-values smaller than λ, we have increases.Also, if this period lasts for a while (as for 500 t 1240 here), the ρ-part of the reward vanishes and we end up with a constant gain T −1 , explaining the flat part of the curve, until the next p T ≥ λ occurs.After this point, we switch from the εregime back to the ρ-regime, i.e., α ρAOB T +1 = α AOB T +1 + γ 1 ρ T .Since typically γ 1 ρ T α ρAOB T −1 − α AOB T −1 , this causes the downward jump in the green curve.For the mFDR procedures presented in Figure 8, there is an additional 'rejection' reward as described in Section 4. Note that this makes some critical values exceed 1 (both for ALORD and ρALORD), which thus cannot be displayed in the Y -axis scale considered in that figure.However, these values are still used in ρALORD algorithm to compute the future critical values (see Remark 4.2).The obtained results are qualitatively similar to the FWER setting: our proposed reward makes the green curves run above the orange ones, uniformly over the considered time, hence inducing significantly more discoveries.imsart-generic ver.2014/07/30 file: DMR2021_EJS.texdate: January 16, 2023

SUR procedures for weighted p-values
In this section, we show how our SUR approach can be easily used to construct valid online p-value weighting procedures.

Setting and benchmark procedure
Consider a standard continuous online multiple testing setting where each p-value is superuniformly distributed under the null, that is, (1) holds.Assume in addition that, at each time t, the p-value p t is associated with a quantity r t ≥ 0, called the raw weight (as opposed to the rescaled weight defined further on), which is assumed to be measurable w.r.t.F t−1 .The magnitude of r t is interpreted as the level of belief in a potential true discovery at time t: a large weight indicates a strong belief that the corresponding null hypothesis is false.Throughout the section, the weights r t are assumed to be available a priori and we will not discuss how to derive them (for this task, we refer to Wasserman and Roeder (2006); Rubin et al. (2006); Roquain and van de Wiel (2009); Hu et al. (2010); Zhao and Zhang (2014); Ignatiadis et al. (2016); Chen and Kasiviswanathan (2020b) among others).
While p-value weighting is a classical tool for improving the performance of multiple testing methods in the offline setting (see references in Section 1.3), the incorporation of weights has received little attention in the online case.The only relevant work to our knowledge is Ramdas et al. (2017) (Section 5 therein), which presents sufficient criteria for weighting procedures controlling the (m)FDR based on so-called GAI++ procedures and also discusses the technical challenges associated with weighted online multiple testing.An explicit algorithm which satisfies these criteria is used in Ramdas et al. (2017) 2 , which is detailed in Appendix C.2 for completeness.This method, which will be our benchmark procedure, works by weighting the p-values and adjusting for this weighting in the rejection reward.

New weighting approach
The main idea of our new approach is as follows: consider weighted p-values pt = p t /w t for some rescaled weight w t ∈ [0, 1] which gives rise to the null bounding family F = {F t : u ∈ [0, 1] → uw t , t ≥ 1}.Since the weights are constrained to take their values in [0, 1], the functions of F are super-uniform, that is, (2) holds.Hence, one can apply our SUR approach with respect to that family F.
More specifically, our approach takes into account the null bounding family F in a simple two-step process, which proceeds as follows: for each time t, 1. enforce super-uniformity by computing the rescaled weight w t = ξ t (r t |r 1 , . . ., r t−1 ), t ≥ 1, for some given rescaling function ξ t valued in [0, 1] (see below for more details and an explicit choice); 2. apply any one of the SUR methods from Section 3 or Section 4, depending on whether FWER or mFDR control is desired.
We denote these new procedures by wX, where X stands for the name of the base procedure (either OB (10), AOB (17), LORD (23) or ALORD (28)).These procedures all come with the corresponding FWER or mFDR control (by additionally assuming (3) if needed).In particular, to the best of our knowledge, this also provides the first method for weighted online FWER control.At first sight, these SUR weighting approaches may seem to be ineffective due to the conservatism induced by the rescaling step.However, this is countered in the second step by using SUR procedures that provide larger values α t , due to the super-uniform rewards accumulated in the past.The hope is that these two effects balance out in such a way as to favor rejection of hypotheses associated with larger values of (raw) weights.

Analysis of RNA-Seq data
We revisit an analysis of the RNA-Seq data set 'airway' using results from the Independent Hypothesis Weighting (IHW) approach (for details, see Ignatiadis et al. (2016) and the vignette accompanying its software implementation).While the original data was not collected in an online fashion, we use it here nevertheless to provide a proof of concept for weighted SUR procedures.The 'airway' data set contains data from 64102 genes and the corresponding (offline) weights are taken from the output of the ihw function from the bioconductor package 'IHW'.These 'raw' weights are then transformed into rescaled weights by using the function ξ t described in the previous section.For the procedure parameters, we use the same choices as for the analysis of the IMPC data, see Section 5.3.
Table 4 (left part) presents the result for the FWER controlling procedures OB, AOB (non-weighted), and wOB, wAOB (SUR weighted approaches).It is clear that incorporating the weights leads to more rejections, which corroborates the fact that the weights coming from Ignatiadis et al. (2016) are indeed informative.
As for mFDR control, the (non-weighted) LORD is compared to our weighted version wLORD in Table 4 (right part).As additional competitors, we also added the weighted GAI++ procedure proposed in Ramdas et al. (2017) (see Section C.2 for a detailed description), that we use either with the raw weights (denoted by wGAI 1 ) or with the rescaled weights (denoted by wGAI 2 ).As one can see, the effect of rescaling the weights is highly beneficial, and the new wLORD proposal is the one that incorporates these weights in the most efficient way.

Conclusion
Existing OMT procedures often suffer from a lack of power due to conservativeness of the p-values.This occurs typically for discrete test statistics, which is a common situation in data sets where testing is based upon counts.To fill the gap, we introduced new SUR versions of some existing classical procedures, that 'reward' the base procedures by spending more efficiently the α-wealth according to known bounds on the null cumulative distribution functions.We showed that our new SUR procedures provide rigorous control of online error criteria (FWER or mFDR) under classical assumptions while offering a systematic power enhancement.When using discrete Fisher exact test statistics, the improvement is substantial, both for simulated and real data.
In addition, even in the standard case of uniformly distributed p-values, our approach allowed us to derive new weighted procedures that incorporate external covariates.This provides improvements w.r.t.existing online weighting strategies.

Another viewpoint
In the discrete setting, let us consider the following constrained spending problem: at each step t, choose the critical value α t to be in the support S t (including 0) so that the following contraint holds It solves the super-uniformity problem, because F t (α t ) = α t for all t, while it controls the online FWER.This general principle, that we refer to as 'constrained spending strategies', can be implemented in many ways.Markedly, the SUR approach is a way to achieve this, by additionally following some reference critical values -here the online Bonferroni critical values α OB t (10).Indeed, the rejection decision p t ≤ α ρOB t and p t ≤ α t = F t (α ρOB t ) are almost surely identical and we have calibrated α ρOB t such that (31) holds, see (13).In other words, even if our critical values are not constrained to be in the support initially, the effective critical values α t = F t (α ρOB t ) that are actually used in the decision rule will automatically belong to the support.Thus, our approach can be equivalently seen as a way of implementing the constrained spending strategies delineated above.
Obviously, there are other ways to implement the constrained spending strategy.One instance is the delayed spending (DS) approach, that we describe in detail in Appendix B.

Future directions
While our results address several issues, they also raise new questions.First, the bandwidth of the kernel-based SUR spending sequence γ given by ( 9) has been chosen in a loose way here, but tuning the bandwidth is certainly interesting from a power enhancement perspective (see Section D.5).Also, in applications, the user would possibly like to select the bandwidth in a data dependent fashion without losing control over type I error rate.These two issues are interesting extensions for future developments.Second, while our work focuses on marginal FDR, it imsart-generic ver.2014/07/30 file: DMR2021_EJS.texdate: January 16, 2023 would be desirable to build rewarded OMT procedures that control the (non-marginal) FDR.However, usual proofs rely on a monotonicity property of the critical value sequence (Ramdas et al., 2017) that is difficult to satisfy here, because the super-uniformity reward naturally varies over time.Hence, deriving rewarded FDR controlling procedures is a challenging issue that is left for future investigations.Third, most of our results rely on an independence assumption, see (3).While this can be considered as a mild restriction in an online framework, relaxing it or incorporating a known dependence structure in OMT is an interesting avenue.α t .This is done by reducing this to a statement on α 0 t via Lemma A.2.More precisely, with a T = T t=1 γ t , we have where the equality above is true provided that the following recursion holds for all T ≥ 1, This is true by Lemma A.2 because of the expression (20) of α t .This concludes the proof.
Proof of Theorems 3.1 and 3.2.Theorems 3.1 and 3.2 are corollaries of Theorem 3.3, by considering A 0 = A OB (λ = 0) and A 0 = A AOB , respectively.Indeed, checking ( 19) is straightforward for A OB from the spending sequence definition or comes from Lemma A.3 for A AOB .

A.2. Proofs for online mFDR control
The global proof strategy is similar to the one used for FWER: we start by proving Theorem 4.3 and then deduce Theorem 4.1 and Theorem 4.2.
imsart-generic ver.2014/07/30 file: DMR2021_EJS.texdate: January 16, 2023 Proof of Theorems 4.1 and 4.2.Theorem 4.1 and Theorem 4.2 can be derived from Theorem 4.3 for A 0 = A LORD (using λ = 0) and A 0 = A ALORD , respectively, by checking (30) in both cases.First, for A LORD , we have because τ j ≤ T −1 is equivalent to R(T −1) ≥ j by definition.Second, for A ALORD , we proceed similarly with the help of Lemma A.3: by definition (28), we have Finally, by using ( 37) and ( 38), the latter is equal to because T ≥ τ j + 1 if and only if R(T − 1) ≥ j.

A.3. Auxiliary lemmas
The following lemma provides a tool for controlling both online FWER and mFDR.
Lemma A.1.For any procedure A = (α t , t ≥ 1), we have for all λ ∈ [0, 1), Proof.Recall α t is either deterministic or F t−1 -measurable (in which case it is independent of p t (X) under (3)).Therefore, under the conditions of the lemma, we have in any case: for all t ∈ H 0 , both This entails The following representation lemma is the key tool for building the new rewarded critical values.
Lemma A.2. Let (α 0 t , t ≥ 1) be any nonnegative sequence.Let (α t , t ≥ 1) be the sequence defined by the recursive relation where a T = T t=1 γ t , T ≥ 1 for any real values γ t , p t , λ and functions F t .Let (ᾱ t , t ≥ 1) be the sequence defined by the recursive relation Then we have αt = ᾱt for all t ≥ 1.Moreover, ᾱt ≥ ᾱ0 t for all t ≥ 1 under (2).In particular, these critical values are nonnegative.
We now establish a result for the functionals T (•) and T j (•), j ≥ 1, which are used by the adaptive procedures A AOB and A ALORD , respectively.

Table 5
Number of discoveries for SURE online Bonferroni (15) (bandwidth h = 10) and the DS approach (39).Here C(30 000) = 5083 as defined in (39).These numbers are obtained by running the procedures on the first 30 000 genes for male (second row) and female (third row) mice in the IMPC data.that the DS method could be more efficient at the very start of the stream but may suffer from conservativeness afterwards.

Procedures
To assess the behaviour of the procedures in a practical setting, we reanalyse the IMPC data from Section 5.3 using the DS procedure defined by ( 39) and ( 40) and compare it with the OB and ρOB from Section 3. The results for FWER control at level α = 0.2 are displayed in Table 5 and Figure 9.As Figure 9 (right panel) shows, the rejection process {R(T ), T ≥ 1}, is almost identical at the very start.However, for larger T , the delayed approach makes less discoveries than the ρOB procedure and this, uniformly in time for this data set.This conservative behaviour is probably caused by under-utilization of wealth as described in Section B.3.More specifically, the non-utilized component of α = 0.2 accumulates up to time T = 1500 approximately to 0.077, so that approximately 38.5% of α = 0.2 are effectively neglected.Accordingly, the wealth plot displayed in Figure 9 shows that the delayed approach manages to spend more wealth than the OB procedure, but still deviates strongly from the nominal wealth curve.Figure 10

B.3. Formal properties
From the definition of the DS approach we obtain the following comparison to OB and ρOB: • the DS approach improves OB uniformly when γ t is nonincreasing: indeed C(t) ≤ t, so that α DS t = αγ C(t) ≥ αγ t = α OB t .
imsart-generic ver.2014/07/30 file: DMR2021_EJS.texdate: January 16, 2023 • the DS approach does not depend on any other tuning parameter such as the bandwidth.By contrast, choosing this parameter badly in the ρOB procedure may adversely affect its performance.• the DS approach is another way of using the super uniformity reward.For instance, if there is no super uniformity reward, that is, F t (αγ t ) = αγ t for all t, then b t = t and the DS procedure reduces to OB.
In addition, we have the following observations: • Delayed start: If F t (x) = 0 for all x < 1 and t ≤ T 0 and F t (x) = x for t ≥ T 0 + 1, the DS procedure is much more intuitive: it yields b 1 = T 0 + 1 by (40) and α DS t = αγ t−T 0 for t ≥ T 0 + 1 which is the most natural way to proceed (just start the testing process at time T 0 + 1).By contrast, ρOB (with rectangular kernel of bandwidth r) collects some reward in α ρOB t , 1 ≤ t ≤ T 0 , spends the reward in the following r time points, but continues with α ρOB t = αγ t for t ≥ T 0 + r + 1.Hence, delaying spends the superuniformity more intuitively than ρOB in that situation.More generally, in practice, we may therefore expect DS to be more efficient in the beginning of the stream.
• Long/infinite delay: Conversely, if there exists T 0 ≥ 1 such that for all t ≥ b T 0 + 1, F t (αγ C(T 0 )+1 ) = 0, then we have b T 0 +1 = +∞ from (40), which in turn implies C(t) ≤ T 0 + 1.But for t ≥ b T 0 + 1, we have C(t) ≥ T 0 + 1 by (39).Hence, for t ≥ b T 0 + 1, C(t) = T 0 + 1 and the 'spending clock' freezes.On the one hand, we have α DS t = αγ T 0 +1 so the delaying works perfectly to effectively improve the OB critical values.On the other hand, this effectively stops the spending of any further budget and thus a large part of the wealth is left unspent.This is in contrast to the SUR approach which uses a reward of an additive nature and thus always has a chance to spend the budget.
• Under-utilization of wealth.The DS method processes each sub-budget αγ j one at a time, until the transition to the next sub-budget αγ j+1 is made.In most cases, however, the inequality (40) defining the transition time b j will be a strict inequality, meaning that when we move on to the next sub-budget we will have used b j t=b j−1 +1 F t (αγ j ) < αγ j .Thus, this method does not exhaust the available sub-budgets.Moreover, since it neglects these 'alpha-gaps', they accumulate over time.This under-utilized wealth leads to unnecessary conservatism.Removing such gaps was precisely the primary motivation for introducing our SUR method, see Section 2.3.The most disadvantageous scenario occurs when b t = t for all t ≤ T , so that the imsart-generic ver.2014/07/30 file: DMR2021_EJS.texdate: January 16, 2023 DS procedure reduces to the original OB procedure up to time T .As an example consider ∈ (0, αγ T ) for some large T ≥ 1 and assume that the support of each p t is given by S t = { , A t , αγ t−1 } ∪ {1} (convention αγ 0 = 1), where A t is a finite subset of (αγ t , αγ t−1 ).Then we have F 1 (αγ 1 ) + F 2 (αγ 1 ) = αγ 1 + hence b 1 = 1, and more generally F t (αγ t ) + F t+1 (αγ t ) = αγ t + for all t ≤ T , which implies b t = t for all t ≤ T .However, we know that OB does not allow to spend all the budget in such a discrete situation, see Figure 1.A potential remedy for the conservatism of the DS method could be to combine it with our SUR method.We describe such a hybrid approach in more detail in Section B.4.
In summary, it may be said that the delaying method is particularly appealing in terms of simplicity and elegance, while the primary aim of the SUR approach is on efficiency.

B.4. Hybrid approach
In this section, we describe a hybrid approach, combining the ideas underlying DS and SUR, in order to improve the utilization of wealth of DS.
To compare the performance of the hybrid approach with the SUR and DS approaches, we use the simulation setting from Section 5.2 in the case where the signal is positioned at the beginning of the stream for each simulation run, which is the most favorable position of the signal for any procedure (see Section 5.2.2 for more details).We consider both procedures based on the uniform kernel (bandwidth h = 100) and those based on the greedy spending sequence (denoted by 'greedy').
Figure 11 shows that taking super-uniformity into account is always beneficial, regardless of the specific approach used.The base DS method performs similarly to the greedy ρOB and the greedy hybrid.In contrast, the hybrid approach based on a uniform kernel improves DS, with performance close to ρOB.Hence, we conclude that closing the alpha-gaps by smoothing with an adequate kernel can make the hybrid approach as powerful as the smoothed ρOB method.However, given the added complexity of the hybrid approach, we prefer to stick with the smoothed SUR.  has been proposed: R t = 1{p t ≤ w t α t } W (t) = W (t − 1) − φ t + R t ψ t φ t ∈ [0, W (t − 1)] ψ t ≤ b t + min (φ t , φ t /(w t α t ) − 1) Note that the latter constraints are similar to the constraints given in Section C.1 for F t (x) = (w t x) ∧ 1 (up to the '∧1' which makes the constraints here slightly more stringent) so that this weighting case is a particular SUR-GAI++ procedure.
For given raw weights r t ≥ 0 (F t−1 measurable), an explicit procedure which is used in Ramdas et al. (2017) 3 , is obtained by choosing α t , w t , φ t , ψ t as follows: This choice is valid because α t ≤ W (t − 1) for all t.Indeed, W (t − 1) = W 0 + t−1 i=1 (−α i + R i ψ i ), so α t ≤ W (t − 1) if and only if t i=1 α i ≤ W 0 + t−1 i=1 R i ψ i , which is true.

C.3. Our ρ-LORD is a SUR-GAI++ rule
We claim here that the procedure ρ-LORD corresponds to a SUR-GAI++ rule with the choice φ t = F t (α t ), ψ t = b t , and To establish this, we check that all constraints given in Section C.1 are satisfied.The only non-trivial one is φ t = F t (α t ) ≤ W (t − 1).Let us now prove it.Recall that W (t) = W (t − 1) − φ t + R t b t and W (0) = W 0 .Hence α 1 = W 0 γ 1 ≤ W 0 .Moreover, for t ≥ 2, So we have ᾱt ≤ W (t − 1) for the critical value ᾱt = t i=1 by letting a t = t i=1 γ i .But now, we have that ᾱt = α t for all t, for α t defined by ( 41).Indeed, this can be seen from Lemma A.2, applied with λ = 0 and α 0 T being the LORD critical values.

D.2. Signal strength
Here, we vary the strength of the signal p 3 in the set {0.1, 0.2, . . ., 1}.We see that the SUR procedures dominate their base counterparts, as expected.In addition, depending on the imsart-generic ver.2014/07/30 file: DMR2021_EJS.texdate: January 16, 2023 signal strength, the gain in power can be considerable.Also note that, perhaps surprisingly, all curves exhibit a decrease in power for p 3 near 1.Since this happens even for the original OB procedure, this is not due to the super-uniformity reward, but could perhaps be caused by the behavior of the power function of multiple Fisher exact tests taken at different levels.

D.3. Local alternatives
As Figure 12 demonstrates, for a fixed value of the signal strength p 3 , the detection problem becomes easier as N increases, so that all procedures attain a power of 1.In this section we are interested in obtaining a more refined analysis of the various power curves when N is large.To this end, we introduce local alternatives, i.e. we now model p 3 as a function of the sample size N .To be more specific, we take N ∈ {5, 10, . . ., 30} × 1000 and set p 3 = p 1 + 1 √ N for mFDR procedures and, p 3 = p 1 + 1.5 √ N for FWER procedures, we fix p 1 = p 2 = 0.1, and generate simulated data as in Section 5.2. Figure 14 displays power and error rates for this data.Taking N as a (crude) proxy for discreteness, we observe that even with a low discreteness (say N ≤ 30000) the SUR methods still provide some degree of improvement.Finally, for FWER procedures, ADDIS-spending provides the best power performance over the whole range of the experiment.This might be explained by the setting causing very conservative nulls p-values (i.e.very close to 1), thus allowing the discarding scheme to redistribute and spend a large part of the wealth on testing alternative hypotheses.Using the SUR method along with the discarding scheme (Tian andRamdas, 2019, 2021) might provide an interesting avenue for further improvement, but this would define yet another class of procedures, which is outside of the scope of this paper.

D.4. Adaptivity parameter
We study the choice of λ for the procedures using adaptivity.It seems that λ = 0.5 is a reasonable choice for the adaptive procedures.Finally, we study the choice of the bandwidth parameter for the rectangular kernel used for the rewarded procedures.As we can see, using a smaller bandwidth provides the best performance for the mFDR controlling rewarded procedures, whereas FWER controlling procedures require a larger bandwidth.The choices h = 100 for FWER controlling procedures, and h = 10 for mFDR controlling procedures seem reasonable although not necessarily optimal.

Fig
Fig 1: Nominal wealth for OB (dashed orange curve), effective wealth for OB (solid orange curve) and effective wealth for ρOB (solid green curve) for the male mice from the IMPC data (see Section 5.3 for more details).

Fig 2 :
Fig 2: Super-uniformity reward ρ t at time t (length of the vertical line) as defined by (8) for a given function F t (orange step function) and a critical value α t (triangle).The dashed line is the identity function x ∈ [0, 1] → x.

Fig 3 :
Fig 3: Sequences of critical values for Bonferroni procedures with different rewards over time 1 ≤ t ≤ T = 300 (simulated data): base Bonferroni critical values (10) (orange line), rewarded with the greedy approach (14) (blue line), and with the rectangular kernel SUR spending sequence (15) (h = 100, green line).The rug plots display the time of discoveries for each procedure with the corresponding color.The Y -axis has been transformed by y → − log(− log(y)).The grey dots denote the p-value sequence (those equal to 1 are displayed at the top of the picture).The spending sequence is γ t ∝ t −1.6 .

Fig 4 :
Fig 4: Sequences of critical values of LORD procedure with different rewards over time 1 ≤ t ≤ T = 300 (simulated data): base LORD critical values (23)(orange line), rewarded with the greedy approach (24) (blue line), and with the rectangular kernel SUR spending sequence (25) (h = 10, green line).The rug plots display the time of discoveries for each procedure with the corresponding color.The y-axis has been transformed by y → − log(− log(y)).The grey dots denote the p-value sequence (those equal to 1 are displayed at the top of the picture).The spending sequence is γ t ∝ t −1.6 .
Theorem 4.1.Consider the setting of Section 2.1 where a null bounding family F = {F t , t ≥ 1} satisfying (2) is at hand.For any spending sequence γ and any SUR spending sequence γ , consider the LORD procedure A LORD = {α LORD t , t ≥ 1} (23) and the LORD procedure with super-uniformity rewards A ρLORD = {α ρLORD t , t ≥ 1} (25).Then, assuming that the model P is such that (3) holds, we have mFDR(A ρLORD , P ) ≤ α for all P ∈ P while A ρLORD uniformly dominates A LORD .

Fig 5 :
Fig 5: Power and type I error rates of the different considered OMT procedures versus positions of the signal: at the beginning (B), the end (E), half at the beginning and half in the middle of the stream (BM), half at the beginning and half at the end of the stream (BE), half in the middle and half at the end of the stream (ME), and taken uniformly at random (Random).

Fig 7 :
Fig 7: Applying online FWER controlling procedures to the male mice IMPC data set.Left panel: p-values and critical values for OB (orange curve) and ρOB (green curve).Right panel: AOB (orange curve) and ρAOB (green curve).Representation similar to Figure 3 (Y -axis transformed by y → − log(− log(y)); p-values equal to 1 displayed at the top of the picture).

Fig 8 :
Fig 8: Applying online mFDR controlling procedures to the male mice IMPC data set.Left panel: p-values and critical values for LORD (orange curve) and ρLORD (green curve).Right panel: ALORD (orange curve) and ρALORD (green curve).Representation similar to Figure 4 (Y -axis transformed by y → − log(− log(y)); p-values equal to 1 are displayed at the top of the picture).
Fig 9: Comparison with DS.Left: nominal wealth for OB (dashed orange curve), effective wealth for OB (solid orange curve), effective wealth for ρOB (solid green curve) and effective wealth for DS (solid purple curve), plot similar to Figure 1.Right: rejection numbers, cumulated over time, for the same procedures (same color code).Both plots are computed from the male IMPC data.

Fig 10 :
Fig 10: Critical values of OB (orange), ρOB (green) and DS (purple) for the IMPC data (left panel is for male, right panel is for female).
Figure12illustrates results when the sample size N , i.e., the subjects number per group, takes values in the set {25, 50, . . ., 150}.As expected, the power plots show that the detection problem becomes easier when N increases.In fact, for large N the power of all procedures converge to 1.We see that our rewarded procedures do well on the whole range of N values and improve substantially on existing OMT procedures for small and moderate values of N , including our default value N = 25.

Fig 12 :
Fig 12: Power and type I error rates of the considered procedures versus N ∈ {25, 50, . . ., 150}, the number of subjects in the groups.

Fig 15 :
Fig 15: Power and type I error rates, for the considered procedures, versus the adaptivity parameter λ.

Fig 16 :
Fig 16: Power for FWER (left) and mFDR (right) rewarded procedures versus the proportion of signal π A , for different kernel bandwidths.

E. 2 .
Figures 19 and 20 display the critical values of the studied online procedures when applied to the IMPC data in the case of female mice.

Fig 17 :
Fig 17: p-values for male mice in the IMPC data of Section 5.3.The left panel presents all p-values, the right panel the first 3000 p-values.The p-values have been transformed as in Figure 3.

Fig 18 :
Fig 18: p-values for female mice in the IMPC data of Section 5.3.The left panel presents all p-values, the right panel the first 3000 p-values.The p-values have been transformed as in Figure 3.

Fig 19 :
Fig 19: Same as Figure 7 but for female mice of IMPC data (see Section 5.3).

Fig 20 :
Fig 20: Same as Figure 8 for female mice of IMPC data (see Section 5.3).

Table 2
Overview of the critical values of the rewarded procedures denoted as the corresponding base procedures, with an additional symbol "ρ" in the name.Here, α OB T , α AOB T , α LORD T , α ALORD T are the base procedures from Table Ramdas (2021) (AOB).•Weproposetwo new SUR procedures for online mFDR control in Section 4: the first one (ρLORD) uniformly improves upon the LORD++ procedures of Javanmard and Montanari (2018); Ramdas et al. (2017) (LORD), while the second one (ρALORD) uniformly improves upon the SAFFRON procedure of Ramdas et al. (2018) (ALORD).

Table 3
Number of discoveries for FWER controlling OMT procedures (left) and mFDR controlling OMT procedures (right).These numbers are obtained by running the procedures on the first 30 000 genes for male (second row) and female (third row) mice in the IMPC data.ProceduresOB ρOB AOB ρAOB LORD ρLORD ALORD ρALORD
provided that (2) holds and if either (3) holds or if the critical values (α t , t ≥ 1) are deterministic.