Online multiple hypothesis testing

Modern data analysis frequently involves large-scale hypothesis testing, which naturally gives rise to the problem of maintaining control of a suitable type I error rate, such as the false discovery rate (FDR). In many biomedical and technological applications, an additional complexity is that hypotheses are tested in an online manner, one-by-one over time. However, traditional procedures that control the FDR, such as the Benjamini-Hochberg procedure, assume that all p-values are available to be tested at a single time point. To address these challenges, a new field of methodology has developed over the past 15 years showing how to control error rates for online multiple hypothesis testing. In this framework, hypotheses arrive in a stream, and at each time point the analyst decides whether to reject the current hypothesis based both on the evidence against it, and on the previous rejection decisions. In this paper, we present a comprehensive exposition of the literature on online error rate control, with a review of key theory as well as a focus on applied examples. We also provide simulation results comparing different online testing algorithms and an up-to-date overview of the many methodological extensions that have been proposed.


Introduction
Large-scale hypothesis testing is now ubiquitous in a variety of biomedical and technological applications.For example, many major technology companies perform tens of thousands of randomised controlled experiments (known as A/B tests) each year to make data-driven decisions about how to improve products [Kohavi et al., 2020].Meanwhile, in genomics it is now routine to test hundreds of thousands of genetic variants for an association with particular phenotypic trait(s).Even in the setting of randomised controlled trials (RCTs) in medicine, there is a growing push towards the use of "overarching" trial frameworks to allow the efficient testing of multiple experimental drugs for multiple patient subpopulations.
Performing a large number of hypothesis tests naturally gives rise to the problem of multiple comparisons [Tukey, 1953]: given a collection of multiple hypotheses to be tested, the goal is to distinguish which hypotheses are null and non-null, while controlling a suitable error rate (see Section 1.2).This error rate is generally formed around the probability of incorrectly classifying a null hypothesis as non-null.Typically, a p-value is calculated for each hypothesis and is then 1 arXiv: 2208.11418v3 [stat.ME] 24 Jul 2023 used to decide whether to reject the null hypothesis.Multiple hypothesis testing is one of the core problems in statistical inference, and has led to a wide range of procedures that can be used to correct for multiplicity and ensure that a suitable error rate is controlled.In contrast, uncorrected hypothesis testing contributes to serious concerns over reproducibility, publication bias and 'p-hacking' in scientific research [Ioannidis, 2005, Head et al., 2015].
Multiplicity, as broadly understood, is naturally linked to scientific reproducibility.Goodman et al. [2016] state that "Multiplicity, combined with incomplete reporting, might be the single largest contributor to the phenomenon of nonreproducibility, or falsity, of published claims" and go on to say that "Scientific fields that routinely work with multiple hypotheses without correcting for or reporting the occurrence of multiplicity run a higher risk of nonreproducibility of results or inferences".As an example of this, Zeevi et al. [2020] recently showed that adjusting for multiplicity greatly enhances the reproducibility of results from psychology experiments.Similarly, in the context of drug development, Bretz and Westfall [2014] in a paper titled "Multiplicity and replicability: two sides of the same coin" showed that there is a close link between between multiplicity and replicability in terms of the observed effect sizes of selected subgroups, with further examples given in Bretz et al. [2019].
Traditionally, multiple hypothesis testing is offline in nature, in the sense that a procedure for testing N hypotheses will receive all of the corresponding p-values (P 1 , . . ., P N ) at once.
Step-up and step-down multiple testing procedures (for example) require knowledge about all p-values in advance.In the offline setting, the seminal Benjamini-Hochberg (BH) procedure is the dominant method used for FDR control.However, this paradigm is often incompatible with modern data-driven decision-making processes, as demonstrated by our motivating examples in Section 1.1.Once the data is available to make a decision about a particular hypothesis, it can be desirable to take a corresponding action (e.g. to update a tech product) relatively quickly, and not to wait for the results of ongoing or future hypothesis tests.Linked with this, in many application areas it may not even be possible to know in advance how many tests in total will be performed.Moreover, the repeated application of traditional offline multiple testing procedures as the family of hypotheses grows can lead to repeatedly changing past decisions, which may be undesirable in some contexts.
What is needed therefore are procedures for online multiple hypothesis testing, which better take into account the nature of modern data analysis.This is defined as follows: A stream of hypotheses arrives online.At each step, the analyst must decide whether to reject the current null hypothesis without having access to the number of hypotheses (potentially infinite) or any future data, but solely based on the previous decisions and evidence against the current hypothesis.

Online and sequential testing
Online hypothesis testing has a sequential nature, in the sense that individual hypotheses (or batches of hypotheses) are tested one after the other over time.However, this is distinct from the more traditional concept of sequential testing, which refers to the testing of a single hypothesis in a sequential manner with data accumulating over time.In sequential testing, the sample size for the experiment is not fixed in advance, and the accumulating data is evaluated as they are collected to allow the experiment to be stopped adaptively, such as when statistical significance is achieved.The framework of online multiple testing can be expanded to be "doubly sequential", where the inner sequential process is a single sequential test, and the outer sequential process refers to the multiple experiments that are performed to test different hypotheses.
For each null hypothesis H t , an anytime-valid p-value is a sequence of p-values (P t,n ) n≥1 where n indexes the sample size in the experiment corresponding to hypothesis H t , such that Pr(P t,N ≤ x) ≤ x, for all x ∈ [0, 1] and any data-dependent stopping time N .In other words, the stopped anytime p-value is a valid p-value in the classical sense, no matter how the experiment was stopped.In online multiple testing, we typically drop the second index and focus on the "outer sequential process" (across experiments/hypotheses), which means that we assume that for each hypothesis H t , we have a valid p-value P t , but we keep in mind that this could have been achieved by stopping an anytime-valid p-value (the "inner sequential process", corresponding to the evidence within a single experiment).
We also wish to draw a distinction between online hypothesis testing and multi-armed bandit (MAB) testing.While both frameworks allow the comparison of multiple experimental arms over time, an MAB can be considered as a single experiment in which resources are iteratively allocated to the different arms in order to adaptively trade off certain costs and benefits, and this allocation depends on the previously observed outcomes on each arm.Again, the two testing frameworks can be combined within a doubly sequential framework where there is a sequence of MAB problems over time, see Yang et al. [2017].Since the framework was first proposed by Foster and Stine [2008], a variety of procedures that control error rates for online hypothesis testing have been developed [Aharoni and Rosset, 2014, Javanmard and Montanari, 2015, Ramdas et al., 2018].Our aim in this paper is to provide an expository overview of this literature on online error rate control, with a review of the underlying theory, key methods and applied examples.
The bulk of the literature has focused on the setting of independent hypothesis tests for provable FDR control (with slightly weaker conditions allowing for control of variants of the FDR).Another important feature of many of the algorithms presented here is that they are adaptive: when some fraction of the tests actually have the alternative hypothesis true (as evidenced by p-values), they adapt and use less conservative tests.
In the rest of this section, we give motivating examples for online multiple testing and present formal definitions of error rate control.Section 2 describes the key procedures for online error rate control in detail.Section 3 presents a simulation study of the procedures, while Section 4 presents two case studies of applying online error rate control.We describe further methodological extensions as well as future directions in Section 5.In Section 6 we provide a summary and some practical guidance, and conclude with a discussion in Section 7.

Motivating examples
We now present three motivating examples from a spectrum of 'easier' to 'harder' settings for currently available online testing algorithms, in terms of the statistical dependency between hypotheses.

A/B testing in tech companies (independent hypotheses)
The development of web applications and services in the tech industry increasingly relies on the use of randomised controlled experiments known as A/B tests.There are a number of widely used platforms now available that streamline and handle the implementation of A/B tests.A typical application is in the development of different versions of webpages.As described in Berman and Van den Bulte [2021], in this context there are two webpage variations (A and B).When an online user visits, the platform randomly assigns the visitor to one of the variations for the duration of the experiment.The platform records the actions that the visitor takes, where the monitored actions reflect the experimenter's goal(s), such as increasing visitor engagement (defined appropriately) or increasing revenue.One of the variations is designated as the baseline, and the performance of the other variation is compared to the baseline using suitable test statistics.If run correctly using anytime-valid p-values [Johari et al., 2021] and/or confidence sequences [Howard et al., 2021], the data can be continuously monitored by the experimenter, and a decision can be made at any stopping time.
Many tech companies run tens of thousands of A/B tests per year, as part of a continuous process of designing, delivering, monitoring and improving webpages and other web services.
However, there is reason to reduce the number of false alarms which result in making changes to web products that do no better (or even perform worse) than the current iteration, corresponding to an incorrect rejection of the null, particularly when such changes are potentially costly or disruptive to users.Hence, the framework of online error rate control provides a framework to do so while still allowing a large number of A/B tests to be performed in a flexible manner.

Platform trials (known positive dependence)
A platform trial has a single master protocol that evaluates multiple treatments across one or more patient types, and allows a potentially large number of treatments to be added during the course of the trial [Saville and Berry, 2016].A new treatment can be added to the trial (corresponding to testing a new hypothesis) when a new experimental therapy becomes available, such as when a safe drug candidate for the disease in question is identified from a successful phase I clinical trial.Treatments are dropped from the trial after they have been formally tested for effectiveness.Such a trial could (in theory) be 'perpetual' in that new treatments can continue to enter into the trial and be tested.Figure 5  In a platform trial, treatments are introduced at different time points by design.However, the trial investigators will wish to make a decision on whether a treatment is beneficial as soon as the data are ready, without waiting for results from the other treatment arms.Hence, the treatment effects are tested sequentially in an online manner, where the number of treatments to be tested in the future may be unknown.More formally, a platform trial generates a sequence of null hypotheses (H 1 , H 2 , H 3 , . ..) which are tested sequentially.Hypothesis H i tests the value of some parameter θ i , such as an estimate of the treatment difference compared to a control arm.
The p-values generated from the platform trial described above will not be independent in general.Dependencies will primarily arise due to the shared control data that is re-used to test multiple hypotheses.A current example of a long-running platform trial is the STAMPEDE trial [James et al., 2008] for patients with locally advanced or metastatic prostate cancer, which we return to as a case study in Section 4.2.

Data repositories (unknown arbitrary dependence)
Public databases and shared data resources are becoming increasingly pervasive and important in modern biomedical research, particularly in the fields of genetics, molecular biology and routinely collected healthcare records.Some well-known examples include the 1000 Genomes Project [1000Genomes Project Consortium et al., 2015] and the Wellcome Trust Case Control Consortium [Wellcome Trust Case Control Consortium et al., 2007].Another example is the International Mouse Phenotyping Consortium database [Koscielny et al., 2013, Dickinson et al., 2016], which we describe as one of our case studies in Section 4. Meanwhile, the increase in routinely collected healthcare data allows evaluation of different healthcare technologies used in practice through emulation of target trials [Dickerman et al., 2019].
Multiple testing naturally occurs in this setting in two ways.Firstly, such databases can be accessed by multiple independent researchers at different times.When a researcher or research group comes up with a new hypothesis, they can fetch the relevant data from a database and perform a statistical test.Secondly, in some databases the family of hypotheses to be tested grows over time as new new hypotheses are tested (e.g., corresponding to new experiments being performed that measure phenotype expression for a previously untested gene knockout.In both of these scenarios, the number of hypotheses being tested will be unknown and potentially very large, and lead to concern about overlapping data allowing for arbitrary correlation patterns between hypothesis tests.The issues such dependence causes will be considered throughout the rest of this paper.
In order to control the number or proportion of false discoveries in this context, new procedures are required that allow a researcher to decide whether to reject a current hypothesis with minimal information about previous hypotheses, and without prior knowledge of even the number of hypotheses that are going to be tested in the future.This is precisely the online multiple testing framework described earlier.

Error rates
We now formally define some error rates of interest.The basic problem setup is as follows.At each time step t = 1, 2, . . . the experimenter observes a p-value P t corresponding to testing a null hypothesis H t , and must make a decision whether to reject H t before the next time step.
We assume that all p-values are valid, i.e. if the null hypothesis H t is true, then Pr{P t ≤ x} ≤ x for all x ∈ [0, 1].1 At time t = 0, the experimenter fixes the level α at which a suitable error rate is meant to be controlled at all times.
A general testing procedure provides a sequence of test levels α t with decision rule At any time T , let R(T ) = T t=1 R t denote the number of rejections (also known as discoveries) made so far and V (T ) denote the total number of falsely rejected hypotheses (also known as false discoveries).
The false discovery proportion (FDP) up to time T is defined as where a ∨ b = max(a, b).The false discovery rate (FDR) is then the expectation of the FDP: A commonly studied variant is the marginal FDR (mFDR): Another related error rate is the false discovery exceedance (FDX), which is the probability the supremum of the FDP exceeds a predefined threshold : FDX (T ) := Pr sup 0≤t≤T FDP(t) ≥ .
We view the FDR as the central metric of interest, given its long history, widespread use in applied fields such as genetics, and intuitive interpretation.The mFDR can be a convenient theoretically tractable proxy for the FDR when it is not possible to prove FDR control for a particular algorithm and data application, as we highlight in the rest of the paper.In some settings previously explored in the literature, it has been shown empirically that the realised FDR and mFDR of online hypothesis testing algorithms are very similar (see e.g.Appendix F of Zrnic et al. [2021]), although this is not true in general (see e.g. the Supplementary material in Javanmard and Montanari [2018]).Hence, the FDR would be the default choice for most users, with the mFDR then being the pragmatic alternative error rate choice if a suitable algorithm for FDR control is not available for the particular data application in mind.
In contrast, the FDX gives a stricter guarantee about the distribution of the FDP: whereas the FDR controls the expectation of the FDP, the FDX controls the tail probability of the FDP (i.e., controlling the (1 − )-quantile of the FDP distribution).Control of the FDX makes most sense in settings where the FDP can deviate significantly from its expectation, such as when the number of hypotheses to be tested is not very large, or there is significant correlation [Javanmard and Montanari, 2018].As for the choice of for the FDX, a default choice of = 0.05 or 0.10 is one option, but a pre-hoc choice of may also be motivated on practical grounds, such as choosing based on the required sample size to achieve a desired power given control of the FDX at level α.Another approach is to use recently proven post-hoc bounds of the FDX for online testing algorithms under independence, which allows the user to choose (and α) freely by examining the corresponding rejections and seeing what makes most sense [Katsevich and Ramdas, 2020].
An alternative error rate to those based on the FDP is the familywise error rate (FWER), which is more commonly considered in clinical trial contexts due to the relatively small number of hypotheses and regulatory requirements.The FWER is the probability of falsely rejecting any null hypothesis: The FWER and hence the FDR can be controlled at level α in a simple manner by using a Bonferonni-type correction, also known as alpha-spending.More precisely, we can choose significance levels α t for H t , such that ∞ t=1 α t = α.We reiterate that this corresponds to the setting where each nominal critical value α t corresponds to testing a single hypothesis H t , with the possibility of repeated testing of the same hypothesis (as in the sequential testing literature) included implicitly, see our remark in the Introduction.However, alpha-spending suffers from a low statistical power, with the probability of the null hypothesis H t being rejected tending to zero as t increases.This motivates the development of more sophisticated algorithms for online error rate control.
2 Online error rate control methodology

Generalised alpha-investing (GAI)
The first proposals for online error rate control were based on "alpha-investing" by Foster and Stine [2008] and its generalisation (GAI) [Aharoni and Rosset, 2014].(An alternative early line of work instead focused on extensions of gatekeeping procedures that allow for online control of the FWER or FDR for ordered hypotheses [Finos andFarcomeni, 2011, Farcomeni andFinos, 2013] but these turn out to be far less powerful in practice, so we do not discuss them further.)Any GAI rule begins with an error budget, or alpha-wealth, which is allocated to the different hypothesis tests over time.That is, there is a price to be paid each time a hypothesis is tested, which can be viewed as making an investment in the hypothesis in question.If the hypothesis is rejected, alpha-wealth is earned back, which can be viewed as a return or payout on the alphainvestment.Since the alpha-wealth can increase in this way, as long as discoveries continue to be made, hypotheses can be tested indefinitely without the test levels tending towards zero.
The intuition behind the alpha-wealth increasing after a rejection is that the denominator in the FDP increases, therefore allowing the numerator (i.e. the number of false rejections) to also increase for future hypothesis tests while still controlling the FDR.
Formally, a GAI rule produces a series of test levels (α 1 , α 2 , α 3 , . ..) based on which it uses (1) to produce the corresponding decisions (R 1 , R 2 , R 3 , . ..).Of course, α t must be based only on R 1 , . . ., R t−1 .At each time point t, the alpha-wealth W (t) decreases by an amount φ t .If the hypothesis H t is rejected (R t = 1), then the alpha-wealth is increased by ψ t .In other words, the price φ t is the amount paid for testing (i.e., investing in) a new hypothesis, and the payout (or return on the investment) ψ t is the amount earned if a discovery is made at that time.Hence the initial wealth is W (0) = w 0 and it is updated via: Figure 2 give a diagrammatic summary of how GAI works.The total wealth W (t) must always be non-negative, and hence φ t ≤ W (t − 1).Additionally, there are restrictions on α t , φ t , ψ t , namely that when a rejection is made, the payout ψ t is capped.This upper bound is there to ensure control of the FDR (and its variants).
Given these constraints, the user is free to choose the sequences α t , φ t and ψ t .As an example, the alpha-investing rule explored in Foster and Stine [2008] chooses and The choice of α t , φ t and ψ t was explored in terms of the trade-off between the sequences α t and ψ t in Aharoni and Rosset [2014].However, in this paper, we focus on the new 'statistical' perspective for constructing online algorithms that control the FDR (see the start of Section 2.2), which implicitly give choices for φ t and ψ t .As is predominately the case in offline multiple Ramdas et al.
[2017] defined a class of improved GAI algorithms, called GAI++, as follows.
Set w 0 so that 0 ≤ w 0 ≤ α and choose the payout ψ t to satisfy where b t = α−w 0 1{R(t−1) = 0}.This upper bound on the payout is different from the original GAI algorithms in order to guarantee FDR control while giving the largest possible payout for rejecting a hypothesis, with the choice of w 0 determining the payout received for the very first rejection (see e.g. the LORD++ algorithm in Section 2.2).Ramdas et al. [2017] show that any monotone GAI++ rule comes with the following guarantee: Theorem 2.1.If the null p-values (i.e., the subsequence of p-values where the null hypothesis is true) are independent of all other p-values, any monotone GAI++ rule satisfies the bound This is in contrast to the GAI rules (including alpha-investing as proposed by Foster and Stine [2008]), which only control the mFDR.
We note that the independence assumption refers to independence between different hypotheses.The important case of sequential testing of any single hypothesis can be seamlessly incorporated through the use of anytime-valid p-values as described in the Introduction.A related framework is to use 'asynchronous' online testing, as discussed in Section 5, which gives the added flexibility of allowing hypothesis tests to overlap in time.

Algorithms for online FDR control: LORD, SAFFRON and ADDIS
Although an "algorithmic perspective" led to the GAI procedures initially used in the online testing literature, Ramdas et al. [2017] posited a "statistical perspective" to construct proce-dures, which is to keep an estimate of the FDP less than α.First, the oracle FDP is defined as for FDP * (t).

LORD
The LORD algorithm was conceptualised by Javanmard and Montanari [2018], and is an instance of a monotone GAI rule.More precisely, given an infinite non-increasing sequence of positive constants {γ t } ∞ t=1 that sums to one, the test levels α t for LORD are chosen as follows: where τ j denotes the time of the j-th rejection and we must have w 0 + b 0 ≤ α for FDR control to hold.
Following this, Ramdas et al. [2017] defined a simple upper bound of FDP * (t): and showed that LORD can be viewed as an algorithm that keeps FDP LORD (t) ≤ α.Here V (t) corresponds to the alpha-wealth used for testing while αR(t) corresponds to the total earned alpha-wealth that can be used for subsequent tests.Exploiting this view, they derived a uniform improvement of LORD, termed LORD++, (presented below).In brief, LORD++ is able to replace b 0 = α − w 0 with the choice b 0 = α while still maintaining FDR control, with the catch that for the very first rejection only b 0 = α − w 0 (see below).
Given an infinite non-increasing sequence of positive constants {γ t } ∞ t=1 that sums to one, the test levels α t for LORD++ are chosen as follows: The above formula may look daunting but it is interpretable.The first term is the fraction of the initial wealth w 0 that is used by the t-th test.The other terms are the fractions of the earnings from rejections before t that are spent in the t-th round: LORD++ awards α − w 0 for the first rejection and α for every subsequent rejection, and on receiving this reward, the method immediately allocates that reward to future rounds according to the same schedule of constants {γ t }, shifted to start at the next instant.This rule ensures that LORD++ never spends more than it has earned, thus keeping FDP LORD++ (t) ≤ α.
The intuitive reason why LORD++ cannot award α for the very first rejection can be seen in the definition FDP(T ) = V (T ) R(T )∨1 .The denominator R(T ) ∨ 1 = 1 when the number of rejections equals zero or one, and hence only starts increasing at the second rejection.This means that the sum of w 0 and the first reward must be at most α, following which α may be rewarded at every rejection.As for the choice of the sequence γ t , this depends on the data application at hand, with a reasonable default choice given by γ t ∝ log(t∨2) t exp( √ log t) , which has been shown to maximise power in the Gaussian setting (i.e.where the test statistics follow a normal distribution) [Javanmard and Montanari, 2018].
The manner in which FDP LORD++ (t) is a simple upper bound on FDP * (t) is reminiscent of the BH procedure for offline testing, which can be derived in a similar fashion.More precisely, suppose that one rejects all p-values below some fixed threshold s ∈ (0, 1).The BH procedure overestimates the FDP using the quantity FDP BH (s) = n•s |R(s)| , where R(s) denotes the set of rejected p-values using the fixed threshold s.The BH procedure then rejects the set R(ŝ BH ) where ŝBH = max{s : FDP BH (s) ≤ α}.This leads us to view LORD++ as the online analog of the BH procedure.
Guarantees for LORD++ hold under different p-value dependencies, which we now formalise.
Define the filtration at time t as F t = σ(R 1 , . . ., R t ) (representing the collection of the observed rejections up to time t) and let α t = f t (R 1 , . . ., R t−1 ) where f t is a [0, 1]-valued function.The null p-values are said to be conditionally super-uniform if Pr{P t ≤ α t |F t−1 } ≤ α t for any F t−1measurable α t .Armed with this definition, we have the following theorem from Ramdas et al. [2017]: Theorem 2.2.(a) If the null p-values are conditionally super-uniform, then the condition FDP LORD (t) ≤ α for all t ≥ 1 implies that mFDR(t) ≤ α for all t ≥ 1.
(b) If the null p-values are independent of each other and of the p-values corresponding to the non-null hypotheses, and {α t } is chosen to be a monotone function of past rejections, then the condition FDP LORD (t) ≤ α for all t ≥ 1 implies that FDR(t) ≤ α for all t ≥ 1.
Finally, in terms of theoretical power guarantees for LORD, Chen and Arias-Castro [2021] considered the setting of a (generalised) Gaussian model (see reference for further details) and showed that LORD is asymptotically optimal, in particular by being as powerful as BH to first asymptotic order.

Ramdas et al. [2018] derived an adaptive version of LORD++ called SAFFRON, which is
based on an estimate of the proportion of true null hypotheses.By not wasting its earnings on attempting to reject weaker signals (i.e.larger p-values), SAFFRON preserves alpha-wealth and hence can have a higher power than LORD++.To this end, we choose λ ∈ (0, 1) and define the candidate p-values as those that satisfy P t ≤ λ, since SAFFRON will never reject a p-value larger than λ.We also choose an infinite nonincreasing sequence of positive constants t=1 that sums to one.Reasonable default choices for these hyper-parameters are λ = 0.5 and γ t ∝ t −1.6 [Ramdas et al., 2018].The formulae for the test levels α t for SAFFRON are given in Appendix A.
SAFFRON starts off with alpha-wealth (1 − λ)w 0 and does not lose any of this wealth when testing candidate p-values.Of course, this has to be done in a principled way and is accounted for in the formulation of the test levels α t , which intuitively helps explain the (1 − λ) multiplicative factor (see Appendix A).It gains an alpha-wealth of (1 − λ)α for each discovery after the first.SAFFRON can make more rejections than LORD++ if there is a significant fraction of non-nulls and the signals are strong.
Similar to LORD++, SAFFRON provably controls the mFDR at all times if the null pvalues are conditionally super-uniform.Also, SAFFRON controls the FDR at all times if the null p-values are independent of each other and of the non-nulls, and {α t } is chosen to be a monotone function of (R 1 , . . ., R t−1 , C 1 , . . ., C t−1 ), where C t = 1{P t ≤ λ}; see Ramdas et al.

ADDIS
stands for an ADaptive algorithm that DIScards conservative nulls, and was proposed by Tian and Ramdas [2019].ADDIS can invest alpha-wealth more effectively than LORD++ or SAF-FRON by explicitly discarding the weakest signals (i.e. the largest p-values) in a principled way, which can lead to a higher power.More formally, in practice it is common to encounter conservative nulls, where a null p-value P is conservative if Pr{P ≤ x} < x for all x ∈ [0, 1].
Often nulls are uniformly conservative, which means that under the null, Pr{P/c ≤ x | P ≤ c} ≤ x for all x, c ∈ (0, 1).
For example, for a one-dimensional exponential family with parameter θ, when the true parameter θ is strictly smaller than θ 0 , the uniformly most powerful test of H 0 : θ ≤ θ 0 versus H 1 : θ > θ 0 will give uniformly conservative nulls [Zhao et al., 2019].Another setting is using always-valid p-values [Johari et al., 2021] in the context of continuous monitoring for A/B testing, which will always be conservative.
In general, adaptivity (used by both SAFFRON and ADDIS) helps when there is a significant fraction of non-nulls (like 10% or 20%).Discarding (used only by ADDIS) helps when the nulls are conservative, meaning that instead of being exactly uniform, they are stochastically larger than uniform.Discarding helps even without adaptivity, and adaptivity helps without discarding.The key idea behind discarding is intuitive: if you see a p-value larger than (say) 0.5, throw it away, but if you see a p-value smaller than 0.5, then double it (to condition on selection) and pass it onto the multiple testing procedure.Roughly, if there are mostly nulls and these are uniformly distributed, this doesn't do much at all -the tested p-values are doubled, but only about half the p-values are tested so the multiplicity correction is halved, cancelling the effects.However, if the nulls are stochastically much larger than uniform, then we may throw away most of the nulls in this step, eventually testing only a much smaller number of p-values (which have been doubled).
In terms of formal definitions, with λ and the corresponding candidate p-values defined as for SAFFRON, we let S t = 1{P t ≤ η} be the indicator of H t being selected for testing (i.e.not discarded).Hence η is the discarding threshold and must be greater than λ.We also choose an infinite non-increasing sequence of positive constants {γ t } ∞ t=0 that sums to one.Reasonable default choices for these hyper-parameters are λ = 0.25, η = 0.5 and γ t ∝ (t + 1) −1.6 , as justified empirically in Tian and Ramdas [2019].The formulae for the test levels α t for ADDIS are given in Appendix B. As can be seen, ADDIS starts off with an alpha-wealth of (η − λ)w 0 and (like SAFFRON) does not lose any of this wealth when testing candidate p-values.The p-values that are greater than η do not affect the test levels for ADDIS at all, i.e. as if they did not exist in the sequence of p-values of all (reflecting the term 'discarding').It gains an alpha-wealth of (η − λ)α for each rejection after the first.
Like for LORD++ and SAFFRON, ADDIS provably controls the mFDR at all times if the null p-values are conditionally uniformly conservative.ADDIS provably controls the FDR at all times if the null p-values are independent of each other and of the non-nulls, and {α t } ∞ t=1 is a monotone function of the past; see Tian and Ramdas [2019] for full details.

Monotone AI
As a comparator to the above algorithms, we also consider a version of the original AI algorithm of Foster and Stine [2008], as modified by Ramdas et al. [2017] to ensure it is a monotone rule and hence that FDR control holds.We will refer to this rule as 'monotone AI'.

Simulation studies
In this section we compare the performance of the LORD++, SAFFRON and ADDIS algorithms in terms of the FDR and statistical power.We do not aim to present an exhaustive simulation of all the algorithms currently available in the literature, but rather select a representative set of algorithms to demonstrate some key general features for the core problem of online FDR control.
To this end, we use LORD++ as a representative 'basic' online algorithm, given that it is the natural online analog of the BH procedure.We then use SAFFRON as a representative of an adaptive online algorithm, while ADDIS is a representative of an adaptive online algorithm that also incorporates discarding.As an additional comparison, we also include the 'monotone AI' rule.In Section 3.3 we refer the reader to further simulation studies that have been published in the literature.First though, we briefly describe software implementation of algorithms for online error rate control.

Software: onlineFDR package
The onlineFDR package is an open-source R package that aims to provide a comprehensive and up-to-date implementation of algorithms for online error rate control.It is freely available via Bioconductor [Robertson et al., 2021].The package implements the LORD++, SAFFRON and ADDIS algorithms, as well as almost all of the algorithms corresponding to the further extensions of online error rate control methodology (see Section 5.1).In particular, it also provides functions for algorithms for online FWER and online FDX control.The package documentation provides a user-friendly introduction to the use of the package, and there is also a Shiny app available [Liou and Robertson, 2021] to allow users to explore algorithms for online FDR control in an interactive way without having to program.All results for the simulation and case studies in this paper were calculated using the package.

Testing with Gaussian observations
In order to examine the relative performance of the online FDR algorithms, we use a simple experimental setup of testing Gaussian means, with a total of T hypotheses.Note that although T is fixed in the simulations, all the methods do not use this knowledge of T to normalise the sequence γ t .The null hypotheses take the form H t : µ t ≤ 0 which are tested against the alternative H t : µ t > 0 for t = 1, . . ., T .We observe independent observations Z t ∼ N (µ t , 1) which are transformed to one-sided p-values P t = Φ(−Z t ), where Φ denotes the standard Gaussian CDF.
The motivation for using one-sided p-values is from A/B testing, where one wishes to detect larger effects, not smaller.The means µ t are set according to the following mixture model: where F 1 ∼ N (3, 1) and F 0 is defined as below.
We use the default settings for LORD++, SAFFRON and ADDIS that are implemented in the onlineFDR package (following suggestions in the literature).For LORD++, we use the default choice of γ t ∝ log(t∨2) t exp( √ log t) .We also use this choice of γ t for alpha-spending, where {α t } ∞ t=1 is simply given by α t = αγ t .For SAFFRON we set λ = 0.5 and γ t ∝ 1 t 1.6 .Finally, for ADDIS we set λ = 0.25, η = 0.5 and γ t ∝ 1 (t+1) 1.6 .We use the exponent 1.6 in the denominator for γ t because this was found empirically to work well in a range of different simulation studies in the original papers.More precisely, the sequence γ t satisfies ∞ t=1 γ t = 1. Figure 3 shows how the test levels {α t } ∞ t=1 (displayed on the log 10 scale) evolve over time for LORD++, SAFFRON, ADDIS and monotone AI compared with uncorrected testing (where α t ≡ α) and alpha-spending.Here, T = 300, α = 0.05, π 1 = 0.5 and we choose F 0 ≡ 0.
All of the online FDR algorithms have higher test levels than alpha-spending (apart from LORD++ briefly early on in this particular experiment).The relative difference increases with t as the online algorithms 'earn back' wealth over time, which alpha-spending cannot do.SAFFRON, ADDIS and monotone AI have higher test levels than LORD++, reflecting how they can more efficiently invest the alpha-wealth.In this setting, since µ t = 0 under the null, the nulls are exactly uniform and so ADDIS cannot take advantage of conservative nulls.Hence the testing levels of ADDIS are similar or slightly lower than those for SAFFRON and monotone AI.
Finally, we see that SAFFRON has similar test levels as uncorrected testing, and SAFFRON, ADDIS and monotone AI can even have test levels above the nominal α.
Figure 4 compares the statistical power of LORD++, SAFFRON, ADDIS and monotone AI compared with uncorrected testing and alpha-spending, as π 1 varies from 0.01 to 0.9.Here, we define power as where H 1 denotes the index set of the non-null hypotheses.We also include the standard Benjamini-Hochberg (BH) procedure as an additional comparison.We stress that BH is an offline procedure and so could not be used for online testing in practice.In our simulation, we Starting with alpha-spending, as expected the power is very low (< 0.2) for all π 1 .LORD++ has substantial power gains compared with alpha-spending (as long as π 1 is not close to zero) and this advantage increases with π 1 .However, LORD++ has substantially lower power than BH for all values of π 1 .As expected, SAFFRON performs better as the fraction of non-nulls π 1 increases, with a higher power than LORD++ for π 1 > 0.05, BH for π 1 > 0.5 and even uncorrected testing for π 1 > 0.7.Since F 0 ∼ N (−0.5, 0.1), almost all the means for the null hypotheses will be negative, i.e. we are in a setting with conservative nulls.Hence, as expected, ADDIS outperforms SAFFRON in terms of power (except for very high values of π 1 ).ADDIS also has a higher power than BH for π 1 > 0.2 and uncorrected testing for π 1 > 0.6.Finally, in this setting, SAFFRON performs very similarly to the monotone AI algorithm in terms of power.
In Appendix C (Figure 6) we show the corresponding FDR for all of the algorithms considered.We see that uncorrected testing can have substantial inflation of the FDR, with the FDR inflated above the nominal α = 0.05 level for π 1 < 0.3.The FDR reaches as high as 0.65 when π 1 = 0.01.All other algorithms control the FDR below the nominal 0.05 level, as expected.

Observations from other simulations
Here, we summarise a few take-home messages for LORD, SAFFRON and ADDIS from simulation results already found in the literature.Javanmard and Montanari [2018] investigated the effect of the ordering of the hypotheses for online testing rules, including LORD.In some applications, hypotheses can be ordered using side information, such that those that are most likely to be rejected come first.With this favourable ordering, the statistical power of LORD can substantially increase as long as π 1 is not too large (since ordering is less relevant in that case).Similar findings for LORD++, SAFFRON and ADDIS in the context of platform trials can be found in Robertson et al. [2023], which also looked at the adversarial setting where hypotheses happen to be ordered so that those most likely to be rejected come last, resulting in lower power.Ramdas et al. [2018] considered the impact on LORD++ and SAFFRON of choosing sequences of the form γ t ∝ t −s , where the parameter s > 1 controls the 'aggressiveness' of the procedure (since the larger the value of s, the more the alpha-wealth is concentrated at small values of t).For Gaussian alternatives, the simulation results suggested that less aggressive sequences are to be preferred in terms of increased power for SAFFRON and LORD++.
Meanwhile, Tian and Ramdas [2019] showed that ADDIS can match the power of SAFFRON when the nulls are not conservative (i.e.uniform nulls).The power advantage of ADDIS over LORD++ and SAFFRON increases the more conservative the nulls are, i.e. the more negative the means for the null hypotheses are (in the Gaussian setting).
The theory presented in Section 2 for provable FDR control requires null p-values to be independent of one another (with a weaker condition sufficing for mFDR control).Robertson et al. [2023] explored the performance of online testing rules, including LORD++, SAFFRON and ADDIS, in the setting of platform trials with a common control, which induces positive correlations between the p-values for testing concurrent arms.There was no evidence of FDR inflation for these algorithms under a range of assumed treatment effects and overlap of control data.Robertson and Wason [2018] considered the setting where the test statistics are assumed to come from a multivariate normal distribution where the covariance matrix has ones along the diagonals and off-diagonal entries equal to ±0.5.There was no evidence of FDR inflation when using LORD++ under a range of non-null distributions.However, with a two-sided test under a Gaussian alternative, the SAFFRON procedure had an inflated FDR for smaller values of π 1 .This inflation persisted and even increased as T increased from 100 to 1000.For further discussion handling dependent p-values, we refer the reader to the end of Section 5.2.
The simulation studies in Robertson and Wason [2018] also highlighted the value of using 'bounded' versions of online testing algorithms.This requires setting an a-priori upper bound M on the number of hypotheses to be tested, so that the γ t ≡ 0 for t > M , which allows setting γ t ≡ 1/M for t ≤ M (for example).The bounded versions have a uniformly higher power than the versions presented in Sections 2 (with the default choices of γ t given in Section 3.2) which assume no upper bound on T , and empirically a substantial gain can be observed for small T (i.e.T < 100).Finally, another general observation is that the power advantages of online testing algorithms compared with alpha-spending increase as T increases.Indeed, when T is small and π 1 is low, online testing algorithms may no longer be competitive in terms of power.
We return to this issue in Section 5.2.

Case studies 4.1 IMPC dataset
Our first case study uses high-throughput phenotypic data from the International Mouse Phenotyping Consortium (IMPC) data repository, which aims to generate and phenotypically characterize knockout mutant strains for every protein-coding gene in the mouse [Koscielny et al., 2013].The IMPC database is an example of a growing dataset mentioned in Section 1.1, since the family of hypotheses is constantly growing as new knockout mice lines are generated and phenotyping data is uploaded to the data repository.
We focus on the analysis of IMPC data performed by Karp et al. [2017], who looked at the influence of sex in mammalian phenotypic traits in both wildtype and mutants.As part of their analysis, Karp et al. analysed the role of sex as a modifier of the genotype effect (for continuous traits) using a two stage pipeline.Stage 1 tested the role of genotype using a likelihood ratio test comparing models (a) and (c).Similarly, stage 2 tested the role of sex using a likelihood ratio test comparing models (a) and (b).
The above procedure resulted in two sets of N = 172 328 distinct p-values, ordered by the date of the corresponding genomic assay.Note that these p-values will not be independent, due to positive and negative associations between different genes (caused for instance by linkage disequilibrium).In addition, multiple variables are being measured for the same gene, and these can be aspects of the same phenotype or be biologically correlated.
Table 1 below shows the number of traits that had a statistically significant genotype effect or were classed as having a statistically significant sexual dimorphism (SD) using LORD++, SAFFRON, ADDIS and monotone AI.As a comparison, we include the results from alphaspending, BH and uncorrected testing.The 'Fixed Threshold' procedure is the fixed significance threshold of 0.0001 used in practice for the IMPC pipeline.Starting first with the results for the genotype data, the online testing algorithms make two to three times as many rejections as fixed testing.ADDIS and SAFFRON in turn make substantially more rejections than LORD++, with an increase of almost 50% and 70%, respectively.

Genotype
ADDIS makes a similar number of rejections to BH, but SAFFRON makes noticeably more rejections than both BH and ADDIS for these data.For the SD data, again the online testing algorithms make substantially more rejections than fixed testing, but the relative increase is much less.ADDIS and SAFFRON make a very similar number of rejections, about 50% more than the number of rejections for LORD++.Finally, for these data we see that monotone AI makes substantially fewer rejections then either ADDIS or SAFFRON.

Platform trial: STAMPEDE
Our second case study is the ongoing STAMPEDE (Systemic Therapy for Advancing or Metastatic Prostate Cancer) platform trial, which evaluates the effect of systemic therapies for prostate cancer on overall survival [James et al., 2008].The trial started with 5 experimental treatment arms (B--F), and compared these with the control arm A, which was standard-of-care (SOC) hormone therapy.Figure 5 shows a schematic of the treatment comparisons that have already been reported from STAMPEDE.Two additional experimental arms (G and H) were added to the trial in 2011 and 2013, respectively.
Table 2 shows the reported p-values (unadjusted for multiplicity) when comparing the experimental arms with the control (arm A), as given in James et al. [2016James et al. [ , 2017]], Mason et al.
[2017], Parker et al. [2018].The dashed lines denote the four 'batches' present in the trial, where a batch corresponds to multiple hypotheses being available to be tested at the same time, as reflected in Figure 5.Following Robertson et al. [2023], we apply the online testing algorithms to these observed pvalues, keeping the alphabetical ordering of p-values within the batches.We set the upper bound on the number of treatments M = 20 (i.e.twice as many arms that have already entered the STAMPEDE trial as of the end of 2021), and use the bounded versions of alpha-spending (i.e. a Bonferroni correction at level α/M ), LORD++, SAFFRON and ADDIS.Table 3 shows which of the hypotheses corresponding to each experimental arm can be rejected at level α = 0.05, as well as the current significance level α 8 that would be used to test the next experimental treatment after the 7 already evaluated in the trial.
Table 3: Hypotheses rejected and current significance level α 8 of different algorithms using the results of the STAMPEDE trial, with the ordering as in Table 2.
Uncorrected testing rejects the hypotheses corresponding to three experimental arms (C, E, G), and has by far the highest value of α 8 .Both SAFFRON and the BH procedure reject hypotheses C and G, and SAFFRON has a substantially higher value of α 8 than for the other online testing algorithms.ADDIS and alpha-spending only reject hypothesis G, and have similar α 8 .Finally, LORD++ does not reject any hypotheses and the value of α 8 is substantially lower than any of the other algorithms.For further discussion and results, see Robertson et al. [2023].
5 Extensions and future directions 5.1 Further extensions 5.1.1Prior weights, penalty weights and decaying memory Ramdas et al. [2017] proposed a number of extensions that apply to the class of GAI++ algorithms, including LORD++.Firstly, they showed how to incorporate certain types of prior information about the different hypotheses as expressed through prior weights w t and penalty weights u t .Prior weights allow the experimenter to exploit domain knowledge about which hypotheses are more likely to be non-null.By assigning a higher prior weight w t > 1 to a hypothesis, the algorithm will have a higher chance of rejecting H t .Meanwhile, penalty weights express the different importance attached to the hypotheses being tested, with u t > 1 indicating a more impactful or important test.Importantly, both w t and u t are allowed to depend on past rejections in this framework.Ramdas et al. [2017] proposed doubly-weighted GAI++ rules that provably control the penalty-weighted FDR when using both prior and penalty weights under independence.Recently, Chen and Kasiviswanathan [2020] showed how to exploit contextual information associated with each hypothesis to re-weight the testing levels in an online manner, leading to increased power while controlling the FDR.
The second proposal of Ramdas et al. [2017] dealt with problems of 'piggybacking' and 'alpha-death'.Piggybacking happens when a substantial number of rejections are made so that the online testing algorithms earn and accumulate enough alpha-wealth to reject later hypotheses at much less stringent thresholds (hence the later tests 'piggyback' on the success of earlier tests).This can lead to a spike in the FDR locally in time, even though the FDR over all time is controlled.Meanwhile, alpha-death occurs when there is a long stretch of null hypotheses, so that online testing algorithms make (almost) no rejections and lose nearly all of their alpha-wealth.Subsequently, the algorithm may have essentially no power, unless a non-null hypotheses with extremely strong signal (small p-value) is observed.Ramdas et al. [2017] proposed the decaying memory FDR (mem-FDR), which pays more attention to recent discoveries through a user-defined discount factor δ ∈ (0, 1] and thus smoothly forgets the past.They then proposed GAI++ rules that control the mem-FDR (which can also include penalty weights) under independence.In addition, they showed how to allow the algorithm to abstain from testing in order to recover from alpha-death.

Local dependence -asynchronous and batched testing
In most of the literature mentioned so far, an implicit assumption is that each hypothesis test can only start when the previous test has finished (the synchronous setting, where synchronous refers to synchronising the start and end time of hypothesis tests).In reality, experimentation is "doubly sequential" like in Figure 1, where it is common to have hypothesis tests that overlap in time, where each test may itself be run sequentially (the asynchronous setting).One natural adjustment for asynchronous testing is to use an online FDR algorithm whenever each test finishes (that is, whichever test is the t-th one to finish, test it at level α t ).However, this would only assign α t at the end of a hypothesis test, which would not be appropriate for sequential hypothesis testing and multi-arm bandit approaches that typically require specification of the target type I error level in advance because it is an important component of their stopping rule.
Hence, the testing levels must be specified at the start of a hypothesis test.The asynchronous setting also means that potentially arbitrary dependence between some p-values must be considered.Indeed, hypothesis tests that are being conducted concurrently are often likely to be dependent, since they may use the same or highly correlated data during their overlap.
To address these challenges of asynchronous testing, Zrnic et al. [2021] derived asynchronous versions of LORD++ and SAFFRON that output test levels α t dynamically at the beginning of the t-th test, such that, despite arbitrary local dependence and regardless of the decision times for each hypothesis, the mFDR is controlled at level α.These procedures achieve this goal both at all fixed times t, as well as certain adaptively chosen stopping times.Tian and Ramdas [2019] also showed how to derive asynchronous versions of ADDIS.Note that in order to account for the uncertainty about the tests in progress, the test levels assigned by asynchronous online procedures will often be more conservative.Thus, there is a trade-off in that although asynchronous procedures take less time to perform a given number of tests, they can be less powerful than their synchronous counterparts.Zrnic et al. [2020] considered the related setting of online batched testing, where a potentially infinite number of batches of hypotheses are tested over time (see Section 4.2 for an example).
To this end, they introduced online, FDR-preserving versions of the most widely used offline algorithms, namely the BH procedure and Storey's improvement of the BH method [Storey, 2002].These online "mini-batched" testing algorithms interpolate between online and offline methodology, thus trading off the best of both worlds.When there is only one batch, the algorithms recover the BH (or Storey-BH) procedure.On the other hand, when all batches are of size one, the algorithms recover the LORD++ (or SAFFRON) procedure.These algorithms control the FDR under independence (an algorithm valid under positive dependence was also derived), and have a higher power than the fully online testing algorithms.Further, since they consist of compositions of offline FDR algorithms, they imply FDR control over each constituent batch, and not just over the whole sequence of tests.

A Bayesian approach
Gang et al. [2021] developed a new class of structure-adaptive sequential testing (SAST) rules for online FDR control, which instead of being based on p-values, are based upon estimates of the conditional local FDR (Clfdr; Cai and Sun [2009]), which can optimally adapt to important utilise its wealth and increase its power as a result.Finally, the authors show that supLORD also controls the mFDR and FDR at both fixed times and stopping times.Hence, supLORD provides the first guarantee for online FDR control at stopping times (LORD++, SAFFRON and ADDIS only control the mFDR at stopping times and not the FDR).

Retesting of hypotheses
One feature of online testing algorithms that has not been explicitly pointed out in the literature is the option of retesting hypotheses (i.e. using the same p-value again later in the testing sequence, when the alpha-wealth may be higher).Crucially however, the choice of whether to retest must be made without using knowledge of the p-value itself, but only that it was not rejected (e.g. that it is greater 0.01).In this example, under the null the p-value will still be conditionally uniform in [0.01, 1].Since this implies that the assumption of conditional superuniformity under the null still holds, the same p-value can be used for retesting.In practice, retesting could happen within an automated testing setting for example, perhaps with additional prior information.Meanwhile, Fisher [2022] proposed a framework for online testing where each hypothesis requires an immediate preliminary decision, which allows the analyst to update that decision until a preset deadline while controlling the FDR.ADDIS-spending, and showed that the power gains can be substantial when the discreteness is high (e.g. the counts in the contingency are moderate).

Incorporating experimental costs
Cook et al. [2022] considered the setting of online multiple hypothesis testing where the cost of data collection (i.e. the cost of conducting an experiment) is not negligible.They proposed an extension of the GAI framework to take into account the cost of data collection, the choice of sample size for each experiment, as well as prior beliefs about the probability of rejection.The proposed methods ensure control of the mFDR and performs particularly well in settings where the aim is to maximise a limited budget of tests to achieve the highest possible power.
5.1.9Post-hoc FDP bounds Katsevich and Ramdas [2020] proposed a class of simultaneous FDP bounds that apply to a variety of settings, including online testing.These bounds are finite-sample have a simple closed form.The results can be used as a diagnostic tool for FDR procedures: after running an FDR procedure, one can obtain a valid 1 − α confidence bound on the FDP of the resulting rejection set.Since the guarantees are post hoc, they apply to any sequence of rejections produced by any online algorithm, that may or may not have been designed for FDR or FDP control.

Online control of the False Coverage Rate
Finally, Weinstein and Ramdas [2020] considered the problem of constructing confidence intervals (CIs) that are valid for online hypothesis testing.In particular, they focus on control of the false coverage rate (FCR), which is the expected ratio of number of constructed CIs that fail to cover their respective parameters to the total number of constructed CIs.In the online hypothesis testing framework they considered, at each step the investigator observes independent data that are informative about the parameter of interest θ t , and must immediately make a decision whether to report a CI for θ t or not.If a CI is reported for θ t , then the the aim is to ensure that that the CI for θ t has FCR ≤ α at all times T .For further details of the proposed algorithms and their theoretical guarantees, see Weinstein and Ramdas [2020].

Current shortcomings and future directions
Online testing for small numbers of hypotheses Online testing algorithms are most powerful in settings where there are a large number (i.e. T > 1000) of hypotheses that will eventually be tested.Thus the biggest advantage will likely be in settings such as A/B testing in large tech companies or in large-scale biological data repositories.However, while platform trials provide a framework for a trial to continue indefinitely in theory, in practice they will typically evaluate a maximum number of interventions in the low tens.Hence, there is a need for investigation of optimal online testing procedures when the maximum number of hypotheses to be tested is relatively low and when the correlation between hypotheses is known (e.g. because of a shared control arm).Separately, there is scope to further improve the power of online testing algorithms when combined with sequential testing of the individual hypotheses, for example by exploiting the fact that pre-specified group-sequential stopping boundaries may be used in a platform trial setting (see Zehetmayer et al. [2021] for a recent proposal along these lines).

Managing incentives across sponsors or products
There are some additional challenges in using online control methods in platform trials or in the IT industry.If different sponsors (e.g. pharmaceutical companies) are supporting a platform trial, they might be reluctant to have their intervention be tested at a notably more stringent level than other sponsors: it may be difficult to reconcile the most powerful overall procedure not being acceptable to individual sponsors.Similarly, if a large IT company imposes that experiments run across various products must all be subjected to oversight in the form of a common online FDR controlling procedure that acts across products, then it may be hard to convince individual product teams that their tests must be subject to a level determined by the results of experiments by other groups.

Optimal choices of parameters for online algorithms
As seen in Section 2, LORD++, SAFFRON and ADDIS depend on the choice of the initial wealth w 0 as well as the sequence {γ t }.Further work could look at optimal choices of these parameters, given assumptions about the distribution of non-null p-values.Exploring dataadaptive choices of time varying sequences {λ t } ∞ t=1 (for SAFFRON and ADDIS) and {η t } ∞ t=1 (for ADDIS) with provable power increase would be another fruitful area of research.Future work could also look at optimal choices for the parameters for the other algorithms in Section 5.1.

Online batched testing
A number of open questions remain regarding the proposals of Zrnic et al. [2020] for online batched testing.First, the framework could be extended to allow for asynchronous online batch testing, using the ideas of Zrnic et al. [2021].Second, it should be possible to derive online batched versions based on the offline counterpart of ADDIS, which would gain power in the presence of conservative nulls.Third, an open question is determining the trade off between the chosen batch size versus power in online batched testing.
Online error rate control under dependence One major shortcoming with almost all of the proposed online testing algorithms is their reliance on the assumption of independence of the null p-values for provable FDR control, which is unlikely to always be case in real data applications.However, in terms of online FDR control under dependence, there have only been limited proposals in the literature.Zrnic et al. [2021] showed that the LOND algorithm by Javanmard and Montanari [2015] controls the FDR under positive dependence.In the online setting, arbitrary dependence of p-values across all time is a rather pessimistic and unrealistic assumption, and thus in the asynchronous setting, Zrnic et al.
[2021] introduced the concept of arbitrary local dependence and showed that online algorithms can be modified to control the FDR even with such dependence.See also Fisher [2021], who showed further results for control of the FDR under positive dependence in the minibatch setting.Finally, Zrnic et al. [2020] showed how to control the FDR in the online batched setting under positive dependence.Future work could explore how to construct more powerful online algorithms under different forms of dependence, including when the correlation structure is known or estimated.

Summary and practical guidance
Table 4 gives a summary of the leading online testing methods discussed in this paper, comparing their assumptions as well as general pros and cons.In terms of practical guidance, we offer the following general suggestions: • A fundamental consideration is which type I error rate is most suitable to control given the experimental context and goals.As alluded to in Section 1.2, this choice may be driven by the anticipated number of hypotheses to be tested, data dependencies and/or regulatory concerns.
• Given the type I error rate that the user wishes to control, there may be a variety of online testing algorithms to choose from.A key consideration is the assumptions around the pvalue dependencies, as shown in Table 3. Algorithms that make stronger assumptions (i.e., assuming independence) will be more powerful, but this can come at the cost of inflated type I error rates if these assumptions do not hold.In practice, it may be difficult to  anticipate or estimate the data dependencies in an experiment.In some settings, such as a platform trial with a common control, the correlation structure can be derived analytically.
Otherwise, with enough data one can try to estimate the correlation empirically.
• The planned timing of hypothesis tests combined with the use of sequential testing may motivate the use of asynchronous or batched versions of online testing algorithms, as discussed in Section 5.1.This links with the issue of the ordering of the hypothesis tests themselves: in some settings the ordering will be out of the analyst's control, while in others it may be possible to use prior information about the probability of rejection to potentially gain power either implicitly by ordering the hypotheses (so that those that are a-priori more likely to be rejected are tested first) or by using prior weights (see Section 5.1).In the batch setting (i.e., where the multiple hypotheses are available to be tested simultaneously) then the batched algorithms presented in Section 5 are recommended to achieve the best of both worlds of offline and online testing.
• As mentioned above, the setting of a small number of hypotheses (< 1000) is a challenging one for online testing.Thus the biggest advantage in terms of power will be seen in settings with large-scale hypothesis testing, such as A/B testing.If at some point it becomes known that the number of hypothesis tests will be bounded by a finite number M then it would make sense to maximise power by ensuring that the alpha-wealth is completely used up by the end of the M -th hypothesis test.
• In general, simulation studies remain valuable for assessing the performance of an online testing algorithm given the experimental context and goals, particularly for evaluating power, as well as type I error rate considerations under departures from independence.
We note that in terms of computational scalability, the algorithms presented in this paper all scale linearly with the number of hypotheses tested.

Discussion
Online error rate control methodology provides a powerful and flexible framework for largescale hypothesis testing that takes into account the temporal nature of modern data analysis.
Over the past 15 years since this framework was first proposed, there have been many proposed improvements and extensions, which better reflect the nature of real-world data and expand the scope of potential applications.In particular, continuous progress has been made towards increasing the statistical power of online testing algorithms, so that they can match (and in some cases even exceed) the power of traditional offline algorithms.The issue of accounting for dependent p-values remains open, although progress has been made here too.
As the methodology becomes increasingly mature, the next natural step is to see application of online testing algorithms in practice.To this end, and as seen in Section 4, there have been a number of papers that are specifically focused on application examples, including in the context of growing data repositories [Robertson et al., 2019], anomaly detection in time series [Rebjock et al., 2021], platform trials [Robertson et al., 2023] and RNAseq data [Liou et al., 2023].Further work may be required to explore and solve practical challenges that may arise in different application settings.Finally, the provision of software and training will also be key to promoting the use of online error rate control in practice.The onlineFDR package we described earlier is a key step in that regard, but software tuned to specific applications may also be desirable.
and S t = i<t 1{P i ≤ η}, τ * j = i≤τ j 1{P i ≤ η}.See Tian and Ramdas [2019] for an alternative formulation of ADDIS where p-values greater than η are explicitly discarded, and the extension to a sequence {η t } ∞ t=1 .

C Simulation study
Figure 6 shows the FDR of LORD++, SAFFRON, ADDIS and monotone AI compared with uncorrected testing, the BH procedure and alpha-spending, using the simulation set-up described in Section 3.2.

Figure 1
Figure 1 gives a diagrammatic representation of online multiple testing, where different hypotheses (corresponding to experiments) are tested over time (corresponding to the collection of data samples).As discussed above, each experiment could itself be a sequential experiment or take the form of an MAB.

Figure 1 :
Figure 1: An abstract online multiple testing framework.As time passes (left to right), new experiments testing different hypotheses are started and stopped, in a possibly indefinite manner.Each horizontal line represents a new experiment/hypothesis, and the length represents the number of samples collected.Decisions about each hypothesis must be made as soon as the corresponding experiment ends.
in Section 4.2 gives a diagram of an example platform trial showing what this looks like.

Figure 4 :
Figure 4: Power of LORD++, SAFFRON, ADDIS and monotone AI compared with uncorrected testing, the BH procedure and alpha-spending as the proportion of non-nulls π 1 varies.We set T = 1000 and α = 0.05.Results are based on 10 4 simulation replicates.

5. 1 . 7
Discrete test statistics Döhler et al. [2021] focused on the setting where the null p-values are conservative due to the discreteness of the test statistics, i.e.where the individual tests are based on counts or contingency tables.The authors proposed uniform improvements of LORD++, SAFFRON and

Figure 6 :
Figure 6: FDR of LORD++, SAFFRON, ADDIS and monotone AI compared with uncorrected testing, the BH procedure and alpha-spending as the proportion of non-nulls π 1 varies.The solid red horizontal line gives the target level of α = 0.05.We set T = 1000 and results are based on 10 4 simulation replicates.
Diagrammatic representation of generalised alpha-investing (GAI), showing how the wealth W (t) at time t changes depending on whether the hypothesis H t is rejected (i.e.whether the corresponding p-value P Xu and Ramdas [2022]e adapted fromXu and Ramdas [2022].testing,we often use monotone decision rules for α t considered as a function of (R 1 ∨ 1 , where H 0 denotes the set of true null hypotheses.If we can keep FDP This technique has been used to derive the LORD, SAFFRON and ADDIS algorithms (see below), by designing different estimates FDP LORD (t), FDP SAFFRON (t), FDP ADDIS (t) * (t) ≤ α at all times t, then (depending on dependence assumption on the p-values) we can prove that mFDR(t) ≤ α or FDR(t) ≤ α.

Table 1 :
Number of rejections made by online FDR algorithms and various comparators using the IMPC datasets.SD = Sexual Dimorphism.
Provable FDR control for positive dependence (the 'PRDS' assumption) -Substantially lower power than the algorithms above FDX supLORD Null p-values are conditionally superuniform + Also controls the mFDR and FDR at both fixed times and stopping times + User may choose the number of rejections after which we begin controlling FDX in exchange for more power -Unclear how robust to departures from conditional superuniformity

Table 4 :
Summary of leading methods for online error rate control, giving dependence assumptions and pros & cons.