Optional Stopping with Bayes Factors: a categorization and extension of folklore results, with an application to invariant situations

It is often claimed that Bayesian methods, in particular Bayes factor methods for hypothesis testing, can deal with optional stopping. We first give an overview, using only most elementary probability theory, of three different mathematical meanings that various authors give to this claim: stopping rule independence, posterior calibration and (semi-) frequentist robustness to optional stopping. We then prove theorems to the effect that - while their practical implications are sometimes debatable - these claims do indeed hold in a general measure-theoretic setting. The novelty here is that we allow for nonintegrable measures based on improper priors, which leads to particularly strong results for the practically important case of models satisfying a group invariance (such as location or scale). When equipped with the right Haar prior, calibration and semi-frequentist robustness to optional stopping hold uniformly irrespective of the value of the underlying nuisance parameter, as long as the stopping rule satisfies a certain intuitive property.


Introduction
In recent years, a surprising number of scientific results have failed to hold up to continued scrutiny. Part of this 'replicability crisis' may be caused by practices that ignore the assumptions of traditional (frequentist) statistical methods (John et al., 2012). One of these assumptions is that the experimental protocol should be completely determined upfront. In practice, researchers often adjust the protocol due to unforeseen circumstances or collect data until a point has been proven. This practice, which is referred to as optional stopping can cause true hypotheses to be wrongly rejected much more often than these statistical methods promise.
Bayes factor hypothesis testing has long been advocated as an alternative to traditional testing that can resolve several of its problems; in particular, it was claimed early on that Bayesian methods continue to be valid under optional stopping (Lindley, 1957, Edwards et al., 1963. In particular, the latter paper claims that (with Bayesian methods) "it is entirely appropriate to collect data until a point has been proven or disproven, or until the data collector runs out of time, money, or patience.". In light of the replicability crisis, such claims have received much renewed interest (Wagenmakers, 2007, Rouder, 2014, Schönbrodt et al., 2017, Yu et al., 2013, Sanborn and Hills, 2013. But what do they mean mathematically? It turns out that different authors mean quite different things by 'Bayesian methods handle optional stopping'; moreover, such claims are often shown to hold only in an informal sense, or in restricted contexts. Thus, the first goal of the present paper is to give a systematic overview and formalization of such claims in a simple, expository setting; the second goal is to extend the reach of such claims to more general settings, for which they have never been formally verified.

Overview and Most Important Result
In Section 2, we give a systematic overview of what we identified to be the three main mathematical senses in which Bayes factor methods can handle optional stopping, which we call τ -independence, calibration, and (semi-) frequentist. We first do this in a setting chosen to be as simple as possible -finite sample spaces and strictly positive probabilities -allowing for straightforward statements and proofs of results.
In Section 3, we extend the statements and results to a much more general setting allowing for a wide range of sample spaces and measures, including measures based on improper priors. These are priors that are not integrable, thus not defining standard probability distributions over parameters, and as such they cause technical complications. Such priors are indispensable within the recently popularized, default Bayes factors for common hypothesis tests (Rouder et al., 2009, Jamil et al., 2016. In Section 4, we provide stronger results for the case in which both models satisfy the same group invariance. Most (but not all) default Bayes factor settings concern such situations; prominent examples are Jeffreys' Bayesian one-and two-sample t-tests, going back to (Jeffreys, 1961), in which the models are location and location-scale families, respectively. Many more examples are given by Berger and various collaborators in a sequence of papers (Berger et al., 1998, Dass and Berger, 2003, Bayarri et al., 2012 who give compelling arguments for using the (typically improper) right Haar prior on the nuisance parameters in such situations; for example, in Jeffreys' one-sample t-test, one puts a right Haar prior on the variance. Haar priors and group invariant models were studied extensively by Eaton (1989), Andersson (1982), Wijsman (1990), whose results this paper depends on considerably. When nuisance parameters (shared by both H 0 and H 1 ) are of the right form and the right Haar prior is used, we can strengthen the results of Section 3: they now hold uniformly for all possible values of the nuisance parameters, rather than in the marginal, 'on average' sense we consider in Section 3; however -and this is our most important insight -we cannot take arbitrary stopping rules if we want to handle optional stopping in this strong sense: the stopping rules have to satisfy a certain intuitive condition, which will hold in many but not all practical cases: a rule such as 'stop as soon as the Bayes factor is ≥ 20' is allowed, but a rule (in the Jeffreys' one-sample t-test) such as 'stop as soon as x 2 i ≥ 20' is not. The paper ends with an Appendix containing all longer mathematical proofs.
Remarks Our analysis is restricted to Bayesian testing and model selection using the Bayes factor method; we do not make any claims about other types of Bayesian inference. Some of the results we present were already known, at least in simple settings; we refer in each case to the first appearance in the literature that we are aware of. The main mathematical novelties in the paper are the results on optional stopping in the general case with improper priors and in the group invariance case. The main difficulties here are that, (a), for fixed sample sizes, at least with continuous-valued data, the Bayes factor usually has a distribution with full support, i.e. its density is strictly positive on the positive reals, whereas with variable stopping times, the support of its distribution may have 'gaps' at which its density is zero or very near 0; and (b), as indicated, we need certain restrictions on the stopping times in order for the results to be valid.
Finally, as an important caveat, we point out that the idea that Bayesian methods can handle optional stopping has also been criticized, for example by Yu et al. (2013), Sanborn and Hills (2013), and also by ourselves (De Heide and Grünwald, 2018). There are two main issues: first, in many practical situations, many Bayesian statisticians use priors that are themselves dependent on parts of the data and/or the sampling plan and stopping time. Examples are Jeffreys prior with the multinomial model and the Gunel-Dickey default priors for 2x2 contingency tables advocated by Jamil et al. (2016). With such priors, final results evidently depend on the stopping rule employed; none of the results below continue to hold for such priors. The second issue is that all mathematical theorems below are just that, mathematical theorems. For them to have implications for practice, one needs to make additional assumptions which sometimes may not be warranted. In particular the second sense of 'handling optional stopping', calibration, relies on an analysis that holds under a Bayes marginal distribution, assigning probabilities that are really average (expected) probabilities with expectations taken over a prior. Yet many if not most priors used in practice (such as Cauchy priors over a location parameter) are of a 'default' or 'pragmatic' nature and are not really believed by the statistician, making the practical meaning of such an expectation questionable; De Heide and  discuss the issue at length.

The Simple Case
Consider a finite set X and a sample space Ω := X T where T is some very large (but in this section, still finite) integer. One observes a sample x τ ≡ x 1 , . . . , x τ , which is an initial segment of x 1 , . . . , x T ∈ X T . In the simplest case, τ = n is a sample size that is fixed in advance; but, more generally τ is a stopping time defined by some stopping rule (which may or may not be known to the data analyst), defined formally below.
We consider a hypothesis testing scenario where we wish to distinguish between a null hypothesis H 0 and an alternative hypothesis H 1 . Both H 0 and H 1 are sets of distributions on Ω, and they are each represented by unique probability distributionsP 0 andP 1 respectively. Usually, these are taken to be Bayesian marginal distributions, defined as follows. First one writes, for both k ∈ {0, 1}, H k = {P θ|k : θ ∈ Θ k } with 'parameter spaces' Θ k ; one then defines or assumes some prior probability distributions π 0 and π 1 on Θ 0 and Θ 1 , respectively. The Bayesian marginal probability distributions are then the corresponding marginal distributions, i.e. for any set A ⊂ Ω they satisfy: (1) For now we also further assume that for every n ≤ T , every x n ∈ X n ,P 0 (X n = x n ) > 0 and P 1 (X n = x n ) > 0 (full support), where here as below we use random variable notation, X n = x n denoting the event {x n } ⊂ Ω. We note that there exist approaches to testing and model choice such as testing by nonnegative martingales (Shafer et al., 2011, van der Pas and and minimum description length (Barron et al., 1998, Grünwald, 2007 in which theP 0 andP 1 may be defined in different (yet related) ways. Several of the results below extend to generalP 0 andP 1 ; we return to this point at the end of the paper, in Section 5. In all cases, we further assume that we have determined an additional probability mass function π on {H 0 , H 1 }, indicating the prior probabilities of the hypotheses. The evidence in favour of H 1 relative to H 0 given data x τ is now measured either by the Bayes factor or the posterior odds. We now give the standard definition of these quantities for the case that τ = n, i.e., that the sample size is fixed in advance. First, noting that all conditioning below is on events of strictly positive probability, by Bayes' theorem, we can write for any A ⊂ Ω, where here as in the remainder of the paper we use the symbol π to denote not just prior, but also posterior distributions on {H 0 , H 1 }. In the case that we observe x n for fixed n, the event A is of the form X n = x n . Plugging this into (2), the left-hand side becomes the standard definition of posterior odds, and the first factor on the right is called the Bayes factor.

First Sense of Handling Optional Stopping: τ -Independence
Now, in reality we do not necessarily observe X n = x n for fixed n but rather X τ = x τ where τ is a stopping time that may itself depend on (past) data (and that in some cases may in fact be unknown to us). This may be defined in terms of a stopping rule f : is then defined as the random variable which, for any sample x 1 , . . . , x T , outputs the smallest n such that f (x 1 , . . . , x n ) = stop. For any given stopping time τ , any 1 ≤ n ≤ T and sequence of data x n = (x 1 , . . . , x n ), we say that x n is compatible with τ if it satisfies X n = x n ⇒ τ = n. We let X τ ⊂ T i=1 X i be the set of all sequences compatible with τ . Observations take the form X τ = x τ , which is equivalent to the event X n = x n ; τ = n for some n and some x n ∈ X n which of necessity must be compatible with τ . We can thus instantiate (2) to where we note that, by definition P j (X n = x n | τ = n) = P (X n = x n | H j , τ = n). Using Bayes' theorem again, the expression on the right can be further rewritten as: Combining (3) and (4) we get: where we introduce the notation γ(x n ) for the posterior odds and β(x n ) for the Bayes factor based on sample x n , calculated as if n were fixed in advance.
We see that the stopping rule plays no role in the expression on the right. Thus, we have shown that, for any two stopping times τ 1 and τ 2 that are both compatible with some observed x n , the posterior odds one arrives at will be the same irrespective of whether x n came to be observed because τ 1 was used or if x n came to be observed because τ 2 was used. We say that the posterior odds do not depend on the stopping rule τ and call this property τ -independence. Incidentally, this also justifies that we write the posterior odds as γ(x n ), a function of x n alone, without referring to the stopping time τ .
The fact that the posterior odds given x n do not depend on the stopping rule is the first (and still relatively weak) sense in which Bayesian methods handle optional stopping; it was perhaps first noted by Lindley (1957); another early source is Edwards et al. (1963). Lindley gave an (informal) proof in the context of specific parametric models; in Section 3.1 we show that the result indeed remains true for general σ-finiteP 0 andP 1 . We note once again that the result only holds if the priors π 0 and π 1 themselves do not depend on τ , an assumption that is violated in many default Bayesian methods.

Second Sense of Handling Optional Stopping: Calibration
An alternative definition of handling optional stopping was introduced by Rouder (2014). Rouder calls γ(x n ) the nominal posterior odds calculated from an obtained sample x n , and defines the observed posterior odds as as the posterior odds given the nominal odds. Rouder first notes that, at least if the sample size is fixed in advance to n, one expects these odds to be equal. For instance, if an obtained sample yields nominal posterior odds of 3-to-1 in favor of the alternative hypothesis, then it must be 3 times as likely that the sample was generated by the alternative probability measure. In the terminology of De Heide and , Bayes is calibrated for a fixed sample size n. Rouder then goes on to note that, if n is determined by an arbitrary stopping time τ (based for example on optional stopping), then the odds will still be equal -in this sense, Bayesian testing is well-behaved in the calibration sense irrespective of the stopping rule/time. Formally, the requirement that the nominal and observed posterior odds be equal leads us to define the calibration hypothesis, which postulates that c = P (H1|γ=c) P (H0|γ=c) holds for any c > 0 that has non-zero probability. For simplicity, for now we only consider the case with equal prior odds for H 0 and H 1 so that γ(x n ) = β(x n ). Then the calibration hypothesis says that, for arbitrary stopping time τ , for every c such that β(x τ ) = c for some x τ ∈ X τ , one has In the present simple setting, this hypothesis is easily shown to hold, because we can write: Rouder noticed that the calibration hypothesis should hold as a mathematical theorem, without giving an explicit proof; he demonstrated it by computer simulation in a simple parametric setting. Deng et al. (2016) gave a proof for a somewhat more extended setting yet still with proper priors. In Section 3.2 we show that a version of the calibration hypothesis continues to hold for general measures based on improper priors, and in Section 4.4 we extend this further to strong calibration for group invariance settings as discussed below.
We note that this result, too, relies on the priors themselves not depending on the stopping time, an assumption which is violated in several standard default Bayes factor settings. We also note that, if one thinks of one's priors in a default sense -they are practical but not necessarily fully believed -then the practical implications of calibration are limited, as shown experimentally by De Heide and . One would really like a stronger form of calibration in which (6) holds under a whole range of distributions in H 0 and H 1 , rather than in terms ofP 0 andP 1 which average over a prior that perhaps does not reflect one's beliefs fully. For the case that H 1 and H 2 share a nuisance parameter g taking values in some set G, one can define this strong calibration hypothesis as stating that, for all c with β(x τ ) = c for some x τ ∈ X τ , all g ∈ G, where β is still defined as above; in particular, when calculating β one does not condition on the parameter having the value g, but when assessing its likelihood as in (7) one does. De Heide and  show that the strong calibration hypothesis certainly does not hold for general parameters, but they also show by simulations that it does hold in the practically important case with group invariance and right Haar priors. In Section 4.4 we show that in such cases, one can indeed prove that a version of (7) holds.

Third Sense of Handling Optional Stopping: (Semi-) Frequentist
In classical, Neyman-Pearson style null hypothesis testing, the main concern is limiting the false positive rate of a hypothesis test. If this false positive rate is bounded above by some α > 0, then a null hypothesis significance test (NHST) is said to have significance level α, and if the significance level is independent of the stopping rule used, we say that the test is robust under frequentist optional stopping.

Definition 1. A function S :
T i=m X i → {0, 1} is said to be a frequentist sequential test with significance level α and minimal sample size m that is robust under optional stopping relative to H 0 if for all P ∈ H 0 i.e. the probability that there is an n at which S(X n ) = 1 ('the test rejects H 0 when given sample X n ') is bounded by α.
In our present setting, we can take m = 0 (larger m become important in Section 3.3), so n runs from 1 to T and it is easy to show that, for any 0 ≤ α ≤ 1, we havē Proof. For any fixed α and any sequence . Then τ is a stopping time, X τ is a random variable, and the probability in (8) is equal to theP 0 -probability that β(X τ ) ≥ 1/α) which by Markov's inequality is bounded by α.
It follows that, if H 0 is a singleton, then the sequential test S that rejects H 0 (outputs S(X n ) = 1) whenever β(x n ) ≥ 1/α is a frequentist sequential test with significance level α that is robust under optional stopping.
The fact that Bayes factor testing with singleton H 0 handles optional stopping in this frequentist way was noted by Edwards et al. (1963) and also emphasized by Good (1991), among many others. If H 0 is not a singleton, then (8) still holds, so the Bayes factor still handles optional stopping in a mixed frequentist (Type I-error) and Bayesian (marginalizing over prior within H 0 ) sense. While form a frequentist perspective, one may not consider this to be fully satisfactory, as argued by Bayarri et al. (2016) this semi-frequentist sense of optional stopping, when applied to H 1 rather than H 0 , sometimes does correspond to what many frequentists consider acceptable in practice.
Yet, in the practically important group invariance case, we can once again show that, if all parameters in H 0 are right Haar, then the Bayes factor is truly robust to optional stopping in the above frequentist sense, i.e. (8) will hold for all P ∈ H 0 and not just 'on average'. While this is hinted at in several papers (e.g. (Bayarri et al., 2016, Dass andBerger, 2003)) it seems to never have been proven formally; we provide a proof in Section 4.5.

The General Case
Let (Ω, F ) be a measurable space. Fix some m ≥ 0 and consider a sequence of functions X m+1 , X m+2 , . . . on Ω so that each X n , n > m takes values in some fixed set ('outcome space') X with associated σalgebra Σ. When working with proper priors we invariably take m = 0 and then we define X n := (X 1 , X 2 , . . . , X n ) and we let Σ (n) be the n-fold product algebra of Σ. When working with improper priors it turns out to be useful (more explanation further below) to take m > 0 and define an initial sample random variable X (m) on Ω, taking values in some set X m ⊆ X m with associated σ-algebra Σ (m) . In that case we set, for n ≥ m, In either case, we let F n be the σ-algebra (relative to Ω) generated by (X n , Σ (n) ). Then (F n ) n=m,m+1,... is a filtration relative to F and if we equip (Ω, F ) with a distribution P then X (m) , X m+1 , X m+2 , . . . becomes a random process adapted to F . A stopping time is now generalized to be a function τ : Ω → {m + 1, m + 2, . . .} ∪ {∞} such that for each n > m, the event {τ = n} is F n -measurable; note that we only consider stopping after m initial outcomes. Again, for a given stopping time τ and sequence of data x n = (x 1 , . . . , x n ), we say that x n is compatible with τ if it satisfies X n = x n ⇒ τ = n, i.e. {ω ∈ Ω : X n (ω) = x n } ⊂ {ω ∈ Ω : τ (ω) = n}.
H 0 and H 1 are now sets of probability distributions on (Ω, F ). Again one writes H j = {P θ|j : θ ∈ Θ j } where now the parameter sets Θ j (which, however, could itself be infinite-dimensional) are themselves equipped with suitable σ-algebras Σ(Θ j ).
We will still represent both H 0 and H 1 by unique measuresP 0 andP 1 respectively, which we now allow to be based on (1) with improper priors π 0 and π 1 that may be infinite measures; as a resultP 0 andP 1 are positive real measures that may themselves be infinite. We also allow X to be a general (in particular uncountable) set. Both nonintegrability and uncountability cause complications, but these can be overcome if suitable Radon-Nikodym derivatives exist. To ensure this, we will assume that for all n ≥ max{m, 1}, for all k ∈ {0, 1} and θ ∈ Θ k , P and ρ (n) are all mutually absolutely continuous: we can simply take ρ (n) =P (n) 0 , but in practice, it is often possible and convenient to take ρ such that ρ (n) is the Lebesgue measure on R n , which is why explicitly introduce ρ here.
The absolute continuity conditions guarantee that all required Radon-Nikodym derivatives exist. Finally, we assume that the measuresP are proper probability measures for all n > m. This final requirement is the reason why we sometimes need to consider m > 0 and nonstandard sample spaces X n in the first place: in practice , one usually starts with the standard setting of a (Ω, F ) where m = 0 and all X i have the same status. In all practical situations with improper priors π 0 and/or π 1 that we know of, there is a smallest finite j and a set X • ⊂ X j that has measure 0 under all probability distributions in H 0 ∪ H 1 , such that, restricted to the sample space X j \ X • , the measures P (j) 1 andP (j) 0 are σ-finite and mutually absolutely continuous, and the posteriors π k (Θ k | x j ) (as defined in the standard manner in (11) below) are proper probability measures. One then sets m to equal this j, and sets X m := X m \ X • , and the required properness will be guaranteed. Our initial sample X (m) is a variation of what is called (for example, by Bayarri et al. (2012)) a minimal sample. Yet, the sample size of a standard minimal sample is itself a random quantity; by restricting X m to X m , we can take its sample size m to be constant rather than random, which will greatly simplify the treatment of optional stopping with group invariance; see Example 1 and 2 below.
We henceforth refer to the setting now defined (with m and initial space X m satisfying the requirements above) as the general case.
We need an analogue of (5) for this general case. IfP 0 andP 1 are probability measures, then there is still a standard definition of conditional probability distributions P (H | A) in terms of conditional expectation for any given σ-algebra A; based on this, we can derive the required analogue in two steps. First, we consider the case that τ ≡ n for some n > m; we know in advance that we observe X n for a fixed n: the appropriate A is then F n , π(H | A)(ω) is determined by X n (ω) hence can be written as π(H | X n ), and a straightforward calculation gives that where (dP (n) 1 /dρ (n) ) and (dP (n) 0 /dρ (n) ) are versions of the Radon-Nikodym derivatives defined relative to ρ (n) . The second step is now to follow exactly the same steps as in the derivation of (5), replacing β(X n ) by (9) wherever appropriate (we omit the details). This yields, for any n such that ρ(τ = n) > 0, and for ρ (n) -almost every x n that is compatible with τ , where here as below, for n ≥ m, we abbreviate π(H k | X n = x n ) to π(H k | x n ).
The above expression for the posterior is valid ifP 0 andP 1 are probability measures; we will simply take it as the definition of the Bayes factor for the general case. Again this coincides with standard usage for the improper prior case. In particular, let us define the conditional posteriors and Bayes factors given X (m) = x m in the standard manner, by the formal application of Bayes' rule, for k = 0, 1 and measurable Θ ′ k ⊂ Θ k and F -measurable A, where P θ|k (A | X (m) = x m ) is defined as the value that (a version of) the conditional probability P θ|k (A | F m ) takes when X (m) = x m , and is thus defined up to a set of ρ (m) -measure 0.
With these definitions, it is straightforward to derive we have the following coherence property, which automatically holds if the priors are proper, and which in combination with (10) expresses that first updating on x m and then on x m+1 , . . . , x n has the same result as updating based on the full x 1 , . . . , x n at once:

τ -independence, general case
The general version of the claim that the posterior odds do not depend on the specific stopping rule that was used is now immediate, since the expression (10) for the Bayes factor does not depend on the stopping time τ .

Calibration, general case
We will now show that the calibration hypothesis continues to hold in our general setting. From here onward, we make the further reasonable assumption that for every x m ∈ X m ,P 0 (τ = ∞ | x m ) = P 1 (τ = ∞ | x m ) = 0 (the stopping time is almost surely finite), and we define T τ := {n ∈ N >0 :P 0 (τ = n) > 0}.
To prepare further, let {B j : j ∈ T τ } be any collection of positive random variables such that for each j ∈ T τ , B j is F j -measurable. We can define the stopped random variable B τ as where we note that, under this definition, B τ is well-defined even if EP 0 [τ ] = ∞.
We can define the induced measures on the positive real line under the null and alternative hypothesis for any probability measure P on (Ω, F ): where B(R >0 ) denotes the Borel σ-algebra of R >0 . Note that, when we refer to P [Bn] , this is identical to P [Bτ ] for the stopping time τ which on all of Ω stops at n. The following lemma is crucial for passing from fixed-sample size to stopping-rule based results.
Lemma 1. Let T τ and {B n : n ∈ T τ } be as above. Consider two probability measures P 0 and P 1 on (Ω, F ). Suppose that for all n ∈ T τ , the following fixed-sample size calibration property holds: Then we have The proof is in Appendix A.
In this subsection we apply this lemma to the measuresP k (· | x m ) for arbitrary fixed x m ∈ X m , with their induced measuresP (· | x m ) for the stopped posterior odds γ τ . Formally, the posterior odds γ n as defined in (10) constitute a random variable for each n, and, under our mutual absolute continuity assumption forP 0 andP 1 , γ n can be directly written as Since, by definition, the measuresP k (· | x m ) are probability measures, the Radon-Nikodym derivatives in (16) and (17) are well-defined.

Lemma 2.
We have for all x m ∈ X m , all n > m: Combining the two lemmas now immediately gives (19) below, and combining further with (13) and (10) gives (20): Corollary 3. In the setting considered above, we have for all x m ∈ X m : and also In words, the posterior odds remain calibrated under any stopping rule τ which stops almost surely at times m < τ < ∞.
For discrete and strictly positive measures with prior odds π(H 1 )/π(H 0 ) = 1, we always have m = 0, and (19)  x m -a.s. because the two measures are assumed mutually absolutely continuous.

(Semi-) Frequentist Optional Stopping
In this section we consider our general setting as in the beginning of Section 3.2, i.e. with the added assumption that the stopping time is a.s. finite, and with T τ := {j ∈ N >0 :P 0 (τ = j) > 0}. From here onward we shall further simplify slightly by assuming that the prior on H 0 and H 1 is equal, so that the posterior odds (given x n and possibly τ = n) equal the Bayes factor β n .
Consider any initial sample x m ∈ X m and letP 0 | x m andP 1 | x m be the conditional Bayes marginal distributions as defined in (12). We first note that, by Markov's inequality, for any nonnegative random variable Z on Ω with, for all x m ∈ X m , EP 0 |x m [Z] ≤ 1, we must have, for 0 ≤ α ≤ 1, Proposition 4. Let τ be any stopping rule satisfying our requirements. The stopped Bayes factor β τ as defined by (14) (with β j in the role of B j ) is a random variable that satisfies, for all x m ∈ X m , EP 0|x m [β τ ] ≤ 1, so that, by the reasoning above, Proof. The following implications are all immediate: The desired result now follows by plugging in a particular stopping rule: let S : ∞ i=m X i → {0, 1} be the frequentist sequential test defined by setting, for all n > m, x n ∈ X n : S(x n ) = 1 iff β n ≥ 1/α. Corollary 5. Let t * ∈ {m + 1, m + 2, . . .} ∪ {∞} be the smallest t * > m for which β −1 t ≤ α. Then for arbitrarily large T , when applied to the stopping rule τ := min{T, t * }, we find that P 0 (∃n ∈ {m + 1, . . . , T } : S(X n ) = 1 | x m ) =P 0 (∃n ∈ {m + 1, . . . , T } : The corollary implies that the test S is robust under optional stopping in the frequentist sense relative to H 0 (Definition 1). Note that, just as in the simple case, the setting is really just 'semi-frequentist' whenever H 0 is not a singleton.

Optional stopping with group invariance
Whenever the null hypothesis is composite, the previous results only hold under the marginal distri-butionP 0 or, in the case of improper priors, underP 0 (· | X m = x m ).

Background for fixed sample sizes
Here we prepare for our results by providing some general background on invariant priors for Bayes factors with fixed sample size n on models with nuisance parameters that admit a group structure, introducing the right Haar measure, the corresponding Bayes marginals, and (maximal) invariants. We use these results in Section 4.2 to derive Lemma 7, which gives us a strong version of calibration for fixed n. The setting is extended to variable stopping times in Section 4.3, and then Lemma 7 is used in this extended setting to obtain our strong optional stopping results in Section 4.4 and 4.5.
For now, we assume a sample space X n that is locally compact and Haussdorf, and that is a subset of some product space X n where X is itself locally compact and Haussdorf. This requirement is met, for example, when X = R and X n = X n . In practice, the space X n is invariably a subset of X n that excludes some 'singular' outcomes that have measure 0 under all hypotheses involved. We associate X n with its Borel σ-algebra which we denote as F n . Observations are denoted by the random vector X n = (X 1 , . . . , X n ) ∈ X n . We thus consider outcomes of fixed sample size, denoting these as x n ∈ X n , returning to the case with stopping times in Section 4.4 and 4.5.
We start with some group-theoretical preliminaries; for more details, see e.g. (Eaton, 1989, Wijsman, 1990, Andersson, 1982. Definition 2 (Eaton (1989), Definition 2.1). Let G be a group of measurable one-to-one transformations on X with identity e. Let Y be a set. A function F : Y × G → Y satisfying 1. F (y, e) = y, y ∈ Y 2. F (y, g 1 g 2 ) = F (F (y, g 1 ), g 2 ), g 1 , g 2 ∈ G, y ∈ Y specifies G acting on the right of Y .
In practice, F is omitted: we will write y · g for a group element g acting on the right of y ∈ Y . For a subset A ⊆ Y , we write A · g := {a · g | a ∈ A}. From now on we let G be a locally compact group G that acts topologically and properly 1 on the right of X n .
Let P 0,e and P 1,e (notation to become clear below) be two arbitrary probability distributions on X n that are mutually absolutely continuous. We will now generate hypothesis classes H 0 and H 1 , both sets of distributions on X n with parameter space G, starting from P 0,e and P 1,e . The group action of G on X n induces a group action on these measures defined by for any set A ∈ F n , k = 0, 1. When applied to A = X n , we get P k,g (A) = 1, for all g ∈ G, whence we have created two sets of probability measures parameterized by g, i.e., In this context, g ∈ G, can typically be viewed as nuisance parameter, i.e. a parameter that is not directly of interest, but needs to be accounted for in the analysis. This is illustrated in Example 1 and Example 2 below. The examples also illustrate how to extend this setting to cases where there are more parameters than just g ∈ G in either H 0 or H 1 . We extend the whole setup to our general setting with non-fixed n in Section 4.4.
We use the right Haar measure for G as a prior to define the Bayes marginals: Definition 3 (Conway (2013) The Bayes marginals are then defined to be: for k = 0, 1 and A ∈ F n . Typically, the right Haar measure is improper so that the Bayes marginals P k are not integrable. Yet, in all cases of interest, they are (a) still σ-finite, and, (b),P 0 ,P 1 and all distributions in H k , k = 0, 1, g ∈ G are mutually absolutely continuous; we will henceforth assume that (a) and (b) are the case.

Example 1.
Consider the one-sample t-test as described by Rouder et al. (2009), going back to (Jeffreys, 1961). The test considers normally distributed data with unknown standard deviation. The test is meant to answer the question whether the data has mean µ = 0 (the null hypothesis) or some other mean (the alternative hypothesis). Below, all densities are taken with respect to Lebesgue measure. Following (Rouder et al., 2009), a Cauchy prior density, denoted by π δ (δ), is placed on the effect size δ = µ/σ. The unknown standard deviation is a nuisance parameter and is equipped with the right Haar prior with density π σ (σ) = 1 σ under both hypotheses. Abbreviating, for general measure P on X n , ( dP/ dλ) (the density of distribution P relative to Lebesgue measure on R n ) to p, this gives the following marginal densities on n outcomes: Normally, this is viewed as a test between H 0 = {P 0,σ : σ ∈ R >0 } and H ′ 1 = {P 1,σ,δ : σ ∈ R >0 , δ ∈ R}, but we can obviously also view it as test between H 0 and H 1 = {P 1,σ }, i.e. we integrate out the parameter δ. Then H 0 and H 1 will be in the form (22) needed to state our results. In all standard invariant settings, one can reduce H 0 and H 1 to this form in the same way, by integrating out all other parameters, which is possible as long we equip those parameters with proper priors, which is invariably done in practice.
The group for the nuisance parameter σ is the group of scale transformations G = {c : c ∈ R >0 }. We thus let the sample space be X n = R n \ {0} n : we remove the measure-zero set {0} n , such that the group action is proper on the sample space. The group action is defined by x n · c = c x n for x n ∈ X n , c ∈ G. Take a g ∈ G. The measures P k,g defined by (21) are given by their densities p 0,σ and p 1,σ as defined above, with σ replaced by g.
We now turn to the main ingredient that will be needed to obtain results on optional stopping: the quotient σ-algebra.
Definition 4 (Eaton (1989), Chapter 2). A group G acting on the right of a set Y induces an equivalence relation: y 1 ∼ y 2 if and only if there exists g ∈ G such that y 1 = y 2 · g. This equivalence relation partitions the space in orbits: O y = {y · g | g ∈ G}, the collection of which is called the quotient space Y /G. There exists a map, the natural projection, from Y to the quotient space which is defined by ϕ Y : Y → Y /G : y → {y · g | g ∈ G}, and which we use to define the quotient σ-algebra Definition 5 (Eaton (1989), Chapter 2). A random element U n on X n is invariant if for all g ∈ G, x n ∈ X n , U n (x n ) = U n (x n · g). The random element U n is maximal invariant if U n is invariant and for all y n ∈ X n , U n (x n ) = U n (y n ) implies x n = y n · g for some g ∈ G.
Thus, U n is maximal invariant if and only if U n is constant on each orbit, and takes different values on different orbits; ϕ X n is thus an example of a maximal invariant. Note that any maximal invariant is G n -measurable. The importance of this quotient σ-algebra G n is the following evident fact: Proposition 6. For fixed k ∈ {0, 1}, every invariant U n has the same distribution under all P k,g , g ∈ G.
Chapter 2 of (Eaton, 1989) provides several methods and examples how to construct a concrete maximal invariant, including the first two given below. Since β n is invariant under the group action of G (see below), β n is an example of an invariant, although not necessarily of a maximal invariant.
Example 1 (continued) Consider the setting of the one-sample t-test as described above in Example 1. A maximal invariant for x n ∈ X n is U n (x n ) = (x 1 /|x 1 |, x 2 /|x 1 |, . . . , x n /|x 1 |).
Example 2. A second example, with a group invariance structure on two parameters, is the setting of the two-sample t-test with the right Haar prior (which coincides here with Jeffreys' prior) π(µ, σ) = 1/σ (see Rouder et al. (2009) for details): the group is G = {(a, b) : a > 0, b ∈ R}. Let the sample space be X n = R n \ span(e n ), where e n denotes a vector of ones of length n (this is to exclude the measure-zero line for which the s(x n ) is zero), and define the group action by x n · (a, b) = ax n + be n for x n ∈ X n . Then (Eaton (1989), Example 2.15) a maximal invariant for x n ∈ X n is U n (x n ) = (x n − xe n )/s(x n ), where x is the sample mean and s(x n ) = n i=1 (x i − x) 2 1/2 . However, we can also construct a maximal invariant similar to the one in Example 1, which gives a special status to an initial sample:

Relatively Invariant Measures and Calibration for Fixed n
Let U n be a maximal invariant, taking values in measurable space (U n , G n ). Although we have given more concrete examples above, it follows from the results of Andersson (1982) that, in case we do not know how to construct a U n , we can always take U n = ϕ X n . Since we assume mutual absolute continuity, the Radon-Nikodym derivative dP Theorem (Berger et al., 1998, Theorem 2.1) Under our previous definitions of and assumptions on G, P k,g ,P k let β(x n ) :=P 1 (x n )/P 0 (x n ) be the Bayes factor based on x n . Let U n be a maximal invariant as above, with (adopting the notation of (15)) marginal measures P [Un] k,g , for k = 0, 1 and g ∈ G. There exists a version of the Radon-Nikodym derivative such that we have for all g ∈ G, all As a first consequence of the theorem above, we note (as did Berger et al. (1998)) that the Bayes factor β n := β(X N ) is G n -measurable (it is constant on orbits) , and thus it has the same distribution under P 0,g and P 1,g for all g ∈ G. The theorem also implies the following crucial lemma:

Extending to Our General Setting with Non-Fixed Sample Sizes
We start with the same setting as above: a group G on sample space X n ⊂ X n that acts topologically and properly on the right of X n ; two distributions P 0,e and P 1,e on ( X n , F n ) that are used to generate H 0 and H 1 , and Bayes marginal measures based on the right Haar measureP 0 andP 1 , which are both σ-finite. We now denote H k as H (n) k , P k,e as P (n) k,e andP k asP are mutually absolutely continuous.
We now extend this setting to our general random process setting as specified in the beginning of Section 3.2 by further assuming that, for the same group G, for some m > 0, the above setting is defined for each n ≥ m. To connect the H (n) k for all these n, we further assume that there exists a subset X m ⊂ X m that has measure 1 under P (n) k,e (and hence under all P (n) g,e ) such that for all n ≥ m: 1. We can write X n = {x n ∈ X n : (x 1 , . . . , x m ) ∈ X m }.
2. For all x n ∈ X n , the posterior ν | x n based on the right Haar measure ν is proper.
3. The probability measures P (n) k,e and P (n+1) k,e satisfy Kolmogorov's compatibility condition for a random process.
4. The group action · on the measures P (n) k,e and P (n+1) k,e is compatible, i.e. for every n > 0, for every Requirement 4. simply imposes the condition that the group action considered is the same for all n ∈ N. As a consequence of 3. and 4., the probability measures P (n) k,g and P (n+1) k,g satisfy Kolmogorov's compatibility condition for all g ∈ G, k ∈ {0, 1} which means that there exists a probability measure P k,g on (Ω, F ) (under which X (m) , X m+1 , X m+2 , . . . is a random process), defined as in the beginning of Section 3, whose marginals for n ≥ m coincide with P (n) k,g , and there exist measuresP 0 andP 1 on (Ω, F ) whose marginals for n ≥ m coincide withP  Berger et al. (1998) and Bayarri et al. (2012). In fact, our initial sample x m ∈ X m is a variation of what they call a minimal sample; by excluding 'singular' outcomes from X m to ensure that the group acts properly on X m , we can guarantee that the initial sample is of fixed size; the size of the minimal sample can be larger, on a set of measure 0 under all P ∈ H 0 ∪ H 1 , e.g. if, in Example 4.1, X 1 = X 2 . We chose to ensure a fixed size m since it makes the extension to random processes considerably easier.
Just as in the case with a fixed n, if we start with large (or even nonparametric) hypotheses H ′ k = {P θ ′ |k : θ ′ ∈ Θ ′ k } which we want to equip with a nuisance parameter g, we can still view this as an instance of the present setting by first taking a proper prior density dπ(θ ′ ) on θ ′ , setting P k,e equal to the corresponding Bayes marginal, i.e. P k,e := P θ ′ π(θ ′ )dπ(θ ′ ), and defining P (n) k,g for each n (and hence P k,g ) via (21). We can then view H k equivalently as {P k,g : g ∈ G}, withP k the Bayes marginal based on the right Haar prior (and Θ k in (1) is equal to G); or as with nowP k the Bayes marginal based on the product prior of π(θ ′ ) and the right Haar prior. This was illustrated (with θ ′ equal to the effect size) for the 1-sample Bayesian t-test in Example 1.

Strong Calibration
Consider the setting, definitions and assumptions of the previous subsection, with the additional assumptions and definitions made in the beginning of Section 3.3, in particular the assumption of a.s. finite stopping time and equal prior odds. We will now show a strong calibration theorem for the Bayes factors β n = (dP 1 )(X n ) defined in terms of the Bayes marginalsP 0 andP 1 with the right Haar prior. Thus β τ is defined as in (14) with β in the role of B.
Theorem 8 (Strong calibration under optional stopping). Let τ be a stopping time satisfying our requirements, such that additionally, for each n > m, the event {τ = n} is G n -measurable. Then, adopting the notation of (15), for all g ∈ G, for P [βτ ] 0,g -almost every b > 0, we have: That means that the posterior odds remain calibrated under every stopping rule τ adapted to the quotient space filtration G m , G m+1 , . . ., under all P 0,g .
Proof. Fix some g ∈ G. We simply first apply Lemma 7 with V n = ½ {τ =n} , which gives that the premise (16) of Lemma 1 holds with c = 1 and β n in the role of B n (it is here that we need that τ n is G n -measurable, otherwise we could not apply Lemma 7 with the required definition of V n ). We can now use Lemma 1 with P 0,g in the role of P 0 to reach the desired conclusion for the chosen g. Since this works for all g ∈ G, the result follows.

Example 1, Continued: Admissible and Inadmissible Stopping Rules
We obtain strong calibration for the one-sample t-test with respect to the nuisance parameter σ (see Example 1 above) when the stopping rule is adapted to the quotient filtration G m , G m+1 , . . .. Under each P k,g ∈ H k , the Bayes factors β m , β m+1 , . . . define a random process on Ω such that each β n is G n -measurable. This means that a stopping time defined in terms of rule such as 'stop at the smallest t at which β t > 20 or t = 10 6 ' is allowed in the result above. If the stopping rule is a function of a sequence of maximal invariants, like x 1 /|x 1 , x 2 /|x 1 |, . . ., it is adapted to the G n as well and we can again apply the result above. On the other hand, this requirement is violated, for example, by a stopping rule that stops when j i=1 (x i ) 2 exceeds some fixed value, since such a stopping rule explicitly depends on the scale of the sampled data.

Frequentist optional stopping
The special case of the following result for the one-sample Bayesian t-test was proven in the master's thesis (Hendriksen, 2017). Here we extend the result to general group invariances.
Theorem 9 (Frequentist optional stopping for composite null hypotheses with group invariance). Under the same conditions as in Section 4.4, let τ be a stopping time such that, for each n > m, the event {τ = n} is G n -measurable. Then, adopting the notation of (15), for all g ∈ G, the stopped Bayes factor satisfies E P0,g [β τ ] = R>0 c dP [βτ ] 0,g (c) = 1, so that, by the reasoning above Proposition 4, we have for all g ∈ G: P 0,g ( 1 βτ ≤ α) ≤ α. 1,g (c) = 1.
where the first equality follows directly from Theorem 8 and the final equality follows because P 1,g is a probability measure, integrating to 1.

Concluding Remarks
We have identified three types of 'handling optional stopping': τ -independence, calibration and semifrequentist. We extended the corresponding definitions and results to general sample spaces with potentially improper priors. For the special case of models H 0 and H 1 sharing a nuisance parameter with a group invariant structure, we showed stronger versions of calibration and semi-frequentist robustness to optional stopping. As a final remark, it is worth noting that -as is immediate from the proofs -all our group-invariance results continue to hold in the setting with H k as in (28), and the definition of the Bayes marginal P k,e relative to θ ′ as in (27) replaced by a probability measure on (Ω, F ) that is not necessarily of the Bayes marginal form. The results work for any probability measure; in particularly one can take the alternatives for the Bayes marginal with proper prior that are considered in the the minimum description length and sequential prediction literature (Barron et al., 1998, Grünwald, 2007 under the name of universal distribution relative to {P θ ′ : θ ′ ∈ Θ ′ }; examples include the prequential, normalized maximum likelihood or 'switch' distributions considered by van der Pas and where ( * ) follows because of our fixed n-calibration assumption.
We have shown that the function g defined by g(t) = t is the Radon-Nikodym derivative dP1 [Bτ ] dP0 [Bτ ] .

Proof. [of Lemma 2]
Let A be any Borel subset of R >0 . We have: P 1 (τ = n | x m )π(H 1 ) dP (0) n (· | x m , τ = n) = P 0 (τ = n | x m )π(H 0 | x m ) where, for the case m = 0, ( * ) follows from (3), which can be verified to be still valid in our generalized setting. The case m > 0 follows in exactly the same way, by shifting the data by m places (so that the new x 1 becomes what was x m+1 , and treating, for k = 0, 1, π(H k | x m ) as the priors for this shifted data problem, and then applying the above with m = 0).
We have shown that the Radon-Nikodym derivative dP at γ n is given by γ n ·P 0 (τ =n|x m )π(H1|x m ) P1(τ =n|x m )π(H0|x m ) , which is what we had to show.
Proof. [of Lemma 7] Let A ′ denote the event V n = 1 and Let A ⊂ R >0 be a Borel measurable set. We can write β n as a function β n (U n ). With this notation, we have: where step (29) holds because β n is G n -measurable and ((30)) follows from ( (25)). We have shown that · t is equal to the Radon-Nikodym derivative dP [βn] 1,g (·|Vn=1) dP [βn] 0,g (·|Vn=1) , which is what we had to prove.