Practical Valid Inferences for the Two-Sample Binomial Problem

Consider comparing two independent binomial responses. Our interest is whether the two binomial parameters are different, and if different, which is larger, and if larger, by how much. This apparently simple problem was addressed by Fisher in the 1930's, and has been the subject of many review papers since then. Yet there continues to be new work on this issue and no consensus solution. Previous reviews have focused primarily on testing and the properties of validity and power, or primarily on confidence intervals, their coverage, and expected length. Here we evaluate both together. For example, we consider whether a p-value and its matching confidence interval are compatible, meaning that the p-value rejects at level $\alpha$ if and only if the $1-\alpha$ confidence interval excludes all null parameter values. For focus, we only examine non-asymptotic inferences, so that most of the p-values and confidence intervals are valid (i.e., exact) by construction. Within this focus, we review different methods emphasizing many of the properties and interpretational aspects we desire from applied frequentist inference: validity, accuracy, good power, equivariance, compatibility, coherence, causal interpretation, and parameterization and direction of effect. We show that no one method can meet all the desirable properties and give recommendations based on which properties are given more importance.


Introduction
Suppose we observe two independent binomial variates with parameters (n 1 , θ 1 ) and (n 2 , θ 2 ). One question we might have is: are θ 1 and θ 2 equal or not? If we reject the null hypothesis of equality (or even if we do not), we typically want to estimate how much larger one θ parameter is than the other. To answer these two questions, the frequentist typically presents an estimate of the effect, a confidence interval (CI) on that effect, and a p-value to test that there is no effect. For such a simple problem, one might think that by now there is a consensus method for testing and creating confidence intervals for this problem. But this is not so. New methods continue to be developed for this problem [see e.g., 41,61,62,21,27], and the closely related problems of causal inferences from a two-sample randomized experiment with binary responses [51,14]. Many review papers on this problem focus on testing alone [see 43,52], or confidence intervals alone [see 49,58,16]. Here we focus on both together.
We limit the scope of this paper by not considering asymptotic methods or other approximations. Newcombe [50] as well as Many review papers or books [see e.g., 49,43,16,50] cover and compare many of those methods. Sometimes those approximations are closed-form expressions and can be useful for deriving simple sample size formulas or when the test is applied many times such as in genomics. But often the approximations are unnecessary with modern computers. Non-asymptotic methods are often called exact, but in this paper we will reserve the term exact for non-asymptotic methods that are valid, meaning tests that control the type I error rate, and confidence intervals that cover the parameter with at least the nominal value. See Section 2.2 for further discussion of the term exact. A class of important non-asymptotic tests that are not valid are the mid-p methods (Section 10), which are sometimes called quasi-exact [30] and are included in our review because for confidence intervals sometimes we want average coverage close to the nominal value instead of guaranteed coverage, which on average will be conservative.
Here is an outline of our paper. In Section 2, we begin by contrasting the twosample binomial problem with the two-sample difference in normal distributions with the same variance, in which there is an accepted solution: the two-sample t-test. This allows us to define inferential properties of interest as well as highlight why there is not one accepted solution to the two-sample binomial problem. Newcombe [50] takes a similar approach. In Section 3 we discuss the choice of effect measure (e.g., difference in binomial parameters, ratio of parameters, or odds ratio of parameters). In Section 4 we define a frequentist triple as a parameter estimator, an associated confidence interval procedure, and a p-value function. We then formally discuss some properties of triples, such as whether the confidence interval and p-value match and are compatible, and whether directional inferences may be made from the triple. The idea of matched triples is discussed in Hirji [29, p. 77] in a less formal way as a "unified report". Our review says very little about parameter estimators, and mostly focuses on properties of p-values and confidence intervals and the compatibility of p-values with confidence intervals. Our discussion of directional inferences is motivated from the three decision rule of Neyman [see e.g., 26]. We describe methods for defining valid one-sided decision rules in Sections 5 (unconditional methods) and 6 (conditional methods), including the associated p-values and confidence intervals. Much of Sections 5 and 6 was thoroughly reviewed in [52] but is included in this paper for completeness; however, Section 5.3 presents some new ideas on informativeness of ordering functions. In Section 7 we review the melded confidence intervals of Fay, Proschan and Brittain [21] which are compatible with the one-sided conditional method (i.e., Fisher's exact test) p-values. In Section 8 we talk about equivalence studies and non-inferiority studies. In Section 9 we discuss non-central confidence intervals and associated tests, with a new focus on the relationship of these intervals to directional inferences. In Section 10 we discuss mid-p methods, which are non-asymptotic methods that relax the validity assumption in order to achieve better accuracy. In Section 11 we discuss the computational aspects of the various methods. In Section 12 we review some recent work on causality and the two-sample binomial problem, and relate those results to the rest of this paper. In Section 13 we discuss power and efficiency of methods (see that section for references of some important papers that address those topics), including some new calculations. In Section 14 we give our final recommendations.

Frequentist Inferences
We define a frequentist triple (or just a triple) as an estimator of a parameter of interest, a confidence interval, and a p-value function. This approach allows us to compare different triples by examining not just properties of each component (i.e., comparing powers of different p-value functions or expected lengths of different confidence intervals), but also to examine properties of the triples as a whole. For example, within a triple we examine inferential agreement between the p-value function and confidence interval procedure. Additionally, we examine what directional inferential statements we can make from the triple, such as stating that θ 2 is significantly larger than θ 1 , and at what significance level.
Although in some different statistical settings (e.g., two-sample normal problem) the standard triple will automatically give inferential agreement between p-values and confidence intervals as well as automatically give directional inferential statements, in the two-sample binomial problem those inferential properties are not automatic. Thus, before discussing the binomial problem, we first review the two-sample problem with normally distributed responses with the same variance. We consider the latter problem first, because there is some consensus that one triple (the difference in means, and the confidence interval and p-value associated with the t-test) is appropriate for this problem. In the normal case, this t-test triple meets some regularity properties that lead to inferences that are intuitive and easy to understand. Because these properties form the basis for a certain statistical intuition about how frequentist inferences ought to be, and because the example uses normal distributional assumptions, we call these properties the "normal intuition". We will show later how the normal intuition breaks down for the two-sample binomial problem, although many of the properties may approximately hold for large samples.

Background and Notation
Consider a general frequentist problem, where we observe data, x, and denote its random variable as X. Assume some probability model for X that depends on a parameter vector θ, but we are interested in a function of θ that returns a scalar, b(θ) = β. We partition the possible values of θ into two sets, the null hypothesis space, Θ 0 , and the alternative hypothesis space, Θ 1 .
In this paper, except for Section 8, we consider only three classes of partitions, where the null and alternative space is defined by β, and separated by a value β 0 on the boundary between the null and alternative hypothesis spaces. The first of these three classes are two-sided hypotheses, which can be equivalently written as The other two classes are the one-sided hypotheses, Alternative is Less Alternative is Greater Let p(x, Θ 0 ) be a p-value associated with the null hypothesis space, Θ 0 . Typically, we assume a class of hypotheses and write (with a slight abuse of notation) p(x, β 0 ) as a p-value associated with the null hypothesis indexed by β 0 . We reject the null hypothesis at level α if p(x, β 0 ) ≤ α. Following Berger and Boos [5], we define a p-value procedure as valid if for all α ∈ (0, 1) and all θ ∈ Θ 0 . (Ripamonti, et al [52] call a valid p-value procedure, a guaranteed p-value.) The term exact is often used to describe tests that give valid p-values, but be aware that the term 'exact' is used in at least 4 different ways in the literature: (i) exact can mean methods not based on asymptotic or other approximations [see 29, p.450], (ii) exact can mean valid methods [see 30,43,21], (iii) exact can mean that the size is equal to the significance level (only possible with randomized tests for discrete data) [15], or (iv) exact can mean that the p-values are the smallest p-values among a class of valid p-values [52, equation 2.5], specifically, p-value procedures such that In this review, we use the term exact only in the sense of (i) and (ii). We use this term because it is part of accepted names of tests or classes of tests, such as Fisher's exact test and unconditional exact tests, and in these cases "exact" can always be interpreted in the sense of (i) and (ii), and in some cases (iv), but never (iii). Following Röhmel [54], we define a p-value procedure as coherent if for every For the classes of hypotheses above, we can invert the p-value function to get its associated 100(1 − α)% confidence region, We define a confidence region as valid if it is guaranteed to have at least nominal coverage for every θ (and hence every b(θ) = β); in other words, Often we use asymptotic methods to create p-values and confidence intervals that are not valid for finite samples, but approach validity as the sample size gets large. In this paper, we only consider non-asymptotic methods, and all are valid except the mid-p methods described in Section 10.

Standard Frequentist Inference: Normal Intuition
Consider the two-sample problem, where the ath group has n a independent and normally distributed responses, with mean µ a and variance σ 2 , for a = 1, 2. Let θ = [µ 1 , µ 2 , σ], and suppose we are interested in β = b(θ) = µ 2 − µ 1 . The t-test is valid for testing the null that β = β 0 and it is the uniformly most powerful (UMP) unbiased test [39, p. 160] for this problem. UMP unbiasedness means that among the class of unbiased tests for this problem (i.e., tests for which the power for each specific parameter in the alternative space is always greater than the power for every parameter in the null space), the t-test is the most powerful test regardless of which θ ∈ Θ 1 we measure power.
We study this case first to define "normal intuition" about frequentist inferences. This normal intuition is a series of properties, that if they are not met, conflict with many statisticians' intuitive feeling of how p-values and confidence regions ought to work. Here are those properties met by the triple: difference in sample means,β; the two-sided p-value from the t-test, p; and the 100(1 − α)% confidence interval on β associated with that p-value, (L, U ).
Reproducibility: Application of the method by two independent statisticians to the same data always gives the same results (as opposed to randomized tests). Confidence region is an interval: The confidence region created from p through equation 2.2 is an interval, meaning it can be written as (L, U ) with all values within the interval belonging to the confidence region. Compatible Inferences: p ≤ α if and only if the (1 − α) confidence interval does not contain β 0 . Accuracy (of coverage): Taken over repeated applications, the probability that the 100(1 − α)% confidence interval procedure includes β is equal to (1 − α) for all values of θ such that b(θ) = β. Centrality (of CI): The 100(1−α)% CI is a central one, meaning P [L > β] ≤ α/2 and P [U < β] ≤ α/2. One-sided p-value from Two-sided p-value: Half of the two-sided p-value can be interpreted as a one-sided p-value in the apparent direction of the effect. For example, ifβ > β 0 then we can reject H 0 : β ≤ β 0 at level p/2. Directional Coherence (of p-value): The t-test method has "directional coherence", where we have expanded the definition of coherence of one-sided p-values to two-sided p-values with an estimate. Call a two-sided p-value function directionally coherent if the p-values are decreasing as β 0 gets farther fromβ. In other words, directionally coherent two-sided p-values have p(x, β * 0 ) ≤ p(x, β 0 ) when either β * 0 < β 0 <β orβ < β 0 < β * 0 . A two-sided p-value with this property can be interpreted as a coherent one-sided p-value in the appropriate direction. For example, ifβ > β 0 then we can reject H 0 : β ≤ β 0 at level p. (And for the t-test p-value, we can also reject at a level of p/2.) Monotonicity (of power): As the sample size increases, there is an increase in power under any probability model with parameter values in the alternative hypothesis space. Nestedness (of CIs): If we had used a larger confidence level, (1 − α * ) > (1 − α), then the 100(1 − α * )% confidence interval, (L * , U * ), would completely contain the 100(1 − α)% one, (L, U ); in other words, L * ≤ L < U ≤ U * .

Two-Sample Binomial: Failure of Normal Intuition
Now we turn to the two-sample binomial problem, where X 1 ∼ Binomial(n 1 , θ 1 ) and independently X 2 ∼ Binomial(n 2 , θ 2 ). Here the parameter of interest is typically one of three functions of θ = [θ 1 , θ 2 ]: the difference (β d = θ 2 − θ 1 ), the ratio (β r = θ 2 /θ 1 ), or the odds ratio ( and includes values of β or both larger and smaller than 1. The cause of this behaviour is the lack of unimodality of the p-value function; see Figure 1.  Incompatible inferences: If the confidence region is not an interval, we can create a valid CI by using the interval that covers the whole confidence region. But this will not give compatible inferences with the p-value function. Returning to the Fisher's exact test confidence region example, we can create a 95% confidence interval by "filling in the hole" as (0.177, 1.014) to create the matching confidence interval [see Section 4.1 or Ref. 6]. In this case, the two-sided p-value rejects the null that β or = 1 at the 0.05 level, but the matching 95% confidence interval includes β or = 1. This issue is different from the incompatible inferences that often occurs by using different methods to calculate p-values and confidence intervals, which can be quite prevalent in this application. For example, the default for R (fisher.test in base R, version 3.5.1) and SAS (exact option in Proc Freq, version 9.4) uses the Fisher-Irwin two-sided p-value, but calculates the two-sided confidence interval on β or by inverting two one-sided Fisher exact p-values [see e.g., 18,19]. Imperfect Accuracy of Coverage: Because of discreteness, the valid confidence interval must have coverage larger than the nominal level for some values of θ, in order to ensure validity for all values of θ. Remember, the term "exact" is often used to mean valid (see Section 2.2), so an "exact" confidence interval may have coverage greater than the nominal level and not, as the term might imply, have coverage exactly equal to the nominal level. Section 10 discusses relaxing the requirement of validity in order to have coverage closer to the nominal level "on average", slightly greater than nominal for some parameter values and slightly less for others. Non-Centrality of Confidence Interval: Although central (1 − α) CIs for the binomial problem are important, much has been written on non-central intervals. Agresti and Min [1] showed that by inverting certain two-sided tests, we get shorter confidence intervals than central ones. For the difference in proportions, this strategy often uses an unconditional exact (i.e., valid) version of a two-sided score test [see 16]. For x 1 /n 1 = 5/9 and x 2 /n 2 = 7/7 then the difference in proportions isβ d = 0.444 with 95% confidence interval using this method equal to (0.005, 0.749) and the associated two-sided exact p-value for testing β d = 0 giving p = 0.0496. Because the 95% confidence interval is based on inverting a two-sided test, we cannot use p/2 = 0.0248 as a one-sided p-value showing that β d > 0 at the 0.025 level. In fact, to ensure validity, we can only use the two-sided p-value as an upper bound on that one-sided p-value. Non-monotonicity of power: Continuing with the previous example (x 1 /n 1 = 5/9 and x 2 /n 2 = 7/7 using the unconditional exact two-sided score test), if we add one more observation to group 2 the two-sided p-value increases regardless of whether the extra observation is a failure (giving x 2 /n 2 = 7/8 and p = 0.172), or success (giving x 2 /n 2 = 8/8 and p = 0.0510) [this example comes from 60]. Thus, it is not surprising that the power to reject at the two-sided 0.05 level when θ 1 = .4 and θ 2 = .9 is higher for n 1 = 9, n 2 = 7 (power= 61.9%) than for n 1 = 9, n 2 = 8 (power=53.7%).
Power non-monotonicity can also exist for common one-sided tests. Using a one-sided Fisher's exact test at the 0.025 level, the power to reject H 0 : β or = 1 when θ 1 = 0.01 and θ 2 = 0.80 is 71.7% when n 1 = n 2 = 5, but 63.2% when n 1 = n 2 = 6. Non-nesting Confidence Intervals: Wang [61] proposed a method for constructing the smallest one-sided confidence interval for the difference of two proportions. Consider x 1 /n 1 = 2/7 and x 2 = 2/5. The lower one-sided 95% interval on the difference, β d , is (−0.467, 1), but the 96% interval by the same method is (−0.442, 1). See Figure 2. Non-Coherence: For testing for non-inferiority on a difference in proportions, Chan and Zhang [10] recommend the exact unconditional test based on the score test. Röhmel [54] give the following illustrative example: the proportion of failures on control is x 1 /n 1 = 130/248 and on new treatment is x 2 /n 2 = 76/170, with the failure rate slightly lower on new treatment, β d = −0.077. If we want to show that H 1 : β d < 0.025 the p-value is p = 0.0226, but if we want to show an even less stringent margin, H 1 : β d < 0.026 the p-value non-intuitively increases to p = 0.0239 (see Figure 3).
For the two-sample binomial problem, many attempts to increase power or get the smallest expected length CI result in violations of some of these "normal intuition" properties.

Choosing the Effect Measure
Choosing the effect measure is dependent on the application, so we examine a real application to discuss the issues. Coulibaly et al. [12] studied a parasite called Mansonella perstans that infects people in parts of Africa. The usual drugs that kill other similar parasites had not been working on killing M. perstans. Coulibaly et al. [12] realized that in this case there was a symbiotic bacteria, Wolbachia, that helped the M. perstans live. They suspected that if they gave a common antibiotic, doxycycline, to kill the bacteria, it may in fact help cure the patient of M. perstans. To study this, some patients were randomized to the treatment group (received doxycycline) and some to the control group (received no treatment). There are issues of missing data that we will ignore for simplicity. The results are that at 12 months x 2 = 67 out of n 2 = 69 subjects who received doxycycline had cleared the M. perstans from their blood, while only x 1 = 10 out of n 1 = 63 who got no treatment cleared the parasite. There are several reasonable choices for how to measure the effect: the difference in clearance rates, the ratio of clearance rates, the ratio of failures, and the odds ratio of clearance rates. Although the choice is often dominated by what is most natural to the intended audience, there are some statistical issues related to this choice.
Without loss of generality, we define the effect measures as measuring how much larger θ 2 is than θ 1 . The opposite effect can be measured by switching group labels. But we could also simultaneously switch group labels and switch the responses. If the effect remains the same after this double switching, we say that the measure has symmetry equivariance. The measures β d and β or have  imsart-generic ver. 2014/10/16 file: TwoBinomArXiv.tex date: April 12, 2019 symmetry equivariance; however, β r does not have it, as we demonstrate with the example. Letθ 2 = 67/69 ≈ 0.97 andθ 1 = 10/63 ≈ 0.16. An estimate of the rate ratio for success (cleared parasites at 12 months) isθ 2 /θ 1 ≈ 6.12. The rate ratio is often called the relative risk, but in this case the "risk" is the risk of getting cured. A different expression of the same data would be to measure the ratio of the rates of failures (those still having detectable parasites at 12 months). Letθ F 2 = 2/69 ≈ 0.03 andθ F 1 = 53/63 ≈ 0.84, then an estimate of the relative risk of failure isθ F 1 /θ F 2 ≈ 29.0. In this latter case the control group looks about 29 times worse than the treatment group, while if we look at the rate ratios for success the treatment group looks only about 6 times better than the control group. So how many times better treatment is than control depends on which way we measure risk. This is a violation of symmetry equivariance. Despite this the rate ratio is often used because it is easy to understand [see e.g., 12], or because it has become the parameter of choice within a field so that its use facilitates comparisons between studies. The difference has symmetry equivariance. If we measured the difference in rates of disease rather than the difference in rates of cure we get exactly the negative difference as we might expect. Similar to the relative risk, the difference is often used because it is easy to understand. Additionally, the sample difference in rates is always defined, unlike the ratio which is undefined whenθ 1 =θ 2 = 0. Figure 4 gives plots of the three statistics usingθ 2 andθ 1 with n 1 = n 2 = 8. The plots go from dark blue (θ 2 is larger) to white (θ 1 =θ 2 ) to dark red (θ 1 is larger), with black denoting indeterminate. Because of the indeterminate black areas, the ordering of the sample space for the ratio and odds ratio is not straightforward (see Section 5.3). The ordering of the measures on the parameters themselves would give a continuous version of Figure 4, and the black regions would reduce to points at (θ 1 , θ 2 ) = (0, 0) or (1, 1). The bottom panels show the lack of symmetry equivariance for the β r . Comparing the panel for β or with the two different ratio panels, we see that the lower left hand corner of the β or panel is similar to the lower left hand corner ofβ r =θ 2 /θ 1 . For small θ,β or is a good approximation toβ r . Similarly for both θ values close to 1,β or is a good approximation of ( The odds ratio is the most complicated of the three measures, but it has some nice properties. It is very important for the case-control design used to study rare diseases, because the odds ratio of disease given exposure is equal to the odds ratio of exposure given disease [see 8]. Also for performing regression on binary observations, logistic regression allows linear predictors to be used to model the log odds, and effects of binary covariates can be expressed as odds ratios. An advantage of the odds ratio for the two-sample binomial case is that by conditioning on the total number of successes in both groups, the probability distribution reduces to a noncentral hypergeometric distribution which is a function of β or . This is discussed more in Section 6.

Defining a Matched Triple
Once we choose an effect measure, we choose an appropriate triple (an estimator, confidence interval, and p-value function) for inferences. Deciding on the estimator is not the focus of this paper. We will not specify the estimator except to require that it is within the confidence interval. We focus mostly on choosing the CI and p-value function. Except in Section 10, we only consider triples that are valid (i.e., the CI and p-value are both valid) and reproducible. Because we require reproducibility, the triple based on the UMP unbiased (and randomized) test is not allowed. Although one could define a triple where the p-value function and confidence interval are derived from different procedures, for focus we will not consider those kinds of triples in this paper. We define a matched triple as one where its confidence interval is derived from its p-value function or vise versa. A matched triple is slightly different from a compatible triple (i.e., a triple where the p-value function and the confidence interval procedure are compatible). For example, as we have shown in Section 2.4, for some p-value functions it is not possible to get a compatible confidence interval. Loosely speaking, if we start with a valid p-value function, the matching CI is the valid CI that gives compatible inferences as much as is possible, and vice versa if we start with a valid CI.
Here is a precise definition of a matched triple. If we start with p(x, β 0 ), an associated confidence region is given by equation 2.2, and the matching CI is smallest interval that contains that confidence region. In other words, if the confidence region has holes in it, then those holes are "filled in". On the other hand, if we start with (L, U ) = C I (x, 1 − α), then the matching p-value function is the smallest α such that β 0 is outside C I (x, 1 − a) for all a ≥ α, or more precisely, The formal proof of the theorem is in the Appendix. The theorem says we must have nested CIs and coherent p-values in order to have compatible inferences. These ideas are best understood graphically. Figure 1 shows lack of directional coherence; for every β 0 there is only one p-value, and the two-sided p-value function is not unimodal. Similarly, Figure 3 shows lack of coherence. Figure 2 shows non-nestedness; for every α there is only one lower limit, and the lower limit is not a monotonic function of the level.

Directional Inferences
Typically, if a researcher finds a significant difference from the two-sided p-value suggesting that β = β 0 , they almost always are interested in interpreting the result in terms of whether β > β 0 or β < β 0 . In other words, the two-sided hypothesis test is often treated as a three-decision rule: (1) fail to reject β = β 0 , (2) reject β = β 0 and conclude β > β 0 , or (3) reject β = β 0 and conclude β < β 0 . If the two-sided p-value has directional coherence, then if we reject H 0 : β = β 0 at level α, we can additionally reject at level α either Consider comparing two triples that both have compatible inferences, one with a central CI, and one with a non-central CI. For the non-central triple (i.e., the one with the non-central CI) the associated two-sided hypothesis may be slightly more powerful, but if the non-central triple is applied also to a subsequent one-sided hypothesis (as in the three decision rule), it can be quite a bit less powerful than the central one. To see this, start with a nested central CI, say (L, U ), and pair it with its matching two-sided p-value, say p C . By Theorem 4.1, this means that whenever the 100(1 − α)% CI excludes β 0 then p C ≤ α, and we can reject H 0 : β = β 0 at level α. After rejecting the two-sided hypothesis at level α, we can reject one of the one-sided hypotheses at level α/2; if β 0 < L we reject H 0 : β ≤ β 0 , while if β 0 > U we reject H 0 : β ≥ β 0 . A non-central CI does not allow one-sided rejections at the α/2 level. Freedman [26] discusses this issue in terms of clinical trials, and, using these arguments as well as some Bayesian motivation, [26] recommends performing two one-sided tests at the α/2 level, which is another way of describing the use of central CI methods for three decision rules.
In summary, if we desire directional inferences, and we want to compare the power to detect a one-sided effect in a fair way (i.e., both methods bound the one-sided type I error rates of the three decision rule at the same level), then we need to compare a method with a two-sided p-value and its matching 100(1 − 2α)% non-central CI, with a pair of one-sided p-values and its matching 100(1 − α)% central CI. This means that when comparing expected lengths of CIs, if directionality of effect is important, we should compare the expected length of a 100(1 − 2α)% non-central CI with the expected length of a 100(1 − α)% central CI. Because directionality is usually important, our default recommendation is to use central confidence intervals and perform three-sided inferences as described above.

Basic Procedure for Defining p-values
Suppose larger θ is better. We want to know if treatment 2 is better than treatment 1 (θ 2 > θ 1 ), and if so by how much. Let T (x) be a function of the data, where larger values of T (x) indicate that treatment 2 is better than treatment 1, and T (X) is defined for all possible values of X. For example, a simple T (x) is the difference in observed proportions (see Figure 4 upper left). For this section and the next (Section 5.2), we require that T is a function of x only. In Later in Section 5.5 T may depend on α, and in Section 5.6 T may depend on β 0 . Barnard [4] outlined convexity conditions which ensure that larger values of T suggest treatment 2 is better. Barnard's convexity (BC) conditions are: Even within functions that satisfy the BC conditions, there are many choices.
In later sections we explore choice of T further, but for now imagine the simple ordering function of T (x) =θ 2 −θ 1 plotted in Figure 4 (upper left panel), which meets the BC conditions. Once we have decided on the ordering function, T , we can create valid unconditional one-sided p-values: p U for testing the null H U0 (defined as H 0 : β ≥ β 0 ) and p L for testing H L0 (H 0 : β ≤ β 0 ) using These p-values are valid since  Lloyd and Kabaila [42] and Wang [61] show two results about these one-sided intervals. First, the lower and upper one-sided confidence limits retain a logical ordering analogous to Barnard Barnard [4] proposed a test he called the CSM test based on an ordering function that starts from the most extreme point, and adds points as the ones that have the lowest valid unconditional p-value among those that meet the BC condition and the symmetry equivariance condition. The p-value function used could be the p C related to testing θ 2 = θ 1 , e.g., p C (x, 0) for testing β d = 0. Additionally, Barnard [3] outlined the general exact unconditional test, and those tests are sometimes referred to as "Barnard's test" [see e.g., 59, 13], but we do not use that terminology to avoid confusion with Barnard's CSM test. Röhmel and Kieser [55] discussed one-sided exact unconditional tests using Barnard's CSM p-value ordering, except with breaking more ties to get higher power, an idea discussed in the next section.
Martín Andrés, Sánchez Quevedo and Silva Mato [44] proposed a good allpurpose ordering, which is to base the ordering on the one-sided mid-pvalue from Fisher's exact test (see equation 10.1). We explore the power properties of this ordering in Section 13. Alternatively, the ordering can be tailored to a specific application. For example, Gabriel et al. [27] proposed an ordering to optimize power for certain types of animal experiments where θ 1 , the parameter for the control group, is expected to be nearly 1.

Improving Power by Breaking Ties: Refinement of Ordering Functions
One important way to improve the power of some unconditional exact tests based on a function T is to break any ties that exist in the ordering function.
If T is an ordering function with ties, and T * is an ordering function that gives the same ordering of T at all the untied values and additionally breaks some ties, then we say T * is a refinement of T . Then the unconditional exact p-values formed with T * are always less than or equal to those formed with T [see 56, p. 158]. Similarly, one-sided exact unconditional lower confidence limits formed using T * are always at least as large as the ones formed using T [37,61]. We describe one specific refinement or tie breaking algorithm for the difference in proportions next, which as far as we are aware, has not been specifically described in the literature and has not been available in software (although there are some closely related methods). We can order within each set of tied values using Wald statistics forβ d , i.e., ordering by . This leaves the ties forβ d = 0, but otherwise defines points with more precision as more extreme, where extreme is further away from zero. Not all the values withβ d = 0 break all the ties. For example, consider the ties atβ d = 5/8 that happen at the x values [0, 5], [1,6], [2,7], and [3,8], for n 1 = n 2 = 8. This method still leaves tied the two pairs of points, [3,8]} and { [1,6], [2,7]}. These remaining ties we argue should remain tied in order for the ordering to retain symmetry equivariance. Note that this suggested ordering is similar, but not equivalent to just ordering the entire sample space by Z(x) [as was studied in 47].
If we break the ties in this way, then the BC conditions are still met, because only at the boundaries (where the ties are broken according to the BC conditions) do the ties occur at two points x a and x b with x a1 = x b1 or x a2 = x b2 . All of the other ties will not have any x a1 = x b1 or x a2 = x b2 so they can be broken in any manner and the overall ordering function, T * , will meet the BC conditions. This is important for computation (see Section 11). Further, the proposed T * (tie-breaking on difference in proportions) does not depend on α or β 0 like some score test based methods (see Sections 5.5 and 5.6) so avoids problems with nesting and coherence.

Ordering Functions for Ratio and Odds Ratio
Performing exact unconditional tests on β r or β or is not straightforward. We consider β r first since it is simpler. One problem is that if we observe x = [0, 0], this could occur with high probability if the true ratio was 100 or if it was 1/100 as long as both θ 1 and θ 2 were very small. So if T (x) is designed so that larger values suggest θ 2 > θ 1 , it is not clear how to define T ([0, 0]) if our interest is in β r .
Since x = [0, 0] gives us no information about β r , we must deal with that point in a special way; we set the p-value at x = [0, 0] to 1 for tests of β r regardless of the null hypothesis. This means that x = [0, 0] is placed "deepest" within the null. Following equations 5.2, this implies T ([0, 0]) can be thought of as the largest value when calculating p U (x, β 0 ) and the smallest value when calculating p L (x, β 0 ). A similar issue applies to the odds ratio, except in that case in addition to x = [0, 0], the point x = [n 1 , n 2 ] also has no information about β or .
For clarity, we rewrite equations 5.2 applied to all three parameters. Let X I denote the set of X values with information about β. Then if x / ∈ X I set p U (x, β 0 ) and p L (x, β 0 ) to 1, otherwise let p U (x, β 0 ) be and analogously, let p L (x, β 0 ) be Since we never reject when x / ∈ X I , these definitions give valid p-values, and additionally when x / ∈ X I we do not need to define T (x). The simple ordering function by the estimate of β r or β or (even when using a tie breaking ordering similar to what was done for β d ) is not very powerful (see Section 13), and is not recommended. Typically, we order using a score function (see Section 5.6) since it gives more reasonable power.

Other Improvements: E+M and Berger-Boos
Another method to apparently improve the ordering statistic for any efficacy parameter (difference, ratio, or odds ratio) is the estimated and maximized (E+M ) p-value [41]. In this method, we replace an ordering statistic, T , with T * , where T * is an estimated p-value when testing H L0 (or the negative estimated p-value when testing H U0 ). We estimate the p-value by plugging inθ 0 instead of taking the supremum of θ under the null, whereθ 0 is the maximum likelihood estimator of θ ∈ Θ 0 . For example, the approximation for p L in expression 5.2 useŝ p L (x, β 0 ) = Pθ 0 [T (X) ≤ T (x)]. Then we "maximize" using T * (x) =p L (x, β 0 ) instead of T as the ordering function, that is, we calculate the exact conditional p-value using expression 5.2 by taking the supremum. Lloyd [41] studied this method and observed that when T * (the approximate p-value) is used as the ordering statistic, the resulting exact unconditional p-value is generally smaller than the exact unconditional p-value on T . The process can be repeated (replace T * by its approximate p-value), but the additional reduction appears to be minimal.
Berger and Boos [5] introduced a popular adjustment that tends to reduce exact unconditional p-values. Instead of taking the supremum over the entire null hypotheses parameter space take the supremum only over C γ , a 100(1−γ)% confidence set of θ restricted to be in the null space, then add γ to ensure validity. This is usually done by reexpressing the parameter space (θ 1 , θ 2 ) as (β, ψ), where ψ is a nuisance parameter, then defining C γ as the intersection of θ ∈ Θ 0 and the set of θ values with ψ in its 100(1 − γ)% confidence interval. A Berger-Boos version of p U of expression 5.2, uses This is not optimal, since we may be able to improve it by using p Uγ (x, β 0 ) as an ordering function. Nevertheless, it usually provides some reduction in p-values [see e.g., 41].

Ordering Functions That Depend on Significance Level
Kabaila and Lloyd [36] showed that for the one-sided 100(1 − α/2)% exact unconditional upper confidence limit, the ordering function, T , that maximizes the asymptotic efficiency is an approximate 100(1 − α/2)% one-sided upper confidence limit itself. This means that you would use a different ordering function for the upper and lower limit, and in fact would use a different ordering function for different confidence levels.
Wang [61] and Wang and Shan [62] also proposed an ordering function to give the smallest CI, and the calculation of the ordering function itself is iterative and quite involved, similar to the CSM test of Barnard [4]. The precise definition of the ordering is notationally cumbersome, but the idea is roughly as follows. Consider the lower 100(1 − α/2)% one-sided limit. Start from the most extreme point x = [0, n 2 ]. Then add points one at a time, picking the point, x a , that gives the largest L(x a , 1 − α/2) and belongs to the set of closest neighboring points with the already included points, where closest neighbor is defined in terms of the BC conditions. The algorithm ensures that the lower limit function meets the BC conditions. Because each added L(x) value is as large as possible, this ordering ensures that if the resulting ordering function T gives the finest partition (there are no ties), then any valid 100(1 − α/2)% one-sided lower limit that meets the BC conditions and uses T for ordering, say L * , has L * (x) ≤ L(x) for all x [see 61,62].
Although we obtain this optimality property, the price is that the ordering function depends on α. Thus, we can have different ordering functions for different α, which can lead to non-nestedness (see Figure 2).

Ordering Functions That Depend on Hypothesis Space Boundaries
Basing the ordering statistic on a score test can increase the power over using simple Wald-type Z statistics [see 9]. Although this increased power has been shown in several simulation studies, it is not clear whether the increase in power is due to the fewer ties for the score test, or from some other difference between the ordering statistics. A problem with the score statistic is that the induced ordering may change based on the β 0 , since score statistics use β 0 in their calculation, whereas most other test statistics do not include β 0 in the calculation. This can produce non-coherence as was shown in Section 2.4 and Figure 3.
Although the exact unconditional p-values and confidence intervals of this Section can be powerful, they are more difficult to calculate then the exact conditional ones described in the next two Sections: Section 6 for p-values, and Section 7 for compatible confidence intervals.

One-Sided Conditional Exact Tests
Yates [65] argues that conditioning on total number of failures is the proper strategy for this problem, and most of the discussants of the paper agreed with this (including Barnard, who first suggested the unconditional approach). One of the main reasons that others had recommended the unconditional approach is an overemphasis on the fixed significance level and the resulting power, which when used leads to more power for the unconditional tests because the sample space has more values and hence is less discrete. Yates [65] argues (in his Section 9) that over reliance on the nominal significance level is not a good reason to prefer the unconditional test, and that p-values should be reported instead of accept/reject decisions. Yates [65] also argues that we should condition on the total number of events (X 1 + X 2 ), because that statistic is approximately ancillary to the effects of interest. Recent reviews [e.g., 43] have emphasized power arguments, and we review the choice of test from that perspective in Section 13. Historically, conditional tests have been important because of their much smaller computational burden compared to unconditional tests. The computational burden for unconditional tests has become less important, although for some applications it may be a non-trivial concern (e.g., big data applications with small sample sizes but very many covariates being tested).
For the unconditional one-sided exact method, to calculate p-values we need to take the supremum of the probability that T (X) is more extreme than the observed T (x) over the parameter space Θ 0 (see e.g., equation 5.2). This is a difficult calculation (see Section 11). An alternative method is to condition on the sum s = x 1 + x 2 , and calculate the conditional probability. The resulting conditional distribution is the extended hypergeometric distribution [34] also called Fisher's noncentral hypergeometric distribution [25], which depends only on β or . Additionally, because s is fixed we can write the ordering function in terms of X 2 only. In fact, the only unique ordering function that makes sense and meets the BC conditions is X 2 itself (ordering on n 1 −X 1 will be equivalent). So this simplifies the calculations if the effect measure is β or . For example, for testing H 0 : β or ≥ β 0 use where the last step comes because the conditional distribution is monotone in β or [48]. The other conditional one-sided p-value, p Lc is calculated similarly except by reversing the inequality. These conditional p-values for testing H 0 : β or = 1 (or equivalently H 0 : θ 1 = θ 2 ) are Fisher's exact one-sided p-values. We calculate the central confidence intervals on β or using equation 5.3 except using the conditional exact one-sided intervals instead of the unconditional ones. Now consider the other measures, β d and β r . At the boundary of equality, the one-sided hypotheses are equivalent. For example, the following three null hypotheses give equivalent Θ 0 : (odds ratio) H 0U : β or ≥ 1, (ratio) H 0U : β r ≥ 1, and (difference) H 0U : β d ≥ 0. Analogously for the other one-sided p-value. But for boundaries not representing equality, Θ 0 changes depending on the effect measure. The simplification of the p-value calculation only works for the odds ratio. For example, for the difference in proportions (i.e., β = β d ) there is not simplification analogous to equation 6.1. Figure 5 shows that the exact onesided conditional confidence limit on β d is not efficient, because the conditional distribution depends on β or . The upper 100(1 − α/2)% limit for β d , say U d , based on the upper limit for β or , say U or , is [see 57, Section 2] There are better ways to get confidence intervals on β d and β r that provide compatible inferences with the one-sided p-values with β 0 representing θ 1 = θ 2 . We show these in the next section.

Melded Confidence Intervals
Fay, Proschan and Brittain [21] developed melded confidence intervals, a general method for creating confidence intervals for the two-sample case, that is closely related to the confidence distribution (CD) approach [63]. Roughly, the 100(1 − α)% melded confidence interval is a central confidence interval that takes the middle 100(1 − α)% of a function of random variables, each created from onesided confidence intervals. Let L θa (x, 1 − α/2) and U θa (x, 1 − α/2) be exact nested 100(1 − α/2)% onesided confidence limits, for θ a for a = 1, 2. The lower and upper CD random variables for group a are W La = L θa (x, A a1 ) and W Ua = U θa (x, A a2 ), where A ai are independent uniform random variables. This gives, W La ∼ Beta(x a , n a − x a + 1) with expectation x a /(n a + 1), and W Ua ∼ Beta(x a + 1, n a − x a ) with expectation (x a + 1)/(n a + 1), and using limits of parameters going to zero we define Beta(0, n + 1) as a point mass at 0 and Beta(n + 1, 0) as a point mass at 1. If the responses were normally distributed, then the lower and upper CD random variables would be identical, but for the binomial case (and for discrete random variables in general) the lower and upper CD random variables (CD-RVs) are different -the lower CD-RV is stochastically smaller than the upper CD-RV. To get a melded confidence intervals on b(θ), Fay, Proschan and Brittain [21] require that b(θ) is a monotonic function of the parameters, such that β = b(θ) is increasing in θ 2 and decreasing in θ 1 . For the binomial problem all three parameters (β d , β r and β or ) meet the monotonicity requirements. Then the 100(1 − α)% (two-sided) melded confidence interval is given by where q(Y, a) is the ath quantile of a random variable Y . The interval is designed conservatively by using [W U1 , W L2 ] for the lower limit, but [W L1 , W U2 ] for the upper limit. Fay, Proschan and Brittain [21] conjectured that if the one-sample confidence interval procedures are valid, central, and nested, and β = b(θ) is monotonic within each parameter, then the melded confidence interval is valid, nested and central. Some mathematical results, simulations in several situations, and extensive numeric calculations in the binomial case supported this conjecture. A rigorous proof of the conjecture is still needed. Let p Um (x, β 0 ) and p Lm (x, β 0 ) be the one-sided melded p-values, the p-values that match with the one-sided melded confidence limits. Then for the binomial case, Fay, Proschan and Brittain [21] showed that the one-sided melded p-values equal the exact one-sided conditional p-values when testing the null with margin β 0 which implies θ 1 = θ 2 . For example, for testing H 0 : β d ≥ 0, we have p Um (x, 0) = p Uc (x, 0), and for testing H 0 : β r ≥ 1, we have p Um (x, 1) = p Uc (x, 1). Because the melded confidence intervals are nested, by Theorem 4.1 the melded confidence intervals are compatible with the p-values from the onesided Fisher's exact test.
The melded CIs for β or are very close to the exact conditional ones, but the melded CIs for β d are more efficient (lower are larger, and upper are smaller) than the exact conditional ones (see Figure 6).

Noninferiority and Equivalence Hypotheses
Two other types of hypotheses are noninferiority and equivalence hypotheses. Suppose we are comparing two treatments, and larger β means that the new treatment is better than the standard one. Let β 0 denote θ 1 = θ 2 and define an equivalence region as β ML < β 0 < β MU . For β ∈ (β ML , β 0 ), although the standard treatment is better, the difference between the two is not substantial enough to be of practical importance. When we reject the one-sided hypothesis, H 0 : β ≤ β ML versus H 1 : β > β ML , we declare the new treatment noninferior. This is just an "alternative is greater" one-sided hypothesis already discussed in Section 2.2. The equivalence hypothesis, however, is qualitatively different, Just as the two-sided hypothesis is often treated as a three decision rule (see Section 4.3), the statement after testing an equivalence hypothesis may be more expansive than either reject or fail to reject the non-equivalence. Let (L, U ) be a valid central nested 100(1 − α)% CI. Then we can make the following declarations based on the relationship between (L, U ) and β ML and β MU : • if β ML < L < U < β MU declare equivalence at level α, • if β ML < L < β MU < U declare noninferiority at level α/2, • if β ML < β MU < L < U declare (substantial) superiority at level α/2, or • if L < U < β ML < β MU declare (substantial) inferiority at level α/2. Lower and Upper limits associated with 95% central confidence intervals by exact conditional method and melding method. Simulated data where na is simulated from uniform on 1 to 100, and xa is uniform on 0 to na, 1000 replications. Calculation used the exact2x2 R package for melded confidence limits and fisher.test from the stats package for the exact conditional limits. The limits for βor agree well, except for some extreme data (e.g., x 1 /n 1 = 1/68 and x 2 /n 2 = 57/61) perhaps caused by numeric issues in the computation, while the limits for β d show that the melded are shorter intervals (lower is larger, upper is smaller).
The last three statements are valid because of the centrality [see e.g., 28, for a similar statement].

Non-central Confidence Intervals and Associated Tests
Let T ts (x) ≡ T ts (x, α, β 0 ) be an ordering function for testing the two-sided null H 0 : β = β 0 , with smaller values suggesting β further away from the null. Then we can create exact unconditional two-sided p-values using and exact conditional two-sided p-values using which simplifies to where f is the probability mass function for the extended hypergeometric distribution with parameter β or = β 0 . Then the associated exact conditional p-value is the usual Fisher's exact test, which we call the Fisher-Irwin test since it was proposed by Irwin [33] and to distinguish it from the central Fisher's exact test created by doubling the minimum of the one-sided Fisher's exact p-values. Using Fisher's exact pvalues (either Fisher-Irwin or central version) as an ordering function in an unconditional exact test gives a version of Boschloo's test. Boschloo [7] showed that using the Fisher-Irwin p-values in this way is uniformly more powerful than the Fisher-Irwin test. This superiority in power holds for both one-sided tests and central tests Lydersen, Fagerland and Laake [43].
Blaker [6] studied non-central confidence sets that always are subsets of the central confidence sets in one parameter distributions. To translate into this problem, we consider only the conditional distribution based on S = s and β = β or . Start with T (x) = x 2 , a one-sided ordering function for the conditional problem (see Section 6). Define Let the two-sided ordering function be Then the two-sided p-value is p ts (x, β 0 ) from equation 9.1, and the associated 100(1 − α)% confidence region is imsart-generic ver. 2014/10/16 file: TwoBinomArXiv.tex date: April 12, 2019 Then Blaker [6] showed that this gives smaller confidence sets than the central CIs. Specifically, C ts (x, 1 − α) ⊂ C c (x, 1 − α), where C c is the exact conditional central CI using the one-sided ordering function T (x) = x 2 . Let the 100(1 − α)% matching confidence interval to p ts be the smallest interval that contains C ts .
Agresti and Min [1] showed that if one wants to create two-sided CIs with shorter expected length, it is generally better to invert p-values from two-sided hypothesis tests that are not central. This makes sense because centrality is a restriction, and two-sided tests without that restriction will leave room for improving expected CI length. For the two-sample binomial problem, basing T ts (x, β 0 ) on score tests gives good expected CI length; see Chan and Zhang [10] for β d and Agresti and Min [2] for β or . Despite this apparent improvement, if directional inferences are needed then central confidence intervals are recommended (see Section 4.3).

Mid-p Methods: Improving Accuracy by Giving Up Validity
The mid-p value is a modification of a p-value for discrete data. Instead of calculating the probability of observing equal or more extreme responses, the mid-p value is 0.5 times the probability of equality plus the probability of more extreme. For example, the conditional exact p-value of equation 6.1 becomes Hwang and Yang [31] gave some optimality criteria for the mid-p approach applied to one parameter situations, which applies to the conditional test using β or since the conditional probability is completely described by only the β or parameter. They show that for one-sided or two-sided hypothesis tests, the loss based on squared error between an indicator that β ∈ {b(θ) : θ ∈ Θ 0 } and the p-value function, and shows that for all β ∈ {b(θ) : θ ∈ Θ 1 } (and β = β 0 ) the expected loss is less than or equal to (strictly less than) the expected loss from any randomized exact p-value function (Theorem 3.3 and 4.3 with Yang, Lee and Hwang [64]). Fellows [22] showed the minimaxity under the squared error loss and linear loss, and also showed that of all non-randomized ordered decision rules, the mid-p version is the only one that has expectation 1/2 under point null.

Computational Issues
Overall, the conditional p-values are much easier to calculate than the unconditional ones, since they do not require taking the supremum over the null space. The melded confidence intervals allow matching CIs to conditional tests of θ 1 = θ 2 , and are very quick to calculate, since they use numeric integration. There may be some precision issues in the numeric integration for extreme data sets.
The main computational speed issues are mostly with respect to the unconditional tests, since they require computing the supremum. Röhmel and Mansmann [56, p. 161] showed that for ordering statistics, T , that meet the BC conditions, the supremum in the p-value calculation is on the boundary between the hypotheses. For example, For example, the score statistic on β d [17], has been shown to follow the BC conditions for fixed β 0 [54]. Further, if T meets the BC conditions and does not depend on β 0 , then Theorem 3.1 of Kabaila [35] shows that the exact unconditional one-sided p-values based on T , are either nonincreasing (for p U (x, β 0 )) or nondecreasing (for p L (x, β 0 )) in β 0 for fixed x. This property means that for these p-values, the associated 100(1 − α/2) one-sided confidence intervals can be easily calculated by finding the value β 0 where the p-value equals α/2.
Calculation using Barnard's CSM p-value ordering can be very slow, because determining the ordering itself requires p-value calculation. Röhmel and Kieser [55] discussed one-sided exact unconditional test using Barnard's CSM p-value ordering, except with breaking ties in a manner that does not worry about symmetry equivariance. Their additional contribution was to not worry about the exact ordering for very small p-values. This can speed up the calculations substantially. Table 1 gives a review of the different methods, their properties of centrality and compatible inferences, as well as approximate ranking of computational speed and power. The last column gives some software availability for the methods; it is not a comprehensive list, and only considers SAS 9.4, R (with packages), and StatXact 11.

Connection to Causal Inference
Suppose there is a population of interest with N individuals. The jth individual has two potential binary outcomes of interest, Y j (1) would be the outcome if the individual were to get treatment 1, and Y j (2) would be the outcome if the individual were to get treatment 2. Let Y j = [Y j (1), Y j (2)]. Then there are 4 types of individuals with respect to these potential outcomes, those with: Let the number of individuals in each of the 4 types be respectively, N 00 , N 11 , N 10 , and N 01 . Let θ 1 = (N 11 + N 10 )/N and θ 2 = (N 11 + N 01 )/N . Presenting the data this way implies that the treatment one subject receives does not affect the responses of other subjects, and there is only one treatment effect for each type of treatment. [This is the stable unit treatment value assumption, see e.g., 32, Section 1.6].
Consider the following type of study.
Step 1: Define the study population as a simple random sample of size n = n 1 + n 2 from the population of interest (of size N ). Let i 1 , . . . , i n be the indices for the individuals in the study population.
Step 2: Randomly assign n 1 of the study subjects to treatment 1, and n 2 to treatment 2. Let w i h be the treatment assigned to the hth individual in the study.
Step 3: Apply assigned treatments and observe responses; for the hth individual in the study observe If we treat N as infinity, then we can treat X 1 ∼ Binomial(n 1 , θ 1 ) and independently X 2 ∼ Binomial(n 2 , θ 2 ). Further, the parameters β d , β r and β or have causal interpretation. For example, β d in this situation is called the average causal difference (or average causal effect). Thus, all the previous results can be interpreted as causal inferences.
Randomized clinical trials typically use a convenience study population based on some inclusion criteria based on ethical risks to study subjects and other practical considerations, and they rarely if ever take a simple random sample from the population of interest (i.e., they rarely do Step 1). Because of this, some suggest basing causal inferences on study specific parameters that are defined only for the individuals included in the study [53,51,40,14]. Let the individuals selected (not necessarily randomly) for inclusion into the jth study be i 1j , . . . , i nj . Let N 00j , N 11j , N 10j and N 01j be the number of individuals in that study in each of the 4 types of potential outcomes. Then the study specific parameters of interest are θ 1j = (N 11j +N 10j )/n and θ 2j = (N 11j +N 01j )/n. The finite population average causal difference for the jth study is β dj = θ 2j − θ 1j . If we had randomized individuals to treatment, then we can get confidence intervals for study specific parameters (such as β dj and the related ones for ratios, β rj , and odds ratios, β orj ) using only assumptions about the randomization. This is called randomization inference or Neymanian inference [51,40,14].
Scientifically, we are usually interested in two aspects of the study [see e.g., 38,20]. First, is there a treatment effect on the study population itself (internal inferences)? And second, is there a similar treatment effect on the population of interest (external inferences)? The advantage of the randomization inference is that it requires no assumptions about how the study sample was obtained in order to make valid internal inferences. The disadvantage is that those inferences are study specific inferences (e.g., inferences about β dj ). Alternatively, we can make the convenience assumption that the study population acts similarly to a simple random sample from the population of interest, and use our study data to make inferences about the population parameters (e.g., β d ). This has the advantage that our inferences are more generally applicable, but has the disadvantage that we have essentially assumed away the problem of generalizing the study specific inference to the external population of interest. For more discussion on this issues see Robins [53] (for observational studies) and Imbens and Rubin [32, Chapter 6] (for randomized experiments).

Power and Efficiency Comparisons
A comprehensive simulation or calculation comparing the different methods with respect to power or efficiency is beyond the scope of this review. Here we review a few of the best of those types of papers and add an example and a couple of graphical calculation results to supplement the previous literature on the topic. In essence this section gives some detailed justification for the rough power/efficiency classifications listed in Table 1.
In general conditional tests (e.g., Fisher's exact tests) are less powerful than the best of the unconditional tests, because the latter tests are less discrete [43]. Martín Andrés and Silva Mato [46] provide a very comprehensive power comparison of several valid unconditional tests (including tests based on either an ordering function of the difference in sample proportions, or on some testbased ordering functions related to Fisher's exact p-value, the unpooled Z test, or Barndard's CSM test). They only considered ordering functions that do not depend on α or β 0 (since they only consider power to show θ 2 > θ 1 [i.e., with β 0 = 0 for the difference or β 0 = 1 for the ratio or odds ratio] the ordering functions automatically do not depend on β 0 ). Martín Andrés and Silva Mato [46] based power comparisons on expected power assuming bivariate uniformly distributed (θ 1 , θ 2 ). They found that Barnard's CSM test was the most powerful on average, and that ordering by either the unpooled Z statistics for the difference in means or the Fisher's exact p-values (i.e., a Boschloo-type test) gave the next best power. Martín Andrés and Silva Mato [46] did not include a pooled Z test, but Mehrotra, Chan and Berger [47] did, and they showed that the pooled Z test can have much better power with unequal sample sizes. So in general we can recommend ordering by the pooled Z instead of the unpooled Z. Since Barndard's CSM test is difficult to calculate, Martın Andrés, Sánchez Quevedo and Silva Mato [45] compared many approximations to that value. They concluded that the mid-p Fisher's p-value was the best approximation to the CSM test, although it could be conservative for very small samples. Hirji, Tan and Elashoff [30] did extensive calculations finding the type I error rate for the exact conditional mid-p one-sided and two-sided (Fisher-Irwin-type) tests. They found that out of 3125 sample size and parameter situations (all with θ 1 = θ 2 ), typically 90-95% of both types of the mid-p p-value when used to test at a 5% significance level, had type I error rates less than or equal to 5%. Further, Lydersen, Fagerland and Laake [43] stated that the mid-p version of the Fisher-Irwin test approximates the Fisher-Boschloo test well, and the latter test (or the exact unconditional test on Pearson's chi-squared test) was their recommendation.
For confidence intervals, we focus on two papers. Chan and Zhang [10] compared unconditional confidence intervals based on estimates of or tests on the difference: the difference in proportions, the unpooled Z statistic, the score statistic (which they called the δ-Projected Z statistic), and the likelihood ratio statistic. They tried all with and without the Berger and Boos [5] adjustment. They showed the score statistic with no adjustment generally gave shorter expected confidence interval length. Santner et al. [58] did a very comprehensive set of calculations for β d confidence intervals, calculating the expected coverage and confidence interval length for a 100×100 grid of values of (θ 1 , θ 2 ). They compared three valid methods and two approximate methods, including the unconditional method based on a two-sided score test, the unconditional method based on two one-sided score tests, and an approximate method of Coe and Tamhane [11]. The results show that of the valid methods, the unconditional method based on the two-sided score test statistic had the lowest expected length, while the central unconditional method based on two one-sided score tests had larger expected length. However, if directional inferences are important, then the proper comparison should be the former method using 100(1−2α)% intervals compared to the latter method using 100(1 − α)% intervals (see Section 4.3). Further, the score tests may lack coherence (see Figure 3). Santner et al. [58] ended up recommending the approximate method of Coe and Tamhane [11], which had shorter expected length confidence intervals and gave coverage above the nominal except in less than 0.6% of the cases. Fagerland, Lydersen and Laake [16], also recommends for small samples the exact unconditional confidence intervals with the ordering function the two-sided score test statistic. Fagerland, Lydersen and Laake [16] mentions using one-sided tests if direction is important.
We now compare the score tests to other tests not included in the previous simulations. Between the unconditional tests applied to β r and β or , the ordering based on score tests or the ordering based on one-sided mid-p Fisher's exact pvalues [44] perform much better than ordering by the estimates with tie breaks as in Section 5.3. For example, with n 1 = n 2 = 20, θ 1 = 0.4, θ 2 = 0.8, and a one-sided 0.025 significance level, the power is 73% for score-based or mid-p Fisher-based tests of both β r and β or , but it is very small for the test that orders by estimates with tie breaks (power ≈ 0 for β r and power ≈ 1% for β or ). We get a slight increase in power for the latter tests when we use Berger and Boos adjustment with γ = 10 −6 (power is 11% for β r and 16% for β or ). In contrast, for β d in that example all three methods of ordering with or without the Berger-Boos adjustment give 73% power.
In Figure 7 we compare powers on the two-sided 0.05 level central tests that β d = 0. Powers are calculated on a grid 99 × 99 grid of values of (θ 1 , θ 2 ). We plot the difference in powers between all pairs of three tests: two unconditional exact tests (one based on the score test for the difference in proportions, and one based on the difference in proportions with a tie break) and the conditional test (the central Fisher's exact test). We find, as expected, that the unconditional tests do better, and that the simple method with a tie break does well when the sample sizes are not equal [see e.g., 47, for a different set of simulations showing a similar result for the two-sided test].
In Figure 8 we compare the unconditional exact tests ordered by score statistics (on either β d = 0, β or = 1, or β r = 1) compared to the unconditional exact tests based on the mid p-values from the one-sided Fisher's exact test. We find that the latter test is generally more powerful. Comparison of powers for testing θ 1 = θ 2 using central tests at the two-sided 0.05 level. The three tests compared are "score"= unconditional exact test based on the score test of the difference in proportions, "simple TB"= unconditional exact test based on the difference in proportions using a simple tie-break (see Section 5.2), and "Fisher"= tests based on central Fisher's exact test. For columns labeled Test 1 vs Test 2, the result is power of Test 1 minus Power of Test 2, so that positive values (pink and gray) indicate that Test 1 is more powerful. White indicates that powers are within 0.025 of each other.

Recommendations
There are many ways to perform frequentist inferences on the two-sample binomial problem, and we focused our extensive review on valid inferences and highlighted practical properties of tests. To summarize, we give a few recommendations.
1. We should almost always use central confidence intervals with either a central p-value, or the minimum of the one-sided p-values. Although using non-central two-sided CIs can slightly decrease expected CI length, that advantage comes at a cost in terms of allowable one-sided inferences. Since after rejecting a two-sided test we usually care about the direction of effect, non-central CIs are not routinely recommended. 2. It is usually not useful to maximize the power or minimize the expected length of the confidence interval. It comes at the price of increased computational burden and will lead to incoherent p-values and non-nested CIs. 3. For fast calculations use the one-sided conditional exact tests and the melded confidence intervals. 4. For more power use the unconditional one-sided valid p-values and associated central CIs. For inferences on β d we can order based on the difference in sample proportions, except break ties while maintaining the BC conditions, and do not let the ordering function depend on β 0 or α. This will ensure monotonicity of the p-values as a function of β 0 , allowing for relatively fast calculations, and avoiding incoherence and non-nestedness. For inferences on β r and β or , using the simple function with a tie breaking ordering will have a much smaller power than the score method or ordering based on one-sided mid-p Fisher's exact p-values. The score method introduces problems with incoherence or non-nestedness, while the mid-p Fisher p-value ordering does not. Because the latter method only uses the mid p-values for ordering within the exact unconditional test framework, the resulting p-values are valid. Further, for inferences on β d , the mid-p ordering meets the BC conditions and is relatively fast to calculate. 5. If validity is not vital, then the mid-p conditional tests are a good approximation to the more powerful of the unconditional exact ones. Additionally, with a large proportion of situations with θ 1 = θ 2 , the mid-p conditional tests still have type I error rates less than the nominal value. Applies to any method, increases power at the cost of validity 10 Rpkg: exact2x2 SAS 9.4 (not all tests) * Approximate computation speed: 1=fast, 2=moderate, 3=slow. * * Approximate power/efficiency: 1=higher power/shorter CI, . . ., 5=lower power/larger CI. * * * Software (not comprehensive, only considered R, SAS and StatXact): R packages available at https://cran.r-project.org/. For SAS the methods are available in PROC FREQ using exact option. The value "both" denotes there could be versions with and without the property, and "?" denotes that it is not clear if the matching confidence intervals are compatible with the p-values because confidence intervals have not been studied with that test (although it is likely the method will not be compatible because it is similar to the smallest CI method).
Appendix: Proof of Theorem 4.1 Proof of statement 1 : (Compatible Inferences) ⇒ (C I = C): If the confidence region associated with a p-value is not an interval, then there must be an α and β 0 such that p(x, β 0 ) ≤ α and β 0 ∈ C I (x, 1 − α), which contradicts the compatible inferences, therefore C I (x, 1 − α) = C(x, 1 − α). (C I = C) ⇒ (Compatible Inferences): If the confidence region associated with the p-value is the matching confidence interval, then the inferences are compatible by definition (equation 2.2).
Proof of statement 2, (Compatible Inferences) ⇒ (Nested CI): We show the contrapostive. If a method has non-nested CIs, then there exists some α 1 < α 2 and some β 0 such that β 0 / ∈ C I (x, 1 − α 1 ) and β 0 ∈ C I (x, 1 − α 2 ). If the method had compatible inferences, then p(x, β 0 ) ≡ p ≤ α 1 and p > α 2 . This leads to the contradiction, p ≤ α 1 < α 2 < p, so the method must not have compatible inferences, and we have proven the result. Proof of statement 3, (Compatible Inferences) ⇒ (Coherence): From statement 2, the compatible inferences imply nested CIs. For one-sided p-values, the compatible inferences with the nested CIs imply that the p-values are non-decreasing as the null space expands (e.g., β 0 gets larger when H 0 : β ≤ β 0 ), and hence are coherent by definition. For two-sided p-values, because of compatible inferences and nested CIs, the p-values are increasing (i.e., non-decreasing) as 1 − α decreases. This is directional coherence by definition.