What does the proof of Birnbaum's theorem prove?

Birnbaum's theorem, that the sufficiency and conditionality principles entail the likelihood principle, has engendered a great deal of controversy and discussion since the publication of the result in 1962. In particular, many have raised doubts as to the validity of this result. Typically these doubts are concerned with the validity of the principles of sufficiency and conditionality as expressed by Birnbaum. Technically it would seem, however, that the proof itself is sound. In this paper we use set theory to formalize the context in which the result is proved and show that in fact Birnbaum's theorem is incorrectly stated as a key hypothesis is left out of the statement. When this hypothesis is added, we see that sufficiency is irrelevant, and that the result is dependent on a well-known flaw in conditionality that renders the result almost vacuous.


Introduction
A result presented in Birnbaum (1962), and referred to as Birnbaum's theorem, is very well-known in statistics. This result says that a statistician who accepts both the sufficiency S and conditionality C principles must also accept the likelihood principle L and conversely. The result has always been controversial primarily because it implies that a frequentist statistician who accepts S and C is forced to ignore the repeated sampling properties of any inferential procedures they use. Given that both S and C seem quite natural to many frequentist statisticians while L does not, the result is highly paradoxical.
Various concerns have been raised about the proof of the result. For example, Durbin (1970) argued that the theorem fails to hold whenever C is restricted by requiring that any ancillaries used must be functions of a minimal sufficient statistic. Kalbfleisch (1975) argued that C should only be applicable when the value of the ancillary statistic used to condition is actually a part of the experimental make-up. This is called the weak conditionality principle. In Evans, Fraser and Monette (1986) it is argued that Birnbaum's theorem, and a similar result that accepting C alone is equivalent to accepting L, are invalid because the specific uses of S and C in proving these results can be seen to be based on flaws in their formulations. For example, Birnbaum's theorem requires a use of S and C where the information discarded by S as irrelevant, which is the primary motivation for S, is exactly the information used by C to condition on and so identifies the discarded information as highly relevant. As such S and C contradict each other. We note that this is precisely what Durbin's restriction on the ancillaries avoids. Furthermore, the result that C alone implies L can be seen to depend on the lack of a unique maximal ancillary which can be viewed as an essential flaw in C. Also, see Holm (1985), Barndorff-Nielsen (1995) and Helland (1995) for various concerns about the formulation of the theorem. Mayo (2010) argues that, in the context of a repeated sampling formulation for statistics, we cannot simultaneously have S and C true, as when S is true then C is false and when C is true then S is false. Gandenberger (2012) offers up a proof that avoids some of the objections raised by others.
Many of these reservations are essentially with the hypotheses to the theorem and suggest that Birnbaum's theorem should be rejected because the hypotheses are either not acceptable or have been misapplied. It is the purpose of this paper to provide a careful set-theoretic formulation of the context of the theorem. When this is done we see that there is a hypothesis that needs to be formally acknowledged as part of the statement of Birnbaum's theorem. With this addition, the force of the result is lost and the paradox disappears. The same conclusions apply to result that C is equivalent to L and, in fact, this is really the only result as S is redundant in Birnbaum's theorem when the additional hypothesis is formally acknowledged.
For our discussion it is important that we stick as closely as possible to Birnbaum's formulation. To discuss the proof, however, we have to make certain aspects of Birnbaum's argument mathematically precise that are somewhat vague in his paper. It is always possible then that someone will argue that we have done this in a way that is not true to Birnbaum's intention. We note, however, that this is accomplished in a very simple and direct way. If there is another precise formulation that makes the theorem true, then it is necessary for a critic of how we do this to provide that alternative.
A basic step missing in Birnbaum (1962) was to formulate the principles as relations on the set I of all model and data combinations. So I is the set of all inference bases is a collection of probability density functions on X E , with respect to some support measure µ E on X E , and x ∈ X E is the observed data. We will ignore all measure-theoretic considerations as they are not essential for any of the arguments. If the reader is concerned by this, then we note that the collection of models where X E and Θ E are finite and µ E is counting measure is rich enough to produce the paradoxical result. So in general we can consider our discussion restricted to the case where X E and Θ E are finite. It is our view that infinite sets and continuous probability measures are not necessary for the development of the basic principles of statistics. Rather the use of infinite sets and continuity represents approximations to a finite reality and appropriate restrictions must be employed on such quantities so that we are not mislead by purely mathematical considerations. In spite of our restrictions, most of our development applies equally well under very general circumstances.
We note that expressing the principles as relations was part of Evans, Fraser and Monette (1986) this is taken further here. In Section 2 we discuss the meaning and use of relations generally. In Section 3 we apply our discussion of relations to Birnbaum's theorem. In Section 4 we draw some conclusions.

Relations
A relation R with domain D is a subset R ⊂ D × D. Saying (x, y) ∈ R means that the objects x and y have a property in common. For example, suppose D is the set of students enrolled at a specific university at a specific point in time. Let R 1 be defined by (x, y) ∈ R 1 whenever x and y are students in the same class. Let R 2 be defined by (x, y) ∈ R 2 whenever x and y have taken a course from the same professor. A If a relation R is reflexive, symmetric and transitive, then R is called an equivalence relation. Clearly R 1 is an equivalence relation and, while R 2 is reflexive and symmetric, it is not typically transitive and so is not an equivalence relation. While (x, y) ∈ R implies that x and y are related, perhaps by the possession of some property, when R is an equivalence relation this implies that x and y possess the property to the same degree. We say that relation R on D implies relation R ′ on D whenever R ⊂ R ′ . Clearly we have that R 1 ⊂ R 2 .
If R is a relation on D, then the equivalence relationR generated by R is the smallest equivalence relation containing R. We see thatR is the intersection of all equivalence relations on D containing R. Also we have that (1) It is not always clear thatR has a meaningful interpretation, at least as it relates to the property being expressed by R. For example,R 2 is somewhat more difficult to interpret and surely goes beyond the idea that R 2 is perhaps trying to express, namely, that two students were directly influenced by the same professor. In fact, it is entirely possible thatR 2 = D × D. As another example, suppose that D = {2, 3, 4, . . .} and (x, y) ∈ R when x and y have a common factor bigger than 1. Then R is reflexive and symmetric but not transitive. If x, y ∈ D then (x, xy) ∈ R, (xy, y) ∈ R soR = D × D andR is saying nothing. It seems that each situation, where we extend a relation R to an equivalence relation, must be examined to see whether or not this extension has any meaningful content for the application. Now suppose we have relations R 1 and R 2 on D and consider the relation R 1 ∪ R 2 . The following result is relevant to our discussion in Section 3.
This says that the equivalence relation generated by the union of relations is equal to the equivalence relation generated by the union of the corresponding generated equivalence relations. Furthermore, it is clear that the union of equivalence relations is not in general an equivalence relation.

Statistical Relations and Principles
We define a statistical relation to be a relation on I and a statistical principle to be an equivalence relation on I. The idea behind a statistical principle, as used here, is that equivalent inference bases contain the same amount of statistical information about the unknown θ. We make no attempt to give a precise definition of what statistical information means. Birnbaum (1962) identified two inference bases I 1 , I 2 ∈ I as containing the same amount of statistical information via the notation Ev(I 1 ) = Ev(I 2 ). We consider several statistical relations.
The likelihood relation L on I is defined by (I 1 , I 2 ) ∈ L whenever Θ E1 = Θ E2 and there exists c > 0 such that f E1,θ (x 1 ) = cf E2,θ (x 2 ) for every θ. We have the following obvious result.
Lemma 2. L is a statistical principle.
Actually the likelihood principle does not completely express the idea that two inference bases with the same likelihood function contain the same amount of statistical information. For this we need another statistical relation. We define the invariance relation G by (I 1 , I 2 ) ∈ G whenever there exist 1-1, onto, smooth functions g : for every x ∈ X E1 where J g (x) = (det(∂g(x)/∂x)) −1 = 1 in the discrete case. We have the following result.
Lemma 3. G is a statistical principle. Now consider the equivalence relation L ∪ G. If (I 1 , I 2 ) ∈ L and (I 2 , I 3 ) ∈ G, then, for some constant c > 0 and mappings g and h, f E1,θ ( ) and, so after relabelling, I 1 and I 3 , have proportional likelihoods. Similarly, if (I 1 , I 2 ) ∈ G and (I 2 , I 3 ) ∈ L, then again, after relabelling, I 1 and I 3 have proportional likelihoods. So (I 1 , I 2 ) ∈ L ∪ G just expresses the fact that I 1 and I 2 have proportional likelihoods, perhaps after relabelling the data and the parameter. In this case we can state clearly what the equivalence relation L ∪ G expresses and the generated equivalence relation makes sense. We do not need L ∪ G, however, for a discussion of Birnbaum's result.
The sufficiency relation S is defined by (I 1 , I 2 ) ∈ S whenever Θ E1 = Θ E2 and there exist minimal sufficient statistics m 1 for E 1 and m 2 for E 2 such that the marginal models induced by the m i are the same and m 1 (x 1 ) = m 2 (x 2 ). We have the following result.
Lemma 4. S is a statistical principle and S ⊂ L.
Proof: Clearly S is reflexive and symmetric and S ⊂ L. Suppose (I 1 , I 2 ) ∈ S via the minimal sufficient statistics m 1 and m 2 and (I 2 , I 3 ) ∈ S via the minimal sufficient statistics m ′ 2 and m 3 . Since any two minimal sufficient statistics are 1-1 functions of each other, there exists 1-1 function h such that m ′ 2 = h • m 2 . Then (I 1 , I 3 ) ∈ S via the minimal sufficient statistics h • m 1 and m 3 . Obviously we have the result that (I 1 , I 2 ) ∈ S whenever I 2 can be obtained from I 1 via a sufficient statistic or conversely. Furthermore, it makes sense to extend S to S ∪ G.
If we are going to say that (I 1 , I 2 ) ∈ C means that I 1 and I 2 contain an equivalent amount of information under C, then we are forced to expand C toC so that it is an equivalence relation. But this implies that the two inference bases I 2 and I 3 presented in the proof of Lemma 5 contain an equivalent amount of information and yet they are not directly related via C. Rather they are related only because they are conditional models obtained from a supermodel that has two essentially different maximal ancillaries.
Saying that such models contain an equivalent amount of statistical information is clearly a substantial generalization of C. Note that, for the example in the proof of Lemma 5, when (1, 1) is observed, the MLE isθ(1, 1) = 1. To measure the accuracy of this estimate we can compute the conditional probabilities based on the two inference bases, namely, and so the accuracy ofθ is quite different depending on whether we use I 2 or I 3 . It seems unlikely that we would interpret these inference bases as containing an equivalent amount of information in a frequentist formulation of statistics. As noted in Section 2, there is no reason why we have to accept the equivalences given by a generated equivalence relation unless we are certain that this equivalence relation expresses the essence of the basic relation. It seems clear that there is a problem with the assertion that (I 1 , I 2 ) ∈C means that I 1 and I 2 contain an equivalent amount of information without further justification.
We now follow a development similar to that found in Evans, Fraser and Monette (1986) to prove the following result.
The proof that L ⊂C relies on discreteness. This was weakened in Evans, Fraser and Monette (1986) and even further weakened in Jang (2011). We now show that Birnbaum's proof actually establishes the following result.
Note that Birnbaum's proof only proves the containments with no equalities but we have the following result.
To prove that the second containment is exact we have, using (1), that (I 1 , I 2 ) ∈ S ∪ C implies that I 1 and I 2 give rise to proportional likelihoods as this is true for each element of S ∪ C and so S ∪ C ⊂ L.
So we do not have, as usually stated for Birnabum's theorem, that S and C are together equivalent to L but we do have that S ∪ C is equivalent to L. Acceptance of S ∪ C is not entailed, however, by acceptance of both S and C as we have to examine the additional relationships added to S ∪ C to see if they make sense. If one wishes to say that acceptance of S and C implies the acceptance of S ∪ C, then a compelling argument is required for these additions and this seems unlikely. From the example of the proof of Theorem 8 we can see that acceptance of S ∪ C is indeed equivalent to acceptance of L.
From Theorems 6 and Theorem 7 we have the following Corollary.
Corollary 9. S ∪ C ⊂C = L where the first containment is proper. Furthermore, S ⊂C and this containment is proper. A direct proof that S ⊂C has been derived by Jang (2011). It is interesting to note that Corollary 9 shows that the existence of S in the modified statement of Birnbaum's theorem, where we require that we accept all the equivalences generated by S and C, is irrelevant as it is not required. This is a reassuring result as it is unlikely that S is defective but it is almost certain that C is defective, at least as currently stated. Also we have the following result.
Lemma 10. C ∪ G = L ∪ G Proof: This is immediate from Lemma 1 and Lemma 3.
This says that the equivalences obtained by combining invariance under relabelling with conditionality are the same as the equivalences obtained by combining invariance under relabelling with likelihood.
As with the proof of Birnbaum's theorem, the proof that C = L provided in Evans, Fraser and Monette (1986) is really a proof thatC = L. This can be seen from the proof of Theorem 6. So accepting the relation C is not really equivalent to accepting L unless we agree that the additional elements ofC make sense. This is essentially equivalent to saying that it doesn't matter which maximal ancillary we condition on and it is unlikely that this is acceptable to most frequentist statisticians and this is illustrated by the discussion concerning the example in Lemma 5.
As noted in Durbin (1970), requiring that any ancillaries used in an application of C be functions of a minimal sufficient statistic voids Birnabum's proof, as the ancillary statistic used in the proof of Theorem 7 is not a function of the sufficient statistic used in the proof. It is not clear, however, what this restriction does to the resultC = L, but we note that there are situations where there exist nonunique maximal ancillaries which are functions of the minimal sufficient statistic. In these circumstances we would still be forced to conclude the equivalence of inference bases derived by conditioning on the different maximal ancillaries if we reasoned as in Evans, Fraser and Monette (1986). Of course, we are arguing here that the result requires the statement of an additional hypothesis.

Conclusions
We have shown that the proof in Birnbaum (1962) did not prove that S and C lead to L. Rather the proof establishes that S ∪ C = L and this is something quite different. The statement of Birnbaum's theorem in prose should have been: if we accept the relation S and we accept the relation C and we accept all the equivalences generated by S and C together, then this is equivalent to accepting L. The essential flaw in Birnbaum's theorem lies in excluding this last hypothesis from the statement of the theorem. The same qualification applies to the result proved in Evans, Fraser and Monette (1986) where the statement of the theorem should have been: if we accept the relation C and we accept all the equivalences generated by C, then this is equivalent to accepting L.
The way out of the difficulties posed by Birnbaum's theorem, and the result relating C and L, is to acknowledge that additional hypotheses are required for the results to hold. Certainly these results seem to lose their impact when they are correctly stated and we realize that an equivalence relation generated by a relation is not necessarily meaningful. It is necessary to provide an argument as to why the generated equivalence relation captures the essence of the relation that generates it and it is not at all clear how to do this in these cases.
As we have noted, the essential result in all of this isC = L and this has some content albeit somewhat minor. Furthermore, the proof of this result is based on a defect in C, namely, it is not an equivalence relation due to the general nonexistence of unique maximal ancillaries. As such it is hard to accept C as stated as any kind of characterization of statistical evidence. Given the intuitive appeal of this relation in some simple examples, however, resolving the difficulties with C still poses a major challenge for a frequentitst theory of statistics.