Statistical Science

Interval Estimation for a Binomial Proportion

Lawrence D. Brown, T. Tony Cai, and Anirban DasGupta

Full-text: Open access

Abstract

We revisit the problem of interval estimation of a binomial proportion. The erratic behavior of the coverage probability of the standard Wald confidence interval has previously been remarked on in the literature (Blyth and Still, Agresti and Coull, Santner and others). We begin by showing that the chaotic coverage properties of the Wald interval are far more persistent than is appreciated. Furthermore, common textbook prescriptions regarding its safety are misleading and defective in several respects and cannot be trusted.

This leads us to consideration of alternative intervals. A number of natural alternatives are presented, each with its motivation and context. Each interval is examined for its coverage probability and its length. Based on this analysis, we recommend the Wilson interval or the equal-tailed Jeffreys prior interval for small n and the interval suggested in Agresti and Coull for larger n. We also provide an additional frequentist justification for use of the Jeffreys interval.

Article information

Source
Statist. Sci. Volume 16, Issue 2 (2001), 101-133.

Dates
First available in Project Euclid: 24 December 2001

Permanent link to this document
http://projecteuclid.org/euclid.ss/1009213286

Digital Object Identifier
doi:10.1214/ss/1009213286

Mathematical Reviews number (MathSciNet)
MR1861069

Zentralblatt MATH identifier
02068924

Citation

Brown, Lawrence D.; Cai, T. Tony; DasGupta, Anirban. Interval Estimation for a Binomial Proportion. Statistical Science 16 (2001), no. 2, 101--133. doi:10.1214/ss/1009213286. http://projecteuclid.org/euclid.ss/1009213286.


Export citation

References

  • Abramowitz, M. andStegun, I. A. (1970). Handbook of Mathematical Functions. Dover, New York.
  • Agresti, A. andCoull, B. A. (1998). Approximate is better than "exact" for interval estimation of binomial proportions. Amer. Statist. 52 119-126.
  • Anscombe, F. J. (1948). The transformation of Poisson, binomial andnegative binomial data. Biometrika 35 246-254.
  • Anscombe, F. J. (1956). On estimating binomial response relations. Biometrika 43 461-464.
  • Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis, 2nded. Springer, New York.
  • Berry, D. A. (1996). Statistics: A Bayesian Perspective. Wadsworth, Belmont, CA.
  • Bickel, P. andDoksum, K. (1977). Mathematical Statistics. Prentice-Hall, EnglewoodCliffs, NJ.
  • Blyth, C. R. and Still, H. A. (1983). Binomial confidence intervals. J. Amer. Statist. Assoc. 78 108-116.
  • Brown, L. D., Cai, T. andDasGupta, A. (1999). Confidence intervals for a binomial proportion andasymptotic expansions. Ann. Statist to appear.
  • Brown, L. D., Cai, T. andDasGupta, A. (2000). Interval estimation in discrete exponential family. Technical report, Dept. Statistics. Univ. Pennsylvania.
  • Casella, G. (1986). Refining binomial confidence intervals Canad. J. Statist. 14 113-129.
  • Casella, G. andBerger, R. L. (1990). Statistical Inference. Wadsworth & Brooks/Cole, Belmont, CA.
  • Clopper, C. J. and Pearson, E. S. (1934). The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika 26 404-413.
  • Cox, D. R. and Snell, E. J. (1989). Analysis of Binary Data, 2nd ed. Chapman and Hall, London.
  • Cressie, N. (1980). A finely tunedcontinuity correction. Ann. Inst. Statist. Math. 30 435-442.
  • Ghosh, B. K. (1979). A comparison of some approximate confidence intervals for the binomial parameter J. Amer. Statist. Assoc. 74 894-900.
  • Hall, P. (1982). Improving the normal approximation when constructing one-sided confidence intervals for binomial or Poisson parameters. Biometrika 69 647-652.
  • Lehmann, E. L. (1999). Elements of Large-Sample Theory. Springer, New York.
  • Newcombe, R. G. (1998). Two-sided confidence intervals for the single proportion; comparison of several methods. Statistics in Medicine 17 857-872.
  • Rao, C. R. (1973). Linear Statistical Inference and Its Applications. Wiley, New York.
  • Samuels, M. L. and Witmer, J. W. (1999). Statistics for the Life Sciences, 2nded. Prentice Hall, Englewood Cliffs, NJ.
  • Santner, T. J. (1998). A note on teaching binomial confidence intervals. Teaching Statistics 20 20-23.
  • Santner, T. J. and Duffy, D. E. (1989). The Statistical Analysis of Discrete Data. Springer, Berlin.
  • Stone, C. J. (1995). A Course in Probability and Statistics. Duxbury, Belmont, CA.
  • Strawderman, R. L. and Wells, M. T. (1998). Approximately exact inference for the common odds ratio in several 2 × 2 tables (with discussion). J. Amer. Statist. Assoc. 93 1294- 1320.
  • Tamhane, A. C. and Dunlop, D. D. (2000). Statistics and Data Analysis from Elementary to Intermediate. Prentice Hall, EnglewoodCliffs, NJ.
  • Vollset, S. E. (1993). Confidence intervals for a binomial proportion. Statistics in Medicine 12 809-824.
  • Wasserman, L. (1991). An inferential interpretation of default priors. Technical report, Carnegie-Mellon Univ.
  • Wilson, E. B. (1927). Probable inference, the law of succession, andstatistical inference. J. Amer. Statist. Assoc. 22 209-212.
  • Oliver (2000). But this does not account for the effects of discreteness, and as BCD point out, guidelines in terms of p are not verifiable. For elementary course teaching there is no obvious alternative (such as t methods) for smaller n, so we think it is sensible to teach a single methodthat behaves reasonably well for all n, as do the Wilson, Jeffreys and Agresti-Coull intervals.
  • (See Vollset, 1993). Presumably some other boundary modification will result in a happy medium. In a letter to the editor about Agresti and Coull (1998), Rindskopf (2000) argued in favor of the logit interval partly because of its connection with logit modeling. We have not usedthis methodfor teaching in elementary courses, since logit intervals do not extendto intervals for the difference of proportions and(like CIW andCIJ) they are rather complex for that level. For practical use andfor teaching in more advanced courses, some statisticians may prefer the likelihoodratio interval, since conceptually it is simple andthe methodalso applies in a general modelbuilding framework. An advantage compared to the Waldapproach is its invariance to the choice of scale, resulting, for instance, both from the original scale andthe logit. BCD do not say much about this interval, since it is harder to compute. However, it is easy to obtain with standard statistical software (e.g., in SAS, using the LRCI option in PROC GENMOD for a model containing only an intercept term andassuming a binomial response with logit or identity link function). Graphs in Vollset (1993) suggest that the boundary-modified likelihood ratio interval also behaves reasonably well, although conservative for p near 0 and1. For elementary course teaching, a disadvantage of all such intervals using boundary modifications is that making exceptions from a general, simple recipe distracts students from the simple concept of taking the estimate plus andminus a normal score multiple of a standard error. (Of course, this concept is not sufficient for serious statistical work, but some over simplification andcompromise is necessary at that level.) Even with CIAC, instructors may findit preferable to give a recipe with the same number of added pseudo observations for all
  • , insteadof z2 /2. Reasonably goodperformance seems to result, especially for small, from the value 4 z20 025 usedin the 95% CIAC interval (i.e., the "addtwo successes andtwo failures" interval). Agresti andCaffo (2000) discussedthis andshowed that adding four pseudo observations also dramatically improves the Waldtwo-sample interval for comparing proportions, although again at the cost of rather severe conservativeness when both parameters are near 0 or near 1.
  • 1954; Blyth andStill, 1983). Finally, we are curious about the implications of the BCD results in a more general setting. How much does their message about the effects of discreteness andbasing interval estimation on the Jeffreys prior or the score test rather than the Wald test extend to parameters in other discrete distributions andto two-sample comparisons? We have seen that interval estimation of the Poisson parameter benefits from inverting the score test rather than the Waldtest on the count scale (Agresti and
  • Coull, 1998). One wouldnot think there couldbe anything new to say about the Waldconfid ence interval for a proportion, an inferential methodthat must be one of the most frequently usedsince Laplace (1812, page 283). Likewise, the confidence interval for a proportion basedon the Jeffreys prior has receivedattention in various forms for some time. For instance, R. A. Fisher (1956, pages 63- 70) showedthe similarity of a Bayesian analysis with Jeffreys prior to his fiducial approach, in a discussion that was generally critical of the confidence interval method but grudgingly admitted of limits obtainedby a test inversion such as the Clopper- Pearson method, "though they fall short in logical content of the limits foundby the fiducial argument, andwith which they have often been confused, they do fulfil some of the desiderata of statistical inferences." Congratulations to the authors for brilliantly casting new light on the performance of these old andestablishedmethods.
  • lishedin Datta andGhosh (1996). Thus a uniform prior for arcsin 1/2, where is the binomial proportion, leads to Jeffreys Beta (1/2, 1/2) prior for. When is the Poisson parameter, the uniform prior for 1/2 leads to Jeffreys' prior -1/2 for. In a more formal set-up, let X1 Xn be iid conditional on some real-valued. Let 1 X1 Xn denote a posterior 1 th quantile for under the prior. Then is saidto be a first-order probability matching prior if P 1 X1 Xn = 1 + o n-1/2 (1)
  • as derived in Tibshirani (1989). Here h · is an arbitrary function differentiable in its arguments. In general, matching priors have a long success story in providing frequentist confidence intervals, especially in complex problems, for example, the Behrens-Fisher or the common mean estimation problems where frequentist methods run into difficulty. Though asymptotic, the matching property seems to holdfor small andmoderate sample sizes as well for many important statistical problems. One such example is Garvan andGhosh (1997) where such priors were foundfor general disper
  • sion models as given in Jorgensen (1997). It may be worthwhile developing these priors in the presence of nuisance parameters for other discrete cases as well, for example when the parameter of interest is the difference of two binomial proportions, or the log-odds ratio in a 2 × 2 contingency table. Having arguedso strongly in favor of matching priors, I wonder, though, whether there is any special needfor such priors in this particular problem of binomial proportions. It appears that any Beta (a a) prior will do well in this case. As noted in this paper, by shrinking the MLE X/n towardthe prior mean 1/2, one achieves a better centering for the construction of confidence intervals. The two diametrically opposite priors Beta (2, 2) (symmetric concave with maximum at 1/2 which provides the Agresti-Coull interval) andJeffreys prior Beta (1/2 1/2) (symmetric convex with minimum at 1/2) seem to be equally good for recentering. Indeed, I wonder whether any Beta prior which shrinks the MLE toward the prior mean / + becomes appropriate for recentering. The problem of construction of confidence intervals for binomial proportions occurs in first courses in statistics as well as in day-to-day consulting. While I am strongly in favor of replacing Waldintervals by the new ones for the latter, I am not quite sure how easy it will be to motivate these new intervals for the former. The notion of shrinking can be explained adequately only to a few strong students in introductory statistics courses. One possible solution for the classroom may be to bring in the notion of continuity correction andsomewhat heuristcally ask students to work with X+ 12 n-X+ 12 instead of X n X. In this way, one centers around X + 12 / n + 1 a la Jeffreys prior.
  • interval (e.g., Theorem 1 of Ghosh, 1979). My first set of comments concern the specific binomial problem that the authors address and then the implications of their work for other important discrete data confidence interval problems. The results in Ghosh (1979) complement the calculations of Brown, Cai andDasGupta (BCD) by pointing out that the Waldinterval is "too long" in addition to being centered at the "wrong" value (the MLE as opposedto a Bayesian point estimate such is usedby the Agresti-Coull interval). His Table 3 lists the probability that the Waldinterval is longer than the Wilson interval for a central set of p values (from 0.20 to 0.80) anda range of sample sizes n from 20 to 200. Perhaps surprisingly, in view of its inferior coverage characteristics, the Waldinterval tends to be longer than the Wilson interval with very high probability. Hence the Waldinterval is both too long andcenteredat the wrong place. This is a dramatic effect of the skewness that BCD mention. When discussing any system of intervals, one is concernedwith the consistency of the answers given by the interval across multiple uses by a single researcher or by groups of users. Formally, this is the reason why various symmetry properties are requiredof confidence intervals. For example, in the present case, requiring that the p interval L X U X satisfy the symmetry property L x U x = 1 L n x 1 U n x (1) for x 0 n shows that investigators who reverse their definitions of success and failure will
  • Pearson (1934). For the binomial problem, Blyth and Still (1983) constructeda set of confidence intervals by selecting among size acceptance regions those that possessed additional symmetry properties and were "small" (leading to short confidence intervals). For example, they desired that the interval should "move to the right" as x increases when n is fixed andshould"move the left" as n increases when x is fixed. They also asked that their system of intervals increase monotonically in the coverage probability for fixed x and n in the sense that the higher nominal coverage interval contain the lower nominal coverage interval. In addition to being less intuitive to unsophisticatedstatistical consumers, systems of confidence intervals formedby inversion of acceptance regions also have two other handicaps that have hindered their rise in popularity. First, they typically require that the confidence interval (essentially) be constructedfor all possible outcomes, rather than merely the response of interest. Second, their rather brute force character means that a specializedcomputer program must be written to produce the acceptance sets andtheir inversion (the intervals).
  • Gupta (2001). This article also discusses a number of additional issues and presents further analytical calculations, including a Pearson tilting similar to
  • the chi-square tilts advised in Hall (1983). Corcoran andMehta's Figure 2 compares average length of three of our proposals with Blyth-Still and with their likelihoodratio procedure. We note first that their LB procedure is not the same as ours. Theirs is basedon numerically computedexact percentiles of the fixedsample likelihoodratio statistic. We suspect this is roughly equivalent to adjustment of the chi-squaredpercentile by a Bartlett correction. Ours is basedon the traditional asymptotic chi-squaredformula for the distribution of the likelihoodratio statistic. Consequently, their procedure has conservative coverage, whereas ours has coverage fluctuating aroundthe nominal value. They assert that the difference in expected length is "negligible." How much difference qualifies as negligible is an arguable, subjective evaluation. But we note that in their plot their intervals can be on average about 8% or 10% longer than Jeffreys or Wilson intervals, respectively. This seems to us a nonnegligible difference. Actually, we suspect their preference for their LR andBSC intervals rests primarily on their overriding preference for conservativity in coverage whereas, as we have discussed above, our intervals are designed to attain approximately the desired nominal value. D. Santner proposes an interesting variant of the original Blyth-Still proposal. As we understand it, he suggests producing nominal % intervals by constructing the % Blyth-Still intervals, with % chosen so that the average coverage of the resulting intervals is approximately the nominal value, %. The coverage plot for this procedure compares well with that for Wilson or Jeffreys in our Figure 5. Perhaps the expectedinterval length for this procedure also compares well, although Santner does not say so. However, we still do not favor his proposal. It is conceptually more complicatedandrequires a specially designed computer program, particularly if one wishes to compute % with any degree of accuracy. It thus fails with respect to the criterion of scientific parsimony in relation to other proposals that appear to have at least competitive performance characteristics. E. Casella suggests the possibility of performing a continuity correction on the score statistic prior to constructing a confidence interval. We do not agree with this proposal from any perspective. These "continuity-correctedWilson" intervals have extremely conservative coverage properties, though they may not in principle be guaranteedto be everywhere conservative. But even if one's goal, unlike ours, is to produce conservative intervals, these intervals will be very inefficient at their normal level relative to Blyth-Still or even Clopper- Pearson. In Figure 1 below, we plot the coverage of the Wilson interval with andwithout a continuity correction for n = 25 and = 0 05, and the corresponding expected lengths. It is seems clear that the loss in precision more than neutralizes the improvements in coverage andthat the nominal coverage of 95% is misleading from any perspective.
  • Agresti, A. andCaffo, B. (2000). Simple andeffective confidence intervals for proportions and differences of proportions result from adding two successes and two failures. Amer. Statist. 54. To appear.
  • Aitkin, M., Anderson, D., Francis, B. and Hinde, J. (1989). Statistical Modelling in GLIM. OxfordUniv. Press.
  • Boos, D. D. and Hughes-Oliver, J. M. (2000). How large does n have to be for Z and t intervals? Amer. Statist. 54 121-128. Brown, L. D., Cai, T. and DasGupta, A. (2000a). Confidence intervals for a binomial proportion andasymptotic expansions. Ann. Statist. To appear. Brown, L. D., Cai, T. and DasGupta, A. (2000b). Interval estimation in exponential families. Technical report, Dept. Statistics, Univ. Pennsylvania.
  • Brown, L. D. and Li, X. (2001). Confidence intervals for the difference of two binomial proportions. Unpublished manuscript.
  • Cai, T. (2001). One-sided confidence intervals and hypothesis testing in discrete distributions. Preprint.
  • Coe, P. R. and Tamhane, A. C. (1993). Exact repeatedconfidence intervals for Bernoulli parameters in a group sequential clinical trial. Controlled Clinical Trials 14 19-29.
  • Cox, D. R. and Reid, N. (1987). Orthogonal parameters and approximate conditional inference (with discussion). J. Roy. Statist. Soc. Ser. B 49 113-147.
  • DasGupta, A. (2001). Some further results in the binomial interval estimation problem. Preprint.
  • Datta, G. S. and Ghosh, M. (1996). On the invariance of noninformative priors. Ann. Statist. 24 141-159.
  • Duffy, D. and Santner, T. J. (1987). Confidence intervals for a binomial parameter basedon multistage tests. Biometrics 43 81-94.
  • Fisher, R. A. (1956). Statistical Methods for Scientific Inference. Oliver and Boyd, Edinburgh.
  • Gart, J. J. (1966). Alternative analyses of contingency tables. J. Roy. Statist. Soc. Ser. B 28 164-179.
  • Garvan, C. W. and Ghosh, M. (1997). Noninformative priors for dispersion models. Biometrika 84 976-982.
  • Ghosh, J. K. (1994). Higher Order Asymptotics. IMS, Hayward, CA.
  • Hall, P. (1983). Chi-squaredapproximations to the distribution of a sum of independent random variables. Ann. Statist. 11 1028-1036.
  • Jennison, C. and Turnbull, B. W. (1983). Confidence intervals for a binomial parameter following a multistage test with application to MIL-STD 105D andmedical trials. Technometrics, 25 49-58.
  • Jorgensen, B. (1997). The Theory of Dispersion Models. CRC Chapman andHall, London.
  • Laplace, P. S. (1812). Th´eorie Analytique des Probabilit´es. Courcier, Paris.
  • Larson, H. J. (1974). Introduction to Probability Theory and Statistical Inference, 2nded. Wiley, New York.
  • Pratt, J. W. (1961). Length of confidence intervals. J. Amer. Statist. Assoc. 56 549-567.
  • Rindskopf, D. (2000). Letter to the editor. Amer. Statist. 54 88.
  • Rubin, D. B. and Schenker, N. (1987). Logit-basedinterval estimation for binomial data using the Jeffreys prior. Sociological Methodology 17 131-144.
  • Sterne, T. E. (1954). Some remarks on confidence or fiducial limits. Biometrika 41 275-278.
  • Tibshirani, R. (1989). Noninformative priors for one parameter of many. Biometrika 76 604-608.
  • Welch, B. L. and Peers, H. W. (1963). On formula for confidence points basedon intergrals of weightedlikelihoods. J. Roy. Statist. Ser. B 25 318-329.
  • Yamagami, S. and Santner, T. J. (1993). Invariant small sample confidence intervals for the difference of two success probabilities. Comm. Statist. Simul. Comput. 22 33-59.