Statistical Science

A Paradox from Randomization-Based Causal Inference

Peng Ding

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Under the potential outcomes framework, causal effects are defined as comparisons between potential outcomes under treatment and control. To infer causal effects from randomized experiments, Neyman proposed to test the null hypothesis of zero average causal effect (Neyman’s null), and Fisher proposed to test the null hypothesis of zero individual causal effect (Fisher’s null). Although the subtle difference between Neyman’s null and Fisher’s null has caused a lot of controversies and confusions for both theoretical and practical statisticians, a careful comparison between the two approaches has been lacking in the literature for more than eighty years. We fill this historical gap by making a theoretical comparison between them and highlighting an intriguing paradox that has not been recognized by previous researchers. Logically, Fisher’s null implies Neyman’s null. It is therefore surprising that, in actual completely randomized experiments, rejection of Neyman’s null does not imply rejection of Fisher’s null for many realistic situations, including the case with constant causal effect. Furthermore, we show that this paradox also exists in other commonly-used experiments, such as stratified experiments, matched-pair experiments and factorial experiments. Asymptotic analyses, numerical examples and real data examples all support this surprising phenomenon. Besides its historical and theoretical importance, this paradox also leads to useful practical implications for modern researchers.

Article information

Statist. Sci. Volume 32, Number 3 (2017), 331-345.

First available in Project Euclid: 1 September 2017

Permanent link to this document

Digital Object Identifier

Average null hypothesis Fisher randomization test potential outcome randomized experiment repeated sampling property sharp null hypothesis


Ding, Peng. A Paradox from Randomization-Based Causal Inference. Statist. Sci. 32 (2017), no. 3, 331--345. doi:10.1214/16-STS571.

Export citation


  • Agresti, A. and Min, Y. (2004). Effects and non-effects of paired identical observations in comparing proportions with binary matched-pairs data. Stat. Med. 23 65–75.
  • Angrist, J. D. and Pischke, J. S. (2008). Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton Univ. Press, Princeton, NJ.
  • Anscombe, F. J. (1948). The validity of comparative experiments. J. Roy. Statist. Soc. Ser. A 111 181–200; discussion, 200–211.
  • Aronow, P. M., Green, D. P. and Lee, D. K. K. (2014). Sharp bounds on the variance in randomized experiments. Ann. Statist. 42 850–871.
  • Barnard, G. A. (1947). Significance tests for $2\times 2$ tables. Biometrika 34 123–138.
  • Box, G. E. P. (1992). Teaching engineers experimental design with a paper helicopter. Qual. Eng. 4 453–459.
  • Chung, E. and Romano, J. P. (2013). Exact and asymptotically robust permutation tests. Ann. Statist. 41 484–507.
  • Cox, D. R. (1958). The interpretation of the effects of non-additivity in the Latin square. Biometrika 45 69–73.
  • Cox, D. R. (1970). The Analysis of Binary Data. Methuen & Co., Ltd., London.
  • Cox, D. R. (1992). Planning of Experiments. Wiley, New York. Reprint of the 1958 original.
  • Cox, D. R. (2012). Statistical causality: Some historical remarks. In Causality: Statistical Perspectives and Applications (C. Berzuini, P. Dawid and L. Bernardinelli, eds.) 1–5. Wiley, New York.
  • Dasgupta, T., Pillai, N. S. and Rubin, D. B. (2015). Causal inference from $2^{K}$ factorial designs by using potential outcomes. J. R. Stat. Soc. Ser. B. Stat. Methodol. 77 727–753.
  • Ding, P. (2017). Supplement to “A paradox from randomization-based causal inference.” DOI:10.1214/16-STS571SUPP.
  • Ding, P. and Dasgupta, T. (2016). A potential tale of two-by-two tables from completely randomized experiments. J. Amer. Statist. Assoc. 111 157–168.
  • Ding, P., Feller, A. and Miratrix, L. W. (2016). Randomization inference for treatment effect variation. J. R. Stat. Soc. Ser. B. Stat. Methodol. 78 655–671.
  • Eberhardt, K. R. and Fligner, M. A. (1977). Comparison of two tests for equality of two proportions. Amer. Statist. 31 151–155.
  • Eden, T. and Yates, F. (1933). On the validity of Fisher’s $z$-test when applied to an actual example of non-normal data. J. Agric. Sci. 23 6–17.
  • Edgington, E. S. and Onghena, P. (2007). Randomization Tests, 4th ed. Chapman & Hall/CRC, Boca Raton, FL. With 1 CD-ROM (Windows).
  • Fienberg, S. E. and Tanur, J. M. (1996). Reconsidering the fundamental contributions of Fisher and Neyman on experimentation and sampling. Int. Stat. Rev. 64 237–253.
  • Fisher, R. A. (1926). The arrangement of field experiments. Journal of the Ministry of Agriculture of Great Britain 33 503–513.
  • Fisher, R. A. (1935a). The Design of Experiments, 1st ed. Oliver and Boyd, Edinburgh.
  • Fisher, R. A. (1935b). Comment on “Statistical problems in agricultural experimentation”. Suppl. J. R. Stat. Soc. 2 154–157. 173.
  • Freedman, D. A. (2008). On regression adjustments to experimental data. Adv. in Appl. Math. 40 180–193.
  • Gail, M. H., Mark, S. D., Carroll, R. J., Green, S. B. and Pee, D. (1996). On design considerations and randomization-based inference for community intervention trials. Stat. Med. 15 1069–1092.
  • Greenland, S. (1991). On the logical justification of conditional tests for two-by-two contingency tables. Amer. Statist. 45 248–251.
  • Hájek, J. (1960). Limiting distributions in simple random sampling from a finite population. Magyar Tud. Akad. Mat. Kutató Int. Közl. 5 361–374.
  • Hinkelmann, K. and Kempthorne, O. (2008). Design and Analysis of Experiments, Vol. 1: Introduction to Experimental Design, 2nd ed. Wiley, New York.
  • Hodges, J. L. Jr. and Lehmann, E. L. (1964). Basic Concepts of Probability and Statistics. Holden-Day, Inc., San Francisco, CA–London–Amsterdam.
  • Hoeffding, W. (1952). The large-sample power of tests based on permutations of observations. Ann. Math. Stat. 23 169–192.
  • Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. In Proc. Fifth Berkeley Sympos. Math. Statist. and Probability (Berkeley, Calif., 1965/66), Vol. I: Statistics 221–233. Univ. California Press, Berkeley, CA.
  • Imai, K. (2008). Variance identification and efficiency analysis in randomized experiments under the matched-pair design. Stat. Med. 27 4857–4873.
  • Imbens, G. W. and Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge Univ. Press, New York.
  • Janssen, A. (1997). Studentized permutation tests for non-i.i.d. hypotheses and the generalized Behrens–Fisher problem. Statist. Probab. Lett. 36 9–21.
  • Kempthorne, O. (1952). The Design and Analysis of Experiments. Wiley, New York; Chapman & Hall, London.
  • Kempthorne, O. (1955). The randomization theory of experimental inference. J. Amer. Statist. Assoc. 50 946–967.
  • Lang, J. B. (2015). A closer look at testing the “no-treatment-effect” hypothesis in a comparative experiment. Statist. Sci. 30 352–371.
  • Lehmann, E. L. (1999). Elements of Large-Sample Theory. Springer, New York.
  • Li, X. and Ding, P. (2016). Exact confidence intervals for the average causal effect on a binary outcome. Stat. Med. 35 957–960.
  • Li, X. and Ding, P. (2017). General forms of finite population central limit theorems with applications to causal inference. J. Amer. Statist. Assoc. To appear.
  • Lin, W. (2013). Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique. Ann. Appl. Stat. 7 295–318.
  • Lin, W., Halpern, S. D., Prasad Kerlin, M. and Small, D. S. (2017). A “placement of death” approach for studies of treatment effects on ICU length of stay. Stat. Methods Med. Res. 26 292–311.
  • Nelsen, R. B. (2006). An Introduction to Copulas, 2nd ed. Springer, New York.
  • Neuhaus, G. (1993). Conditional rank tests for the two-sample problem under random censorship. Ann. Statist. 21 1760–1779.
  • Neyman, J. (1935). Statistical problems in agricultural experimentation (with discussion). Suppl. J. R. Stat. Soc. 2 107–180.
  • Neyman, J. (1937). Outline of a theory of statistical estimation based on the classical theory of probability. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Sci. 236 333–380.
  • Neyman, J. (1990). On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Statist. Sci. 5 465–472. Translated from the 1923 Polish original and edited by D. M. Dabrowska and T. P. Speed.
  • Pauly, M., Brunner, E. and Konietschke, F. (2015). Asymptotic permutation tests in general factorial designs. J. R. Stat. Soc. Ser. B. Stat. Methodol. 77 461–473.
  • Pitman, E. J. G. (1937). Significance tests which may be applied to samples from any populations. Suppl. J. R. Stat. Soc. 4 119–130.
  • Pitman, E. J. G. (1938). Significance tests which can be applied to samples from any populations. III. The analysis of variance test. Biometrika 29 322–335.
  • Reid, C. (1982). Neyman—From Life. Springer, New York.
  • Rigdon, J. and Hudgens, M. G. (2015). Randomization inference for treatment effects on a binary outcome. Stat. Med. 34 924–935.
  • Robbins, H. (1977). A fundamental question of practical statistics. Amer. Statist. 31 97.
  • Robins, J. M. (1988). Confidence intervals for causal parameters. Stat. Med. 7 773–785.
  • Rosenbaum, P. R. (2002). Observational Studies, 2nd ed. Springer, New York.
  • Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. J. Educ. Psychol. 66 688–701.
  • Rubin, D. B. (1980). Comment on “Randomization analysis of experimental data: The Fisher randomization test” by D. Basu. J. Amer. Statist. Assoc. 75 591–593.
  • Rubin, D. B. (1990). Comment on J. Neyman and causal inference in experiments and observational studies: “On the application of probability theory to agricultural experiments. Essay on principles. Section 9” [Ann. Agric. Sci. 10 (1923), 1–51]. Statist. Sci. 5 472–480.
  • Rubin, D. B. (2004). Teaching statistical inference for causal effects in experiments and observational studies. J. Educ. Behav. Stat. 29 343–367.
  • Sabbaghi, A. and Rubin, D. B. (2014). Comments on the Neyman–Fisher controversy and its consequences. Statist. Sci. 29 267–284.
  • Samii, C. and Aronow, P. M. (2012). On equivalencies between design-based and regression-based variance estimators for randomized experiments. Statist. Probab. Lett. 82 365–370.
  • Scheffé, H. (1959). The Analysis of Variance. Wiley, New York; Chapman & Hall, London.
  • Schochet, P. Z. (2010). Is regression adjustment supported by the Neyman model for causal inference? J. Statist. Plann. Inference 140 246–259.
  • Welch, B. L. (1937). On the $z$-test in randomized blocks and Latin squares. Biometrika 29 21–52.
  • White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48 817–838.
  • Wilk, M. B. and Kempthorne, O. (1957). Non-additivities in a Latin square design. J. Amer. Statist. Assoc. 52 218–236.
  • Wu, C. F. J. and Hamada, M. S. (2009). Experiments: Planning, Analysis, and Optimization, 2nd ed. Wiley, Hoboken, NJ.
  • Yates, F. (1937). The design and analysis of factorial experiments. Technical communication 35, Imperial Bureau of Soil Sciences, Harpenden.

See also

Supplemental materials

  • Supplementary Material. Appendix A.1 gives two useful lemmas for randomized experiments. Appendix A.2 gives the proofs of all the theorems and corollaries in the main text. Appendix A.3 comments on the regression-based causal inference, and establishes a new connection between Rao’s score test and the FRT. Appendix A.4 shows more details about generating Figure 2 in the main text. Appendix A.5 discusses the behaviors of the FRT using the Kolmogorov–Smirnov and Wilcoxon–Mann–Whitney statistics.