Bayesian Analysis

Posterior Predictive p-Values with Fisher Randomization Tests in Noncompliance Settings: Test Statistics vs Discrepancy Measures

Laura Forastiere, Fabrizia Mealli, and Luke Miratrix

Advance publication

This article is in its final form and can be cited using the date of online publication and the DOI.

Full-text: Open access


In randomized experiments with noncompliance, one might wish to focus on compliers rather than on the overall sample. In this vein, Rubin(1998) argued that testing for the complier average causal effect and averaging permutation-based p-values over the posterior distribution of the compliance types could increase power as compared to general intent-to-treat tests. The general scheme is a repeated two-step process: impute missing compliance types and conduct a permutation test with the completed data. In this paper, we explore this idea further, comparing the use of discrepancy measures—which depend on unknown but imputed parameters—to classical test statistics and contrasting different approaches for imputing the unknown compliance types. We also examine consequences of model misspecification in the imputation step, and discuss to what extent this additional modeling undercuts the advantage of permutation tests being model independent. We find that, especially for discrepancy measures, modeling choices can impact both power and validity. In particular, imputing missing compliance types under the null can radically reduce power, but not doing so can jeopardize validity. Fortunately, using covariates predictive of compliance type in the imputation can mitigate these results. We also compare this overall approach to Bayesian model-based tests, that is, tests that are directly derived from posterior credible intervals, under both correct and incorrect model specification.

Article information

Bayesian Anal. (2017), 21 pages.

First available in Project Euclid: 17 August 2017

Permanent link to this document

Digital Object Identifier

posterior predictive p-values (PPPV) permutation testing noncompliance principal stratification complier average causal effects (CACE)

Creative Commons Attribution 4.0 International License.


Forastiere, Laura; Mealli, Fabrizia; Miratrix, Luke. Posterior Predictive $p$ -Values with Fisher Randomization Tests in Noncompliance Settings: Test Statistics vs Discrepancy Measures. Bayesian Anal., advance publication, 17 August 2017. doi:10.1214/17-BA1062.

Export citation


  • Angrist, J. D., Imbens,G. W., & Rubin, D. B. (1996). Identification of causal effects using instrumental variables (with discussion).Journal of the American Statistical Association, 91, 444–472.
  • Fisher, R. A. (1925). Statistical Methods for Research Workers.1st ed. Oliver and Boyd, Edinburgh.
  • Fisher, R. A. (1925). The arrangement of field experiments.Journal of the Ministry of Agriculture of Great Britain, 33, 503–513.
  • Fisher, R. A. (1925). The design of experiments.Edinburgh: Oliver and Boyd.
  • Frangakis, C. E. & Rubin, D. B. (2002). Principal stratification in causal inference.Biometrics, 58, 21–29.
  • Gelman, A., Meng, X. L., and Stern, H. S. (1996). Posterior predictive assessment of model fitness via realized discrepancies (with discussion).Statistica Sinica, 6, 733–807.
  • Gelman, A. (2013). Two simple examples for understanding posterior $p$-values whose distributions are far from uniform.Electronic Journal of Statistics, 7, 2595–2602.
  • Guttman, I. (1967). The use of the concept of a future observation in goodness-of-fit problems.Journal of the Royal Statistical SocietyB, 29(1) 83–100.
  • Imbens, G. W. & Angrist, J. D. (1994). Identification and estimation of local average treatment effects.Econometrica, 62, 467–476.
  • Meng, X. L. (1994a). Posterior predictive $p$-values.Annals of Statistics, 22, 1142–1160.
  • Meng, X. L. (1994b). Multiple-imputation inferences under uncongeniality.Statistical Science, 4, 538–573.
  • Neyman, J. (1923). On the application of probability theory to agricultural experiments. Essay on principles. Section 9.Roczniki Nauk Rolniczych Tom X [in Polish]; translated in Statistical Science, 5, 465–480.
  • Neyman, J. (1934). On two different aspects of the representative method: The method of stratified sampling and the method of purposive selection with discussion.Journal of the Royal Statistical Society, 97, 558–625.
  • Robins, J. M., Vaart, A., and Ventura, V. (2000). Asymptotic distribution of p values in composite null models.Journal of the American Statistical Association, 95, 1143–1156.
  • Rubin, B. D. (1974). Estimating causal effects of treatments in randomized and non randomized studies.Journal of Educational Psychology66, 688–701.
  • Rubin, B. D. (1978). Bayesian inference for causal effects.Annals of Statistics, 6, 34–58.
  • Rubin, D. B. (1980). Comment on "Randomization Analysis of Experimental Data in the Fisher Randomization Tes" by D. Basu.Journal of the American Statistical Association, 75, 591–593.
  • Rubin, D. B. (1981). Estimation in parallel randomized experiments.Journal of Educational Statistics, 6(4), 377–401.
  • Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician.Annals of Statistics, 12(4), 1151–1172.
  • Rubin, D. B. (1996a). Discussion of “Posterior predictive $p$-values?” by Gelman, A., Meng, X. L. and Stern, H..Statistica Sinica, 6, 787–792.
  • Rubin, D. B. (1996b). Multiple imputation after 18+ years (with discussion).Journal of the American Statistical Association, 91, 473–520.
  • Rubin, D. B. (1998). More powerful randomization-based $p$-values in double-blind trials with non-compliance.Statistics in Medicine, 17(3), 371–85.
  • Tanner, M. A. & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation (with discussions).Journal of the American Statistical Association, 82, 528–550.