Annals of Applied Statistics

Statistical methods for replicability assessment

Kenneth Hung and William Fithian

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Large-scale replication studies like the Reproducibility Project: Psychology (RP:P) provide invaluable systematic data on scientific replicability, but most analyses and interpretations of the data fail to agree on the definition of “replicability” and disentangle the inexorable consequences of known selection bias from competing explanations. We discuss three concrete definitions of replicability based on: (1) whether published findings about the signs of effects are mostly correct, (2) how effective replication studies are in reproducing whatever true effect size was present in the original experiment and (3) whether true effect sizes tend to diminish in replication. We apply techniques from multiple testing and postselection inference to develop new methods that answer these questions while explicitly accounting for selection bias. Our analyses suggest that the RP:P dataset is largely consistent with publication bias due to selection of significant effects. The methods in this paper make no distributional assumptions about the true effect sizes.

Article information

Ann. Appl. Stat., Volume 14, Number 3 (2020), 1063-1087.

Received: April 2019
Revised: February 2020
First available in Project Euclid: 18 September 2020

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Replicability multiple testing postselection inference publication bias meta-analysis


Hung, Kenneth; Fithian, William. Statistical methods for replicability assessment. Ann. Appl. Stat. 14 (2020), no. 3, 1063--1087. doi:10.1214/20-AOAS1336.

Export citation


  • Achenbach, J. (2015). Many scientific studies can’t be replicated. That’s a problem. The Washington Post.
  • Amrhein, V., Korner-Nievergelt, F. and Roth, T. (2017). The Earth is flat (${p}>0.05$): Significance thresholds and the crisis of unreplicable research. PeerJ 5 e3544.
  • Anderson, C. J., Bahník, Š., Barnett-Cowan, M., Bosco, F. A., Chandler, J., Chartier, C. R., Cheung, F., Christopherson, C. D., Cordes, A. et al. (2016). Response to comment on “Estimating the reproducibility of psychological science”. Science 351 1037c.
  • Andrews, I. and Kasy, M. (2018). Identification of and correction for publication bias. GitHub 1–85.
  • Baker, M. (2015). Over half of psychology studies fail reproducibility test.
  • Barrett, L. F. (2015). Psychology is not in crisis. The New York Times A23.
  • Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E.-J., Berk, R. A., Bollen, K. A., Brembs, B., Brown, L. et al. (2018). Redefine statistical significance. Nat. Hum. Behav. 2 6–10.
  • Benjamini, Y. and Heller, R. (2008). Screening for partial conjunction hypotheses. Biometrics 64 1215–1222.
  • Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300.
  • Benjamini, Y. and Hochberg, Y. (2000). On the adaptive control of the false discovery rate in multiple testing with independent statistics. J. Educ. Behav. Stat. 25 60–83.
  • Benjamini, Y. and Yekutieli, D. (2005). False discovery rate-adjusted multiple confidence intervals for selected parameters. J. Amer. Statist. Assoc. 100 71–93. With comments and a rejoinder by the authors.
  • Camerer, C. F., Dreber, A., Forsell, E., Ho, T.-H., Huber, J., Johannesson, M., Kirchler, M., Almenberg, J., Altmejd, A. et al. (2016). Evaluating replicability of laboratory experiments in economics. Science 351 1433–1436.
  • Camerer, C. F., Dreber, A., Holzmeister, F., Ho, T.-H., Huber, J., Johannesson, M., Kirchler, M., Nave, G., Nosek, B. A. et al. (2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nat. Hum. Behav. 343 229–268.
  • Carey, B. (2015). Many psychology findings not as strong as claimed, study says. The New York Times A1.
  • Dodson, C. S., Darragh, J. and Williams, A. (2008). Stereotypes and retrieval-provoked illusory source recollections. J. Exper. Psychol., Learn., Mem., Cogn. 34 460–477.
  • Duval, S. and Tweedie, R. (2000). Trim and fill: A simple funnel-plot-based method of testing and adjusting for publication bias in meta-analysis. Biometrics 56 455–463.
  • Etz, A. and Vandekerckhove, J. (2016). A Bayesian perspective on the reproducibility project: Psychology. PLoS ONE 11 e0149794.
  • Farris, C., Treat, T. A., Viken, R. J. and McFall, R. M. (2008). Perceptual mechanisms that characterize gender differences in decoding women’s sexual intent. Psychol. Sci. 19 348–354.
  • Fisher, R. A. (1921). On the ‘probable error’ of a coefficient of correlation deduced from a small sample. Metron 1 3–32.
  • Fisher, R. A. (1924). The distribution of the partial correlation coefficient. Metron 3 329–332.
  • Fithian, W., Sun, D. L. and Taylor, J. E. (2014). Optimal inference after model selection. arXiv:1410.2597v2.
  • Gelman, A. and Carlin, J. (2014). Beyond power calculations. Perspectives on Psychological Science 9 641–651.
  • Gelman, A. and O’Rourke, K. (2013). Discussion: Difficulties in making inferences about scientific truth from distributions of published p-values. Biostatistics 15 18–23.
  • Gelman, A. and Tuerlinckx, F. (2000). Type S error rates for classical and Bayesian single and multiple comparison procedures. Comput. Statist. 15 373–390.
  • Gilbert, D. T., King, G., Pettigrew, S. and Wilson, T. D. (2016a). Comment on “Estimating the reproducibility of psychological science.” Science 351 1037a.
  • Gilbert, D. T., King, G., Pettigrew, S. and Wilson, T. D. (2016b). A response to the reply to our technical comment on “Estimating the reproducibility of psychological science.”
  • Gilbert, D. T., King, G., Pettigrew, S. and Wilson, T. D. (2016c). More on “Estimating the reproducibility of psychological science”.
  • Goodman, S. N. (2013). Discussion: An estimate of the science-wise false discovery rate and application to the top medical literature. Biostatistics 15 23–27.
  • Goodman, S. N., Fanelli, D. and Ioannidis, J. P. A. (2016). What does research reproducibility mean? Sci. Transl. Med. 8 341ps12.
  • Hedges, L. V. (1992). Modeling publication selection effects in meta-analysis. Statist. Sci. 7 246–255.
  • Heller, R., Golland, Y., Malach, R. and Benjamini, Y. (2007). Conjunction group analysis: An alternative to mixed/random effect analysis. NeuroImage 37 1178–1185.
  • Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6 65–70.
  • Hung, K. and Fithian, W. (2020). Supplement to “Statistical methods for replicability assessment.”,
  • Ioannidis, J. P. A. (2013). Discussion: Why “An estimate of the science-wise false discovery rate and application to the top medical literature” is false. Biostatistics 15 28–36.
  • Jager, L. R. and Leek, J. T. (2014). An estimate of the science-wise false discovery rate and application to the top medical literature. Biostatistics 15 1–12.
  • Johnson, V. E., Payne, R. D., Wang, T., Asher, A. and Mandal, S. (2017). On the reproducibility of psychological science. J. Amer. Statist. Assoc. 112 1–10.
  • Klein, R. A., Vianello, M., Hasselman, F., Adams, B. G., Adams, R. B., Alper, S., Aveyard, M., Axt, J. R., Bahník, Š. et al. (2018). Many Labs 2: Investigating variation in replicability across sample and setting.
  • Larsen, J. T. and McKibban, A. R. (2008). Is happiness having what you want, wanting what you have, or both? Psychol. Sci. 19 371–377.
  • Lee, J. D., Sun, D. L., Sun, Y. and Taylor, J. E. (2016). Exact post-selection inference, with application to the lasso. Ann. Statist. 44 907–927.
  • Morey, R. D. and Lakens, D. (2017). Why most of psychology is statistically unfalsifiable.
  • Nosek, B. A. and Errington, T. M. (2017). Making sense of replications. eLife 6.
  • Nosek, B. A. and Gilbert, E. (2016). Let’s not mischaracterize replication studies: Authors.
  • Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science 349 943.
  • Purdie-Vaughns, V., Steele, C. M., Davies, P. G., Ditlmann, R. and Crosby, J. R. (2008). Social identity contingencies: How diversity cues signal threat or safety for African Americans in mainstream institutions. J. Pers. Soc. Psychol. 94 615–630.
  • Sampson, A. R. and Sill, M. W. (2005). Drop-the-losers design: Normal case. Biom. J. 47 257–268.
  • Simonsohn, U., Nelson, L. D. and Simmons, J. P. (2014a). p-curve and effect size: Correcting for publication bias using only significant results. Perspectives on Psychological Science 9 666–681.
  • Simonsohn, U., Nelson, L. D. and Simmons, J. P. (2014b). P-curve: A key to the file-drawer. J. Exp. Psychol. Gen. 143 534–547.
  • Srivastava, S. (2015). Moderator interpretations of the Reproducibility Project.
  • Srivastava, S. (2016). Evaluating a new critique of the Reproducibility Project.
  • Storey, J. D. (2002). A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B. Stat. Methodol. 64 479–498.
  • Stroebe, W. (2016). Are most published social psychological findings false?. J. Exp. Soc. Psychol. 66 134–144.
  • The Economist (2016). The scientific method. The Economist.
  • Valentine, J. C., Biglan, A., Boruch, R. F., Castro, F. G., Collins, L. M., Flay, B. R., Kellam, S., Moscicki, E. K. and Schinke, S. P. (2011). Replication in prevention science. Prev. Sci. 12 103–117.
  • van Aert, R. C. M. and van Assen, M. A. L. M. (2017). Bayesian evaluation of effect size after replicating an original study. PLoS ONE 12 e0175302–23.
  • van Aert, R. C. M. and van Assen, M. A. L. M. (2018). Examining reproducibility in psychology: A hybrid method for combining a statistically significant original study and a replication. Behav. Res. Methods 50 1515–1539.
  • van Dijk, E., van Kleef, G. A., Steinel, W. and van Beest, I. (2008). A social functional approach to emotions in bargaining: When communicating anger pays and when it backfires. J. Pers. Soc. Psychol. 94 600–614.
  • Weinstein, A., Fithian, W. and Benjamini, Y. (2013). Selection adjusted confidence intervals with more power to determine the sign. J. Amer. Statist. Assoc. 108 165–176.
  • Yekutieli, D. (2012). Adjusted Bayesian inference for selected parameters. J. R. Stat. Soc. Ser. B. Stat. Methodol. 74 515–541.
  • Zöllner, S. and Pritchard, J. K. (2007). Overcoming the winner’s curse: Estimating penetrance parameters from case-control data. Am. J. Hum. Genet. 80 605–615.

Supplemental materials

  • Supplement A: Supplement to “Statistical methods for replicability assessment”. We detail considerations made for individual studies, as well as evaluate our approximation of $t$-distributions by normal distributions.
  • Supplement B: Code for “Statistical methods for replicability assessment”. We generated the figures, performed the analyses on the RP:P dataset and ran relevant simulations in R.