The Annals of Applied Statistics

False discovery rates in somatic mutation studies of cancer

Lorenzo Trippa and Giovanni Parmigiani

Full-text: Open access

Abstract

The purpose of cancer genome sequencing studies is to determine the nature and types of alterations present in a typical cancer and to discover genes mutated at high frequencies. In this article we discuss statistical methods for the analysis of somatic mutation frequency data generated in these studies. We place special emphasis on a two-stage study design introduced by Sjöblom et al. [Science 314 (2006) 268–274]. In this context, we describe and compare statistical methods for constructing scores that can be used to prioritize candidate genes for further investigation and to assess the statistical significance of the candidates thus identified. Controversy has surrounded the reliability of the false discovery rates estimates provided by the approximations used in early cancer genome studies. To address these, we develop a semiparametric Bayesian model that provides an accurate fit to the data. We use this model to generate a large collection of realistic scenarios, and evaluate alternative approaches on this collection. Our assessment is impartial in that the model used for generating data is not used by any of the approaches compared. And is objective, in that the scenarios are generated by a model that fits data. Our results quantify the conservative control of the false discovery rate with the Benjamini and Hockberg method compared to the empirical Bayes approach and the multiple testing method proposed in Storey [J. R. Stat. Soc. Ser. B Stat. Methodol. 64 (2002) 479–498]. Simulation results also show a negligible departure from the target false discovery rate for the methodology used in Sjöblom et al. [Science 314 (2006) 268–274].

Article information

Source
Ann. Appl. Stat., Volume 5, Number 2B (2011), 1360-1378.

Dates
First available in Project Euclid: 13 July 2011

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1310562724

Digital Object Identifier
doi:10.1214/10-AOAS438

Mathematical Reviews number (MathSciNet)
MR2849777

Zentralblatt MATH identifier
05961668

Keywords
Cancer genome studies genome-wide studies false discovery rate multiple hypothesis testing somatic mutations

Citation

Trippa, Lorenzo; Parmigiani, Giovanni. False discovery rates in somatic mutation studies of cancer. Ann. Appl. Stat. 5 (2011), no. 2B, 1360--1378. doi:10.1214/10-AOAS438. https://projecteuclid.org/euclid.aoas/1310562724


Export citation

References

  • Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300.
  • Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Statist. 29 1165–1188.
  • Blackwell, D. and MacQueen, J. B. (1973). Ferguson distributions via Pólya urn schemes. Ann. Statist. 1 353–355.
  • The Cancer Genome Atlas project. (2008). Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455 1061–1068.
  • Cheng, C. and Pounds, S. (2007). False discovery rate paradigms for statistical analyses of microarray gene expression data. Bioinformation 1 436–446.
  • Dudoit, S., Gilbert, H. and Laan, M. V. D. (2008). Resampling-based empirical Bayes multiple testing procedures for controlling generalized tail probability and expected value error rates: Focus on the false discovery rate and simulation study. Biom. J. 50 716–744.
  • Dudoit, S., Shaffer, J. P. and Boldrick, J. C. (2003). Multiple hypothesis testing in microarray experiments. Statist. Sci. 18 71–103.
  • Dunson, D. B. (2010). Nonparametric Bayes applications to biostatistics. In Bayesian Nonparametrics. Camb. Ser. Stat. Probab. Math. 223–273. Cambridge Univ. Press, Cambridge.
  • Efron, B. (2003). Robbins, empirical Bayes and microarrays. Ann. Statist. 31 366–378.
  • Efron, B., Tibshirani, R., Storey, J. D. and Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. J. Amer. Statist. Assoc. 96 1151–1160.
  • Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc. 90 577–588.
  • Farcomeni, A. (2008). A review of modern multiple hypothesis testing, with particular attention to the false discovery proportion. Stat. Methods Med. Res. 17 347–388.
  • Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209–230.
  • Forrest, W. F. and Cavet, G. (2007). Comment on “The consensus coding sequences of human breast and colorectal cancers.” Science 317 1500 (author reply 1500).
  • Getz, G., Höfling, H., Mesirov, J. P., Golub, T. R., Meyerson, M., Tibshirani, R. and Lander, E. S. (2007). Comment on “The consensus coding sequences of human breast and colorectal cancers.” Science 317 (5844) 1500.
  • Greenman, C., Wooster, R., Futreal, P. A., Stratton, M. R. and Easton, D. F. (2006). Statistical analysis of pathogenicity of somatic mutations in cancer. Genetics 173 2187–2198.
  • Jones, S., Zhang, X., Parsons, D. W., Lin, J. C. H., Leary, R. J., Angenendt, P., Mankoo, P., Carter, H., Kamiyama, H., Jimeno, A., Hong, S. M., Fu, B., Lin, M. T., Calhoun, E. S., Kamiyama, M., Walter, K., Nikolskaya, T., Nikolsky, Y., Hartigan, J., Smith, D. R., Hidalgo, M., Leach, S. D., Klein, A. P., Jaffee, E. M., Goggins, M., Maitra, A., Iacobuzio-Donahue, C., Eshleman, J. R., Kern, S. E., Hruban, R. H., Karchin, R., Papadopoulos, N., Parmigiani, G., Vogelstein, B., Velculescu, V. E. and Kinzler, K. W. (2008). Core signaling pathways in human pancreatic cancers revealed by global genomic analyses. Science 321 1801–1806.
  • Kraft, P. (2006). Efficient two-stage genome-wide association designs based on false positive report probabilities. Pac. Symp. Biocomput. 523–534.
  • Müller, P. and Quintana, F. A. (2004). Nonparametric Bayesian data analysis. Statist. Sci. 19 95–110.
  • Parmigiani, G., Lin, J., Boca, S. M., Sjöblom, T., Jones, S., Wood, L. D., Parsons, D. W., Barber, T., Buckhaults, P., Markowitz, S. D., Park, B. H., Bachman, K. E., Papadopoulos, N., Vogelstein, B., Kinzler, K. W. and Velculescu, V. E. (2007a). Response to comments on “The consensus coding sequences of human breast and colorectal cancers.” Science 317 1500.
  • Parmigiani, G., Lin, J., Boca, S., Sjöblom, T., Kinzler, K., Velculescu, V. and Vogelstein, B. (2007b). Statistical methods for the analysis of cancer genome sequencing. Working Paper 126. Johns Hopkins Univ., Dept. Biostatistics Working Papers. Available at http://www.bepress.com/jhubiostat/paper126.
  • Parmigiani, G., Boca, S., Lin, J., Kinzler, K. and Velculescu, V. (2009). Design and analysis issues in genome-wide somatic mutation studies of cancer. Genomics 93 17–21.
  • Parsons, D. W., Jones, S., Zhang, X., Lin, J. C. H., Leary, R. J., Angenendt, P., Mankoo, P., Carter, H., Siu, I. M., Gallia, G. L., Olivi, A., McLendon, R., Rasheed, B. A., Keir, S., Nikolskaya, T., Nikolsky, Y., Busam, D. A., Tekleab, H., Diaz, L. A., Hartigan, J., Smith, D. R., Strausberg, R. L., Marie, S. K. N., Shinjo, S. M. O., Yan, H., Riggins, G. J., Bigner, D. D., Karchin, R., Papadopoulos, N., Parmigiani, G., Vogelstein, B., Velculescu, V. E. and Kinzler, K. W. (2008). An integrated genomic analysis of human glioblastoma multiforme. Science 321 1807–1812.
  • Rubin, A. F. and Green, P. (2007). Comment on “The consensus coding sequences of human breast and colorectal cancers.” Science 317 1500.
  • Satagopan, J. M. and Elston, R. C. (2003). Optimal two-stage genotyping in population-based association studies. Genet. Epidemiol. 25 149–157.
  • Satagopan, J. M., Venkatraman, E. S. and Begg, C. B. (2004). Two-stage designs for gene-desease association studies with sample size constraints. Biometrics 60 589–597.
  • Satagopan, J. M., Verbel, D. A., Venkatraman, E. S., Offit, K. E. and Begg, C. B. (2002). Two-stage designs for gene-disease association studies. Biometrics 58 163–170.
  • Sjöblom, T., Jones, S., Wood, L. D., Parsons, D. W., Lin, J., Barber, T. D., Mandelker, D., Leary, R. J., Ptak, J., Silliman, N., Szabo, S., Buckhaults, P., Farrell, C., Meeh, P., Markowitz, S. D., Willis, J., Dawson, D., Willson, J. K. V., Gazdar, A. F., Hartigan, J., Wu, L., Liu, C., Parmigiani, G., Park, B. H., Bachman, K. E., Papadopoulos, N., Vogelstein, B., Kinzler, K. W. and Velculescu, V. E. (2006). The consensus coding sequences of human breast and colorectal cancers. Science 314 268–274.
  • Skol, A. D., Scott, L. J., Abecasis, G. R. and Boehnke, M. (2006). Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat. Genet. 38 209–213.
  • Smith, B. (2007). boa: An R package for MCMC output convergence assessment and posterior inference. Journal of Statistical Software 21 1–37.
  • Storey, J. D. (2002). A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 479–498.
  • Stratton, M. R., Campbell, P. J. and Futreal, P. A. (2009). The cancer genome. Nature 458 719–724.
  • Venturini, S., Dominici, F. and Parmigiani, G. (2008). Gamma shape mixtures for heavy-tailed distributions. Ann. Appl. Statist. 2 756–776.
  • Wang, H. and Stram, D. O. (2006). Optimal two-stage genome-wide association designs based on false discovery rate. Comput. Statist. Data Anal. 51 457–465.
  • Wood, L. D., Parsons, D. W., Jones, S., Lin, J., Sjöblom, T., Leary, R. J., Shen, D., Boca, S. M., Barber, T., Ptak, J., Silliman, N., Szabo, S., Dezso, Z., Ustyanksky, V., Nikolskaya, T., Nikolsky, Y., Karchin, R., Wilson, P. A., Kaminker, J. S., Zhang, Z., Croshaw, R., Willis, J., Dawson, D., Shipitsin, M., Willson, J. K. V., Sukumar, S., Polyak, K., Park, B. H., Pethiyagoda, C. L., Pant, P. V. K., Ballinger, D. G., Sparks, A. B., Hartigan, J., Smith, D. R., Suh, E., Papadopoulos, N., Buckhaults, P., Markowitz, S. D., Parmigiani, G., Kinzler, K. W., Velculescu, V. E. and Vogelstein, B. (2007). The genomic landscapes of human breast and colorectal cancers. Science 318 1108–1113.