The Annals of Applied Statistics

Detecting mutations in mixed sample sequencing data using empirical Bayes

Omkar Muralidharan, Georges Natsoulis, John Bell, Hanlee Ji, and Nancy R. Zhang

Full-text: Access denied (no subscription detected)In 2007, access to the Annals of Applied Statistics was open. Beginning in 2008, you must hold a subscription or be a member of the IMS to view the full journal. For more information on subscribing, please visit: http://imstat.org/orders.If you are already an IMS member, you may need to update your Euclid profile following the instructions here: http://imstat.org/publications/eaccess.htm.

Abstract

We develop statistically based methods to detect single nucleotide DNA mutations in next generation sequencing data. Sequencing generates counts of the number of times each base was observed at hundreds of thousands to billions of genome positions in each sample. Using these counts to detect mutations is challenging because mutations may have very low prevalence and sequencing error rates vary dramatically by genome position. The discreteness of sequencing data also creates a difficult multiple testing problem: current false discovery rate methods are designed for continuous data, and work poorly, if at all, on discrete data.

We show that a simple randomization technique lets us use continuous false discovery rate methods on discrete data. Our approach is a useful way to estimate false discovery rates for any collection of discrete test statistics, and is hence not limited to sequencing data. We then use an empirical Bayes model to capture different sources of variation in sequencing error rates. The resulting method outperforms existing detection approaches on example data sets.

Article information

Source
Ann. Appl. Stat. Volume 6, Number 3 (2012), 1047-1067.

Dates
First available: 31 August 2012

Permanent link to this document
http://projecteuclid.org/euclid.aoas/1346418573

Digital Object Identifier
doi:10.1214/12-AOAS538

Zentralblatt MATH identifier
06096521

Mathematical Reviews number (MathSciNet)
MR3012520

Citation

Muralidharan, Omkar; Natsoulis, Georges; Bell, John; Ji, Hanlee; Zhang, Nancy R. Detecting mutations in mixed sample sequencing data using empirical Bayes. The Annals of Applied Statistics 6 (2012), no. 3, 1047--1067. doi:10.1214/12-AOAS538. http://projecteuclid.org/euclid.aoas/1346418573.


Export citation

References

  • Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300.
  • Brockwell, A. E. (2007). Universal residuals: A multivariate transformation. Statist. Probab. Lett. 77 1473–1478.
  • Czado, C., Gneiting, T. and Held, L. (2009). Predictive model assessment for count data. Biometrics 65 1254–1261.
  • Efron, B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J. Amer. Statist. Assoc. 99 96–104.
  • Efron, B., Tibshirani, R., Storey, J. D. and Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. J. Amer. Statist. Assoc. 96 1151–1160.
  • Flaherty, P., Natsoulis, G., Muralidharan, O., Buenrostro, J., Bell, J., Zhang, N. and Ji, H. (2012). Ultrasensitive detection of rare mutations using next-generation targeted resequencing. Nucleic Acids Res. 40 (electronic).
  • Gneiting, T., Balabdaoui, F. and Raftery, A. E. (2007). Probabilistic forecasts, calibration and sharpness. J. R. Stat. Soc. Ser. B Stat. Methodol. 69 243–268.
  • Hedskog, C., Mild, M., Jernberg, J., Sherwood, E., Bratt, G., Leitner, T., Lundeberg, J., Andersson, B. and Albert, J. (2010). Dynamics of HIV-1 quasispecies during antiviral treatment dissected using ultra-deep pyrosequencing. PLoS ONE 5 e11345.
  • Kulinskaya, E. and Lewin, A. (2009). On fuzzy familywise error rate and false discovery rate procedures for discrete distributions. Biometrika 96 201–211.
  • Lehmann, E. L. and Romano, J. P. (2005). Testing Statistical Hypotheses, 3rd ed. Springer, New York.
  • Li, H., Ruan, J. and Durbin, R. (2008). Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18 1851–1858.
  • McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M. and DePristo, M. A. (2010). The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20 1297–1303.
  • Muralidharan, O., Natsoulis, G., Bell, J., Newburger, D., Xu, H., Kela, I., Ji, H. and Zhang, N. (2012). A cross-sample statistical model for SNP detection in short-read sequencing data. Nucleic Acids Res. 40 (electronic).
  • Natsoulis, G., Bell, J. M., Xu, H., Buenrostro, J. D., Ordonez, H., Grimes, S., Newburger, D., Jensen, M., Zahn, J. M., Zhang, N. and Ji, H. P. (2011). A flexible approach for highly multiplexed candidate gene targeted resequencing. PLoS ONE 6 e21088.
  • Porreca, G. J., Zhang, K., Li, J. B., Xie, B., Austin, D., Vassallo, S. L., LeProust, E. M., Peck, B. J., Emig, C. J., Dahl, F., Gao, Y., Church, G. M. and Shendure, J. (2007). Multiplex amplification of large sets of human exons. Nat. Meth. 4 931–936.
  • Rousseeuw, P. J. and Croux, C. (1993). Alternatives to the median absolute deviation. J. Amer. Statist. Assoc. 88 1273–1283.
  • Shendure, J. and Ji, H. (2008). Next-generation DNA sequencing. Nat. Biotechnol. 26 1135–1145.