The Annals of Applied Statistics

An empirical Bayes testing procedure for detecting variants in analysis of next generation sequencing data

Zhigen Zhao, Wei Wang, and Zhi Wei

Full-text: Open access

Abstract

Because of the decreasing cost and high digital resolution, next-generation sequencing (NGS) is expected to replace the traditional hybridization-based microarray technology. For genetics study, the first-step analysis of NGS data is often to identify genomic variants among sequenced samples. Several statistical models and tests have been developed for variant calling in NGS study. The existing approaches, however, are based on either conventional Bayesian or frequentist methods, which are unable to address the multiplicity and testing efficiency issues simultaneously. In this paper, we derive an optimal empirical Bayes testing procedure to detect variants for NGS study. We utilize the empirical Bayes technique to exploit the across-site information among many testing sites in NGS data. We prove that our testing procedure is valid and optimal in the sense of rejecting the maximum number of nonnulls while the Bayesian false discovery rate is controlled at a given nominal level. We show by both simulation studies and real data analysis that our testing efficiency can be greatly enhanced over the existing frequentist approaches that fail to pool and utilize information across the multiple testing sites.

Article information

Source
Ann. Appl. Stat. Volume 7, Number 4 (2013), 2229-2248.

Dates
First available in Project Euclid: 23 December 2013

Permanent link to this document
http://projecteuclid.org/euclid.aoas/1387823317

Digital Object Identifier
doi:10.1214/13-AOAS660

Mathematical Reviews number (MathSciNet)
MR3161720

Zentralblatt MATH identifier
1283.62011

Keywords
Variant call next-generation sequencing Bayesian FDR multiplicity control optimality

Citation

Zhao, Zhigen; Wang, Wei; Wei, Zhi. An empirical Bayes testing procedure for detecting variants in analysis of next generation sequencing data. Ann. Appl. Stat. 7 (2013), no. 4, 2229--2248. doi:10.1214/13-AOAS660. http://projecteuclid.org/euclid.aoas/1387823317.


Export citation

References

  • Altmann, A., Weber, P., Quast, C., Rex-Haffner, M., Binder, E. B. and Müller-Myhsok, B. (2011). vipR: Variant identification in pooled DNA using R. Bioinformatics 27 i77–i84.
  • Amaral, A. J., Ferretti, L., Megens, H.-J., Crooijmans, R. P. M. A., Nie, H., Ramos-Onsins, S. E., Perez-Enciso, M., Schook, L. B. and Groenen, M. A. M. (2011). Genome-wide footprints of pig domestication and selection revealed through massive parallel sequencing of pooled DNA. PLoS ONE 6 e14782.
  • Bansal, V. (2010). A statistical method for the detection of variants from next-generation resequencing of DNA pools. Bioinformatics 26 i318–i324.
  • Benjamini, Y. and Heller, R. (2008). Screening for partial conjunction hypotheses. Biometrics 64 1215–1222.
  • Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Stat. Methodol. 57 289–300.
  • Bodmer, W. and Bonilla, C. (2008). Common and rare variants in multifactorial susceptibility to common diseases. Nat. Genet. 40 695–701.
  • Calvo, S. E., Tucker, E. J., Compton, A. G., Kirby, D. M., Crawford, G., Burtt, N. P., Rivas, M., Guiducci, C., Bruno, D. L., Goldberger, O. A., Redman, M. C., Wiltshire, E., Wilson, C. J., Altshuler, D., Gabriel, S. B., Daly, M. J., Thorburn, D. R. and Mootha, V. K. (2010). High-throughput, pooled sequencing identifies mutations in NUBPL and FOXRED1 in human complex I deficiency. Nat. Genet. 42 851–858.
  • Cheng, C., White, B. J., Kamdem, C., Mockaitis, K., Costantini, C., Hahn, M. W. and Besansky, N. J. (2012). Ecological genomics of Anopheles gambiae along a latitudinal cline: A population-resequencing approach. Genetics 190 1417–1432.
  • Craig, D. W., Pearson, J. V., Szelinger, S., Sekar, A., Redman, M., Corneveaux, J. J., Pawlowski, T. L., Laub, T., Nunn, G., Stephan, D. A., Homer, N. and Huentelman, M. J. (2008). Identification of genetic variants using bar-coded multiplexed sequencing. Nat. Methods 5 887–893.
  • Daye, Z. J., Li, H. and Wei, Z. (2012). A powerful test for multiple rare variants association studies that incorporates sequencing qualities. Nucleic Acids Res. 40 e60.
  • Druley, T. E., Vallania, F. L. M., Wegner, D. J., Varley, K. E., Knowles, O. L., Bonds, J. A., Robison, S. W., Doniger, S. W., Hamvas, A., Cole, F. S., Fay, J. C. and Mitra, R. D. (2009). Quantification of rare allelic variants from pooled genomic DNA. Nat. Methods 6 263–265.
  • Efron, B. (2005). Bayesians, frequentists, and scientists. J. Amer. Statist. Assoc. 100 1–5.
  • Efron, B. (2008). Microarrays, empirical Bayes and the two-groups model. Statist. Sci. 23 1–22.
  • Efron, B. (2010). Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Institute of Mathematical Statistics (IMS) Monographs 1. Cambridge Univ. Press, Cambridge.
  • Efron, B. and Morris, C. (1971). Limiting the risk of Bayes and empirical Bayes estimators. I. The Bayes case. J. Amer. Statist. Assoc. 66 807–815.
  • Efron, B. and Morris, C. (1973). Stein’s estimation rule and its competitors—An empirical Bayes approach. J. Amer. Statist. Assoc. 68 117–130.
  • Efron, B. and Morris, C. N. (1975). Data analysis using Stein’s estimator and its generalizations. J. Amer. Statist. Assoc. 311–319.
  • Efron, B., Tibshirani, R., Storey, J. D. and Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. J. Amer. Statist. Assoc. 96 1151–1160.
  • Elshire, R. J., Glaubitz, J. C., Sun, Q., Poland, J. A., Kawamoto, K., Buckler, E. S. and Mitchell, S. E. (2011). A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One 6 e19379.
  • Fisher, R. A. (1925). Statistical Methods for Research Workers. Oliver & Boyd, Edinburgh.
  • Frazer, K. A., Murray, S. S., Schork, N. J. and Topol, E. J. (2009). Human genetic variation and its contribution to complex traits. Nat. Rev. Genet. 10 241–251.
  • Genovese, C. and Wasserman, L. (2002). Operating characteristics and extensions of the false discovery rate procedure. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 499–517.
  • Hayden, E. C. (2008). International genome project launched. Nature 451 378–379.
  • He, L., Sarkar, S. K. and Zhao, Z. (2012). Capturing the severity of type II errors in high-dimensional multiple testing. Technical report.
  • Hindorff, L. A., Sethupathy, P., Junkins, H. A., Ramos, E. M., Mehta, J. P., Collins, F. S. and Manolio, T. A. (2009). Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA 106 9362–9367.
  • Huang, X., Feng, Q., Qian, Q., Zhao, Q., Wang, L., Wang, A., Guan, J., Fan, D., Weng, Q., Huang, T., Dong, G., Sang, T. and Han, B. (2009). High-throughput genotyping by whole-genome resequencing. Genome Res. 19 1068–1076.
  • Kolaczkowski, B., Kern, A. D., Holloway, A. K. and Begun, D. J. (2011). Genomic differentiation between temperate and tropical Australian populations of Drosophila melanogaster. Genetics 187 245–260.
  • Lander, E. S. (2011). Initial impact of the sequencing of the human genome. Nature 470 187–197.
  • Li, B. and Leal, S. M. (2009). Discovery of rare variants via sequencing: Implications for the design of complex trait association studies. PLoS Genet. 5 e1000481.
  • Li, H., Ruan, J. and Durbin, R. (2008). Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18 1851–1858.
  • Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R. and 1000 Genome Project Data Processing Subgroup (2009a). The sequence alignment/map format and SAMtools. Bioinformatics 25 2078–2079.
  • Li, R., Li, Y., Fang, X., Yang, H., Wang, J., Kristiansen, K. and Wang, J. (2009b). SNP detection for massively parallel whole-genome resequencing. Genome Res. 19 1124–1132.
  • Manolio, T. A., Collins, F. S., Cox, N. J., Goldstein, D. B., Hindorff, L. A., Hunter, D. J., McCarthy, M. I., Ramos, E. M., Cardon, L. R., Chakravarti, A., Cho, J. H., Guttmacher, A. E., Kong, A., Kruglyak, L., Mardis, E., Rotimi, C. N., Slatkin, M., Valle, D., Whittemore, A. S., Boehnke, M., Clark, A. G., Eichler, E. E., Gibson, G., Haines, J. L., Mackay, T. F. C., McCarroll, S. A. and Visscher, P. M. (2009). Finding the missing heritability of complex diseases. Nature 461 747–753.
  • Mardis, E. R. (2011). A decade’s perspective on DNA sequencing technology. Nature 470 198–203.
  • Margraf, R. L., Durtschi, J. D., Dames, S., Pattison, D. C., Stephens, J. E. and Voelkerding, K. V. (2011). Variant identification in multi-sample pools by illumina genome analyzer sequencing. J. Biomol. Tech. 22 74–84.
  • McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M. and DePristo, M. A. (2010). The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20 1297–1303.
  • Momozawa, Y., Mni, M., Nakamura, K., Coppieters, W., Almer, S., Amininejad, L., Cleynen, I., Colombel, J.-F., de Rijk, P., Dewit, O., Finkel, Y., Gassull, M. A., Goossens, D., Laukens, D., Lémann, M., Libioulle, C., O’Morain, C., Reenaers, C., Rutgeerts, P., Tysk, C., Zelenika, D., Lathrop, M., Del-Favero, J., Hugot, J.-P., de Vos, M., Franchimont, D., Vermeire, S., Louis, E. and Georges, M. (2011). Resequencing of positional candidates identifies low frequency IL23R coding variants protecting against inflammatory bowel disease. Nat. Genet. 43 43–47.
  • Morris, C. N. (1983). Parametric empirical Bayes inference: Theory and applications (with discussion). J. Amer. Statist. Assoc. 78 47–65.
  • Nejentsev, S., Walker, N., Riches, D., Egholm, M. and Todd, J. A. (2009). Rare variants of IFIH1, a gene implicated in antiviral responses, protect against type 1 diabetes. Science 324 387–389.
  • Norton, N., Williams, N. M., O’Donovan, M. C. and Owen, M. J. (2004). DNA pooling as a tool for large-scale association studies in complex traits. Ann. Med. 36 146–152.
  • Out, A. A., van Minderhout, I. J. H. M., Goeman, J. J., Ariyurek, Y., Ossowski, S., Schneeberger, K., Weigel, D., van Galen, M., Taschner, P. E. M., Tops, C. M. J., Breuning, M. H., van Ommen, G.-J. B., den Dunnen, J. T., Devilee, P. and Hes, F. J. (2009). Deep sequencing to reveal new variants in pooled DNA samples. Hum. Mutat. 30 1703–1712.
  • Prabhu, S. and Pe’er, I. (2009). Overlapping pools for high-throughput targeted resequencing. Genome Res. 19 1254–1261.
  • Robbins, H. (1951). Asymptotically subminimax solutions of compound statistical decision problems. In Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, 1950 131–148. Univ. California Press, Berkeley and Los Angeles.
  • Robbins, H. (1956). An empirical Bayes approach to statistics. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 19541955, Vol. I 157–163. Univ. California Press, Berkeley and Los Angeles.
  • Sarkar, S. K., Zhou, T. and Ghosh, D. (2008). A general decision theoretic formulation of procedures controlling FDR and FNR from a Bayesian perspective. Statist. Sinica 18 925–945.
  • Sham, P., Bader, J. S., Craig, I., O’Donovan, M. and Owen, M. (2002). DNA pooling: A tool for large-scale association studies. Nat. Rev. Genet. 3 862–871.
  • Smith, A. M., Heisler, L. E., Onge, R. P. S., Farias-Hesson, E., Wallace, I. M., Bodeau, J., Harris, A. N., Perry, K. M., Giaever, G., Pourmand, N. and Nislow, C. (2010). Highly-multiplexed barcode sequencing: An efficient method for parallel analysis of pooled samples. Nucleic Acids Res. 38 e142.
  • Storey, J. D. (2003). The positive false discovery rate: A Bayesian interpretation and the $q$-value. Ann. Statist. 31 2013–2035.
  • Sun, W. and Cai, T. T. (2007). Oracle and adaptive compound decision rules for false discovery rate control. J. Amer. Statist. Assoc. 102 901–912.
  • Sun, W. and Cai, T. T. (2009). Large-scale multiple testing under dependence. J. R. Stat. Soc. Ser. B Stat. Methodol. 71 393–424.
  • Sun, W. and Wei, Z. (2011). Multiple testing for pattern identification, with applications to microarray time-course experiments. J. Amer. Statist. Assoc. 106 73–88.
  • Turner, T. L., Bourne, E. C., Wettberg, E. J. V., Hu, T. T. and Nuzhdin, S. V. (2010). Population resequencing reveals local adaptation of Arabidopsis lyrata to serpentine soils. Nat. Genet. 42 260–263.
  • Turner, T. L., Stewart, A. D., Fields, A. T., Rice, W. R. and Tarone, A. M. (2011). Population-based resequencing of experimentally evolved populations reveals the genetic basis of body size variation in Drosophila melanogaster. PLoS Genet. 7 e1001336.
  • Vallania, F. L. M., Druley, T. E., Ramos, E., Wang, J., Borecki, I., Province, M. and Mitra, R. D. (2010). High-throughput discovery of rare insertions and deletions in large cohorts. Genome Res. 20 1711–1718.
  • Wang, W., Wei, Z. and Sun, W. (2010). Simultaneous set-wise testing under dependence, with applications to genome-wide association studies. Stat. Interface 3 501–511.
  • Wei, Z., Sun, W., Wang, K. and Hakonarson, H. (2009). Multiple testing in genome-wide association studies via hidden Markov models. Bioinformatics 25 2802–2808.
  • Wei, Z., Wang, W., Hu, P., Lyon, G. J. and Hakonarson, H. (2011). SNVer: A statistical tool for variant calling in analysis of pooled or individual next-generation sequencing data. Nucleic Acids Res. 39 e132.
  • Xie, J., Cai, T. T., Maris, J. and Li, H. (2011). Optimal false discovery rate control for dependent data. Stat. Interface 4 417–430.
  • Zhao, Z., Wang, W. and Wei, Z. (2013). Supplement to “An empirical Bayes testing procedure for detecting variants in analysis of next generation sequencing data.” DOI:110.1214/13-AOAS660SUPP.
  • Zhu, Y., Bergland, A. O., González, J. and Petrov, D. A. (2012). Empirical validation of pooled whole genome population re-sequencing in Drosophila melanogaster. PLoS ONE 7 e41901.

Supplemental materials

  • Supplementary material: Supplement to “An empirical Bayes testing procedure for detecting variants in analysis of next generation sequencing data”. This file contains the technical proof of the theorems.