Statistical Science

Scalable Genomics with R and Bioconductor

Michael Lawrence and Martin Morgan

Full-text: Open access

Abstract

This paper reviews strategies for solving problems encountered when analyzing large genomic data sets and describes the implementation of those strategies in R by packages from the Bioconductor project. We treat the scalable processing, summarization and visualization of big genomic data. The general ideas are well established and include restrictive queries, compression, iteration and parallel computing. We demonstrate the strategies by applying Bioconductor packages to the detection and analysis of genetic variants from a whole genome sequencing experiment.

Article information

Source
Statist. Sci. Volume 29, Number 2 (2014), 214-226.

Dates
First available in Project Euclid: 18 August 2014

Permanent link to this document
https://projecteuclid.org/euclid.ss/1408368572

Digital Object Identifier
doi:10.1214/14-STS476

Mathematical Reviews number (MathSciNet)
MR3264533

Zentralblatt MATH identifier
1332.62009

Keywords
R Bioconductor genomics biology big data

Citation

Lawrence, Michael; Morgan, Martin. Scalable Genomics with R and Bioconductor. Statist. Sci. 29 (2014), no. 2, 214--226. doi:10.1214/14-STS476. https://projecteuclid.org/euclid.ss/1408368572


Export citation

References

  • [1] Bischl, B., Lang, M., Mersmann, O., Rahnenfuehrer, J. and Weihs, C. (2011). Computing on high performance clusters with R: Packages BatchJobs and BatchExperiments. Technical Report 1, TU Dortmund.
  • [2] Chambers, J. M. (2008). Software for Data Analysis: Programming with R. Springer, New York.
  • [3] Cormen, T. H., Leiserson, C. E., Rivest, R. L. and Stein, C. (2001). Introduction to Algorithms, 2nd ed. McGraw-Hill, Boston, MA.
  • [4] Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., Handsaker, R. E., Lunter, G., Marth, G. T., Sherry, S. T., McVean, G., Durbin, R. and 1000 Genomes Project Analysis Group (2011). The variant call format and VCFtools. Bioinformatics 27 2156–2158.
  • [5] Gentleman, R. C., Carey, V. J., Bates, D. M. and others (2004). Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol. 5 R80.
  • [6] Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, K. M., Pringle, T. H., Zahler, A. M. and Haussler, D. (2002). The human genome browser at UCSC. Genome Res. 12 996–1006.
  • [7] Kent, W. J., Zweig, A. S., Barber, G., Hinrichs, A. S. and Karolchik, D. (2010). BigWig and BigBed: Enabling browsing of large distributed datasets. Bioinformatics 26 2204–2207.
  • [8] Lawrence, M., Huber, W., Pagès, H., Aboyoun, P., Carlson, M., Gentleman, R., Morgan, M. and Carey, V. (2013). Software for computing and annotating genomic ranges. PLoS Computational Biology 9 e1003118.
  • [9] Lawrence, M. and Wickham, H. (2012). plumbr: Mutable and dynamic data models. R package version 0.6.6.
  • [10] Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R. and 1000 Genome Project Data Processing Subgroup (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25 2078–2079.
  • [11] Ostrouchov, G., Chen, W.-C., Schmidt, D. and Patel, P. (2012). Programming with big data in R. Available at http://r-pbd.org/.
  • [12] Pagès, H., Aboyoun, P., Gentleman, R. and DebRoy, S. (2013). Biostrings: String objects representing biological sequences, and matching algorithms. R package version 2.25.6.
  • [13] R Development Core Team (2010). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  • [14] Revolution Analytics and Weston, S. (2013). foreach: Foreach looping construct for R. R package version 1.4.1.
  • [15] Wickham, H. (2011). The split-apply-combine strategy for data analysis. Journal of Statistical Software 40 1–29.
  • [16] Wickham, H., Lawrence, M., Cook, D., Buja, A., Hofmann, H. and Swayne, D. F. (2009). The plumbing of interactive graphics. Comput. Statist. 24 207–215.
  • [17] Yin, T., Lawrence, M. and Cook, D. (2013). biovizBase: Basic graphic utilities for visualization of genomic data. R package version 1.9.1.