Annals of Applied Statistics

Subsampling methods for genomic inference

Peter J. Bickel, Nathan Boley, James B. Brown, Haiyan Huang, and Nancy R. Zhang

Full-text: Open access


Large-scale statistical analysis of data sets associated with genome sequences plays an important role in modern biology. A key component of such statistical analyses is the computation of p-values and confidence bounds for statistics defined on the genome. Currently such computation is commonly achieved through ad hoc simulation measures. The method of randomization, which is at the heart of these simulation procedures, can significantly affect the resulting statistical conclusions. Most simulation schemes introduce a variety of hidden assumptions regarding the nature of the randomness in the data, resulting in a failure to capture biologically meaningful relationships. To address the need for a method of assessing the significance of observations within large scale genomic studies, where there often exists a complex dependency structure between observations, we propose a unified solution built upon a data subsampling approach. We propose a piecewise stationary model for genome sequences and show that the subsampling approach gives correct answers under this model. We illustrate the method on three simulation studies and two real data examples.

Article information

Ann. Appl. Stat., Volume 4, Number 4 (2010), 1660-1697.

First available in Project Euclid: 4 January 2011

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Genome Structure Correction (GSC) subsampling piecewise stationary model segmentation-block bootstrap feature overlap


Bickel, Peter J.; Boley, Nathan; Brown, James B.; Huang, Haiyan; Zhang, Nancy R. Subsampling methods for genomic inference. Ann. Appl. Stat. 4 (2010), no. 4, 1660--1697. doi:10.1214/10-AOAS363.

Export citation


  • Andrews, D. and Mallows, C. (1974). Scale mixtures of normal distributions. J. Roy. Statist. Soc. Ser. B 26 99–102.
  • Beran, R. (1988). Prepivoting test statistics: A bootstrap view of asymptotic refinements. J. Amer. Statist. Assoc. 83 687–697.
  • Bernardi, G., Olofsson, B., Filipski, J., Zerial, M., Salinas, J., Cuny, G., Meunier-Rotival, M. and Rodier, F. (1985). The mosaic genome of warm-blooded vertebrates. Science 228 953–958.
  • Bickel, P. J., Boley, N., Brown, J. B., Huang, H. and Zhang, N. R. (2010). Supplement to “Subsampling methods for genomic inference.” DOI: 10.1214/10-AOAS363SUPP.
  • Bickel, P. J. and Sakov, A. (2008). On the choice of m in the m out of n bootstrap and its application to confidence bounds for extreme percentiles. Statist. Sinica 18 967–985.
  • Bickel, P. J., Gotze, F. and van Zwet, W. R. (1997). Resampling fewer than n observations: Gains, losses, and remedies for losses. Statist. Sinica 1 1–31.
  • Birney, E. et al. (2007). Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447 799–816.
  • Blakesley, R. W. et al. (2004). An intermediate grade of finished genomic sequence suitable for comparative analyses. Genome Res. 14 2235–2244.
  • Braun, J. and Muller, H.-G. (1998). Statistical methods for DNA sequence segmentation. Statist. Sci. 13 142–162.
  • Carter, N. (2007). Methods and strategies for analyzing copy number variation using DNA microarrays. Nature Genet. 39 S16–S21.
  • Churchill, G. A. (1989). Stochastic models for heterogeneous genome sequences. Bull. Math. Biol. 51 79–94.
  • Churchill, G. A. (1992). Hidden Markov chains and the analysis of genome structure. Comput. Chem. 16 107–115.
  • Das, D., Banerjee, N. and Zhang, M. Q. (2004). Interacting models of cooperative gene regulation. Proc. Natl. Acad. Sci. USA 101 16234–16239.
  • Dedecker, J., Doukhan, P., Lang, G., Leon R., J. R., Louhichi, S. and Prieur, C. (2007). Weak Dependence: With Examples and Applications. Lecture Notes in Statist. 190. Springer, New York.
  • Efron, B. (1981). Nonparametric standard errors and confidence intervals. With discussion and a reply by the author. Canad. J. Statist. 9 139–172.
  • Fickett, J. W., Torney, D. C. and Wolf, D. R. (1992). Base compositional structure of genomes. Genomics 13 1056–1064.
  • Fu, Y.-X. and Curnow, R.-N. (1990). Maximum likelihood estimation of multiple change-points. Biometrika 77 563–573.
  • Gotze, F. and Rackauskas, A. (2001). Adaptive choice of bootstrap sample sizes. In State of the Art in Probability and Statistics (Leiden, 1999) 286–309. Lecture Notes Monogr. Ser. 36. Inst. Math. Statist., Beachwood, OH.
  • Gupta, M. and Liu, J. S. (2005). De novo cis-regulatory module elicitation for eukaryotic genomes. Proc. Natl. Acad. Sci. USA 102 7079–7084.
  • Hall, P. (1992). The Bootstrap and Edgeworth Expansion. Springer, New York.
  • Huang, H., Kao, M. C., Zhou, X., Liu, J. S. and Wong, W. H. (2004). Determination of local statistical significance of patterns in Markov sequences with application to promoter element identification. J. Comput. Biol. 11 1–14.
  • James, B., James, K. L. and Siegmund, D. (1987). Tests for a change-point. Biometrika 74 71–84.
  • Kato, M., Hata, N., Banerjee, N., Futcher, B. and Zhang, M. Q. (2004). Identifying combinatorial regulation of transcription factors and binding motifs. Genome Biol. 5 R56.
  • Künsch, H. (1989). The jackknife and the bootstrap for general stationary observations. Ann. Statist. 17 1217–1241.
  • Letson, D. and McCullough, B. D. (1998). Better confidence intervals: The double bootstrap with no pivot. Amer. J. Agr. Econ. 80 552–559.
  • Li, W., Stolovitzky, G., Bernaola-Galván, P. and Oliver, J. L. (1998). Compositional heterogeneity within, and uniformity between, DNA sequences of yeast chromosomes. Genome Res. 8 916–928.
  • Li, W., Bernaola-Galván, P., Haghighi, F. and Grosse, I. (2002). Applications of recursive segmentation to the analysis of DNA sequences. Comput. Chem. 26 491–510.
  • Margulies, E. H. et al. (2007). Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res. 17 760–774.
  • Olshen, A. B., Venkatraman, E. S., Lucito, R. and Wigler, M. (2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5 557–572.
  • Politis, D. and Romano, J. (1994). Large sample confidence regions based on subsamples under minimal assumptions. Ann. Statist. 22 2031–2050.
  • Politis, D., Romano, J. and Wolf, M. (1999). Subsampling. Springer, New York.
  • Redon, R. et al. (2006). Global variation in copy number in the human genome. Nature 444 444–454.
  • Thisted, R. and Efron, B. (1987). Did Shakespeare write a newly-discovered poem? Biometrika 74 445–455.
  • Venkatraman, S. (1992). Consistency results in multiple change-point problems. Ph.D. dissertation, Stanford Univ.
  • Vostrikova, L. J. (1981). Detecting disorder in multidimensional random process. Sov. Math. Dokl. 24 55–59.
  • Yu, H., Yoo, A. S. and Greenwald, I. (2004). Cluster Analyzer for Transcription Sites (CATS): A C++-based program for identifying clustered transcription factor binding sites. Bioinformatics 20 1198–1200.
  • Zhang, C., Xuan, Z., Mandel, G. and Zhang, M. Q. (2006). A clustering property of highly-degenerate transcription factor binding sites in the mammalian genome. Nucleic Acid Res. 34 2238–2246.
  • Zhou, Q. and Wong, W. H. (2004). CisModule: De Novo discovery of cis-regulatory modules by hierarchical mixture modeling. Proc. Natl. Acad. Sci. USA 101 12114–12119.

Supplemental materials

  • Supplementary material: Some theorems in subsampling methods for genomic inference. In Supplementary Material, we provide theoretical proofs to the theorems presented in the main text.