Source: Ann. Appl. Stat.
Volume 4, Number 4
Large-scale statistical analysis of data sets associated with
genome sequences plays an important role in modern biology. A
key component of such statistical analyses is the computation of
p-values and confidence bounds for statistics
defined on the genome. Currently such computation is commonly
achieved through ad hoc simulation measures. The method of
randomization, which is at the heart of these simulation
procedures, can significantly affect the resulting statistical
conclusions. Most simulation schemes introduce a variety of
hidden assumptions regarding the nature of the randomness in the
data, resulting in a failure to capture biologically meaningful
relationships. To address the need for a method of assessing the
significance of observations within large scale genomic studies,
where there often exists a complex dependency structure between
observations, we propose a unified solution built upon a data
subsampling approach. We propose a piecewise stationary model
for genome sequences and show that the subsampling approach
gives correct answers under this model. We illustrate the method
on three simulation studies and two real data examples.
Andrews, D. and Mallows, C. (1974). Scale mixtures of normal distributions. J. Roy. Statist. Soc. Ser. B 26 99–102.
Mathematical Reviews (MathSciNet): MR359122
Beran, R. (1988). Prepivoting test statistics: A bootstrap view of asymptotic refinements. J. Amer. Statist. Assoc. 83 687–697.
Mathematical Reviews (MathSciNet): MR963796
Bernardi, G., Olofsson, B., Filipski, J., Zerial, M., Salinas, J., Cuny, G., Meunier-Rotival, M. and Rodier, F. (1985). The mosaic genome of warm-blooded vertebrates. Science 228 953–958.
Bickel, P. J., Boley, N., Brown, J. B., Huang, H. and Zhang, N. R. (2010). Supplement to “Subsampling methods for genomic inference.” DOI: 10.1214/10-AOAS363SUPP
Bickel, P. J. and Sakov, A. (2008). On the choice of m in the m out of n bootstrap and its application to confidence bounds for extreme percentiles. Statist. Sinica 18 967–985.
Bickel, P. J., Gotze, F. and van Zwet, W. R. (1997). Resampling fewer than n observations: Gains, losses, and remedies for losses. Statist. Sinica 1 1–31.
Birney, E. et al. (2007). Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447 799–816.
Blakesley, R. W. et al. (2004). An intermediate grade of finished genomic sequence suitable for comparative analyses. Genome Res. 14 2235–2244.
Braun, J. and Muller, H.-G. (1998). Statistical methods for DNA sequence segmentation. Statist. Sci. 13 142–162.
Carter, N. (2007). Methods and strategies for analyzing copy number variation using DNA microarrays. Nature Genet. 39 S16–S21.
Churchill, G. A. (1989). Stochastic models for heterogeneous genome sequences. Bull. Math. Biol. 51 79–94.
Mathematical Reviews (MathSciNet): MR978904
Churchill, G. A. (1992). Hidden Markov chains and the analysis of genome structure. Comput. Chem. 16 107–115.
Das, D., Banerjee, N. and Zhang, M. Q. (2004). Interacting models of cooperative gene regulation. Proc. Natl. Acad. Sci. USA 101 16234–16239.
Dedecker, J., Doukhan, P., Lang, G., Leon R., J. R., Louhichi, S. and Prieur, C. (2007). Weak Dependence: With Examples and Applications. Lecture Notes in Statist. 190. Springer, New York.
Efron, B. (1981). Nonparametric standard errors and confidence intervals. With discussion and a reply by the author. Canad. J. Statist. 9 139–172.
Mathematical Reviews (MathSciNet): MR640014
Fickett, J. W., Torney, D. C. and Wolf, D. R. (1992). Base compositional structure of genomes. Genomics 13 1056–1064.
Fu, Y.-X. and Curnow, R.-N. (1990). Maximum likelihood estimation of multiple change-points. Biometrika 77 563–573.
Gotze, F. and Rackauskas, A. (2001). Adaptive choice of bootstrap sample sizes. In State of the Art in Probability and Statistics (Leiden, 1999) 286–309. Lecture Notes Monogr. Ser. 36. Inst. Math. Statist., Beachwood, OH.
Gupta, M. and Liu, J. S. (2005). De novo cis-regulatory module elicitation for eukaryotic genomes. Proc. Natl. Acad. Sci. USA 102 7079–7084.
Hall, P. (1992). The Bootstrap and Edgeworth Expansion. Springer, New York.
Huang, H., Kao, M. C., Zhou, X., Liu, J. S. and Wong, W. H. (2004). Determination of local statistical significance of patterns in Markov sequences with application to promoter element identification. J. Comput. Biol. 11 1–14.
James, B., James, K. L. and Siegmund, D. (1987). Tests for a change-point. Biometrika 74 71–84.
Mathematical Reviews (MathSciNet): MR885920
Kato, M., Hata, N., Banerjee, N., Futcher, B. and Zhang, M. Q. (2004). Identifying combinatorial regulation of transcription factors and binding motifs. Genome Biol. 5 R56.
Künsch, H. (1989). The jackknife and the bootstrap for general stationary observations. Ann. Statist. 17 1217–1241.
Letson, D. and McCullough, B. D. (1998). Better confidence intervals: The double bootstrap with no pivot. Amer. J. Agr. Econ. 80 552–559.
Li, W., Stolovitzky, G., Bernaola-Galván, P. and Oliver, J. L. (1998). Compositional heterogeneity within, and uniformity between, DNA sequences of yeast chromosomes. Genome Res. 8 916–928.
Li, W., Bernaola-Galván, P., Haghighi, F. and Grosse, I. (2002). Applications of recursive segmentation to the analysis of DNA sequences. Comput. Chem. 26 491–510.
Margulies, E. H. et al. (2007). Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res. 17 760–774.
Olshen, A. B., Venkatraman, E. S., Lucito, R. and Wigler, M. (2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5 557–572.
Politis, D. and Romano, J. (1994). Large sample confidence regions based on subsamples under minimal assumptions. Ann. Statist. 22 2031–2050.
Politis, D., Romano, J. and Wolf, M. (1999). Subsampling. Springer, New York.
Redon, R. et al. (2006). Global variation in copy number in the human genome. Nature 444 444–454.
Thisted, R. and Efron, B. (1987). Did Shakespeare write a newly-discovered poem? Biometrika 74 445–455.
Mathematical Reviews (MathSciNet): MR909350
Venkatraman, S. (1992). Consistency results in multiple change-point problems. Ph.D. dissertation, Stanford Univ.
Vostrikova, L. J. (1981). Detecting disorder in multidimensional random process. Sov. Math. Dokl. 24 55–59.
Yu, H., Yoo, A. S. and Greenwald, I. (2004). Cluster Analyzer for Transcription Sites (CATS): A C++-based program for identifying clustered transcription factor binding sites. Bioinformatics 20 1198–1200.
Zhang, C., Xuan, Z., Mandel, G. and Zhang, M. Q. (2006). A clustering property of highly-degenerate transcription factor binding sites in the mammalian genome. Nucleic Acid Res. 34 2238–2246.
Zhou, Q. and Wong, W. H. (2004). CisModule: De Novo discovery of cis-regulatory modules by hierarchical mixture modeling. Proc. Natl. Acad. Sci. USA 101 12114–12119.