Statistical methods for DNA sequence segmentation



Statistical Science

Statistical methods for DNA sequence segmentation

Jerome V. Braun and Hans-Georg Müller

Source: Statist. Sci. Volume 13, Number 2 (1998), 142-162.

Abstract

This article examines methods, issues and controversies that have arisen over the last decade in the effort to organize sequences of DNA base information into homogeneous segments. An array of different models and techniques have been considered and applied. We demonstrate that most approaches can be embedded into a suitable version of the multiple change-point problem, and we review the various methods in this light. We also propose and discuss a promising local segmentation method, namely, the application of split local polynomial fitting. The genome of bacteriophage $\lambda$ serves as an example sequence throughout the paper.

Keywords: Statisical genetics; change-point; hidden Markov chain; patchiness; quasideviance; split local polynomials; chromosome banding; bacteriophage $\lambda$

Full-text: Open access

Links and Identifiers

Permanent link to this document: http://projecteuclid.org/euclid.ss/1028905933
Digital Object Identifier: doi:10.1214/ss/1028905933
Mathematical Reviews number (MathSciNet): MR1661506
Zentralblatt MATH identifier: 0960.62121

References

Amfoh, K. K., Shaw, R. F. and Bonney, G. E. (1994). The use of logistic models for the analysis of codon frequencies of DNA sequences in terms of explanatory variables. Biometrics 50 1054-1063.
Zentralblatt MATH: 0825.62799
Auger, I. E. and Lawrence, C. E. (1989). Algorithms for the optimal identification of segment neighborhoods. Bulletin of Mathematical Biology 51 39-54.
Mathematical Reviews (MathSciNet): MR89m:92011
Zentralblatt MATH: 0658.92010
Avnir, D., Biham, O., Lidar, D. and Malcai, O. (1998). Is the geometry of Nature fractal? Science 279 39-40.
Barry, D. and Hartigan, J. A. (1992). Product partition models for change-point models. Ann. Statist. 20 260-279.
Zentralblatt MATH: 0780.62071
Bement, T. R. and Waterman, M. S. (1977). Locating maximum variance segments in sequential data. Mathematical Geology 9 55-61. Bernardi, G., Olofsson, B., Filipski, J., Zerial, M., Salinas, J.,
Cuny, G., Meunier-Rotival, M. and Rodier, F. (1985). The mosaic genome of warm-blooded vertebrates. Science 228 953-958.
Bhattachary a, P. K. (1994). Some aspects of change-point analysis. In Change-Point Problems (E. Carlstein, H.-G. M ¨uller and D. Siegmund, eds.) 28-56. IMS, Hay ward, CA.
Mathematical Reviews (MathSciNet): MR1477912
Bickmore, W. and Sumner, A. T. (1989). Mammalian chromosome banding-an expression of genome organization. Trends in Genetics 5 144-148.
Braun, J. V. and M ¨uller, H. G. (1998). Quasi-likelihood fitting of multiple change-points, with application to DNA segmentation. Technical report, Univ. California, Davis.
Brockwell, P. J. and Davis, R. A. (1991). Time Series: Theory and Methods. Springer, New York. Buldy rev, S. V., Goldberger, A. L., Havlin, S., Peng, C.-K.,
Mathematical Reviews (MathSciNet): MR92d:62001
Simons, M., Sciortino, F. and Stanley, H. E. (1993). Comment. Phy s. Rev. Lett. 71 1776.
Carlin, B. P., Gelfand, A. E. and Smith, A. F. M. (1992). Hierarchical Bayesian analysis of changepoint problems. J. Roy. Statist. Soc. Ser. B 41 389-405.
Zentralblatt MATH: 0825.62408
Carlstein, E., M ¨uller, H.-G. and Siegmund, D., eds. (1994). Change-Point Problems. IMS Hay ward, CA.
Mathematical Reviews (MathSciNet): MR98e:62008
Christensen, J. and Rudemo, M. (1996). Multiple change-point analysis of disease incidence rates. Preventive Veterinary Medicine 26 53-76.
Churchill, G. A. (1989). Stochastic models for heterogenous DNA sequences. Bulletin of Mathematical Biology 51 79-94.
Churchill, G. A. (1992). Hidden Markov chains and the analysis of genome structure. Computers in Chemistry 16 107- 115.
Zentralblatt MATH: 0752.92015
Curnow, R. N. and Kirkwood, T. B. L. (1989). Statistical analysis of deoxy ribonucleic acid sequence data-a review. J. Roy. Statist. Soc. Ser. B 152 199-220.
Cvijovic, D. and Klinowski, J. (1995). Taboo search-an approach to the multiple minima problem. Science 267 664- 666.
Mathematical Reviews (MathSciNet): MR95i:90071
Dupuis, J. (1994). Change-point problem in determination of identity-by-descent. Technical Report 1, Stanford Univ.
Elton, R. A. (1974). Theoretical models for heterogeneity of base composition in DNA. Journal of Theoretical Biology 45 533- 553.
Fan, J. and Gijbels, I. (1996). Local Poly nomial Modelling. Chapman and Hall, London.
Fan, J., Heckman, N. E. and Wand, M. P. (1995). Local poly nomial kernel regression for generalized linear models and quasi-likelihood functions. J. Amer. Statist. Assoc. 90 141- 150.
Fickett, J. W., Torney, D. C. and Wolf, D. R. (1992). Base compositional structure of genomes. Genomics 13 1056-1064.
Fu, Y.-X. and Curnow, R. N. (1990). Maximum likelihood estimation of multiple change points. Biometrika 77 563-573.
Zentralblatt MATH: 0724.62025
Gey er, C. J. (1995). Comment on "Bayesian computation and stochastic sy stems," by J. Besag, P. Green, D. Higdon and K. Mengerson. Statist. Sci. 10 46-48.
Gillespie, J. H. (1991). The Causes of Molecular Evolution. Oxford Univ. Press.
Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 41 389-405.
Zentralblatt MATH: 0861.62023
Hartigan, J. A. (1990). Partition models. Comm. Statist. Theory Methods 19 2745-2756.
Mathematical Reviews (MathSciNet): MR1088047
Holmquist, G. P. (1989). Evolution of chromosome bands: Molecular ecology of noncoding DNA. Journal of Molecular Evolution 28 469-486.
Ikemura, T., Wada, K. and Aota, S. (1990). Giant G+C% mosaic structures of the human genome found by arrangement of GenBank human DNA sequences according to genetic positions. Genomics 8 207-216.
Josse, J., Kaiser, A. D. and Kornberg, A. (1961). Enzy matic sy nthesis of deoxy ribonucleic acid. VII. Frequencies of nearest neighbor base sequences in deoxy ribonucleic acid. Journal of Biological Chemistry 236 864-875.
Karlin, S. and Altschul, S. F. (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Nat. Acad. Sci. U.S.A. 87 2264-2268.
Zentralblatt MATH: 0695.92004
Karlin, S. and Brendel, V. (1992). Chance and statistical significance in protein and DNA sequence analysis. Science 257 39-49.
Karlin, S. and Brendel, V. (1993). Patchiness and correlations in DNA sequences. Science 259 677-680.
Karlin, S. and Dembo, A. (1992). Limit distributions of maximal segmental score among Markov-dependent partial sums. Adv. in Appl. Probab. 24 113-140.
Mathematical Reviews (MathSciNet): MR93b:60042
Karlin, S., Dembo, A. and Kawabata, T. (1990). Statistical composition of high-scoring segments from molecular sequences. Ann. Statist. 18 571-581.
Mathematical Reviews (MathSciNet): MR92e:92015
Karlin, S., Ost, F. and Blaisdell, B. E. (1989). Patterns in DNA and amino acid sequences and their statistical significance. In Mathematical Methods for DNA Sequences (M. S. Waterman, ed.) 133-158. CRC Press, Boca Raton, FL.
Mathematical Reviews (MathSciNet): MR1047268
Kimura, M. (1983). The Neutral Allele Theory of Molecular Evolution. Cambridge Univ. Press. Krogh, A., Brown, M., Mian, I. S., Sj ¨olander, K. and Haussler,
D. (1994). Hidden Markov models in computational biology: application to protein modeling. Journal of Molecular Biology 235 1501-1531. Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S.,
Neuwald, A. F. and Wootton, J. C. (1993). Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignments. Science 262 208-214.
Liu, J. S. (1994). The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. J. Amer. Statist. Assoc. 89 958-966.
Zentralblatt MATH: 0804.62033
Liu, J. S. and Lawrence, C. E. (1996). Unified Gibbs method for biological sequence analysis. In Proceedings of the Biometrics Section 194-199. Amer. Statist. Assoc., Alexandria, VA.
Liu, J. S., Neuwald, A. F. and Lawrence, C. E. (1995). Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. Amer. Statist. Assoc. 90 1-15.
Zentralblatt MATH: 0864.62076
Loader, C. R. (1996). Change point estimation using nonparametric regression. Ann. Statist. 24 1667-1678.
Mathematical Reviews (MathSciNet): MR97k:62092
Lombard, F. and Hart, J. D. (1994). The analysis of changepoint data with dependent errors. In Change-Point Problems (E. Carlstein, H.-G. M ¨uller and D. Siegmund, eds.) 194-209.
Mathematical Reviews (MathSciNet): MR1477925
IMS, Hay ward, CA.
Maddox, J. (1992). Long-range correlations within DNA. Nature 358 103.
Meselson, M., Stahl, F. W. and Vinograd, J. (1957). Equilibrium sedimentation of macromolecules in density gradients. Proc. Nat. Acad. Sci. U.S.A. 43 581-588.
M ¨uller, H. G. (1985). Empirical bandwidth choice for nonparametric kernel regression by means of pilot estimators. Statist. Decisions Suppl. 2 193-206.
M ¨uller, H. G. (1992). Change-points in nonparametric regression analysis. Ann. Statist. 20 737-761.
M ¨uller, H. G. (1993). Comment on "Local regression: automatic kernel carpentry," by T. Hastie and C. Loader. Statist. Sci. 8 134-139.
M ¨uller, H. G. and Song, K. S. (1997). A two-stage procedure for change-point detection in nonparametric regression. Statist. Probab. Lett. 34 323-335.
M ¨uller, H. G. and Stadtm ¨uller, U. (1997). Discontinuous versus smooth regression. Technical report, Univ. California, Davis.
Nee, S. (1992). Uncorrelated DNA walks. Nature 357 450.
Neuwald, A. F., Liu, J. S. and Lawrence, C. E. (1995). Gibbs motif sampling: detection of bacterial outer membrane protein repeats. Protein Science 4 1618-1632. Peng, C. K., Buldy rev, S. V., Goldberger, A. L., Havlin, S.,
Sciortino, F., Simons, M. and Stanley, H. E. (1992). Lon
Pennini, E. (1997). Microbial genomes come tumbling in. Science 277 1433.
Prabhu, V. V. and Claverle, J.-M. (1992). Correlations in intronless DNA. Nature 359 782.
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77 257-286.
Raftery, A. E. and Akman, V. E. (1986). Bayesian analysis of a Poisson process with a change-point. Biometrika 73 85-89. Sanger, F., Coulson, A. R., Hong, G. F., Hill, D. F. and
Mathematical Reviews (MathSciNet): MR836436
Petersen, G. B. (1982). Nucleotide sequence of bacteriophage lambda DNA. Journal of Molecular Biology 162 729- 773.
Scherer, S., McPeek, M. S. and Speed, T. P. (1994). Aty pical regions in large genomic DNA sequences. Proc. Nat. Acad. Sci. U.S.A. 91 7134-7138.
Schweizer, D. and Loidl, J. (1987). A model for heterochromatin dispersion and the evolution of C-band patterns. Chromosomes Today 9 61-74.
Scott, A. J. and Knott, M. (1974). A cluster analysis method for grouping means in the analysis of variance. Biometrics 30 507-512.
Zentralblatt MATH: 0284.62044
Shapiro, H. S. and Chargaff, E. (1960). Studies on the nucleotide arrangement in deoxy ribonucleic acid. IV. Patterns of nucleotide sequence in the deoxy ribonucleic acid of ry e germ and its fractions. Biochimica et Biophysica Acta 39 68-82.
Skalka, A., Burgi, E. and Hershey, A. D. (1968). Segmental distribution of nucleotides in the DNA of bacteriophage lambda. Journal of Molecular Biology 34 1-16.
Smith, A. F. M. (1975). A Bayesian approach to inference about a change-point in a sequence of random variables. Biometrika 62 407-416.
Zentralblatt MATH: 0321.62041
Staden, R. (1984). Graphical methods to determine the function of nucleic acid sequences. Nucleic Acids Research 12 521- 538.
Stephens, D. A. (1994). Bayesian retrospective multiple changepoint identification. J. Roy. Statist. Soc. Ser. B 43 159-178.
Zentralblatt MATH: 0825.62412
Stoffer, D. S., Ty ler, D. E. and McDougall, A. J. (1993). Spectral analysis for categorical time series: scaling and the spectral envelope. Biometrika 80 611-622.
Mathematical Reviews (MathSciNet): MR94m:62234
Tajima, F. (1991). Determination of window size for analyzing DNA sequences. Journal of Molecular Evolution 33 470-473.
Venkatraman, E. S. (1992). Consistency results in multiple change-point situations. Technical report, Dept. Statistics, Stanford Univ.
Voss, R. F. (1992). Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phy s. Rev. Lett. 68 3805-3808.
Voss, R. F. (1993). Comment. Phy s. Rev. Lett. 71 1777.
Vostrikova, L. J. (1981). Detecting "disorder" in multidimensional random processes. Soviet Math. Dokl. 24 55-59.
Wallenstein, S., Naus, J. and Glaz, J. (1994). Power of the scan statistic in detecting a changed segment in a Bernoulli sequence. Biometrika 81 595-601.
Zentralblatt MATH: 0810.62025
Wolfe, D. A. and Schechtman, E. (1984). Nonparametric statistical procedures for the changepoint problem. J. Statist. Plann. Inference 9 389-396.
Zentralblatt MATH: 0561.62039
Yao, Y.-C. (1988). Estimating the number of change-points via Schwarz' criterion. Statist. Probab. Lett. 6 181-189.
Zentralblatt MATH: 0642.62016
Yao, Y.-C. and Au, S. T. (1989). Least-squares estimation of a step function. Sankhy¯a Ser. A. 51 370-381.
Zentralblatt MATH: 0711.62031
Zacks, S. (1983). Survey of classical and Bayesian approaches to the change-point problem: fixed sample and sequential procedures of testing and estimation. In Recent Advances in Statistics (M. H. Rizvi, J. S. Rustagi and D. Siegmund, eds.) 245-269. Academic Press, New York.

2009 © Institute of Mathematical Statistics