Statistical Science

Statistical methods for DNA sequence segmentation

Jerome V. Braun and Hans-Georg Müller

Full-text: Open access


This article examines methods, issues and controversies that have arisen over the last decade in the effort to organize sequences of DNA base information into homogeneous segments. An array of different models and techniques have been considered and applied. We demonstrate that most approaches can be embedded into a suitable version of the multiple change-point problem, and we review the various methods in this light. We also propose and discuss a promising local segmentation method, namely, the application of split local polynomial fitting. The genome of bacteriophage $\lambda$ serves as an example sequence throughout the paper.

Article information

Statist. Sci. Volume 13, Number 2 (1998), 142-162.

First available in Project Euclid: 9 August 2002

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier


Braun, Jerome V.; Müller, Hans-Georg. Statistical methods for DNA sequence segmentation. Statist. Sci. 13 (1998), no. 2, 142--162. doi:10.1214/ss/1028905933.

Export citation


  • Amfoh, K. K., Shaw, R. F. and Bonney, G. E. (1994). The use of logistic models for the analysis of codon frequencies of DNA sequences in terms of explanatory variables. Biometrics 50 1054-1063.
  • Auger, I. E. and Lawrence, C. E. (1989). Algorithms for the optimal identification of segment neighborhoods. Bulletin of Mathematical Biology 51 39-54.
  • Avnir, D., Biham, O., Lidar, D. and Malcai, O. (1998). Is the geometry of Nature fractal? Science 279 39-40.
  • Barry, D. and Hartigan, J. A. (1992). Product partition models for change-point models. Ann. Statist. 20 260-279.
  • Bement, T. R. and Waterman, M. S. (1977). Locating maximum variance segments in sequential data. Mathematical Geology 9 55-61. Bernardi, G., Olofsson, B., Filipski, J., Zerial, M., Salinas, J.,
  • Cuny, G., Meunier-Rotival, M. and Rodier, F. (1985). The mosaic genome of warm-blooded vertebrates. Science 228 953-958.
  • Bhattachary a, P. K. (1994). Some aspects of change-point analysis. In Change-Point Problems (E. Carlstein, H.-G. M ¨uller and D. Siegmund, eds.) 28-56. IMS, Hay ward, CA.
  • Bickmore, W. and Sumner, A. T. (1989). Mammalian chromosome banding-an expression of genome organization. Trends in Genetics 5 144-148.
  • Braun, J. V. and M ¨uller, H. G. (1998). Quasi-likelihood fitting of multiple change-points, with application to DNA segmentation. Technical report, Univ. California, Davis.
  • Brockwell, P. J. and Davis, R. A. (1991). Time Series: Theory and Methods. Springer, New York. Buldy rev, S. V., Goldberger, A. L., Havlin, S., Peng, C.-K.,
  • Simons, M., Sciortino, F. and Stanley, H. E. (1993). Comment. Phy s. Rev. Lett. 71 1776.
  • Carlin, B. P., Gelfand, A. E. and Smith, A. F. M. (1992). Hierarchical Bayesian analysis of changepoint problems. J. Roy. Statist. Soc. Ser. B 41 389-405.
  • Carlstein, E., M ¨uller, H.-G. and Siegmund, D., eds. (1994). Change-Point Problems. IMS Hay ward, CA.
  • Christensen, J. and Rudemo, M. (1996). Multiple change-point analysis of disease incidence rates. Preventive Veterinary Medicine 26 53-76.
  • Churchill, G. A. (1989). Stochastic models for heterogenous DNA sequences. Bulletin of Mathematical Biology 51 79-94.
  • Churchill, G. A. (1992). Hidden Markov chains and the analysis of genome structure. Computers in Chemistry 16 107- 115.
  • Curnow, R. N. and Kirkwood, T. B. L. (1989). Statistical analysis of deoxy ribonucleic acid sequence data-a review. J. Roy. Statist. Soc. Ser. B 152 199-220.
  • Cvijovic, D. and Klinowski, J. (1995). Taboo search-an approach to the multiple minima problem. Science 267 664- 666.
  • Dupuis, J. (1994). Change-point problem in determination of identity-by-descent. Technical Report 1, Stanford Univ.
  • Elton, R. A. (1974). Theoretical models for heterogeneity of base composition in DNA. Journal of Theoretical Biology 45 533- 553.
  • Fan, J. and Gijbels, I. (1996). Local Poly nomial Modelling. Chapman and Hall, London.
  • Fan, J., Heckman, N. E. and Wand, M. P. (1995). Local poly nomial kernel regression for generalized linear models and quasi-likelihood functions. J. Amer. Statist. Assoc. 90 141- 150.
  • Fickett, J. W., Torney, D. C. and Wolf, D. R. (1992). Base compositional structure of genomes. Genomics 13 1056-1064.
  • Fu, Y.-X. and Curnow, R. N. (1990). Maximum likelihood estimation of multiple change points. Biometrika 77 563-573.
  • Gey er, C. J. (1995). Comment on "Bayesian computation and stochastic sy stems," by J. Besag, P. Green, D. Higdon and K. Mengerson. Statist. Sci. 10 46-48.
  • Gillespie, J. H. (1991). The Causes of Molecular Evolution. Oxford Univ. Press.
  • Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 41 389-405.
  • Hartigan, J. A. (1990). Partition models. Comm. Statist. Theory Methods 19 2745-2756.
  • Holmquist, G. P. (1989). Evolution of chromosome bands: Molecular ecology of noncoding DNA. Journal of Molecular Evolution 28 469-486.
  • Ikemura, T., Wada, K. and Aota, S. (1990). Giant G+C% mosaic structures of the human genome found by arrangement of GenBank human DNA sequences according to genetic positions. Genomics 8 207-216.
  • Josse, J., Kaiser, A. D. and Kornberg, A. (1961). Enzy matic sy nthesis of deoxy ribonucleic acid. VII. Frequencies of nearest neighbor base sequences in deoxy ribonucleic acid. Journal of Biological Chemistry 236 864-875.
  • Karlin, S. and Altschul, S. F. (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Nat. Acad. Sci. U.S.A. 87 2264-2268.
  • Karlin, S. and Brendel, V. (1992). Chance and statistical significance in protein and DNA sequence analysis. Science 257 39-49.
  • Karlin, S. and Brendel, V. (1993). Patchiness and correlations in DNA sequences. Science 259 677-680.
  • Karlin, S. and Dembo, A. (1992). Limit distributions of maximal segmental score among Markov-dependent partial sums. Adv. in Appl. Probab. 24 113-140.
  • Karlin, S., Dembo, A. and Kawabata, T. (1990). Statistical composition of high-scoring segments from molecular sequences. Ann. Statist. 18 571-581.
  • Karlin, S., Ost, F. and Blaisdell, B. E. (1989). Patterns in DNA and amino acid sequences and their statistical significance. In Mathematical Methods for DNA Sequences (M. S. Waterman, ed.) 133-158. CRC Press, Boca Raton, FL.
  • Kimura, M. (1983). The Neutral Allele Theory of Molecular Evolution. Cambridge Univ. Press. Krogh, A., Brown, M., Mian, I. S., Sj ¨olander, K. and Haussler,
  • D. (1994). Hidden Markov models in computational biology: application to protein modeling. Journal of Molecular Biology 235 1501-1531. Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S.,
  • Neuwald, A. F. and Wootton, J. C. (1993). Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignments. Science 262 208-214.
  • Liu, J. S. (1994). The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. J. Amer. Statist. Assoc. 89 958-966.
  • Liu, J. S. and Lawrence, C. E. (1996). Unified Gibbs method for biological sequence analysis. In Proceedings of the Biometrics Section 194-199. Amer. Statist. Assoc., Alexandria, VA.
  • Liu, J. S., Neuwald, A. F. and Lawrence, C. E. (1995). Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. Amer. Statist. Assoc. 90 1-15.
  • Loader, C. R. (1996). Change point estimation using nonparametric regression. Ann. Statist. 24 1667-1678.
  • Lombard, F. and Hart, J. D. (1994). The analysis of changepoint data with dependent errors. In Change-Point Problems (E. Carlstein, H.-G. M ¨uller and D. Siegmund, eds.) 194-209.
  • IMS, Hay ward, CA.
  • Maddox, J. (1992). Long-range correlations within DNA. Nature 358 103.
  • Meselson, M., Stahl, F. W. and Vinograd, J. (1957). Equilibrium sedimentation of macromolecules in density gradients. Proc. Nat. Acad. Sci. U.S.A. 43 581-588.
  • M ¨uller, H. G. (1985). Empirical bandwidth choice for nonparametric kernel regression by means of pilot estimators. Statist. Decisions Suppl. 2 193-206.
  • M ¨uller, H. G. (1992). Change-points in nonparametric regression analysis. Ann. Statist. 20 737-761.
  • M ¨uller, H. G. (1993). Comment on "Local regression: automatic kernel carpentry," by T. Hastie and C. Loader. Statist. Sci. 8 134-139.
  • M ¨uller, H. G. and Song, K. S. (1997). A two-stage procedure for change-point detection in nonparametric regression. Statist. Probab. Lett. 34 323-335.
  • M ¨uller, H. G. and Stadtm ¨uller, U. (1997). Discontinuous versus smooth regression. Technical report, Univ. California, Davis.
  • Nee, S. (1992). Uncorrelated DNA walks. Nature 357 450.
  • Neuwald, A. F., Liu, J. S. and Lawrence, C. E. (1995). Gibbs motif sampling: detection of bacterial outer membrane protein repeats. Protein Science 4 1618-1632. Peng, C. K., Buldy rev, S. V., Goldberger, A. L., Havlin, S.,
  • Sciortino, F., Simons, M. and Stanley, H. E. (1992). Lon
  • Pennini, E. (1997). Microbial genomes come tumbling in. Science 277 1433.
  • Prabhu, V. V. and Claverle, J.-M. (1992). Correlations in intronless DNA. Nature 359 782.
  • Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77 257-286.
  • Raftery, A. E. and Akman, V. E. (1986). Bayesian analysis of a Poisson process with a change-point. Biometrika 73 85-89. Sanger, F., Coulson, A. R., Hong, G. F., Hill, D. F. and
  • Petersen, G. B. (1982). Nucleotide sequence of bacteriophage lambda DNA. Journal of Molecular Biology 162 729- 773.
  • Scherer, S., McPeek, M. S. and Speed, T. P. (1994). Aty pical regions in large genomic DNA sequences. Proc. Nat. Acad. Sci. U.S.A. 91 7134-7138.
  • Schweizer, D. and Loidl, J. (1987). A model for heterochromatin dispersion and the evolution of C-band patterns. Chromosomes Today 9 61-74.
  • Scott, A. J. and Knott, M. (1974). A cluster analysis method for grouping means in the analysis of variance. Biometrics 30 507-512.
  • Shapiro, H. S. and Chargaff, E. (1960). Studies on the nucleotide arrangement in deoxy ribonucleic acid. IV. Patterns of nucleotide sequence in the deoxy ribonucleic acid of ry e germ and its fractions. Biochimica et Biophysica Acta 39 68-82.
  • Skalka, A., Burgi, E. and Hershey, A. D. (1968). Segmental distribution of nucleotides in the DNA of bacteriophage lambda. Journal of Molecular Biology 34 1-16.
  • Smith, A. F. M. (1975). A Bayesian approach to inference about a change-point in a sequence of random variables. Biometrika 62 407-416.
  • Staden, R. (1984). Graphical methods to determine the function of nucleic acid sequences. Nucleic Acids Research 12 521- 538.
  • Stephens, D. A. (1994). Bayesian retrospective multiple changepoint identification. J. Roy. Statist. Soc. Ser. B 43 159-178.
  • Stoffer, D. S., Ty ler, D. E. and McDougall, A. J. (1993). Spectral analysis for categorical time series: scaling and the spectral envelope. Biometrika 80 611-622.
  • Tajima, F. (1991). Determination of window size for analyzing DNA sequences. Journal of Molecular Evolution 33 470-473.
  • Venkatraman, E. S. (1992). Consistency results in multiple change-point situations. Technical report, Dept. Statistics, Stanford Univ.
  • Voss, R. F. (1992). Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phy s. Rev. Lett. 68 3805-3808.
  • Voss, R. F. (1993). Comment. Phy s. Rev. Lett. 71 1777.
  • Vostrikova, L. J. (1981). Detecting "disorder" in multidimensional random processes. Soviet Math. Dokl. 24 55-59.
  • Wallenstein, S., Naus, J. and Glaz, J. (1994). Power of the scan statistic in detecting a changed segment in a Bernoulli sequence. Biometrika 81 595-601.
  • Wolfe, D. A. and Schechtman, E. (1984). Nonparametric statistical procedures for the changepoint problem. J. Statist. Plann. Inference 9 389-396.
  • Yao, Y.-C. (1988). Estimating the number of change-points via Schwarz' criterion. Statist. Probab. Lett. 6 181-189.
  • Yao, Y.-C. and Au, S. T. (1989). Least-squares estimation of a step function. Sankhy¯a Ser. A. 51 370-381.
  • Zacks, S. (1983). Survey of classical and Bayesian approaches to the change-point problem: fixed sample and sequential procedures of testing and estimation. In Recent Advances in Statistics (M. H. Rizvi, J. S. Rustagi and D. Siegmund, eds.) 245-269. Academic Press, New York.