The Annals of Applied Statistics

Distributions associated with general runs and patterns in hidden Markov models

John A. D. Aston and Donald E. K. Martin

Full-text: Open access


This paper gives a method for computing distributions associated with patterns in the state sequence of a hidden Markov model, conditional on observing all or part of the observation sequence. Probabilities are computed for very general classes of patterns (competing patterns and generalized later patterns), and thus, the theory includes as special cases results for a large class of problems that have wide application. The unobserved state sequence is assumed to be Markovian with a general order of dependence. An auxiliary Markov chain is associated with the state sequence and is used to simplify the computations. Two examples are given to illustrate the use of the methodology. Whereas the first application is more to illustrate the basic steps in applying the theory, the second is a more detailed application to DNA sequences, and shows that the methods can be adapted to include restrictions related to biological knowledge.

Article information

Ann. Appl. Stat. Volume 1, Number 2 (2007), 585-611.

First available in Project Euclid: 30 November 2007

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Competing patterns CpG islands finite Markov chain imbedding generalized later patterns higher-order hidden Markov models sooner/later waiting time distributions


Aston, John A. D.; Martin, Donald E. K. Distributions associated with general runs and patterns in hidden Markov models. Ann. Appl. Stat. 1 (2007), no. 2, 585--611. doi:10.1214/07-AOAS125.

Export citation


  • Aston, J. A. D. and Martin, D. E. K. (2005). Waiting time distributions of competing patterns in higher-order Markovian sequences. J. Appl. Probab. 42 977–988.
  • Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Appl. Statist. 39 357–365.
  • Balakrishnan, N. and Koutras, M. (2002). Runs and Scans with Applications. Wiley, New York.
  • Barlow, K. (2005). AL133339 locus of chromosome 20. EMBL/GenBank/DDBJ databases.
  • Baum, L. E. and Eagon, J. A. (1967). An equality with applications to statistical estimation for probabilistic functions of finite state Markov chains. Bull. Amer. Math. Soc. 73 360–363.
  • Bird, A. (1987). CpG islands as gene markers in the vertebrate nucleus. Trends in Genetics 3 342–347.
  • Cappé, O., Moulines, E. and Rydén, T. (2005). Inference in Hidden Markov Models. Springer, New York.
  • Cheung, L. W. K. (2004). Use of runs statistics for pattern recognition in genomic DNA sequences. J. Comput. Biol. 11 107–124.
  • Ching, W.-K., Ng, M. K. and Fung, E. (2003). Higher-order hidden Markov models with applications to DNA sequences. Lecture Notes in Comput. Sci. 2690 535–539. Springer, Berlin.
  • Churchill, G. A. (1989). Stochastic models for heterogeneous DNA sequences. Bull. Math. Biol. 51 79–94.
  • Cox, D. R. (1955). The analysis of non-Markovian stochastic processes by the inclusion of supplementary variables. Proc. Cambridge Philos. Soc. 51 433–441.
  • Durbin, R., Eddy, S., Krogh, A. and Mitchison, G. (1998). Biological Sequence Analysis. Cambridge Univ. Press.
  • Fu, J. C. and Chang, Y. M. (2003). On ordered series and later waiting time distributions in a sequence of Markov dependent multistate trials. J. Appl. Probab. 40 623–642.
  • Fu, J. C. and Koutras, M. V. (1994). Distribution theory of runs: A Markov chain approach. J. Amer. Statist. Assoc. 89 1050–1058.
  • Guéguen, L. (2005). Sarment: Python modules for HMM analysis and partitioning of sequences. Bioinformatics 21 3427–3428.
  • Hamilton, J. D. (1989). A new approach to the economic analysis of nonstationary time series and the business cycle. Econometrica 57 357–384.
  • Harvey, A. C. (1989). Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge Univ. Press.
  • Hwang, G. U., Choi, B. D. and Kim, J.-K. (2002). The waiting time analysis of a discrete-time queue with arrivals as a discrete autoregressive process of order 1. J. Appl. Probab. 39 619–629.
  • Koski, T. (2001). Hidden Markov Models for Bioinformatics. Kluwer Academic Publishers, Dordrecht.
  • Koutras, M. V. and Alexandrou, V. A. (1995). Runs, scans and urn model distributions: A unified Markov chain approach. Ann. Inst. Statist. Math. 47 743–766.
  • Krogh, A. (1997). Two methods for improving performance of an HMM and their application for gene finding. In Fifth International Conference on Intelligent Systems for Molecular Biology (T. Gaasterland et al., eds.) 179–186. AAAI Press.
  • Krogh, A., Mian, I. S. and Haussler, D. (1994). A hidden Markov model that finds genes in E. coli DNA. Nucleic Acids Research 22 1501–1531.
  • Li, J. and Gray, R. M. (2000). Image Segmentation and Compression Using Hidden Markov Models. Kluwer Academic Publishers.
  • Lindgren, G. (1978). Markov regime models for mixed distributions and switching regressions. Scand. J. Statist. 5 81–91.
  • Macdonald, I. L. and Zucchini, W. (1997). Hidden Markov and other Models for Discrete-valued Time Series. Chapman and Hall, London.
  • Martin, D. E. K. and Aston, J. A. D. (2007). Waiting time distribution of generalized later patterns. To appear.
  • Naus, J. I. (1982). Approximations for distributions of scan statistics. J. Amer. Statist. Assoc. 77 177–183.
  • Perkins, S. (1997). Inside Old Faithful: Scientists look down the throat of a geyser. Science News Oct 11 1997 232.
  • Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77 257–286.
  • Raftery, A. E. (1985). A model for high-order Markov chains. J. Roy. Statist. Soc. Ser. B 47 528–539.
  • Reinert, G., Schbath, S. and Waterman, M. S. (2005). Probabilistic and statistical properties of finite words in finite sequences. In Lothaire: Applied Combinatorics on Words (J. Berstel and D. Perrin, eds.). Cambridge Univ. Press.
  • Saxonov, S., Berg, P. and Brutlag, D. L. (2006). A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters. Proc. Natl. Acad. Sci. USA 103 1412–1417.
  • Sims, C. A. and Zha, T. (2006). Were there regime switches in US monetary policy? American Economic Review 96 54–81.
  • Smith, M. (2007). AL117335 locus of chromosome 20. EMBL/GenBank/DDBJ databases.
  • Takai, D. and Jones, P. A. (2002). Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proc. Natl. Acad. Sci. USA 99 3740–3745.
  • Takai, D. and Jones, P. A. (2003). The CpG island searcher: A new WWW resource. In Silico Biology 3 0021.
  • Viterbi, A. (1967). Error bounds for convolutions codes and an asymptotically optimal decoding algorithm. IEEE Trans. Inform. Theory 13 260–269.
  • Yuan, M. and Kendziorski, C. (2006). Hidden Markov models for microarray time course data in multiple biological conditions. J. Amer. Statist. Assoc. 101 1323–1334.

Supplemental materials