• Bernoulli
  • Volume 14, Number 1 (2008), 180-206.

The adjusted Viterbi training for hidden Markov models

Jüri Lember and Alexey Koloydenko

Full-text: Open access


The EM procedure is a principal tool for parameter estimation in the hidden Markov models. However, applications replace EM by Viterbi extraction, or training (VT). VT is computationally less intensive, more stable and has more of an intuitive appeal, but VT estimation is biased and does not satisfy the following fixed point property. Hypothetically, given an infinitely large sample and initialized to the true parameters, VT will generally move away from the initial values. We propose adjusted Viterbi training (VA), a new method to restore the fixed point property and thus alleviate the overall imprecision of the VT estimators, while preserving the computational advantages of the baseline VT algorithm. Simulations elsewhere have shown that VA appreciably improves the precision of estimation in both the special case of mixture models and more general HMMs. However, being entirely analytic, the VA correction relies on infinite Viterbi alignments and associated limiting probability distributions. While explicit in the mixture case, the existence of these limiting measures is not obvious for more general HMMs. This paper proves that under certain mild conditions, the required limiting distributions for general HMMs do exist.

Article information

Bernoulli, Volume 14, Number 1 (2008), 180-206.

First available in Project Euclid: 8 February 2008

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Baum–Welch bias computational efficiency consistency EM hidden Markov models maximum likelihood parameter estimation Viterbi extraction Viterbi training


Lember, Jüri; Koloydenko, Alexey. The adjusted Viterbi training for hidden Markov models. Bernoulli 14 (2008), no. 1, 180--206. doi:10.3150/07-BEJ105.

Export citation


  • Baum, L.E. and Petrie, T. (1966). Statistical inference for probabilistic functions of finite state Markov chains., Ann. Math. Statist. 37 1554–1563.
  • Bilmes, J. (1998). A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Technical Report 97–021, International Computer Science Institute, Berkeley, CA, USA.
  • Brémaud, P. (1999)., Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues. New York: Springer.
  • Bryant, P. and Williamson, J. (1978). Asymptotic behaviour of classification maximum likelihood estimates., Biometrika 65 273–281.
  • Caliebe, A. (2006). Properties of the maximum a posteriori path estimator in hidden Markov models., IEEE Trans. Inform. Theory 52 41–51.
  • Caliebe, A. (2007). Private, communication.
  • Caliebe, A. and Rösler, U. (2002). Convergence of the maximum a posteriori path estimator in hidden Markov models., IEEE Trans. Inform. Theory 48 1750–1758.
  • Cappé, O., Moulines, E. and Rydén, T. (2005)., Inference in Hidden Markov Models. New York: Springer.
  • Celeux, G. and Govaert, G. (1992). A classification EM algorithm for clustering and two stochastic versions., Comput. Statist. Data Anal. 14 315–332.
  • Chou, P., Lookbaugh, T. and Gray, R. (1989). Entropy-constrained vector quantization., IEEE Trans. Acoust. Speech Signal Process. 37 31–42.
  • Durbin, R., Eddy, S., Krogh, A. and Mitchison, G. (1998)., Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge Univ. Press.
  • Eddy, S. (2004). What is a hidden Markov model?, Nature Biotechnology 22 1315–1316.
  • Ehret, G., Reichenbach, P., Schindler, U. et al. (2001). DNA binding specificity of different STAT proteins., J. Biol. Chem. 276 6675–6688.
  • Ephraim, Y. and Merhav, N. (2002). Hidden Markov processes., IEEE Trans. Inform. Theory 48 1518–1569.
  • Fraley, C. and Raftery, A.E. (2002). Model-based clustering, discriminant analysis, and density estimation., J. Amer. Statist. Assoc. 97 611–631.
  • Genon-Catalot, V., Jeantheau, T. and Larédo, C. (2000). Stochastic volatility models as hidden Markov models and statistical applications., Bernoulli 6 1051–1079.
  • Green, P.J. and Richardson, S. (2002). Hidden Markov models and disease mapping., J. Amer. Statist. Assoc. 97 1055–1070.
  • Huang, X., Ariki, Y. and Jack, M. (1990)., Hidden Markov Models for Speech Recognition. Edinburgh Univ. Press.
  • Jelinek, F. (1976). Continuous speech recognition by statistical methods., Proc. IEEE 64 532–556.
  • Jelinek, F. (2001)., Statistical Methods for Speech Recognition. MIT Press.
  • Ji, G. and Bilmes, J. (2006). Backoff model training using partially observed data: Application to dialog act tagging. In, Proc. Human Language Techn. Conf. NAACL, Main Conference 280–287. New York City, USA: Association for Computational Linguistics. Available at
  • Juang, B.-H. and Rabiner, L. (1990). The segmental K-means algorithm for estimating parameters of hidden Markov models., IEEE Trans. Acoust. Speech Signal Proc. 38 1639–1641.
  • Kogan, J.A. (1996). Hidden Markov models estimation via the most informative stopping times for the Viterbi algorithm. In, Image Models (and Their Speech Model Cousins) (Minneapolis, MN, 1993/1994). IMA Vol. Math. Appl. 80 115–130. New York: Springer.
  • Koloydenko, A., Käärik, M. and Lember, J. (2007). On adjusted Viterbi training., Acta Appl. Math. 96 309–326.
  • Krogh, A. (1998)., Computational Methods in Molecular Biology. Amsterdam: North-Holland.
  • Lember, J. and Koloydenko, A. (2007). Adjusted Viterbi training for hidden Markov models. Technical Report 07-01, School of Mathematical Sciences, Nottingham Univ. Available at,
  • Lember, J. and Koloydenko, A. (2007). Adjusted Viterbi training: A proof of concept., Probab. Eng. Inf. Sci. 21 451–475.
  • Lember, J. and Koloydenko, A. (2007). Infinite Viterbi alignments in the two-state hidden Markov models. In, Proc. 8th Tartu Conf. Multivariate Statist. To appear.
  • Leroux, B.G. (1992). Maximum-likelihood estimation for hidden Markov models., Stochastic Process. Appl. 40 127–143.
  • Li, J., Gray, R.M. and Olshen, R.A. (2000). Multiresolution image classification by hierarchical modeling with two-dimensional hidden Markov models., IEEE Trans. Inform. Theory 46 1826–1841.
  • Lindgren, G. (1978). Markov regime models for mixed distributions and switching regressions., Scand. J. Statist. 5 81–91.
  • McDermott, E. and Hazen, T. (2004). Minimum classification error training of landmark models for real-time continuous speech recognition. In, Proc. ICASSP.
  • McLachlan, G. and Peel, D. (2000)., Finite Mixture Models. New York: Wiley.
  • Merhav, N. and Ephraim, Y. (1991). Hidden Markov modelling using a dominant state sequence with application to speech recognition., Comput. Speech Lang. 5 327–339.
  • Ney, H., Steinbiss, V., Haeb-Umbach, R. et al. (1994). An overview of the Philips research system for large vocabulary continuous speech recognition., Int. J. Pattern Recognit. Artif. Intell. 8 33–70.
  • Och, F. and Ney, H. (2000). Improved statistical alignment models. In, Proc. 38th Ann. Meet. Assoc. Comput. Linguist. 440–447. Association for Computational Linguistics.
  • Ohler, U., Niemann, H., Liao, G. and Rubin, G. (2001). Joint modeling of DNA sequence and physical properties to improve eukaryotic promoter recognition., Bioinformatics 17 S199–S206.
  • Padmanabhan, M. and Picheny, M. (2002). Large-vocabulary speech recognition algorithms., Computer 35 42–50.
  • Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications in speech recognition., Proc. IEEE 77 257–286.
  • Rabiner, L. and Juang, B. (1993)., Fundamentals of Speech Recognition. Upper Saddle River, NJ: Prentice-Hall.
  • Rabiner, L., Wilpon, J. and Juang, B. (1986). A segmental K-means training procedure for connected word recognition., AT&T Tech. J. 64 21–40.
  • Ryden, T. (1993). Consistent and asymptotically normal parameter estimates for hidden Markov models., Ann. Statist. 22 1884–1895.
  • Sabin, M. and Gray, R. (1986). Global convergence and empirical consistency of the generalized Lloyd algorithm., IEEE Trans. Inf. Theory 32 148–155.
  • Sclove, S. (1983). Application of the conditional population-mixture model to image segmentation., IEEE Trans. Pattern Anal. Mach. Intell. 5 428–433.
  • Sclove, S. (1984). Author’s reply., IEEE Trans. Pattern Anal. Mach. Intell. 5 657–658.
  • Shu, I., Hetherington, L. and Glass, J. (2003). Baum–Welch training for segment-based speech recognition. In, Proceedings of IEEE 2003 Automatic Speech Recognition and Understanding Workshop 43–48.
  • Steinbiss, V., Ney, H., Aubert, X. et al. (1995). The Philips research system for continuous-speech recognition., Philips J. Res. 49 317–352.
  • Ström, N., Hetherington, L., Hazen, T., Sandness, E. and Glass, J. (1999). Acoustic modeling improvements in a segment-based speech recognizer. In, Proceedings of IEEE 1999 Automatic Speech Recognition and Understanding Workshop 139–142.
  • Titterington, D.M. (1984). Comments on “Application of the conditional population-mixture model to image segmentation”., IEEE Trans. Pattern Anal. Mach. Intell. 6 656–657.
  • Wessel, F. and Ney, H. (2005). Unsupervised training of acoustic models for large vocabulary continuous speech recognition., IEEE Trans. Speech Audio Process. 13 23–31.