• Bernoulli
  • Volume 19, Number 3 (2013), 846-885.

Divergence rates of Markov order estimators and their application to statistical estimation of stationary ergodic processes

Zsolt Talata

Full-text: Open access


Stationary ergodic processes with finite alphabets are estimated by finite memory processes from a sample, an $n$-length realization of the process, where the memory depth of the estimator process is also estimated from the sample using penalized maximum likelihood (PML). Under some assumptions on the continuity rate and the assumption of non-nullness, a rate of convergence in $\bar{d}$-distance is obtained, with explicit constants. The result requires an analysis of the divergence of PML Markov order estimators for not necessarily finite memory processes. This divergence problem is investigated in more generality for three information criteria: the Bayesian information criterion with generalized penalty term yielding the PML, and the normalized maximum likelihood and the Krichevsky–Trofimov code lengths. Lower and upper bounds on the estimated order are obtained. The notion of consistent Markov order estimation is generalized for infinite memory processes using the concept of oracle order estimates, and generalized consistency of the PML Markov order estimator is presented.

Article information

Bernoulli, Volume 19, Number 3 (2013), 846-885.

First available in Project Euclid: 26 June 2013

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

finite memory estimator infinite memory information criteria Markov approximation minimum description length oracle inequalities penalized maximum likelihood rate of convergence


Talata, Zsolt. Divergence rates of Markov order estimators and their application to statistical estimation of stationary ergodic processes. Bernoulli 19 (2013), no. 3, 846--885. doi:10.3150/12-BEJ468.

Export citation


  • [1] Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory (Tsahkadsor, 1971) 267–281. Budapest: Akadémiai Kiadó.
  • [2] Barron, A., Birgé, L. and Massart, P. (1999). Risk bounds for model selection via penalization. Probab. Theory Related Fields 113 301–413.
  • [3] Barron, A., Rissanen, J. and Yu, B. (1998). The minimum description length principle in coding and modeling. IEEE Trans. Inform. Theory 44 2743–2760.
  • [4] Berbee, H. (1987). Chains with infinite connections: Uniqueness and Markov representation. Probab. Theory Related Fields 76 243–253.
  • [5] Birgé, L. and Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc. (JEMS) 3 203–268.
  • [6] Bressaud, X., Fernández, R. and Galves, A. (1999). Speed of $\overline{d}$-convergence for Markov approximations of chains with complete connections. A coupling approach. Stochastic Process. Appl. 83 127–138.
  • [7] Comets, F., Fernández, R. and Ferrari, P.A. (2002). Processes with long memory: Regenerative construction and perfect simulation. Ann. Appl. Probab. 12 921–943.
  • [8] Cover, T.M. and Thomas, J.A. (2006). Elements of Information Theory, 2nd ed. Hoboken, NJ: Wiley.
  • [9] Csiszár, I. (2002). Large-scale typicality of Markov sample paths and consistency of MDL order estimators. IEEE Trans. Inform. Theory 48 1616–1628.
  • [10] Csiszár, I. and Körner, J. (2011). Information Theory: Coding Theorems for Discrete Memoryless Systems, 2nd ed. Cambridge: Cambridge Univ. Press.
  • [11] Csiszár, I. and Shields, P.C. (2000). The consistency of the BIC Markov order estimator. Ann. Statist. 28 1601–1619.
  • [12] Csiszár, I. and Talata, Z. (2006). Context tree estimation for not necessarily finite memory processes, via BIC and MDL. IEEE Trans. Inform. Theory 52 1007–1016.
  • [13] Csiszár, I. and Talata, Z. (2010). On rate of convergence of statistical estimation of stationary ergodic processes. IEEE Trans. Inform. Theory 56 3637–3641.
  • [14] Dedecker, J. and Doukhan, P. (2003). A new covariance inequality and applications. Stochastic Process. Appl. 106 63–80.
  • [15] Dedecker, J. and Prieur, C. (2005). New dependence coefficients. Examples and applications to statistics. Probab. Theory Related Fields 132 203–236.
  • [16] Donoho, D.L. and Johnstone, I.M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81 425–455.
  • [17] Duarte, D., Galves, A. and Garcia, N.L. (2006). Markov approximation and consistent estimation of unbounded probabilistic suffix trees. Bull. Braz. Math. Soc. (N.S.) 37 581–592.
  • [18] Fernández, R. and Galves, A. (2002). Markov approximations of chains of infinite order. Bull. Braz. Math. Soc. (N.S.) 33 295–306.
  • [19] Finesso, L., Liu, C.C. and Narayan, P. (1996). The optimal error exponent for Markov order estimation. IEEE Trans. Inform. Theory 42 1488–1497.
  • [20] Gabrielli, D., Galves, A. and Guiol, D. (2003). Fluctuations of the empirical entropies of a chain of infinite order. Math. Phys. Electron. J. 9 Paper 5, 17 pp. (electronic).
  • [21] Galves, A. and Leonardi, F. (2008). Exponential inequalities for empirical unbounded context trees. In In and Out of Equilibrium. 2. Progress in Probability 60 257–269. Basel: Birkhäuser.
  • [22] Gao, J. and Gijbels, I. (2008). Bandwidth selection in nonparametric kernel testing. J. Amer. Statist. Assoc. 103 1584–1594.
  • [23] Krichevsky, R.E. and Trofimov, V.K. (1981). The performance of universal encoding. IEEE Trans. Inform. Theory 27 199–207.
  • [24] Leonardi, F. (2010). Some upper bounds for the rate of convergence of penalized likelihood context tree estimators. Braz. J. Probab. Stat. 24 321–336.
  • [25] Marton, K. (1998). Measure concentration for a class of random processes. Probab. Theory Related Fields 110 427–439.
  • [26] Ornstein, D.S. (1973). An application of ergodic theory to probability theory. Ann. Probab. 1 43–58.
  • [27] Ornstein, D.S. and Weiss, B. (1990). How sampling reveals a process. Ann. Probab. 18 905–930.
  • [28] Rissanen, J. (1989). Stochastic Complexity in Statistical Inquiry. World Scientific Series in Computer Science 15. Singapore: World Scientific.
  • [29] Ryabko, B. and Astola, J. (2006). Universal codes as a basis for time series testing. Stat. Methodol. 3 375–397.
  • [30] Ryabko, B.Y. (1984). Twice-universal coding. Probl. Inf. Transm. 20 173–177.
  • [31] Ryabko, B.Y. (1988). Prediction of random sequences and universal coding. Probl. Inf. Transm. 24 87–96.
  • [32] Schönhuth, A. (2009). On analytic properties of entropy rate. IEEE Trans. Inform. Theory 55 2119–2127.
  • [33] Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461–464.
  • [34] Shields, P.C. (1996). The Ergodic Theory of Discrete Sample Paths. Graduate Studies in Mathematics 13. Providence, RI: Amer. Math. Soc.
  • [35] van Handel, R. (2011). On the minimal penalty for Markov order estimation. Probab. Theory Related Fields 150 709–738.
  • [36] Štar’kov, J.M. (1977). Coding of discrete sources with unknown statistics. In Topics in Information Theory (Second Colloq., Keszthely, 1975). Colloq. Math. Soc. János Bolyai 16 559–574. Amsterdam: North-Holland.