Penalized estimate of the number of states in Gaussian linear AR with Markov regime

We deal with the estimation of the regime number in a linear Gaussian autoregressive process with a Markov regime (AR-MR). The problem of estimating the number of regimes in this type of series is that of determining the number of states in the hidden Markov chain controlling the process. We propose a method based on penalized maximum likelihood estimation and establish its strong consistency (almost sure) without assuming previous bounds on the number of states.


Introduction
Our aim in this article is to establish consistency criteria for the method of penalized maximum likelihood estimation for the number of states in a hidden Markov chain in an AR-MR process.We show strong consistency for an estimator of the number of states in autoregressive process with Markov regime when the regression functions are linear and the noise is Gaussian.
Autoregressive processes with Markov regime can be looked at as a combination of hidden Markov models (HMM) with threshold regression models.These have been introduced in an econometric context by Goldfeld and Quandt [12] and they have become quite popular in the literature ever since Hamilton [13] employed them in the analysis of the gross internal product of the USA for two regimes: one of contraction and another of expansion.
When the number of states in a hidden Markov chain is known a priori the estimation problems can be solved, in principle, through the use of techniques based on maximum likelihood estimation (see, McDonald and Zucchini [17] and Cappe et al. [2]).But in many applications, a key problem is to determine the number of states in a way such that the data is adequately described while at the same time a compromise is maintained between fitness and possibility of generalizing the model.The problem of estimating the hidden Markov chain in AR-MR is a typical example of a nested-family of models: models with m parameters can be seen also as models with m+1 parameters.Thus the problem of model selection is essentially that of determining the smallest model that contains the distribution capable of generating the data.In many instances, the estimation of the model will depend on how identifiability affects the model and not on the specification of the correct model.
A first approximation to determine the dimension of the model is a statistical test based on the likelihood ratio (see Dachuna and Duflo [5], p. 227).For the estimation of the number of state in hidden Markov chain, the likelihood ratio test fails because regularity assumptions do not hold.In particular, the model is not identifiable, as some parameters do not show up under the null hypothesis and the information matrix is singular.As a result the asymptotic distribution of the likelihood ratio is not χ 2 .As an alternative, one can construct generalized tests for the likelihood ratio that would hold under non-standard conditions.For the problem of the determination of the number of states in AR-MR, Hansen [14] has proposed a test that works with loss of identifiability but in order to implement it one needs to calculate p-values in an approximate way; this leads to computationally heavy calculations which produce approximate p-values which underestimate the real ones.Garcia [8] has advanced more attractive computational alternatives which lack, however, the technical rigor present in Hansen's approach.
For HMM models the likelihood ratio test is not bounded.Gassiat and Keribin have studied it [11] and have shown that it diverges to infinity.The rate of growth of the likelihood ratio as the parameters increase is related to the complexity of the model.This brings us to consider penalized estimators of the likelihood function that compensate the lack of likeness between models with different dimensions.The specification of small penalties depends on how the divergence rate at infinitum of the likelihood ratio is determined.But as far as we know this is still an open problem for HMM models where the data belongs to infinite sets.
In general, criteria for penalized likelihood are obtained through approximations to Kullback-Leibler divergence.Among others, we find the very popular information criteria of Akaike (AIC) and the Bayesian one (BIC).These have been used by several authors in applications of the HMM models, however, as is mentioned by McDonald and Zucchini [17], these authors have made no reference as to their validity.
We shall distinguish two cases, regarding whether or not the observed variables are in an infinite set.For the case of the HMM model with data belonging to a finite set much work has been done starting with Finesso's presentation of the problem [7] where he establishes the strong consistency for the penalized estimator of the number of states assuming that the actual number of states belongs to bounded set of integers.Liu y Narayan [16], assuming this restriction introduce a strongly-consistent estimator based on statistical mixtures of the Krischevsky-Trofimov type as this allows to normalize the likelihood so as to control the growth of likelihood when the number of states is increased.In studies dealing with the efficiency of their estimator they show specifically that the probability for underestimating decreases at an exponential rate with the sample size, whereas the probability for overestimating does not exceed a third degree polynomial of the size of the sample.Based on this former work, Gassiat and Boucheron [10] have introduced considerable advances: they proved the strong consistency of the penalized estimator without assuming a priori upper bounds for the number of states; in addition, they showed that the probabilities for underestimating as well as for overestimating fall at an exponential rate with sample size.For AR-MR processes with observations belonging to a finite set the techniques introduced by Gassiat and Boucheron were further used by Chambaz and Matias [4] to simultaneously show the consistency of the number of states of the hidden chain and the memory of the observed process.
For the non-finite case in HMM models Rydén [19] have shown consistency for a penalized likelihood estimator which in the limit does nor underestimate the number of states.Dortet-Bernadet [6] have shown that under certain regularity conditions the Rydén estimator is indeed consistent.Gassiat [9] studying a penalized estimator of marginal likelihood concludes that there is consistency in probability with the actual number of states.This technique is extended by Olteanu and Rynkiewicz [18] in order to select the number of regression functions in processes where the regime is controlled by an independent sequence.In this very same work, the authors indicate that the penalized marginal likelihood criterion cannot be directly applied to AR-MR.Smith et al. [21] have advanced a new information criterion in order to be able to approximate the Kullback-Leibler divergence and to select the numbers of states and the variables in AR-MR.This criterion imposes a penalty that reduces state number overestimation.Following the work on finite alphabets in Ref. [7,16,10], Chambaz et al. [3] have shown strong consistency for penalized and Bayesian estimators of the number of states in HMM and observations belonging to infinite (discrete and continuum) sets; they have worked with conditionally Poisson and Gaussian distribution.As in the previous works, no a priori bounds are assumed for the number of states.
Following Chambaz et al. [3] we prove a mixture-type inequality (see Section 2.1) that allows us to normalize the likelihood and in addition we also prove in Section 3, without assuming a priori bounds on the actual state number of the hidden Markov chain, that the penalized estimator underestimates.In order to show that the penalized estimator does not overestimate the number of states, we use an approach that works well for nesting models and which is based on the equicontinuity of the likelihood function.We would like to point out that our results are obtained for the linear case and that they can be easily generalized to the nonlinear case if we assume that a sublinearity hypothesis such as the one required by Yao y Attali [22] holds, albeit retaining the assumption of Gaussian-like behavior.

Definitions and introductory comments
A linear autoregressive process with Markov regime (AR-MR) is defined by: where {e n } are i.i.d.random variables, σ 2 i is the variance of the model in each regime and σ 2 = (σ 2 1 , . . ., σ 2 m ).The sequence {X n } is a homogeneous Markov chain with state space {1, . . ., m}.We denote by A its transition matrix We assume that: S1 The Markov chain {X n } is recurrent positive.Hence, it can have an invariant distribution that we denote by λ = (λ 1 , . . ., λ m ).S2 Y 0 , the Markov chain {X n } and the sequence {e n } are mutually independent.S3 The e n has Gaussian distribution N (0, 1).
• The symbol ½ B (x) denotes the function which assigns the value 1 if x ∈ B and 0 elsewhere.
• Distributions and densities are denote by p.
is the number of transitions from i to j in n steps.
• The symbol 1 i denotes a n i -dimensional column vector with 1 in all of its positions and The process {Y n } in general is not a Markov chain but the associated process {(Y n , X n )} is a Markov chain with state space R × {1, . . ., m}.In what follows we introduce some properties-to be used throughout this work-related to the likelihood function for the present model.

The likelihood function
We consider the conditional distribution p ψ (Y n 1 |Y 0 = y 0 ) as the likelihood function for a set of observations Y n 0 = y n 0 and parameter ψ.Because of the total probability rule the total, likelihood function for the model is given by: Using our above notation we may represent the AR-MR process defined by Eq. (2.1) by means of its m linear models, for each 1 Thus, the distribution of Y n 0 conditional to x n 1 is written as We assume that prior distribution p(ψ) on Ψ satisfies where A i denotes the i-th row of A. Due to (3.2) we will consider the prior distribution for (θ, σ 2 ) belonging to a Gaussian-Gamma family (see Broemiling [1], §1, page.3), means for each i = 1 . . ., m, H1 θ 1 , . . ., θ m are independent with . ., σ 2 m are independent with Inverse-Gamma distribution H3 A 1 , . . ., A m are independent with A i ∼ D(e i ), where D denotes a Dirichlet density the parameters vector (1/2, . . ., 1/2), The related mixture statistic is defined by The main results of this section is the comparison between the likelihood function and the mixture statistics.
Theorem 3.1.For each m ≥ 1 and the prior distribution p(ψ) satisfies the inequality where

Penalized estimation of the number of states
The purpose of this Section is to advance an estimation method based on penalized maximum likelihood in order to select the number of states m of a hidden Markov chain {X n }.For every integer m ≥ 1, we consider the sets Ψ m and M = m≥1 Ψ m the family for all models, (with convention Ψ 0 = ∅).We define the number of states m 0 through the property Remark: (Identifiability) We assume that for the true model Ψ m0 the vector components {(α i , b i , σ i )} m0 i=1 are different; thus, for every n, there exists a point i=1 are different.Therefore, in agreement with Remark 2.10 of Krishnamurthy and Yin [15] the model is identifiable in the following sense: If K stands for the Kullback-Leibler divergence K(ψ, ψ m0 ) = 0 then, ψ = ψ m0 .As a result, identifiability implies that m 0defined by Eq. (4.1) -is unique.
Let pen(n, m) be a penalty term which is given by a positive function with increasing values of n and m.We define the estimator for penalized maximum likelihood as (PML) for m 0 as, We say that m(n) overestimates the number of states m 0 if m(n) > m 0 and that it underestimates the number of states if m(n) < m 0 .
In the following theorem we prove that the estimator PML for m 0 , overestimates the number of states.In order to prove this Theorem, the following two Lemmas are necessary: Lemma 4.1 (Finesso [7]).Assume (S1-S6) the set of functions f n (ψ) = 1 n log p ψ (Y n 1 |Y 0 = y 0 ) is an equicontinuos sequence a.s-P ψ0 .
The following result is a usual one in the context of order selection for a nested family of models, see [2], §15, p 577-578.For HMM models, similar results are given, for example, in [10,3].Lemma 4.2.Assume (S1-S6) we have: In the following theorem we prove that the estimator m underestimates the number of states m 0 .

Proofs
Proof of Theorem 3.1.
We observe that Hence, the Theorem can be proved by finding constants C 1 , C 2 such that: Thus, taking into account equations (5.1) and (3.1) Let us evaluate q m (x n 1 ) following the proof given in the Appendix of Ref. [16].Consider . (5.4) The right-hand-side of equation (5.4) does not exceed Gassiat and Boucheron [10] showed that .
As a result of the evaluation of the mixture, upon integration over the variables θ y σ 2 , is: Now, setting u 0 → 0 and v 0 → 0 (which means that in the limit we consider a priori distributions which are not informative for σ 2 although they are improper) Again, introducing conditions with respect to Y n 1 = y n 1 and x n 1 , as the model is both linear and Gaussian, the estimators ML for 1 ≤ i ≤ m are ) and that the right-hand-side of the inequality satisfies We arrive at the following expression for the density quotient: Taking logarithms of both sides of the inequality, we have that: Let us notice that the right-hand side of the former inequality satisfies the following bounds: For term T 1 we have For term and for term we write the first term of the inequality Making use both of convexity and the Ergodic Theorem we see that the following relation is satisfied: a.s.
Substituting the calculated bounds Proof of Lemma 4.1.

We work directly with the extended Markov chain
) and let ψ, ψ ′ ∈ Ψ.We prove that for each ε > 0 there exists a δ(ε) > 0 such that:

Complete likelihood is written as
from where it ensures that The right-hand-side of the inequality (5.6) can be bounded in the following way • Since n ij /n ≤ 1, n i /n ≤ 1 and the parameters a ij , a ij , σ 2 i , σ 2 i are lower bounded (for S6), there exist a constant C 1 such that the terms T 1 and T 2 of (5.6) are upper bounded by C 1 ψ − ψ .
• Due to compactness of the parameters space (S6), there exist a constant C 2 such that term T 3 of (5.6) are upper bounded by C 2 σ 2 − σ 2 1 n n k=1 Y k .The stability condition (S4), and the existence of the moments of e 1 (S3) implies (see Yao and Atalli [22]), by the Ergodic Theorem, that the terms of the 1/n n k=1 g(Y k ) are controlled.Hence • By the same argument of compactness (S5 and S6) we have and again, following the Ergodic Theorem, the right side of the above inequality is upper bounded by C 4 ψ − ψ a.s.• By the Cauchy-Schwarz-Bunyakowski inequality Now, the norm of the symmetric matrix W t i W i is given by the absolute value of of the largest real eigenvalue, which in the present case is Thus, the last term of (5.6) is smaller than C 5 ψ − ψ .
We thus reach the conclusions that there exists a constant C such that this implies that h n is an equicontinuous series.In order to return to {Y n } we note that from where we have and then, adding over x n x n from where it follows Proof of Lemma 4.2.
The first part follows from proposition 2.9 of [15].
We use the fact that since ψ ∈ Ψ m according to Lemma 4.2 there exists 1 ≤ i ≤ I such that log p ψm < nε + log p ψi ; hence it follows from the (5.= 0 from where it follows that, Proof of Theorem 4.2.

Let us define the set
We note that and where (a) is a consequence of the way the PML estimator is defined (4.2) and (b), of Theorem 3. (see Searle [20], §2, págs.49-53).
On the other hand, substituting in r, we have: = o n 2 2τ 2 a.s.
Since we notice that n k /n → λ k a.s (S1), We have proven that
1.In what follows we consider the coefficient Y t I k P k Y I k /Y t I k B k Y I k .Conditions with respect to Y n 1 = y n 1 and x n 1 , as the model is both linear and Gaussian (3.2) then Y t I k P k Y I k has a χ 2 (n k , γ) distribution, where γ = (1/2)θ t k W t P k Wθ k is the non-centrality parameter; further, we assume that P k has a maximum rank.Moreover, χ 2 (n k , 1/2θ t k W t P k Wθ k ) can be approximated by a χ 2 r having the same mean and the same variance with r = (n k + 2γ) 2 /(n k + 4γ).For the denominator, if assume B k to have full range, then Y t I k B k Y I k distributes χ 2