General-order observation-driven models: ergodicity and consistency of the maximum likelihood estimator

The class of observation-driven models (ODMs) includes many models of non-linear time series which, in a fashion similar to, yet different from, hidden Markov models (HMMs), involve hidden variables. Interestingly, in contrast to most HMMs, ODMs enjoy likelihoods that can be computed exactly with computational complexity of the same order as the number of observations, making maximum likelihood estimation the privileged approach for statistical inference for these models. A celebrated example of general order ODMs is the GARCH$(p,q)$ model, for which ergodicity and inference has been studied extensively. However little is known on more general models, in particular integer-valued ones, such as the log-linear Poisson GARCH or the NBIN-GARCH of order $(p,q)$ about which most of the existing results seem restricted to the case $p=q=1$. Here we fill this gap and derive ergodicity conditions for general ODMs. The consistency and the asymptotic normality of the maximum likelihood estimator (MLE) can then be derived using the method already developed for first order ODMs.

ODMs have the nice feature that the computations of the associated (conditional) likelihood and its derivatives are easy, the parameter estimation is hence relatively simple, and the prediction, which is a prime objective in many time series applications, is straightforward. However, it turns out that the asymptotic properties of the maximum likelihood estimator (MLE) for this class can be cumbersome to establish, except when they can be derived using computations specific to the studied model (the GARCH(1, 1) case being one of the most celebrated example). The literature concerning the asymptotic theory of the MLE when the observed variable has Poisson distribution includes [23,21,24] and [46]. For a more general case where the model belongs to the class of one-parameter exponential ODMs, such as the Bernoulli, the exponential, the negative binomial (with known shape parameter) and the Poisson autoregressive models, the consistency and the asymptotic normality of the MLE have been derived in [11]. However, the one-parameter exponential family is inadequate to deal with models such as multi-parametric, mixture or multivariate ODMs (the negative binomial with all unknown parameters and mixture Poisson ODMs are examples of this case). A more general consistency result, has been obtained recently in [13], where the observed process may admit various conditional distributions. This result has later been extended and refined in [16]. However, most of the results obtained so far have been derived only under the framework of GARCH(1, 1)-type or first-order ODMs. Yet, up to our knowledge, little is known for the GARCH(p, q)-type, i.e. larger order discrete ODMs, as highlighted as a remaining unsolved problem in [44].
Here, following others (e.g. [42,29]), we consider a general class of ODMs that is capable to account for several lagged variables of both hidden and observation processes. We develop a theory for the class of general-order ODMs parallel to the GARCH(p, q) family and in particular we investigate the two problems listed here below. a) Provide a complete set of conditions for general order ODMs implying that the process is ergodic. b) Prove the consistency of the MLE. Under the assumption of well-specified models, this can be treated in two separate sub-problems. 1) Prove that the the MLE is equivalence-class consistency.
2) Characterize the set of identifiable parameters.
In principle, the general order model can be treated by embedding it into a firstorder one and by applying the results obtained e.g. in [13,16] to the embedded model. Yet the particular form of the embedded model does not fit the usual assumptions tailored for standard first-order ODMs. This is why, as pointed out in [44], solving Problem a) is not only a formal extension of available results. To obtain Theorem 8 where Problem a) is addressed, we derive conditions by taking advantage of the asymptotic behavior of iterated versions of the kernels involved. Incidentally, this also allows us to improve known conditions for some first-order models, as explained in . Once the ergodicity of the model is proved, we can solve Sub-problem b) 1). This is done in Theorem 10, which is obtained almost for free from the embedding in the ODM(1,1) case. Sub-problem b) 2) is much more involved and has been addressed in [17].
To demonstrate the generality, applicability and efficiency of our approach, we apply our results to three specific integer valued ODMs, namely, the log-linear Poisson GARCH(p, q) model, the negative binomial integer-valued GARCH or the NBIN-GARCH(p, q) model and the Poisson AutoRegressive model with eXogenous variables, known as the PARX model. To the best of our knowledge, the stationarity and ergodicity as well as the asymptotic properties of the MLE for the general loglinear Poisson GARCH(p, q) and NBIN-GARCH(p, q) models have not been derived so far. For the PARX(p, q) model, which can be considered as a vector-valued ODM in our study, such results are available in [1] but our approach leads to significantly different assumptions, as will be shown in Section 2.3.3. Numerical experiments involving the log-linear Poisson GARCH(p, q) and the NBIN-GARCH (p, q) models can be found in [40,Section 5.5] for a set of earthquake data from Earthquake Hazards Program [43].
The paper is structured as follows. Definitions used throughout the paper are introduced in Section 2, where we also state our results on three specific examples. In Section 3, we present our main results on the ergodicity and consistency of the MLE for general order ODMs. Finally, Section 4 contains the postponed proofs and we gather some independent useful lemmas in Section 5.

Definitions and Examples
2.1. Observation-driven model of order (p, q). Throughout the paper we use the notation u ℓ:m := (u ℓ , . . . , u m ) for ℓ ≤ m, with the convention that u ℓ:m is the empty sequence if ℓ > m, so that, for instance (x 0:(−1) , y) = y. The observationdriven time series model can formally be defined as follows.
Definition 1 (General order ODM and (V)LODM). Let (X, X ), (Y, Y) and (U, U) be measurable spaces, respectively called the latent space, the observation space and the reduced observation space. Let (Θ, ∆) be a compact metric space, called the parameter space. Let Υ be a measurable function from (Y, Y) to (U, U). Let (x 1:p , u 1:q ) →ψ θ u1:q (x 1:p ) : θ ∈ Θ be a family of measurable functions from (X p × U q , X ⊗p ⊗ U ⊗q ) to (X, X ), called the reduced link functions and let G θ : θ ∈ Θ be a family of probability kernels on X × Y, called the observation kernels.
(ii) We further say that this model is a vector linearly observation-driven model of order (p, q, p ′ , q ′ ) (shortened as VLODM(p, q, p ′ , q ′ )) if moreover for some p ′ , q ′ ∈ Z >0 , X is a closed subset of R p ′ , U ⊂ R q ′ and, for all x = x 0:(p−1) ∈ X p , u = u 0:(q−1) ∈ U q , and θ ∈ Θ, for some mappings ω, A 1:p and B 1:q defined on Θ and valued in R p ′ , R p ′ ×p ′ p and R p ′ ×q ′ q . In the case where p ′ = q ′ = 1, the VLODM(p, q, p ′ , q ′ ) is simply called a linearly observation-driven model of order (p, q) (shortened as LODM(p, q)).

Remark 1. Let us comment briefly on this definition.
(1) The standard definition of an observation driven model does not include the admissible mapping Υ and indeed, we can define the same model without Υ by replacing the second equation in (2.1) by where (x 1:p , y 1:q ) → ψ θ y1:q (x 1:p ) : θ ∈ Θ is a family of measurable functions from (X p × Y q , X ⊗p ⊗ Y ⊗q ) to (X, X ), called the (non-reduced) link functions, and defined by However by inserting the mapping Υ, we introduce some flexibility and this is useful for describing various ODMs with the same reduced link functioñ ψ θ . For instance all LODMs or VLODMs use the form of reduced link function in (2.2) although they may use various mappings Υ's. This is the case for the GARCH, log-linear GARCH and NBIN-GARCH, see below, but also of the bivariate integer-valued GARCH model [9]. (2) When p = q = 1, then the ODM(p, q) defined by (2.1) collapses to the (first-order) ODM considered in [13] and [16]. Note also that if p = q, setting r := max(p, q), the ODM(p, q) can be embedded in an ODM(r, r), but this requires augmenting the parameter dimension which might impact the identifiability of the model. (3) Note that in our definition, Υ does not depend on θ. In some cases, one can simplify the link functionψ, by allowing one to let Υ depend on the unknown parameter. To derive identifiability conditions or to prove the consistency of the MLE, the dependence of the distribution with respect to the unknown parameter θ is crucial. In contrast, proving that the process is ergodic is always done for a given θ and it is thus possible to use our set of conditions to prove ergodicity with Υ depending on θ, hence reasoning for a given θ and a given Υ. Moreover our set of conditions (namely Conditions (A-3), (A-4), (A-5), (A-6), (A-7) and (A-8) in Theorem 8) depends on Υ only through the image space U, which can usually be taken to be the same even in situations where Υ may depend on θ.
The GARCH model has been extensively studied, see, for example, [5,25,26,34,27] and the references therein. Many other examples have been derived in the class of LODMs. An important feature of the GARCH is the fact that x → G θ (x; ·) maps a family of distributions parameterized by the scale parameter √ x, which is often expressed by replacing the first line in (2.1) by an equation of the form Y k = √ X k ǫ k with the assumption that {ǫ n : n ∈ Z} is i.i.d. Such a simple multiplicative formula no longer holds when the observations Y k 's are integers, which seriously complicates the theoretical analysis of such models, as explained in [44]. Our results will apply to the GARCH case but we do not improve the existing results in this case. The real interest of our approach lies in being able to address general order integer-valued ODMs for which existing results are scarce.
. Consider an ODM as in Definition 1. Then, for all k ≥ 0, the conditional distribution of (Y k , X k+1 ) given F k only depends on where we defined For all θ ∈ Θ and all probability distributions ξ on (Z, Z), we denote by P θ ξ the distribution of {X k , Y k ′ : k > −p, k ′ > −q} satisfying (2.1) and Z 0 ∼ ξ, with the usual notation P θ z in the case where ξ is a Dirac mass at z ∈ Z. We replace P by E to denote the corresponding expectations.
We always assume that the observation kernel is dominated by a σ-finite measure ν on (Y, Y), that is, for all θ ∈ Θ, there exists a measurable function g θ : X × Y → R ≥0 written as (x, y) → g θ (x; y) such that for all x ∈ X, g θ (x; ·) is the density of G θ (x; ·) with respect to ν. Then the inference about the model parameter is classically performed by relying on the likelihood of the observations (Y 0 , . . . , Y n ) given Z 0 . For all z ∈ Z, the corresponding conditional density function p θ (y 0:n |z) with respect to ν ⊗(n+1) under parameter θ ∈ Θ is given by where, setting u k = Υ(y k ) for k = 0, . . . , n, the sequence x 0:n is defined through the initial conditions and recursion equations where, throughout the paper, for all j ∈ {1, . . . , p + q − 1}, we denote by Π j (z) the j-th entry of z ∈ Z. Note that x k+1 only depends on z and y 0:k for all k ≥ 0. Throughout the paper, we use the notation: for all n ≥ 1, y 0:n−1 ∈ Y n and z ∈ Z, ψ θ y 0:(n−1) (z) := x n , with x n defined by (2.7). (2.8) Note that for n = 0, y 0:−1 is the empty sequence and (2.8) is replaced by ψ θ ∅ (z) = x 0 = Π p (z). Given the initial condition Z 0 = z (i) , the (conditional) maximum likelihood estimatorθ z (i) ,n of the parameter θ is thus defined by where, for all z (i) ∈ Z, Note that since {X n : n ∈ Z} is unobserved, to compute the conditional likelihood, we used some arbitrary initial values for X (−p+1):0 , (the first p entries of z (i) ). We also use arbitrary values for the last q − 1 entries of z (i) . Note that we index our set of observations as Y 0:n (hence assuming 1 + n observations to compute L θ z (i) ,n ). The iterated function ψ θ Y 0:k can be cumbersome but is very easy to compute in practice using the recursion (2.7). Moreover, the same kind of recursion holds for its derivatives with respect to θ, allowing one to apply gradient steps to locally maximize the likelihood.
In this contribution, we investigate the convergence ofθ z (i) ,n as n → ∞ for some (well-chosen) value of z (i) under the assumption that the model is well specified and the observations are in the steady state. The first problem to solve in this line of results is thus to show the following.
This ergodic property is the cornerstone for making statistical inference theory work and we provide simple general conditions in Section 3.2. We now introduce the notation that will allow us to refer to the stationary distribution of the model throughout the paper. Definition 3. If (A-1) holds, then a) P θ denotes the distribution on ((X × Y) Z , (X × Y) ⊗Z ) of the stationary solution of (2.1) extended on k ∈ Z, with F k = σ(X −∞:k , Y −∞:(k−1) ); b)P θ denotes the projection of P θ on the component Y Z .
We also use the symbols E θ andẼ θ to denote the expectations corresponding to P θ andP θ , respectively. We further denote by π θ X and π θ Y the marginal distributions of X 0 and Y 0 under P θ , on (X, X ) and (Y, Y), respectively. As a byproduct of the proof of (A-1), one usually obtains a function V X : X → R ≥0 of interest, common to all θ ∈ Θ, such that the following property holds on the stationary distribution (see Section 3.2).
(A-2) For all θ ∈ Θ, π θ X (V X ) < ∞. It is here stated as an assumption for convenience. Note also that, in the following, for V : X → R ≥0 and f : X → R, we denote the V -norm of f by with the convention 0/0 = 0 and we write f V if |f | V < ∞. With this notation, under (A-2), for any f : X → R such that f V X , π θ X (|f |) < ∞ holds, and similarly, since π θ Y = π θ X G θ as a consequence of (2.1), for any f : Equivalence-class consistency and identifiability. For all θ, θ ′ ∈ Θ, we write θ ∼ θ ′ if and only ifP θ =P θ ′ . This defines an equivalence relation on the parameter set Θ and the corresponding equivalence class of θ is denoted by [θ] := {θ ′ ∈ Θ : θ ∼ θ ′ }.
The equivalence relationship ∼ was introduced by [32] as an alternative to the classical identifiability condition. Namely, we say thatθ z (i) ,n is equivalence-class consistent at the true parameter θ ⋆ if Recall that ∆ is the metric endowing the parameter space Θ. Therefore, in (2.11), ∆(θ z (i) ,n , [θ ⋆ ]) is the distance between the MLE and the set of parameters having the same stationary distribution as the true parameter, and the convergence is considered under this common stationary distribution. Identifiability can then be treated as a separate problem which consists in determining all the parameters θ ⋆ for which [θ ⋆ ] reduces to the singleton {θ ⋆ }, so that equivalence-class consistency becomes the usual consistency at and only at these parameters. The identifibility problem is treated in [17,Proposition 11 and Theorem 12] for general order ODMs, where many cases of interest are specifically detailed. We will use in particular [17, Theorem 17 and Section 5.5].

Three examples.
To motivate our general results we first explicit three models of interest, namely, the log-linear Poisson GARCH(p, q), the NBIN-GARCH(p, q) and the PARX(p, q) models. To the best of our knowledge, the stationarity and ergodicity as well as the asymptotic properties of the MLE for the general log-linear Poisson GARCH(p, q) and NBIN-GARCH(p, q) models have not been derived so far. Such results are available for PARX model in [1] but our approach leads to significantly different assumptions, as will be shown in Section 2.3.3.
Once the ergodicity of the model is established, we can investigate the consistency of the MLE. This is done by first investigating the equivalence-class consitency of the MLE and then we only need to add an identifiability condition to get the consistency of the MLE, see Theorems 4, 5 and 7 hereafter. Such results pave the way for investigating the asymptotic normality. This is done in [40,Proposition 5.4 Example 1. The Log-linear Poisson GARCH(p, q) Model parameterized by θ = (ω, a 1:p , b 1:q ) ∈ Θ ⊂ R 1+p+q is a LODM(p, q) with affine reduced link function of the form (2.2) with coefficients given by with observations space Y = Z ≥0 , hidden variables space X = R, admissible mapping Υ(y) = ln(1 + y), and observation kernel G θ (x; ·) defined as the Poisson distribution with mean e x .
Our condition for showing the ergodicity of general order log-linear Poisson GARCH models requires the following definition. For all x = x (1−(p∨q)):0 ∈ R p∨q , m ∈ Z ≥0 , and w = w (−q+1):m ∈ {0, 1} q+m , define ψ θ w (x) as x m+1 obtained by the recursion We can now state our result on general order log-linear Poisson GARCH models.
The proof is postponed to Section 4.4.
Remark 2. Let us provide some insights about Condition (2.14).
(1) Using the two possible constant sequences w, w k = 0 for all k or w k = 1 for all k in (2.13), we easily see that (2.14) implies a 1:p ∈ S p and a 1: where we used the usual convention a k = 0 for p < k ≤ q and b k = 0 for q < k ≤ p and where (2) A sufficient condition to have (2.14) is Indeed, defining ρ as the left-hand side of the previous display, we clearly have, for all m > p ∨ q, w ∈ {0, 1} q+m and x ∈ R p∨q , (3) The first iteration (2.13) implies, for all w ∈ {0, 1} q and x ∈ [−1, 1] p∨q , Hence a sufficient condition to have (2.18) (and thus (2.14)) is (4) When p = q = 1, by Points (1) and (3) above, Condition (2.14) is equivalent to have |a 1 | < 1 and |a 1 + b 1 | < 1. This condition is weaker than the one derived in [13] where |b 1 | < 1 is also imposed.

NBIN-GARCH model.
Our next example is the NBIN-GARCH(p, q), which is defined as follows in our setting.
with affine reduced link function of the form (2.2) with coefficients given by (2.12), observations space Y = Z ≥0 , hidden variables space X = R ≥0 , admissible mapping Υ(y) = y, and observation kernels G θ (x; ·) defined as the negative binomial distribution with shape parameter r > 0 and mean r x, that is, for all y ∈ Z ≥0 , We now state our result for general order NBIN-GARCH model.
The proof is postponed to Section 4.5.

Remark 3.
Clearly, under the setting of Example 2, if Eq. (2.1) has a stationary solution such that µ = x π X (dx) < ∞, taking the expectation on both sides of the second equation in (2.1) and using that y π Y (dy) = r x π X (dx), then (2.20) must hold, in which case we get Hence (2.20) is in fact necessary and sufficient to get a stationary solution admitting a finite first moment, as was already observed in [48, Theorem 1] although the ergodicity is not proven in this reference. However, we believe that, similarly to the classical GARCH(p, q) processes, we can find stationary solutions to Eq. (2.1) in the case where (2.20) does not hold. This is left for future work.

2.3.3.
The PARX model. The PARX model is similar to the standard INGARCH(p, q) model but with additional exogenous variables in the linear link function for generating the hidden variables. Following [1], these exogenous variables are assumed to satisfy some Markov dynamic of order 1 independently of the observations and of the hidden variables (see their Assumption 1). This leads to the following definition.
Definition 6 (PARX model). Let d, p, q and r be positive integers.
. . , d}. Let P(x) denote the Poisson distribution with parameter x and L be a given Markov kernel on R r × B(R r ). We define the Poisson AutoRegressive model with eXogenous variables, shortly written as PARX(p, q), as a class of processes parameterized by θ = (ω, a 1: , and satisfying Remark 4. Note that our a 1 , . . . , a p and b 1 , . . . , b q correspond to β 1 , . . . , β q and α 1 , . . . , α p of [1]. Here we allowed r to be different from d whereas in in [1] it is assumed that d = r and they are denoted by d x . Since the exogenous variables are observed we can recast the PARX model into a VLODM(p, q) by including the exogenous variables into the observations Y t and the hidden variable X t . Namely, we can formally see the above PARX model as a special case of VLODM as follows.
Example 3 (PARX model as a VLODM). Consider a PARX model as in We end up this section by stating our result for general PARX model. For establishing the ergodicity of the PARX model, we need some assumptions on the dynamic of the exogeneous variables through the Markov kernel L. Namely we consider the following assumptions on the kernel L.
(L-1) The Markov kernel L admits a kernel density ℓ with respect to some measure ν L on Borel sets of R r such that where · denotes the Euclidean norm on R r . (L-2) The Markov kernel L is weak-Feller (Lf is bounded and continuous for any f that is bounded and continuous). (L-3) There exist a probability measure π L on Borel sets of R r , a measurable function (L-4) There exists a constant M ≥ 1 such that for all (ξ, ξ ′ ) ∈ R 2r , such that the following assertions hold for L, its kernel density ℓ introduced in (L-1) and its drift function introduced in (L-3).
We can now state our main result for the PARX model.

Theorem 7. Consider the PARX(p, q) model of Definition 6, seen as the VLODM of Example 3. Suppose that (L-1)-(L-4) hold, and that, for all
Then the following assertions hold.
The proof is postponed to Section 4.6. Let us briefly comment on our assumptions.
Remark 5. Contrary to the previous example, the PARX model requires an additional Markov kernel L and therefore additional assumptions on this kernel. We briefly comment on them hereafter and compare our assumptions to those in [1].
(i) Assumptions (L-1)-(L-3) are classical assumptions on Markov kernels to ensure its stability and ergodicity. In [1, Assumption 2], a different approach is used and instead a first order contraction of the random iterative functions defining the Markov transition is used. Their assumption would typically imply our (L-3) with V L (ξ) = 1 + ξ s for some s ≥ 1, with their Assumption(3)(ii) implying our (L-3)(iii). (ii) Our Assumption (L-4) is inherited from our approach for proving ergodicity using the embedding of the PARX model into a VLODM model detailed in Example 3. It is not clear to us whether it can be omitted for proving ergodicity. This question is left for future work. Note that we provide another formulation of this assumption in Remark 10, see Section 4.6.
There is no equivalent of our Assumption (L-4) in [1] as their technique for proving ergodicity is different. However, although we have the same condition (2.22) as them for ergodicity, see their Assumption 3(i), they require an additional condition, Assumption 3(iii), which we do not needed. This condition involves both the coefficients a i and b i and the contraction constant ρ used in their stability assumption for the Markov transition of the covariates. In contrast, our condition on the parameters a i and b i of the model, merely imposed through (2.22), are completely separated from our assumptions (L-1)-(L-4) on the dynamics of the covariates. (iii) Assumptions (L-5) and (L-6) are not required to get ergodicity. they are needed to establish the equivalence class consistency of the MLE and the identifiability of the model. Like our Assumption (L-4) for proving ergodcity, Assumption (L-5) is inherited from the embedding of the PARX model into a VLODM model here used for proving the equivalent class consistency. Assumption (L-6) is a mild and natural identifiability assumption. It basically says that the covariates f 1 (ξ k ), . . . , f d (ξ k ) are not linearly related conditionally to ξ k−1 . If they were, it would suggest using a smaller set of covariates. The identifiability condition of [1] is different as it involves a condition both on the covariate distribution and on the parameters a 1 , . . . , a p and b 1 , . . . , b q , see their Assumption 5.
To conclude this section, let us carefully examine the simple case where the covariates are assumed to follow a Gaussian linear dynamic and see, in this specific case, how our assumptions compare to that of [1]. More precisely, assume that L is defined by the following equation on the exogenous variables where ℵ is an r × r matrix with spectral radius ρ(ℵ) ∈ (0, 1), σ > 0, and η t ∼ N (0, I r ). Then Assumption (L-1) holds with and ν L being the Lebesgue measure on R r . It is straightforward to check (L-2) and (L-3) with V L (ξ) = e λ ξ for any λ > 0.
Let us now check that Assumption (L-4) holds. First note that, setting Then we get that for all ξ, ξ ′ ∈ R r such that ξ − ξ ′ ≤ ǫ, for some positive constant M 0 only depending on ℵ , σ and ǫ. Finally, for all ξ, ξ ′ ∈ R r , and Assumption (L-4) holds.

Preliminaries.
In the well-specified setting, a general result on the consistency of the MLE for a class of first-order ODMs has been obtained in [13]. Let us briefly describe the approach used to establish the convergence of the MLEθ z (i) ,n in this reference and in the present contribution for higher order ODMs. Let θ ⋆ ∈ Θ denote the true parameter. The consistency of the MLE is obtained through the following steps.
Step 1 Find sufficient conditions for the ergodic property (A-1) of the model. Then the convergence of the MLE to θ ⋆ is studied underP θ⋆ as defined in Definition 3.
Step 2 Establish that, as the number of observations n → ∞, the normalized loglikelihood L θ z (i) ,n as defined in (2.10), for some well-chosen z (i) ∈ X p , can be approximated by where p θ (·|·) is aP θ⋆ -a.s. finite real-valued measurable function defined on (Y Z , Y ⊗Z ). To define p θ (·|·), we set, for all y −∞:0 ∈ Y Z ≤0 and y ∈ Y, whenever the following limit is well defined, Step 3 By (A-1), the observed process {Y k : k ∈ Z} is ergodic underP θ⋆ and provided thatẼ it then follows that Step 4 Using an additional argument (similar to that in [37]), deduce that the MLEθ z (i) ,n defined by (2.9) eventually lies in any given neighborhood of the set which only depends on θ ⋆ , establishing that where ∆ is the metric endowing the parameter space Θ.
Step 5 Establish that Θ ⋆ defined in (3.2) [15], we proved a general result for partially observed Markov chains, which include first order ODMs, in order to get Step 5, see Theorem 1 in this reference. Finally Step 6 is often carried out using particular means adapted to the precise considered model.
We present in Section 3.2 the conditions that we use to prove ergodicity (Step 1) and, in Section 3.3, we adapt the conditions already used in [16,15] to carry out Step 2 to Step 5 for first order model to higher order ODMs.
Using the embedding described in Section 4.1, all the steps from Step 1 to Step 5 can in principle be obtained by applying the existing results to the first order ODMs in which the original higher order model is embedded. This approach is indeed successful, up to some straightforward adaptation, for Step 2 to Step 5. Ergodicity in Step 1 requires a deeper analysis that constitutes the main part of this contribution. As for Step 6, it is treated in [17].

3.2.
Ergodicity. In this section, we provide conditions that yield stationarity and ergodicity of the Markov chain {(Z k , Y k ) : k ∈ Z ≥0 }, that is, we check (A-1) and (A-2). We will set θ to be an arbitrary value in Θ and since this is a "for all θ (...)" condition, to save space and alleviate the notational burden, we will drop the superscript θ from, for example, G θ and ψ θ and respectively write G and ψ, instead.
Ergodicity of Markov chains is usually studied using ϕ-irreducibility. This approach is well known to be quite efficient when dealing with fully dominated models; see [35]. It is not at all the same picture for integer-valued observation-driven models, where other tools need to be invoked; see [21,13,16] for ODMs(1,1). Here we extend these results for general order ODMs(p, q). Let us now introduce our list of assumptions. They will be further commented after they are all listed.
We first need some metric on the space Z and assume the following.
(A-3) The σ-fields X and U are Borel ones, respectively associated to (X, δ X ) and (U, δ U ), both assumed to be complete and separable metric spaces.
For an LODM(p, q), the following condition, often referred to as the invertibility condition, see [41], is classically assumed.
where S p is defined in (2.17). For an ODM(p, q) with a possibly non-linear link function, (I-1) is replaced by a uniform contracting condition on the iterates of the link function, see (A-4) below. In order to write this condition in this more general case, recall that any finite Y-valued sequence y, ψ θ y is defined by (2.8) with the recursion (2.7). Next, we may rewrite these iterates directly in terms of u 0:(n−1) instead of y 0:(n−1) . Namely, we can definẽ ψ θ u 0:(n−1) (z) := x n , with x n defined by (2.7) (3.4) so that ψ θ y 0:(n−1) (z) =ψ θ Υ ⊗n (y 0:(n−1) ) (z) for all z ∈ Z and y 0:(n−1) ∈ Y n . Now define, for all n ∈ Z >0 , the Lipschitz constant forψ θ u , uniform over u ∈ U n , where we set, for all v ∈ Z 2 , We use the following assumption on a general link function.
(A-4) For all θ ∈ Θ, we have Lip θ 1 < ∞ and Lip θ n → 0 as n → ∞. The following assumption is mainly related to the observation kernel G and relies on the metrics introduced in (A-3) and on the iterates of the link functions defined in (3.4).
(A-5) The space (X, δ X ) is locally compact and if q > 1, so is (U, δ U ). For all x ∈ X, there exists δ > 0 such that Moreover, one of the two following assertions hold. (A-6) There exist measurable functions V X : X → R ≥0 and V U : The following condition is used to show the existence of a reachable point.
(A-7) The conditional density of G with respect to ν satisfies, for all (x, y) ∈ X × Y, and one of the two following assertions hold.
(a) There exists y 0 ∈ Y such that ν( The last assumption is used to show the uniqueness of the invariant probability measure, through a coupling argument. It requires the following definition, used in a coupling argument. Under (A-8)-(i) (defined below), for any initial distribution ξ on (Z 2 , Z ⊗2 ), letÊ ξ denote the expectation (operator) asso- U → R ≥0 and a Markov kernel G on X 2 × Y, dominated by ν with kernel density g(x, x ′ ; y) such that, setting W Y = W U • Υ, we have GW Y W X and the three following assertions hold.
(i) For all (x, x ′ ) ∈ X 2 and y ∈ Y, (ii) The function W X is symmetric on X 2 , W X (x, ·) is locally bounded for all x ∈ X, and W U is locally bounded on U.
one of the two following assertions holds.
(1) In many examples, (3.12) is satisfied with where φ is a measurable function from X 2 to X, in which case, GW Y should be replaced by GW Y • φ in (A-8).
(3) If q = 1, the terms depending on ℓ both in (3.9) and (3.13) vanish. We can take V U = W U = 0 without loss of generality in this case. (4) Recall that a kernel is strong (resp. weak) Feller if it maps any bounded measurable (resp. bounded continuous) function to a bounded continuous function. By Scheffé's lemma, a sufficient condition for G to be weak Feller is to have that x → g(x; y) is continuous on X for all y ∈ Y. But then (3.7) gives that G is also strong Feller by dominated convergence. (5) Note that Condition (3.7) holds when G(x; ·) is taken among an exponential family with natural parameter continuously depending on x and valued within the open set of natural parameters, in which case G is also strong Feller. We are in this situation for both Examples 1 and 2.
We can now state the main ergodicity result.
Theorem 8. Let P z be defined as P θ z in Definition 1.
For convenience, we postpone this proof to Section 4.2.
The following lemma provides a general way for constructing the instrumental functions α and φ that appear in (A-8) and in Remark 6-(1). The proof can be easily adapted from [16, Lemma 1] and is thus omitted.
Lemma 9. Suppose that X = C S for some measurable space (S, S) and C ⊆ R. Thus for all x ∈ X, we write x = (x s ) s∈S , where x s ∈ C for all s ∈ S. Suppose moreover that for all x = (x s ) s∈S ∈ X, we can express the conditional density g(x; ·) as a mixture of densities of the form j(x s )h(x s ; ·) over s ∈ S. This means that for all t ∈ C, y → j(t)h(t; y) is a density with respect to ν and there exists a probability measure µ on (S, S) such that We moreover assume that h takes nonnegative values and that one of the two following assumptions holds.

Convergence of the MLE.
Once the ergodicity of the model is established, one can derive the asymptotic behavior of the MLE, provided some regularity and moment condition holds for going through Step 2 to Step 5, as described in Section 3.1. These steps are carried out using [16, Theorem 1] and [15, Theorem 3], written for general ODMs (1,1). The adaptation to higher order ODMs(p, q) will follow easily from the embedding of Section 4.1. We consider the following assumptions, the last of which uses V X as introduced in Definition 3 under Assumptions (A-1) and (A-2).
1 ∈ U, a closed set X 1 ⊆ X, C ≥ 0, h : R + → R + and a measurable functionφ : Y → R ≥0 such that the following assertions hold with z (i) = (x Remark 8. In the case where the observations are discrete, one usually take ν to be the counting measure on the at most countable space Y. In this case, g θ (x; y) ∈ [0, 1] for all θ, x and y and Condition (B-3)(ii) trivially holds whatever X 1 is.
We have the following result, whose proof is postponed to Section 4.3.

Embedding into an observation-driven model of order
We further denote the successive composition of Ψ θ y0 , Ψ θ y1 , ..., and Ψ θ y k by (4.3) Ψ θ y 0:k = Ψ θ y k • Ψ θ y k−1 • · · · • Ψ θ y0 . Note in particular that ψ θ y 0:k defined by (2.8) with the recursion (2.7) can be written as (4.4) ψ θ y 0:k = Π p • Ψ θ y 0:k , Conversely, we have, for all k ≥ 0 and y 0:k ∈ Y k+1 , where we set u j = Π p+q+j (z) for −q < j ≤ −1 and u j = Υ(y j ) for 0 ≤ j ≤ k and use the convention ψ θ y 0:j (z) = Π p−j (z) for −p < j ≤ 0. By letting Z k = (X (k−p+1):k , U (k−q+1):(k−1) ) (see (2.4)), Model (2.1) can be rewritten as, for all (4.6) By this representation, the ODM(p, q) is thus embedded in an ODM(1, 1). This in principle allows us to apply the same results obtained for the class of ODMs(1, 1) in [16] to the broader class of ODMs(p, q). Not all but some conditions written above for an ODM(p, q) indeed easily translate to the embedded ODM(1, 1). Take for instance (A-4). By (3.5), (3.6) and (4.5), we have, for all n ∈ Z >0 , using the convention Lip θ m = 1 for m ≤ 0, Hence the same assumption (A-4) will hold for the embedded ODM(1, 1). As an ODM(1, 1), the bivariate process {(Z k , Y k ) : k ∈ Z ≥0 } is a Markov chain on the space (Z × Y, Z ⊗ Y) with transition kernel K θ satisfying, for all (z, y) ∈ Z × Y, A ∈ Z and B ∈ Y, Remark 9. Note that (A-1) is equivalent to saying that the transition kernel K θ of the complete chain admits a unique invariant probability measure π θ on Z × Y. Moreover the resulting π θ X and π θ Y can be obtained by projecting π θ on any of its X component and any of its Y component, respectively.
Note also that, by itself, the process {Z k : k ∈ Z ≥0 } is a Markov chain on (Z, Z) with transition kernel R θ defined by setting, for all z ∈ Z and A ∈ Z, (4.9) R θ (z; A) = ½ A (Ψ θ y (z))G θ (Π p (z) ; dy).

4.2.
Proof of Theorem 8. The scheme of proof of this theorem is somewhat similar to that of [16,Theorem 2] which is dedicated to the ergodicity of ODM(1,1) processes. The main difference is that we need to rely on assumptions involving iterates of kernels such as (E-2), (A-4) and (A-8)(iv) below, to be compared with their counterparts (A-4), (A-7) and (A-8)(iv) in [16,Theorem 2]). Using the embedding of Section 4.1, we will use the following conditions directly applying to the kernel R of this embedding.
Based on these conditions, we can rely on the two following results. The existence of an invariant probability measure for R is given by the following result. Obviously, we haveπR =π, which shows that R admits an invariant probability distributionπ. Now let M > 0. Then by Jensen's inequality, we have for all n ∈ Z ≥0 , Letting n → ∞, we then obtainπ(V ∧ M ) ≤ β 1−λ ∧ M . Finally, by the monotone convergence theorem, letting M → ∞, we getπV < ∞.

Proposition 12. Assume (E-1) (E-3) and (E-4). Then the Markov kernel R admits at most one unique invariant probability measure.
Proof. This is extracted from the uniqueness part of the proof of [13,Theorem 6], see their Section 3. Note that our Condition (E-3) corresponds to their condition (A2) and our Condition (E-4) to their condition (A3) (their α, Q,Q and Q ♯ being ourᾱ, R,R andR).
Hence it is now clear that the conclusion of Theorem 8 holds if we can apply both Lemma 11 and Proposition 12. This is done according to the following successive steps.
Step 2 Prove (E-2): this is done in Lemma 13 using (A-3), (A-5), (A-6) and the fact that, for all y ∈ Y, ψ y is continuous on Z, as a consequence of Lip 1 < ∞ in (A-4).
Step 5 Provide an explicit construction forR andR satisfying (4.12) and (4.13) in (E-4): this is done in Lemma 15 using (A-8)(i); Step 6 Finally, we need to establish the additional properties of thisR required in (E-4), namely, (4.14) and (4.15). This will be done in the final part of this section using the additional Lemma 16.
Let us start with Step 2.

Lemma 13. If for all y ∈ Y, ψ y is continuous on Z, then (A-5) and (A-6) imply (E-2).
Proof. We first show that R is weak Feller, hence R m is too, for any m ≥ 1. Further defineF : Z × X → R by setting, for all z ∈ Z and x ∈ X, Hence, with these definitions, we have, for all z ∈ Z, Rf (z) =F (z, Π p (z)), and it is now sufficient to show thatF is continuous. We write, for all z, z ′ ∈ Z and Since z →f (z, y) is continuous for all y, we have A(z, z ′ , x ′ ) → 0 as (z ′ , x ′ ) → (z, x) by (3.7) and dominated convergence. We have B(z, x, x ′ ) → 0 as x ′ → x, as a consequence of (A-5)(a) or of (A-5)(b). HenceF is continuous and we have proved that R m is weak Feller for all m ∈ Z >0 .
We now proceed with Step 5.
In order to achieve Step 6, we rely on the following result which is an adaption of [13, Lemma 9]. (4.21) and that W : Then, (4.14) and (4.15) hold.
Proof of (ii). We apply Theorem 10. We have already shown that (A-4), (A-1) and (A-2) hold in the proof of Assertion (i), with V X (x) = e τ |x| for any τ > 0. Assumptions (B-1) and (B-2) obviously hold for the log-linear Poisson GARCH model (see Remark 7 for the second one). It now only remains to show that, using V X as above, (B-3) is also satisfied. We set X 1 = X which trivially satisfies (B-3)(i). Condition (B-3)(ii) is also immediate (see Remark 8). Next, we look for an adequatē φ, h and C ≥ 0 so that (iii) (iv) (v) and (vi) hold in (B-3). We have, for all θ ∈ Θ and (x, We thus set h(u) = e x (i) 1 u, C = 1 andφ(y) = A + B y for some adequate nonnegative A and B so that (iii) (using Remark 7 and Υ(y) = ln(1 + y) ≤ y), (iv) and (v) follow. Then we have G θφ (x) ≤ A + B e x V X , provided that we chose τ ≥ 1. This gives (vi) and thus (B-3) holds true, which concludes the proof.
We start with (A-6), with V X (x) = x. We can further set V Y (y) = y since, we then have GV Y (x) = rV X (x) so that |GV Y | V X = r. With these definitions, (3.9) leads to On the other hand, we have that, for all z = (x (−p+1):0 , u (−q+1):(−1) ∈ Z = R p+q−1 ≥0 and all n = 1, 2, . . . , where, in the right-hand side of this equation, if n − k ≤ 0, E z [X n−k ] should be replaced by x n−k and if n − k ≤ −1, E z [Y n−k ] should be replaced by u n−k . By definition of G, E z [Y n−k ] = rE z [X n−k ]. Hence, denoting x n = E z [X n ] we get that the sequence {x k : k ∈ Z ≥0 } satisfies the recursion where here c k = a k + r b k ≥ 0 for k = 1, . . . , p ∧ q, c k = a k if q < k ≤ p and c k = rb k if p < k ≤ q. Moreover we can clearly find a constant C only depending on θ such that x 1:(p∨q) ∞ ≤ C(1 + |z| ∞ ). Then applying Lemma 20 and (4.32), we get that there exists C ′ > 0 and ρ ∈ (0, 1) such that, for all n ≥ 1, Condition (A-6) follows. We conclude with the proof of (A-8). Let us apply Lemma 9 with C = (0, ∞) = X and S = {1}, µ being the Dirac mass at point 1, j(x) = (1 + x) −r and h(x; y) = Γ(r+y) y ! Γ(r) x 1+x y , which satisfies (H'-1). This leads to α and φ satisfying (A-8)(i) and (3.14) by setting, for all x, Next, for all (x, x ′ ) ∈ Z 2 , we have We thus obtain (A-8)(iii) by setting W X (x, x ′ ) = (1 ∨ r). All the other conditions of (A-8) are trivially satisfied in this case, taking W U ≡ 1.
Next, we look for an adequateφ, h and C ≥ 0 so that (iii) (iv) (v) and (vi) hold in (B-3). For all θ ∈ Θ, (x, x ′ ) ∈ X 1 and y ∈ Z ≥0 , we have We set C = 0, h(s) = s andφ(y) = A + B y for some adequate non-negative A and B so that (iii) (using Remark 7 and Υ(y) = y), (iv) and (v) follow. Then we have G θ ln +φ (x) ≤ A + B r x V X . This gives (vi) and thus (B-3) holds true, which concludes the proof.
Proof of (iii). The proof of this point is similar to the proof of Theorem 4(iii) in Section 4.4, except that Condition (4.30) is replaced by (4.33) ln + (|y|) π θ Y (dy) < ∞ .
We already checked that G θ V Y V X with V Y (y) = y, implying y π θ Y (dy) < ∞. Hence (4.33) holds and the proof is concluded. 4.6. Proof of Theorem 7. The following remark about Assumption (L-4) will be useful.
We now turn to checking (A-6). Define Then, using (L-3)-(iv) with n = 1 and h = V L , To complete the proof of (A-6), it only remains to check (3.8). Recall that in this model, (see (2.21)), we have, using (L-3), Defining c i as in Assertion (ii) of Lemma 21, we write this inequality as Combining (4.36) and (4.37) yields that, for any ǫ > 0, we can find M large enough such that, with V defined by (3.9), and setting we have, for all n ∈ Z >0 , u n ∈ R ≥0 and Thus, since ̺ < 1, for n large enough, It follows that, for any ǫ > 0, lim sup where, in the second inequality, we used Lemma 20 with Assertion (ii) of Lemma 21. We thus get (3.8).
We have thus checked all the assumptions of Theorem 8 for the VLODM of Example 3 and this concludes the proof of Assertion (i).

Useful Lemmas
The following result is used in the proof of Theorem 10. It is proven in [18,Lemma 19].
Lemma 17. (A-4) implies that for all θ ∈ Θ, there exists C > 0 and ρ ∈ (0, 1) such that Lip θ n ≤ C ρ n for all n ∈ Z >0 . A byproduct of Lemma 17 is the following result, which is used in the proof of Lemma 14. A-4) hold. Let θ ∈ Θ and u 0 ∈ U and set u k = u 0 for all k ∈ Z ≥0 . Then there exists x ∞ ∈ X such that for all z ∈ Z, ψ θ u 0:n (z) converges to x ∞ as n → ∞ in (X, δ X ).
The two following lemmas are straightforward and their proofs are thus omitted. They are used in the proofs of Theorem 5 and Theorem 7, respectively. Lemma 20. Let r be a positive integer and (ω, c 1:r ) ∈ R 1+r . Let {x k : k ∈ Z ≥0 } be a sequence satisfying c k x n−k , n ≥ r , Suppose that the polynomial P (z) = 1− r k=1 c k z k has no roots in {z ∈ C : |z| ≤ 1}. Then there exist ρ < 1 and C > 0 such that, for all n ∈ Z ≥0 , |x n − ω/P (1)| ≤ C ρ n (1 + max(|x 0 |, . . . , |x r−1 |)) .