Observation-driven models for discrete-valued time series

Statistical inference for discrete-valued time series has not been developed like traditional methods for time series generated by continuous random variables. Some relevant models exist, but the lack of a homogenous framework raises some critical issues. For instance, it is not trivial to explore whether models are nested and it is quite arduous to derive stochastic properties which simultaneously hold across different specifications. In this paper, inference for a general class of first order observation-driven models for discrete-valued processes is developed. Stochastic properties such as stationarity and ergodicity are derived under easy-to-check conditions, which can be directly applied to all the models encompassed in the class and for every distribution which satisfies mild moment conditions. Consistency and asymptotic normality of quasi-maximum likelihood estimators are established, with the focus on the exponential family. Finite sample properties and the use of information criteria for model selection are investigated throughout Monte Carlo studies. An empirical application to count data is discussed, concerning a test-bed time series on the spread of an infection. MSC2020 subject classifications: Primary 62M20, 62F12; secondary 62M10, 62J12.


Introduction
The analysis of time series that are generated by continuous random variables has a long tradition in statistics and dates back, in the parametric setting, to [42] and [41], who introduced the concept of autoregression, a dynamic model for the conditional mean of a stochastic process. In the same period, [37] defined moving average processes as linear combinations of uncorrelated random variables capable of capturing cyclical fluctuations. It was only in the seventies, with the formalization by [4] of the class of ARMA models, that autoregressive (AR) and moving average (MA) processes found their popularity and became massively fitted to real data. The merit of Box and Jenkins's work was the specification of a unified class of processes, generalizing ARMA models to account for non-stationarity, seasonality, exogenous regressors, as well as the systematic treatment of all the sub-models belonging to the class, which led to the development of well established inferential procedures.
The development of parametric models for count and binary data has not enjoyed the same popularity, partly because linear processes are related to second order stationarity, which fully characterizes Gaussian time series. For discrete data, the concept of autocovariance needs to be adapted [38] and the Wold representation has no direct interpretation, see the discussion in the recent handbook edited by [11]. Since the AR-and MA-like models first introduced by [43] and [27], there have been some relevant specifications, such as the generalized ARMA (GARMA) by [3] and their martingalized version, the M-GARMA by [44], as well as the generalized linear ARMA (GLARMA) by [9]. An interesting class of autoregression models for count data has been proposed by [20] and [22], inspired by the generalized linear transformation of [30]. Integer-valued time series with extreme observations have been dealt with by [25], based on the beta-Negative Binomial distribution.
As recently acknowledged by [10], the analysis of discrete-valued time series would benefit from the specification of a unified framework able to encompass most of the models available in the literature. As a matter of fact, it is not trivial to explore whether models are nested, and, consequently, to derive stochastic properties that simultaneously hold across models. In addition, model comparison becomes crucial when direct relationships among different models are unknown. The lack of a unified framework and, consequently, of systematic analysis is also in contrast with the growing attention paid, in recent years, to high dimensional data sets involving dynamic binary and count data, in different contexts, such as number of clicks or amount of intra-day stock transactions [12,1]. Attempts in this direction have been made by [14] who provide a theoretical formulation which is useful in principle but less effective when the aim is to implement and adapt models for real applications. Indeed, the quite general framework developed by [14] encompasses several models for which stochastic and inferential properties have been previously derived in the literature, but at the price of conditions that are extremely complicated to verify in practice for each model and distribution.
To summarize the main results developed in the literature, on the side of the stochastic properties, [29] develop notable results about strict stationarity and ergodicity for the specific case of GARMA and Poisson Threshold autoregressive models, using the theory of Markov chains. Conversely, conditions holding for several models but requiring restrictive assumptions are discussed in [31], based on contraction conditions, and in [16], based on the weak dependence approach. [20] and [22] develop results on ergodicity employing a perturbation approach, which is necessarily suited for the case of count data following a Poisson distribution. Similar results are discussed in [5] under the assumption of a Negative Binomial distribution as the data generating process. An interesting extension to categorical time series has been recently proposed by [23].
As far as inference is concerned, the properties of the maximum likelihood estimator (MLE) and Quasi MLE (QMLE) have been studied for some subsets of discrete-valued models. [14] prove the consistency of MLE and QMLE for the general framework they propose. Asymptotic normality, in the same setting, is later discussed by [15]. Comparable results have been derived by [12], based on the approach developed by [31], and by [1] for the specific case of the Poisson distribution. However, the conditions needed to verify the properties of MLE and QMLE are far from immediate. This paper introduces a general class of observation-driven models for discretevalued stochastic processes that encompasses the existing models in the literature and includes novel specifications. In the terminology of [7], observationdriven models are designed for time-varying parameters whose dynamics are functions of the past observations only and are not driven by an idiosyncratic noise term. Essentially, we specify a dynamic model for the conditional mean of a density, or mass function for discrete-valued time series, which does not necessarily belong to the exponential family. This generality allows one to estimate alternative models designed to capture the past effects of the conditional mean itself, of the lagged discrete-valued process and error-type components.
The stochastic theory and the likelihood inference are developed for first order models in the class (i.e. with one lag of autoregression), through an extension of the theory of [29] as far as stationarity and ergodicity are concerned, and of [14] and [15] for the asymptotic properties of likelihood estimators. In addition to the results that apply to novel models, we derive several new methodological results for existing models, that were not yet proved in the literature, such as strict-stationarity and ergodicity of first order GLARMA models and ergodicity of M-GARMA models for discrete distributions.
In summary, a general modelling framework is introduced which aims (i) to provide a unified specification for a broad class of discrete-valued time series where relevant instances represent special cases, (ii) to provide direct relationships among different models which belong to the framework but are not necessarily nested within each other, (iii) to derive the stochastic properties for first order models which hold simultaneously for the entire class (strict stationarity and ergodicity), (iv) to implement QMLE inference that also allows us to define model selection criteria across different, and not nested, models, (v) to derive the asymptotic properties of QMLE, and (vi) to make all the models encompassed in the framework fully applicable in practice. With the focus on model comparison, models included in the general framework are applied for the analysis of a test-bed time series in count data analysis, on the spread of an infection, namely Escherichia coli, in the German region of North Rhine-Westphalia.

The general framework
Let {Y t } t∈T be a stationary stochastic process defined on the probability space We specify a class of observation-driven models where the conditional density or mass function of Y t , depending on a time-varying parameter μ t = E(Y t |F t−1 ), is a member of the one-parameter exponential family where the dynamics of the density (or mass) function q(Y t |F t−1 ) are captured by the parameter μ t , or equivalently by X t . The time-varying parameter μ t is related to the process X t by a twice-differentiable, one-to-one monotonic function g(·), which is called the link function. The functions A(·) (log-partition) and d(·) define the particular distribution [30]. The mapping f (·) is a twice-differentiable bijective function, chosen according to the model of interest. Note that the process {Y t } t∈T is observed whereas {μ t } t∈T is not. However, from equation (2), it can be shown, by backward substitutions, that the process {μ t } t∈T is a deterministic function of the past F t−1 and this is the reason why we refer to "observation-driven models". The function h(Y t ) is called the "data-link function" since it is applied to the process Y t whereasḡ(μ t ) is said the "mean-link function" since it is applied only to the conditional mean, unlike the link function g(·) which, in principle, can be applied to any parameter or moment of the probability distribution. Both the functions h(Y t ) andḡ(μ t ) are twice-differentiable, one-to-one monotonic and their shape depends on the specific model (2) and the distribution of interest in equation (1).
The vector Z t = [1, Z 1t , . . . , Z st ] T in equation (2) is a vector of covariates and α is the corresponding coefficient vector with comparable dimensions. The parameters φ j measure an autoregressive-like effect of the observations; instead, the parameters γ j state the dependence of the process from its whole past memory (since μ t−j depends on the past observations Y t−j−1 , . . . ); finally, the parameters θ j represent the analogous of a moving average component, since the last term of (2) can be defined as a prediction error where the process {ν t } t∈T is some scaling sequence, typically: (i) ν t = σ t (Pearson residuals), (ii) ν t = σ 2 t (score-type residuals), (iii) ν t = 1 (no scaling), . Note that every time the mean-link function is selected as the conditional expectation of the data-link function for the pro- , then the residuals in equation (3) form a white noise (WN) sequence, with unit variance. Each exponential family in the form (1) can be re-parametrized in the canonical form: where the sequence is called the canonical parameter, whereas the functionf (·) = (f • g)(·) is referred to as the canonical link function andĀ (·) is a re-parametrization of A (·) with respect to Q t . It is known that for the exponential family (4) the conditional mean is is the canonical link function, thenf ≡ g and the following simplification occurs: f (X t ) = X t , so Q t = X t = g(μ t ), which gives again the distribution (1), with f (X t ) = X t , so that (1) and (4) are exactly the same. Clearly, the moments become . The function f (·) allows us to introduce noncanonical shapes for g(·), thus adding flexibility to the model. We provide some examples to clarify the nature of the framework.

Example 1
In the setting (1, 2), the Poisson distribution is obtained with . All the derivatives of A(X t ) = exp(X t ) equal μ t . However, this definition is based on the equivalence g ≡f , which is the canonical link; hence equation (2) becomes a log-linear model on the response log(μ t ). It is possible to model (2) with a different shape of g(·); for example, one may be interested in a linear model for the parameter of the Poisson μ t , then g(μ t ) = μ t and clearly g =f . In this case, the Poisson distribution is reconstructed from (1), by setting f (X t ) = log(X t ) = log(μ t ), A(X t ) = X t = μ t and d(Y t ) = log(1/Y t !). Again, by knowing that the inverse of the canonical linkf −1 (·) = exp(·), the conditional expectation would be E( t /2, with first and second derivatives μ t and σ 2 , respectively.

Related models
One of the most frequently used specifications in the area of discrete-valued time series is the Generalized Autoregressive Moving Average model, GARMA [3]. Here, the distribution of the process is usually assumed to be the one-parameter exponential family (1). From equation (2), the GARMA model is obtained when k = 0, by setting g ≡ḡ ≡ h and ν t = 1, so that, where α = 1 − p j=1 φ j B j β, β is a vector of constants and B is the lag operator. By rearranging the constant in terms of β we obtain the equation (3) of [3].
A suitable extension of the GARMA model (5), the martingalized GARMA (M-GARMA), has recently been introduced by [44]; it is derived from (2) by setting The relevant feature of the model is that it allows the residuals ε t to be a martingale difference sequence, i.e. E(ε t |F t−1 ) = 0.
Another similar model has been developed by [36], [34] and [9] with the name GLARMA model; here again the distribution is the exponential family (1). We can write the GLARMA model (2) by setting p = 0, h as the identity and where α = 1 − k j=1 γ j B j β. Hereq = max(k, q) and θ j = γ j + τ j for j = 1, . . . ,q, where τ j are free parameters. The formulation of the constant term in equation (7) as a function of β is equivalent to equation (13) in [17], the alternative definition of the GLARMA model originally introduced in [9]. Note that here, if ν t = σ t , then the prediction error ε t = (Y t − μ t )/ν t is a white noise process with unit variance.
Another promising stream of literature is due to [20], who introduced Poisson autoregression, henceforth Pois AR, which is obtained when (1) is Pois(μ t ), with f (X t ) = log(X t ), and in equation (2), we have q = 0 and g ≡ h : identity: The parameters in equation (8) are constrained in the positive real line. A variant of (8) is the log-linear Poisson autoregression, henceforth Pois log-AR, [22] which is obtained by (2) For Poisson data, the GARMA model (5) with identity or log links corresponds to a constrained Poisson autoregression where γ j = −θ j and φ j is replaced by φ j + θ j , in equations (8) or (9). A model like (9) could also be used for Negative Binomial data, by rewriting the distribution in terms of the expected value parameter μ t [5]: where ν is the dispersion parameter (if integer, it is also known as the number of failures) and the usual probability parameter would be p t = ν ν+μt . The distribution (10) with model (9) is obtained from the distribution (1), by setting the non-canonical link g(μ t ) = log(μ t ) and . The Binary Autoregressive Moving Average (BARMA) model ( [27,38]), introduced for Binomial data, is obtained when (1) is Bin(a, μ t ), where a is known and the probability parameter p t = μ t /a, and, in (2), γ = 0, h : identity (ḡ(μ t ) reduces to μ t ) and c = 0. Then Even though this model is designed for use with Binomial distributions, so typically g : logit or g : probit, in general, the link function g can be any suitable function. Although the class of models (1)- (14) is quite general, all the link functions involved are required to be continuous and bijective. Then, the framework presented in the current contribution does not include, for example, the Poisson threshold model, as defined in [29, eq. 9] and [14, Sec. 1.2.1].

New model specifications
Other models of potential interest, which are not explicitly specified in the existent literature, are instead encompassed in the framework (1)-(2). We discuss a class of glink-ARMA models. As a relevant instance, consider the log-ARMA model The model (12) detects the autoregressive effect of the past lags of Y t and also accounts for a long past feedback effect, via lags of μ t ; then, a white noise prediction error ε t = [(log(Y t + 1) −ḡ(μ t ))/ν t ] is added to the functional transformation of the data, where E(ε t ) = 0 and V(ε t ) = 1. The same model (12), when (1) is Bin(a, μ t ), is recovered by setting the non-canonical link X t = g(μ t ) = log(μ t ) and Q t = log pt Yt . Along the same lines, a logit-ARMA model can be specified for Binomial data as a combination of the BARMA model from [27] and an autoregressive component: where, in equation (1) A similar model can also be specified by replacing the logit function with the probit link function.
The usefulness of the specifications (12)-(13) can mainly be exploited when a closed form expression is available for the conditional expectationḡ(μ t ) (and possibly for the standard deviation ν t ). For example, if the distribution of For a comprehensive discussion on the closed form solutions see [44]. In the case of Binomial or Poisson data, though, such closed forms are not available and it seems reasonable to use an approximation based on the Taylor expansion around the mean μ t , likeḡ . This would reduce models (12)- (13) to an interesting reparametrized version of the log-AR model described in equation (9).
Despite the wide use of the Poisson model for count data and the default Negative Binomial alternative to account for over-dispersion, both choices fail when data present under-dispersion or an excess of zero value observations. For instance, the use of generalized Poisson distributions is quite popular, as well as the use of alternative link functions, when the canonical ones are not appropriate to the scientific ground; see [35] for a discussion. Choosing the link function in non-linear models is a relevant issue in case of over-dispersed and under-dispersed count data as explicitly debated in [30]. Generalized approaches to accommodate specific data structures may benefit from a flexible specification of glink-ARMA type models.

Stochastic properties
In the following, we shall focus our attention to the baseline model in the class, obtained by setting k = p = q = 1 in equation (2) with no covariates (Z T t α = α) and unitary scaling sequence, ν t = 1 for t ∈ T : where the function Y * t modifies the values of Y t to lie within the domain of h(·). Establishing stochastic properties and inferential results for model (2) is typically challenging beyond the order one; see, for example, [14,Sec. 4]. Indeed, in the vast majority of the contributions on observation-driven models for integer-valued data the theoretical results are derived for first order models. Remarkable exceptions are referred for the simpler linear INGARCH model [18]. For the same model, Poisson quasi maximum likelihood inference is discussed in [1]. In [29, pg. 813], the authors seem to show a way to extend the ergodicity results of the GARMA(1,1) model, to the general lag (p,q) order, but in fact, such an extension is possible after perturbing the original model with a stochastic noise, which is equivalent to making a correction for continuity on the starting integer-valued {Y t } process. For these reasons, the results of the present contribution are derived for the framework (14).
In Remark 1 we discuss an extension which includes non-unitary scaling sequences. In addition, although the inclusion of time-varying covariates, Z t , in model (14) may determine it to be, in general, non-stationary [29, pg. 810], the addition of a non-time-varying random vector, Z, to (14) will keep all the results of the present paper unaffected. Note that in the first order observation-driven model (14) the series μ t can be determined recursively by knowing only the starting point μ 0 and the observations Y 0 , . . . , Y t−1 . This major simplification is lost in the case of models with a lag order greater than the first.

Stationarity and ergodicity
The proof of the stability conditions is established for the baseline model (14) by showing the ergodicity of a first order Markov chain process. In the present section the random process {Y t } is not required to be distributed according to the exponential family (1) but to satisfy only relatively mild moment conditions, i.e. assumptions (A1)-(A2) below. Define μ 0 = μ, g(μ) = x andḡ(μ) = g(g −1 (x)) =g(x), whereg(·) ≡ḡ •g −1 (·). In order to deal with different possible domains of the process {μ t }, we consider three separate cases: for A ∈ F X to be the t-step transition probability with initial state X 0 = x. Consider the following assumptions: (A3) g and h are bijective, increasing and with the process {μ t } t∈T specified as in (14). Moreover, (A3)-(A4) hold. Then, {μ t } t∈T has a unique stationary distribution. This implies that {Y t } t∈T is strictsense stationary and ergodic.
The proof and some preliminary lemmata are postponed to the Appendix. It can immediately be seen that in the special case where Y t is distributed according to (1), Assumption (A1) automatically holds, because μ t = E(Y t |F t−1 ). For model (14), the σ-algebra generated by μ t is a subset of F t−1 , and for the properties of conditional expectations, Assumption (A2) is a relatively mild moment condition generally satisfied for usual discrete distributions encompassed in (1), such as, e.g., Poisson and Binomial distributions, see also [29,. Assumption (A3) involves several different conditions depending on three factors: (i) the specific model employed; (ii) the selected distribution, and (iii) the chosen link functions. To illustrate the implication of (A3) on existing models, we consider two representative examples. (5) is applied (with p = q = 1 and no covariates) to Binomial data, we haveḡ(μ) = g(μ), so that we fall in the case (A3).1. The domain of the observations involves (A3).1.(c), since y ∈ (0, a) and μ ∈ (0, a), such that g(·) : (0, a) → R, where g(y) = h(y). The monotonicity conditions on the shape of the link functions g and h in (A3) are quite standard. For instance, the logit link function g(μ) = log(μ/(a − μ)) is bijective and increasing. As γ = 0 in model (5), for the Binomial GARMA model, the stationarity condition |θ| < 1 is obtained from (A3).
Many other results in the same spirit of Examples 3-4 can be obtained from Theorem 1, by selecting several different combinations of model, distribution, and link function, i.e. aspects (i)-(iii) previously discussed. Section 3.2 analyses the impact of the assumptions of Theorem 1 over the models introduced in Section 2.
As far as Assumption (A4) is concerned, though it might not be immediate to verify, it can usually be replaced, for the integer-valued distribution encompassed in (1), with an alternative condition, which is easier to check: Binomial (with known number of trial/failure), and g −1 (·) is Lipschitz.
The equivalence of (A4) and (A5) has been proved in [29] for the Poisson and Binomial distribution; the proof for the Negative Binomial is reported in the Appendix. The required Lipschitz continuity of g −1 (·) is easily met for the usual link functions (e.g. logit, identity). However, there are exceptions, like the log link function. The modified log link function [29, eq 12], employed in Example 4, provides a viable alternative to avoid the problem. (14) withḡ

Remark 1 Consider equation
where ε t , as in equation (3), is a white noise with unit variance. Under the conditions of the following corollary, the scaling sequence does not affect the stationarity conditions.
. Theorem 1 still holds true by replacing (14) with (15) if the function σ(·) is: 1. increasing for μ t ∈ R + and decreasing for μ t ∈ R − ; 2. increasing for μ t ∈ R + ; 3. monotone with respect to μ t ; depending on the domain of μ t . Corollary 1 follows from Theorem 1 in a way that is non-straightforward to prove; an outline of the proof is thus reported in the Appendix. Subsequent corollaries will come without proof. The conditions on ν t are, in general, widely satisfied. For example, if Y t belongs to the exponential family in (4), σ 2 (μ) = A (X t ) = (g −1 ) (g(μ)) where g is increasing by assumption, whereas σ 2 (μ) is increasing since (g −1 ) is increasing; this holds as long as g is concave (g −1 is convex) which is true for μ > 0. By contrast, σ 2 (μ) is decreasing if (g −1 ) is decreasing which happens when g is convex: this is the case of μ < 0, which is what was required.

Remark 2
It is worth noting that Assumption (A3) guarantees the existence of the r th moment of all link functions (h, g,ḡ) provided that E|Y t | r < ∞, for some integer r. Indeed, if Y t ∈ R, then μ t ∈ R (Case 1 ) and, since h is monotone increasing, concave in R + and convex in R − , it is not hard to show that |h(y)| ≤ a 0 + a 1 |y| for all y ∈ R and for some non-negative constants a 0 , a 1 . Analogous inequalities hold for generic integers r > 1, by the binomial theorem. Further details can be found in the Proof of Lemma 2 reported in the Appendix. The same arguments apply to g, since |g(μ)| ≤ a 0 + a 1 |μ|, for all μ ∈ R, which follows by , the random variables are bounded.

Stochastic properties for relevant encompassed models
The results obtained in the previous section can be applied to specific models belonging to the unified framework (14) or (15), and in particular to the first order version of the novel models introduced in Section 2.2. We also specifically derive the stochastic properties of the related models discussed in Section 2.1, since for some of them the stochastic properties have not been fully addressed in the literature. Consider the one lag model (14).
As a proof of coherence in our findings, it is worth noting that, when γ = 0 and g ≡ h ≡ḡ, Theorem 1 reduces to Theorem 5 in [29], providing results for the first order GARMA model Now we derive the stochastic properties for the first order BARMA model (11), such as with fixed number of trials n, link function g : (0, a) → R is bijective and increasing, g −1 is Lipschitz with constant not greater than 1 and |θ| < 1. Then the process {μ t } t∈T defined in (17) has a unique stationary distribution. Hence, the process {Y t } t∈T is strictly stationary and ergodic.
Note that for Binomial distribution (A1)-(A2) hold. Here, the conditions (A3) and (A5) on g and g −1 are clearly satisfied for usual link functions, like the logit.
To the best of our knowledge, no results are available for strict stationarity in the GLARMA model, apart from the simplest case when k = 0, q = 1 [9,17]. Define the GLARMA(1,1) model as (18) has a unique stationary distribution and {Y t } t∈T is strictly stationary and ergodic, if (A2) holds and 1. g is bijective and increasing, and In the GLARMA model, the conditional distribution of {Y t } t∈T is part of the exponential family, then (A1) holds true. Instead, (A3) and (A5) reduce to conditions 1 and 2, which clearly are widely satisfied for the usual link functions. In practical applications, the condition on the coefficients of the model is required to establish its stationarity. The proof of stationarity for the one lag M-GARMA model from (6) given in [44] only holds for continuous variables. We generalize the result by deriving the conditions for stationarity also for the case of discrete variables. They are shown to be equivalent to those available for the GARMA model. This is reasonable since the former is a special case of the latter. We now move to strict-stationarity and ergodicity results for some of the novel models presented in Section 2.2.
We discuss the result for model (8), with no covariates and p = k = 1. Note that the components of the model are all non-negative,ḡ is not present, and θ = 0. Indeed, conditions (A3)1.(b) and (A3)2.(b) in Theorem 1 coincide, providing the following conclusion.
Then the process {μ t } t∈T defined in (8) has a unique stationary distribution. Hence, the process {Y t } t∈T is strictly stationary and ergodic.
These results are not new in the literature [20,16]. (12), with k = p = q = 1, has a unique stationary distribution. Hence, the process {Y t } t∈T is strictly stationary and ergodic.
Assumption (A1) is met for the distribution (1). The condition (A3) on the shape of the link function holds here, as g(μ) = log(μ). However, the Lipschitz continuity ong(·) and the condition (A4) are required since g −1 (·) does not satisfy (A5). Note that the stationarity and ergodicity for the model (9), with no covariates and p = k = 1, are established as a special case of Corollary 5, with q = θ = 0.

Corollary 6
Suppose that {Y t } t∈T is distributed as Bin(n, μ t ) with fixed number of trials n,g(x) is Lipschitz with constant L ≤ 1 and |γ| + |θ| < 1. Then the process {μ t } t∈T defined in (13), with k = p = q = 1, has a unique stationary distribution. Hence, the process {Y t } t∈T is strictly stationary and ergodic.
For Binomial distributions (A1)-(A2) hold and the conditions (A3) and (A5) are satisfied for the logit link function. For space constraints, we do not show other examples. However, based on the theoretical results developed for this flexible framework, stationarity and ergodicity can be directly established for a wide class of models under several discrete distributions.
For a better readability we have summarized, in Table 1, the stationarity and ergodicity conditions found in this section concerning parameters of already existing models, so as to allow the current results obtained by Theorem 1 to be compared with the ones previously available in the literature. The last column reports the papers where the previous conditions were established. For the log-AR model the conditions determined by [14] are different and less restrictive than ours. This is because the authors employed a different approach. However, when the coefficients are positive the two conditions are equivalent. All the other results found in the current contribution are new or equal the ones already existing in the literature.

Quasi-maximum likelihood inference
The aim of this section is to establish the asymptotic theory of the quasi maximum likelihood estimator of the parameter ρ = (α, γ, φ, θ). More precisely we develop asymptotic results in the three following cases: (i) misspecified MLE: misspecification occurs in the distribution (1) and/or in the model (2), and (ii) QMLE: misspecification occurs in the distribution (1), (iii) correctly specified MLE. Specifically, strong consistency is derived in the three cases; asymptotic normality is derived for the QMLE and the MLE. Finite sample properties are explored through an extensive simulation study, as well as the performance of information criteria for model selection. Tables including detailed and numerical results are postponed to the Appendix.

Asymptotic properties
Assume that the variables in the process {Y n } n∈Z are integer-valued. Let (Λ, d) be a compact metric set of parameters, with suitable metric d(·), and consider α,δ ∈ R + . Then, Λ = ρ = (α, γ, φ, θ) ∈ R 4 : |α| ≤α, |δ| = |φ + θ| ≤δ is the parameter set. We make explicit the dependence of the conditional distribution (1) from the mean process by using the notation q(y t |F t−1 ) = q(X t ; y t ). Let g ρ Y −∞:t be a stationary ergodic random process, not necessarily equal to the process X t = g(μ t ) in (14), and its sample counterpart is denoted by g ρ y 1:t−1 (x), where x is the starting value of the chain g ρ · . The notation g ρ y s:t (x) = g ρ yt • g ρ yt−1 • · · · • g ρ ys (x), s ≤ t is the so-called Iterated Random Function (IRF), see [13], with It is worth noting that in the special case of a correctly specified model, X 0 = g ρ Y −∞:0 and equation (19) reduces exactly to the process in equation (14). Let us define the log-likelihood function as follows: whose associated maximum likelihood estimator iŝ We specify the following assumptions: Here, the almost sure limit is taken under the stationary distribution of {Y t } t∈T . The proof is in the Appendix. Now the special case of correctly specified MLE is treated. Let us denote Λ 0 as the interior of the set Λ.
We need to show that P = {ρ }. The proof is postponed to the Appendix. The asymptotic consistency of QMLE is now established.
where {x } is the maximum of the function P (x , dy) log q(x, y).
In practice, μ = A (x ) states that the mean function has to be correctly specified regardless the true data generating process. The proof is analogous to Theorem 3 and follows directly by  (22) is asymptotically normally distributed for the model (14).

Theorem 4
Assume that Corollary 7 and (H2) hold. Moreover, assume that J (ρ ) is non-singular. Then, The proof relies on the argument of [15,Thr 4.2] and follows the fashion and the notation used in the proof of Theorem 2, thus it is postponed to the Appendix. It goes without saying that for correctly specified MLE, equation (21) is the exact MLE and J (ρ ) = I(ρ ) in Theorem 4, providing the standard ML inference.

Finite sample properties
Finite sample properties of MLE and QMLE are explored through a simulation study which considers some models illustrated in Section 2.1. Tables including the details of the numerical results are stored in Section B of the Appendix. All the results are based on s = 1, 000 replications, with different configurations of the parameters and increasing sample size n = (200, 500, 1, 000). A correctly specified MLE has been estimated on data coming from Bernoulli or Poisson distributions, across several models. For QMLE, data are generated from a Geometric distribution, with Poisson distribution fitted instead, for GARMA and log-AR models. For Poisson and Geometric data, the log-link is employed g(μ t ) = log(μ t ); instead, for the Bernoulli one, the logit g(μ t ) = log(μ t )/ log(1 − μ t ) is specified. For all the models involved, the mean of the estimators approaches the true value, for both the well-specified MLE and QMLE. Some convergence problems arise for the BARMA model, but the standard error and the bias still tend to reduce by increasing n; this gives evidence of convergence, although at a slower rate. Turning to asymptotic normality, evidence of normality emerges from the Kolmogorov-Smirnov (KS) test, even when the sample size is small. The outcomes are in line with those of [15]. These results are coherent with the theory presented so far.
In summary, Table A-1 in the Appendix reports the estimation results for the GLARMA model when the data come from a Bernoulli distribution. The estimates tend to be closer to the true value of the parameters as the sample size increases, which confirms the consistency of the estimators. Consequently, the bias is also reduced. Moreover, the estimates are significant at the usual levels and the true value of the parameters falls into the confidence intervals. The KS tests do not reject the normality of the estimators even with a small sample size. The same comments hold true for all the combinations of parameters employed. Similar results are obtained in Tables A-2 and A-3 which show the outcome of simulations for the GARMA and log-AR models, respectively, performed on data generated from the Geometric distribution in (10), but with Poisson distribution fitted instead (QMLE). The GARMA model seems to be more accurate on the approximation of the true values but some problems with the KS test are found when a non-stationary region for the parameters ρ = (0.5, 0.4, 1.2) is investigated. Instead, the log-AR model could not be estimated in non-stationary regions of the parameters.

Model selection
A crucial aspect in empirical applications is model selection. In likelihood inference, model selection is typically carried out based on information criteria such as Akaike information criterion (AIC) and Bayesian information criterion (BIC). To assess the effectiveness of AIC and BIC for selecting the most appropriate model for the data at hand, we carry out an extensive simulation study with competing one lag models log-AR, GARMA, and GLARMA for Poisson data. The last two are also computed, together with the BARMA model, for Binomial data. In extreme synthesis, when the sample size n is small, the selection for some models can perform poorly, but when n is sufficiently large (n ≥ 500), all the models allow the selection of the right data generating model with high probability.
We simulate the first order log-AR, GARMA, and GLARMA models for Y t |F t−1 distributed according to a P ois(μ t ), with (α, φ, θ, γ) = (0.2, 0.4, 0.2, 0.3), number of repetitions S = 1, 000 and sample sizes n = (250, 500, 1, 000). The same is done by generating data from the first order BARMA, GARMA and GLARMA models, with Bin(5, p t ), p t = μ t /a and g(μ t ) = log(μ t )/ log(a − μ t ). For the GARMA model, g(y t ) = log(y t )/ log(1−y t ), y t = min(max(y t , c), 5−c) and c = 0.1, whereas, in the GLARMA model, s t = 5p t (1 − p t ). For each distribution, we generate S times a vector of data with length n from one model, then the data generated are employed in the estimation of all the three models.
The Akaike and the Bayesian information criteria are computed for each model. Finally, the frequency of correct selection over the S repetitions is established, counting the percentage of the number of times the information criteria selected the model truly employed to generate the data. The same procedure is replicated for all the models. The results for the AIC are summarized in Table 2 (results for the BIC are identical).
For the Poisson distribution, the results are excellent in the GARMA and the GLARMA models. The log-AR seems to show a slower convergence towards the right model, but it reaches a satisfactory result with increasing n. The same holds, in the case of Binomial data, for the BARMA and GLARMA models. Finally, the GARMA model also works very well for the Binomial distribution.

Application on disease cases of Escherichia coli in North Rhine-Westphalia
We consider a test-bed time series, i.e. the weekly number of reported disease cases caused by Escherichia coli in the state of North Rhine-Westphalia (Germany) from January 2001 to May 2013. The data can be found in the R package tscount [28]. The time series has a time length n = 646 and is plotted in Figure  1, with its sample autocorrelation function (ACF). There is a temporal correlation which spreads over several lags with a greater magnitude compared to the data set in the previous example. The slow decay of the ACF suggests the use of a feedback mechanism.
For the data generating process we assume both the Poisson and the Negative Binomial (NB) distribution in equation (10), where ν > 0 is the dispersion parameter and μ t is the conditional expectation. Indeed, equation (10) is defined in terms of mean rather than of the probability parameter p t = ν ν+μt and it accounts for overdispersion in the data as V(Y t |F t−1 ) = μ t (1 + μ t /ν) ≥ μ t . We fit the following models log-AR: where y t = max {y t , c} with c = 0.1, and ε t = (y t−1 − μ t−1 )s t−1 . Different values of 0 < c < 1 do not affect the estimates; while s t is the square root of the conditional variance s t = √ μ t for the Poisson distribution and s t = μ t (1 + μ t /ν) for the NB. In this likelihood-based framework, model selection is based on information criteria, such as AIC and BIC. The Quasi Information Criterion (QIC) introduced by [32] is also employed. It is a generalization of the AIC which takes into account the usage of a working quasi-likelihood instead of the true likelihood. QIC coincides with AIC in the case of well-specified models. QMLE has been carried out. The log-likelihood function of the Poisson and NB distributions is maximized by using a standard optimizer in R based on the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm. The score functions written in terms of predictor x t = log μ t are: The solution of non-linear equation system χ n (ρ) = 0, if it exists, provides the QMLE of ρ (denoted byρ). In NB models, estimation of ν is also required. The moment estimator proposed in [6] is used: whereμ t = μ t (ρ) is the estimator from the Poisson model. Then, with ν =ν we estimate the NB model and obtain the new estimates forμ t , plug them into (23), obtain a new value forν, and repeat the procedure until a certain tolerance value is reached. The standard errors are computed from the "sandwich" estimators in Theorem 4; each quantity has been replaced by its sample counterpart. The results of the analysis are summarized in Table 3. For Log-AR, GARMA and GLARMA the whole set of parameters is significant at the 5% levels. The parameterν is generally around 10. All the information criteria select the NB GLARMA model as the best, in a goodness-of-fit sense. We then assess the adequacy of the fit. The adequacy of the fit has been checked through the behaviour of the standardized Pearson residuals , which is done by taking the empirical versionê t from the estimated quantities. If the model is correctly specified, the residuals should be white noise sequences with constant variance. This can be seen from the ACF, which in our case appears uncorrelated. [8] introduced a non-randomized version of Probability Integral Transform (PIT) for discrete data. It can be built based on the conditional cumulative distribution function where P t (·) is the cumulative distribution function (CDF) at time t (in our case Poisson or NB). If the model is correct, u ∼ Uniform(0, 1) and the PIT (24) will appear to be the cumulative distribution function of a Uniform(0, 1). The PIT (24) is computed for each realization of the time series y t , t = 1 . . . , n and for values u = j/J, j = 1, . . . , J, where J is the number of bins (usually equal to 10 or 20); then its meanF (j/J) = 1/n n t=1 F (j/J|y t ) is taken. The outcomes are probability mass functions, obtained in terms of differencesF ( j J ) −F ( j−1 J ); Figure 2 is a representative plot. The NB seems to be more appropriate for the data as its PIT's are quite near to Uniform(0, 1). Another control can be performed by using the probability and marginal calibration (mc), as defined in [24]. It compares the average of CDF selected,P (x) = 1/n n t=1 P t (x), against the average of the empirical CDF,Ḡ(x) = 1/n n t=1 1(y t ≤ x). The mc is plotted in Figure 1 for the log-AR and GLARMA models. In the other models the results are similar. Both distributions seem to show a good concordance with empirical distribution but the NB appears to perform better than the Poisson, especially for the larger quantiles.
In order to assess the predictive power, we refer to the concept of sharpness of the predictive distribution defined in [24]. It can be measured by some average quantities related to the predictive distribution, which takes the form 1/n n t=1 d(P t (y t )), and d(·) is a scoring rule. We adopt the usual scoring rules employed in the literature: the logarithmic score (logs) − log p t (y t ), where p t (·) is the probability mass at the time t; the quadratic score (qs) −2p t (y t ) + p 2 , where p 2 = ∞ k=0 p 2 t (k); the spherical score (sphs) −p t (y t )/ p and the ranked probability score (rps) Numerical results for each model are collected in Table 4. The NB GLARMA model provides the best predictive performance for all the scoring rules analysed, and it is ultimately chosen, since it has been also selected by the information criteria.

Discussion
We developed statistical inference for a first order class of models for discrete time series which encompasses known models as well as new models of potential interest for the analysis of integer-valued time series. Stability conditions have been derived for the models in the class and a large family of probability distributions satisfying mild moment conditions. Consistency and asymptotic normality of the quasi maximum likelihood estimators have been also established, with the focus on the exponential family. The results about stochastic and inferential properties make any model belonging to the class fully applicable in practice. An interesting extension of the present study may concern suitable specifications for multinomial data. Indeed, equation (1) describes an exponential family dealing only with one time-varying parameter μ t . However, there are cases where several dynamic parameters characterize the distribution. Concerning discretevalued data, this is the case of categorical random variables. An extension of the general framework to several parameters in the context of exponential families may be considered following the results on multinomial logistic models in [23], where exogenous covariates are also considered.
Another extension of potential interest may be the specification of multivariate discrete models. Recently, [21] established the multivariate discrete-valued extension of the (mixed) Poisson autoregression models (8)- (9). In line with such developments, the extension to a multivariate setting may represent a challenging research advance, though complications may arise at the modelling and at the inferential stage. For instance, as far as the stochastic properties are concerned, the coupling condition employed in this paper to show stationarity and ergodicity of the model (Lemma 4 in the Appendix) does not apply to multivariate processes. A possible direction to solve the problem may be based on the perturbation approach, as described by [21,. In addition, the choice of a suitable multivariate version of the discrete probability mass function is nontrivial. Although several alternatives have been proposed in the literature, see the recent review in [19,Sec. 2], the choice of a suitable multivariate version of the discrete probability mass function remains a challenging problem. As a matter of fact, multivariate discrete probability mass functions have a complicated closed form and the associated likelihood inference is both theoretically and computationally cumbersome. Furthermore, in many cases, multivariate probability mass functions imply restricted models, of limited use in applications: see the discussion in [26] and [10]. A viable alternative may be the specification of joint distribution of the integer vector {Y t } by a copula approach, as described in [21,Sec. 2].
Along the lines traced in this discussion, we expect the specification of the broad class of models will provide useful enhancements to study the dynamic trend of count and binary data.

A.1. Preliminary Lemmata for the Proof of Theorem 1
The proof of Theorem 1 is based on the following preliminary lemmata, stated with the same notation as the theorem. First, a small set (Lemma 1) and a drift condition are proved on the Markov chain X t = g(μ t ) (Lemma 2); after that, the weak Feller property is established for the chain (Lemma 3), which proves the existence of a stationary distribution for {X t } t∈T . Then, the asymptotic strong Feller condition is verified (Lemma 4). Finally, the existence of a reachable point is shown (Lemma 5) and, by combining all these results, the uniqueness of the stationary distribution of the chain is proven. For definitions and properties invoked in the lemmata, see [2, pg. 54-55].
Let E x (·) denote the expectation under the probability P x (·) induced on the path space of the chain {X t } t∈T when the initial state X 0 is deterministically equal to x. Consider the following drift condition ∀x ∈ S: ∞) and A ⊂ S. We first prove that it is possible to select a set A which is small. Given X 0 = x and μ 0 = μ = g −1 (x), we can writeḡ(μ) =ḡ(g −1 (x)) = (ḡ • g −1 )(x) =g(x) where the composite functiong is still monotonic (and invertible), as a composition of monotonic functions. Then, with probability at least 3/4, X 1 ≥ min{b(a 1 (M )), b(a 2 (M ))} − |γ|M − |θ||g(M )| and

Lemma 1 Define a set
and a * is the operator * applied to a. This shows that A is a small set, by (A1)-(A3), in the fashion of [29, p. 812]. We omit the details. Proof Consider the small set A in Lemma 1. We shall consider two states, x > M and x < −M , each one, in turn, facing three different cases, according to the domain of the mean parameter μ t , as described in Section 3.1. Set A as defined in Lemma 1. Take V (x) = |x|. We only prove the case where x > M and the mean process μ t ∈ R (Case 1 ), as the other cases can be dealt with in a similar manner. The interested reader can find full detailed proofs in [2, pg. 55-59]. We assume, without loss of generality, that h(0) = 0, since replacing h(y) with h(y) − h(0) simply changes the value of α. In this case, we assume that h is concave on R + and convex on R − , so that there are constants a 0 , a 1 ≥ 0 such that |h(y)| ≤ a 0 + a 1 |y| for all y; the same assumptions hold for g. Consider From equation (A-2), we need to show that When h(μ) ≤ g(μ), this holds from a result in [29, Sec. A.7,] by replacing g(·) by h(·). Instead, when h(μ) > g(μ), the result is unchanged by applying the following inequality h(μ) = g(μ+δ) ≤ g(μ)+g(δ), where δ > 0, for the concavity of the functions involved in the same domain. Next, we show that the term is "small" relative to the linear term x. Specifically, we prove that there are some constants Since h(0) = 0 and h is monotonic increasing, for x > M, by [29, eq. 23,], Using the Markov inequality stated in [29, eq. 14], for any fixed ε ∈ (0, 1) and x > M, Ifḡ ≡ h = g, equation (A-5) reduces to Ch(μ)/μ 2+δ−r . Recall that for y > 0, a 0 + a 1 y ≥ h(y), so that a 0 + a 1 μ ≥ h(μ). Hence, μ ≥ (h(μ) − a 0 )/a 1 and (A-4) is bounded by is an increasing function, since it is a composition of increasing functions, and is therefore bounded by a constant, for . Consequently, the above bound applies here. Ifḡ ≡ g = h we define (A-5) as g(μ)(C 1 μ r +C 2 )/ε 2+δ μ 2+δ + h(μ)(C 1 μ r + C 2 )/ε 2+δ μ 2+δ = Cx/μ 2+δ−r + Ch(μ)/μ 2+δ−r and it is bounded by 2+δ−r , which converges to 0 as x → ∞. It only remains to show that is "small". Whenḡ ≡ h, this is straightforward by replacing g(·) by h(·) in [29, p. 826], establishing the existence of constants C 1 , , for the concavity of h(·), for μ > 0 when x > M, [29, p. 824], Combining this result with (A-2) and (A-3), we have that, for all x enough large, this gives the final result. For any ε ∈ (0, 1) there is some constant When x < −M and μ t ∈ R, the previous proof holds directly by symmetry. In the other cases, according to the states discussed above, similar conditions are found; for convenience the results organized according to the domain of x and μ t are reported below.
Lemma 5 If (A3) holds, then a reachable point x 0 exists for the chain (14).
) and x t is its sample counterpart. Firstly, consider the case in whichḡ ≡ g and put, h(0) = 0 (which simply changes the value of the constant α). Equation (14) could be written as Let us consider the case Y * t = 0, for t = 1, . . . , n. Hence, by (A-8), Let x ∈ R and let C be an open set containing x. Then, by setting x 0 = x and for all t ≥ 1, , it can be immediately be seen thatḡ(μ t ) = 0, for t = 1, . . . , n and (A-8) still holds, with γ instead of δ, as it follows by (A3) that |γ| < 1. Whenḡ ≡ h = g we consider the case Y t = c, for t = 1, . . . , n so that μ t = c, for t = 1, . . . , n and Y * t = c, for t = 1, . . . , n; and finally, set h(c) = 0 and (A-8) will be valid again, with γ instead of δ.

A.2. Proof of Theorem 1
Theorem 1 follows directly from Lemmata 1-5. More precisely, if (A1)-(A2) and (A3) hold, the process {X t } t∈T has at least a stationary distribution. The result is obtained by Lemmata 1-3 and Theorem 2 in [40]. Besides, if (A1)-(A4) hold, the stationary distribution of the process {X t } t∈T is unique. This is immediate by Lemma 4, Lemma 5 and Theorem 3 in [29]. Finally, by Proposition 8 in [14], the stationarity of {Y t } t∈T follows directly from the uniqueness of the stationary distribution of {X t } t∈T ; this completes the proof.

A.3. Proof of Corollary 1
Let us define ν 0 = ν(μ 0 ) = ν(μ) = ν and set g(μ) = x. It is worth noting . In fact ν is the standard deviation σ(μ) of h(Y 0 ), which is constant w.r.t x (and then w.r.t μ). For this reason when x ∈ A and μ ∈ R (Case 1 ), the result of Lemma 2 holds here unchanged. When we have x > M; if ν is increasing w.r.t μ we have that as x → ∞ (μ → ∞) ν goes to infinity as well (and 1/ν → 0, then it is therefore bounded for x > M) or converges to a specific constant. In both cases the proof of Lemma 2 still holds with a modification of the constants C. The same thing (with signs inverted) holds as x < −M , provided that ν is decreasing w.r.t μ. Case 2, when x ∈ [−M, ∞), holds as above, by setting without loss of generality, that h(c) = 0, since replacing h(y) with h(y) − h(c) simply changes the value of α; the same assumptions hold for g. When x < −M , we have 0 < μ = g −1 (x) < g −1 (0) = c, ν is only required to be monotone w.r.t μ, indeed if it is decreasing σ(μ) > σ(c) = ξ, instead, if it is increasing σ(μ) > σ(0) = ξ, and then ) + |γ||x| which provide the same stationarity condition obtained in absence of the scaling sequence. For Case 3 we have 0 < μ < a, also ν is required to be monotone, if it is increasing σ(μ) > σ(0) = δ, by contrast, if it is decreasing σ(μ) > σ(a) = δ, then which again provide the same stationarity condition. Then, Lemma 2 holds also for the chain (15).
As far as the Feller properties are concerned, it is easy to see that the weak Feller condition is satisfied since, in general, σ 2 (μ) is continuous for μ (and then for x). Hence, Lemma 3 holds. Also, in order to prove Theorem 1, the asymptotic strong Feller property remains to be verified. DefineỸ 0 = h(Y 0 ) andμ =ḡ(μ). We compute the scaling sequence from the first order Taylor
which has finite expectation, and then is finite according to (H1). In fact, h(Y * t ) is stationary and |h(Y 0 )| ≤ a 0 + a 1 |Y 0 |, for Case 1.
almost surely, so that (A-16) tends to 0 as t → ∞ according to [14,Lem. 34], (H1) and equation (A-17). The mean value theorem allows to rewrite equation (A-15) as ∂g ρ Y1: tends to 0 as t → ∞ for the same reason in (A-14) if the following expectation is finite The first term of (A-19), E log ∂g ρ Y1: is finite, since, for (H2), the expectation of (A-18) is finite. The proof in the second term of (A-19) follows from the mean-value theorem.
which is finite as the expectations of (A-12) and (A-13) are for (H1). The same results of (A-15) and (A- 16) apply similarly for f (·), thus are omitted. Hence, (A7)-(i) is proved. We now move to (A7)-(ii). Consider K ρ (g • Y 1:t−1 (x), Y t ) as The proof is shown for a single derivative, the proof of the others is immediate.

Appendix B: Simulation results for finite sample properties
In this section, the numerical results concerning the finite sample properties discussed in Section 4.2 are presented. Table A-1 summarizes the estimation results for the GLARMA model when the data come from a Bernoulli distribution. Tables A-2 and A-3 show the outcome of simulations for GARMA and log-AR models performed on data generated from Geometric distribution in (10), but with Poisson distribution fitted instead (QMLE). The first row of the tables reports the true parameter values; the following two rows show the mean of the estimated parameters, obtained by averaging out the results from all simulations along with the corresponding standard error. The subsequent two rows present the lower and upper limits of the confidence interval for the estimated mean. Finally, the last two rows correspond to the bias of the mean and the p-value of the Kolmogorov-Smirnov (KS) test for normality on the standardized MLE/QLME obtained from the simulations.