Stationarity of Generalized Autoregressive Moving Average Models

,


Introduction
Stationarity is a fundamental concept in time series modeling, capturing the idea that the future is expected to behave like the past; this assumption is inherent in any attempt to forecast the future.Many time series models are created by combining nonstationary effects such as trends, covariate effects, and seasonality with a stochastic process that is known or believed to be stationary.Alternatively, they can be defined by partial sums or other transformations of a stationary process.The properties of statistical estimators for particular models are then established via the relationship to the stationary process; this includes consistency of parameter estimators and of standard error estimators (Brockwell and Davis 1991, Chap. 7-8).
However, (strict) stationarity can be nontrivial to establish, and many time series models currently in use are based on processes for which it has not been proven.Strict stationarity (henceforth, "stationarity") of a stochastic process {X n } n∈Z means that the distribution of the random vector (X n , X n+1 . . ., X n+j ) does not depend on n, for any j ≥ 0 (Billingsley 1995, p.494).Sometimes weak stationarity (constant, finite first and second moments of the process {X n } n∈Z ) is proven instead, or simulations are used to argue for stationarity.
The most common approach to establishing strict stationarity and ergodicity (defined as in Billingsley 1995, p.494) is via application of Lyapunov function methods (also known as drift conditions) to a Markov chain that is related to the time series model.However, Lyapunov function methods are almost always used in conjunction with an assumption of ϕ-irreducibility, which can be violated by discrete-valued observation-driven time series models.Such models are important since (due to the simplicity of evaluating the likelihood function) they are typically the best option for modeling very long count-or binary-valued time series.
We address this challenge to show for the first time stationarity of several models designed for discrete-valued time series, including the class of Generalized Autoregressive Moving Average (GARMA) models (Benjamin et al., 2003).We do this from two perspectives.First, we show stationarity and ergodicity of a perturbed version of the model.Such ergodicity results for a perturbed model can be used in some cases to show asymptotic normality of parameter estimators for the original model (Fokianos et al., 2009;Fokianos andTjostheim, 2011, 2010).We show that the perturbed and original models are closely related in the sense that the perturbed model yields (finite-sample) parameter estimates that are arbitrarily close to those from the original model.This result is given under weak conditions that encompass nearly all discrete-valued time series models, and utilizes the fact that the perturbed model is ϕ-irreducible.It implies that a researcher can choose to use the perturbed model without substantially affecting (finite-sample) parameter estimates, in order to get strong theoretical properties associated with stationarity and ergodicity, such as consistent estimation of the mean and lagged covariances of the process, and more generally the expectation of any integrable function (Billingsley 1995, p.495).However, our result does not immediately imply stationarity properties for the original model.Indeed, it is not clear that one can always use such perturbation results to show asymptotic normality of estimators in the original model, and when it is possible model-specific approaches may be needed.
Instead of pursuing such a model-specific approach we maintain attention on a broad class of models and investigate a second approach.We show that the original models have a unique stationary distribution (so are strictly stationary when initialized in that distribution), by using Feller properties of the chain.To our knowledge Feller properties have not previously been used to show uniqueness of the stationary distribution for discrete-valued time series models (although they have been used to show existence of a stationary distribution; Davis, Dunsmuir, and Streett 2003).This approach is very general and has the potential to be used for showing that wide classes of discrete-valued models have unique stationary distributions.
Our stationarity results for the original models potentially form the foundation for showing consistency and asymptotic normality of parameter estimates in the GARMA model class in its general form.Unlike the results for the perturbed model, consistent estimation of the expectation of integrable functions is not immediate from our stationarity results for the unperturbed model; we leave this to future work.
GARMA models generalize autoregressive moving average models to exponential-family distributions, naturally handling count-and binary-valued data among others.They can also be seen as an extension of generalized linear models to time series data.The numerous applications of these models include predicting numbers of births (Léon and Tsai, 1998), modeling poliomyelitis cases (Benjamin et al., 2003), and predicting valley fever incidence (Talamantes et al., 2007).The main stationarity result that currently exists for GARMA models is weak stationarity in the case of an identity link function; unfortunately this excludes the most popular of the count-valued models (Benjamin et al., 2003).Zeger and Qaqish (1988) have also used a connection to branching processes to show stationarity for a special case of the purely autoregressive Poisson log-link GARMA model.The stationarity of particular models related to Poisson GARMA has also been addressed by Davis et al. (2003) (log link case) and Ferland et al. (2006) (linear link case).
In Section 2 we describe Lyapunov function methods, give our justification for using these methods on perturbed discrete-valued time series models, and give background on the use of Feller properties to show stationarity.In Section 3 we illustrate the perturbation approach by applying it to a specific count-valued threshold model, and in Section 4 we use the perturbation method to prove stationarity for the class of perturbed GARMA models.In Section 5 we show stationarity of the original GARMA and count-valued threshold models using Feller properties.

Stationarity of Observation-Driven Models
For a real-valued process where n ≤ m.An observationdriven time series model for {Y n } n∈N has the form: for functions h θ,n parameterized by θ and some density function ψ ν (typically with respect to counting or Lebesgue measure) that can depend on both time-invariant parameters ν and the time-dependent quantities µ n (Zeger and Qaqish, 1988;Davis et al., 2003;Ferland et al., 2006).Observation-driven models are desirable because the likelihood function for the parameter vector (θ, ν) can be evaluated explicitly.The alternative class of parameter-driven models (Cox, 1981;Zeger, 1988), by contrast, incorporates latent random innovations which typically make explicit evaluation of the likelihood function impossible, so that one must resort to approximate inference or computationally intensive Monte Carlo integration over the latent process (Chan and Ledolter, 1995;Durbin and Koopman, 2000;Jung et al., 2006).These methods do not scale well to very long time series, so observation-driven models are typically the better option in this case.Observation-driven models are usually constructed via a Markov-p structure for µ n , meaning that for n ≥ p for some function g θ and for fixed initial values µ 0:p−1 .This structure implies that the vector µ n−p:n−1 forms the state of a Markov chain indexed by n.Billingsley (1995) shows that its transformation {Y n } is also stationary and ergodic.

Stationarity of a Perturbed Process
We will give an approach based on a perturbed version of the discrete-valued model, and justify its use.First we describe the use of drift conditions to show stationarity and ergodicity of ϕ-irreducible processes.For a general Markov chain X = {X n } n∈N on state space S with σ-algebra F define T n (x, A) = Pr(X n ∈ A|X 0 = x) for A ∈ F to be the n-step transition probability starting from state X 0 = x.The appropriate notion of irreducibility when dealing with a general state space is that of ϕ-irreducibility, since general state space Markov chains may never visit the same point twice.
Definition 1.A Markov chain X is ϕ-irreducible if there exists a nontrivial measure ϕ on F such that, whenever ϕ(A) > 0, T n (x, A) > 0 for some n = n(x, A) ≥ 1, for all x ∈ S.
The notion of aperiodicity in general state space chains is the same as that seen in countable state space chains, namely that one cannot decompose the state space into a finite partition of sets where the chain moves successively from one set to the next in sequence, with probability one.For a more precise definition, see Meyn and Tweedie (1993), Sec. 5.4.We need one more definition before we can present drift conditions.Definition 2. A set A ∈ F is called a small set if there exists an m ≥ 1, a nontrivial measure ν on F, and a λ > 0 such that for all x ∈ A and all C ∈ F, T m (x, C) ≥ λν(C).
Small sets are a fundamental tool in the analysis of general state space Markov chains because, among other things, they allow one to apply regenerative arguments to the analysis of a chain's long-run behavior.Regenerative theory is indeed the fundamental tool behind the following result, which is a special case of Theorem 14.0.1 in Meyn and Tweedie (1993).Let E x (•) denote the expectation under the probability P x (•) induced on the path space of the chain when the initial state X 0 is deterministically x.
Theorem 1. (Drift Conditions): Suppose that X = {X n } n∈N is ϕ-irreducible on S. Let A ⊂ S be small, and suppose that there exist b ∈ (0, ∞), > 0, and a function Then X is positive Harris recurrent.
The function V is called a Lyapunov function or energy function.The condition (4) is known as a drift condition, in that for x / ∈ A, the expected energy V drifts towards zero by at least .The indicator function in (4) asserts that from a state x ∈ A, any energy increase is bounded (in expectation).
Positive Harris recurrent chains possess a unique stationary probability distribution π.If X 0 is distributed according to π then the chain X is a stationary process.If the chain is also aperiodic then X is ergodic, in which case if the chain is initialized according to some other distribution, then the distribution of X n will converge to π as n → ∞.
Hence, the drift condition (4), together with aperiodicity, establishes ergodicity.A stronger form of ergodicity, called geometric ergodicity, arises if (4) is replaced by the condition for some β ∈ (0, 1) and some V : S → [1, ∞) (note the change in the range of V ).Indeed, (5) implies (4).Either of these criteria are sufficient for our purposes.
A problem can occur, however, when we attempt to apply this method for proving stationarity to an observation-driven time series model given by ( 1) and (3): the Markov chain {µ n−p:n−1 } n≥p may not be ϕ-irreducible.This occurs, for instance, whenever Y n can only take a countable set of values and the state space of µ n−p:n−1 is R p .Then, given a particular initial vector µ 0:p−1 the set of possible values for µ n is countable.In fact, for any fixed initialization µ 0:p−1 there is a countable set A ⊂ R p such that ∞ n=p Pr(µ n−p:n−1 ∈ A|µ 0:p−1 ) = 1, and distinct initial vectors µ 0:p−1 can have distinct sets A. For a simpler example of a Markov chain with the same property, consider the stochastic recursion defined by X n = [X n−1 + Y n ] mod 1 where {Y n } n≥1 are i.i.d.discrete random variables on the rationals and x mod 1 is the fractional part of x.If X 0 is rational, then so is X n for all n ≥ 1, while if X 0 is irrational then so is X n for all n ≥ 1.Also, the set of states that can be reached from any fixed X 0 is countable.
However, by adding small real-valued perturbations to a discrete-valued time series model one can obtain a ϕ-irreducible process.We do this by returning to the most general framework (1) and (2), and replacing h θ,n with a function of two inputs: where the Z i iid ∼ φ are random perturbations having density function φ (typically with respect to Lebesgue measure), σ > 0 is a scale factor associated with the perturbation, and h θ,n (•, σZ 0:n−1 ) is a continuous function of Z 0:n−1 such that h θ,n (y, 0) = h θ,n (y) for any y.The value µ (σ) 0 is a fixed constant that we take to be independent of σ, so that µ (σ) 0 = µ 0 .When the perturbed model is constructed to be ϕ-irreducible, one can then apply drift conditions to prove its stationarity.
We will show that using the perturbed instead of the original model has an arbitrarily small effect on the (finite-sample) parameter estimates.We do this by proving that the likelihood of the parameter vector η = (θ, ν) calculated using (6) converges uniformly to the likelihood calculated using (2) as σ → 0.More precisely, the joint density of the observations Y = Y (σ) 0:n and first n perturbations Z = Z 0:n−1 , conditional on the parameter vector η, the perturbation scale σ, and the initial value µ 0 , is where µ k (σZ) is the value of µ (σ) k induced by the perturbation vector σZ through (6), with µ 0 (σZ) = µ 0 .The likelihood function for the parameter vector η implied by the perturbed model is the marginal density of Y integrating over Z, i.e., Here we have placed a subscript σ on the likelihood function to emphasize its dependence on σ.Let the likelihood function without the perturbations be denoted by L, so that Theorem 2. Under regularity conditions (a) & (b) below, the likelihood function L σ based on the perturbed model ( 6) converges uniformly on any compact set K to the likelihood function L based on the original model (2), i.e., for any fixed sequence of observations y 0:n and conditional on the initial value µ 0 .So if L is continuous in η and has a finite number of local maxima and a unique global maximum on K, the maximum-likelihood estimate of η based on L σ converges to that based on L. Also, Bayesian inferences based on L σ converge to those based on L, in the sense that the posterior probability of any measurable set A using likelihood L σ (and restricting to a compact set) converges to that using L.

Regularity Conditions:
(a) For any fixed y the function ψ ν (y; µ) is bounded and Lipschitz continuous in µ, uniformly in η ∈ K.
(b) For each n, µ n (σZ) is Lipschitz in some bounded neighborhood of zero, uniformly in η ∈ K.
Assumption (a) holds, e.g., for ψ ν (y; µ) equal to a Poisson or binomial density with mean µ, or a negative binomial density with mean µ and precision parameter ν.As we will see for several models, µ n (σZ) can easily be constructed to satisfy (b).Theorem 2 is proven in Appendix A.1.
Theorem 2 says that one can choose to use the perturbed model (with fixed and sufficiently small perturbation scale σ) instead of the original model, without significantly affecting finitesample parameter estimates, in order to get the strong theoretical properties associated with stationarity and ergodicity.These include consistent estimation of the mean and lagged covariances of the process.Although we have shown that the perturbed and original models are closely related, and although one can use drift conditions to show stationarity and ergodicity properties of the perturbed model, this approach does not yield stationarity and ergodicity properties for the original model.This is due to the substantial technical difficulty associated with interchanging the limits σ → 0 and n → ∞.Theorem 2 addresses the case of a fixed number of observations n, as σ → 0, while consistency of parameter estimation for the perturbed model is a statement about n → ∞ for fixed σ.Due to this limitation of the perturbation approach we separately address stationarity of the original process, as outlined in Section 2.2.

Stationarity of the Original Process
When the chain {µ n−p:n−1 } n≥p associated with the observation-driven time-series model is not ϕ-irreducible we will see that one can instead use Feller properties to prove that it has a unique stationary distribution.We address existence of a stationarity distribution first, then uniqueness of that distribution.
In the absence of ϕ-irreducibility, the "weak Feller" condition can be combined with a drift condition to show existence of a stationary distribution.A chain evolving on a complete separable metric space S is said to be "weak Feller" if the transition kernel T (x, •) satisfies T (x, •) ⇒ T (y, •) as x → y, for any y ∈ S and where ⇒ indicates convergence in distribution; see, e.g., Section 6.1.1 of Meyn and Tweedie (1993) and Theorem 25.8 (i) and (ii) of Billingsley (1995).
Theorem 3. (Tweedie, 1988) Suppose that S is a locally compact complete separable metric space with F the Borel σ-field on S, and that the Markov chain {X n } n∈N with transition kernel T is weak Feller.Let A ∈ F be compact, and suppose that there exist b ∈ (0, ∞), > 0, and a function V : S → [0, ∞) such that for all x ∈ S the drift condition (4) holds.Then there exists a stationary distribution for T .
Uniqueness of the stationary distribution can be established using the "asymptotic strong Feller" property, defined as follows (Hairer and Mattingly, 2006).Let S be a Polish (complete, separable, metrizable) space.A "totally separating system of metrics {d n } n∈N for S" is a set of metrics such that for any x, y ∈ S with x = y, the value d n (x, y) is nondecreasing in n and lim n→∞ d n (x, y) = 1.A metric d on S implies the following distance between probability measures µ 1 and µ 2 : where is the minimal Lipschitz constant for φ with respect to d.Using these definitions, a chain is asymptotically strong Feller if, for every fixed x ∈ S, there is a totally separating system of metrics {d n } for S and a sequence t n > 0 such that where B(x, γ) is the open ball of radius γ centered at x, as measured using some metric defining the topology of S.
Then we have the following result, which is an extension of results in Hairer and Mattingly (2006) and Hairer (2008).A "reachable" point x ∈ S means that for all open sets A containing x, ∞ n=1 T n (y, A) > 0 for all y ∈ S (Meyn and Tweedie, 1993, p. 135).
Theorem 4. Suppose that S is a Polish space and the Markov chain {X n } n∈N with transition kernel T is asymptotically strong Feller.If there is a reachable point x ∈ S then T can have at most one stationary distribution.
The results in Hairer (2008) require an "accessible" point, which is stronger than a reachable point.Theorem 4 is proven in Appendix A.2.

A Poisson Threshold Model
Our first example is a Poisson threshold model with identity link function that we have found useful in our own applications (Matteson et al., 2011).The model is defined as where the threshold boundaries satisfy 0 < L < U < ∞.To ensure positivity of µ n we assume ω, α, β > 0, (α + γ) > 0, and (β + η) > 0. Additionally we take η ≤ 0 and γ ≥ 0, so that when Y n−1 is outside the range (L, U ) the mean process µ n is more adaptive, i.e. puts more weight on Y n−1 and less on µ n−1 .
We will show that a perturbed version of the model {Y n } n∈N is stationary and ergodic under the restriction (α + β + γ + η) < 1.This can be proven via extension of results in Fokianos et al. (2009) for a non-threshold linear model.However, a much simpler proof is as follows.First, incorporate perturbations Z n iid ∼ Uniform(0, 1) as in Theorem 2: The regularity conditions for Theorem 2 hold since ψ ν is the Poisson density and µ n and take the state space of the Markov chain for any M > 0, and define m to be the smallest positive integer such that ] and where T is the transition kernel of the Markov chain X.
in Definition 2 then establishes A as a small set.A similar argument can be used to show ϕ-irreducibility and aperiodicity.
Taking the energy function In particular, for some |ν| < 1 and for M large enough.So E x V (X 1 ) has geometric drift for x ∈ A. Although the range of V is [0, ∞) here, we can easily replace V by Ṽ (x) = x + 1 to get the range [1, ∞).So the chain {µ n } n∈N is geometrically ergodic, and thus stationary for an appropriate initial distribution for µ

Generalized Autoregressive Moving Average Models
Generalized Autoregressive Moving Average (GARMA) models are a generalization of autoregressive moving average models to exponential-family distributions, allowing direct treatment of binary and count-valued data, among others.GARMA models were stated in their most general form by Benjamin et al. (2003), based on earlier work by Zeger and Qaqish (1988) and Li (1994).Showing stationarity for GARMA models is harder than for the linear models that have been the subject of most previous studies (Bougerol and Picard, 1992;Ferland et al., 2006;Fokianos et al., 2009), since a small change in the transformed mean can correspond to a very large change on the scale of the observations, causing instability.
We write GARMA models in the following form (Benjamin et al., 2003): for some real-valued link function g, where Y * n is some mapping of Y n to the domain of g, and where ψ ν is a density function with respect to some measure on R (typically Lebesgue or counting measure), parameterized by ν.The second and third terms of the model ( 10) are the autoregressive and moving-average terms, respectively.This model is more general than the class of models developed in Benjamin et al. (2003) in the sense that we do not assume that ψ ν is in the exponential family.However, we do assume that E(Y n |µ n ) = µ n , and we assume a bound on the (2 + δ) moment of Y n in terms of |µ n |, for some δ > 0. We will see that our conditions are satisfied by many standard choices such as the Poisson and binomial GARMA models.In practice when applying the GARMA model covariates are often included, and multiple lags can be allowed in the autocorrelation and moving-average terms, yielding the more general model: However, for simplicity we focus on the case p, q = 1; we discuss how one might extend our results to p > 1 and q > 1 at the end of Sec.4.1.Since the covariates are time-dependent, the model ( 11) is in general nonstationary, and interest is in proving stationarity in the absence of covariates, i.e.where W n β = γ as in (10).We handle three separate cases: Case 1: ψ ν (µ) is defined for any µ ∈ R. In this case the domain of g is R and we take Y * n = Y n .Case 2: ψ ν (µ) is defined for only µ ∈ R + (or µ on any one-sided open interval by analogy).In this case the domain of g is R + and we take Y * n = max{Y n , c} for some c > 0. Case 3: ψ ν (µ) is defined for only µ ∈ (0, a) where a > 0 (or any bounded open interval by analogy).In this case the domain of g is (0, a) and we take Y * n = min [max(Y n , c), (a − c)] for some c ∈ (0, a/2).
Valid link functions g are bijective and monotonic (WLOG, increasing).Choices for Case 2 include the log link, which is the most commonly used, and the link, parameterized by α > 0, which has the property that g(µ) ≈ µ for large µ.Benjamin et al. (2003) also suggest an unmodified identity link function g(µ) = µ for Case 2; however, this requires strong restrictions on the parameters in order to guarantee that µ n ≥ 0, so we do not address this or other cases of non-surjective link functions.Examples of valid link functions for Cases 1 and 3 are the identity and logit functions, respectively.
In this section we obtain ergodicity and stationarity results for the perturbed model: where Z n iid ∼ N (0, 1), for any σ > 0. Stationarity of the original model ( 10) is addressed in Section 5.
For the perturbed model we have the following stationarity results.
Theorem 5.The process {µ n } n∈N specified by the perturbed GARMA model ( 13) is an ergodic Markov chain and thus stationary for an appropriate initial distribution for µ (σ) 0 , under the conditions below.This implies that the perturbed GARMA model {Y (σ) n } n∈N is stationary and ergodic when µ (σ) 0 is initialized appropriately.The conditions are: • g is bijective, increasing, and n )} n∈N process, by applying Prop. 1 of Meitz and Saikkonen (2008).
The following popular models are special cases of Theorem 5: Corollary 6. Suppose that conditional on µ n .Then the perturbed GARMA model is ergodic and stationary given an appropriate initial distribution for µ (σ) 0 , provided that |ρ|, |θ| < 1 and the link function g is bijective, increasing, and concave.This is satisfied, for instance, by the log link and the modified identity link (12).Theorem 2 applies with no further restrictions.
Proof.If X is Poisson with mean µ then where the inequality can be seen by considering the cases λ ≤ 1 and λ > 1 separately.Thus we can take δ = 2 and r = 2. Theorem 2 applies here, shown as follows.The Poisson density satisfies regularity condition (a).Also, X n = g(µ (σ) n ) is linear in Z 0:n−1 and g −1 (•) is Lipschitz on any compact set (due to the concavity of g), implying that µ n = g −1 (X n ) is Lipschitz in Z 0:n−1 , uniformly on any compact subset of the parameter space (γ, ρ, θ) ∈ R 3 .
Corollary 7. Suppose that conditional on µ and fixed number of trials a. Then the perturbed GARMA model is ergodic and thus stationary for an appropriate initial distribution for µ (σ) 0 , provided that |θ| < 1 and g is bijective and increasing (e.g. the logit link).If g −1 is locally Lipschitz then Theorem 2 also holds.
The local Lipschitz condition on g −1 is satisfied for the logit and probit link functions, and in the case where g is differentiable holds as long as the derivative of g is nowhere zero.
Proof.The 2 + δ moment condition holds by taking δ = 0.5 and r = 0: Theorem 2 applies here, by verifying the regularity conditions as for Corr.6.Unlike the case of Corr.6, g −1 is not automatically locally Lipschitz, which is why Corr.7 explicitly makes this assumption.

Proof of Theorem 5
Define X n = g(µ n ); we will prove Theorem 5 by showing that the Markov chain X = {X n } n∈N with transition kernel T on state space R is ϕ-irreducible, aperiodic, and positive Harris recurrent with a geometric drift condition.Aperiodicity and ϕ-irreducibility are immediate since the Markov transition kernel has a (normal mixture) density that is positive on the whole real line.
Next, define the set A = [−M, M ] for some constant M > 0 to be chosen later; we will show that A is small, taking m = 1 and ν to be the uniform distribution on A in Definition 2. Let x = X 0 and write µ = g −1 (x).For any y > 0 Markov's inequality then gives In particular, for y = [4(d Then with probability at least 3/4, where a * is the operator * applied to a (e.g. a * = max{a, c} for Case 2).Then it is easy to see that ∃λ > 0 such that T (x, •) ≥ λν(•) for all x ∈ A.
Next we use the small set A to prove a drift condition.Taking the energy function V (x) = |x|, we have the following results.First we give the drift condition for x ∈ A: Proposition 8. Cases 1-3: There is some constant Then we give the drift condition for x / ∈ A, handling the cases x < −M and x > M separately: Proposition 9. Cases 2-3: There is some constant Case 1: For any ∈ (0, 1) there is some constant K 2 < ∞ such that for M large enough, E x V (X 1 ) ≤ (|ρ| + )V (x) + K 2 for all x < −M .
These results have the following intuition for Case 2: Prop.9 shows that for very negative X n−1 , |θ| controls the rate of drift, while Prop. 10 shows that for large positive X n−1 , |ρ| controls the rate of drift.The former result is due to the fact that for very negative values of X n−1 the autoregressive term in ( 13) is a constant, ρ(g(c) − γ), so the moving-average term dominates.The latter result is due to the fact that for large positive X n−1 , the distribution of Y For a perturbed version of the GARMA model with multiple lags (11), it may be possible to show geometric ergodicity of the multivariate Markov chain with state vector µ (σ) (n−max{p,q}+1):n .Again this could be done by finding a small set and energy function such that a drift condition holds, subject to appropriate restrictions on the parameters (ρ 1 , . . ., ρ p ) and (θ 1 , . . ., θ q ).

Stationarity of the Original Model
In this section we will show existence and uniqueness of the stationary distribution for the Poisson threshold model and the class of GARMA models.These results potentially form the foundation for broadly showing consistency and asymptotic normality of maximum likelihood estimators in these models.

The Poisson Threshold Model
We will illustrate the use of Feller properties to show that a discrete-valued time series model has a unique stationary distribution.For the Poisson threshold model ( 9) we first show existence of a stationary distribution.
Proof.We use Theorem 3. The space S = [ w 1−β−η , ∞) is a locally compact complete separable metric space with Borel σ-field.Let Y 0 (x) and µ 1 (x) denote the random variables Y 0 and µ 1 conditioned on µ 0 = x.Since Y 0 (x) = Pois(x) we have that Y 0 (x) converges in distribution to Y 0 (y) as x → y for any y ∈ S. Therefore µ 1 (x) converges in distribution to µ 1 (y) as x → y, proving that the chain {µ n } n∈N is weak Feller.The set A defined in Section 3 is compact, and a drift condition for this set is shown in that Section (the proof of the drift condition is valid in the case σ = 0).By Theorem 3 the chain {µ n } n∈N has a stationary distribution.
Next we show uniqueness of the stationary distribution.
Lemma 12.The chain {µ n } n∈N defined by ( 9) has at most one stationary distribution, provided that β < 1.
Proof.The space S is a Polish space.The point x = w 1−β−η is reachable, shown as follows.For any initial state µ 0 = y and any m the probability that Y n = 0 for all n = 0, . . ., m is strictly positive.So for any open set A containing x we can choose m large enough that µ m ∈ A with positive probability; therefore x is reachable.
The process {µ n } n∈N is asymptotically strong Feller; the proof is given in Appendix A.5, under the condition β < 1. Theorem 4 then implies that the process {µ n } has at most one stationary distribution.
Putting together Lemmas 11 and 12 we find that: Corollary 13.The mean process {µ n } n∈N of the Poisson threshold model defined by ( 9), under the restrictions (α + β + γ + η) < 1 and β < 1, has a unique stationary distribution π.When µ 0 is initialized according to π the Poisson threshold model {Y n } n∈N is strictly stationary.

The GARMA Model
First we show existence of a stationary distribution for the GARMA model (10) by using the weak Feller property.Let Y 0 (x) denote the random variable Y 0 conditioned on µ 0 = x.
Theorem 14.The process {µ n } n∈N specified by the GARMA model ( 10) has a stationary distribution, and thus is stationary for an appropriate initial distribution for µ 0 , under the conditions below.This implies that the GARMA model {Y n } n∈N is stationary when µ 0 is initialized appropriately.The conditions are: • g is bijective, increasing, and Case 1: g : R → R is concave on R + and convex on R − , and |ρ| < 1 Case 2: g : R + → R is concave on R + , and |ρ|, |θ| < 1 Case 3: |θ| < 1; no additional conditions on g : (0, a) → R.
Proof.We apply Theorem 3 to the chain {g(µ n )} n∈N to show that it has a stationary distribution; this implies the same result for the chain {µ n } n∈N .The state space S = R of {g(µ n )} n∈N is a locally compact complete separable metric space with Borel σ-field.A drift condition for {g(µ n )} n∈N is given in the proof of Theorem 5, for the compact set A = [−M, M ] (the proof of that drift condition holds when σ = 0).All that remains is to show that the chain {g(µ n )} n∈N is weak Feller.Let X n = g(µ n ).For X 0 = x we have that showing the weak Feller property.
Next we show uniqueness of the stationary distribution, using the asymptotic strong Feller property.We will assume that the distribution π z (•) of g(Y * n ) conditional on g(µ n ) = z varies smoothly and not too quickly as a function of z.By this we mean that π z (•) has the Lipschitz property sup Theorem 15.Suppose that the conditions of Thm. 14 and the Lipschitz condition (15) hold, and that there is some x ∈ R that is in the support of Y 0 for all values of µ 0 .Then there is a unique stationary distribution for {µ n } n∈N .
This result is proven in Appendix A.3.
The following two results give two classes of examples where Theorem 15 may be applied.The proofs of these results may be found in Appendix A.4.
The condition that g −1 be Lipschitz is equivalent to requiring that for y > x, g(y) − g(x) ≥ ζ(y −x) for some ζ > 0, and this condition in turn is equivalent to requiring that g be bounded away from zero when g is differentiable.The link function (12) satisfies this condition, while the log link does not.
We suspect that the Lipschitz condition in Theorem 15 can be weakened to a local Lipschitz condition, based on the fact that local Lipschitz is equivalent to Lipschitz on a compact space, and the fact that although the state space of g(µ n ) is not compact, we have a drift condition for the process {g(µ n )} n∈N which (informally) ensures that the chain stays in a limited part of the space.With the weaker local Lipschitz condition, Proposition 16 could be extended to link functions like the log link.
Proposition 17. Suppose that conditional on µ n , Y n is binomial with fixed number of trials a and mean µ n , the link function g : (0, a) → R is bijective and increasing, g −1 is Lipschitz, |θ| < 1 and c ∈ (0, 1).Then the process {µ n } n∈N defined in (10) has a unique stationary distribution π.Hence, when µ 0 is initialized according to π, the process {Y n } n∈N is strictly stationary.

A Appendix: Proofs
A.1 Proof of Theorem 2 Fixing y 0:n and letting Z = Z 0:n−1 be the perturbations, where the expectation is taken over Z, the data being fixed.Then we have where ).We will show that the supremum inside the expectation in ( 16) converges to 0 almost surely (in Z) as σ → 0; then bounded convergence implies that the expectation ( 16) itself converges to 0 as σ → 0, proving Thm. 2.
By assumption the function ψ ν (y; µ) is Lipschitz continuous in µ, and µ k (•) is Lipschitz continuous in some bounded neighborhood C of 0, uniformly in η ∈ K.In other words, there exists a finite constant L k such that, for any z, z ∈ C, Finally, we apply the usual telescoping-sum argument to conclude that the function By regularity condition (a), The fact that β k (•) is Lipschitz uniformly in η ∈ K for each k = 0, 1, . . ., n then ensures that n k=0 β k (•) is Lipschitz on C, uniformly in η ∈ K as desired.
Choose n large enough that for all w ∈ B(g(x * ), δ) we have To show that {Z n } n∈N is asymptotically strong Feller we will use the sequence of metrics d n defined by By Example 3.2 (1) in Hairer and Mattingly (2006) this is a totally separating system of metrics.We will also define t n = n.
An interesting property of the distance metric ( 7) is that if we take d(x, y) = 1 {x =y} then we get the total variation distance between the probability measures µ 1 , µ 2 .This is because in this case taking the supremum over {φ : Lip d φ = 1} is equivalent to taking the supremum over {φ : φ(x) ∈ [0, 1] ∀x ∈ R}.An analogous result is true for our choice of distance d n , that when taking the supremum over {φ : Lip dn φ = 1} it is sufficient to consider φ such that φ(x) ∈ [0, 1] ∀x ∈ R and Lip dn φ = 1.
Let Y n (z) and Z n (z) indicate the random variables Y n and Z n conditioned on Z 0 = z.By (15) we have π z (•) − π w (•) T V < B|z − w|.Using Proposition 3(g) of Roberts and Rosenthal (2004) ) then we can continue to "couple" the chains in the above way.Notice that the probability that the chains couple for all times 0, 1, . . . is at least 1 − B|z − w| ∞ n=0 |θ| n = 1 − |z−w|B 1−|θ| .Consider the distance T n (z, •)−T n (w, •) dn ; we will bound this by conditioning on whether or not the chains couple for all time.If they couple for all time, then |Z n (z) − Z n (w)| = |θ| n |z −w|.Due to this fact and the fact that it is sufficient to consider φ such that φ(x) ∈ [0, 1] for all x ∈ R, which converges to 0 as γ → 0. Therefore the process {Z n } n∈N is asymptotically strong Feller.

A.4 Proof of Propositions 16 and 17
In view of Corollary 6 it suffices to verify the two conditions stated in Theorem 15 to prove Proposition 16.
Zero is in the support of Y 0 for all values of µ 0 .To establish the Lipschitz condition (15), we use coupling theory as follows.Let Y n (z) denote Y n conditional on g(µ n ) = z.Suppose that z and w are two values of g(µ n ).Then the total variation distance between g(Y * n (z)) and g(Y * n (w)) is since g is invertible, and we can recover Y n from Y * n since c ∈ (0, 1).The coupling inequality, e.g., Thorisson (1995), ensures that The key point is that the joint distribution of X and Y is arbitrary.We choose X and Y in such a way that we can bound P (X = Y ) and therefore obtain a bound on the total-variation distance between X and Y .When this bound is Lipschitz, we then have the desired property.
So, suppose z > w.Let Y n (w) be Poisson distributed with mean g −1 (w).Let ξ be a Poisson random variable, independent of Y n (w), with mean g −1 (z)−g −1 (w), and set Y n (z) = Y n (w)+ξ.Then Let ζ be the Lipschitz constant for g −1 .Then ( 18) is bounded above by which is Lipschitz, with Lipschitz constant ζ, and this completes the proof of Proposition 16.
The proof of Proposition 17 follows exactly the same lines as Proposition 16 except for the coupling used.To this end, suppose that z > w and let Y n (z) be binomially distributed with parameters a (number of trials) and g −1 (z)/a (probability of success).Let Y n (w) be conditionally binomially distributed with parameters Y n (z) and g −1 (w)/g −1 (z), conditional on Y n (w).Then Y n (w) is (marginally) binomially distributed with mean g −1 (w), and .
The moment generating function of a binomial random variable X ∼ Bin(a, p) with a trials and probability p of success is E(e tX ) = (pe t + 1 − p) a .Taking e t = g −1 (w)/g −1 (z) gives

A.5 Proof that Model (9) is Asymptotically Strong Feller
The proof is nearly identical to that for the GARMA model given in Appendix A.3.However, it requires that 1 > max{β + η, β} = β.The necessary Lipschitz property referred to in that proof holds for the Poisson threshold model (9) since this model uses the identity link function.
To give more detail, let Z n = µ n and let π z (•) be the distribution of Y n conditional on Z n = z, i.e. π z = Pois(z).The proof of Prop.16 then implies that the Lipschitz condition (15) holds.As in Appendix A.3, use the system of metrics d n defined in (17) and define t n = n.Let Y n (z) and µ n (z) indicate the random variables Y n and µ n conditioned on µ 0 = z.We have π z (•)−π w (•) T V < B|z−w|.So we can construct Y 0 (z) and Y 0 (w) in such a way that they have the correct marginal distributions π z and π w , and Pr(Y 0 the probability that the chains "couple" in this way for all time is at least 1 − B|w−z| 1−β .The rest of the argument from Sec A.3 holds unchanged.

A.6 Proof of Proposition 8, Case 1
For readability we make the dependence of Y (σ) 0 on σ implicit.Recall that µ = g −1 (x), and assume WLOG that g(0) = 0, since replacing g(y) with h(y) = g(y) − g(0) simply changes the value of γ.Due to the fact that g is concave on R + and convex on R − , there are constants a 0 , a 1 ≥ 0 such that |g(y)| ≤ a 0 + a 1 |y| for all y.Using these facts, equation ( 13), and the triangle inequality, we can bound E x V (X 1 ) as follows, where d i denote bounded (in µ) constants for each i ≥ 3: By the triangle and Jensen's inequalities, A.7 Proof of Propositions 9 and 10, Case 1 We will prove Prop. 10 for Case 1; Prop.9 for Case 1 then holds by symmetry.We will show that for large x, the autoregressive part of the GARMA model dominates and the movingaverage portion of the model is negligible.In the bound (20), the autoregressive part of the model is captured by |ρ|E x |g(Y 0 )|, while the moving-average part corresponds to the term |θ|E x |g(Y 0 ) − x|.Since g(0) = 0 and g is monotonic increasing, for all x large enough by Jensen's inequality.Now, µ = g −1 (x) > 0 for x > 0, so using ( 14) as x → ∞.Thus, from ( 22), for any given > 0, there exists M > 0 so that for x > M , where the second inequality is due to concavity of g on R + .Next we show that the term E x |g(Y 0 ) − x| in ( 20) is "small" relative to the linear (in x) term: Proposition 18.There is some constant d 13 such that 2+δ)   for all x large enough.
A.8 Proof of Proposition 8 and Proposition 9, Case 2 Assume WLOG that g(c) = 0, since replacing g(y) with h(y) = g(y) − g(c) simply changes the value of γ.Since g(c) = 0, g(Y * 0 ) ≥ 0 is nonnegative for any Y * 0 .Also, due to the concavity of g, there is some a 1 > 0 such that g(y) ≤ a 1 y for all y ∈ R + .Using these facts, equation ( 13), and the triangle inequality, we can bound E x V (X 1 ) as follows: where µ = g −1 (x), implying that This is sufficient to get a uniform bound on E x V (X 1 ) for x ∈ [−M, M ], proving Prop.8.It also proves Prop.9 by showing that for x < −M , E x V (X 1 ) ≤ d 20 +|θ| |x|, since µ = g −1 (x) ≤ g −1 (0) on this set.
A.9 Proof of Proposition 10, Case 2 Using Jensen's inequality and the fact that P x (Y 0 < c) x→∞ −→ 0, for all x large enough Using a similar argument to (23) above, we see that the last two terms in the argument of g converge to 0 as x → ∞.Hence, for any > 0 we can find M > 0 so that, for all x > M , where d 21 is the slope of a subgradient of g at g −1 (M ).
Combining this with (25), there exists M > 0 such that for x > M , It remains to show that the final term in this expression is small relative to the linear (in V (x)) term as x → ∞.This follows in almost identical fashion to the proof of this result in Case 1.We omit the details.
proving the result.
(σ) 0 .As shown in Section 2, this implies that the time series model {Y (σ) n } n∈N is also stationary and ergodic.Stationarity of the original process {Y n } n∈N is addressed in Section 5.
)≤ b 1 (x)(d 1 µ r + d 2 ) 1/(2+δ) ≤ b 1 (x)(d 9 µ r/(2+δ) + d 10 ) (triangle inequality) = d 9 b 1 (x)µ r/(2+δ) + d 10 b 1 (x) } n∈N .In particular, showing that {µ n−p:n−1 } n≥p is ϕ-irreducible, aperiodic and positive Harris recurrent (defined below) implies that it has a unique stationary distribution π, and that if µ 0:p−1 ∼ π then {µ n−p:n−1 } n≥p is a stationary and ergodic process.That {Y n } n∈N is also stationary and ergodic is seen as follows.Conditional on {µ n } n∈N , the Y n are independent across n and each Y n has a distribution that is a function of only µ n:n+p (since Y n ∼ ψ ν (µ n ) and since the values µ n+1:n+p depend on Y n ).Therefore there is a deterministic function f such that one can simulate {Y n } conditional on {µ n } by: (a) generating an i.i.d.sequence of Uniform(0, 1) random variables U n , and (b) setting Y n = f (µ n:n+p , U n ).The multivariate process {(µ n−p:n−1 , U n )} n≥p is stationary and ergodic, and so Thm.36.4 of In this case it is sometimes possible to prove stationarity and ergodicity of {Y n } n∈N by first showing these properties for the multivariate Markov chain {µ n−p:n−1 } n≥p , then "lifting" the results back to the time series model {Y n