Nonasymptotic control of the MLE for misspecified nonparametric hidden Markov models

We study the problem of estimating an unknown time process distribution using nonparametric hidden Markov models in the misspecified setting, that is when the true distribution of the process may not come from a hidden Markov model. We show that when the true distribution is exponentially mixing and satisfies a forgetting assumption, the maximum likelihood estimator recovers the best approximation of the true distribution. We prove a finite sample bound on the resulting error and show that it is optimal in the minimax sense--up to logarithmic factors--when the model is well specified.


Introduction
Let (Y 1 , . . . , Y n ) be a sample following some unknown distribution P * . The maximum likelihood estimator can be formalized as follows: let {P θ } θ∈Θ , the model, be a family of possible distributions; pick a distribution Pθ of the model which maximizes the likelihood of the observed sample.
In many situations, the true distribution may not belong to the model at hand: this is the so-called misspecified setting. One would like the estimator to give sensible results even in this setting. This can be done by showing that the estimated distribution converges to the best approximation of the true distribution within the model. The goal of this paper is to establish a finite sample bound on the error of the maximum likelihood estimator for a large class of true distributions and a large class of nonparametric hidden Markov models.
In this paper, we consider maximum likelihood estimators (shortened MLE) based on model selection among finite state space hidden Markov models (shortened HMM). A finite state space hidden Markov model is a stochastic process (X t , Y t ) t where only the observations (Y t ) t are observed, such that the process (X t ) t is a Markov chain taking values in a finite space and such that the Y s are independent conditionally to (X t ) t with a distribution depending only on the corresponding X s . The parameters of a HMM (X t , Y t ) t are the initial distribution and the transition matrix of (X t ) t and the distributions of Y s conditionally to X s .
HMMs have been widely used in practice, for instance in climatology (Lambert et al., 2003), ecology (Boyd et al., 2014), voice activity detection and speech recognition (Couvreur and Couvreur, 2000;Lefèvre, 2003), biology (Yau et al., 2011;Volant et al., 2014)... One of their advantages is their ability to account for complex dependencies between the observations: despite the seemingly simple structure of these models, the fact that the process (X t ) t is hidden makes the process (Y t ) t non-Markovian.
Up to now, most theoretical work in the literature focused on well-specified and parametric HMMs, where a smooth parametrization by a subset of R d is available, see for instance Baum and Petrie (1966) for discrete state and observations spaces, Leroux (1992) for general observation spaces and Douc and Matias (2001) and Douc et al. (2011) for general state and observation spaces. Asymptotic properties for misspecified models have been studied recently by Mevel and Finesso (2004) for consistency and asymptotic normality in finite state space HMMs and Douc and Moulines (2012) for consistency in HMMs with general state space. Let us also mention Pouzo et al. (2016), who studied a generalization of hidden Markov models in a semi-misspecified setting. All these results focus on parametric models.
Few results are available on nonparametric HMMs, and all of them focus on the wellspecified setting. Alexandrovich et al. (2016) prove consistency of a nonparametric maximum likelihood estimator based on finite state space hidden Markov models with nonparametric mixtures of parametric densities. Vernet (2015a,b) study the posterior consistency and concentration rates of a Bayesian nonparametric maximum likelihood estimator. Other methods have also been considered, such as spectral estimators in Anandkumar et al.  (2015b), to the best of our knowledge, there has been no result on convergence rates or finite sample error of the nonparametric maximum likelihood estimator, even in the wellspecified setting.
The main result of this paper is an oracle inequality that holds as soon as the models have controlled tails. This bound is optimal when the true distribution is a HMM taking values in R. Let us give some details about this result.
Let us start with an overview of the assumptions on the true distribution P * . The first assumption is that the observed process is strongly mixing. Strong mixing assumptions can be seen as a strengthened version of ergodicity. They have been widely used to extend results on independent observation to dependent processes, see for instance Bradley (2005) and Dedecker et al. (2007) for a survey on strong mixing and weak dependence conditions. The second assumption is that the process forgets its past exponentially fast. For hidden Markov models, this forgetting property is closely related to the exponential stability of the optimal filter, see for instance Le Gland and Mevel (2000); Gerencsér et al. (2007); Douc et al. (2004Douc et al. ( , 2009. The last assumption is that the likelihood of the true process has sub-polynomial tails. None of these assumptions are specific to HMMs, thus making our result applicable to the misspecified setting. To approximate a large class of true distributions, we consider nonparametric HMMs, where the parameters are not described by a finite dimensional space. For instance, one may consider HMMs with arbitrary number of states and arbitrary emission distributions. Computing a maximizer of the likelihood directly in a nonparametric model may be hard or result in overfitting. The model selection approach offers a way to circumvent this problem. It consists in considering a countable family of parametric sets (S M ) M ∈M -the models--and selecting one of them. The larger the union of all models, the more distributions are approximated. Several criteria can be used to select the model, such as bootstrap, cross validation (see for instance Arlot and Celisse (2010)) or penalization (see for instance Massart (2007)). We use a penalized criterion, which consists in maximizing the function (S, θ ∈ S) −→ 1 n log p θ (Y 1 , . . . , Y n ) − pen n (S), where p θ is the density of (Y 1 , . . . , Y n ) under the parameter θ and the penalty pen only depends on the model S and the number of observations n. Assume that the emission distributions of the HMMs-that is the distribution of the observations conditionally to the hidden states-are absolutely continuous with respect to some known probability measure, and call emission densities their densities with respect to this measure. The tail assumption ensures that the emission densities have sub-polynomial tail: where the supremum is taken over all emission densities γ in the models for a function n −→ D(n). For instance, this assumption holds when all densities are upper bounded by e D(n) . A key remark at this point is the dependency of D(n) with n: we allow the models to depend on the sample size. Typically, taking a larger sample makes it possible to consider larger models. A good choice is to take D(n) proportional to log n.
To stabilize the log-likelihood, we modify the models in the following way. First, only keep HMMs whose transition matrix is lower bounded by a positive function n −→ σ − (n). We show that taking this lower bound as (log n) −1 is a safe choice. Then, replace the emission densities γ by a convex combination of the original emission densities and of the dominating measure λ with a weight that decreases polynomially with the sample size. In other words, replace γ by (1 − n −a )γ + n −a λ for some a > 0. Taking a > 1 ensures that the component λ is asymptotically negligible. Any a > 0 works, but the constants of the oracle inequality depend on it.
A simplified version of our main result (Theorem 8) is the following oracle inequality: for all α 1, there exists constants A and n 0 such that if the penalty is large enough, the penalized maximum likelihood estimatorθ n satisfies for all t 1, η ∈ (0, 1) and n n 0 , with probability larger than 1 − e −t − n −α : where K(θ) can be seen as a Kullback-Leibler divergence between the distributions P * and P θ . In other words, the estimator recovers the best approximation of the true distribution within the model, up to the penalty and the residual term.
In the case where the true distribution is a HMM, it is possible to quantify the approximation error inf θ∈S K(θ). Using the results of Kruijer et al. (2010), we show that the above oracle inequality is optimal in the minimax sense-up to logarithmic factors-for realvalued HMMs, see Corollary 12. This is done by taking HMMs whose emission densities are mixtures of exponential power distributions-which include Gaussian mixtures as a special case.
The paper is organized as follows. We detail the framework of the article in Section 2. In particular, Section 2.3 describes the assumptions on the true distribution, Section 2.4 presents the assumptions on the model and Section 2.5 introduces the Kullback Leibler criterion used in the oracle inequality. Our main results are stated in Section 3. Section 3.1 contains the oracle inequality and Section 3.2 shows how it can be used to show minimax adaptivity for real-valued HMMs. Section 4 lists some perspectives for this work.
One may wish to relax our assumptions depending on the setting. For instance, one could want to change the dependency of the functions B(n) and σ − (n) on n, change the tail conditions or the rate of forgetting. We give an overview of the key steps of the proof of our oracle inequality in Section 5 to make it easier to adapt our result.
Some proofs are postponed the Appendices. Appendix A contains the proof of the minimax adaptivity result and Appendix B contains the proof of the main technical lemma of Section 5.

Notations and assumptions
We will use the following notations: • a ∨ b is the maximum of a and b, a ∧ b the minimum; is the set of measurable and square integrable functions defined on the measured space (A, A, µ). We write L 2 (A, µ) when the sigma-field is not ambiguous; • log is the inverse function of the exponential function exp.

Hidden Markov models
Finite state space hidden Markov models (HMM in short) are stochastic processes (X t , Y t ) t 1 with the following properties. The hidden state process (X t ) t is a Markov chain taking value in a finite set X (the state space). We denote by K the cardinality of X , and π and Q the initial distribution and transition matrix of (X t ) t respectively. The observation process (Y t ) t takes value in a polish space Y (the observation space) endowed with a Borel probability measure λ. The observations Y t are independent conditionally to (X t ) t with a distribution depending only on X t . In the following, we assume that the distribution of Y t conditionally to {X t = x} is absolutely continuous with respect to λ with density γ x . We call γ = (γ x ) x∈X the emission densities.
Therefore, the parameters of a HMM are its number of hidden states K, its initial distribution π (the distribution of X 1 ), its transition matrix Q and its emission densities γ. When appropriate, we write p (K,π,Q,γ) the density of the process with respect to the dominating measure under the parameters (K, π, Q, γ). For a sequence of observations Y n 1 , we denote by l n (K, π, Q, γ) the associated log-likelihood under the parameters (K, π, Q, γ), defined by l n (K, π, Q, γ) = log p (K,π,Q,γ) (Y n 1 ). We denote by P * the true (and unknown) distribution of the process (Y t ) t , E * the expectation under P * , p * the density of P * under the dominating measure and l * n the loglikelihood of the observations under P * . Let us stress that this distribution may not be generated by a finite state space HMM.

The model selection estimator
Let (S K,M,n ) K∈N * ,M ∈M be a family of parametric models such that for all K ∈ N * and M ∈ M, the parameters (K, π, Q, γ) ∈ S K,M,n correspond to HMMs with K hidden states. Note that the models S K,M,n may depend on the number of observations n. Let us see two ways to construct such models.
Mixture densities. Let {f ξ } ξ∈Ξ be a parametric family of probability densities indexed by Ξ ⊂ R d . Let M ⊂ N * . We choose S K,M,n to be the set of parameters (K, π, Q, γ) such that Q and π are uniformly lower bounded by (log n) −1 and for all L 2 densities. Let (E M ) M ∈M be a family of finite dimensional subspaces of L 2 (Y, λ). We choose S K,M,n to be the set of parameters (K, π, Q, γ) such that Q and π are uniformly lower bounded by (log n) −1 and for all x ∈ [K], γ x is a probability density such that γ x = g ∨ 0 for a function g ∈ E M such that g 2 n.
In both cases, we took a lower bound on the coefficients of the transition matrix Q that tends to zero when the number of observations grows. This allows to estimate parameters for which some coefficients of the transition matrix are small or zero. We prove the choice (log n) −1 to be a good choice in general in Theorem 8.
Since the true distribution does not necessarily correspond to a parameter of S K,M,n , taking a larger model S K,M,n will reduce the bias of the estimator (K,π K,M,n ,Q K,M,n ,γ K,M,n ). However, larger models will make the estimation more difficult, resulting in a larger variance. This means one has to perform a bias-variance tradeoff to select a model with a reasonable size. To do so, we select a number of statesK n among a set of integers K n and a model indexM n among a set of indices M n such that the penalized log-likelihood is maximal: K∈Kn,M ∈Mn 1 n l n (K,π K,M,n ,Q K,M,n ,γ K,M,n ) − pen n (K, M ) for some penalty pen n to be chosen. In the following, we use the following notations.
• S n := K∈Kn,M ∈Mn S K,M,n is the set of all parameters involved with the construction of the maximum likelihood estimator; • S (γ) K,M,n = {γ | (K, π, Q, γ) ∈ S K,M,n } is the set of density vectors from the model S K,M,n . S (γ) n is defined in the same way.

Assumptions on the true distribution
In this section, we introduce the assumptions on the true distribution of the process (Y t ) t 1 . We assume that (Y t ) t 1 is stationary, so that one can extend it into a process (Y t ) t∈Z .

Forgetting and mixing
Let us state the two assumptions on the dependency of the process (Y t ) t .
[A⋆forgetting] There exists two constants C * > 0 and ρ * ∈ (0, 1) such that for all i ∈ Z, for all k, k ′ ∈ N * and for all For the mixing assumption, let us recall the definition of the ρ-mixing coefficient. Let (Ω, F, P ) be a measured space and A ⊂ F and B ⊂ F be two sigma-fields. Let The ρ-mixing coefficient of (Y t ) t is defined by [A⋆mixing] There exists two constants c * > 0 and n * ∈ N * such that ∀n n * , ρ mix (n) 4e −c * n .
[A⋆forgetting] ensures that the process forgets its initial distribution exponentially fast. This assumption is especially useful for truncating the dependencies in the likelihood.
[A⋆mixing] is a usual mixing assumption and is used to obtain Bernstein-like concentration inequalities. Note that [A⋆mixing] implies that the process (Y t ) t 1 is ergodic.
Even if [A⋆forgetting] is analog to a ψ-mixing condition (see Bradley (2005) for a survey on mixing conditions) and is proved using the same tool as [A⋆mixing] in hidden Markov models-namely the geometric ergodicity of the hidden state process-these two assumptions are different in general. For instance, a Markov chain always satisfies [A⋆forgetting] but not necessarily [A⋆mixing]. Conversely, there exists processes satisfying [A⋆mixing] but not [A⋆forgetting].
Lemma 1 Assume that (Y t ) t is generated by a HMM with a compact metric state space X (not necessarily finite) endowed with a Borel probability measure µ. Write Q * its transition kernel and assume that Q * admits a density with respect to µ that is uniformly lower bounded and upper bounded by positive and finite constants σ * − and σ * + . Write (γ * x ) x∈X its emission densities and assume that they satisfy γ * x (y)µ(dx) ∈ (0, +∞) for all y ∈ Y. Then [A⋆forgetting] and [A⋆mixing] hold by taking and n * = 1.
Proof This lemma follows from the geometric ergodicity of the HMM.
For , the Doeblin condition implies that for all distribution π and π ′ on X , Let A ∈ σ(Y t , t k) and B ∈ σ(Y t , t 0) such that P * (B) > 0. Taking π the stationary distribution of (X t ) t and π ′ the distribution of X 0 conditionally to B in the above equation implies Bradley (2005) for the definition of the φ-mixing coefficient and its relation to the ρ-mixing coefficient). One can check that the choice of c * and n * allows to obtain [A⋆mixing] from this inequality.

Extreme values of the true density
We need to control the probability that the true density takes extreme values.
[A⋆tail] There exists two constants B * 1 and q ∈ [0, 1] such that ∀i ∈ Z, ∀k ∈ N, ∀u 1, In practice, only two values of q are of interest. The case q = 0 occurs when the densities are lower and upper bounded by positive and finite constants. If the densities are not bounded, then q = 1 works in most cases and corresponds to subpolynomial tails. Indeed, the lower bound on log p * (Y i |Y i−1 i−k ) is always true when taking q = 1 and B * = 1 by definition of the density p * , resulting in the following equivalent assumption: [A⋆tail'] There exists a constant B * 1 such that This can be obtained from Markov's inequality under a moment assumption, as shown in the following lemma.

Model assumptions
We now state the assumptions on the models. Let us recall that the distribution of the observed process is not assumed to belong to one of these models. Consider a family of models (S K,M,n ) K∈N * ,M ∈M,n∈N * such that for each K, M and n, the elements of S K,M,n are of the form (K, π, Q, γ) where π is a probability density on [K], Q is a transition matrix on [K] and γ is a vector of K probability densities on Y with respect to λ.

Transition kernel
We need the following assumption on the transition matrices and initial distributions of S n .
[Aergodic] is standard in maximum likelihood estimation. It ensures that the process forgets the past exponentially fast, which implies that the difference between the loglikelihood 1 n l n and its limit converges to zero with rate 1/n in supremum norm.

Tail of the emission densities
When (K, π, Q, γ) ∈ S n , [Aergodic] implies that under the parameters (K, π, Q, γ), for all x ∈ [K], the probability to jump to state x at time t is at least σ − (n), whatever the past may be. This implies that the density p (K,π,Q,γ) (Y t |Y t−1 1 ) is lower bounded by σ − (n) x γ x (Y t ). Furthermore, it is upper bounded by x γ x (Y t ). Thus, it is enough to bound this quantity to control p (K,π,Q,γ) without having to handle the time dependency.
We need to control the tails of b γ like we did for log p * in order to get nonasymptotic bounds. This is the purpose of the following assumption.
[Atail] There exists two constants q ∈ [0, 1] and B(n) 1 such that This assumption is often easy to check in practice, as shown in the following lemma.
Lemma 3 Assume that one of the two following assumption holds: 1. (subpolynomial tails) there exists D(n) 1 such that 2. (bounded densities) there exists D(n) 1 such that Consider a new model where all γ are replaced by γ ′ = (1 − n −a )γ + n −a for a fixed constant a > 0. Then [Atail] holds for this new model with q = 1 (resp. q = 0 with the second assumption) and B(n) = D(n) ∨ (a log n).
Changing the densities as in the lemma amounts to adding a mixture component (with weight n −a and distribution λ) to the emission densities to make sure that they are uniformly lower bounded. We shall see in the following that if a 1, then this additional component changes nothing to the approximation properties of the models, see the proof of Corollary 12. This is in agreement with the fact that this component is asymptotically never observed as soon as a > 1.

Complexity of the approximation spaces
The following assumption means that as far as the bracketing entropy is concerned, the set of emission densities of the model S K,M,n (without taking the hidden state into account) behaves like a parametric model with dimension m M .
[Aentropy] There exists a function (M, K, D, n) −→ C aux 1 and a sequence (m M ) M ∈M ∈ N M such that for all δ > 0, M , K and D, where d ∞ is the distance associated with the supremum norm and N (A, d, ǫ) is the smallest number of brackets of size ǫ for the distance d needed to cover A. Let us recall that the bracket [a; b] is the set of functions f such that a(·) f (·) b(·), and that the size of the bracket [a; b] is d(a, b).
Note that we allow the models to depend on the sample size n, which can make C aux grow to infinity with n. To control the growth of the models, we use the following assumption.
[Agrowth] There exists ζ > 0 and n growth such that for all n n growth , sup K,M s.t. K n and m M n log C aux (M, K, B(n)(log n) q , n) n ζ .
A typical way to check [Aentropy] is to use a parametrization of the emission densities, for instance a lipschitz application [−1, 1] m M −→ S In this case, C aux depends on the lipschitz constant of the parametrization. An example of this approach is given in Section 3.2 for mixtures of exponential power distributions.

Limit and properties of the log-likelihood
In this section, we focus on the convergence of the log-likelihood. First, we recall results from Barron (1985) and Leroux (1992) that show the existence of its limit in a general setting. Then, we show how to control the difference between the log-likelihood and its limit using the assumptions from the previous Sections.

Convergence of the log-likelihood
The first result comes from Barron (1985) and shows that the true log-likelihood converges almost surely with no assumption other than the ergodicity of the process (Y t ) t 1 .
Lemma 4 (Barron (1985)) Assume that the process (Y t ) t 1 is ergodic, then there exists a quantity l * > −∞ such that The second result follows from Theorem 2 of Leroux (1992). A careful reading of his proof shows that one can relax his assumptions to get the following lemma. Note that the definition of l n extends naturally to the case where γ is not a vector of probability densities, or even a vector of integrable functions with respect to λ, through the formula Lemma 5 (Leroux (1992)) Let K be a positive integer, γ a vector of K nonnegative and measurable functions, Q a transition matrix of size K and π a probability measure on [K]. Assume that the process (Y t ) t 1 is ergodic and that E * [(log γ x (Y 1 )) + ] < +∞ for all x ∈ [K]. Then: 1. There exists a quantity l(K, Q, γ) < +∞ which does not depend on π such that lim sup π(x) > 0, then Then the almost sure convergence also holds in L 1 (P * ).

Assume
When appropriate, we define K(K, Q, γ) by Note that when γ is a vector of probability densities, K(K, Q, γ) 0 since it is the limit of a sequence of Kullback-Leibler divergences: under the assumptions of Lemma 5, if inf x∈ [K] π(x) > 0,

Approximation of the limit
The following lemma controls the difference between the log-likelihood and its limit. When [A⋆forgetting] (resp. [Aergodic]) holds, the log-density of Y 1 conditionally to the previous observations converges exponentially fast to what can be seen as the density of Y 1 conditionally to the whole past, that is ). Strictly speaking, we define the limit of the log-density L * i,∞ and L i,∞ (K, Q, γ), which can be seen respectively as log where p (K,Q,γ) is the density of a stationary HMM with parameters (K, Q, γ). When µ is the stationary distribution of the Markov chain under the parameter (K, Q, γ), we write L i,k (K, Q, γ).
The third point follows from the ergodicity of (Y t ) t 1 under [A⋆mixing], from the integrability of L i,∞ and L * i,∞ under [Atail] and [A⋆tail] and from Lemmas 4 and 5.
Note that under the assumptions of point 3 of Lemma 6, one has recall that γ is a vector of probability densities in this case), or with some notation abuses: ) . Thus, K(K, Q, γ) can be seen as a Kullback Leibler divergence that measures the difference between the distribution of Y 1 conditionally to the whole past under the parameter (K, Q, γ) and under the true distribution. It can be seen as the prediction error under the parameter (K, Q, γ).
In the particular case where the true distribution of (Y t ) t is a finite state space hidden Markov model, K characterizes the true parameters, up to permutation of the hidden states, provided the emission densities are all distinct and the transition matrix is invertible, as shown in the following result.
Lemma 7 (Alexandrovich et al. (2016), Theorem 5) Assume (Y t ) t is generated by a finite state space HMM with parameters (K * , π * , Q * , γ * ). Assume Q * is invertible and ergodic, that the emission densities Then for all K ∈ N * , for all transition matrix Q of size K and for all K-uple of probability densities γ, one has K(K, Q, γ) 0.

Oracle inequality for the prediction error
The following theorem states an oracle inequality on the prediction error of our estimator. It shows that with high probability, our estimator performs as well as the best model of the class in terms of Kullback Leibler divergence, up to a multiplicative constant and up to an additive term decreasing as (log n) ··· n , provided the penalty is large enough. Let and let (K,π,Q,γ) = (K,πK ,M ,n ,QK ,M,n ,γK ,M ,n ) be the nonparametric maximum likelihood estimator.
Then there exists constants A and C pen depending only on α, C σ , C B , n * and c * and a constant n 0 depending only on α, C σ and C B such that for all for all t 1, for all η 1, with probability at least 1 − e −t − 2n −α , as soon as The proof of this theorem is presented in Section 5. Its structure and main steps are detailed in Section 5.1, and the proof of these steps are gathered in Section 5.2.
Note that this theorem is not specific to one choice of the parametric models S K,M,n : one can choose the type of model that suits the density one wants to estimate best. In the following section, we use mixture models to estimate densities when Y is unbounded. If Y is compact, we could use L 2 spaces and this oracle inequality would still hold.
The powers of log n in the term (log n) 7+q come from: • The limitation of the dependency to the log n most recent observations, which induces a factor (log n) 2 ; • The dependency of σ − (n) and B(n) on n, each of them at the root of a factor (log n) 2 ; • Truncating the emission densities (possible thanks to assumptions [Atail] and [A⋆tail]), which induces a factor (log n) 2q ; • The use of a Bernstein inequality for exponentially α-mixing processes, which introduces a factor (log n) 2 compared to a Bernstein inequality for independent variables. However, together with the previous point (the truncation of the emission densities), the two points only induce a factor (log n) 1+q .
In the term (log n) 2 log log n of the penalty, a factor log n comes from the limitation of the dependency and a factor log n log log n from σ − (n). Finally, the term (log n) 3+q in the penalty comes from the dependency of B(n) on n, from truncating the emission densities and from using a Bernstein inequality for exponentially α-mixing processes.

Minimax adaptive estimation using location-scale mixtures
In this section, we show that the oracle inequality of Theorem 8 allows to construct an estimator that is adaptive and minimax up to logarithmic factors when the observations are generated by a finite state space hidden Markov model. To do so, we consider models whose emission densities are finite mixtures of exponential power distributions, and use an approximation result by Kruijer et al. (2010). Assume that (Y t ) t 1 is generated by a stationary HMM with parameters (K * , Q * , γ * ), which we call the true parameters. We consider the case Y = R endowed with the probability λ with density G λ : y −→ (π(1 + y 2 )) −1 with respect to the Lebesgue measure. In order to quantify the approximation error by location-scale mixtures, we use the following assumptions from Kruijer et al. (2010).
there exists a polynomial L and a constant R > 0 such that if r is the largest integer smaller than β, one has There exists positive constants c and τ such that is positive and there exists y m < y M such that (γ * x G λ ) is nondecreasing on (−∞, y m ) and nonincreasing on (y M , +∞).
All these assumptions refer to the functions (γ * x G λ ), which are the densities of the emission distributions with respect to the Lebesgue measure. Hence, the choice of the dominating measure λ does not matter as far as regularity conditions are concerned.
Note that Kruijer et al. (2010) only assumed (C3) outside of a compact set. However, since the regularity assumption (C1) implies that (γ * x G λ ) is continuous, one can assume (C3) for all y without loss of generality.
It is important to note that even though we require some regularity on the emission densities, for instance through the polynomial L and the constants β and τ , we do not need to know them to construct our estimator, thus making it adaptive.
We consider the following models. Let p 2 be an even integer and Let M = N * . We take S K,M,n as the set of parameters (K, π, Q, γ) such that In other words, the emission densities are mixtures of λ (with weight n −2 ) and of M translations and dilatations of ψ.
The second point follows from the fact that the densities γ * x are uniformly bounded under (C3) and by taking δ large enough in Lemma 2.
[Aergodic] holds by definition of the models. See Section A.1.1 for the proof of the last two points.
The results of this section remain the same when the weight of λ in the emission densities of S K,M,n is allowed to be larger than n −2 instead of being exactly n −2 .
Corollary 12 (Minimax adaptive estimation rates) Assume (C1)-(C4) hold. Also assume that inf Q * > 0. Then there exists a constant C > 0 such that for all M 3 and n M , Hence, using Theorem 8 with pen n (K, M ) = (KM + K 2 )(log n) 15 /n, there exists a constant C such that almost surely, there exists a (random) n 0 such that Proof Proof in Section A.1.3.
This result shows that our estimator reaches the minimax rate of convergence proved by Maugis-Rabusseau and Michel (2013) for density estimation in Hellinger distance, up to logarithmic factors. Since estimating a density is the same thing as estimating a one-state HMM, this means that our result is adaptive and minimax up to logarithmic factors when K * = 1. As far as we know, knowing whether increasing the number of states makes the minimax rates of convergence better is still an open problem. It seems reasonable to think that it doesn't, which would imply that our estimator is in general adaptive and minimax.

Perspectives
The main result of this paper is a guarantee that maximum likelihood estimators based on nonparametric hidden Markov models give sensible results even in the misspecified setting, and that their error can be controlled nonasymptotically. Two properties of both the models and the true distributions are at the core of this result: a mixing property and a forgetting property, which can be seen as a local dependence property.
These two properties are not specific to hidden Markov models. Therefore, it is likely that our result can be generalized to many other models and distributions. To name a few, one could consider hidden Markov models with continuous state space as studied in Douc and Matias (2001) or Douc et al. (2011), or more generally partially observed Markov models, see for instance Douc et al. (2016) and reference therein. Special cases of partially observed Markov models are HMMs with autoregressive properties (Douc et al., 2004) and models with time inhomogeneous Markov regimes (Pouzo et al., 2016). One could also consider hidden Markov fields (Kunsch et al., 1995) and graphical models to generalize to more general distributions than time processes.
Another interesting approach is to consider other forgetting and mixing assumptions. For instance, Le Gland and Mevel (2000) state a more general version of the forgetting assumption where the constant is replaced by an almost surely finite random variable, and Gerencsér et al. (2007) give conditions under which the moments of this random variable are finite. Other mixing and weak dependence conditions have also been introduced in the litterature with the hope of describing more general processes, see for instance Dedecker et al. (2007).

Overview of the proof
By definition of (K,π,Q,γ), one has for all K Now, assume that with high probability, for all K, M and (K, π, Q, γ) ∈ S K,M,n , for some constant η ∈ (0, 1 2 ), some penalty pen n and some residual term R n . The above inequality leads to (1 − η)K(K,Q,γ) (1 + η)K(K, Q K,M , γ K,M ) + 2pen n (K, M ) + 2R n , and the oracle inequality follows by noticing that 1+η 1−η 1+ 4η and 1 1−η 2 when η ∈ (0, 1 2 ). Let us now prove equation (2). The following remark will be useful in our proofs: since one has for all k, µ and (K, π, Q, γ) ∈ S n and finally for all k, k ′ ∈ N * , for all µ, µ ′ probability distributions and for all (K, π, Q, γ) Approximate ν(K, π, Q, γ) by the deviation where D > 0 and 2D + log 1 σ − (n) thanks to equation (3). Considering these functions t (D) (K,Q,γ) has two advantages. The first one is to limit the time dependency to an interval of length k, which makes it possible to use the forgetting property of the process (Y t ) t∈Z . The second one is to consider bounded functionals of this process, for which one can get Bernstein-like concentration inequalities. The error of this approximation is given by the following lemma. 1−ρ * 2−ρ * ∧ 1 1+C * . Then for all u 1, with probability greater than 1 − 2ne −u , for all (K, π, Q, γ) ∈ S n , Proof Proof in Section 5.2.2.
The following theorem is our main technical result. It shows thatν k (t (B(n)u q ) (K,Q,γ) ) can be controlled uniformly on all models with high probability.
Theorem 14 Assume [Aergodic], [Aentropy] and [A⋆mixing]. Also assume that there exists n 1 such that for all n n 1 , for all K n and M such that m M n, Let (w M ) M ∈M be a sequence of positive numbers such that M e −w M e − 1. Then there exists constants C pen and A depending on n * and c * and a numerical constant n 0 such that for all ǫ > 0 and n n 1 ∨ n 0 , the following holds.
Let pen n be a function such that for all K n and M such that m M n, Then for all s > 0, with probability larger than 1 − e −s , for all K n ∧ 1 2σ − (n) and M such that m M n and for all (K, π, Q, γ) ∈ S K,M,n , Proof Proof in Section B.
The last step is to control the variance term E[t Then for all k such that one has for all D > 0, v log n and (K, π, Q, γ) ∈ S n : Now that the main lemmas have been stated, let us show how the assumptions of Theorem 8 leads to the desired oracle inequality.
Let C σ and C B be two positive constants and let Let α 0. In order to have ne −u n −α , take u = (1 + α) log n.
Note that u 1 for all n 3. The assumptions on v and k are v log n and k 1 σ − (n) log n + 2 log 1 σ − (n) (note that the assumption on k entails ρ k−1 (1 − ρ)/n). Thus, there exists an integer n 0 depending on C σ such that if n n 0 , these assumptions hold for k = 2 Cσ (log n) 2 v = log n .

Upper bounds for the moments of the tails
Let W be a nonnegative random variable such that for all u 0, P * (W u q ) = e −u (if q > 0; otherwise W = 0). Assumption [Atail] implies that there exists a coupling of W and sup γ∈S (γ) n |b γ (Y 1 )| such that on the event {sup γ∈S (γ) n |b γ (Y 1 )| B(n)}, one has sup γ∈S (γ) n |b γ (Y 1 )| B(n)W P * -almost surely. Therefore, controlling the moments of W is enough to control the moments of sup γ∈S (γ) Lemma 16 For all u 1, Likewise, by integration by parts, which is enough to conclude.

Proof of Lemma 13 (approximating the likelihood)
one gets using Lemma 6 and [A⋆forgetting] that We restrict ourselves to the event n and |L * 0,k |/B * can be upper bounded by the random variable W defined in Section 5.2.1, which means that for all u 1, as soon as B(n) B * , which concludes the proof.

Proof of Lemma 15 (controlling the variance residual)
Lemma 17  where H(P, Q) is the Hellinger distance between P and Q: for any τ > 0. Let v be a real number such that 2B(n)v q = log 1 λ − log 1 σ − (n) , then [Atail] and [A⋆tail] imply that sup γ ′ ∈S (γ) n b γ ′ (Y 0 ) /B(n) and |L * 0,∞ |/B * can be upper bounded by the random variable W defined in Section 5.2.1, which means that for all v 1, as soon as B(n) B * by taking τ = 1 3 . Therefore, for all v 1 such that the real number λ defined by 2B(n)v q = log 1 λ − log 1 σ − (n) satisfies λ e −4 (i.e. 2B(n)v q 4 − log 1 σ − (n) , which holds as soon as v 1 and σ − (n) e −1 ), using that the Kullback Leibler divergence is lower bounded by the Hellinger distance. The and the lemma is proved by dividing both sides by 3 2B(n)v q + log 1 The next step is the control of the difference between V(K, Q, γ) and E * [t Then, using Lemma 6, equation (4), Lemma 16, B(n) B * and the condition on σ − (n) (which implies ρ ρ * and 1 1−ρ C * ). Therefore, under the assumptions of Lemma 17, one has Let us take k − log n log ρ + log(1−ρ) log ρ + 1 and v log n, so that . Therefore, the condition on k holds as soon as using that log log x (log x)/e for all x > 1 and that e(1 − 1/e) 1. Therefore, for all k satisfying equation (9), for all D > 0 and v log n, which concludes the proof.  n . Moreover, for all y ∈ Y and γ ∈ S (γ) where we recall that the maximum is taken over µ ∈ [−n, n] and s ∈ [ 1 M , 1]. By the choice of K n and M n , one also has K n and m M n, i.e. M n 2 . If y ∈ [−n, n], b γ (y) log n + 0 ∨ log(1 + y 2 ) + log M + log π log n + 0 ∨ log(1 + n 2 ) + log(n/2) + log π 4 log n + log(πe/2) 5 log n as soon as n 5. Otherwise, one can take y n and then b γ (y) log n + 0 ∨ (−(y − n) p + log(1 + y 2 ) + log M + log π) log n + 0 ∨ (−(y − n) p + log(1 + 2(y − n) 2 + 2n 2 ) + log M + log π) log n + 0 ∨ (−Y p + log(1 + 2Y 2 ) + log 2n 2 + log(n/2) + log π) by writing Y = y − n and using that log(a + b) log a + log b when a, b 1. Since max Y 0 (−Y p + log(1 + 2Y 2 )) log 3 as soon as p 2, one gets b γ (y) 4 log n + log 3π 5 log n as soons as n 10.

Checking [Aentropy] and [Hgrowth]
Let us first assume that there exists a constant L p such that the function (µ, s) −→ s −1 ψ(s −1 (y−u)) is L p -Lipschitz for all y (where the origin space is endowed with the supremum norm). Then a bracket covering of size ǫ of ([n, n] × [ 1 M , 1]) M provides a bracket covering of {γ(·|x)} γ∈S (γ) n ,x∈[K of size L p ǫ. Since there exists a bracket covering of size ǫ of [n, n] × [ 1 M , 1] for the supremum norm with less than ( 4n ǫ ∨ 1) 2 brackets, one gets [Aentropy] by taking C aux = 4L p n and m M = 2M .
Let us now check that this constant L p exists.

∂ ∂µ
as soon as p 2. Thus, one can take L p = pn 2 , which corresponds to C aux = 4pn 3 . With this C aux , checking [Hgrowth] is straightforward for all ζ > 0.

A.1.2 Proof of Lemma 11 (approximation rates)
Let F (y) = e −c|y| τ . Lemma 4 of Kruijer et al. (2010) ensures that there exists c ′ > 0 and H 6β + 4p such that for all x ∈ [K * ] and u > 0, there exists a mixture g u,x with O(u −1 | log u| p/τ ) components, each with density 1 u ψ( ·−µ u ) with respect to the Lebesgue measure for some µ ∈ {y | F (y) c ′ u H }, such that g u,x approximates the emission density γ * x : Take s = u| log u| −1− p τ . When |µ| s −1 , one has F (µ) exp(−cs −τ ) = o(c ′ s H ). Thus, for s small enough, all translation parameters µ belong to [−s −1 , s −1 ]. Moreover, by definition of s, the mixture g u,x contains fewer than s −1 components when s is small enough. Finally, we use that Taking s −1 = M and g M,x = g u,x concludes the proof.

A.1.3 Proof of Corollary 12 (minimax adaptive estimation rate)
Denote by h the Hellinger distance, defined by h(p, q) 2 = E P [( q/p−1) 2 ] for all probability densities p and q associated to probability measures P and Q. Let be the Hellinger distance between the distributions of Y 1 conditionally to Y 0 −∞ under the true distribution and under the parameters (K, Q, γ) (see Lemma 6 for the definition of these conditional distributions).
The following lemma shows that the Kullback-Leibler divergence and the Hellinger distance are equivalent up to a logarithmic factor and a small additive term.
Then there exists a constant n 1 depending on C B and C σ such that for all n n 1 ∨ exp( B * C B ), one has for all (K, Q, γ) ∈ S n Proof The lower bound comes from the fact that the square of the Hellinger distance is smaller than the Kullback-Leibler divergence. For the upper bound, we use Lemma 4 of Shen et al. (2013): for all v 4 and for all probability measures P and Q with densities p and q, KL(p q) h 2 (p, q) (1 + 2v) + 2E P log p q 1 log p q v . (K,Q,γ) . Then log p q |b γ | + |L * 1,∞ | + log 1 and 1 log p Taking v = 2C B (log n) 2 , one gets that there exists n 1 depending only on C B and C σ such that for all n n 1 , v + log σ − (n) (C B log n) 2 and 1 + 2v 5C B (log n) 2 , so that Note that [A⋆tail] also holds for L * 1,∞ using the uniform convergence of Lemma 6. This implies that P * (|L * 1,∞ | C B (log n) 2 ) exp(− log n) n −1 since C B (log n) B * for n exp( B * C B ). Likewise, [Atail] implies that P * (|b γ | C B (log n) 2 ) n −1 . The last expectation of the above equation can be written as where a = |L * 1,∞ | and b = |b γ |. Then, note that 6C B (log n) 2 n using C B log n B * and Lemma 16 for the first term and [Atail] for the second one. Likewise, so that finally which concludes the proof.
Let M ∈ N * . Let g M,x be the approximating densities given by Lemma 11 and write γ M,x = n −2 + (1 − n −2 )g M,x for all x ∈ [K * ]. The following lemma controls the error H(K * , Q * , (γ M,x ) x ) coming from the approximation of the densities.
Lemma 20 Assume σ − (n) inf Q * , then Thus, one needs to control the expectation of the second term. Since p x and p * x belong to [σ − (n); 1] by minoration of their transition matrices, one has The following equation follows from a careful reading of the proof of Proposition 2.1 of De Castro et al. (2017) by noticing that the roles of γ * and γ M are symmetrical in their proof.
. Therefore, using the Cauchy-Schwarz inequality: which concludes the proof of the lemma.
one has for all x Therefore, Since σ − (n) = C σ (log n) −1 and (1 − ρ) −1 (σ − (n)) −1 , there exists a constant C such that for all n 3, by definition of the densities g M,x . The choice of penalty verifies the lower bound of Theorem 8. Thus, the oracle inequality of Theorem 8 with η = 1, α = 2 and t = 2 log n entails that for n large enough and for any sequence (M n ) n such that M n n/2 for all n: Taking M n ∼ n 1 2β+1 (log n) 2β(1+p/τ )−6 2β+1 , one gets the announced rate.

Appendix B. Proof of the control ofν k (Theorem 14)
Let us give an overview of the proof of the control ofν k . The first step of the proof is to obtain a Bernstein inequality onν k (t) for a single function t. This is done using the mixing properties of the process (Y i ) i and by noticing thatν k (t) is the deviation of an empirical mean.
The second step is to transform the inequality on one function t into an inequality on the supremum over all function t belonging to a given class. This step involves the bracketing entropy of the aforementionned class. The control of this entropy is where the shape of the penalty appears.
At this stage, one is able to upper bound the supremum ofν k (t (D) (K,Q,γ) ) over all parameters (K, π, Q, γ) ∈ S K,M,n . However, this upper bound is of order n −1/2 (up to logarithmic factors), which is suboptimal. The third step of the proof gets rid of the n −1/2 term by considering the processes W K,M,n := sup In the rest of this Section, we omit the dependency of σ − , B, W K,M , x K,M and S K,M on n in the notations. We also introduce the notation θ = (K, π, Q, γ) for (K, π, Q, γ) ∈ S n to make the notation shorter. Given θ ∈ S n , we write π θ , Q θ and γ θ its components.

B.1 Concentration inequality
First, let us introduce some notations. Let D > 0, K 1, M ∈ M and k 1. For all Define for all σ > 0 the sets ,d,ǫ) denote the minimal cardinality of a covering of A by brackets of size ǫ for the semi-distance d, that is by sets [t 1 , t 2 ] = {t : Y k −→ R , t 1 (·) t(·) t 2 (·)} such that d(t 1 , t 2 ) ǫ. H(A, d, ·) is called the bracketing entropy of A for the semi-distance d.
The first step of the proof is to obtain a Bernstein inequality for the deviations of a single t (D) (Z i ).
Theorem 21 Assume [A⋆mixing] holds. Then there exists a constant C mix depending on c * and n * such that the following holds.
Let t be a real valued, measurable bounded function on Y k+1 . Let V = E * [t 2 (Z 0 )]. Then for all λ ∈ (0, 1 C mix (n * +k+1) t ∞ (log n) 2 ) and for all n ∈ N: The following result is a Bernstein inequality for exponentially α-mixing processes.
Lemma 22 (Merlevède et al. (2009), Theorem 2) Let (A i ) i 1 be a stationary sequence of centered real-valued random variables such that A 1 ∞ M and whose α-mixing coefficients satisfy, for a certain c > 0, Then there exists positive constants C 1 and C 2 depending on c such that for all n 2 and all λ ∈ (0, Assumption [A⋆mixing] implies that the α-mixing coefficients of (Y i ) i satisfy α mix (n) e −c * n for all n n * since 4α mix (n) ρ mix (n) (see for instance Bradley (2005)). However, this is not enough to apply the previous result: one needs the inequality to hold for all n (and not for n larger than some constant) and for the process (Z i ) i . To do so, we partition the process (Z i ) i into several processes for which the above result applies, and then gather the inequalities.
Consider the processes (Z i(n * +k+1)+j ) i with α-mixing coefficients α Z,j (n). By construction, they satisfy α Z,j (n) e −c * n * n for all n 1 and j ∈ {1, . . . , n * + k + 1}. Apply Lemma 22, one gets that there exists two positive constants C 1 and C 2 depending on c * and n * such that for all function t, all λ ∈ (0, 1 C 1 M (log n) 2 ) and all n ∈ N: (EA k i ) 1/k for any positive integer k and any positive random variable (A i ) 1 i k , one gets which concludes the proof.
The following result follows mutatis mutandis from the proof of Theorem 6.8 of Massart (2007) using the previous theorem.
Lemma 23 Assume [A⋆mixing] holds. Then there exists a constant C * 1 depending on n * and c * such that the following holds.
Let T be a class of real valued and measurable functions on Y k+1 such that T is separable for the supremum norm. Also assume that there exists positive numbers σ and b such that for all t ∈ T , t ∞ b and E * t 2 (Z 0 ) σ 2 and assume that N (T , d k , δ) is finite for all δ > 0.
Then for all measurable set A such that P * (A) > 0:

Now, by taking T = {t
(D) θ |θ ∈ B σ } and b = 2D + log 1 σ − , one gets the following lemma from Lemma 4.23 and Lemma 2.4 of Massart (2007): Lemma 24 Assume that there exists a function ϕ and constants C and σ K,M such that x → ϕ(x) x is nonincreasing and Then for all x K,M σ K,M and z > 0, one has with probability greater than 1 − e −z : The two remaining steps are the control of the bracketing entropy which will lead to equation (10) (see Section B.2) and the choice of the parameters x K,M and z (see Section B.3).

B.2.1 Reduction of the set
For all θ ∈ S K,M , let g θ = (g θ,x ) x∈ [K] where In order to control the bracketing entropy of {t (D) θ | θ ∈ B σ }, we control the bracketing entropy of the set G := {g θ | θ ∈ S K,M } for the distance Remark 25 In the rest of Section B.2, we always assume that since if this is not the case, then t θ ′ (y k ) = 0. This means that only the y k satisfying equation (12) are relevant for the construction of the brackets.
For all θ ∈ S K,M , one has so that for all θ ∈ S K,M , using that σ − 1/4 and | log a − log b| |a − b|/(a ∧ b). Therefore, whereN is the minimal cardinality of a bracket covering of G such that all brackets [a, b] satisfy

B.2.2 Decomposition into simple sets
The aim of this section is to prove the following lemma.
where d ∞ is the distance of the supremum norm and where γ θ denotes the function (x, y) −→ γ θ (y|x). Let: • [a, b] be a bracket of {π θ } θ∈S K,M of size ǫ for the supremum norm ; • [p, q] be a bracket of {Q θ } θ∈S K,M of size ǫ pour the supremum norm ; • [u, v] be a bracket of {γ θ } θ∈S K,M of size ǫe −D for the supremum norm.
Without loss of generality, one can assume since all elements of {π θ } θ∈S K,M and {Q θ } θ∈S K,M satisfy these inequalities. One can also assume that there exists θ ∈ S K,M such that π θ ∈ [a, b], Q θ ∈ [p, q] and γ θ ∈ [u, v]. Under this assumption, all brackets that we construct are non empty and for all y ∈ Y, e −D (1 − Kǫ) x u(y|x) x v(y|x) e D + Kǫe −D . Using the approach of Appendix A of De Castro et al. (2017), one can write g θ,x as the following product of matrices .

Now, let
. Proof Using minimalist notations, one has which gives the desired result. The proof of the second case is similar and comes from the fact that Lemma 28 Assume ǫ f i|k (x, ·) − g i|k (x, ·) 1 5 2 Proof With minimalist notations, one has using that σ − a b 1.
[A, B] is a bracket of G, and this construction gives a bracket covering of G.