Efficient Bayesian estimation and use of cut posterior in semiparametric hidden Markov models

We consider the problem of estimation in Hidden Markov models with finite state space and nonparametric emission distributions. Efficient estimators for the transition matrix are exhibited, and a semiparametric Bernstein-von Mises result is deduced. Following from this, we propose a modular approach using the cut posterior to jointly estimate the transition matrix and the emission densities. We derive a general theorem on contraction rates for this approach. We then show how this result may be applied to obtain a contraction rate result for the emission densities in our setting; a key intermediate step is an inversion inequality relating $L^1$ distance between the marginal densities to $L^1$ distance between the emissions. Finally, a contraction result for the smoothing probabilities is shown, which avoids the common approach of sample splitting. Simulations are provided which demonstrate both the theory and the ease of its implementation.


Introduction
Hidden Markov models (HMMs) are a broad and widely used class of statistical models, enjoying applications in a variety of settings where observed data is linked to some ordered process, for which an assumption of independently distributed data would be both inappropriate and uninformative. Specific applications include modelling of weather [3,36], circadian rhythms [35], animal behaviour [18,39], finance [44], information retrieval [48,57], biomolecular dynamics [32], genomics [63] and speech recognition [53].
In this paper, we consider inference in finite state space HMMs. Such models are characterised by an unobserved (latent) Markov chain (X t ) t≥1 taking values in [R] = {1, . . . , R} with R < ∞, evolving according to a transition matrix Q. Conditionally on X t = j, Y t ∼ F j where F j is the emission distribution with associated density f j . These models generalise independent mixture models, which are obtained as a special case when the X t are independent and identically distributed. Here we assume that R is known so that the parameters in such models are then Q and F = (F 1 , . . . , F R ).
Most of the work on HMMs considers parametric models, where the emissions are assumed to admit densities in a parametric class {f θ : θ ∈ Θ} where Θ ⊂ R d for some d < ∞, see for instance [11,20,23]. However such an assumption leads to inference which is strongly influenced by the choice of the parametric family {f θ : θ ∈ Θ}. This problem has been often discussed in the literature, see for instance [53], [21] or [63], especially but not solely in relation to clustering or segmentation. In the seminal paper [24], the authors show identifiability under weak assumptions on the emissions, provided the transition matrix is of full rank, paving the way for estimation of semiparametric models. In [4], it is shown further that the number R of hidden states may be identified. In [7,34] and [16], frequentist estimators of Q and f = (f 1 , . . . , f R ) respectively have been proposed using spectral methods, showing in particular that Q can be estimated at the rate 1/ √ n. Frequentist estimation of the emission densities has also been addressed using penalised least squares approaches [15] and spectral methods [16]. However, no results exist on the asymptotic distribution of frequentist estimators for Q, nor on efficient estimation for Q.
Although Bayesian nonparametric estimation methods have been considered in practice in Hidden Markov models, see for instance [63] or [21], little is known about their theoretical properties. While [60,61] established posterior consistency under general conditions on the prior and refined the analysis to derive posterior concentration rates on the marginal density of successive observations, g Q,F , no results exist regarding the properties of Bayesian procedures when seeking to recover the parameters Q and F, or other functionals of Q, F which are often of interests in HMMs. For instance, when interests lie in clustering or segmentation, the quantities of interest are the smoothing probabilities, being the conditional distribution of the latent states given the observations. In [16], the authors obtain rates of convergence for frequentist estimators of the smoothing probabilities, but their result requires either a sup-norm convergence for the estimatorf or splitting the data into two parts, with estimation of f based on one part of the data and estimation of the smoothing probabilities based on the other part of the observations. In this paper we intend to bridge this gap, concentrating on Bayesian semiparametric methods while also exhibiting non-Bayesian, semiparametric efficient estimators of Q.
We first construct a family of priors Π 1 on (Q, F) which we show in Theorem 2 leads to an asymptotically normal posterior distribution for √ n(Q −Q), of variance V detailed therein. HereQ is a freqentist estimator, exhibited in Theorem 1, for which √ n(Q − Q * ) converges in distribution to N (0, V ), with Q * being the the frequentist true parameter under which the data is assumed to be generated. Consequently, Bayesian estimates associated to such posteriors such as the posterior mean, enjoy parametric convergence rates to Q * , and importantly credible regions for Q are asymptotically confidence intervals. We then refine this construction to obtain a scheme for which V is the optimal variance for semiparametric estimation of Q, being the inverse efficient Fisher information, in Theorem 3. Semiparametric Bernstein-von Mises properties are highly non trivial results and correspondingly sparse in the literature, moreover in [13,14,22,54] a number of counterexamples are exhibited, where semiparametric Bernstein-von Mises results do not hold. Since this property is crucial to ensure that credible regions are also confidence regions and thus robust to the choice of the prior distribution, it is important to study and verify it.
Our approach to obtaining these results follows the ideas of [28], extending their work on mixtures to the more complex HMM setting. In particular, the construction ofQ (together with the prior on Q, F) is based on a parametric approximation to the nonparametric emission model, with the property that estimation in this model with appropriately 'coarsened' data leads to a wellspecified parametric model for estimation of Q. Once we reduce to such a parametric model, we have asymptotic normality of the corresponding MLE as in [10] as well as Bernstein-von Mises for the posterior as in [17], although the intuition that our 'coarsened' data is less informative translates to an inefficient asymptotic variance V . We then show that we can construct efficient estimators of Q by choosing a type of coarsening, similar to the approach of [28]. The proof techniques in our context are however significantly more complex since the notions of Fisher information and score functions are much less tractable in hidden Markov models (see for instance [19]).
The prior distributions Π 1 (dQ, dF) considered above, which lead to the Bernsteinvon Mises property of the posterior on Q, rely on a crude modelling of the emission densities and are therefore not well behaved as far as the estimation of F is concerned. To overcome this problem, we adapt in Section 4 the cut posterior approach (see [9,12,37,43,52]) to our semiparametric setting. Cut posteriors were originally proposed in the context of modular approaches to modelling, where a model is assembled from a number of constituent models, each with its own parameters θ i and data Y i . In the usual construction, the cut posterior has the effect of 'cutting feedback' of one of the (less reliable) data sources Y i on the other parameters (associated to more reliable data).
Our approach, though also modular, departs from this setting in that we use a single data source but wish to choose different priors for different parts of the parameter. This means that we consider a prior Π 2 (dF|Q), then combine Π 1 (dQ|Y 1:n ) with the conditional posterior Π 2 (dF|Q, Y 1:n ) to produce a joint distribution. In this way, we construct a distribution over the parameters which is not a proper Bayesian posterior, but which simultaneously satisfies a Bernstein-von Mises theorem for the posterior marginal on Q, and is well concentrated on the emission distributions and other functionals such as the smoothing probabilities. Through this construction, we manage to combine the "best of both worlds" in the estimation of the parametric and nonparametric parts. We believe that this idea could be used more generally in other semiparametric models.
As previously mentioned, the existing posterior concentration result in the semiparametric HMMs covered only the marginal density g Q,F of a fixed number of consecutive observations, see [60]. A key step in obtaining posterior contraction rates on the emission distributions F is an inversion inequality allowing us to deduce L 1 concentration of the posterior distribution of the emission distributions from concentration (at the same rate) of the marginal distribution of the observations. This is established in Theorem 4, from which we derive contraction rates for the cut-posterior on f in Theorem 5. This inversion inequality is of independent interest and can be used outside the framework of Bayesian inference. We finally show in Theorem 6 that these results lead to posterior concentration of the smoothing distributions, which are the conditional distributions of the hidden states given the observations, building on [16] but refining the analysis so that we require neither sup-norm contraction rate on the emissions f , nor a splitting of the data set in two parts.
Organisation of the paper The paper is organised as follows. In Section 2, we introduce the model and the notions involved, together with a general strategy for inference on Q and f based on the cut posterior approach. In Section 3, we study the estimation of Q, proving asymptotic normality of the posterior and asymptotic efficiency. In Section 4, the cut posterior approach is studied, posterior contraction rates for f are derived together with posterior contraction rates for the smoothing probabilities. Theoretical results from both sections are then illustrated in Section 5. The most novel proofs for both of these sections are presented to Section 6, with more standard arguments, along with further details of simulations, deferred to the supplementary material [50]. All sections, theorems, propositions, lemmas, assumptions, algorithms and equations presented in the supplement are designed with a prefix S.

Inference in semiparametric finite state space HMMs
In this section we present our strategy to estimate Q and F = (F 1 , . . . , F R ), based on a modular approach which consists of first constructing the marginal posterior distribution of Q based on a first prior Π 1 on (Q, F), and then combining it with the conditional posterior distribution of F given Q and Y 1:n based on a different prior Π 2 (dF|Q).

Model and notation
Hidden Markov models (HMMs) are latent variable models where one observes Y 1:n = (Y t ) 1≤t≤n whose distribution is modelled via latent (non observed) variables X 1:n = (X t ) 1≤t≤n ∈ [R] n which form a Markov chain. In this work we consider finite state space HMMs: Y t |X n ∼ F Xt , (X t ) t≤n ∼ MC(Q), P Q,F (X t = r|X t−1 = s) = Q rs , r, s ≤ R (1) and the number of states R is assumed to be known throughout the paper.
The parameters are then the transition matrix Q = (Q rs ) r,s≤R of the Markov chain and the emission distributions F = {F r } r≤R , which represent the conditional distribution of the observations given the latent states. For a transition matrix Q we denote by p Q its invariant distribution (when it exists).
Visual representation of data generating process of HMM Throughout the paper we denote by F * = {F * r } r≤R and Q * respectively the true emission distributions and the true transition matrix. The aim is to make inference on F * , Q * , and some functionals of these parameters, using likelihood based methods and in particular Bayesian methods. We assume that the distributions F * j , j = 1, . . . , R are absolutely continuous with respect to some measure λ on Y ⊂ R d , d ≥ 1 and we denote by f * 1 , . . . , f * R their corresponding densities. When the latent states (X t ) t≥1 are independent and identically distributed on [R], the parameters are not identifiable unless some strong assumptions are made on the F j 's, see [5]. However, in [24], it is proved that under weak assumptions on the data generating process, both Q and F are identifiable up to label swapping (or label switching). More precisely, let P Under assumptions of ergodicity p Q exists and is unique. For example, this holds under the assumption Q ij > 0 for all i, j ∈ [R]. see for instance equation (1) in [26]. Denote by Q = ∆ R R ⊂ [0, 1] R×R the R−fold product of the R − 1 dimensional simplex and let P be the set of probability measures on R d . Consider the following assumption: • (i) The latent chain (X t ) t≥1 has true transition matrix Q * = (Q * ij ) satisfying det Q * = 0. • (ii) The true emission distributions {F * r } R r=1 are linearly independent. Then from [24], if Assumption 1 holds, then for any Q ∈ Q and any F 1 , . . . , F R ∈ P, if P Q * ,F * then Q = Q * and F = F * up to label swapping. By "up to label swapping" we precisely mean that there exists a permutation τ of [R] such that τ Q = Q * and τ F = F * , where τ Q = (Q τ (r),τ (s) , r, s ∈ [R]) and τ F = (F τ (1) , . . . , F τ (R) ). The requirement for such a permutation τ is unavoidable, since the labelling of the hidden states is fundamentally arbitrary. Correspondingly, the results which follow will always be given up to label swapping. In a slight abuse of notation, we will sometimes interchange F and f = (f 1 , . . . , f R ), the latter being the densities of F with respect to some measure λ in a dominated model.
The likelihood associated to model (1), when (F r ) r∈ [R] are dominated by a measure λ, is then given by (3) Extension to initial distributions different from the stationary one is straightforward, under the exponential forgetfulness of the Markov chain, which holds under our assumptions below, see Section S5.1.
If Π is a prior on Q × F R , with F = {f : R d → R + : f (y)dλ(y) = 1} is the set of densities on R d with respect to λ, then the Bayesian posterior Π(·|Y 1:n ) is defined as follows: for any Borel subset A of Q × F R , we have This posterior is well defined as soon as almost surely with respect to the prior p Q exists, which holds for instance when Π(min i,j Q ij > 0) = 1, which implies that the transiton matrix is ergodic Π−almost surely by the earlier remarks. Throughout the paper we consider the parametrization of Q given byQ = (Q rs , r ∈ [R], s ∈ [R − 1]) so that for each r we have Q rR = 1 − s<R Q rs . Hence, specification of the matrix Q amounts to specification of the (R)×(R−1) matrixQ ∈Q for which R−1 s=1 Q rs ≤ 1 for all r, and we will identify Q withQ (and Q withQ) when making statements about asymptotic distributions.
It will be helpful to consider that the stationary HMM (X t , Y t ) t≥1 is defined on (X t , Y t ) t∈Z where the (X t ) t∈Z and the (Y t ) t≤0 are not observed. This is possible to define by considering the reversal of the latent chain.
We use P * to denote the joint law of the variables (X t , Y t ) t∈Z under the stationary distribution associated with the parameters Q * , F * . Estimators are then understood to be random variables which are measurable with respect to σ((Y t ) 1≤t≤n ).

Cut posterior inference: A general strategy for joint inference
on Q and f In Section 3.1 below we construct a family of prior distributions Π 1 on Q × F R such that the associated marginal posterior distribution of Q follows a Bernsteinvon Mises theorem centred at an asymptotically normal regular estimatorQ, i.e. with probability going to 1 under P * , and min where ⇒ * denotes convergence in distribution under P * and where S R is the set of permutations of [R].
However, the choice of Π 1 which we require for this control over Q leads to a posterior distribution on F which is badly behaved, see Section 3. In order to jointly estimate well Q and F, we propose the following cut posterior approach. Consider a second conditional prior distribution Π 2 (dF|Q) which leads to a conditional posterior distribution of F given Q, Y 1:n of the form .
The cut posterior is then defined as the probability distribution on Q × F R by : Note that if Π 2 (dF|Q) = Π 1 (dF|Q) then dΠ cut ((Q, F)|Y 1:n ) is a proper posterior distribution and is equal to dΠ 1 ((Q, F)|Y 1:n ). The motivation behind the use of the cut posterior dΠ cut ((Q, F)|Y 1:n ) is to keep the good behaviour Π 1 (Q|Y 1:n ) while being flexible in the modelling of F to ensure that the posterior distribution over both F and Q (and functionals of the parameters) are well behaved.
Adapting the proof techniques from [29] to posterior concentration rates for cut posteriors, we derive in Section 4 contraction rates for cut posteriors in terms of the L 1 norm of g Q,F is the density of P Q,F with respect to the dominating measure λ. That is, we show that under suitable conditions and choice of n = o(1), Π cut ( g Q * ,F * 1 ≤ n |Y 1:n ) = 1 + o P * (1). To derive cut posterior contraction rates in terms of the L 1 norm of f − f * , we prove in Section 4.2 an inversion inequality in the form We also derive cut-posterior contraction rates for the smoothing probabilities (p Q,F (X i = ·|Y 1:n )) i=1,...,n in Section 4.3. In contrast with [16], concentration rates for (p Q,F (X i = ·|Y 1:n )) i=1,...,n do not require to split the data into 2 groups nor do they require to to have a control of f r − f * r in sup-norm. We can avoid these difficulties thanks to the Bayesian approach as is explained in Section 4.3.
In our implementation of the cut posterior, we adopt a nested MCMC approach of the kind detailed in [12] and [37], see Section 5 for details.
In the following section we present Π 1 and show that the associated marginal posterior distribution Π 1 (Q|Y 1:n ) is asymptotically normal in the sense of Equation (5).

Semi -parametric estimation of Q: Bernstein-von Mises property and efficient estimation
The prior Π 1 is based on a simple histogram prior on the f 1 , . . . , f R . For the sake of simplicity we present the case of univariate data; the multivariate case can be treated similarly. Without loss of generality we assume that the observations belong to [0,1], and note that if Y = R, we can transform the data to [0, 1] via a C 1 diffeomorphism (such as φ(x) = exp(x) 1+exp(x) ) prior to the analysis. The prior relies on a partition of the space [0, 1] into finitely many intervals {I 1 , I 2 , . . . }, and transforming the data is equivalent to constructing a prior based on the corresponding partition {φ −1 (I 1 ), φ −1 (I 2 ), . . . } of R. The construction of Π 1 is very similar to the construction considered in [28].
into κ M bins, with κ (·) : N → N a strictly increasing sequence. Given I M , we consider the model of piecewise constant densities as the set, F M , of densities with respect to Lebesgue measure, in the form: where |I m . We could for instance consider a sequence of dyadic partitions with κ M = 2 M , such partitions are admissible for sufficiently large M in the sense we detail below, and are used for the empirical investigation of Section 5.
The parameters for this model are then Q and ω (M ) = (ω mr ) r=1,...,R m=1,...,κ M , the latter of which varies in the set Through (8), we identify each ω (M ) ∈ Ω M with a vector of emission densities f ω (M ) ∈ F R M , and thus a prior distribution Π M over the parameter space Q × Ω M is identified with a prior distribution Π over Q × F R M . The corresponding posterior distribution is denoted Π M (·|Y 1:n ) and is defined through (4).
Throughout this section we write ω r := (ω mr , m ≤ κ M ) and we denote by M(I M ) the hidden Markov model associated with densities of the form (8).
A key argument used in [24] to identify Q * from g Q * ,f * , is to find a partition I M for some M > 0 such that the matrix has full rank. We call such a partition an admissible partition.
Remark 1. Note that although we are using piecewise constant functions to model the emission densities, we do not assume that the f * r , r ∈ [R] are piecewise constant and the simplified models M are not meant to lead to good approximation of the emissions densities. However, interestingly, as far as the parameter Q is concerned, the likelihood induced by such a model is not mis-specified but corresponds rather to a coarsening of the data. Indeed it corresponds to the likelihood associated to observations of Y ) and for such observations the simplified model leads to a well -specified likelihood. Note also that although we are modelling densities with respect to Lebesgue measure we do not require F * r to have density with respect to Lebesgue measure, since the quantity of importance are the probabilities F * ]. This particular coarsening was introduced in the context of mixtures by [28]. It is not at all obvious how other types of coarsening can be found in order to use valid parametric models in the sense that they are well specified models for the coarsened data and the parameter of interest Q.

Asymptotic normality and Bernstein-von Mises
In this section we study the asymptotic behaviour of the marginal posterior distribution of Q under model M(I M ), together with the asymptotic normality of the maximum likelihood estimatorQ n,M in this model. As mentioned in Remark 1, the likelihood associated to model M(I M ) is given by with the abuse of notation In other words, our likelihood becomes one of a hidden Markov model with finite state space and multinomial emission distributions, and under P * , (Y Asymptotic normality of the maximum likelihood estimator (MLE) of parametric finite state space hidden Markov models was considered for instance in [10], who showed that the MLE was asymptotically normal with covariance matrix given by the inverse of the Fisher information matrix, which is given by the limiting covariance matrix of the score statistics, see Lemma 1 of [10].
The following theorem demonstrates that asymptotically normal, parametricrate estimators of the transition matrix exist, although such estimators may not have optimal asymptotic variance.
The main difficulty in the proof of Theorem 1 is showing that the Fisher information matrix J M is invertible.
Proof. The model M(I M ) is a regular parametric HMM, hence using Theorem 1 of [10], we establish asymptotic normality of the MLE for the parameter θ = (Q, ω (M ) ) as soon as J M is invertible. We prove invertibility of J M in Section S3.1, then projecting onto the Q co-ordinates gives the result.
To derive the Bernstein-von Mises Theorem associated to model M(I M ), we need the following assumption on the prior distributions on Q and ω (M ) : Assumption 2.
• The prior Π Q on Q has positive and continuous density on Q.
• The prior on ω (M ) has positive and continuous density with respect to Lebesgue measure.
Theorem 2. Let M > 0, I M be an admissible partition for f * , and let Π M be a prior satisfying Assumption 2. Assume that (i) of Assumption 1 is satisfied that the transition matrix Q * is irreducible and aperiodic and that ω * mr > 0 for all m ∈ [κ M ] and all r ∈ [R]. We then have (up to label-swapping) As with the proof of Theorem 1, parametric results apply as soon as the Fisher information matrix is seen to be invertible.
Proof. Inspecting the proof of Theorem 1 of [10], we see (for θ = (Q, ω (M ) ), J M = J M (θ * ) the Fisher information and l n = log g n the log-likelihood) that which up to o P * (1) is equal to the T n of [17] at which their Bernstein-von Mises result (Theorem 2.1) is centred. Then this result implies the total variation convergence (in probability) of the posterior to the given normal distribution, by considering the marginal posterior on (the free entries of) Q. We remark further that the MLEθ is regular, which follows from the characterisations of Fisher information in Lemmas 1 and 2 of [10], the expansion of the MLE above, and an application of Le Cam's third lemma (Example 6.7 of [58]) along the lines of the proof of Lemma 8.14 in [58].
An interesting feature of Theorems 1 and 2 is that they essentially only require that I M is an admissible partition of F * . For a given partition, this is an assumption on F * , and indeed the choice of the partition is important. Note however, that under Assumption 1 and for instance if the (Lebesgue) densities f * r are positive and continuous, then for all sequences of embedded partitions (I M ) M with radius going to 0 there exists an M such that I M is admissible for F * . More discussion is provided in Section 5.
Combining Theorems 1 and 2, we see that credible regions for Q based on the posterior associated to Π M are also asymptotic confidence regions. Their size may not be optimal however, even asymptotically. To ensure that such credible regions have optimal size while being asymptotic confidence regions, we would require thatJ −1 M is the best possible (asymptotic) variance, but this is not true in general; whileJ −1 M is the efficient covariance matrix for the estimation of Q in model M(I M ), it is not necessarily the semiparametric efficient covariance matrix for the estimation of Q in model (3). The existence of an efficient estimator of Q in the semiparametric hidden Markov model, with likelihood (3) has not been established, although the fact that √ n-convergent estimates of Q exist in the literature (see Section S3.1) indicate that semiparametric efficient estimation of Q should be possible.
In the following section we construct an efficient estimator for Q and a prior leading to an efficient Bernstein-von Mises theorem.

Efficient estimation
In [28], in the context of semiparametric mixture models with finite state space, the authors also consider the prior model (8) for the emission distributions. They derive for fixed M a Bernstein-von Mises theorem similar to Theorem 2, and they show that if we let M = M n go to infinity sufficiently slowly as n → ∞, and if the corresponding partitions I Mn are embedded with radius going to 0, thenJ Mn converges to the efficient semiparametric Fisher matrix for Q and the estimatorQ n,Mn is efficient.
In this section we prove a similar result, however the proof is significantly more involved than in the case of mixture models. To study the theory on efficient semiparametric estimation of Q in the semiparametric HMM models, we follow the approach of [47]. We first prove the LAN expansion in local submodels, which allows us to describe the tangent space, and then we prove thatJ M converges toJ as M goes to infinity whereJ is the efficient Fisher information matrix for Q. Throughout Section 3.2 we assume that the F * r have density f * r with respect to Lebesgue measure on [0, 1] and that the following holds:

Scores and tangent space in the semiparametric model
We begin by exhibiting the LAN expansion for our model, following the framework of [47]. As is usual in semiparametric efficiency arguments (see also Chapter 25 of [58]), this involves identifying the score functions and LAN norm along one-dimensional submodels passing through the true parameter. Since these submodels are themselves parametric, this identification can be made by following the framework of [19], who considered asymptotic normality in the context of parametric HMMs. A more thorough treatment, alongside a recollection of the relevant definitions and results in [47], can be found in Section S5.2 of the supplementary material -in what follows, we aim to give an overview.
The first step is to exhibit a LAN expansion for the semiparametric model we consider. The parameter space is Θ = Q × F R and we consider the tangent space, in the sense of Definition 1 of [47], at the parameter (Q * , f * ) as : h r f * r = 1, h r ∞ < ∞} and where we use the parametrizationQ defined in Section 2.1. Write Then θ n (a, h) is a perturbation of the parameter θ * = (Q * , f * ) along the path characterised by a given element (a, h) ∈ H. Consider the submodel , |t| ≤ }, for some (depending on a, h) sufficiently small that Θ a,h ⊂ Θ. Then for given a, h and for sufficiently large n, the perturbed parameters θ n (a, h) are elements of Θ a,h . This means that, for each a, h, we can make an asymptotic expansion of the log-likelihood ratio between θ n (a, h) and θ * by analysing this ratio in the sub model Θ a,h . To this end, we expand the gradient of the log-likelihood (a,h) n (t), in Θ a,h at t 0 = 0 as and where, writing a rR = − s a rs and Q r, (11) These formulae arise as an application of the results of Section 6.1 of [19] to the (parametric) model Θ a,h .
We note that the contribution of the first term of the right hand side of (S5.43), which may be rewritten as to the expression defined in (S5.42), is precisely the score function for estimation in the model Θ a,0 in which the emission densities are fixed and known. We see that this is equal to a T S Q * (Y −∞:k ), for S Q * the score at Q * in the R × (R − 1)dimensional parametric model with known emissions and unknown Q, which takes the form where we index with (r, s) ∈ [R] × [R − 1] for convenience sake, but consider S Q * as a vector of length R(R − 1).
We then rewrite (S5.42), substituting also the expression in (S5.43) as To set notation for what follows, we will denote the r th summand of the above display, for a given direction h r , as The above discussion means that, through a Taylor expansion, we can write with t 1 ∈ (0, n −1/2 ) and by choosing (a, h) 2 H the Fisher information at t = 0 in the model Θ a,h , which is defined in [19] as the L 2 norm 1 of ∆ k,∞ , is linear in (a, h) and satisfies ∆ n,(a,h) → N (0, (a, h) 2 H ), by the discussion directly preceding Theorem 2 in [19]. We also used above the local uniform convergence of the second derivative of the score to the Fisher information matrix at t = 0 in Θ a,h , which is guaranteed by Theorem 3 of [19].
The preceding discussion shows that our model is LAN, which is the first step in understanding efficient estimators of a parameter of interest. In our case, the parameter of interest will be v n (P n,θn(a,h) ) = Q, which has 'derivative' v(a, h) = a ∈ R p in the sense of [19], with p = R(R − 1).
Following what precedes, we apply the convolution theorem also described in [47], and originally proven in [59]. This theorem essentially states that the limiting law of a regular estimator is lower bounded by that of a Gaussian random variable, whose covariance matrix is the covariance of the efficient influence function, which is itself characterised by the tangent space and the parameter 'derivative'v.
In Section S5.2, we detail the application of this theorem, the arguments are similar to those used in the iid setting, see for instance Section 25 of [58]. This essentially involves identifying influence functionsv T b for estimation of onedimensional functions b T Q at b T Q * , as we vary b ∈ R p . By considering the elements for which a = 0, we first find thatv T b is in the orthogonal complement (with respect to the LAN norm on H) of the linear span of the scores in the models Θ 0,h , which is the span of the H r (h r ) as we vary r and h r ∈ H r . Write A for the projection onto this space, and writẽ for the projection of the score function S Q onto the orthogonal complement of this space, which we call the efficient score. By making further standard arguments, again detailed in Section S5.2, we find that the influence functioṅ v T b has variance b TJ −1 b, whereJ is the covariance matrix of the efficient scorẽ S Q * , which is the efficient information matrix. By varying b, this characterises the optimal limiting covariance matrix asJ −1 , the inverse efficient information matrix.
We show in the following section thatJ M converges toJ when M goes to infinity and that, if M n is a sequence going to infinity slowly enough,Q Mn is asymptotically efficient and the posterior distributions Π Mn (dQ|Y 1:n ) satisfy the efficient Bernstein-von Mises theorem.

Approximation by the models M(I M )
To prove thatJ M converges toJ we need some additional assumptions on the true generating process and the partitions: < sup We also consider a sequence of partitions with vanishing radius. Assumption 5 is directly under the control of the practitioner, and is verified for instance by considering nested dyadic partitions, as is done in Section 5. Assumption 4 is an assumption on the data generating process, although is very common in the context of semiparametric efficiency. Note also that, under Assumptions 4 and 5, there exists M 0 > 0 for which I M0 is admissible for f * .
In Proposition 1, we show that the efficient Fisher information matrix is precisely the limit ofJ M . To do this, we first define the efficient scoresS (M ) Q * for the histogram models, whose covariance matrix will beJ M , analogously to what was done in Section 3.2.1 in the context of the full semiparametric model. Recall from Remark 1 that the likelihood in model corresponds to a likelihood of the hidden Markov model with multimomial emission distributions and observations Y , and so we emphasize in our notation that the scores in M(I M ) depend only on these summaries. A straightforward adaptation of the earlier presentation then leads us to define the score functions for Q in M(I M ) by and we then define, for h r ∈ H r,M , Following again the arguments of Section 3.2.1, we write A M for the projection onto the space spanned by the H r,M (h r ), and finally definẽ Q * , whose covariance matrix isJ M , the efficient Fisher information matrix for the model M(I M ).
The convergence ofS (M ) Q * toS Q * is then established by a martingale argument. To show convergence of the projection operators A M to A is more involved; we use a deconvolution argument which shows that boundedness in the space of the nuisance scores H r (h r ) implies boundedness of the h r in the index space H r (and likewise for H r,M (h r ), H r,M ). These intermediate arguments, and the proof of the following results, are in Section 6.
where the convergence is in L 2 (P * ). Moreover, we haveJ M →J as M → ∞ whereJ ≥J M is the efficient information for estimating Q in the full data model, and is invertible.
From Proposition 1 we deduce the following result.
where M n → ∞ sufficiently slowly. Then under Assumptions 1, 3-5,Q n is a P * -regular estimator and satisfies (up to label- whereJ is the variance of the efficient score function, as defined in Proposition 1. In particular,Q n is an efficient estimator of Q in the full semiparametric model. Moreover, for Π n a sequence of priors placing mass on models M(I Mn ) respectively, and satisfying Assumption 2, we have (up to label-swapping) that

Cut posterior contraction
In what follows, we study contraction rates for the cut posterior Π cut (·|Y 1:n ) defined in Section 2.

Concentration of marginal densities
In this section, we present Proposition 2 which controls Π cut ( g ( ) Q * ,f * ≤ n |Y 1:n ). This result follows from Theorem S2.7, which is an adaptation of the general approach of [29] and is of independent interest, see Section S2 for full details.
For the sake of simplicity we consider a prior in the form Π 2 (df |Q) = Π 2 (df ); extension to the case where the prior Π 2 depends on Q is straightforward from Section S2. Note however that the conditional posterior on f given Q depends on Q through the likelihood.
Hence, similarly to [60] we consider the following assumptions used to verify the Kulback-Leibler condition. Assumption 6. Let n ,˜ n > 0 denote two sequences such that˜ n ≤ n = o(1) and n˜ 2 n → ∞ . Assume that the prior Π 2 on F R satisfies the following conditions: A. There exists C Π2 > 0 depending on the choice of prior Π 2 and a sequence S n ⊂ F R such that Π 2 (S n ) exp −C Π2 n˜ 2 n and such that for all f ∈ S n , there exists a set S ⊂ Y and functionsf 1 , . . . ,f R satisfying, for all 1 ≤ i ≤ R and for C R,Q = 4 + log( 2R B. For all constants C > 0, there exists a sequence (F n ) n≥1 of subsets of F R and a constant C > 0 such that be observations from a finite state space HMM with transition matrix Q and emission densities f = (f r ) r=1,...,R . Grant Assumptions 1-5 and consider the cut posterior obtained by choosing Π 1 as the prior Π M of Theorem 2 associated to the admissible partition I M , and Π 2 such that Assumption 6 is verified for suitable n ,˜ n such that n 2 n log n. Then for any K n → ∞, up to label-swapping, Q * ,f * L1 > K n n }|Y n ) = o P * (1).

Remark 2.
Given Proposition 2, we may use the results of [60] to derive posterior contraction rates for a number of priors Π 2 . We explore the case that Π 2 is a (product of) Dirichlet process mixture of normals in Section 4.4, as this is a popular choice for density estimation.
Remark 3. We could also choose Π 1 = Π 1,n equal to Π Mn for some M n → ∞ sufficiently slowly to give the refined control over the marginal cut posterior on Q as described in Theorem 3.

Concentration of emission distributions
Proposition 2 applies the general contraction result of Theorem S2.7 to obtain an estimation result for the marginal distribution of the observations. Given that we already have control over the transition matrix in this setting when using Π 1 as a histogram prior, it remains to establish concentration rates for the emission distributions. Theorem 4 allows us to translate a rate on a marginal distribution into a corresponding rate on the emission distribution. Q * ,f be the marginal density for 3 consecutive observations under the parameters (Q * , f ). Then there exists a constant C = C(f * , Q * ) > 0 such that, for sufficiently small g Remark 4. Theorem 4 provides an inversion inequality from L 1 to L 1 , which has interest in both Bayesian and frequentist estimation of emission densities. It is the first such result for the L 1 distance, with Theorem 6 of [15] establishing a similar inequality in the L 2 case. Given that the testing assumptions of Theorem S2.7 (and more generally, of results based on [29]) are much more straightforward to verify for the L 1 distance, our result has particular interest in the context of Bayesian (or pseudo-Bayesian) settings.
Using Proposition 2 and Theorem 4 together with Theorem 2, we easily deduce the following.
Theorem 5. Let Y 1:n ∼ P n * be distributed according to the HMM with parameters Q * and f * , and grant Assumptions 1-5. Let Π 1 = Π M and let Π 2 satisfy Assumption 6. Then, for n as in Proposition 2 and any K n → ∞, we have (up to label-swapping)

Concentration of smoothing distributions
When clustering data, the smoothing distribution P * (X k = x|Y 1:n ) is often of interest. Our final main result concerns recovery of these probabilities using a (cut) Bayesian approach, establishing contraction of the posterior distribution over these smoothing distributions in total variation, by combining novel arguments with the inequality given in Proposition 2.2 of [16]. We recall the notation θ = (Q, f ).
Theorem 6. Grant the assumptions of Theorem 5, together with and assume that for some constants γ, C > 0. Then for n as in Proposition 2 for which n 3 n → 0 and for any K n → ∞, we have (up to label swapping) Remark 5. The requirement that n 3 n → 0 is used in the proof, although we expect that this condition is not fundamental and it may be possible to weaken with appropriate proof techniques. It is not clear if assumption (19) is crucial, it is however very weak and satisfied for instance as soon as

Example: Dirichlet process mixtures of Gaussians
In this section, we show that Assumption 6 is verified when Π 2 is a Dirichlet process mixture of Gaussians. We use the results of Section 4 of [60], in which and satisfying weak tail assumptions which we give in Assumption S8. Here β > 0, L is a polynomial function, γ > 0 and k β = β − 1 for · the usual ceiling operator.
The following result, which is a corollary of Theorem 4.3 of [60], shows that for prior choices satisfying Assumption S9, the conditions of Theorem 5 are verified with where t > t 0 ≥ 2 + 2 γ + 1 Corollary 1. Let Y 1:n ∼ P n * be distributed according to the HMM with parameters Q * and f * , and grant Assumptions 1-5. Let Π 1 = Π M and let the prior Π 2 on F R take the form of an R−fold product of Dirichlet process mixtures of Gaussians, in which the base measure α and variance prior Π σ satisfy Assumption S9. Suppose further that the true emission distributions satisfy Assumption S8 and that, for each i, f * i ∈ P(β, L, γ). Then Theorems 5 and 6 hold with n as in (20) Proof. The proof follows from [60] who showed that Assumption 6 is verified under the stated assumptions with λ(dy) the Lebesgue measure, appropriately chosen B n ,˜ n = n − β 2β+1 log(n) t0 with t 0 as in the statement, and n as in the statement.
The rate n − β 2β+1 is minimax optimal for the classes P(β, L, γ) under iid sampling assumptions, see [46]. Although not proved here, we strongly believe the rate n to be minimax up to log n factors, since an iid sampling assumption corresponds to estimating f 1 , . . . , f R after observing X 1:n , Y 1:n which is easier than estimating f 1 , . . . , f R from Y 1:n only. We thank the anonymous reviewer for pointing this out to us. Remark 6. It is also possible to verify the conditions of Proposition 2 in the case of a countable observation space, again by following the example set out in [60], in which case Assumption 6 is verified with λ(dy) the counting measure on N by using a Dirichlet process prior on the emissions whose base measure satisfies a tail condition, and under a tail assumption on the true emissions. In this case, a parametric rate up to log factors is obtained.

Practical considerations and simulation study
In this section, we discuss the practical implication of the method and results described in the previous sections. As described in Section 2.2 we consider a cut posterior approach where Π 1 is based on a histogram prior on the emission distributions and where Π 2 is a Dirichlet process mixture of normals for the emission densities. We first describe the implementation of Π 1 (Q|Y 1:n ).

MCMC algorithm for Q
We first define the construction of the partition I M . In Section 3 it is defined as a partition on [0, 1] (or more generally on [0, 1] d ), but using a monotonic transform, this is easily generalized to a partition of R (or R d ). In this section we restrict ourselves to dyadic paritions, i.e. I Once I M is chosen, we consider a prior on Q, ω M (suppressing the M henceforth) and for the sake of computational simplicity we consider the following family of Dirichlet priors: We then use a Gibbs sampler on (Q, ω, X) where given X, Y 1:n , and the conditional distribution of X given Q, ω, Y is derived using the forward -backward algorithm (see [45] or [23]). To overcome the usual label-switching issue in mixtures and HMMs, we take the approach of Chapter 6 of [45] which deals with MCMC in the mixtures setting, in which the authors propose relabelling relative to the posterior mode as a post-processing step (with likelihood computed also with forward-backward).
In our simulation study we have considered R = 2 hidden states with transition matrix and emission distributions We have considered data of size n = 1000, 2500, 5000, 10000, obtained by restricting a single simulated data set of size 10000. To study the effect of M we run our MCMC algorithm targeting Π 1 (·|Y 1:n ) with κ M = 2, 4, 8, 16, 64, 128 bins.
We took γ ij = β ij = 1 so that independent uniform priors over the simplex were used for rows of the transition matrix and the histogram weights, and chose , however its steep gradient near y = 0 means that several data points end up in the middle bins when using the partition on R induced by a dyadic partition in [0, 1], which leads to an overcoarsening of the data. The linear interpolation in the range |y| ≤ 3 is intended to provide more discrimination between data points lying in this interval, which includes the vast majority (approx 97.5%) of the data.
We ran the MCMC for 150000 iterations for each binning, discarding 10000 iterations as burn-in and retaining one in every twenty of the remaining draws, for a total of 7000 posterior draws.
The fitted distributions for Q under the priors Π 1 , with varying M , are shown in Figures 2, 3, 4 and 5 and demonstrate Theorems 2 and 3 as we detail below.
To make an additional comparison with a typical Bayesian nonparametric approach, we also fitted a model with a prior Π , which used Dirichlet priors on the transition matrix as in Π 1 but Dirichlet process mixtures of Gaussians to model the emission densities. We defer further details of Π to Section 5.2 as it is more relevant as a comparison with Π 2 , since both have a much higher computational cost in comparison to Π 1 when implementing as described in Section 5.2, as elaborated on in Section S6. In Figure 6, we compare distributions for the transition matrix under Π (·|Y 1:n ) with the distribution under Π 1 (·|Y 1:n ) for a selection of values of M , but we emphasize the large difference in computational tractability.
Even when κ M = 2, the posterior distribution under the prior Π 1 , as n increases, looks increasingly like a Gaussian, though its variance is quite large. For slightly larger values of κ M , this Gaussian shape is preserved but with lower variance, demonstrating Proposition S1.8 which states that the Fisher information grows as we refine the partition. However, we can also see that taking κ M too large leads to erroneous posterior inference, which may simply be biased (for instance when n = 5000 and we take κ M = 64) or lose its Gaussian shape entirely (for instance when n = 1000 and κ M ∈ {64, 128}). This demonstrates that the requirement of Theorem 3 that M n → ∞ sufficiently slowly has practical consequences and is not merely a theoretical artefact.
While the latter issue is easily diagnosed by eye, the former issue is somewhat more worrisome as the bias cannot be so easily identified when one does not have access to the true data generating process. In Figure 6, we see preliminary empirical evidence that this problem of bias may also be present when taking a fully Bayesian approach based on the prior Π , indicating that our approach may work better than a fully Bayes approach if M n can be tuned well. We emphasize however that this simulation study is very limited in scope, and is only intended to demonstrate our results rather than to make conclusive comparisons.
For the histogram prior Π 1 , we propose the following heuristic to tune κ M after computing the posterior for a range of different numbers of bins. For a small number of bins (say κ M = 4), take the posterior mean as a reference estimate (sayQ 0 ), which should have low bias and moderate variance. For higher κ M values, compareQ 0 κ M -specific posterior mean and the 1 − α credible sets C α , for α = 0.05, 0.1 say. We consider that κ M is not too large ifQ 0 is well within the bounds of C α . We have added some further markings to Figure 4 which illustrate this approach.
We emphasize that it is most important not to refine the partition too quickly, so even when adopting this heuristic one may wish to favour lower values of κ M . When fitting Π 2 as discussed in Section 5.2, we used posterior draws based on one possible refinement, in which κ M = 4 for n = 1000, κ M = 8 for n ∈ {2500, 5000} and κ M = 16 for n = 10000 -see Figure 7.
After simulating draws from the marginal posterior Π 1 of Q, we then use a Gibbs sampler to target the cut-posterior distribution Π 2 (f |Q, Y ), for which the prior Π 2 (f |Q) is a Dirichlet mixture as follows. For each hidden state r (whose labels are fixed after relabelling in Section 5.1) we use independent dirichlet Process mixtures of normals to model f r : with M 0 = 1, µ c = 0 and σ 2 c = 1. The kernels φ σ are centred normal distributions with variance v = σ 2 , and v (r) is equipped with an InvGamma(α σ , β σ ) prior 2 . Note that the scale parameter σ is fixed across µ's for each r. The MCMC     procedure will rely on the following stick breaking representation of (21) involving mixture allocation variables s (r) = (s In order to approximately sample from the corresponding posterior, we replace the above prior with the Dirichlet-Multinomial process (see Theorem 4.19 of [30]) with truncation level S max = √ n , in which W is instead sampled from a Dirichlet distribution with parameter This truncation level is suggested as a rule of thumb in the remark at the end of Section 5.2 of [30]. Further details on implementation can be found in Section S6.
Nested MCMC For the i th draw from the cut posterior, we would ideally sample first Q i from Π 1 (·|Y ), and then f i = (f ir ) r from Π 2 (·|Q i , Y ). For the first step, we used Algorithm SA1 with burn in and thinning. However, we encounter difficulty when simulating from Π 2 (·|Q i , Y ), as the Q i changes when i changes, and so an MCMC approach effectively needs to mix i by i. In order to achieve this, we run a nested MCMC approach (see [52]) in which, for each i, an interior chain of length C is run, taking the final draw from the interior chain as our i th global draw.
Concerns about the computational cost of nested MCMC have been expressed, especially when there is strong dependence between the two modules (in our case, representing the transition and emissions) as in Section 4.2 of [41]. However, we found little improvement beyond C = 10 interior iterations, see Figure  8. We expect that this is down to the well localised posterior Π 1 (·|Y ) on Q which means that the Q i , and hence the targets Π 2 (·|Q i , Y ), don't vary too much i by i. Plots of the posterior mean when C = 10 are detailed in Figure 7. Since we used the thinned draws of Q from Π 1 , we ran 7000 such interior chains.
When using such a small number of iterations for the interior chain, nested MCMC is not costly compared to other cut sampling schemes (see e.g. Table  1 of [41]). We further suggest that the computational cost compared to fully Bayesian approaches with fixed targets is not as high as it may seem, given common practice of placing an inverse gamma prior on the variance for computational convenience.
that the use of such interior chains should, at least partially, subsume the need for thinning. Indeed running such interior chains with a fixed target for each i would be precisely the same as thinning. A more detailed development of this idea can be found in Section 4.4 of the supplement to [12].
Comparison to fully Bayesian approach As mentioned in Section 5.1, we also consider the fully Bayesian model with prior Π as a means of comparison. The prior Π independently places a Dirichlet prior on the rows of the transition matrix as discussed beforehand Π 1 , as well as a Dirichlet process mixture prior over the emission densities as discussed earlier in this section for Π 2 .
In the implementation of Π , we ran the MCMC for 70000 iterations, discarding the first 10000 as burn-in and thinning at a rate of one in ten observations. This was chosen so that we had an approximate matching with the computational cost of 7000 iterations of the cut posterior, each with 10 interior iterations. We plot the resulting emission densities in Figure 9. In comparison to Figure  7, the pointwise bands seem to capture the ground truth more accurately. We remark however that Theorem 5 only provides guarantees on L 1 concentration and so the plots will not entirely reflect the theory; L 1 credible sets are rather less easily visualised. We also emphasize our earlier comment from Section 5.1 that the simulation study is limited in scope, and is not intended to provide conclusive comparisons with other approaches.

Proofs of main results
We first prove Proposition 1.

Proof of Proposition 1: Convergence of the scores
Throughout this section, Assumptions 1-5 are assumed to hold.
Recall that and Define P to be the space spanned by the nuisance scores {H(h) : h = (h r ) r≤R , h r ∈ H r } in the full semiparametric model, and also P M to be the space spanned by the nuisance scores ) with H r as in Equation (14), and also To prove Proposition 1, we prove convergence of the scores in P M to scores in P and we prove convergence of the score functions S To prove Proposition 1, it is enough to prove thatS (M ) Q * converges toS Q * . A major difficulty here is to prove the convergence of A M S to AS in L 2 (P * ), with S ∈ L 2 (P * ), because contrary-wise to what happens for mixture models in [28] the sets P M are not embedded.
In Lemma 3 we prove using a martingale argument that S We now prove that (A − A M )S L 2 = o(1), for all S ∈ L 2 (P * ). As mentioned earlier, the non trivial part of this proof comes from the fact that the sets P M are not embedded. Hence we first prove that the set P M is close to the setP M , whereP Since the partitions are nested, theP M are nested, and reasoning as in [28] we obtain that (Ã M S) M ∈N is Cauchy, and so converges to someÃS. Then, arguing as in [28], we identifyÃS with AS in Lemma 5, by first arguing thatÃ is a projection onto a subspace of P, and then showing that elements of P are well approximated by elements ofP by approximating the corresponding h r ∈ L 2 0 (f * r ) by histograms (h r,M ) M ∈N . This terminates the proof of Proposition 1.
Lemma 1. For any element S ∈ L 2 (P * ), Proof. As will appear in the proof a key step is Lemma 2 which says that for any . Lemma 4 also applies for L 2 bounded sequences, and implies that for any bounded Suppose now that, along a subsequence, h M 2 → ∞. Then along this sub- → 0 in L 2 and we can apply Lemma 2 to conclude that h M h M 2 → 0 in L 2 along this subsequence, when in fact this subsequence has constant unit L 2 norm. We thus get a contradiction and conclude that sup M h M 2 is bounded. Then by Lemma 4, we have that with the convergence following as before using Lemma 2 and the boundedness of (1) and Lemma 1 is proved.
For shortness sake we write g * = g (2) Q * ,F * and we will abuse notation and write y j := y −j for j ∈ N, so y 1 is y −1 etc. Expanding D 1 (h M ), we thus have that Since for any function H, → 0.
(24) Note further that the family of functions of y 1 given by y 1 → ( s p * s Q * sr f * s (y 1 )) k r=1 is linearly independent. Consider now a partitionĪ 1 ,Ī 2 , . . . ,Ī R of [0, 1] such that defines a matrix F = (F * rj ) of rank R. Such a partition always exists by Lemma S3.22. Define also . . , p * R ) (which is invertible under Assumption 3) and recall Q * = (Q * sr ). Then we can write A T = F T D p Q * and note that A also has rank R. Now, integrating the y 1 coordinate out of Equation (24), we get for all j = 1, . . . k r, j)) rj . Then the preceding display may be rewritten as Now from the definition, (B M ) M is a bounded sequence in a finite dimensional space and so converges to some B along a subsequence. Working on this subsequence, we then get the limit Consider now D 2 (h M ), wish again vanishes by assumption and Lemma 6, and take the limit in M along the subsequence where we just established convergence. As before, abuse notation by replacing Y −k by Y k for k ∈ N. Expanding and multiplying through by joint marginals we get at the limit Rewriting (26) into matrix products we obtain By linear independence of the the set of functions {f * s (y 1 )} s=1,...,R we have that for all s 1 the coefficients vanish, so that for all s 1 we get We can now repeat the argument but instead viewing the expression as a linear combination of the linearly independent set of functions {f * s (y 2 )} s=1,...,R . Expanding what precedes for each s 1 , we get so that for each r, s 1 we get that Expanding in terms of the linearly independent functions {(Q * f * ) r (y 0 )} r=1,...,R we get which then gives for each r, r , s 1 that In particular, setting s 1 = r we get that for all r, r , By Assumption 3, Q * ij > 0 ∀i, j and so we can divide through to get for some a depending on r alone. Write D a for the diagonal matrix whose diagonals are the a(r ). Then Equivalently we can write with each column of B expressible as the integral overĪ j , with respect to y 1 , of D ph (y 1 ) = D p B 0 f * (y 1 ), and thus expressible as B = D p B 0 F. We can then substitute into the previous display to see that but the expression of A as A T = F T D p Q * allows us to write

This means that
a(r)Q * rs = −Q * rs a(s). Taking r = s and dividing through by Q * rr gives a r = 0 for all r, hence B 0 = 0 andh = 0. This meansh m (y 0 ) → 0 in L 2 (dy 0 ) and so h r,M → 0 in L 2 (f r (y 0 )dy 0 ) and so in L 2 (P * ) under Assumption 4, which is what we wanted to show.
Proof. Without loss of generality we take j = 0. We first show that we can, uniformly in M , truncate the series in (16) to an arbitrary degree of accuracy. we prove that uniformly in M , as J goes to infinity. Using Lemma S5.26, we have that there exists ρ < 1 and C > 0 such that for all j ≤ −1, for all M and similarly with G 0 , G 1 in place of G 0 M , G 1 M . Therefore for all > 0 there exists J( ) > 0 for which It now suffices to prove for all j that Firstly, we show that, for l = −1, 0, the G l M form an increasing sequence of σ-algebras G l M ⊂ G l M for M < M , and that the limiting σ-algebra G l ∞ := σ M ∈N G l M is equal to G l . We suppress the dependence on l since the proofs are the same. For the sake of presentation we code the coarsened observations Y → 0.

Proof of Theorem 4
The proof of Theorem 4 works by contradiction. If there exist no C > 0 such that for all f with g Q,f − g Q,f * L 1 sufficiently small, then there exists a sequence of emission densities f (n) such that g For j = 2, 3, r ∈ [R] and f (n) as above, set and write h n = (h n r : r ∈ [R]). Since lim inf n ∆ n,3 = 0, there exists a subsequence φ 1 (n) such that ∆ φ1(n),3 → 0, which implies also that ∆ φ1(n),2 → 0 since ∆ n,2 ≤ ∆ n,3 . We have by assumption Q,f * L 1 → 0, and by applying Lemma 7 we can find a further subsequence φ 2 (φ 1 (n)) such that f (φ2(φ1(n))) → f * in L 1 . For notational convenience we write ∆ n,j for the terms ∆ φ2(φ1(n)),j along this subsequence, and in general any index with n is now interpreted as being along this subsequence. Since where P = diag(p * 1 , . . . , p * R ). Partition I 1 , . . . , I R such that F * , as defined in Lemma 7, has rank R. Define also H n = (H n ir ) ir = Ii h n r (y)dy. We have for any integrable function G any any i ∈ [R] that R R |G(y 1 , y 2 )|dy 1 dy 2 ≥ R Ii |G(y 1 , y 2 )|dy 1 dy 2 ≥ R Ii G(y 1 , y 2 )dy 1 dy 2 , which implies in particular that r,s (P Q * ) rs (F * ir h n s (y 2 ) + H n ir f * s (y 2 )) = o L 1 (1).

This gives us
The sequence H n is bounded and so converges to some H 0 along a subsequence φ 3 (n) which we pass to. Then since F * is of rank R, we may write and so h n r → (B 1 f * ) r in L 1 for each l = 1, . . . , R. Next, since we have that ∆ n,3 → 0 in L 1 , we get r,s,t Replacing h n r by its limit (B 1 f * ) r , we get that for Lebesgue-almost-all y 1 , y 2 , y 3 r,s,t Under Assumption 4, we have continuity of f * and so in fact the above display holds for all y 1 , y 2 , y 3 . This expression is of the form of Equation (26) in the proof of Lemma 2, and the same proof techniques show that B 1 = 0 and hence h n = o L 1 (1). This yields the desired contradiction as R r=1 h n r L 1 = 1 for all n ∈ N. We conclude that no subsequence can exist for which ∆ n,3 → 0, and hence that lim inf n→∞ ∆ n,3 > 0, which terminates the proof of Theorem 4.
The proof of Theorem 4 uses the following result which we prove in Section S3.5 of the supplement.
Lemma 7. Consider a sequence of densities f (n) such that g Q * ,f * , then f (n) L 1 (dy) → f * along a subsequence.

Proof of Theorem 6
To prove Theorem 6, we make use of Lemma S3.24 which allows us to approximate the MLE by an estimator which does not depend on the observation for which we control the respective smoothing probability. Intuitively, this result says that a single observation does not influence our MLE too much.
Proof of Theorem 6. The starting point of the proof is Proposition 2.2 of [16].
where c * (y) = min r∈[R] s∈[R] Q * rs f * s (y) ≥q s f * s (y) and C * is a constant depending only on Q * , f * . Throughout the proof we use C * for a generic constant depending only on Q * , f * . First for any K n → ∞, if for γ defined in (19), then Theorem 5 together with condition (19) gives Π cut (B c n |Y 1:n ) = o P * (1). Recall also that n 2 n → ∞ so that for z n > 0 Π cut (∆ k > z n n |Y 1:n ) ≤ Π cut ({∆ k > z n n } ∩ B n |Y 1:n ) + o P * (1).
Since dΠ 1 (Q|Y 1:n ) − φ n (Q|Y 1:n )dQ T V → 0 where φ n (Q|Y 1:n ) is a Gaussian density centred at the MLE of varianceJ −1 /n, we have Π cut −Π cut T V = o P * (1), whereΠ cut is the measure φ n (Q|Y 1:n )dQdΠ 2 (f|Q, Y 1:n ). Moreover on B n , To prove Theorem 6, we thus need to control the sum in (29) under the posterior distribution. In [1], the authors use either a control in sup-norm off s − f * s or a split their data into a training set used to estimate f * , and a test set used to estimate the smoothing probabilities. We show here that with the Bayesian approach we do not need to split the data in two parts nor do we need a supnorm control. We believe that our technique of proof might still be useful for other approaches.
Note that L(y +1:n |X = s, θ) = s Q ss L(y +1:n |X +1 = s , θ) ∈ q 2 s L(y +1:n |X +1 = s , θ), s L(y +1:n |X +1 = s , θ) , and also that P θ (X l = s|Y 1:l−1 ) ∈ [q 2 , 1], which together implies, denotinḡ This implies that the cut posterior mass of A n (z, ) ∩ B n is bounded as Applying Lemma S3.24, we also have uniformly in Hence, it suffices to control each summand of (30) withΠ cut replaced byΠ ( ) cut . We first focus on lower bounding the denominator D n ( ). Writing Ω n = { s f s (y ) ≥ f * s (y )/2}, it is immediate that We will show that we can replace the above bound with a bound in probability where the integrand does not feature the indicator 1 Ωn . This is similar in spirit to concentration results based around the framework of [29] (including our own Theorem S2.7), where a key ingredient of the proof is to lower bound the denominator by a suitable quantity. We have that

Now we can bound the probability in the integrand by
We can bound the conditional density by which in turns implies that Thus, on this event of probability at least 1 − 4RK n n , we can bound the denominator as and so, on this event, , together with the fact that f (y |Y (− ) 1:n )dy ≤ s f * s (y )dy , as well as the bound on f s − f * s L 1 on B n , we obtain To bound J 2 note that 1 An(z; ) ≤ |f r (y ) − f * r (y )|/(z n r f * r (y )) so that , and bounding as before with J 1 , we obtain E * [J 2 ] ≤ RKn z . This implies that Recall that T n ≤ 2 [K n n ] −1 and z n ≥ K n / 2 leads to for some constant c 1 independent on n and .
Returning to the original goal of bounding (29), and having now bounded the sum over | − k| ≤ T n , we study the sum over | − k| > T n . For this, we will bound the sum underΠ cut defined previously. First note that where we use that c * (y) ≥ s p * s f * s (y), and that √ c * is L 1 bounded by some C * from condition (18). As soon as L n n 2 n ≥ L 0 log n with L 0 large enough, possible since n 2 n log n by assumption, we have P * ∃ , c * (y ) ≤ e −Lnn 2 n = o(1) and we can bound c * (y ) ≥ e −Lnn 2 n for all . Next, we definē We writẽ where we write the denominator as D n and write N n (Ā n ( ) c ) for the numerator.
Using Lemma S2.9 so that P * [D n < e −Lnn 2 n ] L −1 n together with φ n (Q|Y 1:n ) n R(R−1)/2 , we obtain with probability greater than 1 − O(1/L n ), with L n going arbitrarily slowly to infinity, Π cut (B n ∩Ā n ( ) c |Y 1:n ) ≤ e Lnn 2 n Bn 1Ā n( ) c e n(θ)− n(θ * ) dΠ 2 (f )φ n (Q|Y 1:n )dQ ≤ C * e 2Lnn 2 n n R(R−1)/2 z n n | −k|>Tn where in the final line we again use that log n n 2 n to consolidate the expo-nential. This implies thatΠ cut (B n ∩Ā n ( ) c |Y 1:n ) is controlled under P * as with the final equality holding as soon as T n log(1/ρ) ≥ 5nL n 2 n . We recall that we required earlier in the proof, in order to establish Equation (33), that T n n → 0 arbitrarily slowly. We thus use the assumption n 3 n → 0 to choose T n so that both conditions may hold simultaneously.
Finally combining (34) with (33) and (29), and since K n and L n can be chosen to go arbitrarily slowly to infinity and n 2 n log n, we obtain that for any z n going to infinity, Π cut (∆ k > z n n |Y 1:n ) = o P * (1), which terminates the proof.

Conclusion and discussion
In this paper we use the cut posterior approach of [37] for inference in semiparametric models, which we apply to the nonparametric Hidden Markov models. A difficulty with the Bayesian approach in high or infinite dimension is that it is difficult (if not impossible) to construct priors on these complex models which allow for good simultaneous inference on a collection of parameters of interest, where good means having good frequentist properties (and thus leading to some robustness with respect to the choice of the prior). As mentioned in Section 1, a number of examples have been exhibited in the literature where reasonable priors lead to poorly behaved posterior distribution for some functionals of the parameters. We believe that the cut posterior approach is an interesting direction to pursue in order to address this general problem, and we demonstrate in the special case of semiparametric Hidden Markov models that it leads to interesting properties of the posterior distribution and is computationally tractable.
Our approach is based on a very simple prior Π 1 on Q, f used for the estimation of Q based on finite histograms with a small number of bins. This enjoys a Bernstein-von Mises property, so that credible regions for Q are also asymptotic confidence regions. Moreover by choosing M large (but not too large) we obtain efficient estimators. Proving efficiency for semiparametric HMMs is non trivial and our proof has independent interest. Another original and important contribution of our paper is an inversion inequality (stability estimate) comparing the L 1 distance between g Q,f1 and g Q,f2 and the L 1 distance between f 1 and f 2 . Finally another interesting contribution is our control of the error of the estimates of the smoothing probabilities using a Bayesian approach, which is based on a control under the posterior distribution of despite the double use of the data (in the posterior on f r and in the empirical distance above). It is a rather surprising result, which does not hold if f r is replaced withf r constructed using y 1 , . . . , y n , unless a sup-norm bound onf r − f * r is obtained. Section 5 demonstrates clearly the estimation procedures and highlights the importance of choosing a small number of bins in practical situations. Here, there remains open the question of how to choose this number in a principled way, and an interesting extension of the work would follow along the lines of the third section of [28], in which the authors produce an oracle inequality to justify a cross-validation scheme for choosing the number of bins.
Finally, the paper deals with the case where R is known. This assumption is rather common both in theory (see for instance the works of [27], [16] or [8]) and in practice (for instance in genomic applications as in [63]). The identifiability results of [4] and [24] do show that R can be identified as well, leaving open the possibility to jointly estimate R, together with Q and f. We leave this direction of research for future work. Abstract: The supplementary material contains a number of proofs, a presentation of the general contraction theory for cut posteriors, and some further details on simulations. In Section S1 we present the proofs of Theorem 3 and Proposition 2. In Section S2, we develop the general contraction theory for cut posteriors which is a straightforward adaptation [29], and has general interest beyond the setting of hidden Markov models. In Section S3, we collect a number of technical results. In Section S4, we document the assumptions required for the application of our contraction theory when the emissions are modelled as Dirichlet process mixtures of Gaussians. In Section S5, we gather some key results from the literature. In Section S6, we provide some further details of our computations.
All sections, theorems, propositions, lemmas, definitions, remarks, assumptions, algorithms, figures and equations presented in the supplement are designed with a prefix S. Regarding the others, we refer to the material of the main text [49].

S1. Proofs of main results
In the following section we detail the proofs of Theorem 3 and Proposition 2.

S1.1. Proof of Theorem 3
The proof of Theorem 3 follows from Proposition 1 and Lemma S1.8. By the remarks at the end of the proof of Theorem 1, the MLE is regular. This means that, for all parameter sequences θ n of the form θ 0 + n − 1 2 h with h ∈ [−H, H] for some fixed H, √ n(θ (n) − θ n ) converges in distribution under P θn to a variable, say Z, of fixed distribution, say µ Z . Explicitly, this means that for Borel sets A we have Here θ n is a sequence of parameters in Q × Ω M . However, any likelihood-based estimator depends on the observations only through the probabilities of each bin assignment, by considering the multinomial model with the count data. Thus, we can replace P θn with any law from the full semiparametric model for which the transition matrix Q n is the same and the functions f n have the corresponding ω M (θ n ) as the bin assignment probabilities, and the above convergence remains true.
Since M +1 is positive semi definite. Proof of Theorem 3. We have by inspecting the part of the proof of Lemma S1.8 concerning regularity, that under any sequence of the form P n where µ M is Gaussian of covariance equal toJ −1 M by Theorem 1. Here d BL is the bounded Lipschitz metric which metrizes weak convergence. By taking M n → ∞ sufficiently slowly, we have R Mn,n → 0 also. Since for any t ∈ R R(R−1) , (t TJ Mn t) −1 is decreasing by Lemma S1.8 and bounded below by (t TJ t) −1 the efficient variance, it converges and the limit is equal to (t TJ t) −1 by Proposition 1. This implies weak convergence of the measures µ Mn to µ, the centred Gaussian measure whose variance isJ −1 , and applying the triangle inequality in the metric d BL proves the first claim.
For the Bernstein-von Mises result, we have from Theorem 2 that whereR M,n → 0 in P * −probability as n → ∞. Refining the sequence from before so that M n → ∞ sufficiently slowly that we additionally haveR Mn,n → 0, and noting thatJ Mn →J, we get

S1.2. Proof of Proposition 2
The proof of Proposition 2 is a direct consequence of the following result, which shows that Theorem S2.7 applies to semi-parametric HMMs of the type described in Section 2.2. For the construction of Π 1 , a monotonic transformation is implicitly used (if necessary) in order to consider histograms on [0, 1] Proposition S1.3. Let (Y t ) t≥1 be observations from a finite state space HMM with transition matrix Q and emission densities f = (f r ) r=1,...,R . Consider the cut posterior based on Π 1 associated to the partition I M and Π 2 .
(i) Under Assumptions 1-5, Assumption S7 is satisfied with T n = {Q; Q − Q * ≤ z n / √ n} with z n → ∞ sufficiently slowly and φ n the restriction to T n of the Gaussian density centered atQ n,M with varianceJ −1 M /n. (ii) Choosing Π 2 such that Assumption 6 is verified for suitable n ,˜ n with n 2 n log n, the assumptions of Theorem S2.7 are verified for the same n and any K n → ∞.
Proof. (i) The idea of the proof is as follows: We know from Theorem 2 that Π 1 (·|Y 1:n ) is close in total variation distance to the normal distribution centred at the estimatorQ. Then the restriction of this distribution to a ball of radius z n centred at Q * , where M n → ∞ slowly, will be close in total variation to the original normal distribution with high probability as n → ∞.
Let p 1,n be the normal density of varianceJ −1 M /n centred atQ, whereQ is the estimator from Theorem 2. Let p 2,n be the restriction of this normal density to {Q : Q − Q * ≤ z n / √ n} where z n → ∞. Fix > 0 and choose n large enough that P * ( Q − Q * > cx n / √ n) < where 1 > c > 0 is a constant. Let p 3,n be the density obtained by restriction of p 1,n to {Q : Q − Q * ≤ (1 − c)z n / √ n}. Then for sufficiently large n, we have with probability exceeding 1 − that the support of p 3,n is a subset of the support of p 2 , in which case p 1,n − p 2,n L 1 ≤ p 1,n − p 3,n L 1 .
But the variance of p 1,n is of the order O(n −1 ) and z n → ∞ so, as n → ∞, the support of p 3,n approaches the support of p 1,n and p 1,n − p 3,n L 1 → 0. But then p 1,n − p 2,n L 1 → 0. Thus, with probability exceeding 1 − , we have that p 1,n − p 2,n L 1 → 0 as n → ∞. Since > 0 is arbitrary we conclude p 1,n − p 2,n L 1 = o P * (1).
Taking φ n = p 2,n as above for any z n → ∞, we verify Assumption S7 for any n for which n 2 n log n.
(ii) This is a consequence of Theorem 3.1 in [60]. The choice of T n also provides verification of the hypothesis of Lemma 3.2 of [60] and establishes the required control on KL divergence described in (S2.35). To satisfy the testing assumption (S2.36), it suffices to note that for large enough n, T n is a subset of {Q :

S2. General theorem for cut posterior contraction
In this section, we present a general theory for contraction of cut posteriors which is developed in the style of the usual theory for Bayesian posteriors of [29]. The main result of this section is Theorem S2.7, from which Proposition 2 follows.
Consider a general semiparametric model in which there is a finite dimensional parameter ϑ and an infinite-dimensional parameter η. Suppose we wish to estimate pair (ϑ, η) governing the law P n ϑ,η of a random sample Y 1:n ∈ Y n . Assume the data is generated by some true distribution P n * = P n ϑ * ,η * and consider the following two models: Model 1: Consider a model T × F 1 on pairs (ϑ, η) such that ϑ * ∈ T but for which we do not require that η * ∈ F 1 . Consider a joint prior Π 1 over this space yielding a marginal posterior on ϑ given by Model 2: Consider a model on η, conditional on ϑ, with a prior Π 2 (·|ϑ). We denote the parameter set for η by F 2 and we assume that η * ∈ F 2 . We obtain in this model, a conditional posterior distribution Π 2 (dη|Y 1:n , ϑ).
Definition S2.2. The cut posterior on (ϑ, η) is given by Write P n ϑ,η for the law of the observations Y 1:n . Define for some > 0 and some loss function d(·, ·) on P ϑ,η , with B c ( ) its complement. Define also the Kullback-Leibler neighbourhoods of P n * = P n ϑ * ,η * as where K(P n ϑ * ,η * |P n ϑ,η ) = E n * log dP n ϑ * ,η * dP n ϑ,η is the Kullback-Leibler divergence. We now present a general theorem to characterize cut-posterior contraction result in the spirit of the now classical result of [29]. Our main additional assumption for the cut setup is that we have sufficiently good control over Π 1 , similar to the kind established in Section 2 in the HMM setting, which we detail now.
Denote by π 1 (·|Y 1:n ) the marginal posterior density of ϑ with respect to some measure µ 1 on T , associated to the prior Π 1 on ϑ, η.
Assumption S7. For all sequences z n → ∞ there exist T n ⊂ T with ϑ * ∈ T n , n = o(1) and a non-negative, random function φ n on T n with supp(φ n ) ⊂ T n , such that π 1 (·|Y 1: Assumption S7 is mild, for instance if φ n is a Gaussian distribution with meanθ and variance i * /n, for some semi definite matrix as soon as n 2 n log n. Theorem S2.7. Let Assumption S7 hold with n 2 n → ∞ and assume that there exist C > 0 such that for any ϑ ∈ T n , sets S n (ϑ) ⊂ F 2 satisfying ϑ∈Tn {ϑ} × S n (ϑ) ⊂ V n (P n * , n ), inf ϑ∈Tn Π 2 (S n (ϑ)|ϑ) ≥ e −Cn 2 n . (S2. 35) Assume also that there exist L n → ∞, F 2,n (ϑ) ⊂ F 2 and ψ n : for some K n , K > 0. Then, as n → ∞, Remark S7. Typically K n = O( √ L n ), when the tests (S2.36) are constructed as a supremum of local tests. Given some L n → ∞ for which the conditions are satisfied, we can choose L n → ∞ arbitrarily slowly, and consequently we can choose K n → ∞ arbitrarily slowly.
The proof of Theorem S2.7 which we present below is a rather simple adaptation of [29]. As in [30], we have simplified the common Kullback-Leibler neighbourhood assumption involving variances of the log-likelihood ratio using the technique of Lemma 6.26 therein.
Remark S8. By placing an additional assumption on neighbourhoods for higherorder Kullback-Leibler variations, the assumption on the existence of the sequence L n → ∞ can be replaced with an assumption that L n = L for a constant L > 1 + C -the proof is similar but uses a different (but standard) technique for proving the evidence lower bound. In this case we can choose K n a constant.
Write alsoB = B c (K n n ) ∩ (T n × F R ), the subset of B c (K n n ) on whichΠ is supported, and δ n = µ(T n ).
Write alsoΠ(B |Y 1:n ) =Π(B |Y 1:n )ψ n +Π(B |Y 1:n )(1−ψ n )1 Ω c n +Π(B |Y 1:n )(1− ψ n )1 Ωn , where Ω n = {sup ϑ∈Tn φ n (ϑ) ≤ e Kn 2 n }. By the testing assumption, the first term vanishes in probability, while the second term vanishes under Assumption S7. For the remaining term, we have forL n → ∞ to be chosen later, that since J n (ϑ) ≤ 1. We can bound E * [I 1 ] as , by an application of Cauchy-Schwartz. Now, noting that by Lemma S2.9 and under the assumption of sufficient prior mass on the S n (ϑ), we have for anỹ L n → ∞ that under Assumption S7 by taking z n = o(L n ). We bound E * [I 2 ] by Using Fubini's theorem again alongside the assumption on type II errors and on the sieves, and the deterministic bound on φ n (ϑ) over Ω n from Assumption S7, we bound what precedes by ChoosingL n = o(L n ), we get the required convergence.
Lemma S2.9 provides the lower bound on the denominator used in the proof of Theorem S2.7. The proof is standard but we include it for completeness, it follows almost exactly the proof of Lemma 6.26 in [30].
Then sup as n → ∞.
By applying the logarithm to both sides and using Jensen's inequality, it suffices to show that with high probability Arguing as in [30], we obtain as n → ∞ for any M n → ∞. Since the bound does not depend on ϑ and uses only that {ϑ} × S n (ϑ) ⊂ V 0 (P 0 , n ), we obtain the desired bound on the supremum.

S3. Proofs of technical results
Section S3 is devoted to a number of technical proofs which are required the main results, but are reasonably standard in their approach.
In Section S3.1, we prove that the Fisher information matrix is invertible for general discrete state-space, discrete observation HMMs. This is necessary to apply the results of [10] and [17] in the proofs of Theorems 1 and 2.
In Section S3.2, we gather some properties of the Fisher information matrix, showing a Cramer-Rao bound for estimation in HMMs, and showing a local uniform convergence result for the expected information from n observations.
In Section S3.3, we collect technical lemmas used for the deconvolution argument which is the key part of the proof of Theorem 3.
In Section S3.4, we state a result which implies the existence of an admissible partition.
In Section S3.5, we collect technical lemmas which feature in the proofs of Theorems 4 and 6

S3.1. Non-singularity of Fisher Information
In what follows, we establish invertibility of the Fisher Information matrix for general multinomial Hidden Markov models.
Proposition S3.4. Consider a multinomial HMM with latent states X t ∈ {1, . . . , R} and discrete observations Y t taking values in the set of basis vectors of R κ , which we denote {e 1 , . . . , e M }. Denote Q ∈ R R×R the transition matrix and Ω ∈ R κ×R be the matrix whose columns ω r = (ω mr ) M m=1 are the emission probabilities for the r th state.
Denote θ = (Q, Ω) ∈ R p the HMM parameter and write J(θ) for the Fisher Information matrix with entries given by Then, if Q, Ω have rank R, J is non-singular.
The idea of the proof is to exhibit estimators with L 2 risk of order 1 √ n . We then show that the local asymptotic minimax result of [25] implies that the existence of such estimators guarantees a non-singular Fisher information. We use the spectral estimators proposed in [7], the control of which is established in Section S3.1.1.
Proof of Proposition S3.4. By an application of the van Trees inequality, analogously to Equation (12) in Theorem 4 of [25], we obtain for the HMM that (for p the parameter dimension) which holds for any vector U . Here J n is the joint Fisher information for n observations as in Proposition S3.17 and q is a density on R p such that J q := R P ∇q∇q T 1 {q>0} q dx is non-singular. Rescaling we get that, for any vector U , Taking a limit in n and applying Lemma S3.17 stated below gives that the limit inferior of the left hand side is at least Call the matrix on the right hand side which we invert J c . It is indeed invertible for sufficiently large c as J q is invertible, and the set of invertible matrices is open. Denote its matrix square root by J 1 2 c . Now suppose ∃V * such that (V * ) T J(θ * )V * = 0. Then by writing With A a fixed constant not depending on c. Taking the limit as c → ∞ gives that (upper bounding also the averaging over the law qdh by the supremum over h) which contradicts the local uniform bound of Proposition S3.5.
The following proposition establishes the existence of an estimator with suitable risk, as required for the arguments of Proposition S3.1.
Proposition S3.5. Let θ = (Q, Ω) be the parameter for the HMM described in Proposition S3.4 and let θ * be such that the identifiability conditions of Assumption 1 hold. Then there exists an estimatorθ such that, for any > 0 sufficiently small, we have, as n → ∞, up to label-swapping.
Proof. Letθ be the spectral estimator constructed in Section S3.1.1. We have that The bound of Lemma S3.15 is valid for x ≥ 1, and so Since the bound is valid on a small neighbourhood around θ, it holds in supremum over that neighbourhood.

S3.1.1. Construction of spectral estimators
Spectral estimation of HMMs has been addressed by a number of previous works [1,2,7,8,16,34]. In [1], the authors exhibit estimators for emissions in parametric HMMs with a √ n rate in probability, but we require convergence in expectation. In [16], a convergence in expectation is shown but the rate contains a logarithmic factor which is not sufficient for our use in the proof of Proposition S3.1. In [2], the authors exhibit estimators for which the concentration is sub-Gaussian, which would permit integration to an in-expectation bound, but their work only concerns two-state HMMs.
We will construct estimators which are very similar to those constructed in these works. To get the convergence properties we require, we use as a starting point the sub-Gaussian concentration of the empirical estimator of E ijk = P(Y 1 = i, Y 2 = j, Y 3 = k). To extract the HMM parameters, we follow the approach set out in [7], who prove guarantees for their Tensor power method which outputs, from an estimateT of a given symmetric, orthogonal tensor T , estimates of the associated eigenvalues and eigenvectors whose error is of the same order as the input error T − T .
Since T is required to be symmetric and orthogonal, we cannot directly apply their algorithm to extract estimates with appropriate concentration properties from estimatesÊ of the tensor E. Instead, we construct an estimateT of a symmetrised, orthogonalised version T of E, the eigenvalues and eigenvectors of which can then be extracted through the Tensor power method and related back to the parameters of the HMM. This approach is similar to that taken in [6,33]; [33] consider the setting of Spherical Gaussians, which requires a different symmetrisation step but whose orthogonalisation procedure we adopt here. Our symmetrisation step is instead based on Theorem 3.6 of [7].
The construction First define the following matrices and tensors: Next, we will symmetrise about Y 3 as follows: definẽ , but with Y 1 , Y 2 replaced byỸ 1 ,Ỹ 2 respectively; these matrices are then symmetric. Define estimatorsĚ (12) ,Ě (13) andĚ (123) of the symmetric quantities as follows: Split the sample in two and from the first sample, produce estimatesÊ (·) . This yields estimatesÂ,B of A θ , B θ . Then defině We remark that Theorem 3.6 of [7] gives us the decomposition of the symmetric tensorẼ where p i = P θ (X 2 = i) is taken to be the stationary distribution and µ i = P θ (Y 3 = ·|X 2 = i) = (ΩQ T ) i . In a slight abuse of notation, we surpress the dependence on θ in our notation for p i , µ i .

This justifies that indeedẼ
(123) θ is a symmetric tensor. However, the µ i are generally not orthonormal, which is required for the application of Theorem 5.1 of [7] (the tensor power method). Define, for a three-way tensor T and matrices P, Q, R, Next, defineẼ W,θ as in Section 4.3.1 of [7] as follows: Take W = W θ as the matrix of left orthonormal singular vectors ofẼ (12) so that W TẼ (12) θ W = I and define µ which are orthonormal. Then takẽ Now setŴ as the matrix of left orthonormal singular vectors ofĚ (12) , with correspondingĚ W,θ so we apply the Tensor power method of [7]. This returns estimateŝ µ W I ,p i of µ W i and p i . Now setμ i = 1 √ν iŴ †μW i , which esimates (up to labelswapping) the columns of ΩQ T . By repeating the above procedure, but instead symmetrising about Y 2 (so defining a correspondingỸ 1 ,Ỹ 3 ) we produce estimates (up to label-swapping) of the columns of Ω. By permuting these estimates relative to some consistent estimator (such as the MLE, which is consistent without assuming invertibility of the Fisher Information, see e.g. [19]) we finally produce estimates (Q,Ω) of (Q, Ω).
In the following section, we control the error introduced at each stage of this construction.

S3.1.2. Control of spectral estimators
The control on the estimators constructed in S3.1.1 is ultimately proved in Lemma S3.15; this control is established through the following sequence of technical lemmas. The estimator we will use are those constructed by the tensor power method in [7]. It is convenient to assume κ = R in what follows so that certain matrices are square -the extension to the general case is straightforward using SVD and low rank approximations but is omitted for ease of exposition.
The next lemma makes explicit the sub-Gaussian concentration of emperical tensors when the data is generated according to a geometrically ergodic Markov chain. It resembles closely Lemma 1 of [2].
Lemma S3.10 (Concentration of empirical tensors). Define the matrices and tensors Define further Then there exists a constant C = C(R, Q * ), such that for all x ≥ 1, sufficiently small > 0, and any superscripts, Proof. The proof is essentially the same as the proof of Lemma 1 in [2]. In that paper, the authors use the concentration result of [51] which provides a bound in terms of the pseudo-spectral gap γ ps of the chain Z n = (X n , X n+1 , X n+2 , Y n , Y n+1 , Y n+2 ), together with Proposition 3.4 of the same reference which relates γ ps to the mixing time t mix of the same chain. Arguing as in [2], we see that t mix is uniformly bounded on a neighbourhood of the true parameter, since the eigenvalues of the transition matrix vary continuously in its entries. For the remainder of the proof, we follow the argument of [2].
An important step in the application of the Tensor power method is to preprocess the empirical tensors controlled in Lemma S3.10 so that they are symmetric and orthogonal. The symmetrisation step involves a linear transformation of the first two observations so that the three-way expectation is symmetric -in Lemma S3.10 but with Y 1 , Y 2 replaced byỸ 1 ,Ỹ 2 respectively; these matrices are then symmetric. Define estimatorsĚ (12) ,Ě (13) andĚ (123) of the symmetric quantities as follows: Split the sample in two and from the first sample, produce estimatesÊ (·) . This yields estimatesÂ,B of A θ , B θ . Then defineĚ (12) Then there exists a constantC =C(R, W * ) such that for all x ≥ 1, sufficiently small > 0 and any superscripts, Proof. Write E for the event thatÊ (12) is invertible. Choose r such that the ball of radius r around θ * is contained in the set of θ for which the corresponding Q = Q(θ) is invertible. Then for all θ in a ball of radius at most r 2 inside the ball on which Lemma S3.10 holds, the ball of radius r 2 centred at any Q(θ) contains only those θ for which Q (θ ) is invertible. Hence, by choosing x = c √ n where c = c(r), we can apply Lemma S3.10 for sufficiently large n to get that Working on E, Lipshitz continuity of matrix inversion and multiplication implies thatÂ andB satisfy an exponential bound of the kind shown forÊ (·) , as a corollary of Lemma S3.10. This implies thatĚ (12) andĚ (13) satisfy a corresponding bound when approximating the analogous quantities withÂ,B replaced with A, B. These quantities in turn approximateẼ (·) by the same arguments as presented in Lemma S3.10.
For the estimatesÂ andB, we can formally takeÂ =B = I on the event that E (12) is not invertible.
We deduce that forC suitably chosen and by enlarging the constant by some factor greater than one, we get that for sufficiently large n, any x ≥ 1, and sufficiently small > 0 that Lemma S3.11 tells us we can construct an approximation to the symmetrised tensorsẼ (·) . In the next result, we show that we can further extend this to an approximation to a symmetrised, orthogonalised tensor, at which point the Tensor power method may be applied to see that the parameter values extracted from this approximate tensor well approximate the parameter values associated to the true orthogonal, symmetric tensor (which are themselves related in an analytic sense to the parameters of interest). The orthogonalisation or 'whitening' step is detailed in Lemma S3.12 and follows closely the ideas of appendix C of [33].
Before we proceed, we remark that Theorem 3.6 of [7] gives us the decomposition of the symmetric tensorẼ where p i = P θ (X 2 = i) is taken to be the stationary distribution and µ i = P θ (Y 3 = ·|X 2 = i) = (ΩQ T ) i . In a slight abuse of notation, we surpress the dependence on θ in our notation for p i , µ i .

This justifies that indeedẼ
(123) θ is a symmetric tensor. However, the µ i are generally not orthonormal, which is required for the application of Theorem 5.1 of [7] (the tensor power method). First, let us define for a three-way tensor T and matrices P, Q, R, We are now ready to show that the concentration of Lemma S3.10 is preserved on further reduction: Lemma S3.12 (Whitening preserves concentration). DefineẼ which are orthonormal. Then takẽ Then there exists an estimateŴ , with correspondingĚ (123) W =Ě (123) (Ŵ ,Ŵ ,Ŵ ) such that, for someC =C (R, Q * ), for any x ≥ 1 and sufficiently small > 0 Proof. The proof essentially follows Appendix C.6 in [33]. In particular, we take W,Ŵ to be the matrix of left orthonormal singular vectors ofẼ (12) andĚ (12) respectively. Then their Lemma 12 gives that where p min is the smallest element of the stationary distribution (which under Assumption 1 is uniformly bounded below on a ball around the true parameter) and ζ R (Ẽ (12) ) is the R th largest singular value ofẼ (12) θ , which is also positive under this assumption at θ = θ * (and uniformly bounded below on some neighbourhood of the true parameter) asẼ for this tensor with good concentration properties, Theorem 5.1 of [7] gives guarantees for an algorithm which recovers the µ W i (and the w i ). Note that the probability of failure (their η) for the algorithm described may be tuned by the user, and so for a theoretical estimator we may consider the algorithm in the limit η → 0 (corresponding to running the algorithm for an arbitrarily long number of iterations). A direct application of the theorem then provides the following result: Lemma S3.13 (Recovery of orthogonalised parameters). Suppose Algorithm 1 of [7] is called iteratively, as in the statement of Theorem 5.1 of the same reference. Letμ W i be the estimates of µ W i returned by this process andp i the estimates of p i returned (by also applying x → x −2 to the estimated weights). Then there existsC =C (R, Q) and a permutation τ ∈ S R such that for all x ≥ 1 and sufficiently small > 0 The previous result follows easily from the fact that the order of error in the output corresponds to the order of the error in the input tensors, together with the smoothness of x → x −2 , so the proof is omitted. It remains to show that these estimates can be 'de-whitened' to produce estimates for the vector of interest µ i , which we show in the following result.
Lemma S3.14 (Recovery of original parameters). Let W † be the Moore-Penrose pseudo-inverse of W T . Then and theμ i constructed asμ for some τ ∈ S R and C = C(R, Q * ), for all x ≥ 1 and for sufficiently small > 0.
Proof. The first part of the statement is given by Theorem 4.3 of [7]. The control overμ i then follows from the control overĚ (12) established in Lemma S3.11 and the Lipshitz continuity of the singular vectors and Moor-Penrose pseudo-inverse on a neighbourhood of the true parameter, as well as the control overν i and µ W i established in Lemma S3.14. The estimates produced target (up to label-swapping) the columns of ΩQ T . By adapting the procedure to symmetrise about the second view (rather than the third), we can produce estimates (up to label-swapping) of Ω. Although the labelling is arbitrary, the permutation of Lemma S3.14 may differ in the two cases.
To alleviate this issue, we use any consistent estimator (such as the maximumlikelihood estimator, which is consistent without assumptions on the Fisher information, see Theorem 1 of [19]). Because the transition matrix Q has full rank, the µ i (and associated vectors from symmetrisation about the second view) are well-separated and so we can fix a labelling relative to this estimator.
Lemma S3.15 (Fixing of labels). LetΞ 3 be the matrix whose columns are thê µ i as in Lemma S3.14 (with symmetrisation about the third view as detailed) andΞ 2 be the corresponding matrix when symmetrising about the second view.
Moreover, letΞ 3 ,Ξ 2 be the corresponding estimates formed from the MLE, that isΞ where · is the max of the squared norms of the columns. Definê Recall θ = (Q, Ω) and defineθ = (Q,Ω) .Then there exists a τ ∈ S R (the permutation accounting for the difference between the truth and the MLE) such that, for all > 0 small enough, for some C = C(R, Q).

S3.2. Fisher Information and asymptotic lower bound
The following result gives a lower bound on the asymptotic variance of regular estimators and is used in the proof of Lemma S1.8.
Lemma S3. 16. Letθ n be a regular estimator in the histogram model with partition I M , with M fixed. Let Z denote the random variable of law equal to the limit distribution of the scaled and centred estimates √ n(θ n − θ * ) under P * as n → ∞. Then Cov(Z) ≥ J(θ * ), where J(θ) is the Fisher information matrix at the parameter θ.
Proof. Consider estimation of the one-dimensional parameter λ T θ. Recall from Lemma S3.17 that for any sequence θ n → θ, we have J(θ) = lim n→∞ 1 n J n (θ n ), with J n the joint Fisher information for n observations, as defined in Lemma S3.17.
We follow the arguments of Gill and Levit [31]. Let π be a fixed prior density on [−1, 1] and J(π) = E[(log π(θ)) 2 ]. For a given H > 0, let π(H, n) be the rescaling of this prior to the interval A = [θ 0 − n − 1 2 H, θ 0 + n − 1 2 H] for given H > 0. Then, applying the van Trees inequality we get where the left hand expectation is taken over the joint law of the parameter and the data given the parameter (with the parameter distributed according to π(H, n)) and the expectation in the denominator is over θ having that law also. Dividing through gives Taking first n → ∞ then H → ∞ we obtain the result, applying Lemma S3.17 to get convergence of the expectation term in the denominator to the Fisher information.
The previous result requires the following convergence property of the Fisher information matrices. We also use it in the proof of Proposition S3.1.
Lemma S3.17. Let θ n → θ. Denote J n = J (M ) n the Fisher information for n observations in the model with M bins given by and J the Fisher information for the model. Then When θ n = θ the result is simply the definition of J. The interest is hence in establishing local uniform convergence.
Proof. Theorem 3 of [19] establishes the result in the case of observed information 1 , a.s. under P θ . Moreover, the Fisher and Louis identities [42] show that boundedness of the complete observed information and complete scores (which take simple forms) implies boundedness of the observed information. Write Y = [M ] N and µ for the counting measure on this space with associated densities p θ (y). We have since the observed information is bounded and p θn (y) → p θ (y) pointwise and so in L 1 (µ) by Scheffé's lemma [55],

S3.3. Technical results for the deconvolution argument
Here, we collect the technical results required to prove Lemma 2.
Let us recall some notation for what follows. For r ∈ {1, . . . , R}, and g ∈ L 2 (f * r ), define and define H r (g) similarly but with the conditioning being on the sigma algebras G 0 and G −1 . Then define, for g = (g 1 , · · · , g r ), These are the score functions in the submodels with fixed transition matrix, and emission densities varying along the path characterised by g.
Our first technical result allows us to eliminate certain terms when we make the main deconvolution argument. We recall G l = σ(Y −∞:l ).
Lemma S3.18. For g ∈ H r and j < k ≤ 0, we have Proof. We first note that by using usual properties of conditional expectations. Given X j , Y j and X j are conditionally independent of Y k:0 and thus Since X j has values in the discrete set [R], we can explicitly write the conditional expectation as But since g ∈ L 2 (f * r dx) has zero expectation against the emission distribution for state r, the right hand side vanishes and hence so does its expectation given Y k:0 .
In the following section we relate the score functions H M in the binned model to score functions H in the full model where the perturbation g is in the 'direction' of a histogram.
We have now reduced the problem to the case where the sums in (S3.37) are over j > −J. Since h M L 2 = O(1), to show convergence of the remaining difference between H r,M and H r it suffices to show that P * (X j = r|G l M ) → P * (X j = r|G l ) in L ∞ (P * ) for l = 0, 1 and −J ≤ j ≤ 0.
We use Lemma S5.25. Let K > J. We then obtain, for all j ≥ −J and l = 0, −1, Similarly, P * (X j = r|G l M ) = (1 + O(ρ j+K ))P * (X j = r|Y By choosing K sufficiently larger than J, we have seen that the first and third terms are small, and bounded above by 9JL uniformly in j ≥ −J. It remains to bound the second term. We have .
We write and similarly for Y −K:l . The true emissions f * r are continuous functions on the compact set [0, 1], hence they are uniformly continuous and hence are approximated in L ∞ ([0, 1]) by the histograms f ω * r with ω * r = ( Im f * r (y)dy : m ∈ [κ M ]). Since there are a fixed number of such terms, we can choose M large enough that, for each j, the preceding disaplay is bounded above by 9JL . Putting everything together and using Hölder's inequality, we get that The general case follows by summing over r, which concludes the proof.

S3.3.2. Deconvolution argument
The next lemma essentially provides a reduction of the deconvolution argument.
Proof. For notational simplicity, we will write for any index j and σ-algebra G h M (Y j )P * (X 0 |G) = r h r,M (Y j )P * (X 0 = r|G).
Using Lemma S3.18 with k = 0 to eliminate the terms which condition on G 0 we obtain Using Lemma S5.26 we obtain, for a particular ρ < 1 which does not depend on M , that with the right-hand sum vanishing because the E * (h M (y j )P * (X j |G J−1 )) = 0 by Lemma S3.18. The term O(ρ J ) is uniform in M since h M L 2 (P * ) ≤ 1 and by choosing J large enough this term can be made arbitrarily small. We conclude that as M → ∞. By instead applying Lemma S3.18 with k < 0, we can argue similarly that, for all k ≥ 0

S3.3.3. On the convergence onÃ M
The proof of the following result is very similar to the proof of Lemma 1 in [28], which used that the spaces of score functions for mixture models were nested to show convergence. Proof. We have by the arguments of [28] that for any S ∈ L 2 (P * ) (Ã M S) is Cauchy, as the spacesP M are nested. By completeness of L 2 (P * ) we then establish convergence to some elementÃS ∈ L 2 (P * ). SinceÃ M are projections, A is also a projection onto its image, which is some subspacẽ Recall that P is defined as the closure of the linear span of the H(h) functions, as in the statement of Lemma 1. By definition, P is closed, and sinceP M ⊂ P for all M , we must have Cl M ∈NP M ⊂ P. We show that any element in H(h) ∈ P can be written as the L 2 limit of some sequence in M ∈NP M . First choose a sequence h M → h in L 2 (P * ) such that h M L 2 ≤ 2 h L 2 for all M , which is possible under Assumption 5. We now wish to show that for all r ∈ [R] (h r,M (y 0 ) − h r (y 0 ))P * (X 0 = r|G 0 ) vanishes in L 2 . By Lemma S5.26, we can choose J sufficiently large negative that Since h r,M L 2 ≤ 2 h r L 2 , J can be chosen so that the above holds uniformly in M .
To control the finitely many remaining terms, it suffices to note the L ∞ boundedness of the probabilities and the L 2 convergence of h M to h, and so for M sufficiently large the remaining finite sum is bounded by 2 , and so (S3.38) has L 2 norm at most . Since was arbitrary, we conclude that P ⊂P, and so the two spaces coincide and A =Ã as required.

S3.4. Admissible partitions
The following lemma is well-known in the identifiability literature. We recall it here for completeness.

S3.5. Technical results for the proofs of Theorems 4 and 6
We employ the following lemma (Lemma 7 of the main text) in the proof of Theorem 4 to eliminate certain terms.
Lemma S3.23. Consider a sequence of emissions f (n) such that, g where the i = 1 term is the unconditional likelihood. Note that (S3.40) corresponds to the log-likelihood based on observing Y (− ) 1:n . Standard arguments for showing asymptotic normality (see e.g. [10]), using Taylor expansions of the derivative of the expression (S3.41), give 1 n D 2 θ lθ(Y 1:n ) + J M ) the error in approximating the Fisher Information with the negative log-likelihood. The uniform law of large numbers for the observed information as given in Theorem 3 of [19], together with the consistency of the MLE (see e.g. [40]), then implies that R n = o P * (1).
For the estimatorθ (− ) , we have for each l ∈ L, by the definition in (S3. 39 Note that the R n term does not depend on , and vanishes in P * −probability.
We will show that, for each ∈ L n , 1 , so that a union bound implies that P * (∃l ∈ L : 1 √ n (∇ θ * l − θ * (Y − 1:n ) − ∇ θ * l θ * (Y 1:n )) > ) ≤ |L n |o P * (|L n | −1 ) = o P * (1), from which the result follows. The argument is based on the exponential forgetting properties, and the expansion of the score considered in [19]. For the full log-likelihood (S3.41), we have through the Fisher identity as in [19] that the log-likelihood of θ given Y 1:n and X 0 = x, denoted l n (θ, X 0 = x), has gradient at θ = θ * given by ∇ θ * log(Q Xi−1,Xi ω Xi,Yi )|Y 0:k−1 , X 0 = x , which can be seen by writing the log-likelihood as a telescoping sum and using the Fisher identity to write each term as a conditional expectation of the full likelihood. We can do the same for the likelihood without Y : Define We note that the term ∆ The first sum remains unchanged compared to the expression without the missing data point. Lemma 8 of [19] shows that we may replace the ∆ k,0,x by ∆ k,0 where the latter is defined conditioning only on the Y 0:k -the same argument made there shows the analogous result for the ∆ (− ) k,0,x with respect to the analogously defined ∆ (− ) k,0 . We note that, under Assumptions 3 and 4, the ∆ k,0 and ∆ − k,0 are bounded uniformly in X (·) , Y (·) , and so for any L n to be chosen later, we have that It remains to control the difference between the remaining contributions for suitably chosen L n . Similarly to what is done in [19], we define ∆ k,m by conditioning instead on Y m:k . Write ∆ k,m as a telescoping sum as ∆ k,m = E θ * [∇ θ * log(Q X k−1 ,X k ω X k ,Y k )|Y m:k ] , where ρ * ∈ (0, 1) is a constant which depends on θ * alone. Furthermore, their equation (20) states that Combining the above two displays in the same way as Lemma 10 of [19], we have for a suitable constant C(Q * , ω * ) that ∆ k,0 − ∆ k, +1 L 2 (P * ) ≤ C(Q * , ω * ) ρ (k− −1)/2−1 * 1 − ρ * .

S4. Assumptions required for the application of Theorem 5 to Dirichlet process mixtures
In this section, we detail the assumptions required for Proposition 1. We will need the following assumptions on the behaviour of the emissions, which are assumptions (T1)-(T3) of [60].
C. For all 1 ≤ i ≤ R, f * i is positive and there exists c i > 0 and y low i < y high i such that f * I is non-decreasing on (−∞, y low i ), bounded below by c > 0 on (y low i , y high i ), and non-increasing on (y high , ∞).
We will further require the following assumptions on the choice of prior, which are Assumptions (G1) and (S1)-(S3) of [60]. These assumptions are verified by standard choices of Gaussian base measure and inverse gamma prior on the standard deviation.
Condition B on Π σ can be relaxed and we essentially need the density π σ to behave like an inverse -Gamma near 0 and have tails bounded by some power of 1/σ near infinity, see for instance [38].

S5. Key results from literature
In this section, we document some of the key results from other works which are used in our contributions. Section S5.1 features the forgetfulness properties of [19], while Section S5.2 expands upon the presentation of efficiency given in Section 3.2.1, recalling the relevant framework of [47].

S5.1. Forgetting of the hidden Markov chain
The following results are from [19]. Recall thatq = min ij Q * ij > 0 under Assumption 3. The first result quantifies the exponential forgetting of the hidden chain.
The second of these results follows from applying Equation (20) of [19] to both the known-emission model and the submodels on the emissions. It in particular ensures that the score functions we consider are well-defined.
Lemmas S5.25 and S5.26 remain true when replacing Y by Y (M ) , the coarsened data. The choice of ρ is made independently of M as it only depends on the transition matrix Q * , which is the same in all histogram models as it is in the semiparametric model.
To show the model is LAN with tangent space H, we seek to identify the score functions ∆ n,a,h note that the log-likelihood ratio Λ(θ n (a, h), θ 0 ) can be approximated by considering the score in the submodel Θ a,h = {θ t = (Q * + ta, (f * r (1 + th r ) : r ∈ [R])) , |t| ≤ }, which, for fixed a, h and fixed > 0 sufficiently small, is a one-dimensional submodel Θ a,h ⊂ Θ. From [19], in which the authors consider score functions in parametric HMMs, we can expand the gradient of the log-likelihood and where, writing a rR = − s a rs and Q r,R = 1 − s Q rs , 1{X k−1 = r, X k = s} a r,s Q * r,s + R s=1 1 Xi=s h s (Y i ).
(S5.43) These formulae arise as an application of the results of Section 6.1 of [19] to Θ a,h .
We note that the contribution of the first term of the right hand side of (S5.43), which may be rewritten as to the expression defined in (S5.42), is precisely the score function for estimation in the model Θ a,0 in which the emission densities are fixed and known. We see that this is equal to a T S Q * (Y −∞:k ), for S Q * the score at Q * in the R × (R − 1)dimensional parametric model with known emissions and unknown Q, which takes the form Definition S5.4 (Differentiability of parameter). Let v n (P n,θ ) be R p −valued parameters for some p > 0. We say the sequence (v n ) is differentiable if √ n(v n (P n,θn(a,h)) ) − v n (P n,0 )) →v(a, h) ∀h ∈ H for some continuous linear mapv : H −→ R p .
In our case, the parameter of interest will be v n (P n,θn(a,h) ) = Q, so that p = R(R − 1). We then obtainv(a, h) = a ∈ R p .
We finally recall the definition of a regular estimator.
Definition S5.5 (Regular estimators). A sequence of maps T n = T n (Y 1 , . . . , Y n ) ∈ R p is said to be locally regular for v n if, under P n,θn(a,h) , we have √ n(T n − v n (P n,θn(a,h) ) ⇒ Z as n → ∞ for every (a, h) ∈ H, where Z is a Borel measurable tight random element in R p which does not depend on (a, h) ∈ H.
Following what precedes, we are ready to state the convolution theorem also described in [47], and originally proven in [59]. As with what precedes, we slightly simplify the statement to make clear the application to our context. Herev T b is the unique element of H such that (v T (b))(a, h) = v T b , (a, h) H wherev T : (R p ) † → H † is the adjoint ofv, and where the superscript · † indicates the dual space.
The convolution theorem essentially states that the limiting law of a regular estimator is the convolution of the laws of a Gaussian (in this case Z 0 ) and some independent random variable (in this case W), and so in particular the variance is lower bounded by that of Z 0 . It also describes the variance of Z 0 in terms of the tangent space, its dual space, and the parameter 'derivative'v. In our case, we recall thatv(a, h) = a, and p = R(R − 1).
We note that we identify the dual of R p with itself, through the bijection b ↔ b, · R p , sov T (b) is a shorthand forv T ( b, · R p ). The adjointv T ofv then satisfies, by definition, for (a, h) ∈ H and b ∈ R p , In this section we describe the algorithm (Algorithm SA1) used for the simulation study in Section 5. S R denotes the symmetric group of order R and where we use subscripts to refer to algorithm iterations. We use the RHmm package [56] for the forward-backward algorithm in the simulation of the hidden states and computation of the log likelihood. We also use the gtools package [62] for generation of permutations and Dirichlet draws.
Algorithm SA1: Algorithm for MCMC draws targeting posterior of Q.
Input : Binned data Y (M ) ∈ [κ M ] n , number of hidden states R, prior parameters γ,β, iterations I, initial hidden states value X init Output: List of I draws of Q Initialise X 1 = X init for i = 1, . . . , I + b do Draw transition matrix Q i ∼ P (Q|X i ), histogram weights ω i ∼ P (ω|X i , Y ), hidden states X i+1 ∼ P (X|Y, Q i , ω i ) end Compute i M AP = arg max i log Π 1 (Q i , ω i |Y ) for i = 1, . . . , I do

S6.2. MCMC Algorithm for Π 2
In Algorithm SA2, we detail the MCMC procedure we implement in R. Once again, we make use of the forward-backward algorithm to simulate latent states, allowing us to exploit the simple structure of the full likelihood. One key difference is that our algorithm requires allocation to a bivariate latent space (with total number of states RS max ) involving both the HMM hidden state and the Dirichlet mixture component. Since the forward backward algorithm is O(N K 2 ) with K the number of states and N the number of samples, and since we take S max = √ N , the overall implementation is O(N 2 ).

S6.3. MCMC Algorithm for Π
The MCMC algorithm of Π is obtained by modifying Algorithm SA2 to remove the interior loop over c = 1, . . . , C and sampling a transition matrix Q from the conditional distribution given the latent states, as in Algorithm SA1, at the beginning of each outer loop over i. As with Algorithm SA2, the complexity is O(N 2 ) and so using it to target the marginal posterior on Q is much slower than using Π 1 , as discussed in Section 5.