Online Expectation Maximization based algorithms for inference in hidden Markov models

The Expectation Maximization (EM) algorithm is a versatile tool for model parameter estimation in latent data models. When processing large data sets or data stream however, EM becomes intractable since it requires the whole data set to be available at each iteration of the algorithm. In this contribution, a new generic online EM algorithm for model parameter inference in general Hidden Markov Model is proposed. This new algorithm updates the parameter estimate after a block of observations is processed (online). The convergence of this new algorithm is established, and the rate of convergence is studied showing the impact of the block size. An averaging procedure is also proposed to improve the rate of convergence. Finally, practical illustrations are presented to highlight the performance of these algorithms in comparison to other online maximum likelihood procedures.


Introduction
A hidden Markov model (HMM) is a stochastic process {X k , Y k } k≥0 in X × Y, where the state sequence {X k } k≥0 is a Markov chain and where the observations {Y k } k≥0 are independent conditionally on {X k } k≥0 .Moreover, the conditional distribution of Y k given the state sequence depends only on X k .The sequence {X k } k≥0 being unobservable, any statistical inference task is carried out using the observations {Y k } k≥0 .These HMM can be applied in a large variety of disciplines such as financial econometrics ( [24]), biology ( [7]) or speech recognition ( [18]).This new algorithm, called Block Online EM (BOEM) is derived in Section 2 together with an averaged version.Section 3 is devoted to practical applications: the BOEM algorithm is used to perform parameter inference in HMM where the forward recursions mentioned above are available explicitly.In the case of finite state-space HMM, the BOEM algorithm is compared to a gradient-type recursive maximum likelihood procedure and to the online EM algorithm of [3].The convergence of the BOEM algorithm is addressed in Section 4. The BOEM algorithm is seen as a perturbation of a deterministic limiting EM algorithm which is shown to converge to the stationary points of the limiting relative entropy (to which the true parameter belongs if the model is well specified).The perturbation is shown to vanish (in some sense) as the number of observations increases thus implying that the BOEM algorithms inherits the asymptotic behavior of the limiting EM algorithm.Finally, in Section 5, we study the rate of convergence of the BOEM algorithm as a function of the block-size sequence.We prove that the averaged BOEM algorithm is rate-optimal when the blocksize sequence grows polynomially.All the proofs are postponed to Section 6; supplementary proofs and comments are provided in [20]. 2 The Block Online EM algorithms
Denote by {X k , Y k } k≥0 the canonical coordinate process on the measurable space (X × Y) N , (X ⊗ Y) ⊗N .For any θ ∈ Θ and any probability distribution χ on (X, X ), let P χ θ be the probability distribution on ((X × Y) N , (X ⊗ Y) ⊗N ) such that {X k , Y k } k≥0 is Markov chain with initial distribution P χ θ ((X 0 , Y 0 ) ∈ C) = 1 C (x, y) g θ (x, y) µ(dy) χ(dx) and transition kernel K θ .The expectation with respect to P χ θ is denoted by E χ θ .Throughout this paper, it is assumed that the Markov transition kernel K θ has a unique invariant distribution π θ (see below for further comments).For the stationary Markov chain with initial distribution π θ , we write P θ and E θ instead of P π θ θ and E π θ θ .Note also that the stationary Markov chain {X k , Y k } k≥0 can be extended to a two-sided Markov chain {X k , Y k } k∈Z .
It is assumed that, for any θ ∈ Θ and any x ∈ X, M θ (x, •) has a density m θ (x, •) with respect to a finite measure λ on (X, X ).Define the complete data likelihood by p θ (x 0:T , y 0:T ) def = g θ (x 0 , y 0 ) where, for any u ≤ s, we will use the shorthand notation x u:s for the sequence (x u , • • • , x s ).For any probability distribution χ on (X, X ), any θ ∈ Θ and any s ≤ u ≤ v ≤ t, we have where φ χ θ,u:v|s:t is the so-called fixed-interval smoothing distribution.We also define the fixed-interval smoothing distribution when X s ∼ χ: Given an initial distribution χ on (X, X ) and T + 1 observations Y 0:T , the EM algorithm maximizes the so-called incomplete data log-likelihood θ → χ ( The central concept of the EM algorithm is that the intermediate quantity defined by T ] may be used as a surrogate for χ θ,T (Y 0:T ) in the maximization procedure.Therefore, the EM algorithm iteratively builds a sequence {θ n } n≥0 of parameter estimates following the two steps: In the sequel, it is assumed that there exist functions S, φ and ψ such that (see A1 for a more precise definition), for any (x, x ) ∈ X 2 and any y ∈ Y, Therefore, the complete data likelihood belongs to the curved exponential family and the step i) of the EM algorithm amounts to computing where •, • is the scalar product on R d (and where the contribution of g θ (x 0 , Y 0 ) is omitted for brevity).It is also assumed that for any s ∈ S, where S is an appropriately defined set, the function θ → φ(θ) + s, ψ(θ) has a unique maximum denoted by θ(s).Hence, a step of the EM algorithm writes

The Block Online EM (BOEM) algorithms
We now derive an online version of the EM algorithm.Define Sχ,T τ (θ, Y) as the intermediate quantity of the EM algorithm computed with the observations Y T :T +τ : where (2).Let {τ n } n≥1 be a sequence of positive integers such that lim n→∞ τ n = +∞ and set τ n denotes the length of the n-th block.Given an initial value θ 0 ∈ Θ, the BOEM algorithm defines a sequence {θ n } n≥1 by where {χ n } n≥0 is a family of probability distributions on (X, X ).By analogy to the regression problem, an estimator with reduced variance can be obtained by averaging and weighting the successive estimates (see [19,28] for a discussion on the averaging procedures).Define Σ 0 def = 0 and for n ≥ 1, Note that this quantity can be computed iteratively and does not require to store the past statistics {S j } n−1 j=0 .Given an initial value θ 0 , the averaged BOEM algorithm defines a sequence { θ n } n≥1 by The algorithm above relies on the assumption that S n can be computed in closed form.In the HMM case, this property is satisfied only for linear Gaussian models or when the state-space is finite.In all other cases, S n cannot be computed explicitly and will be replaced by a Monte Carlo approximation S n .Several Monte Carlo approximations can be used to compute S n .The convergence properties of the Monte Carlo BOEM algorithms rely on the assumption that the Monte Carlo error can be controlled on each block.[21] provides examples of applications when Sequential Monte Carlo algorithms are used.Hereafter, we use the same notation {θ n } n≥0 and { θ n } n≥0 for the original BOEM algorithm or its Monte Carlo approximation.
Our algorithms update the parameter after processing a block of observations.Nevertheless, the intermediate quantity S n can be either exactly computed or approximated in such a way that the observations are processed online.In this case, the intermediate quantity S n or S n is updated online for each observation.Such an algorithm is described in [3, Section 2.2] and [9, Proposition 2.1] and can be applied either to finite state-space HMM or to linear Gaussian models.[9] proposed a Sequential Monte Carlo approximation to compute S n online for more complex models (see also [21]).
The classical theory of maximum likelihood estimation often relies on the assumption that the "true" distribution of the observations belongs to the specified parametric family of distributions.In many cases, it is doubtful that this assumption is satisfied.It is therefore natural to investigate the convergence of the BOEM algorithms and to identify the possible limit for misspecified models i.e. when the observations {Y k } k≥0 are from an ergodic process which is not necessarily an HMM.

Application to inverse problems in Hidden Markov Models
In Section 3.1, the performance of the BOEM algorithm and its averaged version are illustrated in a linear Gaussian model.In Section 3.2, the BOEM algorithm is compared to online maximum likelihood procedures in the case of finite statespace HMM.Applications of the Monte Carlo BOEM algorithm to more complex models with Sequential Monte Carlo methods can be found in [21].
Both the BOEM algorithm and its averaged version converge to the true value φ = 0.9; the averaging procedure clearly improves the variance of the estimation.We now discuss the role of {τ n } n≥0 .Figure 2 displays the empirical variance, when estimating φ, computed with 100 independent Monte Carlo runs, for different numbers of observations and, for both the BOEM algorithm and its averaged version.We consider four polynomial rates τ n ∼ n b , b ∈ {1.2, 1.8, 2, 2.5}. Figure 2a shows that the choice of {τ n } n≥0 has a great impact on the empirical variance of the (non averaged) BOEM path {θ n } n≥0 .To reduce this variability, a solution could consist in increasing the block sizes τ n at a larger.The influence of the block size sequence τ n is greatly reduced with the averaging procedure as shown in Figure 2b.We will show in Section 5 that averaging really improves the rate of convergence of the BOEM algorithm.

Finite state-space HMM
We consider a Gaussian mixture process with Markov dependence of the form: Y t = X t +V t where {X t } t≥0 is a Markov chain taking values in X def = {x 1 , . . ., x d }, with initial distribution χ and a d × d transition matrix m. {V t } t≥0 are i.i.d.N (0, v) r.v., independent from {X t } t≥0 , i.e., for all (x, y) ∈ X × Y, In the experiments below, the initial distribution below is chosen as the uniform distribution on X.The statistics used to estimate θ are, for all The online computation of these intermediate quantities is given [3, Section 2.2].
, and At the end of the block, the new estimate is given, for all (i, j) ∈ {1, • • • , d} 2 by (the dependence on Y, θ, χ, T and τ is dropped from the notation) We first compare the averaged BOEM algorithm to the online EM (OEM) procedure of [3] combined with a Polyak-Ruppert averaging (see [28]).Note that the convergence of the OEM algorithm is still an open problem.In this case, we want to estimate the variance v and the states {x 1 , . . ., x d }.All the runs are started from v = 2 and from the initial states {−1; 0; .5;2; 3; 4}.The algorithm in [3] follows a stochastic approximation update and depends on a step-size sequence {γ n } n≥0 .It is expected that the rate of convergence in L 2 after n observations is γ 1/2 n (and n −1/2 for its averaged version) -this assertion relies on classical results for stochastic approximation.We prove in Section 5 that the rate of convergence of the BOEM algorithm is n −b/(2(b+1)) (and n −1/2 for its averaged version) when τ n ∝ n b .Therefore, we set τ n = n 1.1 and γ n = n −0.53 .Figure 3 displays the empirical median and first and last quartiles for the estimation of v with both algorithms and their averaged versions as a function of the number of observations.These estimates are obtained over 100 independent Monte Carlo runs.Both the BOEM and the OEM algorithms converge to the true value of v and the averaged versions reduce the variability of the estimation.Figure 4 shows the similar behavior of both averaged algorithms for the estimation of x 1 in the same experiment.Some supplementary graphs on the estimation of the states can be found in [20,Section 4]).
We now compare the averaged BOEM algorithm to a recursive maximum likelihood (RML) procedure (see [23,30]) combined with Polyak-Ruppert averaging (see [28]).We want to estimate the variance v and the transition matrix m.All the runs are started from v = 2 and from a matrix m with each entry equal to 1/d.The RML algorithm follows a stochastic approximation update and depends on a step-size sequence {γ n } n≥0 which is chosen in the same way as above.Therefore, for a fair comparison, the RML algorithm (resp.the BOEM algorithm) is run with γ n = n −0.53 (resp.τ n = n 1.1 ). Figure 5     Monte Carlo runs.For both algorithms, the bias and the variance of the estimation decrease as n increases.Nevertheless, the bias and/or the variance of the averaged BOEM algorithm decrease faster than those of the averaged RML algorithm (similar graphs have been obtained for the estimation of the other entries of the matrix m and for the estimation of v; see [20,Section 4]).As a conclusion, it is advocated to use the averaged BOEM algorithm instead of the averaged RML algorithm.(c) There exists a continuous function θ : S → Θ s.t. for any s ∈ S, A2 There exist σ − and σ + s.t. for any (x, x ) ∈ X 2 and any θ ∈ Θ, 0 A2, often referred to as the strong mixing condition, is commonly used to prove the forgetting property of the initial condition of the filter, see e.g.[10,11].This assumption holds for example if X is finite or for linear state-spaces with truncated gaussian state and measurement noises.More generally, this condition holds when X is compact.Note in addition that by [25, Theorem 16.0.2],A2 implies that the Markov kernel M θ has a unique invariant distribution which guarantees the existence of the unique invariant distribution π θ for K θ .
We now introduce assumptions on the observation process It is defined on some probability space (Ω, F, P).We stress that this process is not necessarily the observation of an HMM.Let be σ-fields associated to Y.We also define the β-mixing coefficients by, see [8], A5 There exists c > 0 and a > 1 such that for all n ≥ 1, τ n = cn a .
For p > 0 and Z a random variable measurable w.r.t. the σ-algebra σ (Y n , n ∈ Z),

A6 -(p)
There exists b ≥ (a + 1)/2a (where a is defined in A5) such that, for any n ≥ 0, where S n is the Monte Carlo approximation of S n which is defined by (6).
A6 gives a L p control of the Monte Carlo error on each block.In [15, Theorem 1], such bounds are given for Sequential Monte Carlo algorithms.Practical conditions to ensure A6 are given in [21] in the case of Sequential Monte Carlo methods.

The limiting EM algorithm
In the sequel, M(X) denotes the set of all probability distributions on (X, X ).
iii) Assume in addition that A6-(p) holds.For any p ∈ (2, p), there exists a constant C s.t. for any n ≥ 1, where S n is the Monte Carlo approximation of S n defined by (6).
Theorem 4.1 allows to introduce the limiting EM algorithm, defined as the deterministic iterative algorithm θn = R( θn−1 ) where The limiting EM can be seen as an EM algorithm applied as if the whole trajectory Y was observed instead of Y 0:T .For this limiting EM, the so-called sufficient statistics depend on the observations only through the mean The stationary points of the limiting EM are defined as We show that there exists a Lyapunov function W w.r.t. to the map R and the set L i.e., a continuous function W satisfying the two conditions: For such a function, the sequence {W( θk )} k≥0 is nondecreasing and { θk } k≥0 converges to L. Define, for any m ≥ 0, θ ∈ Θ and probability distribution χ on (X, X ), . By [13, Lemma 2 and Proposition 1], under A1-4, for any θ ∈ Θ, there exists a random variable log p θ (Y 1 |Y −∞:0 ), such that for any probability distribution χ on (X, X ), log where χ θ,T (Y) is the log-likelihood defined by (3).The function θ → (θ) may be interpreted as the limiting log-likelihood.We consider the function W , given, for all θ ∈ Θ, by To identify the stationary points of the limiting EM algorithm as the stationary points of , we introduce an additional assumption.|∇ θ log m θ (x, x ) + ∇ θ log g θ (x , y)| .
Remark 4.3.In the case where {Y k } k≥0 is the observation process of the stationary HMM {(X k , Y k )} k≥0 parameterized by θ ∈ Θ, we can build a twosided stationary extension of this process to obtain a sequence of observations {Y k } k∈Z .Following [13, Proposition 3], the quantity (θ) can be written as where p θ (Y 1 |Y −m:0 ) is the conditional distribution under the stationary distribution.Since is the Kullback-Leibler divergence between p θ (Y 1 |Y −m:0 ) and p θ (Y 1 |Y −m:0 ), for any θ ∈ Θ, (θ ) − (θ) ≥ 0 and θ is a maximizer of θ → (θ).If in addition θ lies in the interior of Θ, then θ ∈ L.
The following proposition gives sufficient conditions for the convergence of the limiting EM algorithm and the Monte Carlo BOEM algorithm to the set L. Theorem 4.4 is a direct application of Proposition A.1 for the limiting EM algorithm.The proof for the Monte Carlo BOEM algorithm is detailed in Section 6.3.By Sard's theorem if W is at least d θ (where Θ ⊂ R d θ ) continuously differentiable, then W(L) has Lebesgue measure 0 and hence has an empty interior.

Rate of convergence of the Block Online EM algorithms
We address the rate of convergence of the Monte Carlo BOEM algorithms to a point θ ∈ L. It is assumed that A8 (a) S and θ are twice continuously differentiable on Θ and S.
Hereafter, for any sequence of random variables {Z n } n≥0 , write In (19), the rate is a function of the number of updates (i.e. the number of iterations of the algorithm).Theorem 5.2 shows that the averaging procedure reduces the influence of the block-size schedule: the rate of convergence is proportional to T 1/2 n i.e. to the inverse of the square root of the total number of observations up to iteration n.Theorem 5.2.Let p > 2. Assume that A2, A3-(p), A4-5, A6-(p) and A8 hold.Then, for any p ∈ (2, p), Theorems 5.1 and 5.2 give the rates of convergence as a function of the number of updates but they can also be studied as a function of the number of observations.Let {θ int k } k≥0 (resp.{ θ int k } k≥0 ) be such that, for any k ≥ 0, where n is the only integer such that k ∈ [T n + 1, T n+1 ].The sequences {θ int k } k≥0 and { θ int k } k≥0 are piecewise constant and their values are updated at times {T n } n≥1 .
By Theorem 5.1, the rate of convergence of {θ int k } k≥0 is given (up to a multiplicative constant) by k −a/(2(a+1)) , where a is given by A5.This rates is slower than k −1/2 and depends on the block-size sequence (through a).On the contrary, by Theorem 5.2, the rate of convergence of { θ int k } k≥0 is given (up to a multiplicative constant) by k −1/2 , for any value of a.Therefore, this rate of convergence does not depend on the block-size sequence.
Since Θ is compact, by A1, there exist constants C 1 and C 2 s.t. the supremum in θ ∈ Θ of this expression is bounded above by Since χ is a distribution and λ is a finite measure, the continuity follows from the dominated convergence theorem.
Let us introduce the following shorthand S s (x, x ) def = S(x, x , Y s ).Define the shift operator ϑ onto Y Z by (ϑy) k = y k+1 for any k ∈ Z; and by induction, define the s-iterated shift operator ϑ s+1 y = ϑ(ϑ s y), with the convention that ϑ 0 is the identity operator.For a function h, define osc(h) def = sup z,z |h(z) − h(z )|.

Proof of Theorem 4.1
The proof of Theorem 4.1 relies on auxiliary results about the forgetting properties of HMM.Most of them are really close to published results and their proof is provided in the supplementary material [20,Section 3].The main novelty is the forgetting property of the bivariate smoothing distribution.
By the Minkowski inequality combined with Lemmas A.6, A.7 applied with q n def = 2v n m n , there exists a constant C s.t.
Proof of Proposition 4.2 (ii) We prove that 27) is equal to zero.By definition of θ, R(θ) = θ and thus θ ∈ L. The converse implication is immediate from the definition of L.

Proof of Theorem 4.4
The proof of Theorem 4.4 relies on Proposition A.1 applied with T (θ) def = R(θ) and with θ n+1 = θ S χn,Tn τn+1 (θ n , Y) .The key ingredient for this proof is the control of the L p -mean error between the Monte Carlo Block Online EM algorithm and the limiting EM.The proof of this bound is derived in Theorem 4.1 and relies on preliminary lemmas given in Appendix A. The proof of (37) is now close to the proof of [16,Proposition 11] and is postponed to the supplement paper [20, Section 2.1].

Proof of Theorem 5.1
Define s def = S(θ ) and write where Υ Define {µ n } n≥0 and {ρ n } n≥0 s.t.µ 0 = 0, ρ 0 = S 0 − s and where, By Proposition 6.2, the first term in (28) gives A Taylor expansion with integral remainder term gives the rate of convergence of the second term.This concludes the proof of Theorem 5.1, Eq. ( 19).where e n is given by (30).
Proof.By A5 and A6-(p), we have Then, it is sufficient to prove that Let p ∈ (2, p).In the sequel, C is a constant independent on n and whose value may change upon each appearance.Let 1 ≤ m n ≤ τ n+1 and set v n def = τn+1 2mn .By Lemma A.7 applied with q k def = 2v k m k , we have, where δ k and ζ k are defined by and where T k is given by (41).We will prove below that there exists C s.t.We turn to the proof of (31).By the Berbee Lemma (see [29,Chapter 5]) and A4, there exist C ∈ [0, 1) and β ∈ (0, 1) s.t. for all k ≥ 1, there exists a random variable Y ,(k) T k +m k :T k+1 +m k on (Ω, F, P) independent from F Y T k with the same distribution as Y T k +m k :T k+1 +m k and Therefore, by setting Minkowski and Holder (with a def = p/p and b −1 def = 1−a −1 ) inequalities, combined with (33), A4, Lemma A.4 and A3-(p) yield (31).
We now prove (32).Upon noting that δ k is F Y T k+1 -measurable and δ k is a martingale increment, the Rosenthal inequality (see [17,Theorem 2.12,p.23]) n where Using again By Lemma A.6 and (31), there exists C s.t. for any k ≥ 1 and since 2/p < 1, convex inequalities yield .Hence, by (35), I For any p ∈ (2, p), Under A8, A −1 exists.By ( 29) and (36), The result now follows from Proposition 6.For any L ≥ 1, m ≥ 1 and any distribution χ on (X, X ), define We introduce the σ-algebra F Tn defined by where F Tn is given by (10) and where H Tn is independent from Y (the σ-algebra H Tn is generated by the random variables independent from the observations Y used to produce the Monte Carlo approximation of {S k−1 } n k=1 ).Hence, for any positive integer m and any B ∈ G Y Tn+m , since H Tn is independent from B and from F Y Tn , P(B| F Tn ) = P(B|F Y Tn ).Hence, the mixing coefficients defined in (11) are such that Note that θ n is F Tn -measurable and that S n is F Tn+1 -measurable.
There exists a constant C s.t. for any distribution χ on (X, X ), any m ≥ 1, k, ≥ 0 and any Θ-valued where ∆p def = p−p p p and β is given by A4.
The BOEM algorithm with averaging.
The BOEM algorithm, with averaging
displays the empirical median and empirical first and last quartiles of the estimation of m(1, 1) as a function of the number of observations over 100 independent (c) The averaged BOEM algorithm.
(d) The averaged OEM algorithm.

Figure 3 :
Figure 3: Estimation of v using the online EM and the BOEM algorithms (top) and their averaged versions (bottom).Each plot displays the empirical median (bold line) and the first and last quartiles (dotted lines) over 100 independent Monte Carlo runs with τ n = n 1.1 and γ n = n −0.53 .

Figure 4 :
Figure 4: Estimation of x 1 using the averaged OEM and the averaged BOEM algorithms.Each plot displays the empirical median (bold line) and the first and last quartiles (dotted lines) over 100 independent Monte Carlo runs with τ n = n 1.1 and γ n = n −0.53 .The first ten observations are omitted for a better visibility.
(b) The averaged RML algorithm.

Figure 5 :
Figure 5: Empirical median (bold line) and first and last quartiles (dotted line)for the estimation of m(1, 1) using the averaged RML algorithm (right) and the averaged BOEM algorithm (left).The true values is m(1, 1) = 0.5 and the averaging procedure is starter after 10000 observations.The first 10000 observations are not displayed for a better clarity.
β m k /pb .By the Minkowski and Jensen inequalities, it holds I