Robust Filtering and Propagation of Uncertainty in Hidden Markov Models

We consider the filtering of continuous-time finite-state hidden Markov models, where the rate and observation matrices depend on unknown time-dependent parameters, for which no prior or stochastic model is available. We quantify and analyze how the induced uncertainty may be propagated through time as we collect new observations, and used to simultaneously provide robust estimates of the hidden signal and to learn the unknown parameters, via techniques based on pathwise filtering and new results on the optimal control of rough differential equations.


Introduction
The filtering of hidden processes from noisy observations is an important and routine problem arising in many applications. The basic problem is to derive an optimal online estimator for an unobserved 'signal' process X evolving randomly in time, from observations of another process Y whose dynamics depend on the current state of the signal. Such stochastic filters have been derived and analyzed in various contexts, notably in the settings of linear underlying dynamics (the Kalman-Bucy filter [25,26]) and finitestate Markov processes (the Wonham filter [33]), but also in general nonlinear settings; see Bain and Crisan [5] or Crisan and Rozovskii [18] for a comprehensive exposition of nonlinear filtering.
Stochastic filters take two primary inputs, namely a stochastic model for the underlying processes X, Y , and the observed data, coming from discrete observations of the path t → Y t . The performance of a filter is naturally sensitive to the choice of stochastic model, and to the calibration of its parameters. In practice the parameters of the model are often unknown and may themselves vary in time. With the benefit of a stochastic model for the parameters (or simply a prior distribution in the case of constant parameters), one may in principle simply increase the dimension of the filter in order to simultaneously estimate the unknown parameters alongside the signal. In the absence of such a stochastic model, however, filtering alone is not sufficient, particularly when the parameters vary in time according to entirely unknown dynamics. Such general parameter uncertainty is thus a substantial and compelling problem.
The objective of the present work is to provide a theoretical framework to quantify and model how our uncertainty in the model parameters may be propagated through time, and to derive filters which provide robust estimates, both of the hidden signal process and of the unknown parameters.
We focus on the case of continuous-time finite-state hidden Markov models. The associated filters are known to be continuous with respect to their model parameterscorresponding convergence results are given in for example Chigansky and van Handel [12] or Guo and Yin [24]-but this by no means guarantees a satisfactory performance when the adopted parameters differ significantly from the true parameters. Uncertainty-robust filters for such systems were proposed by Borisov [7,8] via minimax-filtering, whereby a best estimate is sought with respect to the worst case scenario, where 'scenarios' here are represented by probability distributions over the space of all possible parameter values.
Such minimax procedures are by now classical, designed to find the estimate which minimizes the maximum expected loss over a range of plausible models, an approach which may be traced back at least as far as Wald [32], and has been applied in various settings, principally in those with linear underlying dynamics; see for example Martin and Mintz [27], Miller and Pankov [28], Siemenikhin [29], Siemenikhin, Lebedev and Platonov [30] or Verdú and Poor [31]. Invariably, however, by focusing exclusively on the worst case scenario, such procedures do not necessarily ensure a satisfactory performance under statistically realistic scenarios, and moreover make no attempt to learn the true parameter values, or more generally to evaluate our uncertainty and how it should be updated to reflect new observations. Our approach is inspired by the discrete-time results of Cohen [13], in which the datadriven robust (DR) expectation of [14] is introduced in a filtering context as a means of computing uncertainty-robust evaluations of functionals of the signal. In particular, in [13] it is shown that such DR-expectations actually provide the only way to construct an 'expectation' which penalises uncertainty, while preserving the natural properties of monotonicity, translation equivariance and constant triviality. In short, such nonlinear expectations consider evaluating random variables under a whole family of stochastic models, which itself is a standard approach to problems of robustness, except that the DR-expectation also penalises such models according to how 'unreasonable' they are considered to be. This penalisation is linked to statistical estimation of the models themselves, specifically via the corresponding negative log-likelihood function evaluated using the observed data.
Nonlinear expectations incorporating such model penalisation were first applied to continuous-time filtering in [3] in the context of Kalman-Bucy filtering-however, there the penalisation was based only on an initial calibration, and was not updated to incorporate new observations. In both [13] and [3], the nonlinear expectation is seen to be characterised by what is essentially its convex dual, which, owing to the additive structure of the penalty, may be computed by a suitable dynamic programming principle.
In the present work we shall focus on this dual object, or 'value function', as the object which describes the propagation of our uncertainty through time as we collect new observations. We adopt the classical setting in which the unobserved signal process X is a continuous-time finite-state Markov chain, and study the uncertainty arising from both the unknown rate matrix of the chain X, and from the unknown observation matrix which determines the drift of the observation process Y . We will see how the value function in this setting may be formulated to encode our opinion of how 'reasonable' both posterior distributions and parameter values are given our observations, and how it may then be used to compute robust estimates of each.
As alluded to above, filters are typically sensitive to both uncertainty of the model parameters, and to errors in the observed data due to imprecise modelling of the observation process. One may therefore desire a filter to be 'robust' in both of these distinct senses, namely robust with respect to parameter uncertainty, and continuous with respect to the observation path t → Y t (in some suitable topology on path space). By taking a fully pathwise approach, by means of first 'lifting' the observation process into the space of rough paths, we will see that our resulting filters are robust in both of these senses.
The first use of rough path theory in uncertainty-robust filtering was presented in Section 4 of Allan and Cohen [2], allowing to extend the results of [3], but remaining in the Kalman-Bucy setting. We also highlight the results of Crisan, Diehl, Friz and Oberhauser [17] as the first use of rough paths to establish continuity of stochastic filters with respect to the (enhanced) observation path.
We highlight that the penalisation approach we adopt, formulated in terms of the 'reasonability' of parameter values, is inherently non-probabilistic, and thus does not require any prior distribution for the parameter values, nor any stochastic model for the dynamics of the parameters, which would be necessary for a purely Bayesian approach. Only an initialisation of the penalty is required, which in practice has an exponentially vanishing effect on the filter. In particular, our approach is suitable for cases in which filtering the parameter values is not feasible (or desirable due to additional nonlinearities in the underlying equations).
One of the key steps in our approach is the derivation of a pathwise optimal control problem driven by the observation path t → Y t (ω). As was demonstrated in Diehl, Friz and Gassiat [19] and subsequently in [2], rough path theory provides a convenient framework for pathwise control problems, in which they are formulated as the optimal control of a rough differential equation (RDE). In the current work we also provide new results in this direction, particularly on the regularity of the corresponding (rough) value function. A particular difficulty which we face, not encountered in [19] or [2], is that the spatial domain of our value function is actually time-dependent (and indeed rough), and thus requires delicate analysis. Moreover, and of independent interest, we also provide a new growth estimate for rough integrals depending on unconstrained parameters, which not only improves upon the corresponding estimate in [2], but in fact turns out to be sharp (see Lemma 2.5 below).
The structure of the paper is as follows. We begin in Section 2 by recalling some basic concepts from rough path theory, and presenting our main results for controlled RDEs in Theorem 2.3, the proof of which shall be postponed to the appendix. In Section 3 we shall introduce our underlying filtering problem and motivate our approach to quantifying uncertainty via model penalisation. This will then lead to an optimal control problem, and we shall establish properties of the corresponding value function in Section 4. We will then show in Section 5 that the value function satisfies a suitable rough PDE. We shall present two simple numerical examples in Section 6, and end with some brief concluding remarks in Section 7.

Rough path preliminaries 2.1 Notation
We consider a finite time interval [0, T ], and write ∆ [0,T ] := {(s, t) : 0 ≤ s ≤ t ≤ T } for the standard 2-simplex. For any path X on [0, T ], we write the increment of X over the interval [s, t] as X s,t := X t − X s , and write X ∞ := sup s∈[0,T ] |X s | for the supremum norm. We also define the following function spaces. For given vector spaces V and W , we write • L(V ; W ) for the space of linear maps from V → W , for the space of n times continuously differentiable (in the Fréchet sense) functions φ : V → W such that φ and all its derivatives up to order n are uniformly bounded, where the supremum is taken over all finite partitions P of the interval [0, T ].
defined in the Riemann-Stieltjes sense. On the other hand, for a general path Y of finite p-variation, the integral in (2.2) does not exist in the classical sense. The point here then is that the value of this integral is postulated by the enhancement Y, which in practice is often constructed using stochastic integration. For example, given a continuous semimartingale Y , for p ∈ (2, 3) one can construct an enhancement via Stratonovich integration: . . , d, and the resulting lift Y = (Y, Y) then defines a random rough path, so that Y(ω) ∈ V p for almost every ω ∈ Ω. Of course, More generally, by integration by parts, Thus, the additional information encoded by this lift is contained in the antisymmetric part of Y, which corresponds to the Lévy area of the process Y . For rough paths Y = (Y, Y) andỸ = (Ỹ ,Ỹ), we write for the (inhomogeneous) rough path norm 1 , and for the corresponding rough path distance. We will also consider the space of geometric rough paths V 0,p g ⊂ V p , defined as the closure of canonical lifts of smooth paths with respect to the pseudometric in (2.4). For example, when Y is a semimartingale and we lift using Stratonovich integration, as in (2.3), the resulting lift turns out to be a (random) geometric rough path. This property of being well approximated by smooth paths allows one to make sense of solutions to a wide class of rough ODEs and PDEs-we will see an example of this in Definition 5.5 below.
We will sometimes write e.g. Y p,[s,t] for the p-variation of Y over the subinterval [s, t].

Rough integration
As noted above, the enhancement Y may be seen as postulating the value of the integrals contains all the information required to define integrals against Y j , for any integrand in the space of paths 'controlled' by (Y 1 , . . . , Y d ) in the following sense.
Let Y = (Y, Y) be a p-rough path for some p ∈ [2, 3). We say that a path X ∈ C p-var ([0, T ]; R m ) is a controlled rough path (in the sense of Gubinelli [23]), if there exists a path X ′ ∈ C p-var ([0, T ]; L(R d ; R m )), known as the Gubinelli derivative of X with respect to Y , such that the remainder term R X : for the space of controlled rough paths (with respect to Y ), which becomes a Banach space when equipped with the norm (X, Proposition 2.2 (Proposition 2.6 in [22]).
) be a controlled rough path. Then the limit where the limit is taken over any sequence of partitions P of the interval [0, T ] such that the mesh size |P| → 0. This limit (which does not depend on the choice of sequence of partitions) is called the rough integral of X against Y.

Rough differential equations
We consider the rough differential equation (RDE) given by 2 -var and any x ∈ R m , there exists a unique X ∈ C p-var such that the controlled path (X, X ′ ) = (X, φ(X, γ)) ∈ V p Y solves the RDE (2.6) driven by Y with parameter γ and initial condition X 0 = x.
where the constant C depends on b, φ, ψ, p and L.

9)
where the constants C ′ , C ′′ depend on b, φ, p, L and M , and C ′′ also depends on ψ.
(iv) Let p ∈ (2, 3). Suppose that Y is a continuous semimartingale on some probability space (Ω, F, P) and Y = (Y, Y) is its Stratonovich lift, as in (2.3), so that Y(ω) = (Y (ω), Y(ω)) ∈ V 0,p g for almost every ω ∈ Ω. Then the solution of the random RDE (2.6) is indistinguishable from the solution of the Stratonovich SDE and moreover the rough and Stratonovich integrals coincide almost surely, that is, for almost every ω ∈ Ω.
2 Strictly speaking, in making precise sense of the product X ′ s Ys,t, we use the natural identification of (written in terms of the terminal values X T ,X T , γ T ,γ T instead of the initial values), with the same constants C ′ , C ′′ as before.
We have from part (ii) of Theorem 2.3 that the bound (2.10) holds when q = p−1 2 , for some constant C depending on b, φ, ψ, p, L and T . In fact, this estimate is sharp, in the sense of the following lemma. 3 Uncertainty in hidden Markov models

The Wonham filter
Let X be a continuous-time Markov chain taking values in the standard basis X = {e 1 , . . . , e m } of R m . We write A t = [a ij (t)] m×m for the rate matrix 3 of X, so that in particular the process X t − t 0 A s X s ds is a càdlàg martingale. We assume that E[X 0 ] = π 0 , for some element π 0 of the open probability simplex We consider the problem of estimating the current state of the chain X from observations of the R d -valued process Y = (Y 1 , . . . , Y d ), with Y 0 = 0 and dynamics where h i is an R m -valued time-dependent vector, and B = (B 1 , . . . , B d ) is a standard d-dimensional Brownian motion. In the following we will write h = (h 1 , . . . , h d ) ∈ R m×d for the full observation matrix, and it will also be convenient to write i.e. the diagonal matrix with diagonal elements given by the vector h i t . We will denote by (Y t ) t≥0 the (completed) natural filtration generated by the observation process Y . The goal of the associated filtering problem is to determine, at each time t, the posterior distribution π t = E[X t | Y t ]. In the present setting, the filtering problem was resolved by Wonham [33]; see also Bain and Crisan [5,Chapter 3]. The optimal filter π is the unique continuous Y t -adapted solution of the stochastic differential equation where I denotes the (m × m)-identity matrix. It will be convenient later to interpret the stochastic integral appearing in (3.2) in the sense of Stratonovich, rather than that of Itô. The Stratonovich version of (3.2) is given by

Parameter uncertainty
We shall consider both uncertainty of the rate matrix A of the Markov chain X, and uncertainty of the observation matrix h. We recall the space of (m × m)-rate matrices, that is, matrices A = [a ij ] m×m such that a ij ≥ 0 for all i = j, and m i=1 a ij = 0 for all j = 1, . . . , m.
Assumption 3.1. We shall write A for the set of admissible rate matrices, and H for the set of admissible observation matrices, which we shall assume to be bounded connected subsets of (m × m)-rate matrices and of R m×d respectively. Let k ≥ 1 be the total dimension of the space of A × H, which we note can be at most m(m − 1) + md. For notational clarity, we assume that the elements of the space A × H may be parameterised by the elements of R k . More precisely, we assume that there exists a bijection from R k → A × H which belongs to the class C 3 b .
Example 3.2. As a simple example, one might have m = 2, d = 1, In this case k = 2, and we may parametrise A × H by R 2 via the mapping Remark 3.3. This framework includes cases in which the rate matrix A (resp. the observation matrix h) is known. In such cases we simply have We suppose that the true parameter is a Lipschitz continuous path taking values in R k . We shall therefore take as our uncertainty class the space A , where:

The setup
Our setup is the following. For each choice of parameter γ ∈ A we denote by A = (A t : t ∈ [0, T ]) and h = (h t : t ∈ [0, T ]) the rate matrix and observation matrix corresponding to γ (via the bijection in Assumption 3.1).
Let X and Y be two adapted processes on a filtered space (Ω, F, (F t ) t∈[0,T ] ). For each γ ∈ A and initial distribution π 0 ∈ S m , we let P γ,π 0 be a probability measure such that the law of (X, Y ) is equal to the law of (X γ,π 0 ,Ỹ γ,π 0 ), where, on some probability space, X γ,π 0 is a Markov chain with transition matrix A and initial distribution E[X γ,π 0 0 ] = π 0 , andỸ γ,π 0 is a weak solution of (3.1), i.e.Ỹ γ,π 0 satisfies the SDE (3.1) driven by some Brownian motion B with the observation matrix h and initial valueỸ γ,π 0 0 = 0. We note in particular that the processes X and Y as functions on Ω×[0, T ], and hence also the (uncompleted) filtration σ(Y s : s ∈ [0, t]), t ∈ [0, T ], are defined independently of the choice of parameters-it is only the law of (X, Y ) which varies depending on the choices of γ and π 0 .
To the family of all possible parameters (γ, π 0 ) ∈ A × S m , we can naturally associate the corresponding solution π of the filtering equation ((3.2) or (3.3)). Thus, at each time t ≥ 0, we obtain in general a whole family of possible posterior distributions π t = x ∈ S m for the signal, and a family of possible values γ t = a ∈ R k for the unknown parameter γ at time t. Since we don't know which choice is the correct one, at each time t we wish to know how to decide which posterior distribution x ∈ S m and parameter value a ∈ R k is the most 'reasonable' given our observations.
At each time t > 0, and for each choice of posterior distribution x ∈ S m and parameter value a ∈ R k , the central question we pose is thus the following: Given our observations, and given all the possible parameters choices, how reasonable is it that we would end up with the posterior distribution π t = x and parameter value γ t = a at time t?
To make this question more concrete, we need a notion of the 'unreasonability' of different parameter choices. Mathematically, this notion may be represented by a 'penalty' function which, at each time t, penalises parameters according to how unreasonable we consider them to be given our observations up to time t.
Let us suppose for the moment that we have specified such a penalty function, denoted by β t (γ, π 0 | Y t ), which assigns a penalty to each choice of the parameters γ, π 0 . Then, for a particular posterior distribution x ∈ S m and parameter value a ∈ R k , the most reasonable parameters (γ, π 0 ) at time t are those which attain the minimum of the set where π = (π s ) s∈[0,t] satisfies the filtering equation with rate and observation matrices corresponding to the parameter γ.

The penalty function
We suppose that our penalty takes the form of a negative log-posterior density. That is, we take where ϑ and L(· | Y t ) denote the prior and likelihood respectively.
Remark 3.5. Since the posterior is only proportional to the product of prior and likelihood, (3.4) is correct up to an additive constant. For simplicity we will omit this constant from our analysis, conceding that our penalty function is correct up to an additive constant. This constant may be reintroduced upon numerical computation, chosen to ensure that the penalty function always takes the value zero at its minimum.
The penalty function in (3.4) is built from the log-likelihood function, a familiar object from classical statistics. Penalties based on log-likelihoods form the basis of the data-driven robust (DR) expectation of [14], which allows the level of penalisation of different parameter choices to be recursively updated through time as we collect new observations. Here we add to this an additional penalty based on our prior beliefs, which may be calibrated accordingly. We assume that the prior takes the form where as usual π = (π s ) s∈[0,t] is the posterior distribution corresponding to the parameters γ and π 0 , andγ is the derivative of γ. Here, the functions f : S m × R k × R k → R and g : S m × R k → R may be calibrated to represent our prior beliefs about the plausibility of different parameter choices. In practice the function f may also be time dependent, and may even depend on our observations provided that it is Y t -predictable. By allowing f to depend on the derivativeγ, we can penalise parameters, not only according to their value, but also according to how quickly they vary over time. For example, if we believe that the true parameter (or some component thereof) should remain fairly constant in time, then we can incorporate this belief by choosing the function f to grow very quickly relative to the magnitude ofγ s .
The natural choice for the likelihood L t (· | Y t ) is the Radon-Nikodym derivative that is, the likelihood ratio of the (arbitrary) parameter choice γ, π 0 , with respect to a (fixed) choice of reference parametersγ,π 0 . We will now derive an explicit expression for this likelihood. Recall (from e.g. Bain and Crisan [5,Chapter 2]) that for a given choice of parameters γ, π 0 , the innovation process V = (V 1 , . . . , V d ), given in this setting by is a Y t -adapted Brownian motion under P γ,π 0 , and moreover that, in this setting, V generates the observation filtration (see Allinger and Mitter [4]). Writingπ (resp.V ) for the posterior distribution (resp. innovation process) under the reference measure Pγ ,π 0 , we have Thus, by Girsanov's theorem (see e.g. [15,Chapter 15]), we can represent the likelihood as a stochastic exponential, namely Since the reference parameters are taken to be fixed, they simply amount to an additive constant in the above expression. That is, As in Remark 3.5, we shall henceforth omit this constant.
For later convenience, we make the transformation Substituting (3.5) and (3.7) into (3.4), we obtain where, for notational simplicity, we have introduced the functions f :

Pathwise filtering
As discussed above, we propose to evaluate the unreasonableness of different posterior distributions x ∈ S m and parameter values a ∈ R k by minimizing the penalty β t (γ, π 0 | Y t ) in (3.8) over all choices of the parameters γ, π 0 which would have resulted in the distribution π t = x and value γ t = a. Of course, in practice this optimization should depend on the particular realisation of the observation process Y that we actually observe. Thus, we do not wish to optimize the expectation of (3.8), but rather we wish to simultaneously optimise with respect to each individual realisation of the process Y . This motivates a pathwise interpretation of the filtering equation.
We proceed as follows. We first fix a reference measure Pγ ,π 0 . We then enhance the observation process Y using Stratonovich integration: defined under the measure Pγ ,π 0 , which we recall defines a random geometric rough path Y = (Y, Y) ∈ V 0,p g for any p ∈ (2, 3). Recall the Stratonovich filtering equation (3.3). For notational simplicity, we rewrite this equation in the form We note that in general the measures P γ,π 0 are not necessarily equivalent on F t (as different choices of the rate matrix A may have different patterns of zero entries), and hence the completed -adapted, the process Y coincides almost surely with the same integral defined under any other choice of measure P γ,π 0 (even though the corresponding completed filtrations Y t may not agree).
Thus, defining π as the solution of the RDE which exists by part (i) of Theorem 2.3, then, for each choice of parameters (γ, π 0 ), the corresponding solution π of (3.12) is indistinguishable from the solution of the Stratonovich equation (3.11) defined under P γ,π 0 , and moreover the rough integral t 0 ψ(π s , γ s ) dY s coincides almost surely with the stochastic integral the observation process, we obtain an associated rough path Y ∈ V 0,p g , and we can write the penalty corresponding to this realisation as (3.13) As discussed above, we propose to evaluate the reasonableness of posterior distributions x ∈ S m and parameter values a ∈ R k by determining the most reasonable choice of the parameters γ, π 0 which would have resulted in the posterior π t = x and parameter value γ t = a at time t. The 'unreasonableness' of each pair (x, a) ∈ S m × R k is given by the functional κ : (3.14) where the infimum is taken over all γ ∈ A and π 0 ∈ S m such that γ satisfies γ t = a and the solution π of (3.12) takes the terminal value π t = x.

Interpretation
At each time t, the function (x, a) → κ(t, x, a) encodes our opinion of how reasonable each posterior distribution x ∈ S m and each parameter value a ∈ R k is at time t, given our observations. Thus, the map t → κ(t, ·, ·) describes the propagation of our uncertainty through time.
Since κ(t, x, a) measures the unreasonability of posteriors and parameter values, we obtain a filter which is robust to uncertainty by simply taking the minimum of κ(t, ·, ·). That is, the most reasonable parameter values at each time are given by (3.15) and the most reasonable posteriors at each time may be obtained similarly.
As noted above, the penalty function β t is defined up to an additive constant, which depends on time t but does not depend on the choice of parameters γ, π 0 . Similarly, the interpretation of the function κ(t, ·, ·) is not affected by additive constants, and in practice it is therefore natural to shift the values of κ so that inf (x,a) κ(t, x, a) = 0 for every t ≥ 0. In particular, given a threshold λ > 0, one can then define a set of reasonable parameter values (or similarly posteriors) by setting with the most reasonable parameter values, as in (3.15), being recovered as lim λ→0 R λ t . Analogously to Cohen [13], one can also define an associated DR-expectation by setting defined for every functional ϕ : X → R. As mentioned in the introduction, such expectations allow one to compute evaluations of random variables which penalise uncertainty, whilst retaining many of the natural properties one would expect from an expectation. With π defined as above, i.e. as the solution of the rough filtering equation (3.12), it follows that m j=1 π j,t ϕ(e j ) is a version of the conditional expectation E γ,π 0 [ϕ(X t ) | Y t ] for every choice of parameters and every functional ϕ. Choosing this version, the nonlinear expectation in (3.16) evaluated on a particular (enhanced) observation path Y [0,t] is given by Lemma 3.6. For any Y ∈ V 0,p g , t ∈ [0, T ] and functional ϕ : X → R, we have that where κ is the function defined in (3.14).
Proof. For any (x, a) ∈ S m × R k , we observe that We then obtain (3.17) upon taking the supremum over (x, a) ∈ S m × R k .
The key insight of our approach is that the function κ, as defined in (3.14), has the form of the value function of an optimal control problem. To make this precise we introduce: where we now interpretγ t,a,u = u ∈ U as a control, and π t,x,a,u , γ t,a,u as the state variables, which satisfy the controlled dynamics given by One should take care here to exclude unphysical trajectories. That is, given a path Y and terminal condition (π t , γ t ) = (x, a), there may exist choices of control u ∈ U for which the solution π = (π s ) s∈[0,t] of (3.12) leaves the domain S m . Such tuples (t, x, a, u) do not correspond to a physical initial value π 0 , and should thus be discarded.
Since λ > 0, it is then inevitable that π 2,s < 0 for some s < t, and hence that π s / ∈ S 2 .
Although we don't obtain an initial value π 0 ∈ S m for trajectories which leave the domain, we can simply assign an infinite initial cost to all such trajectories, so that the corresponding controls are never considered when taking the infimum in (3.18).
Moreover, in general there will exist terminal conditions (t, x, a) for which every choice of control u ∈ U results in a trajectory π = (π s ) s∈[0,t] which leaves the domain S m . These are posteriors x ∈ S m which do not correspond to any pair of parameters (γ, π 0 ) ∈ A × S m , and which are therefore totally implausible given our observations up to the current time. In these cases we have simply κ(t, x, a) = ∞. Notation 3.9. Let t ∈ [0, T ] and a ∈ R k . We denote the set of 'plausible' posteriors at time t (for which at least one physical trajectory exists) by Q t := x ∈ S m κ(t, x, a) < ∞ = x ∈ S m ∃u ∈ U such that π t,x,a,u 0 ∈ S m .
Since we impose no uniform bound on the controls u ∈ U , it is easy to deduce that if κ(t, x, a) < ∞ for some a ∈ R k , then in fact κ(t, x, a) < ∞ for all a ∈ R k . Thus, the set Q t defined above is independent of the choice of a ∈ R k . We will also denote by the domain on which κ is finite.
It is clear that Q 0 = S m . Moreover, the domain Q t , which is easily seen to be an open subset of S m , does not depend on the choice of the functions f and g, but it does depend on the space A × H and on the realisation of the observation path Y [0,t] . The boundary t → ∂Q t therefore also inherits the roughness of Y.
Remark 3.10. The fact that the set of plausible posteriors Q t is in general a proper subset of S m should not be too surprising. In particular, in the degenerate case with no uncertainty, so that A×H is just the singleton {(A true , h true )} and the initial distribution π 0 = π true 0 is known, the set Q t reduces to the singleton {π true t }, where π true is the filter corresponding to the true parameters. Moreover, we cannot expect all posteriors to be plausible (i.e. reachable by at least one filter trajectory) without an assumption of irreducibility on the admissible rate matrices.
Remark 3.11. Although in general the domain Q t is a proper subset of S m , there are cases in which Q t = S m (so that all posteriors x ∈ S m are considered to be plausible) at every time t. We will see an example of this in Section 6.1.

An unbounded pathwise control problem
In the previous section we formulation an optimal control problem, which for convenience we restate here. We have the value function As discussed in the previous section, in (4.1) we assign an infinite initial cost to all trajectories π t,x,a,u which leave the domain S m at any time s < t. We will sometimes omit the superscripts on the state variables when no confusion is likely to occur.

Observations and assumptions
We begin with some observations. First, we note that this is a 'backward' control problem, in the sense that we prescribe a terminal condition (π t , γ t ) = (x, a) for the state trajectories and, for each choice of control u, solve the controlled dynamics backwards in time to obtain the corresponding initial value π 0 . More significantly, here we wish to perform the optimization for every fixed (enhanced) realisation Y [0,T ] of the stochastic process Y . This type of problem is known as 'pathwise stochastic control'. In fact, we have formulated our problem in terms of the optimal control of a rough differential equation, which we wish to perform for an arbitrary geometric rough path Y ∈ V 0,p g . Control problems of this type were first studied by Diehl et al. [19], and subsequently by Allan and Cohen [2].
Moreover, the control problem stated above is unbounded, in the sense that, as we will see, the value function κ(t, x, a) 'blows up' for values of x which are close to the boundary of Q t , and also for very large values of a. This is because such values of x and a are considered to be very unreasonable, and are thus assigned a very large cost.    We then note that By part (ii) of Assumption 4.2, for any ε > 0 we infer the existence of a constant C ε such that |u| for all (x, a, u) ∈ S m × R k × R k . Choosing ε sufficiently small, we deduce that and, since f and g are bounded below, the same is true of κ.
Step 2. Now let ∆ be a compact subset of the domain D. By the definition of D, for each point (t,x,ā) ∈ ∆ there exists a controlū ∈ U such that πt ,x,ā,ū s ∈ S m for all s ∈ [0,t]. Moreover, by continuity, there exists a compact subset Ξt ,x,ā of S m and an open neighbourhood Ot ,x,ā of (t,x,ā) such that π t,x,a,ū s ∈ Ξt ,x,ā for all (t, x, a) ∈ Ot ,x,ā and all s ∈ [0, t].
Since {Ot ,x,ā : (t,x,ā) ∈ ∆} is an open cover for the compact set ∆, there exists a finite collection of points (t,x,ā) ∈ ∆ and corresponding controlsū such that ∆ ⊂ ∪ (t,x,ā) Ot ,x,ā . The finite union Ξ := ∪ (t,x,ā) Ξt ,x,ā is clearly compact. Moreover, we have shown that: for any (t, x, a) ∈ ∆ there exists a controlū ∈ U from our finite collection such that π t,x,a,ū s ∈ Ξ for all s ∈ [0, t]. Since each controlū : [0, T ] → R k is bounded, the finite collection of controls specified above is uniformly bounded. Thus, there exists a compact set K ⊂ S m × R k such that, for any (t, x, a) ∈ ∆, there exists a controlū ∈ U such that (π t,x,a,ū s , γ t,a,ū s ) ∈ K for all s ∈ [0, t].
Since f and g are assumed to be continuous, they are locally bounded, and hence bounded on K. Thus, using (4.5) again, we have x,a,ū s , γ t,a,ū s ,ū s ) ds + C + g(π t,x,a,ū 0 , γ t,a,ū 0 ) < ∞, so that κ is bounded above on ∆, and hence locally bounded above on D. Proof. We recall the inequality (4.6), which reads: for some constant C. Since κ is bounded above on ∆, and since g is bounded below, we infer an upper bound on Since κ is bounded above on ∆, and since f is bounded below, we similarly infer from (4.8) an upper bound on g(π t,x,a,u 0 , γ t,a,u 0 ). Since g ∈ C ↑ (S m × R k ; R), this implies the existence of a compact set Ξ ⊂ S m × R k , such that, for terminal values (t, x, a) ∈ ∆, we may restrict to controls u such that (π t,x,a,u 0 , γ t,a,u 0 ) ∈ Ξ. Since both the initial and terminal values of the state variables π t,x,a,u , γ t,a,u are then restricted to the compact sets Ξ and ∆ respectively, and since we know that we may restrict to controls u such that γ t,a,u p 2 ,[0,t] ≤ M , we conclude that the entire path s → (π t,x,a,u s , γ t,a,u s ) may be restricted to a compact set K.

Regularity of the value function
We have the following dynamic programming principle. . 5 We suppress the dependency of UM,K on the point (t, x, a) in our notation.
The result of Lemma 4.7 follows the same proof as that of Theorem 2.1 in [34,Chapter 4]. In particular, the rough integrals appearing in the controlled dynamics and value function do not cause any additional difficulty. Proof. Let ∆ be a compact subset of D, and let (t, x, a), (t,x,ã) ∈ ∆. Let u ∈ U . By Corollary 4.5, we may assume that u ∈ U M,K , i.e. γ t,a,u p 2 ,[0,t] ≤ M for some M > 0, and there exists a compact subset K ⊂ S m × R k , such that (π t,x,a,u s , γ t,a,u s ) ∈ K for all (t, x, a) ∈ ∆ and s ∈ [0, t]. By Corollary 2.4, we have π t,x,a,u − π t,x,ã,u p, [0,t] |x −x| + |a −ã|, (4.10) Since we have restricted to the compact set K, we may then take the functions f and g to be Lipschitz in (x, a). Using (4.10) and (4.11), we then have  Proof. Let ∆ be a compact and convex subset of D. Let (r, x, a), (t, x, a) ∈ ∆ with r ≤ t. Note that then (s, x, a) ∈ ∆ for all s ∈ [r, t] by convexity. By Corollary 4.5, we may restrict to controls u ∈ U M,K , so that γ t,a,u p 2 ,[0,t] ≤ M for some M > 0, and there exists a compact subset K ⊂ S m × R k , such that (π t,x,a,u s , γ t,a,u s ) ∈ K for all (t, x, a) ∈ ∆ and s ∈ [0, t]. Similarly to the proof of Theorem 2.2 in Bardi and Da Lio [6], by Lemma 4.7 we can further restrict to controls u ∈ U M,K such that where 0 ∈ U is the zero control. We aim to bound each of the terms on the right-hand side. and similarly with u replaced by 0.
Since the path s → (π t,x,a,0 s , γ t,a,0 s ) then lives in a compact set, and since f is locally bounded, it follows that (4.14) By Proposition 4.8, since we have restricted to a compact set, we may take κ to be Lipschitz in (x, a). We then have that where we used the fact that b is uniformly bounded, and that (4.13) also holds with ψ replaced by φ. By Lemma 4.7, we can take a sequence of controls (u n ) n≥1 ⊂ U M,K such that Using (4.13) and (4.17), and the fact that κ is locally Lipschitz in (x, a), we have Using (4.18), and taking the limit as n → ∞, we obtain 19) which implies that κ is continuous in t.
Step 1. We recall the inequality (4.6), which reads: Thus, as |a| → ∞, it must be the case that either t 0 f (π t,x,a,u s , γ t,a,u s , u s ) ds → ∞, or |γ t,a,u 0 | → ∞, and in the latter case it then follows from part (iii) of Assumption 4.2 that g(π t,x,a,u 0 , γ t,a,u 0 ) → ∞. Since f and g are both bounded below, it follows from (4.22) that inf (t,x) κ(t, x, a) → ∞ as |a| → ∞, i.e. that (4.21) holds for v = κ.
Step 2. Let us now assume for a contradiction that (4.20) does not hold for v = κ. We then infer the existence of a sequence ((t n , x n , a n )) n≥1 ⊂ D and a constant C 2 , such that d(x n , ∂Q t n ) → 0 as n → ∞, and κ(t n , x n , a n ) ≤ C 2 for all n ≥ 1.
By (4.22), for each n ≥ 1 there exists a control u n ∈ U such that 1 2 t n 0 f (π t n ,x n ,a n ,u n s , γ t n ,a n ,u n s , u n s ) ds − C 1 + g(π t n ,x n ,a n ,u n 0 , γ t n ,a n ,u n 0 ) < κ(t n , x n , a n ) + 1 ≤ C 2 + 1. (4.24) We have two possibilities. Namely, either there exists a subsequence (n j ) j≥1 such that or there does not. If there does exist such a subsequence, then it follows from (4.23) that t n j 0 f (π t n j ,x n j ,a n j ,u n j s , γ t n j ,a n j ,u n j s , u contradicting (4.24) (since g is bounded below).
Step 3. If there does not exist a subsequence such that (4.25) holds, then it follows immediately that there exists a constant M > 0 such that γ t n ,a n ,u n p 2 ,[0,t n ] ≤ γ t n ,a n ,u n 1,[0,t n ] = t n 0 |u n s | ds ≤ M for every n ≥ 1.
We can then apply Corollary 2.4 to deduce the existence of a single constant C such that |π t n ,x n ,a n ,u n 0 − π t n ,z n ,a n ,u n 0 | ≤ π t n ,x n ,a n ,u n − π t n ,z n ,a n ,u n p,[0,t n ] ≤ C|x n − z n | for any points z n ∈ Q t n .
Since terminal values of π on (or outside) the boundary ∂Q t n result in initial values outside the domain S m (by the definition of Q t n ), we may choose a terminal value z n close to x n , but also close enough to the boundary ∂Q t n to ensure that the initial value π t n ,z n ,a n ,u n 0 is arbitrarily close to the boundary ∂S m . More precisely, we choose the points (z n ) n≥1 such that |x n − z n | ≤ d(x n , ∂Q t n ), and such that the corresponding initial values satisfy d(π t n ,z n ,a n ,u n 0 , ∂S m ) → 0 as n → ∞. In particular, we then have that |π t n ,x n ,a n ,u n 0 − π t n ,z n ,a n ,u n 0 | ≤ C|x n − z n | ≤ Cd(x n , ∂Q t n ) −→ 0 as n −→ ∞.
Since d(π t n ,z n ,a n ,u n 0 , ∂S m ) → 0 as n → ∞, we can find a sequence (y n ) n≥1 ⊂ ∂S m such that |π t n ,z n ,a n ,u n 0 − y n | → 0 as n → ∞. Then d(π t n ,x n ,a n ,u n 0 , ∂S m ) ≤ |π t n ,x n ,a n ,u n 0 − y n | ≤ |π t n ,x n ,a n ,u n 0 − π t n ,z n ,a n ,u n 0 | + |π t n ,z n ,a n ,u n 0 − y n | −→ 0 as n → ∞, and hence, since g ∈ C ↑ (S m × R k ; R), we deduce that g(π t n ,x n ,a n ,u n 0 , γ t n ,a n ,u n 0 ) −→ ∞ as n −→ ∞, contradicting (4.24) (since f is bounded below).

A smooth regularisation
Our main aim is to establish the function κ (recall (3.14) or (4.1)) as the solution of a rough HJ equation (namely (5.12) below). As in Diehl et al. [19], we first approximate the rough path Y by a smooth path η. Having solved the associated classical control problem, as is a standard strategy for rough ODEs and PDEs, we can define solutions to our HJ equation for genuinely rough driving paths by taking the closure of smooth paths in rough path topology. Accordingly, given a smooth path η : [0, T ] → R d , we define the approximate value function: Recall Notations 3.9 and 4.10. Since in general the set of plausible posteriors Q t , and hence also the domain D on which the value function κ is finite, may depend on the observation path, we will correspondingly write Q η t := {x ∈ S m | κ η (t, x, a) < ∞} and where, since η is smooth, the integral may be understood in the Riemann-Stieltjes sense. By simply replacing Y by η in the corresponding proofs, the approximate value function κ η inherits all the properties established in the previous section, namely: • κ η is bounded below, and locally bounded above, • κ η satisfies the dynamic programming principle, i.e. (4.9) with Y replaced by η, Moreover, κ η is actually Lipschitz continuous in t, locally uniformly on D. To see this, we recall the estimate (4.19) from the proof of Proposition 4.9, which, replacing Y by η, (1 + η ∞ ) η ∞ |t − r|. Substituting this into (5.4), we deduce that κ η is Lipschitz in t.

A smooth HJ equation
We will return to the (rough) value function κ in Section 5.3 below. For now we will restrict our attention to the smoothed version κ η , as defined in (5.1), and introduce the associated HJ equation: for a smooth approximation η of Y, where as usualη denotes the derivative of η.
Remark 5.1. Since the spatial variable x is confined to the simplex S m ⊂ R m , it may not seem meaningful to consider taking the gradient ∇ x . However, since the coefficients b and φ always remain directed within the simplex, the directional derivatives b · ∇ x and φ · ∇ x always exist.
In the following we consider solutions of (5.5)-(5.6) in the sense of viscosity solutions. The unfamiliar reader is referred to Barles [1] or Crandall, Ishii and Lions [16] for a detailed explanation.
Definition 5.2. We say that a continuous function v : D η → R is a viscosity subsolution (resp. supersolution) of (5.5)-(5.6) if v(0, x, a) ≤ (resp. ≥) g(x, a) for all (x, a) ∈ S m ×R k , and, for any point (t, x, a) ∈ D η with t ∈ (0, T ], for every smooth function ϕ : D η → R such that v − ϕ has a local maximum (resp. local minimum) at the point (t, x, a). We say that v is a viscosity solution if it is both a viscosity subsolution and a viscosity supersolution.
The proof of Proposition 5.3 is standard-see Theorem 2.5 in [34,Chapter 4] or Proposition 4.9 in [3] for details in analogous settings.
The result of Proposition 4.11 above shows that the (approximate) value function explodes near the boundary of the domain D η . In fact, this 'explosive boundary condition' is precisely the extra condition needed to obtain uniqueness for the corresponding HJ equation.
Theorem 5.4. The approximate value function κ η is both the minimal viscosity supersolution and the maximal viscosity subsolution of (5.5)-(5.6) in the class C ↑ (D η ; R), and is thus the unique viscosity solution of (5.5)-(5.6) in the class C ↑ (D η ; R).
Proof. We will prove the minimality of κ η among viscosity supersolutions of (5.5)-(5.6). The proof of maximality among subsolutions follows a similar argument, and is hence omitted for brevity.
Let v ∈ C ↑ (D η ; R) be another viscosity supersolution of (5.5)-(5.6), and let ∆ be a compact subset of D η . Recalling (4.5) and the fact that f is bounded below, we have t 0 f (π t,x,a,u s , γ t,a,u s , u s ) ds + t 0 ψ(π t,x,a,u s , γ t,a,u s ) dη s + g(π t,x,a,u 0 , γ t,a,u 0 ) ≥ g(π t,x,a,u 0 , γ t,a,u 0 ) − C for some constant C. Since κ η is locally bounded above (by Lemma 4.4), and hence bounded above on ∆, we infer that there exists a bound λ > 0 such that, for (t, x, a) ∈ ∆, we can restrict to controls u ∈ U such that g(π t,x,a,u 0 , γ t,a,u 0 ) < λ. (5.8) We define, for δ > 0, the subdomains As g ∈ C ↑ (S m × R k ; R) (by part (iii) of Assumption 4.2), there exists a δ > 0 sufficiently small such that In particular, by (5.8), we have that (π t,x,a,u 0 , γ t,a,u 0 ) ∈ S m δ × B δ whenever (t, x, a) ∈ ∆. Let G ∈ C ↑ (S m × R k ; R) be a locally Lipschitz function, bounded below, and such that We define a new value function K, similarly to κ η , but with initial cost function G, viz. In particular, since G ≤ g, K is a viscosity subsolution of (5.5)-(5.6). Moreover, by Proposition 4.8, K is locally Lipschitz continuous in (x, a), uniformly in t ∈ [0, T ]. As noted above, we have that (π t,x,a,u 0 , γ t,a,u 0 ) ∈ S m δ × B δ for all (t, x, a) ∈ ∆, which is also true for our new initial cost G, since G > λ on (S m × R k ) \ (S m δ × B δ ). As G = g on S m δ × B δ , it follows that K = κ η on ∆. (5.9) The asymptotic growth rate of K(t, x, a) as d(x, ∂Q η t ) → 0, and as |a| → ∞, depends on the growth rate of the initial cost G. By taking G to grow sufficiently slowly, we can ensure that K is dominated asymptotically by v. More precisely, since v ∈ C ↑ (D η ; R), we can choose G so that there exists an ε > 0 such that Combining (5.10) and (5.11), we have that K ≤ v on the parabolic boundary of D η ε . By the standard comparison principle for viscosity (sub/super-)solutions-e.g. Theorem 5.1 in Barles 6 [1]-applied with the subsolution K and the supersolution v, it follows that K ≤ v on D η ε . Since ∆ ⊂ D η ε , it then follows from (5.9) that Since ∆ was arbitrary, we conclude that κ η ≤ v on the entire domain D η .

A rough HJ equation
Replacing the smooth path η by the rough path Y in (5.5)-(5.6), we obtain the rough HJ equation: We understand a solution to this equation in the sense of the following definition, sometimes known as a rough viscosity solution, used by Diehl et al. [19], as well as for instance by Caruana, Friz and Oberhauser [9,10,21] (see also Chapter 12 in Friz and Hairer [20]).
Definition 5.5. Given a smooth path η, we write η = (η, η (2) ) ∈ V 0,p g for its canonical lift, with η (2) defined as in (5.3). We write κ η for the unique viscosity solution of (5.5)-(5.6) in the class C ↑ (D η ; R), which by Theorem 5.4 is precisely the approximate value function, as defined in (5.1). We say that a continuous function v solves the rough HJ equation (5.12)-(5.13) if κ η n −→ v as n −→ ∞ (5.14) locally uniformly on D, whenever (η n ) n≥1 is a sequence of smooth paths such that η n → Y with respect to the p-variation rough path distance, i.e. η n ; Y p → 0 as n → ∞.
We note that if such a solution of (5.12)-(5.13) exists, then it is unique. Moreover, since the rough path Y ∈ V 0,p g is geometric, there certainly exists such a sequence of smooth paths (η n ) n≥1 .
We can now state our main result. 6 The hypotheses of the theorem are satisfied since K is locally Lipschitz, and hence Lipschitz on the compact set D η ε . Strictly speaking the domain D η ε is not in general a simple Cartesian product of the spatial and temporal domains, as in the setting of Barles, but this is of no consequence, and requires only a trivial adaptation to the proof of the comparison principle.
Theorem 5.6. The value function κ, as defined in (3.14) and (4.1), solves the rough HJ equation (5.12)- (5.13) in the sense of Definition 5.5. Moreover, writing κ = κ Y , the map from V 0,p g → R given by Y → κ Y (t, x, a) is locally uniformly continuous with respect to the p-variation rough path distance, locally uniformly in (t, x, a).
Proof. We first note that κ is continuous by Propositions 4.8 and 4.9.
Let Z ∈ V p be another rough path such that Y; Z p ≤ 1. By possibly replacing L by L + 1, we may assume that |||Z||| p ≤ L. Let us write π t,x,a,u,Y (resp. π t,x,a,u,Z ) for the solution of the RDE (4.2) driven by Y (resp. Z), and write κ Y (resp. κ Z ) for the corresponding value function, as defined in (4.1), defined on the domain D = D Y (resp. D Z ).
Let ∆ be a compact subset of D Y ∩ D Z . By Corollary 4.5, we may restrict to controls u ∈ U M,K , so that γ t,a,u p 2 ,[0,t] ≤ M for some M > 0, and there exists a compact subset K ⊂ S m × R k such that (π t,x,a,u,Y s , γ t,a,u s ) ∈ K and (π t,x,a,u,Z s , γ t,a,u s ) ∈ K for all (t, x, a) ∈ ∆ and s ∈ [0, t]. We then deduce from Corollary 2.4, that π t,x,a,u,Y − π t,x,a,u,Z By part (i) of Assumption 4.2, as we have restricted the state trajectories to a compact set, we can take f and g to be Lipschitz in (x, a), uniformly in u. Then, for any (t, x, a) ∈ ∆, Let (η n ) n≥1 be sequence of smooth paths such that η n ; Y p → 0 as n → ∞. It follows from (5.15) that the domain D ηn converges to D in the obvious sense as n → ∞. In particular, given a compact set ∆ ⊂ D, we then have that ∆ ⊂ D η n for all sufficiently large n. The required convergence in (5.14) then follows by taking Z = η n in the above. The stated continuity of the value function with respect to the driving rough path is also immediate from the above. Remark 5.7. As shown in Theorem 5.6, the value function κ is continuous with respect to the enhanced observation path Y. Although this does not in general imply continuity of the minimum point of the value function, in typical situations where κ is convex and particularly unimodal (in the sense of having exactly one global minimum), this continuity with respect to Y will be inherited by the minimum point of κ, and hence by the most reasonable posterior and parameter value. Filters based on our approach are thus robust, both with respect to parameter uncertainty, and in the sense of continuity with respect to the (enhanced) observation path (cf. Crisan et al. [17]).
6 Numerical examples 6.1 Unknown rate matrix As an example, let us take m = 2 and d = 1, so that the hidden signal X is a 2state Markov chain taking values in X = {e 1 , e 2 }, and the observation process Y is 1-dimensional. We shall suppose that the observation vector h is known and constant, but that the rate matrix A depends on an unknown parameter λ, viz.
for some (known) ν, α > 0. In this case the dimension of the uncertain parameter is k = 1, and we adopt the smooth parametrisation: The observation process has dynamics: Our objective is to learn the unknown parameter λ t . We emphasize that here we do not assume any prior or any model for the dynamics of λ; we suppose that we only know that it lies in the interval (0, ν).
In this example we actually have Q t = S 2 ∼ = (0, 1) for all t ≥ 0, regardless of the realisation of Y [0,t] . To see this, note that for π 2,s ≃ 0 we have dπ 2,s ≃ λ s ds, and since λ takes values in (0, ν), we can take λ s sufficiently small (by choosing a suitable control u) to ensure that π 2,s > 0 for all s < t. Similarly, if π 2,s ≃ 1 we have dπ 2,s ≃ (λ s − ν) ds, and we can take λ s sufficiently close to ν to ensure that π 2,s < 1 for all s < t.
We solve the problem in the variables (q, γ), where we make the change of variables We adopt a simple numerical approximation to solve the HJ equation (5.12)-(5.13). We first recall that the most reasonable posteriors and parameter values are attained at the minimum point of the value function κ. To obtain an efficient scheme, we linearise the controlled dynamics around this minimum point (in the variables (q, γ)) and, taking the penalty functions f , g to be quadratic, compute the resulting linear-quadratic optimization problem, linearising around the new minimum point after each time step.
for some constants τ, δ, ε > 0. We take ν = 1, α = 1, τ = δ = 5 × 10 −2 , ε = 10 −3 , and simulate the signal X and observation process Y with the 'true' parameter λ t = 0. The learned value of γ t is given by (3.15), and the corresponding value of λ t is then obtained by reversing the above change of variables. The minimum point of the value function, which at each time t corresponds to the most reasonable parameter value λ t given the observations up to time t, is compared with the true value of λ t in Figure 1.

Unknown observation matrix
We now suppose that the rate matrix A is known, but that the observation matrix h depends on an unknown parameter α, viz.
for some (known) λ, µ > 0 and 0 ≤ ν 1 < ν 2 . We adopt the parametrisation: The observation process has dynamics: Note that if the jump rates λ, µ are sufficiently large then the filtering of the signal X becomes an intractable task, as the observation time between jumps is too short to detect individual jumps. However, even in this case the problem of learning the unknown parameter α remains. We stress again that we do not assume any prior or model for the dynamics of α, assuming only that it takes values in the interval (ν 1 , ν 2 ). We have dπ 2,s = λ(1 − π 2,s ) − µπ 2,s ds + 2α s π 2,s (1 − π 2,s ) dY s =: b(π 2,s , γ s ) ds + φ(π 2,s , γ s ) dY s , where π = (π 1 , π 2 ) = (1 − π 2 , π 2 ). As in the previous example, we solve the problem in the variables (q, γ), where we make the change of variables As in the previous example, we adopt the penalty functions (6.1)-(6.4). We take ν 1 = 0.2, ν 2 = 1.8, λ = µ = 5 × 10 −2 , τ = δ = 10 −2 , ε = 10 −3 , and simulate the signal The minimum point of the value function corresponds to the most reasonable value of α t , and is compared with the true parameter in Figure 2.

Concluding remarks
Remark 7.1. Since our evaluation of the 'reasonability' of parameters is inherently nonprobabilistic, our framework can be immediately extended to include non-Markovian models. Suppose for instance that a parameter γ were, at each time t, a function of both t and of the entire paths of the signal X and observation process Y up to time t, i.e. γ = γ(t, X ·∧t , Y ·∧t ). Given a realisation ω ∈ Ω, this then defines a path t −→ γ t := γ(t, X ·∧t (ω), Y ·∧t (ω)) which may then be 'learned' like any other deterministic path.
Remark 7.2. In the current work we consider observations corrupted by Gaussian (Brownian) noise. More general observation noise with jumps could also be included in our theory, but would require further analysis. We expect that the recent results on càdlàg rough paths in Chevyrev and Friz [11] or Friz and Zhang [22] could provide a suitable basis for the corresponding pathwise formulation.
Remark 7.3. The setting of this paper is analogous to the dynamic generator, DRexpectation framework of Cohen [13]. One may alternatively consider the static generator case, where the unknown parameters are assumed to be constant in time. In this case formal calculations suggest that the value function κ should satisfy an equation of the form dκ + b · ∇ x κ − f dt + φ · ∇ x κ − ψ dY t = 0 with κ(0, ·, ·) = g, which in principle must be solved separately for each a ∈ R k . This appears to be practically inconvenient, and it remains an open problem to derive some finite approximation or alternative approach which yields a tractable solution.
On the other hand, at least formally, the constant-parameter case is recovered in the present framework by imposing an infinite cost for non-zero controlsγ = u.

A Rough path estimates
In this section we exhibit a direct approach to obtaining solutions to the RDE (2.6) in the controlled path setting of Gubinelli, based on the fixed point argument of Friz and Zhang [22].
The result then follows from taking the supremum over all possible partitions s 0 < s 1 < . . . < s N of the interval [0, T ].
Lemma A.2. Let φ ∈ C 3 b , γ ∈ C p 2 -var , and let Y = (Y, Y) ∈ V p with |||Y||| p ≤ L for some L > 0. For any controlled path (X, X ′ ) ∈ V p Y , we have that · 0 φ(X r , γ r ) dY r , φ(X, γ) ∈ V p Y , and the estimates where the constant C depends only on φ, p and L.
Writing X −X ∞ ≤ |X 0 −X 0 | + X −X p and using (A.4), we then have for some constant C 3 > 1 2 depending only on b, φ, p and L. Let δ = δ 3 := 2C 3 > 1. Taking t = t 3 ≤ t 2 ∧t 2 sufficiently small such that Step 4. From now on we allow the multiplicative constant indicated by to also depend on ψ. We have We can divide the (arbitrary) time interval [0, T ] into a partition 0 = s 0 < s 1 < . . . <