Continuous-discrete smoothing of diffusions

Suppose X is a multivariate diffusion process that is observed discretely in time. At each observation time, a transformation of the state of the process is observed with noise. The smoothing problem consists of recovering the path of the process, consistent with the observations. We derive a novel Markov Chain Monte Carlo algorithm to sample from the exact smoothing distribution. The resulting algorithm is called the Backward Filtering Forward Guiding (BFFG) algorithm. We extend the algorithm to include parameter estimation. The proposed method relies on guided proposals introduced in Schauer et al. (2017). We illustrate its efficiency in a number of challenging problems.


Introduction
Suppose X is a diffusion process with dynamics governed by the stochastic differential equation (SDE) dX t = b(t, X t ) dt + σ(t, X t ) dW t (1.1) with prescribed initial density X t0 ∼ π.Here b : R + × R d → R d and σ : R + × R d → R d×d are the drift and dispersion coefficient respectively.The process W is a vector valued process in R d consisting of independent Brownian motions.It is assumed that the required conditions for existence of a strong solution are satisfied (cf.Karatzas and Shreve (1991)).We assume observation times 0 = t 0 < t 1 < • • • < t n and observations with conditional density k i .This includes as special case where L i is a m i × d-matrix, Σ i an m i × m i positive definite matrix and ϕ(x; µ, Σ) denotes the density of the N(µ, Σ)-distribution, evaluated at x.For t ≥ 0, denote the set of non-past observations by V t , i.e.V t = {V i : t i ≥ t}.With slight abuse of notation, set In this article we address two related problems: smoothing and inference.For the first one we assume that b, σ and {k i , i = 0, . . ., n} are known and aim to reconstruct the path {X t , t ∈ [0, t n ]} based on V 0 , which here is given precise meaning as sampling from the conditional distribution of X on [0, t n ] conditional on V 0 .Because we are dealing with "continuous dynamics-discrete observations", this setup is also referred to as continuous-discrete smoothing or hybrid smoothing.This is an interesting problem in its own right that arises naturally for instance in statistical analysis of tracking systems (cf.Särkkä (2013)).
Additionally, smoothing constitutes a central component of Bayesian inference algorithms for discretely observed diffusions.If in (1.1) the coefficients of the diffusion X or the observation scheme {k i , i = 0, . . ., n} depend on some unknown parameters, inference can be performed by employing a data augmentation algorithm, where one iterates between (i) updating the parameter conditional on the smoothed path and (ii) continuous-discrete smoothing conditional on the value of the parameter (cf.Roberts and Stramer (2001)).The relevance of smoothing and parameter estimation for diffusions is illustrated from the problem reappearing in a wide variety of application fields, see for instance van Kampen (1981), Arnaudon et al. (2017), Apte et al. (2007), Cuenod et al. (2011) and Wilkinson (2006).There is extensive research about this topic, in Section A we put this work further into context of related research.The setup (1.2) includes the practically relevant case that a highly non-linear process X is only observed at start time 0 and at a final time, say 1, see for instance Arnaudon et al. (2020) for an application in shape analysis.
1.1.Data augmentation for continuous-discrete smoothing .In this section we give intuition on the method that we will give for sampling from the smoothing distribution.Assume for the moment that x 0 is known and denote x ti by x i .Using Bayesian notation (writing π for a density), we have the following recursive relation Here the first term on the right hand side captures the dynamics of the conditional process, which satisfies where we write suggestively In the (non-Bayesian) notation used in the next subsection, π(V i | x i ), the likelihood of x i given the present and future observations V i , is denoted as ρ(t i , x i ).Up to a scaling factor, ρ can be related to a backward filtered marginal density of X.In a simplified view of our approach, firstly we compute the scaled filtering density backward in time.
One can now show using the arguments laid out in Appendix C that the weight equals the exponential process of a change of measure: which reveals the changed drift b + σσ ∇ log ρ of the conditional (smoothed) process by Girsanov's theorem.Thus we may substitute sampling X i | X i−1 , V i by simulating a process X with drift b + σσ ∇ log ρ, understood as data augmentation in the sense of Tanner and Wong (1987), and obtain a sample of the smoothing distribution of X i (and in fact the entire smoothed segment {X t , t i−1 ≤ t ≤ t i }) in this way.
Finally, if ρ cannot be computed efficiently, we find a substitute ρ and correct for the difference through Monte Carlo methods: this is the technique of guided proposals as introduced in the following section.
1.2.Approach.Conceptually, we follow closely the literature on guided proposals.We explain the main idea here, more technical details can be found in Appendix B.
(1.5) Define ρ(0, x) = k 0 (x; v 0 )ρ(0+, x).If X t0 ∼ π , with π (x 0 ) = π(x 0 )ρ(0, x 0 ) π(x 0 )ρ(0, x 0 ) dx 0 (1.6) being the smoothed (conditional) density of x 0 , then a trajectory X constitutes a sample from the smoothing distribution.This expression was derived in Marchand (2012) (Theorem 2.3.4) in a special case of (1.3) with Σ j ≡ 0 for all j).In Appendix C we derive it using Doob's h-transform for the more general setting considered in this paper.As p is tractable only in very specific instances, so are ρ and π .Hence, sampling X directly from (1.4) is generally not possible.The key idea in Schauer et al. (2017) (which addressed a simpler problem of diffusion bridge simulation) is to introduce the so-called guided proposal process, defined as the (strong) solution to the SDE where r(t, x) = ∇ x log ρ(t, x) and ρ(t, x) is defined in a similar manner as ρ(t, x), but with p and {k i } replaced by tractable approximations p and { ki } respectively.Here p can be taken to be the transition density of some auxiliary diffusion X and { ki } can be based on (1.3).The choice of X and { ki } impacts the quality with which the law of X • approximates the law of the target process X .Intuitively, the drift and dispersion of X should be chosen such that X is similar to X in areas visited by the true conditional process.At the same time, the two functions need to be simple enough, so as to make it possible to compute ρ(t, x) and r(t, x).Schauer et al. (2017) showed that the linear processes, defined by the SDE d Xt = β(t) + B(t) Xt dt + σ(t) dW t , (1.8) give rise to a family of auxiliary diffusions that grant a generous degree of flexibility (as detailed in Section 3.2) for crafting suitable proposals X • .For instance, in (Schauer et al., 2017, Section 1.3) the authors showed how in a setting of a highly non-linear target diffusion X , a simple shift by a constant in β(t) may lead to a dramatic improvement of simulated proposal bridges X • , leading e.g. to lowering of rejection rates in Metropolis-Hastings algorithms.
Assume for now that x 0 is known and k i = ki with ki derived from observation scheme (1.3) and transition density p derived from (1.8).We defer the general case to Section 4. Let P and P • denote the laws of X and X • on [0, t n ] (started at x 0 ) respectively.Under regularity conditions on the auxiliary process X listed in Appendix B, it follows from Theorem 1 in Schauer et al. (2017) and Theorem 2.14 in Bierkens et al. (2020) that where The factor in front of Ψ is written to indicate that in this expression ρ and ρ are to be evaluated at time 0+, which is of importance when considering the case where x 0 is not known.Here, b and σ are the coefficients of the SDE defining X, ã(s) = σ(s)σ(s) and H(s) is the negative of the Hessian matrix of x → ∇ x log ρ(s, x) (which turns out not to depend on x).
Since all terms in (1.7) and (1.9) are in principle tractable, one may define an importance sampler that targets the law P : proposals are drawn from X • (by forward simulating paths via some discretisation scheme for SDEs, say, the Euler-Maruyama scheme) and importance weights are computed based on (1.9).As the SDE for X • is obtained from the SDE for X by superimposing a guiding term, the paths distributed according to the former law are called guided proposals.The process X • is typically used as a proposal (the origin of the terminology "guided proposal") in a "latent" path-space Metropolis-Hastings step such as the non-centred path-space Crank-Nicolson scheme (cf.Cotter et al. (2013) and Beskos et al. (2008)).
However, our results can also be used in other Monte Carlo procedures, for example in an auxiliary particle filter.Essentially our backward filtering step is an approximation to the backwards information filter.Various related "optimal policy finding" strategies for finding a good approximation have been proposed in the particle filtering literature, such as the iterated auxiliary particle filter (Guarniero et al. (2017)) and controlled SMC (Heng et al. (2020)).Guided proposals can also be used to create flexible variational classes for latent diffusion paths (cf.recent work Beskos et al. (2021)).In fact guided proposals are agnostic to the choice of inferential algorithm, a point highlighted in the preprint van der Meulen and Schauer (2021).
Whereas the computations of r, H and ρ(0, x 0 ) for (1.8) are in principle tractable, it is not trivial to derive an efficient numerical scheme.The numerical methods presented in Schauer et al. (2017) are restricted to simpler cases and to diffusion bridge simulation on a single time segment.In this paper we derive an efficient scalable scheme (both in the number of observations as dimension of the state space of the diffusion) using guided proposals for sampling high-dimensional, non-linear latent diffusion processes under the general observation scheme (1.2), and to estimate unknown parameters.This is illustrated in a number of challenging numerical examples.
1.3.Innovation.We derive simple systems of ordinary differential equations that can be used to compute all terms needed for implementation of guided proposals, i.e. terms r, H and ρ(0, x 0 ).These results imply that in order to sample paths of the proposal diffusion X • and embed them in the Metropolis-Hastings algorithm only the following steps need to be performed • a system of ordinary differential equations, akin to the updating equations in hybrid Kalman filtering, has to be solved backwards in time (suitable systems are given in theorems 2.4, 2.5 and 2.6); • paths of X • need to be simulated forward in time, using standard simulation techniques based on stochastic Taylor expansions (cf.Kloeden and Platen (2013)); • the Radon-Nikodym derivative between the laws of X and X • has to be evaluated and then used in the acceptance probability of the Metropolis-Hastings algorithm.
In particular, in this scheme, forward simulating the guided proposal X • is comparable in computational effort to simulating X itself.This resolves objections about (perceived) computational effort required to simulate guided proposals using a closed form expression for ρ.
The systems of ODEs make use only of the matrix addition, multiplication and inversion operations and scale much better to high dimensional problems compared to for example van der Meulen and Schauer (2017b).For the smoothing problem, the first step needs to be executed only once.
We call the resulting algorithm the Backward Filtering Forward Guiding (BFFG) algorithm, because (as we will show) it combines backward filtering under the auxiliary process X and forward simulation of the guiding process X • .It may also be characterised as a variant of the forward filtering-backward sampling algorithm for linear state space models with (possibly) non-linear continuous time processes.Reversing the time-direction and sampling the process forward in time is not strictly necessary, but turns out to be more convenient.
We believe the method derived in this paper has a number of attractive properties: 1.It is a computationally simple algorithm that provides a unified approach to smoothing of both hypo-elliptic and uniformly elliptic diffusions.
2. It allows for taking into account nonlinearities in the drift efficiently (by choice of the auxiliary process X).
3. It can deal with a large class of diffusions with state-dependent diffusion coefficient.
4. The algorithm targets the exact smoothing distribution and does not rely on Gaussian approximations.
5. I can deal with nonlinear observation schemes or non-Gaussian error densities while still sampling from the exact smoothing distribution.
Regarding being exact, we derive our algorithm in continuous time, but ultimately, in any implementation the SDE for X • needs to be discretised.However, the mesh-width for the discretisation can be controlled by the user.
1.4.Outline.In Section 2 we derive ODE-systems for backward filtering which result in efficient ways for computing r, H and ρ(0, x 0 ).In Section 3 we discuss strategies for choosing the auxiliary process X and auxiliary observation scheme to be used in backward filtering.The backward filtering forward guiding algorithm for diffusions is subsequently given in Section 4. In Section 5 we address applications of the proposed algorithms to problems in which the dimension of the state-space of the diffusion is high.In such a setting, the backward filtering can for example be carried out using an ensemble Kalman filter.We illustrate our results on 3 challenging examples in Section 6.The appendix contains sections on related literature, a derivation of the guiding term for guided proposals together with deferred proofs and remarks.

Backward filtering
To sample the guided proposal, we need to define ρ as a proxy to ρ defined in (1.5).For that, we choose X as in (1.8).For i = 0, . . ., n, we assume ki to correspond to the observation scheme (1.3) which is parametrised by L i and Σ i .
Definition 2.1.For a specified (strictly positive definite) covariance matrix P n+ , define kn+ (x) = ϕ(x; 0, P n+ ).We define ρ by Note the inclusion of k n+ which does not appear in (1.5).It ensures regularisation, where we may intuitively think of P n+ = ε −1 I with small, though any choice of P n+ will give rise to a valid algorithm targeting the exact smoothing distribution.
Naturally BFFG will work best if { ki } and X approximate {k i } and X well, most importantly in those areas where the smoothing distribution puts its weight.We comment on good choices in Section 3. Theorems 2.4, 2.5 and 2.6 in this section give backward recursions for evaluating r, H and ρ which are needed for application of guided proposals.All proofs are in Section D of the appendix.
We impose Assumption 2.2.For each i ∈ {0, . . ., n}, the matrix Σ i is strictly positive definite.For i ∈ {1, . . ., n}, let v(t) be defined by the concatenated vector of all non-past observations: where i ∈ {1, . . ., n}.Define In the following, we give three results characterising ρ, first directly as marginal likelihood, and then in a form more suitable for implementation: as (scaled) backward ODE for the filtering density in two parametrisations.
Theorem 2.4.Let the triplet (L(t), M † (t), µ(t)) be defined on each interval (t i−1 , t i ] (i = 1, . . ., n) as solutions to the backward ODE systems and for i = 0, . . ., n, where (2.7) Note that the equations for M † and µ directly give the time derivative and that the values of M † (t) and µ(t) can be found by using a numerical quadrature rule such as for example the trapezoid rule.This theorem shows that H and r can be computed by solving the systems (2.2), (2.3) and (2.4).Additionally, by computing the term |M (t)|, we may evaluate ρ as well.In Theorem 2.5 below we will see that computation of this determinant can be avoided by solving another (scalar-valued) backward ODE.
The dimensions of the equations in Theorem 2.4 are larger on (t i−1 , t i ] than on (t j−1 , t j ] for all j > i.The dimension on the segment closest to zero (i.e. the leftmost segment) grows rapidly with a large number of observations.It turns out that other sets of backward differential equations can be derived, which are constant in dimensions determined by the the dimension of the state space of the diffusion, no matter the number of observations.(2.8) At observations times, we have for i = 0, . . ., n

,
(2.9) to be initialised from Furthermore, if we define P (t n +) = P n+ and ν(t n +) = 0 then for i = 0, . . ., n From (2.13) it is seen that Assumption 2.3 can be weakened to invertibility of Σ i +L i P (t i +)L i .
In case σ and B are constant, P (t) can be computed directly.
Hence there is a closed form expression for the distribution of Xi | (V i , V i+1 ) which gives ν(t i ) and P (t i ).
3 Choice in auxiliary process and auxiliary observation scheme Application of the tractable backward filtering updates requires choosing ({L i }, {Σ i }) for the observation scheme and t → (β(t), B(t), σ(t)) for the auxiliary process X.
3.1.Choice of the auxiliary observation scheme.The backward filtering formulas in Section 2 are used with the auxiliary observation scheme of the form implied by (1.3), leaving open the choice of (L i , Σ i ) in case the actual observation scheme is of the more general form (1.2).
The simplest thing to do is choose L i and Σ i such that ϕ(v i ; L i X ti , Σ i ) ≈ k i (X ti ; v i ).Ultimately, the best choice of (L i , Σ i ) is that the discrepancy between P • and P is minimal (for example measured by (reverse) Kullback-Leibler divergence).
Example 3.1.Assume that instead of with g i a nonlinear map.Consider the linearisation (with ν as in the covariance filter) where . Under this linearisation, backward filtering can be carried out exactly as in our original setting by assuming to observe V i − ζ i instead of V i and setting L i to be equal to G i .This choice of auxiliary scheme is alike the extended Kalman filter (Algorithm 5.4 in Särkkä ( 2013)).
We refer to Chapters 5 and 6 in Särkkä (2013) for other approximations proposed in the literature that can be used as auxiliary observation scheme.
3.2.Choice of the auxiliary process X.The auxiliary process X needs to be chosen by a user, i.e. the parameters β(t), B(t) and σ(t) need to satisfy matching and regularity conditions, but are otherwise free.A number of practical ways for choosing them are discussed in van der Meulen and Schauer (2017a) (Section 4.4); below, we list a number of reasonable options.Determining an optimal auxiliary law in an automated fashion deserves its own study and is the subject of an ongoing research (but cf. also considerations in Schauer et al. (2017)).
A Waive the freedom and simply take σ constant, B(t) ≡ 0 and β = 0.If X is either elliptic or hypo-elliptic with nonzero noise on each coordinate, this yields a valid algorithm.It still takes local nonlinearity into account through the presence of b in (1.7) and, in case of partial observations, is informed by multiple future observations.
, where J b denotes the Jacobian matrix of b.In case the diffusion is fully observed, x(t i ) is taken equal to X ti , else it is specified by a user-chosen guess for it.The guess can be adaptively refined by using empirical averages of the imputed path.Otherwise put, the auxiliary process comes from linearisations of the target diffusion at the times of subsequent observations.
C The aim is to employ a first order Taylor expansion b Of course, x(t) is unknown, but information becomes available during MCMC-iterations, and thus, just as in B, empirical averages of the imputed path can be used as its proxy.More specifically, we propose to first use the auxiliary process constructed by one of the preceding methods to get x, and then, run the algorithm for k iterations and compute the average of these k paths.Denote this average by { X(0 Next, starting at i = 1, repeat the following steps until the prescribed total number of iterations for adaptation has been reached. (a) Set (b) Recompute H(t), F (t) and c(t) on [0, t n ] based on B (i) (t, x).
(c) Perform k MCMC-iterations based on H(t), F (t) and c(t).Compute the average of all simulated paths to obtain { X(i To avoid common problems with adaptive schemes we stop the adaptation after a fixed number of steps. D Choose σ = σ, B(t) ≡ 0 and take β nonzero to make the pulling term itself take into account the nonlinearity of the system.To determine β, first obtain F (t n +) and H(t n +) as in (2.10).Next, for i = n to 1: using the formulas of the information filter (Theorem 2.5).
E Use a linear combination of the schemes above.This is a risk-averse, robust strategy, which aims to guard against unexpected artifacts that some of the adaptive strategies above might exhibit (say, for strategy C such behaviour could appear as a result of multimodality on a path space).In Section 6 we use: ) , where b (B) and b (C) are defined by strategies B and C respectively and w ∈ [0, 1] are some user-chosen weights.Another option (not considered here) consists of sampling w in each iteration from the Bernoulli distribution with fixed success probability.
F Finally, the auxiliary process may take values in a higher dimensional space, with only its first d coordinates used in defining the proposal diffusion.Say, we want to target a process X conditioned on the observations V i = LX ti + η i .Consider an auxiliary process [ X, Y ] with X solving d Xt = (B Xt + β(t) + CY t ) dt + σ dW t and Y a second, independent linear process dY t = DY t dt + dW Y t driven by an independent Brownian motion W Y .Here B, C and D are matrices of suitable dimension.Then, we may use an augmented sampler that targets the joint process [X t , Y t ] conditional on the observations V i .The augmented observation operator is given by L aug = [L 0], the augmented drift for the target process is b aug (t, [x, y]) = [b(t, x), Dy] and the block diagonal dispersion coefficient is [ σ 0 0 I ]; the sampler uses the augmented guided proposal with the auxiliary linear process [ X, Y ] that has a drift baug (t, [x, y]) = [Bx + Cy, Dy] and the block diagonal dispersion coefficient [ σ 0 0 I ].Samples of the conditional process X are obtained by marginalisation.To give a simple example, X can be taken to be a sum of a Brownian motion and an integrated Brownian motion.

Backward Filtering Forward Guiding for diffusions
In this section we present a novel Gibbs sampling algorithm for joint continuous-discrete smoothing of diffusions and parameter estimation.We will first precisely derive the posterior distribution that we target in Section 4.1.Next, in Section 4.5 we present details on initialisation of the algorithms, whereas steps for updating the path, initial state and parameters and given in sections 4.2, 4.3 and 4.4 respectively.
4.1.Likelihood calculations.The law P • of the guided proposal approximates P with tractable likelihood ratio.In the simple setting of known starting point x 0 and k i = ki with ki derived from observation scheme (1.3) we have specified this ratio in (1.9).Dropping the requirement that k i = ki , we have for bounded measurable test-functions g where This follows from (1.9) upon correcting for the discrepancy between ki and k i .Now we assume b, σ, p, { ki } , {k i }, π(x 0 ) depend on an unknown parameter θ with prior κ(θ).We add a subscript θ if we wish to highlight this dependency.
A naive approach for sampling jointly the latent path X and parameter θ would consist of data-augmentation where one iteratively updates the latent path X proposing from X • conditionally on the parameter θ and θ conditionally on X .It is well known that this yields a reducible algorithm if unknown parameters appear in the diffusivity σ (cf.Roberts and Stramer (2001)).The key idea put forward in Golightly and Wilkinson (2008) (see also van der Meulen and Schauer (2017a) and Papaspiliopoulos et al. (2013)) is to use a noncentred parametrisation where the law of X • is casted as a pushforward of Wiener measure.It is this approach that we adopt here as well: as we assume existence of a strong solution to the SDE (1.7), there exists of a measurable map GP θ such that where Z is a Wiener process in R d .The process Z is referred to as the innovation process.
Below, we define algorithms 4.1, 4.3 and 4.4.These are to be combined in a Gibbs sampler to target the joint posterior distribution of (θ, x 0 , Z) specified by the proper density with respect to the product measure of Lebesgue measure on (θ, x 0 ) and d -dimensional Wiener measure on Z.In particular, samples of the latent path are obtained from samples (θ, x 0 , Z) through X = GP θ (x 0 , Z).
To see why (4.2) is true, if g is a bounded measurable function, then, with W denoting Wiener measure and X 0 , Θ random variables drawn from the prior, is the density of (x 0 , θ) conditional on V 0 .Finally, use ρ θ (0, x 0 ) = k 0 (x 0 , v 0 )ρ θ (0+, x 0 ) and likewise for k0 , ρθ and multiply both the numerator and denominator by k0 (x 0 ; v 0 ) and "absorb" the ratio k 0 (x 0 ; v 0 )/ k0 (x 0 ; v 0 ) into the product by starting at i = 0 rather than i = 1.4.2.Smoothing from known initial state and parameter.In this conditional update step we assume that the initial state and parameter are either known or conditioned upon.This means sampling from the smoothing distribution which requires the likelihood ratio of P with respect to P • as specified in (4.1).Recall the definition of Ψ in (1.10).For this step, we choose a tuning parameter λ ∈ [0, 1).Algorithm 4.1.Smoothing step for fixed θ and fixed initial state x 0 .Suppose the current iterate is (θ, x 0 , Z) and X = GP θ (x 0 , Z).
For i = n to 0 (a) For t ∈ (t i , t i+1 ], backwards solve the ordinary differential equations (2.8) (a) Sample independently a Wiener process W and set .
Repeating step 2 multiple times constitutes a procedure to sample from the smoothing distribution, in which case step 1 only needs to be done once.We have used the information filter from Theorem 2.5.Clearly, with straightforward modifications it can be adjusted to other filtering equations from Section 2.
ODEs for H(t) and F (t) are calculated on t ∈ [t 0 , t n ] using time discretisation.Once these have been calculated and stored (on a grid of timepoints), the remainder of the algorithm consists of preconditioned Crank-Nicolson steps on the Wiener increments.After "burn in", the sample paths generated by this algorithm are from the smoothing distribution.Note that since we assume the initial state x 0 to be known and the parameter θ to be fixed, c(t), in fact, does not need to be computed.As we will shortly extend the presented algorithm to include uncertatinty over the initial state and parameter estimation we have already included the ODE and the update formula for c(t).
While Algorithm 4.1 is theoretically valid for any fixed value of λ, its efficiency strongly depends on the particular choice of this parameter.Instead of a fixed value of λ, we can choose it randomly at each iteration of step (3a): the acceptance probability in step (3b) is not affected by its value (cf. the discussion after Algorithm 1 in van der Meulen and Schauer (2017a)).
Remark 4.2.If the measurement error is close to zero, then the (Euler) discretisation of the guided proposal is a delicate matter due to the behaviour of the guiding term just prior to observation times.As discussed in Section 5 of van der Meulen and Schauer (2017a) a time change and scaling of the process can alleviate discretisation errors.For completeness, we present this approach in Appendix E.
4.3.Initial state updating.We choose to update the initial state conditional on (Z, θ).
The expression for A follows from the target density in (4.2).We choose a Markov kernel q (which may depend on θ) for proposing a new state x • 0 .
An independence sampler can be obtained by taking for specified nonnegative π.Naturally, in case π is itself Gaussian an obvious choice is to take π = π.
4.4.Parameter updating.Choose a Markov kernel q for proposing new value for θ.
1. Propose θ • from the proposal kernel q.
2. If (H, F, c) (defining the guiding term) depend on the parameter, then recompute the guiding term with parameter θ • using step 1 of Algorithm 4.1.
The following Gaussian conjugacy result is sometimes useful; its proof is an elementary consequence of Girsanov's theorem.
Proposition 4.5.Assume a diffusion X with drift of the form where both the diffusion coefficient and the initial distribution of 4.5.Initialisation.While any value of (θ, x 0 ) within the support of their prior is possible, we propose to choose an initial value θ and subsequently sample x 0 ∼ π • for specified nonnegative π.Next, we sample a Wiener process Z on [0, t n ].The corresponding guided proposal is obtained from X • = GP θ (x 0 , Z), which can be computed using Next, we initialise X by defining 4.6.Blocking strategy.In this article, for comparisons of the effects of blocking we always adopt a "chequerboard" updating strategy.More precisely, we say that we use "blocks of length k" (for an even k) to mean that the following strategy is adopted for updating the path: 1. Initialise X (0:n) .

Sample the last segment
Naturally, the algorithm above describes only smoothing.In order to transform the above into an inference algorithm a parameter update step described in Algorithm 4.4 may be introduced in between steps 3 and 4 or after step 6 or both.
5 Continuous-discrete smoothing for high dimensional diffusion processes For high dimensional SDEs with d 1, the computation of the d × d backward filtered covariance P (t) in (2.11) becomes prohibitive in an implementation in dense matrix algebra (and likewise the equations for its inverse).Another numerical difficulty is the computation of the likelihood accounting for the difference in a and ã in case these are unequal.In this section we do not consider the latter problem, and therefore, restrict our attention to the high dimensional situation where a = ã.In this case the linear filtering step is the numerical bottleneck and not the sampling step.Specifically, if the process X and the observations are of the form dX t = (B(t)X t + F (t, X t )) dt + σ t dW t , X 0 ∼ p(x 0 ), then a natural choice is to take B(t) = B(t) and σ(t) = σ(t).
Even if B and σ are sparse, P (t) typically is not.But if those operators that describe the drift and dispersion of X are local, then one may be able to approximate P by a sparse matrix.
Instead of stating a formal definition we provide an illustrative example, which can be seen as a variant of the stochastic heat equation on a graph with drift.Consider the d-dimensional diffusion (X t ) for which a state x is understood as a discrete approximation to a function f : [0, 1] → R, so that in particular, a coordinate X (i) corresponds to an evaluation of f in More concretely, let X be defined as a solution to with σ, c > 0 and the graph Laplacian of the linear graph with d vertices corresponding to the coordinates.Λ acts as a discrete approximation to the second derivative.Λ is a local operator in the sense of having a representation by a banded matrix (this corresponds to all interactions in the linear graph being limited to only those between the neighbours).As a result, matrix-vector operations (matvecs) Λx can be computed efficiently in O(d) time.X is a Gauss-Markov process with stationary distribution N(0, (Λ + cI) −1 ) (as Σ = (Λ + cI) −1 is a solution to the continuous time Lyapunov equation BΣ + ΣB + C = 0, where Thus, in the stationary regime, X(t) at a fixed time t is a draw from a high-dimensional Gaussian law approximating a spatial Gaussian process on [0, 1] with continuous realisations of Hölder regularity just below 1 2 .While (Λ + cI) −1 is not a sparse matrix, the entries decay exponentially moving away from the diagonal.By Proposition 2.7, the same holds for P (t).Good, sparse approximations are obtained by setting (Λ + cI) −1 or P (t) to zero outside of a fixed band.
In contrast, the forward simulation and the likelihood evaluation happen in R d and are fast, as long as fast matvec operations with the operator H are available.This implies that the actual bottleneck of our approach is in solving a continuous-discrete linear filtering problem and we can rely on a long sequence of work in this direction.
From the various possibilities we consider but two: a sampling approach related to the ensemble Kalman filter and a sparsity enforcing solver for (2.12).
5.1.Ensemble backward filter.Instead of solving (2.12) one can exploit that for the linear backward process with a reversed time axis Here, d Ws can be taken to be an Itô integral; as σ does not depend on space, there is no need for an Itô correction.
This can be used to obtain estimates of ν(s) and a low rank approximation of P (s) by sampling X(k) a number of K times with K d and approximating by the empirical covariance T −s .To obtain a positive definite (and invertible) estimate of P , one employs localization and sets P = Σ • Λ where • is the Hadamard (elementwise) product with a positive definite sparse localisation matrix Λ chosen to ignore long range dependencies, see Houtekamer and Mitchell (2001).The observations are taken into account by way of the usual ensemble Kalman filter X(k) T −ti = X (k) see Evensen (2003).
5.2.Forced sparsity.Secondly, we consider sparsity enforcing backward integrators, where we set Here If P is well-approximated by a sparse matrix -as illustrated for the case (5.2) -then drop P h will be sparse.We explore this approach in section 6.3.
Remark 5.1.A particularly relevant case of (5.1) is a process X that is obtained from a spatial discretisation of the stochastic partial differential equation taking solutions in a Hilbert space H, where A is a densely defined, positive definite operator and W is a Q-Wiener process with a covariance operator Q given by a positive definite trace class operator, see Da Prato and Zabczyk (2014).

Examples
In all examples, the differential equations for H(t), F (t) and c(t) have been solved using the 7th order Runge-Kutta solver (cf.Verner (1978)).The guided proposal is obtained by solving its SDE using Euler discretisation.The code is available in the folders scripts/papers of the package BridgeSDEInference.jl1(Mider et al. (2019)) and scripts of the package BridgeSPDE.jl2 .Both packages build upon Schauer and contributors (2017), for the programming language Julia (Bezanson et al. ( 2017)).
6.1.Lorenz system.The Lorenz system is notable for having chaotic solutions for certain parameter values and initial conditions.It is described by the SDE with the drift and dispersion coefficient given by b Data: We initialised the process at x 0 = 1.5 −1.5 25 and simulated it on [0, 2] with mesh-width 2e-4 and parameters θ = 10 28 8/3 , σ 0 = 3.Only the 2nd and 3rd coordinate were observed so that L i is a 2 × 3 matrix with the first row [0, 1, 0] and the second row [0, 0, 1].We then defined two datasets.The procedures were inferring parameter θ, whereas parameter σ 0 was assumed to be known.All four parameters could have been inferred simultaneously; however, by fixing σ 0 it was possible to illustrate the differences in performances of various algorithms more clearly.Indeed, a varying σ 0 locally influences the speed at which chains on a path space, as well as parameter chains mix and this makes it more difficult to disentangle the contribution due to efficiency of the algorithm from that of a locally elevated or decreased volatility.We illustrate the problem of inference for the volatility coefficient on a more challenging example in the following section.
Setting a Gaussian prior over θ made it possible to implement conjugate updates for θ (cf.Proposition 4.5 or van der Meulen et al. ( 2014)).Each block had its own persistence parameter λ i of the preconditioned Crank-Nicolson scheme (in case of no-blocking a single parameter λ was used).For each block, λ i was tuned adaptively so as to target the acceptance probability of the imputation step for a given block to be 0.234 (cf.Section 3 of Roberts and Rosenthal (2009)).The updates of the initial position were always done jointly with the updates of the path in the first block (or jointly with the entire path updates in case of no blocking).
We chose the auxiliary process according to the strategy B, where the initial guess for x(t i ) was taken to be Ξ i v i,1 v i,2 , Ξ i = 25, (i = 1, . . ., n), and where the values of Ξ i were updated over the course of running an MCMC sampler: during the first 2500 iterations, once in every 500 steps we used an empirical average of X 1,ti to re-define Ξ i .Note that this was done in all six experiments.
Additionally, in the experiment with the "adaptation", we used strategy E, where the adaptation time was set to 2500 iterations and adaptation updates were done once in every 500 steps.The weights were changed from, initially, (1, 0) (the first value representing the weight for strategy B, the second one for strategy C), through (0.7, 0.3) and (0.4, 0.6) up to (0.2, 0.8) at iteration 1500 (from then on the weights remained unchanged).

Results:
The results of the experiments associated with the first dataset are shown in Figure 3 and they illustrate the outcome of only the first 10 4 iterations.The summary concerning time-adjusted effective sample size that is based on the entire chain is given in Table 1.
The green dashed lines in the top-most three traceplots indicate the values taken by the diffusion trajectory that was used to produce the data.As we only have one realisation of the path and finitely many observations, the conditional distribution (given the data) is not centred at this line; however, we can expect the samples to be close to them.The green dashed lines in the bottom three traceplots indicate the values of the parameters that were set to simulate the data.
Under the first observational setting the discrepancy between an approximate and a true guiding term is small and thus blocking becomes a hindrance that decelerates mixing of the parameter chains and the path chain.This is clearly illustrated in the left-most column of Figure 3, which corresponds to the algorithm of van der Meulen and Schauer (2017b), where the block-length is set to 2. By increasing the block-length to 8 (centre column) the mixing is improved and by removing it altogether and updating the entire path at once even better mixing can be achieved (right column).The same conclusions are strongly supported by the results summarised in Table 1, where the difference in the time-adjusted effective sample size is in excess of 2 orders of magnitude (we used a standard definition of the effective sample size, recommended by R. Neal in the panel discussion of Kass et al. (1998): , with n ps denoting the number of posterior samples and ρ k the autocorrelation at lag k; it is implemented as a function ESS in R programming language).
In the top row of Figure 3 we plotted a thinned chain of the imputed trajectories (1 in every 100 sampled paths is given; note that unlike figures 1 and 2, the temporal component is represented only by the changing colour of the trajectories, whereas each of the three axes have purely spatial meaning and correspond to the coordinates of the process X).Notice how short blocks cause a slowdown in mixing on a path space, effectively disallowing the sampler to perform large moves.This is expected to be of a particular hindrance to sampling  The results from the first dataset clearly illustrate that removing the restriction on the block length can have a positive effect on the efficiency of sampling on a path space.Nonetheless, some caution must be executed in extrapolating those conclusions.To see this, consider the second dataset.It is qualitatively different from the first one, in that the inter-observation distance is large enough for the non-linear dynamics of the target process to become pronounced between any two observations, whereas the decreased observational noise elevates the impact of the guiding term (for this reason it is also considered to be a substantially more difficult problem, for which a smaller number of numerical schemes can be applied to).
The inference results obtained on the second dataset are gathered in Figure 4.In this setting, using blocking yields similar results as imputing the path in full.In fact, as summarised by Table 2, the values of taESS corresponding to parameter chains are approximately twice as large when blocking is used (as compared to a no-blocking scheme).However, it is possible to improve the mixing of the chains even further-as depicted in the centre column of Figure 4-by designing better proposal laws.We used a scheme from E simply to motivate the usefulness of making the coefficients of the auxiliary diffusion time-dependent.The topic of optimal choice of the auxiliary law is beyond the scope of this paper.
Problems encountered in practice might fall anywhere on the spectrum spanned by the two datasets above or exhibit peculiarities of its own.We therefore believe that removing restriction on the block length-as it is done in this article-is a particularly practical improvement.
For reference we also ran the algorithm on the second dataset with the choice of the simplest strategy A and reported the result in Table 2.
6.2.Prokaryotic auto-regulatory gene network.We consider a simplified model of the auto-regulated protein transcription studied in Golightly andWilkinson (2005, 2011).certain sites on the DNA: the transcription of DNA into mRNA and the subsequent translation and protein folding: the reversible protein dimerisation: and the mRNA and protein degradation: The total number of DNA strands is assumed to be fixed: DNA + DNA • P 2 = K.It is well known that such system admits a diffusion approximation (called the chemical Langevin equation) and for the given system above, an associated SDE is given by: where S is the stoichiometry matrix: and the function h is given by: For this SDE X t = (RNA t , P t , (P 2 ) t , DNA t ) and W is an 8-dimensional Brownian motion.For more details about the model and how to derive chemical Langevin equations see Golightly and Wilkinson (2005).
We simulated three datasets using the Gillespie algorithm.In all three cases the process was started from x 0 = (8, 8, 8, 5) and we set K = 10.
Dataset 1: First, we reproduced the observational scheme of Golightly and Wilkinson (2011), where noisy estimates of the total number of proteins that are not attached to any DNA strands are recorded at integer times: Dataset 2: Next, in addition to observations from (6.1), we assumed that for each 8 observations of protein counts it is possible to obtain one noisy count of RNA, i.e. that: 8, 16, . . . , 96, (6.2) and that for i ∈ {1, 2, . . ., 100}\{8, 16, . . ., 96}, V i is given by (6.1).
Algorithm details: We ran the inference algorithm using guided proposals without blocking.
Note that due to a rectangular shape of the volatility coefficient it is impossible to employ guided proposals with blocking and a non-zero persistence parameter λ of the preconditioned Crank-Nicolson scheme.Consequently, one can employ the algorithm of van der Meulen and Schauer (2017b) only in a special case when the observations are made on a sufficiently dense time-grid, for which it is possible to set λ = 0.
We set the prior distribution over the starting position to be X 0 ∼ N(x 0 , I 4×4 ) and, following Golightly and Wilkinson (2011), a prior distribution over the logarithm of each inferred parameter to be Unif([−7, 2]).We remark that the priors chosen by Golightly and Wilkinson (2011) that truncate the natural support of the parameters might be problematic, especially for the first dataset, as the data is not informative enough to narrow down the support of some of the posteriors, causing the parameter chains to hit against the boundaries of the priors.Instead, using an exponential prior over each θ i -this being the maximum entropy distribution of all continuous distributions supported on [0, ∞)-might be a more suitable choice.For the sake of fair comparison we stick with the choice made by Golightly and Wilkinson (2011).
The MCMC sampler started off with updating each log-transformed parameter separately using Metropolis-Hastings algorithm with random walk proposals and adaptation as described in Section 3 of Roberts and Rosenthal (2009).After the first 12•10 3 steps of parameter updates we computed the empirical covariance of the parameter chains and switched to joint proposals of Haario et al. (2001) (we used a modified version from Section 2 of Roberts and Rosenthal (2009)).We kept updating the empirical covariance matrix once in every 100 steps of the algorithm.We chose the auxiliary law according to the strategy B, where during the first 10 4 iterations of the Markov chain we updated our guesses for the latent positions of the path at the observation times once in every 200 iterations.
Results: Figure 5 summarises the results from running inference on each of the three datasets.
For each log-transformed parameter we give traceplots of the Markov chain and the resulting marginal density plots (with burn-in set to 2 • 10 4 iterations).The true parameter values are marked with orange, dashed, horizontal lines, whereas grey, dotted, vertical lines mark the change in the type of the transition kernel used (from adaptive, single-site updates to joint updates of Haario et al. (2001)).
For the dataset 1 we fixed θ 5 , θ 6 and K to their true values and we aimed at inferring the remaining parameters.This is the same setup as in Golightly and Wilkinson (2011).The results are given in the left column of Figure 5.
Note that the marginal posteriors over log(c 3 ) and log(c 8 ) presented in Golightly and Wilkinson (2011, Figure 3) (corresponding to log(θ 3 ) and log(θ 8 )) concentrate decisively around incorrect values of the parameters, possibly pointing to some undiagnosed problems with the methodology.In our case, the true values of the parameters that were used to generate the data lie within the support of the posterior.In particular, the posteriors over log(θ 3 ) and log(θ 7 ) do not depart much from the priors, indicating that the data is not informative enough to identify those two parameters.Indeed, an examination of the reaction equations reveals that θ 3 and θ 7 regulate the production and the degradation of the RNA respectively and the observation scheme that provides only noisy counts of P might not provide sufficient information.
We generated the second dataset to examine the degree of improvement in identifying the true parameters that could be brought about by adding sparse observations of the RNA.We assumed that only K is known and we fixed θ 6 to an incorrect but reasonable value of θ 6 = 0.5.This was done to reproduce a real setting in which none of the θ parameters are known, so fixing any θ i parameter to its true value is impossible.By fixing θ 6 to an incorrect value and running the inference algorithm we hope to estimate θ 5 /θ 6 .Ideally, one would want to estimate all 8 parameters simultaneously, but the dataset is not informative enough to conduct such inference.The results are summarised in the middle column of Figure 5. Parameters θ i , i ∈ {1, 2, 3, 4, 7, 8} have been successfully identified and the posterior over θ 5 concentrates around the value θ 5 for which θ 5 /θ 6 is equal to the true ratio θ 5 /θ 6 of the parameters that were used to generate the data.
Naturally, the problem with the observational setting of dataset 2 is that noisy counts of RNA t need not be easily available in practice.However, it is conceivable that it is possible to collect information about the marginal distribution of RNA.To make use of such information, at each time t ∈ {8, 16, . . ., 96} instead of observing a true, noisy count of RNA t we could instead observe a realisation from the marginal density of RNA.We reproduce such observational setting by randomly permuting the time-labels of the generated RNA t , t ∈ {8, 16, . . ., 96}, resulting in the dataset 3. The inference results in this setting (assuming K known and θ 6 fixed to an incorrect value of θ 6 = 0.5) are summarised in the right column of Figure 5.Some of the marginal posteriors appear flatter than the ones in the middle column, but the loss of information is not substantial and the posteriors still identify the true parameter values.
6.3.Tracking the translation of a dynamic Gaussian random field.We consider the problem of inferring a latent spatial process from noisy low frequency observations.As an example we take a few pictures from a sequence of satellite images depicting the evolution of a convective cloud system taken 15 minutes apart.The sequence was analysed by Avenel et al. (2014) from the perspective of curve tracking.3A visual inspection indicates that the system moves south-east and we are interested in estimating the speed of the system, as well as its location and extent at intermediate times.
No attempt is made to physically model a convective system.This example is merely meant to serve as a starting point for applications in high dimensional settings.The statistical analysis is made on the minimalistic assumptions that we observe a large scale phenomenon moving at constant speed through space, but the images show additional temporal and spatial high frequency phenomena we consider as random and it is the latent, large scale evolution that we are interested in.In this way we develop a generic model to track deformations and translations in time not bound to this particular meteorological example.Both Wiener and formal observation noise then capture all dynamics, observation errors and local physical phenomena which cannot be explained by our simple model.
We set this numerical experiment up in the way of a 2-d generalisation of (5.2).It is convenient to describe the model as a discretely observed, matrix-valued Langevin process (stochastic heat equation) on the graph with a stationary distribution that shows large scale variation in the distribution of its samples.This is similar in spirit to Hartog and van Zanten ( 2019), but we take a more dynamical view with discrete time observations and continuous latent dynamics and a drift term.Furthermore, via forward guiding our approach has a natural non-Gaussian extension.To avoid working with covariance tensors describing the uncertainty of a matrix valued process we move to the equivalent, vector-valued process, obtained through the application of the vec (vectorisation) operation.
The images have a single channel with values in [0, 1], so each pixel corresponds to one entry of a real m×n matrix, but we do not take the boundedness into account and model them as realvalued matrices.The matrix-valued observations are denoted by Y 0 , . . ., Y 3 corresponding to pictures with indices 33, 35, 40, 45.The chosen images are not equally spaced in time, but this poses no difficulty to our method.The last three pictures are taken 75 minutes apart, the first two pictures are 30 minutes apart.We consider 75 min as 1 time unit.The original image size was 196 × 166 and it was downsampled to 65 × 55.
We assume that where η is matrix-valued Gaussian noise with covariance tensor Σ of the appropriate dimension (corresponding to the covariance matrix of vec(η)).We took Σ = σ 2 I The latent process is a stochastic process X = (X t ) as well taking values in the set of m×n-real matrices, so nominally a 3575 dimensional diffusion process solving with W t an uncorrelated, matrix-valued Brownian motion, σ, ρ, c > 0 and Λ the graph Laplacian tensor defined in (6.4) below.
where ∇ ↓ = ∇ (0,−1) and ∇ ← = ∇ (−1,0) are given by the mass transport operators with (∆i, ∆j) ∈ {(−1, 0), (1, 0), (0, 1), (0, −1)}.Thus b θ (t, x) = − σ 2 2 (ρΛ + cI)x + F θ (x) with θ considered as unknown parameter is of the form prescribed by Proposition 4.5.Choosing ρ and σ proportional to image scale makes the model roughly scale invariant, which informally can be seen from considering the stationary distribution given by the Lyapunov equation under transformation of the SDE by a linear downsampling operator R : R m×n → R m/2×n/2 , assuming n, m are even.We then chose parameters which worked well at 20 × 24 pixel resolution and scaled them to other resolutions 41 × 49 and 55 × 65 that we use.While the model works for finite dimensional problem at hand, the spatial white observation noise has no infinite dimensional scaling limit and for larger applications one would consider smoother observation noise with spatial correlation given by a non-diagonal covariance structure.
We took the time step as δt = 1/(35ρ).We assume that X T is a priori Gaussian with mean 0.5 (a grey value) and variance 2.5I and equip θ with an improper uniform prior.

Algorithm details:
The SDE in our model is linear, so with X = X, the proposal is a sample from the conditional distribution and no change of measure is needed.We solve the backward filtering equations from Theorem 2.6 for ν and P using the Euler scheme in combination with the enforced sparsity method from Section 5.2 using Julia's native sparse matrix type (CSC).
We employed Rao-Blackwellisation to obtain an estimate of the mean of the joint posterior of parameters and latent path, where in one step, the mean of X given observations and θ is given by dx and in the second step the mean of θ given X is obtained using Proposition 4.5.
For the correction formulas (2.13) we opted for an implementation based on the sparse Cholesky factorisation provided in Julia.

Results:
The first 15 steps of the algorithm were run on a model downscaled to lower resolution, for the subsequent 15 steps the resolution was increased and the final 15 steps were run on a model with the chosen, target resolution.Figure 8 shows the trace of θ 1 , and θ 2 for 45 iterates of the Rao-Blackwellised chain.After the last 15 steps the final estimates of the posterior means of the parameters are obtained as θ1 = 1.55 and θ2 = 6.12.These are in good agreement with visual assessment of the movement.The estimates of θ appear mostly stable at different model scales.(See the small changes in the estimates after each change of resolution at steps 15 and 30 in Figure 8.) The animation in Figure 7 shows the estimated posterior mean trajectory.The incorporation of discrete time observations with continuous latent dynamics and a drift term allows the algorithm to recognise the convective system as one object even if it has moved a substantial amount of pixels.
We only report rough timings: backward filtering step took half a minute for the correction steps and about two minutes for propagating the uncertainty at full resolution.Forward guiding took about half a minute and the parameter update about 1 second.(Timings are variable and depend on the current value of θ, as θ influences the sparsity.)The Kalman gain had 3 % non-zero entries with = 10 −8 .

A Related work
If the full state of the diffusion process is observed at all times {t i , 0 ≤ i ≤ n} without noise, i.e. if L i = I and Σ i ≡ 0 for all i in (1.3), then the smoothing problem reduces to the sampling of n independent diffusion bridges.This problem has attracted considerable attention over the past two decades, see for instance Eraker (2001), Elerian et al. (2001), Durham and Gallant (2002), Clark (1990), Bladt et al. (2016), Beskos et al. (2008), Hairer et al. (2009), Bayer and Schoenmakers (2013), Lin et al. (2010)), Beskos et al. (2006), Delyon and Hu (2006), Lindström (2012), Schauer et al. (2017), Whitaker et al. (2017) and Bierkens et al. (2020).In the general case however, the connecting bridges cannot be sampled independently between adjacent observation times.In fact, at time t ∈ (t i−1 , t i ) the process X, conditioned on V 0 , depends on all future conditionings V i , . . ., V n .To resolve this problem, subsequent simulation of bridges on overlapping intervals has been proposed by Golightly and Wilkinson (2008), Fuchs (2013) and van der Meulen and Schauer (2017b).Särkkä and Sottinen (2008) considered filtering (instead of smoothing) of diffusions under the assumption that the dispersion coefficient is only allowed to depend on time.If the diffusion can be transformed to unit diffusion coefficient, then filtering can also be accomplished using the exact algorithm for simulation of diffusions, as introduced by Beskos and Roberts (2005).This algorithm forms the basis for the methods presented in Fearnhead et al. (2008) and Olsson and Ströjby (2011).Various solutions to the filtering and smoothing problem are further discussed in Särkkä and Sarmavuori (2013).Key to the proposed algorithms therein is the assumption that the distribution of X t , conditional on the data V 0 can be approximated by the normal distribution.
Recently, Graham et al. (2019) considered the same problem as considered here using a sampler based on constrained Hamiltonian dynamics.The basic idea is to consider the solution to the SDE as a forward map of Wiener-innovations that are constrained to a manifold implied by the observations.Our approach is rather different and based on forward simulating paths that directly take all observations into account.

B Technical background on guided proposals
Previous work on guided proposals includes: (A) In Schauer et al. (2017) the basic idea was introduced in the setting of uniformly elliptic diffusions (i.e. the case where for each y ∈ R d there exists an ε > 0 such that σ(t, x) y ≥ ε y 2 ), with one segment and a "full" observation at the time of conditioning.More precisely, the aim is to simulate a diffusion, conditioned on X 0 = x 0 and X T = x T , with 0 < T .
(B) In van der Meulen and Schauer (2017b) this was extended to the setting of two future conditionings.More precisely, in this paper it was shown how to define guided proposals to sample from a diffusion starting in X 0 = x 0 and conditioned on v S ∼ N(LX S , Σ S ) and X T = x T , where 0 < S < T .
(C) In Bierkens et al. (2020) the work on Schauer et al. (2017) was generalised in a different direction by incorporating hypo-elliptic diffusions, observed partially and without noise.
More precisely, here it is assumed that X 0 = x 0 and the conditioning is given by In all cases, the SDE for the conditioned process is of the form (1.4).The function ρ is specific to the type of conditioning and depends on the unknown transition densities p of the diffusion X.
The results of Theorem 1 in Schauer et al. (2017) and Theorem 2.14 in Bierkens et al. (2020) (and thus, the absolute continuity of the laws P and P • and the expression (1.9)) are proven under the following conditions: (i) boundedness and smoothness of the drift coefficient b and dispersion coefficient σ; (ii) "matching conditions on the diffusivity (and possibly drift)".In setting (A), the matching condition is given by ã(T ) = a(T, x T ) which can always be ensured.
In setting (C) the matching conditions are more subtle and more complicated.The results in Bierkens et al. (2020) strongly suggest that for the diffusivity of the auxiliary process the condition La(T, x T )L = Lã(T )L suffices.In case of hypo-ellipticity, additionally the drift of the auxiliary process, b(t, x) = B(t)x + β(t), needs to satisfy the equation Lb(T, x T ) = L b(T, x T ).These conditions are particularly important in the noiseless limit of the discrete time (partial) observations.

C Derivation of the guiding term in X
Here we explain the type of conditioning induced by the guiding term in X .In particular, the specific form of ρ in (1.5).
Let p denote the transition density of a diffusion process Y solving an SDE with drift b and diffusivity σ on the (canonical) filtered probability space (Ω, F, {F t }, P) with Ω = C([0, t n ], R d ).Assume the hierarchical model with ξ 0 assumed fixed (known).Without loss of generality we assume t ∈ [0, t 1 ).We will assume ξ and denote the density of (ξ, v), conditional on Y t = y, by We apply Doob's h-transform with It is easily verified that for any t ∈ [0, t 1 ), Z t is a martingale with mean 1. Define a new probability measure Q on Ω by the change of measure dQ| Ft = Z t dP| Ft , ∀t ∈ [0, t 1 ).By Girsanov's theorem it follows that under Q where W is Brownian motion under Q.We have where we multiplied by 1 = ζ (0,ξ0) (ξ, v)/ζ (0,ξ0) (ξ, v), used Fubini to arrive at the final equality and ψ is defined by Therefore, we have This shows that under Q, first ξ is sampled from the density ψ, and subsequently the process Y is conditioned on the event {Y t1 = ξ 1 , . . ., Y tn = ξ n }.Note that sampling from ψ corresponds to sampling from the posterior of ξ in the statistical model specified by (C.1), with data v.
In this model, π 0,ξ0 can be viewed as the prior density of ξ.
To connect to the main text, simply note that Y under the measure P is X, whereas Y under the measure Q is X and that the h transform is in our case ρ.

D Proofs of theorems in Section 2 and additional remarks
In the proofs we will assume observation times 0 < S < T and observations V S ∼ N(L S X S , Σ S ) and V T ∼ N(L T X T , Σ T ) with L S ∈ R m S ×d and L T ∈ R m T ×d ; the extension to multiple future conditionings being trivial.
D.1.Proof of Theorem 2.4.Define Φ to be the solution to dΦ and notice (upon differentiation) that are the solutions to ODEs (2.2), (2.3) and (2.4) respectively.Note that µ(t) ∈ R m(t) and that Υ(t), L(t), M † (t) and v(t) have similar structures when t is either in (0, S] or (S, T ].The relations in (2.5) can be verified by evaluating L(t) for both t = S and t = S+ (and similarly for M † (t) and µ(t)).
As we assume Σ T to be strictly positive definite, matrix M (t) = M † (t) −1 exists for t ∈ [0, T ] (see however Remark D.2 in case of no noise on the observations).
The expression for r for t ∈ (0, S] follows by extending Lemma 2.5 in van der Meulen and Schauer (2017b) to the case where not necessarily Σ T = 0 and L T = I.The expressions for t ∈ (S, T ] follow from equation (4.1) in van der Meulen and Schauer (2017b).
To derive ρ, note by Schauer et al. (2017, Section 5) that X is a Gaussian process with conditional mean µ t (s, x) = E[ Xt | Xs = x] and covariance K t (s) = Cov( Xs , Xt ), where: Then, clearly, (V S , V T ) = ((L S XS + η S ) , (L T XT + η T ) ) is Gaussian as well and its mean (denoted with v(t)) and covariance (denoted with Ω(t)) are given by: and Ω(t) with the definitions (D.1), (D.2) and (D.3) reveals that and the expression for ρ follows.
Remark D.1.The value of Theorem 2.4 lies in recognition of the structure on both H and r, something which was not noticed in van der Meulen and Schauer (2017b).The theorem shows that both quantities can be written in a unified way on both (0, S] and (S, T ].The key to this is the proper definition of L (including the indicator).
Remark D.2.Suppose L T = I and Σ T ≡ 0. For t ∈ (S, T ], M (t) exists if and only if T t Φ(T, t)ã(τ )Φ(T, t) dτ is invertible.This matrix is the controllability Grammian.Systems theory provides sufficient conditions for controllability.In case ã(t) is not invertible, then M (t) exists if and only if the pair of functions (B, σ) is controllable on [t, T ] for any t ∈ [0, T ) (cf.Section 5.6 in Karatzas and Shreve (1991)).
Remark D.3.Suppose L S = I and Σ S = 0.This corresponds to the case where the diffusion is fully observed at time S without noise.By the Markov property, the pulling term should only depend on v S (which is then in fact x S ) and not on v T .To verify this, first note that we can write where for t ∈ [0, S) Now assume L S = I and that A is invertible.Then we have Using the formula for the inverse of a block matrix, we get . This is exactly as in Lemma 6 of Schauer et al. (2017).
Remark D.4.Suppose t ∈ (t i−1 , t i ] and we condition on future incomplete observations (v i , . . ., v n ).In that setting the correct definitions for Υ and L(t) are Proof of Theorem 2.5.The expression for log ρ follows from Theorem 2.4 by simple algebra, with t , where Backward ordinary differential equations We derive the ordinary differential equations solved by H(t), F (t) and c(t) in a neighbourgood disjoint from {S, T }.We start with H(t).(2) t = − 1 2 tr(H(t)ã(t)).(D.9) Combining (D.8) and (D.9) yields the ODE for c(t) stated in Theorem 2.5.

Update equations
The boundary conditions for H(T ), F (T ), C T and C (2) T : (1) T + C (2) Now, combining the expressions for H(t), F (t) and C (1) t , given in (2.6), (2.7) and (D.5) respectively, with the update equations for L(S), M (S) and µ(S) at time S, given in Theorem 2.4, yields the update equations for H(S), F (S) and C (1)

S + C
(2)  Remark D.5.We investigate the behaviour of P (S) and ν(S) when the noise level tends to zero.Assume L S P (S+)L S is invertible.Then it follows from (2.13) that the expression for P (S) is also well defined when Σ → 0.Moreover, when Σ = 0 we have L S P (S) = L S P (S+) − L S P (S+)L S (L S P (S+)L S ) −1 L S P (S+) = 0.
To evaluate the limiting behaviour of ν(S), we write L S ν(S) = L S P (S)L S Σ −1 S v S + L S P (S)H(S+)ν(S+).
The second term on the right-hand-side is easily seen to tend to zero when Σ S → 0. For deriving the limit of the first term on the right-hand-side we define C = L S P (S+)L S Σ −1 S and rewrite This implies that L S P (S)L S Σ −1 S → I. Combining these results we obtain that L S ν(S) → v S , when Σ S → 0.
In particular, if we have full observations at time S, i.e.L S = I, then P (S) → 0 and ν(S) → v S if Σ S → 0. As a consequence, in case of full observations and no noise, we recover the full observation without noise case (in this simpler setting, the differential equations for ν and P were derived in van der Meulen and Schauer (2017c)).However, in the general case where L S is not of full rank, we need that det Σ S = 0 to obtain the value of ν(S) from ν(S+).

E Implementation using a time-change and scaling
If the noise level on the observation is small, care is required in the discretisation of guided proposals near the conditioning points.For this reason, a time-change and scaling was introduced in Section 5 of van der Meulen and Schauer (2017a).Here, we explain it for the case For a mapping f : R → R we write f τ (s) = f (τ (s)).We propose to discretise the process U s , defined by Here, J is defined by J(s) = τ (s)H τ (s).

Figure 1 :
Figure 1: Dataset 1 Figure 2: Dataset 2Algorithm details: Overall, six inference algorithms have been run, three on each dataset.The procedures were inferring parameter θ, whereas parameter σ 0 was assumed to be known.All four parameters could have been inferred simultaneously; however, by fixing σ 0 it was possible to illustrate the differences in performances of various algorithms more clearly.Indeed, a varying σ 0 locally influences the speed at which chains on a path space, as well as parameter chains mix and this makes it more difficult to disentangle the contribution due to efficiency of the algorithm from that of a locally elevated or decreased volatility.We illustrate the problem of inference for the volatility coefficient on a more challenging example in the following section.

Figure 3 :
Figure 3: Lorenz system: results on dataset 1. Left column: results of a single MCMC run with blocks set to comprise of two inter-observation intervals (as in van der Meulen and Schauer (2017b)); middle column: blocks comprising of eight inter-observation intervals; right column: no blocking.Top plots: paths sampled by the MCMC samplers (1 in every 100 sampled paths); Rows 2-4: marginal distribution of the coordinates of X at time t = 1.5.Rows 5-7: traceplots for the updated parameters θ.
It is based on the stochastic model describing λ repressor protein cI of phage λ in Escherichia coli introduced in Arkin et al. (1998).The simplified model is characterised by eight biomolecular reactions: the reversible repression of transcription in which a protein dimer P 2 attaches to

Figure 4 :
Figure 4: Lorenz system: results on dataset 2. Left column: results of a single MCMC run with no blocking; middle column: blocks comprising of two inter-observation intervals; right column: blocks of length 2 and adaptive tuning of the auxiliary law according to b (aux) = wb (7.1) + (1 − w)b (C) (see Section 4.6 for details).Top plots: paths sampled by the MCMC samplers (1 in every 100 sampled paths); Rows 2-4: marginal distribution of the coordinates of X at time t = 1.Rows 5-7: traceplots for the updated parameters θ.

Figure 5 :Figure 6 :
Figure 5: Inference results for the prokaryotic example.Each iteration refers to a single step of parameter update.Left column gives the results of inference run on dataset 1; middle column: results from dataset 2; right column: results from dataset 3

Figure 7 :
Figure 7: In the pdf version: The center panel shows the posterior mean trajectory x as animation.The left frame shows the first observation and the right frame shows the last of the four observations in the same resolution, showing that the latent convective system bridge moves smoothly from initial to the final state.The animation is also available as supplementary file.

Table 1 :
Time-adjusted effective sample size (taESS) and effective sample size (ESS) for the experiments with the first dataset of the Lorenz example.Based on chains of length 10 5 .Best results are written in bold.
of multimodal diffusions.

Table 2 :
Time-adjusted effective sample size (taESS) and effective sample size (ESS) for the experiments with the second dataset of the Lorenz example.Additional comparison for the experiments with the auxiliary law chosen according to the strategy A. Based on chains of length 10 5 .Best results are written in bold.