Particle Methods for Stochastic Differential Equation Mixed Effects Models

Parameter inference for stochastic differential equation mixed effects models (SDEMEMs) is a challenging problem. Analytical solutions for these models are rarely available, which means that the likelihood is also intractable. In this case, exact inference is possible using the pseudo-marginal approach, where the intractable likelihood is replaced with a nonnegative unbiased estimate. A useful application of this idea is particle MCMC, which uses a particle filter estimate of the likelihood. While the exact posterior is targeted by these methods, a naive implementation for SDEMEMs can be highly inefficient. We develop three extensions to the naive approach which exploits specific aspects of SDEMEMs and other advances such as correlated pseudo-marginal methods. We compare these methods on real and simulated data from a tumour xenography study on mice.


Introduction
Stochastic differential equations (SDEs) are defined as ordinary differential equations (ODEs) with one or more stochastic components. SDEs allow for random variations around the mean dynamics specified by the ODE. These models can be used to capture inherent randomness in the system of interest. For repeated measures data, random effects can be used to account for between-subject variability; this gives an SDE mixed effects model (SDEMEM).
SDEMEMs are emerging as a useful class of models for biomedical and pharmacokinetic/pharmacodynamic data (Donnet et al., 2010;Donnet and Samson, 2013a;Leander et al., 2015). They have also been applied in psychology (Oravecz et al., 2011) and spatio-temporal modelling (Duan et al., 2009). Statistical inference for these models is generally difficult however. In most cases, the SDE does not have an explicit or analytical solution (transition density), making the likelihood intractable. Including random effects adds further complexity.
Parameter inference for SDEMEMs has largely focussed on maximum likelihood estimation; e.g. Picchini et al. (2010), Picchini and Ditlevsen (2011), Delattre et al. (2013) and Donnet and Samson (2013a,b). There are few Bayesian inference methods; Donnet et al. (2010) propose a Gibbs sampler coupled with an Euler-Maruyama discretisation of the intractable transition density. Whitaker et al. (2017a) take a data augmentation approach based on a diffusion bridge, which allows for non-linear dynamics between observed time points. Picchini and Forman (2019) compare results from a particle MCMC algorithm (Andrieu and Roberts, 2009;Andrieu et al., 2010) and a Bayesian synthetic likelihood approach (Wood, 2010;Price et al., 2018). They apply both methods to an SDE with known solution, and suggest an Euler-Maruyama approximation if the solution is unavailable.
It is unlikely however that any one approach to estimating SDEMEMs is optimal for all applications. Performance will depend on the complexity of the underlying SDE, the number of parameters, the number of observations for each subject, as well as the complexity of the random effects. Inference results also depend on the properties of the ODE underlying the SDEMEM, and whether it reflects important features of the data, such as monotonicity or periodicity. It has been our experience that methods that work well on simple examples can often fail badly on more complex ones. This motivates our focus on significant extensions to the pseudo-marginal approach of Picchini and Forman (2019) for SDEMEMs. Pseudo-marginal methods can overcome some limitations of data augmentation approaches because they integrate out the latent states (Stramer and Bognar, 2011;Gunawan et al., 2018b). This is especially useful when there is substantial correlation between the latent variables and the parameters of interest. Our article develops a suite of new and efficient Bayesian methods for SDEMEMs that take advantage of advances in particle methods and exploit specific aspects of SDEMEMs. We compare these methods on a model adapted from one used by Picchini and Forman (2019) to model the growth of tumour volumes in mice. We believe that the results of this comparison are of interest to the wider Bayesian community.
The rest of the paper is organised as follows. Sections 2 and 3 provide the necessary background on state space models, stochastic differential equations, particle filters and particle MCMC methods. Section 4 proposes three potential particle methods for SDEMEMs. Sections 5-7 compare these methods with the Picchini and Forman (2019) approach on simulated and real data from a tumor xenography study on mice, modelled with a monotonic growth SDEMEM. Appendix B (Botha et al., 2020) gives a second example which shows the performance of our methods on a non-monotonic data example. Section 8 concludes with a discussion of the results and possible future work. Code for our methods is available at https://github.com/imkebotha/particle-mcmc-sdemem.

State Space Models
State space models (SSMs) consist of two processes: a Markov process {X t } t≥0 ⊂ X , where X t is usually only partially observed and is often viewed as a latent process, and an observed process {Y t } t≥0 ⊂ Y. The X and Y spaces are typically subsets of Euclidean space. To obtain an SSM, we assume that {(x t , y t ); t ≥ 0} is Markov with model parameters θ, so that For ease of exposition, t = 0, . . . , T − 1 are assumed to be the observation times. We simplify further and assume that where g(y t | x t , θ) is the observation density and f (x t | x t−1 , θ) is the state transition density; π(θ) is the prior for θ. The unnormalized posterior density of the latent states and model parameters is The integral in (2) is usually intractable. For some models, inference is also complicated due to an intractable transition density, see e.g. the SDEs in Section 2.2. While approximate methods can be used in this case, Section 3.2 shows that exact inference is still feasible if it is possible to simulate from the transition density.

Stochastic Differential Equation Mixed Effects Models
A one-dimensional Itô process (Øksendal, 2013, p. 22) is a stochastic process {X t } t≥0 satisfying

Particle Methods for SDEMEMs
where B t = t 0 dB s is standard one-dimensional Brownian motion. The differential form of (3) gives the stochastic differential equation (SDE) governing {X t } t≥0 . For simplicity, we only consider one-dimensional SDEs, but it is straightforward to extend the methods introduced in Section 4 to higher dimensions.
Given an Itô process {X t } t≥0 , the general form for a one-dimensional continuous SDE parameterised by φ X is is the diffusion, φ X are the fixed model parameters and {B t } t≥0 is a standard Brownian motion process. This model can be extended by allowing some of the parameters to vary between the m = 1, . . . , M individuals. In this more general setting, let φ X be the vector of fixed common parameters of the SDE, and η m the vector of subject specific parameters (random effects), where η m ∼ p(φ η ). The stochastic differential equation mixed effects model (SDEMEM) is then given by, The solution to (4) gives the transition density of the states. If an analytical solution for the transition density is unavailable, numerical methods can be used; Section 2.3 discusses some of these.
This leads to a state-space model if the process shown in (4) is hidden. Let y m,t ∈ {Y m,t } t≥0 denote a noisy observation for individual m, m = 1, . . . , M at time ξ m,t , t = 0, . . . , T m − 1, where T m is the number of observations for individual m. To simplify notation, we assume that observations are taken at the same time points for all individuals, i.e. ξ t , t = 0, . . . , T − 1, but this restriction is unnecessary for our methods. Further dependence on ξ t is denoted simply by t, e.g. 0 : T − 1 represents ξ 0 : ξ T −1 . We assume that the observation equations are given by Let θ = (σ, φ X , φ η ) be the vector of all unknown parameters in the model, y m = y m,0:T −1 and x m = x m,0:T −1 . The joint posterior of θ, η 1:M and x 1:M can be expressed as The following running example is used throughout the paper to illustrate some of the concepts and methods.

SDE Simulation
Consider the SDEMEM for a single individual, If the SDE cannot be solved analytically, then it is necessary to use approximate methods. This section describes two common approaches for approximate simulation of SDEs: the Euler-Maruyama discretisation (EMD) and the diffusion bridge approach. Both methods simulate the SDE between discrete time points (generally corresponding to the observed times) along the entire diffusion trajectory or path, i.e. from t = 0 to T − 1. The error resulting from the discretisation is controlled using data augmentation, which introduces additional (unobserved) states between observation times.
Given a process {X t } t≥0 , the time interval [ξ t , ξ t+1 ] between two observations is split into D subintervals, where D denotes the level of discretisation, The EMD and diffusion bridges simulate over each subinterval as follows where μ DB (·) and Ψ DB (·) are determined by the method used, and ΔB τ k = B τ k+1 − B τ k . Since ΔB τ k ∼ N (ΔB τ k ; 0, Δτ ) by definition, the path is simulated by recursively applying The joint density of this approximation is

Euler-Maruyama
The Euler-Maruyama discretisation (EMD) is the simplest method to simulate an approximate trajectory from an SDE. Assuming that the drift and diffusion coefficients are locally constant, the EMD uses the proposal which approximates the transition density f (x τ k+1 | x τ k , φ X , η). If the SDE has constant drift and diffusion, then the EMD gives the exact solution, i.e. the transition density.
Example (SDEMEM with constant drift and diffusion). The EMD for the SDEMEM in (6), with μ k = β m and v k = γ 2 , is which is the exact transition density.

Diffusion Bridges
Simulating from the (approximate) transition density may be sub-optimal if some observations are highly informative or there is little observation noise. More effective trajectories can be obtained if the proposal for x t can be directed towards y t . This is possible using a diffusion bridge.
The modified diffusion bridge (MDB) of Durham and Gallant (2002) (see also Golightly and Wilkinson, 2008) directs x t linearly towards y t . The MDB is derived by approximating the joint distribution of X τ k+1 , Y ξt+1 | x τ k using a multivariate normal distribution, and then conditioning on Y ξt+1 = y ξt+1 . The distribution of X τ k+1 , Y ξt+1 | x τ k is obtained from the observation density (5) and the EMD (9) of X τ k+1 | x τ k ; see Appendix 1 of Golightly and Wilkinson (2008) for a more detailed derivation. The MDB is a bridge proposal of the form Whitaker et al. (2017b) notes that the modified diffusion bridge can perform poorly when the drift coefficient (μ(X τ k , φ X , η) = μ k ) is not approximately constant. To overcome this problem, they propose partitioning the SDE into a deterministic process and a residual stochastic process, such that the latter has constant drift. Rewriting the model in terms of these processes gives The idea is to choose ζ t and f (·) such that the drift of (10) is approximately constant. The simplest solution (Whitaker et al., 2017b) is to set ζ t = π t and f (·) = μ(·) as The residual bridge is obtained by constructing the MDB on the residual process {R t }. This bridge proposal is

Particle MCMC
There are at least two possible Bayesian strategies to obtain parameter inference for θ for state-space SDEMEMs. The first is to use a Metropolis-Hastings algorithm to sample from θ | y 1:M . This requires the likelihood p(θ | y 1:M ) to be analytically tractable, which is generally not the case for SDEMEMs. The second strategy involves a component-wise sampling approach, which iteratively updates θ | y 1: In both cases, a particle filter may be used to overcome these issues. The following sections describe particle filters, particle methods and their variants for state-space models. Section 4 covers the specific application of these methods to SDEMEMs.
Algorithm 1: The generic particle filter of Doucet and Johansen (2009).
Input : data y 1:T , number of particles N , static parameters θ, initial states x 1:N 0 and the vector of random numbers u † . Output : weighted sample {x 1:N 1:T , W 1:N 1:T }, likelihood estimate Z Notation: We use the convention that index (n) means 'for all n ∈ {1, . . . , N}' end † Note that the random numbers in u are used implicitly when the particles are resampled and moved (steps 3 and 4).

Particle Filters
Exact state estimation of SSMs using the Kalman filter is only possible when they are Gaussian or conditionally Gaussian. For non-linear, non-Gaussian SSMs, a particle filter can be used for simulation consistent estimation (Gordon et al., 1993;Carpenter et al., 1999;Doucet et al., 2000;Del Moral et al., 2006;Doucet and Johansen, 2009).
Particle filters are used to traverse through a sequence of intermediary distributions towards some target distribution. We describe the generic particle filter of Doucet and Johansen (2009) (see Algorithm 1), with a filtering distribution of the form A combination of move, reweight and resample steps are used to transition through this sequence. The move step generates values for x t from some proposal distribution q(x t | y t , x t−1 , θ). Once moved, the N particles are re-weighted according to, Particles are then resampled with probability W 1:N t for the next iteration. This is done to avoid particle impoverishment, where most of the weight is given to few particles.
There are several resampling methods that can be used, including multinomial, stratified (Kitagawa, 1996), and more recently, the Srinivasan sampling process (Gerber et al., 2019).
The particle filter gives an unbiased estimate of the likelihood from the unnormalized weights, The bootstrap particle filter of Gordon et al. (1993) is a special case of the generic The calculation of the weights then simplifies to w n t = W n t−1 g(y t | x t , θ).

Pseudo-Marginal MCMC
Markov chain Monte Carlo (MCMC) algorithms work by constructing an ergodic Markov chain with the posterior distribution p(θ | y) as its stationary distribution. The Metropolis-Hastings (MH) algorithm is a standard MCMC method, which accepts candidate values θ * ∼ q(θ * | θ) with probability If the likelihood function p(y | θ) is intractable then the MH algorithm cannot be implemented directly, however pseudo-marginal MCMC can be used.
The pseudo-marginal approach of Andrieu and Roberts (2009) allows for exact inference for models with intractable likelihoods. In this approach, the intractable likelihood p(y 1:T | θ) is replaced with a non-negative unbiased estimate within an otherwise standard MH algorithm. The likelihood estimate is written interchangeably as p(y 1:T | θ) = p(y 1:T | θ, u) where u ∼ p(u) are the auxiliary variables used to construct the likelihood estimate. Unbiasedness means that Pseudo-marginal MCMC can therefore be defined as a standard MH algorithm on an augmented space, i.e. the space of θ augmented with the auxiliary variables u. The MCMC chain targets p(θ, u | y 1:T ) which has the posterior p(θ | y 1:T ) as its θ-marginal distribution, since The next sections describe the particle marginal Metropolis-Hastings (PMMH) and particle Gibbs (PG) algorithms proposed by Andrieu et al. (2010).

Particle Marginal Metropolis-Hastings
The PMMH algorithm is a pseudo-marginal method where the intractable likelihood is replaced by an unbiased particle filter estimate (12). As in Section 3.2, the resulting chain targets the joint density p(θ, u | y 1:T ), where u is the vector of random numbers used in the particle filter; u contains all the random numbers required to resample and move the particles (steps 3 and 4 of Algorithm 1) and its elements are usually standard uniform or standard normal.

Input
: data y 1:T , initial values θ 0 and number of iterations I Output : posterior samples θ 1:I 1 Initialise θ 1 = θ 0 2 Draw u ∼ p(·) 3 Run Algorithm 1 to obtain an unbiased estimate of p(y 1: Run Algorithm 1 to obtain an unbiased estimate of p(y 1: Calculate the acceptance probability A drawback of the PMMH algorithm is that it can be difficult to find good proposals. Another drawback is the chain's tendency to get stuck whenever the likelihood is greatly overestimated for a particular value of θ, i.e. if p(y 1:T | θ) is greatly overestimated, then the acceptance probability for θ * will be small unless p(y 1:T | θ * ) is also overestimated. This issue is mitigated by decreasing the variance of the log of the ratio of the likelihood estimates A common strategy to do this is to increase the number of particles N used in the particle filter. Since the estimates are unbiased, this increases both the precision and accuracy of the individual likelihood estimates. Sherlock et al. (2015), Pitt et al. (2012) and Doucet et al. (2015) showed that optimal performance is gained when N is chosen such that the standard deviation of the estimated log-likelihood is between 1 and 4. A superior alternative approach is the correlated pseudo-marginal (CPM) method of Deligiannidis et al. (2018) (see also Dahlin et al. (2015)). Tran et al. (2016) introduced a variation of the CPM method called the block pseudo-marginal (BPM) approach.

Correlated Pseudo-Marginal
At any given iteration of the PMMH algorithm (Algorithm 2), the likelihood ratio is p(y 1:T | θ * , u * )/p(y 1:T | θ, u), where θ * and u * are the proposed values and θ and u are the current values. Deligiannidis et al. (2018) show that the mixing of the chain is greatly improved by correlating p(y 1:T | θ * , u * ) and p(y 1:T | θ, u). This helps to vastly reduce the variance of the log of the ratio of the estimated likelihoods (14), without having to reduce the variance of each of the individual likelihood estimates.
The correlated pseudo-marginal (CPM) approach does this by making u and u * highly correlated. Assuming the random numbers are normally distributed, Deligiannidis et al. (2018) use the Crank-Nicolson proposal to induce a correlation of ρ If the particle filter depends on non-normal random numbers, transformation to normality is applied. Deligiannidis et al. (2018) derive some optimality results to tune the parameters of the CPM method. In particular, they use the variance of the estimated log-likelihood ratio around the mode of the posterior to tune ρ for a given number of particles N .
The block pseudo-marginal (BPM) approach induces correlation by updating u in blocks or subsets ; the vector of random numbers u is divided into B blocks, and a single block is updated at each iteration while the remaining B − 1 are held constant. The resulting correlation between the logs of the likelihood estimates is approximately 1−1/B and is induced much more directly than in CPM. No assumption about the form or distribution of u is required. Tran et al. (2016) uses the variance of the log-likelihood estimator to tune N for each group or block. Given that B is sufficiently large, they derive the optimal variance for both Monte Carlo and randomised quasi-Monte Carlo (RQMC) log-likelihood estimators.
Relative to standard PMMH, CPM and BPM are able to tolerate significantly more variance in the log-likelihood estimates, such that less particles are needed for the chain to mix well. The increase in computational efficiency gained from this typically outweighs the overhead associated with storing the vector of random numbers u.

Conditional Particle Filter
The particle Gibbs (PG) algorithm of Andrieu et al. (2010), requires a variation of the generic particle filter (Section 3.1) called the conditional particle filter (CPF). The CPF differs from the generic particle filter by holding a single path x k 1:T invariant throughout the iterations.
The ancestral lineage B k 1:T of the invariant path contains the index of each state x k t ∈ x k 1:T relative to all other states x 1:N t for t = 1, . . . , T . For example, if N = 4 and B k 1:3 = {4, 1, 2}, then Once a weighted sample is obtained, a new invariant path and associated ancestral lineage may be drawn using the backwards sampling method of Whiteley (2010) and Lindsten and Schön (2012). See Algorithms 3 and 4 for more details.
Algorithm 3: The conditional particle filter.

Input
: data y 1:T , number of particles N , initial states x 1:N 0 , static parameters θ, invariant path x k 1:T and associated ancestral lineage B k 1:T . Output : new path x k 1:T and associated ancestral lineage B k 1:T

Notation:
We use the convention that index (n) means 'for all n ∈ {1, . . . , N}' The matrix A n t−1 gives the parent indices of the particles at time t − 1. The relationship between the ancestral lineage and the matrix of parent indices is A

Particle Gibbs
PMMH uses the unbiased estimate of the likelihood computed by the particle filter.
In particle Gibbs, the latent states are updated using a conditional particle filter, i.e.
x 1:T is approximately sampled from p(x * 1:T | y 1:T , x 1:T , θ) (see Algorithms 3 and 5). The parameters θ are updated using Gibbs sampling if the full conditional posterior is available, or a Metropolis-Hastings step otherwise.

Input
: particles x 1:N 1:T , particle weights w 1:N 1:T and normalised particle weights W 1:N T Output : new ancestral lineage B 1:T Notation: We use the convention that index (n) means 'for all n ∈ {1, . . . , N}' Since a new path x 1:T is simulated at each iteration, PG does not suffer from the same mixing problem as PMMH; it is significantly less sensitive to the number of particles used. PG also has the advantage that more efficient updating schemes for θ can be used, such as the Metropolis-adjusted Langevin algorithm (MALA) or Hamiltonian Monte Carlo (HMC). While this method has a number of advantages over PMMH, it is not as general as it requires a closed form transition density to update θ. Algorithm 5: The particle Gibbs algorithm.

Input
: data y 1:T , initial values θ 0 , initial path x 0 1:T and associated ancestral lineage

Methods
We are interested in parameter inference for the state-space SDEMEM described in Section 2.2. For a single individual m, with observations taken at t = 0, . . . , T − 1, the sequence of distributions (11) traversed by the particle filter (see Section 3.1) is This particle filter returns an estimate of p(y m | η m , σ, φ X ), which can be used to estimate the conditional likelihood for all individuals, Since the likelihood estimator for each individual is unbiased and independent, the product of the estimators is also unbiased, If the solution of the SDE is unavailable, the transition density is approximated using the Euler-Maruyama discretisation (EMD), so the target distribution is exact only up to discretisation error of the SDE. This error can be made arbitrarily small by increasing the level of discretisation at the expense of increased computation (see Section 2.3).

Individual-Augmentation Pseudo-Marginal
Our first method is Individual-Augmentation Pseudo-Marginal (IAPM), named for the additional auxiliary variables required to estimate the likelihood for each individual. Here, the likelihood estimate is, using the importance distribution g(η m | θ) within a PMMH algorithm (Algorithm 2); see Algorithms 6 and 7 for more details.
The variability of p(y m | θ) for a given g(η m | θ) is controlled by the number of particles N , as well as the number of random effects draws L. The choice of importance distribution g(· | θ) has an important impact on both of these quantities. A naive choice is g(η m | θ) = p(η m | θ). While this simplifies the likelihood calculation, it can be very inefficient if p(η m | y m , θ) and p(η m | θ) are not similar. We propose instead to use a Laplace approximation of a distribution over η m that is proportional to p(y m | x m , θ)p(η m | θ), where x m is an approximation of x m . We present two choices for x m . The first uses the solution of the ODE given by the drift of the SDEMEM (4), d X m,t = μ( X m,t , φ X , η m )dt. The second approximates x m with the mean of the modified diffusion bridge (see Section 2.3), with Δ k = Δ t = ξ m,t+1 − ξ m,t , such that We refer to these importance distributions as Laplace-ODE and Laplace-MDB respectively.

Input
: data y 1:M , initial values θ 0 , and number of iterations I Output : posterior samples θ 1:I 1 initialise θ 1 = θ 0 2 Draw u ∼ p(·) 3 Run Algorithm 7 to obtain likelihood estimate p(y 1:M | θ 1 , u) 4 for i = 2 to I do 5 Draw θ * ∼ q(· | θ i−1 ) and u * ∼ p(·) 6 Run Algorithm 7 to obtain likelihood estimate p(y 1:M | θ * , u * ) 7 Calculate the acceptance probability  Section 3.2 describes a correlated version of PMMH using block pseudo-marginal, which can also be applied to IAPM (cIAPM). We now briefly outline how to do this. Let u = (u RE , u PF ), where u RE and u PF are the random numbers used to draw the random effects and those used in the particle filter respectively. At each iteration of the chain, new random numbers for individual m, 1 ≤ m ≤ M are proposed, while the rest are held constant. This induces a correlation of approximately 1 − 1/M between successive log-likelihood estimates . RQMC is straightforward to use within cIAPM as the random numbers are independent when using BPM; while correlated RQMC random numbers are possible, they are very difficult to implement effectively (see Gunawan et al., 2016). (6), the IAPM approximation of p(y m | θ) with importance distribution g(η m | θ) is given by

Component-Wise Pseudo-Marginal
This A particle filter estimate of p(y 1:M | η 1:M , θ X ) is used when updating η 1:M and θ X (Algorithm 1). The parameter φ η is updated directly however, since (15) is tractable. See Algorithm 8 for more details. This method is generally faster than IAPM as the particle filter is called 2M times per MCMC iteration (with the above configuration), instead of LM times as in IAPM. However, the CWPM chain may mix poorly if there is a high correlation between η 1:M and θ.
A correlated version of CWPM (cCWPM) may be implemented using BPM. Again, only the random numbers for a single individual are updated at each iteration while the rest are held constant. (6) Run Algorithm 1 to obtain the likelihood estimate p(y 1:

Example (SDEMEM with constant drift and diffusion). For the SDEMEM in
Accept η * 1:M and u * with probability Run Algorithm 1 to obtain the likelihood estimate p(y 1:M | η i 1:M , θ * X , u * )

10
Accept θ * X and u * with probability α = min 1, p y 1: For the correlated version, new random numbers are only drawn for a single block in steps 4 and 7, not the whole vector.

Mixed Particle Method
Our final method is a variation of the PMMH + PG algorithm of Gunawan et al. (2018a).
We use a combination of PMMH and PG to update the parameters η 1:M , σ, φ X and φ η , depending on the form of the full conditional distributions, At each iteration, the invariant path x 1:M is updated using a conditional particle filter (Algorithm 3). Where the density p(y 1:M | η 1:M , σ, φ X ) is required, i.e. (16) and (17), a particle filter estimate is used (PMMH step). The full conditionals for σ and φ η are tractable as they only depend on the observation density, and the priors for the random effects and the fixed common parameters (θ), all of which are known. Hence, σ and φ η can be updated directly. It is important that the likelihood estimate is updated once a new value of σ is accepted; this must be done with the same u that was used to estimate the previous likelihood. See Algorithm 9 for more details. As with CWPM (Section 4.2), mixing of the Markov chain can be poor if high correlation exists between η 1:M and θ and/or x 1:M and σ.
Similarly to IAPM and CWPM, a correlated version of MPM (cMPM) can be implemented using BPM, where u is divided into M blocks based on the individuals m = 1, . . . , M. (6), the parameters are updated in the following blocks,

Likelihood Estimation
Three particle MCMC methods are introduced: IAPM, CWPM and MPM; each relies on a particle filter to calculate an unbiased estimate of the intractable likelihood. Tuning parameters for this calculation are the level of discretisation D (if the exact transition density is unknown), the number of particles N and, for IAPM, the number of random effects draws L.
The likelihood estimator in PMMH is typically tuned such that the variance of the log-likelihood ratio is between 1 and 4 (Sherlock et al., 2015;Pitt et al., 2012;Doucet et al., 2015). This optimizes the trade-off between statistical and computational efficiency, i.e. the number of particles versus the computation time. Tuning is usually done through experimentation at a central location of the posterior, which is often obtained using pilot runs. Deligiannidis et al. (2018) and Tran et al. (2016) also utilise this approach for their correlated likelihood estimators. Deligiannidis et al. (2018) use trial-and-error to tune the correlation of the random numbers such that the standard deviation is approximately 1.4. Tran et al. (2016) derives the optimal standard deviation of the log-likelihood ratio when Monte Carlo and randomised quasi-Monte Carlo random numbers are used to estimate the likelihood. For the latter, they obtain 0.82/(1 − ρ 2 ) 1/2 , where ρ = 1 − 1/B is the approximate correlation between the log-likelihoods and B is the number of blocks. Thus, for a Algorithm 9: Mixed particle method (MPM) algorithm.

Input
: data y 1:M , initial values η 0 1:M , σ 0 , φ X 0 , φ η 0 , and x 0 1:M , initial path Run Algorithm 1 to obtain the likelihood estimate p(y 1: Accept η * 1:M and u * with probability Accept σ * with probability Run Algorithm 1 to update the likelihood estimate p(y 1: Run Algorithm 1 to obtain the likelihood estimate p(y 1: Accept φ * X and u * with probability α = min 1, p(y 1: The random effects η 1:M and parameters φ X are updated using PMMH. The latent states X 1:M are updated using PG and σ and φ η are updated directly. For the correlated version, new random numbers are only drawn for a single block in steps 4 and 10, not the whole vector.

Tuning Parameters description
Simplifying assumption D † level of discretisation for the SDE -N number of particles -L * number of random effects draws L = N correlation between 0.8-0.99 (5-100 blocks), the target or optimal standard deviation is between 1.37-5.81. Tran et al. (2016) also use a different number of particles to estimate the likelihood for each block. The simplifying assumption that L = N is made for IAPM. The number of particles for each of our methods can then be tuned through experimentation, by selecting N such that 1.4 ≤ σ Δ ≤ 1.8. A fixed value of the level of discretisation D is used throughout. In many cases, it is possible to select a reasonable value of D based only on the model. To simplify the tuning process, the same number of particles is used across all individuals; however efficiency gains are possible if this value is allowed to vary between subjects. Table 1 shows the tuning parameters for all methods. Assumptions to simplify the tuning process are provided where available.
It is necessary to also specify a proposal function for the particle filter and the importance density for IAPM. Section 2.3 describes three different ways to simulate from an SDE: the Euler-Maruyama discretisation (EMD), the modified diffusion bridge (MDB) and the residual bridge (RB). Any of these can be used to move particles within a particle filter. Section 4.1 also proposes the Laplace-ODE and Laplace-MDB importance densities for IAPM. The optimal choice of the proposal function and the importance density is problem specific and may have a large impact on the efficiency of the likelihood estimate. In general, it is possible to choose the importance density based on the proposal function, i.e. EMD + Laplace-ODE (or L-ODE) and MDB/RB + Laplace-MDB. Recall that the Laplace-ODE approximates the underlying states using the ODE specified by the drift of the SDEMEM; the feasibility of this importance density relies on how quickly the solution of the ODE is computed. As with D, exploration of the model may indicate a sensible choice of proposal function.  For CWPM and MPM, we recommend using more efficient proposals for φ η and σ (MPM) if possible, e.g. those based on MALA or HMC.

Data
We apply our methods to real data from a tumour xenography study on mice obtained from Picchini and Forman (2019). The study had 4 treatment groups and 1 control group, and each group had 7-8 mice. Measurements were taken every Monday, Wednesday and Friday for six weeks; however, the majority of the mice were euthanized before the end of the study, once their tumour volumes exceeded 1000 cubic mm.
We focus specifically on group 5 (the control group). There are 7 mice in this group, with 2-14 observations per mouse and 34 observations in total. Only one mouse in this group survived longer than 11 days, being euthanized on day 32 of the study. Figure 1 plots this data.

Model
To fit the data, we consider an adaptation of the SDEMEM used by Picchini and Forman (2019) for unperturbed growth. There are m = 1, . . . , M subjects, with measurements taken at discrete times ξ t , t = 1, . . . , T m , where T m is the number of observations for subject m. The model is, where V m,t is the volume of subject m at time ξ t . The underlying ODE model has solution V m,t = v m0 exp(β m t), which is the general exponential growth model. The random effects for this model are the parameters β m and V m0 , which are assigned the prior distributions Since both β m and V m0 are constrained to be positive, they are updated on the log scale. The observations are modelled as Since the data is observed on the log scale, the transformation X m,t = log(V m,t ) can be applied to (18) and (19) using Itô's lemma. The full model is then given by The likelihood is intractable since model (20) does not have a closed form solution for X m,t . The following prior is assigned to θ = (μ X0 , σ X0 , μ β , σ β , γ, σ, ρ) p(θ) = N (μ X0 ; 3, 4 2 )HN (σ X0 ; 5 2 )N (μ β ; 0, 4 2 )HN (σ β ; 5 2 )HN (γ; 5 2 ) × HN (σ; 5 2 )N (ρ; 1, 0.5 2 ), where HN (σ 2 ) refers to the half-normal distribution with mean zero and scale parameter σ.
Note that taking ρ = 1 gives model (7), which is the original SDEMEM used by Picchini and Forman (2019). We add the parameter ρ which allows for both a more flexible variance and renders the transition density intractable. We test this model on the dataset introduced in Section 5.1. To ensure numerical stability when simulating from the SDE, we scaled the observation times by the maximum time observed. In addition to the real data, we also apply our methods to synthetic data simulated from model (20) using θ = (μ X0 , σ X0 , μ β , σ β , γ, σ, ρ) = (3, 1, −1, 1, 1, 0.5, 1) . For the synthetic data, we assumed 1000 mice with 457 observations each -this corresponds to a measurement every hour for 19 days following the initial measurement. We used 9 subsets of this dataset with all combinations of 10, 100 and 1000 subjects and an observation every 24 hours (20 observations), 12 hours (39 observations) and 1 hour (457 observations). We refer to these datasets as sim (M, H), where M is the number of subjects (10, 100, or 1000) and H is the number of hours between observations (24, 12 or 1). For example, the subset of 100 subjects with an observation every 12 hours is denoted sim(100, 12), while the full dataset is denoted sim(1000, 1). When M is left blank, we refer to all datasets with the specified value of H, e.g. sim(, 1) represents sim(10, 1), sim(100, 1) and sim(1000, 1). Similarly, when H is left blank, we refer to all datasets with the specified value of M , e.g. sim(1000,) represents sim(1000, 24), sim(1000, 12) and sim(1000, 1). The performance of our methods on these datasets indicates their scalability with respect to the density of the time series and number of subjects. Figure 2 plots this data.

Likelihood Estimation Results
All code is implemented in MATLAB. Vectorisation and parallelisation are applied where possible; e.g. in the particle filter we vectorise the particle operations and parallelise over the subjects. Parallelisation is only applied if the average number of observations per subject is greater than 10, and it is not used for CWPM and MPM on the sim(10, 24) dataset as it increased the computation time. For IAPM we also parallelise over the random effects draws when running the importance sampler. Our results are produced using 8 cores. Resampling is done at every iteration in the conditional particle filter, but we use adaptive resampling when estimating the likelihood (resampling when ESS < N/2). For block pseudo-marginal, we use B = min(M, 100) blocks and update them systematically.
We first consider the efficiency of the likelihood estimation. For each of the three methods, all possible combinations of proposal function and importance density (IAPM) are tested. We define the naive combination as the IAPM algorithm with the prior as importance density and the Euler-Maruyama discretisation (EMD) as the proposal function in the particle filter. The naive method is the uncorrelated version of this combination.
As outlined in Section 4.4, the tuning parameters are set such that 1.4 ≤ σ Δ ≤ 1.8. Measurements are calculated from a minimum of 1000 log-likelihood estimates at a fixed value of θ and η 1:M (CWPM). For the real data, we use θ = (4, 1, 2, 1, 1.6, 0.05, 1), which is obtained from a few preliminary MCMC runs (low values of N, L and D are sufficient for this). For the simulated data, we use the true value θ = (3, 1, −1, 1, 1, 0.5, 1). The random effects η 1:M are determined similarly, using preliminary runs for the real data and the true values for the synthetic data.
We define the level of discretisation (D) as the number of intermediate timepoints between each observation. The results seem insensitive to this value, so D is fixed at 10 for all methods. Computation is stopped if the computation time for a single loglikelihood estimate exceeds 15 minutes or require more than 150 GB of RAM. This section uses the notation 'importance density + proposal function' to refer to a particular combination of the two, e.g. prior + RB. All combinations are tuned to roughly the same statistical efficiency (based on σ Δ ), such that the most efficient method has the lowest computation time. Further mention of statistical efficiency refers to the value of the tuning parameters N and L.

IAPM
Of the three methods, IAPM is the most difficult and time-consuming to tune. Assuming L = N simplifies the tuning process, but is sub-optimal. Depending on the implementation of the code, having a larger/smaller N or L can significantly improve the computation time.
Once we started testing combinations, we found that the variance of the Laplace-ODE importance density approaches 0 for at least one of the random effects, such that the draws for that random effect are approximately equal. This is solved by setting the covariance to a diagonal matrix of the prior variances scaled by 0.5; the altered importance density is denoted as L-ODE.
Tables 3-5 summarize the log-likelihood results for all datasets. Dashed lines indicate that the computation time exceeds the 15 minute time limit per likelihood estimate. All combinations exceed this limit on the sim(1000, 1) dataset. Likewise, for the sim(1000,   24) and sim(1000, 12) datasets, results can only be obtained for the Laplace-MDB importance density within the time limit. On these datasets, we find that the variance of the estimated σ Δ is very high for the correlated versions of the prior and L-ODE combinations.
sim ( -  Correlating the log-likelihoods generally increases the statistical efficiency. This increase is significant on the larger datasets, as is the corresponding reduction in computation time. Interestingly, for all sim(10, ) datasets, the uncorrelated Laplace-MDB + EMD is more statistically efficient than the correlated version. This is also true for Laplace-MDB + MDB and RB on the sim(10, 1) dataset.
Across all datasets, the Laplace-MDB importance density outperforms the prior and L-ODE in terms of overall efficiency. Of the prior and L-ODE, the latter shows the poorest performance. Results for the uncorrelated versions are only available for the real, sim(10, 24) and sim(10, 12) datasets and these are also the only datasets with L-ODE combinations that outperform the prior. Based on these results, the drift ODE may not be a good approximation of the underlying states. A large diffusion coefficient and/or measurement error can account for this.
The most efficient proposal function depends on the size of the dataset. In terms of statistical efficiency, the modified diffusion bridge and residual bridge give nearly identical results and generally outperform the Euler-Maruyama discretisation. The latter is the fastest however, and the residual bridge is the slowest. While this has little effect on real sim(10, 24) sim(10, 12) sim(10, 1) sim ( Table 7: Log-likelihood results for the CWPM method on the sim(100,12), sim(100,1), sim(1000,24) and sim(1000,12) datasets. The highlighted rows show the combinations which give the best time. Notation: PF = proposal function used in particle filter, Cor. = indicates whether likelihood estimates are correlated or not.
the smallest datasets, the time difference is significant on the larger ones. Correspondingly, the Euler-Maruyama discretisation gives the best results on the sim(, 1) datasets, while the diffusion bridges are more efficient for the rest.
On the datasets where results for the naive combination are available, a significant increase in relative efficiency (from the naive combination to the best one) is obtained. Interestingly, this improvement is less on the sim(10, 1) and sim(100, 1) datasets. For the latter, the value of N is the same for the prior and Laplace-MDB importance densities. On the real data, N reduces from 200 to 4 in the uncorrelated naive case, and from 90 to 3 in the correlated one. A corresponding 2.75-fold decrease in time is observed from the correlated naive to the best combination.

CWPM
For CWPM, it is only necessary to select a proposal function and find a value for N . Again, this is done through experimentation. Tables 6-7 show results for all datasets.
Dashed lines indicate that the memory limit of 150 GB of RAM per likelihood is exceeded. Due to this limit, no results could be obtained for the sim(1000, 1) dataset.
For all datasets, the correlated versions give the best results. Since the correlation induced is approximately 1 − 1/M , the relative gain in efficiency increases with the number of subjects. In contrast, the number of particles needed for the standard versions grows quickly with the size of the dataset.
As with IAPM, the most efficient proposal function depends on the number of observations per subject. MDB/RB, MDB and EMD gives the best results for the sim(, 24), sim(, 12) and sim(, 1) datasets respectively. For the latter, any benefit in statistical efficiency from the bridges is outweighed by the increase in computation time. The MDB and RB also give the best results on the real data.

MPM
This method uses the same log-likelihood estimate as CWPM, so no extra tuning is required. When N > 1, we use the same number of particles for the conditional particle filter as for the standard. When N = 1, as for the real data (see Table 6), we add an extra particle to account for the invariant path.

MCMC Results
We use the time per log-likelihood estimate from Section 6 to determine which methods to run, i.e. ≤ 2 seconds for IAPM, ≤ 1 second for CWPM and ≤ 0.5 second for MPM. The best proposal function and importance density (for IAPM) from Section 6 is also used. Where the MDB and RB proposal functions give similar results, the MDB is preferred. Due to the time constraints, the naive method (uncorrelated IAPM with prior + EMD) is only run on the real and sim(10, 24) datasets. No results are obtained for the sim(100,1) or sim(1000, 1) datasets.
Each of the methods are run for 100,000 iterations starting at the same value of θ that was used in Section 6. We use random walk proposals for the parameters which cannot be updated directly, i.e. those updated with a PMMH step. In CWPM and MPM, we also use pre-conditioned MALA to update the random effects hyperparameters {μ X0 , σ X0 , μ β , σ β }, and in MPM, we use a slice sampler to update σ. Tuning parameters for these proposals include the random walk covariance (also used as the MALA pre-conditioning matrix), and the stepsize for MALA. These values are tuned through experimentation.
We compare the methods based on the multivariate effective sample size (multiESS) (Vats et al., 2015) of θ and the computation time in minutes. A score for each method is calculated as the approximate rate of independent samples per minute (multiESS/time). Table 8 shows the score for each method. Table 1 in Appendix A shows the breakdown of the multiESS for each update block. Tables 2 and 3 in Appendix A show the acceptance rates (AR) for the three methods on all datasets and Figure 1 in Appendix A shows the marginal posteriors of θ for all datasets. As expected, the marginal posteriors become more precise as the size of the dataset grows (via more subjects and/or more densely observed time series). The jagged marginal posteriors for the sim(1000, 12) dataset may be due to Monte Carlo error as the multiESS for {γ, σ, ρ} is relatively small.  There is a large increase in multiESS between IAPM, and CWPM and MPM on all datasets. This is partly due to the higher multiESS for the X 0 hyperparameters (see Table 1 in Appendix A) and the more efficient proposals used for φ η and σ (in MPM). For both the real and synthetic data, MPM gives the highest multiESS, followed by CWPM. Table 8 shows that CWPM has the largest score due to its relatively short computation time. In general, CWPM runs much faster than the other two methods.

Discussion
We introduced three methods for simulation consistent parameter inference of statespace SDEMEMs and outlined some strategies for improving the efficiency of the likelihood estimate for these methods through the choice of importance density and proposal function. The efficiency of the calculation is generally also increased by correlating successive log-likelihood estimates. Wiqvist et al. (July 23, 2019) concurrently and independently introduced a method for SDEMEMs that is very similar to our CWPM method. They propose the same update blocks for the parameters as in CWPM and give three variations of this approach; namely naive Gibbs, blocked Gibbs and a correlated PMMH method. In the first, the random numbers u are updated whenever the likelihood is estimated. In blocked Gibbs, u is updated with the random effects but kept fixed for the other parameter blocks. Lastly, their correlated PMMH method uses the approach of Deligiannidis et al. (2018) to correlate the likelihoods, i.e. by correlating the random numbers (see Section 3.2).
Our approach differs in that we use the block pseudo-marginal (BPM) method of Tran et al. (2016). For mixed effects models, BPM has a number of advantages over CPM: a) it is simple to implement; b) it induces more directly the correlation between the estimates of the log-likelihood; c) it is much more straightforward to use with RQMC; d) its efficient implementation only requires the random seed to be stored, which can greatly reduce the computational storage requirements. A drawback of BPM is that the correlation is limited by the number of subjects. If there are few subjects, then CPM may be more effective at inducing correlation. An attractive option in this case is to combine BPM with CPM, i.e. correlate the auxiliary variables in the current block, while keeping the rest fixed. The feasibility of this approach is an area of future research. Unlike Wiqvist et al. (2019), we have not explored different strategies to update the random numbers; the approach we use in our example in Sections 5-7 most closely follows their naive Gibbs approach.
To further improve efficiency, we exploit bridge proposals in the particle filter rather than proposing directly from the (approximate) transition density as in the standard bootstrap filter used by Wiqvist et al. (2019). By including the IAPM and MPM methods, our paper provides a more comprehensive suite of particle methods for application to general state-space SDEMEMs. Wiqvist et al. (2019) allow the number of particles to vary between individuals, which is also straightforward to implement in our methods; see also Tran et al. (2013).
The IAPM, CWPM and MPM methods are much more efficient than the naive method; for the majority of the simulated datasets, the naive approach is computationally infeasible. The best method to use greatly depends on the model and data; IAPM is a good choice when there are few parameters, while CWPM and MPM may be preferable when there are many parameters; see Gunawan et al. (2018a). These methods are also flexible in the sense that they can be tailored to a specific model and used in combination, e.g. by integrating over a subset of the random effects using IAPM, but updating the rest using CWPM or MPM steps. Note that if IAPM is combined with MPM, then the invariant path from the conditional particle filter may be used for x m in the importance sampler. CWPM gives the best results for the example in Sections 5-7. In general, this method has the shortest computation time and is the easiest to tune; however, as noted before, care must be taken if high correlation exists between the random effects and model parameters.
Tables 1 and 2 in Section 4.4 summarize our recommendations on how to set the tuning parameters in the new sampling methods. The optimal selection of the tuning parameters is beyond the scope of our article, and is the subject of our ongoing research; we note as well that there are very few optimal results for tuning parameters in particle Metropolis within Gibbs MCMC sampling schemes.
Appendix B applies our methods to an SDEMEM based on an Ornstein-Uhlenbeck (OU) process and compares the exact posterior, i.e. one obtained without discretisation (as the OU process has a computable transition density) with the posterior obtained by the Euler-Maruyama approximation using D = 10. Both versions give the same marginal posteriors for all methods; however, the discretised versions take longer than the exact ones. CWPM gives the best results here as well.
Lastly, zero-variance control variates (Mira et al., 2013;Friel et al., 2016;South et al., 2019) may be used to further reduce the variance of any expectation estimated from the chains, e.g. the expectation of the target with respect to the auxiliary variables. Efficiency of the methods may also be increased through non-centered parameterisations of the random effects η 1:M (Papaspiliopoulos et al., 2007).