Sequential Bayesian Experimental Design for Implicit Models via Mutual Information

Bayesian experimental design (BED) is a framework that uses statistical models and decision making under uncertainty to optimise the cost and performance of a scientific experiment. Sequential BED, as opposed to static BED, considers the scenario where we can sequentially update our beliefs about the model parameters through data gathered in the experiment. A class of models of particular interest for the natural and medical sciences are implicit models, where the data generating distribution is intractable, but sampling from it is possible. Even though there has been a lot of work on static BED for implicit models in the past few years, the notoriously difficult problem of sequential BED for implicit models has barely been touched upon. We address this gap in the literature by devising a novel sequential design framework for parameter estimation that uses the Mutual Information (MI) between model parameters and simulated data as a utility function to find optimal experimental designs, which has not been done before for implicit models. Our approach uses likelihood-free inference by ratio estimation to simultaneously estimate posterior distributions and the MI. During the sequential BED procedure we utilise Bayesian optimisation to help us optimise the MI utility. We find that our framework is efficient for the various implicit models tested, yielding accurate parameter estimates after only a few iterations.


Introduction
Scientific experiments are critical to improving our perception and understanding of how the world works. Most of the time these experiments are time-consuming and expensive to perform. It is thus crucial to decide where and how to collect the necessary data to learn most about the subject of study. Bayesian experimental design attempts to solve this problem by allocating resources in an experiment using Bayesian statistics (see Ryan et al. (2016) for a comprehensive review). Roughly speaking, the aim is to find experimental design, e.g. measurement location or time, that are expected to most rapidly address the scientific aims of the experiment, mitigating the costs. The relevant scientific objectives can include, but are not limited to, model parameter estimation, prediction of future observations or comparison of competing models. In this particular paper we shall only be concerned with the objective of parameter estimation.
At the core of Bayesian experimental design is the so-called utility function, which is maximised to find the optimal design at which to perform an experiment. A popular and principled utility function for parameter estimation is the mutual information between model parameters and simulated data (Lindley, 1972). Intuitively, this metric measures the additional information we would obtain about the model parameters given some realworld observations taken at a particular design. Depending on the model, computing the mutual information can be difficult or even intractable and, as a consequence, various methods for its estimation have arisen.
Whenever new, real-world data is collected through physical experiments, the surface of the utility function tends to change, e.g collecting data with the same design would generally not yield much new information. The treatment of this change, for a single, new data point, is called myopic sequential Bayesian experimental design and is manifested through an update of the prior distribution upon observing real-world data. This stands in contrast to static Bayesian experimental design that is concerned with situations where we do not update our prior distributions when observing new data, such as when there is nearly no time, or too much time, between real-world measurements, or data has to be collected all at once. Sequential Bayesian experimental design is a well-established field for situations in which the model has a tractable likelihood function and inferring the posterior distribution is straight-forward (Ryan et al., 2016). However, there have only been few studies (e.g. Hainy et al., 2016) pertaining to the arguably more realistic situation of intractable, implicit models.
In practice, statistical models commonly have likelihood functions that are analytically unknown or intractable. This is the case for implicit models, where we cannot evaluate the likelihood but we can still sample from it. They are ubiquitous in the natural and medical sciences and therefore have widespread use. Examples include ecology (Ricker, 1954;Wood, 2010), epidemiology (Numminen et al., 2013;Corander et al., 2017), genetics (Marttinen et al., 2015;Arnold et al., 2018), cosmology (M. Schafer and Freeman, 2012;Alsing et al., 2018) and modelling particle collisions (Agostinelli et al., 2003;Sjöstrand et al., 2008). Because the likelihood function for implicit models is intractable, we are generally not able to work with the exact posterior distribution. As a result, likelihood-free inference methods have emerged to solve this issue.
In order to compute the mutual information between model parameters and simulated data however, one needs to be able to evaluate the ratio between posterior density to prior density several times which is difficult in the likelihood-free setting. This is especially challenging in the sequential framework, where the current belief distribution gets updated after every observation. In this work we propose to approximate the density ratio in mutual information directly via the Likelihood-Free Inference by Ratio Estimation (LFIRE) method of Thomas et al. (2016). We perform this in the context of sequential Bayesian experimental design, a significant extension of Kleinegesse and Gutmann (2019) that only considered the static setting.
In this paper we propose a sequential Bayesian experimental design framework for implicit models that have intractable data-generating distributions. In brief, we make the following contributions: 1. Our approach allows us to approximate the mutual information in the presence of an implicit model directly by LFIRE, without resorting to simulation-based likelihood approximations required by other approaches. At the same time, LFIRE also provides an approximation of the sequential posterior.
2. We demonstrate the efficacy of our sequential framework on examples from epidemiology and cell biology. We further showcase that previous approaches may produce experimental designs that heavily penalise multi-modal posteriors thereby introducing an undesirable bias into the scientific data gathering stage, which our approach avoids.
In Section 2 we give basic background knowledge to sequential Bayesian experimental design, mutual information and likelihood-free inference, in particular LFIRE. We then combine these concepts in Section 3 and explain our novel framework of sequential design for implicit models. We test our framework on various implicit models and present the results in Section 4. We conclude our work and discuss possible future work in Section 5.

Bayesian Experimental Design
In Bayesian experimental design the aim is to find experimental designs d that yield more informative, or useful, real-world observations than others. Furthermore, in this work we are particularly interested in finding the optimal design d * that results in the best estimation of the model parameters. At its core, this task requires defining a utility function U (d) that describes the value of performing an experiment at d ∈ D, where D defines the space of possible designs. In order to qualify as a 'fully Bayesian design', this utility has to be a functional of the posterior distribution p(θ | d, y) (Ryan et al., 2016), where θ are the model parameters and y is simulated data. The utility function is then maximised in order to find the optimal design d * , i.e. d * = arg max d∈D U (d). (2.1) The choice of utility function U (d) is thus critical, as different functions will usually lead to different optimal designs. The most suitable utilities naturally depend on the task in question, but there are a few common functions that have been used extensively in the literature. For instance, the Bayesian D-Optimality (BD-Opt) is based on the determinant of the inverse covariance matrix of the posterior distribution, 1 and is a measure of how precise the resulting posterior might be given certain designs (Ryan et al., 2016), and robust utility function is the mutual information, one of the most principled choices in Bayesian experimental design (e.g. Ryan et al., 2016).

Mutual Information
The mutual information I(θ; y|d) can be interpreted as the expected reduction in uncertainty (entropy) of the model parameters if the data y was obtained with design d. It accounts for possibly non-linear dependencies between θ and y. It is an effective metric with regards to the task of parameter estimation, as we are essentially concerned with finding the design for which the corresponding observation yields the most information about the model parameters θ. In other words, mutual information tells us how 'much' we can learn about the model parameters given the prospective data at a particular design.
Mutual information is defined as the Kullback-Leibler (KL) divergence D KL (Kullback and Leibler, 1951) between the joint distribution and the product of marginal distributions of y and θ given d, i.e.
where we have made the usual assumption that our prior belief about θ is not affected by the design, i.e. p(θ | d) = p(θ).
The mutual information can also be interpreted as the expected KL divergence between posterior p(θ | d, y) and prior p(θ) (see e.g. Ryan et al., 2016) and essentially tells us how different on average our posterior distribution is to the prior distribution. The utility that we then need to maximise in order to find the optimal design d * is thus where p(y | θ, d) is the data generating distribution, commonly referred to as the likelihood. The particular form of mutual information in (2.8) can be obtained from (2.5) by applying the product rule to p(θ, y | d). Even though mutual information is a wellstudied concept, estimating it efficiently remains an open question, especially in higher dimensions.
Assuming for now that we have optimised the utility function in (2.8) and have obtained the optimal design d * . An experimenter would then go and perform the experiment at d * and observe real-world data y * . Everything up to this point is static Bayesian experimental design. If we would like to update our optimal design in light of a real-world observation, we would have to perform sequential Bayesian experimental design, i.e. update our prior distribution and optimise the utility function again. This procedure is then repeated several times to obtain (myopic) sequentially designed experiments. We shall not aim to find non-myopic sequentially designed experiments where we would we plan ahead more than one time-step as this adds another layer of complexity.
Let k be the kth iteration of the sequential design procedure, where k = 1 corresponds to the task of finding the first optimal experimental design d * 1 yielding real-world observation y * 1 . At iteration k we then optimise the utility function U k (d) to obtain sequential optimal designs d * k , with corresponding real-world observations y * k . The utility function at iteration k ∈ {1, 2, . . . , K} depends on the set of all previous observations D k−1 = {d * 1:k−1 , y * 1:k−1 }, with D 0 = ∅, and therefore will change at every iteration. Its form stays similar to (2.8), except that the prior and posterior distributions now depend on D k−1 , i.e.
Note that we will assume that data is generated independently of previous observations, i.e. p(y | θ, d, D k−1 ) = p(y | θ, d). In certain special cases where gathering real-world observations changes the data-generating process this would not be the case.

Likelihood-Free Inference
Implicit models have intractable likelihood functions, which means that p(y | θ, d) is either too expensive to compute or there is no closed-form expression. This results in standard Bayesian inference becoming infeasible. Because of their widespread use however, it is crucial to be able to infer the parameters of implicit models. As a result, the field of likelihood-free inference has emerged. These methods leverage the fact that, by definition, implicit models allow for sampling from the data-generating distribution.
A popular likelihood-free approach is Approximate Bayesian Computation (ABC, Rubin, 1984). ABC rejection sampling (Pritchard et al., 1999), the simplest form of ABC, works by generating samples from the prior distribution over the model parameters and then using them to simulate data from the implicit model. The prior parameters that result in data that is 'close' to observed data are then accepted as samples from the ABC posterior distribution. See Sisson et al. (2018) or Lintusaari et al. (2017) for reviews on ABC.
In this paper, we make use of another approach to likelihood-free inference called Likelihood-Free Inference by Ratio Estimation (LFIRE) (Thomas et al., 2016). LFIRE uses density ratio estimation to obtain ratios r(d, y, θ) of the likelihood to marginal density and, therefore, the posterior to prior density, i.e. (2.10) The method works by estimating the ratio from data simulated from the likelihood p(y | θ, d) and data simulated from the marginal p(y | d), e.g. via logistic regression (Thomas et al., 2016). Since the prior density p(θ) is known, learning the ratio corresponds to learning the posterior, i.e. p(θ | d, y) = r(d, y, θ)p(θ). Importantly, the learned ratio yields automatically also an estimate of the mutual information in (2.9).
The LFIRE framework can be used with arbitrary models of the ratio or posterior. For simplicity, like in the simulations by Thomas et al. (2016), we here use the log-linear model where ψ(y) are some fixed summary statistics. Thomas et al. (2016) showed that this log-linear model, while simple, generalises the popular synthetic likelihood approach by Wood (2010); Price et al. (2018b). Moreover, learning the summary statistics from data, e.g. by means of neural networks, is possible too (Dinev and Gutmann, 2018). For further details on LFIRE, we refer the reader to the original paper by Thomas et al. (2016).

Sequential Mutual Information Estimation
The main aim of this work is to construct an effective sequential experimental design framework for implicit models. To do this, we have to approximate the sequential utility in (2.9) in a tractable manner. We propose to use LFIRE to estimate the intractable density ratio in (2.9) and, at the same time, obtain the posterior density. The main difference to the work of Kleinegesse and Gutmann (2019) is that they only considered static experimental design and did not have the additional complications that come with the sequential setting, such as updating the prior distribution upon observing realworld data. Our approach bears some similarities to the SMC sequential design method of Hainy et al. (2016). However, we use LFIRE for updating the posterior when new data are collected and for direct estimation of the mutual information (MI), rather than relying on simulation-based likelihood estimation. Further, unlike Hainy et al. (2016), our approach avoids the MCMC perturbation step, which requires re-processing all data seen so far.

Sequential Utility
We assume that we have already made k − 1 experiments resulting in the set of optimal designs and observations D k−1 = {d * 1:k−1 , y * 1:k−1 }, with D 0 = ∅. At iteration k of the sequential BED procedure we then set out to determine the optimal design d * k and the corresponding real-world observation y * k . To do so, we first approximate the density ratio of p(θ | d, y, D k−1 ) and p(θ | D k−1 ) by the ratio r k (d, y, θ, D k−1 ) computed by LFIRE, 2 such that We then plug this into the expression for the sequential MI utility in (2.9) and obtain We can approximate this with a Monte-Carlo sample average to obtain the estimate where y (i) ∼ p(y | d, θ (i) ) and θ (i) ∼ p(θ | D k−1 ). The above mutual information estimate U k (d) is then optimised to find the optimal design d * k and, through a realworld experiment, the corresponding observation y * k at iteration k. Two core technical difficulties in (3.4) are (1) how to obtain parameter samples θ (i) ∼ p(θ | D k−1 ) from the updated belief distribution and (2) how to compute the sequential LFIRE ratio in (3.1) given the observations D k−1 . We explain our solutions to these difficulties in Sections 3.2 to 3.4.

Updating the belief about the model parameters
For iteration k = 1 we only require samples from the prior distribution p(θ) in order to compute the MI in (3.4). We here assume that sampling from the prior is possible. For iteration k = 2, we require samples from p(θ | D 1 ), for k = 3 we require samples from p(θ | D 2 ), etc. We here describe how to obtain samples from the updated belief p(θ | D k ) after any iteration k. For that, let us first define what it means to update the belief about the model parameters. After observing real-word data y * k at optimal design d * k , we update the observation data set, i.e.
and y = y * k , the numerator in (3.1) equals p(θ | D k ), leading us to an expression for the updated belief distribution, Furthermore, we can approximate the belief distribution p(θ | D k ) after iteration k as a product of k estimated density ratios and the initial prior p(θ), Each of the density ratios r s in (3.6) are evaluated at the observations {d * s , y * s } of the relevant iteration s, but also depend on all previous observations D s−1 . We can write this product of density ratios as a weight function w k and then (3.6) becomes where we have defined the weight function w k to be according to (2.10) and w 0 (θ) = 1 ∀ θ. We use the weight function in (3.8) to obtain samples from the updated belief distribution p(θ | D k ). To do so, we first sample N initial prior samples θ (i) ∼ p(θ). After every iteration k we then obtain weights w We compute these by updating each weight w (i) k−1 by the LFIRE ratio evaluated at the observed data according to (3.8), in order to yield w (i) k . Since we can store the weights w (i) k−1 for a particular parameter, we do not need to recompute them.
To finally obtain updated belief samples, we first normalise the weights: W k . Then we sample an index from the categorical distribution, i.e. I ∼ cat({W (i) k }), and choose the initial prior sample θ (I) . Repeating this several times results in a set of parameter samples that follows p(θ | D k ). We have summarised this procedure in Algorithm 1.
Algorithm 1 Obtaining samples from the updated belief p(θ | D k ) 1: After iteration k, obtain particle set {w Sample from a categorical distribution: Choose θ (I) as a sample from the updated prior distribution 6: end for We note that approximating (3.3) directly with weighted samples from p(θ) instead of using Algorithm 1 to obtain samples from p(θ | D k−1 ) would theoretically result in a lower variance Monte-Carlo estimator. However, because we did not observe this in our simulations and Algorithm 1 had lower computation times, we opted to use Algorithm 1 instead. For more details see Appendix B.

Resampling
We can see from (3.8) that a weight w (i) k is computed by the product of LFIRE ratios given all previous observations. Less significant parameter samples, i.e. ones with a low density ratio, thus have weights that quickly decay to zero. This means that after several iterations we may be left with only a few weights w (i) k that are effectively nonzero. Eventually, only few different parameter samples θ (i) of the current particle set are chosen in the sampling scheme in Algorithm 1, which increases the Monte-Carlo error of the sequential utility in (3.4) and of the marginal samples in the sequential LFIRE procedure (see Section 3.4).
We can quantify how many effective samples we have via the effective sample size η (Kish, 1965), (3.9) If η is small, i.e. η N , then they do not cover much relevant parameter space and our Monte-Carlo approximations may become poor. Thus, if the effective sample size becomes smaller than a minimum sample size η min we need to resample our set of parameter samples; this allows us to have a set of new parameter samples that wellrepresent the current belief distribution. From a practical point of view, throughout this work we shall use the typical value of η min = N/2 (e.g. Chen, 2003;Doucet and Johansen, 2009).
We start the resampling procedure by transforming the parameter space such that all parameter samples θ (i) have values between 0 and 1, i.e. θ → θ . This ensures that different parameter dimensions have similar scales, thereby increasing the robustness of the following steps. 3 If the parameter space has boundary conditions B, through a bounded prior distribution for instance, then we transform these in the same way as the parameter space, B → B (see Appendix C). In this transformed space, we model the belief distribution p(θ | D k ) after the current iteration k as a truncated Mixture of Gaussians (MoG), i.e.
where I is the identity matrix, σ 2 is a variance parameter and W (i) k are the normalised weights. The indicator function 1 B (θ ) is 1 if θ satisfies the boundary conditions B and 0 otherwise. Note that we have one Gaussian for every parameter sample of the current particle set; each Gaussian is centred at that parameter sample θ (i) and has the same standard deviation σ. The parameter σ is typically small which means that (3.10) should be understood more in terms of smoothing than Gaussian mixture modelling.
We compute the Gaussian standard deviation by first using a k-dimensional (KD) tree (Bentley, 1975) to find the nearest neighbour NN(θ (i) ) of each parameter sample.
Let δ be the median of all the distances of a sample to its nearest neighbour, i.e. δ = median |θ (i) − NN(θ (i) )| . We then compute the standard deviation σ as a function g of δ, i.e. σ = g(δ), (3.11) where we choose g to be the square-root function in order to increase robustness to possibly large median distances. 4 In order to get a new sample from the updated belief distribution, we sample an index from a categorical distribution, i.e. I ∼ cat({W (i) k }), and obtain a parameter sample from the corresponding Gaussian N (θ ; θ (I) , Iσ 2 ). We accept this parameter sample if it satisfies the transformed boundary conditions B and reject it otherwise. Doing this a number of times yields a set of new parameter samples. We then set the weight of each of the new, resampled parameter samples as proportional to one and transform the samples back to the original parameter space, i.e. θ → θ. This procedure of resampling parameters is summarised in Algorithm 2.
Unlike the resampling step that uses MCMC in previous approaches to SMC sequential design (Hainy et al., 2016), our approach does not exactly preserve the distribution of the particles. However, crucially, it does not require re-processing all data collected to date, accelerating computation. Conceptually, our resampling method can also be understood as fitting some model to weighted samples from the prior distribution. This can be viewed as a type of kernel density estimate (KDE) formed from weighted samples. Other density estimators that allow for sampling, even fully-parametric ones, could be used as well.

Sequential LFIRE
As can be seen from (3.1), the sequential LFIRE ratios depend on previous observations. This particular dependency requires us to revise the original LFIRE method of Thomas et al. (2016) slightly. To compute the ratio r k (d, y, θ, D k−1 ) we need to sample data from the likelihood p(y | θ, d) and from the marginal p(y | d, D k−1 ). We again assume that observing data does not affect the data generating process, i.e. sampling from the likelihood remains unchanged. The marginal distribution does change upon observing data, i.e. at iteration k we have This implies that in order to obtain samples from the marginal we first have to sample from the belief distribution p(θ | D k−1 ) according to Algorithm 1. These parameter samples from the updated belief distribution are then plugged into the data generating distribution to finally obtain samples from the marginal. The rest of the LFIRE procedure remains unchanged (see Thomas et al., 2016 for more details).
Algorithm 2 Resampling via a Mixture of Gaussian model 1: After iteration k, obtain particle set {w 2: Transform the parameters to be in the unit hyper-cube, θ → θ 3: Transform the boundary conditions in the same way, B → B 4: Find the nearest neighbour of each parameter sample 5: Compute the standard deviation σ for the MoG model, according to (3.11) 6: Normalise the weights: W Sample from a categorical distribution: while not accepted do 10:

Optimisation
In all sections hitherto we have explained how to compute the sequential mutual information utility U k (d) at iteration k. We have, however, not addressed the issue of optimising the utility with respect to the designs d in order to find the optimal design d * k . While traditionally the utility has been optimised via grid search or a sampling-based approach by Müller (1999), there have been a few recent approaches using evolutionary algorithms (Price et al., 2018a) or Gaussian Processes (GP) (Overstall and Woods, 2017). The latter approaches were generally found to outperform grid search in terms of efficiency. We here choose to optimise the sequential utility using Bayesian Optimisation (BO) (Shahriari et al., 2016), as was done by Kleinegesse and Gutmann (2019), due to its flexibility and efficiency. In addition, BO smoothes out the Monte-Carlo error of our utility approximations, and may therefore help in locating the optimal design d * k as well.
BO is a popular optimisation scheme for functions that are expensive to evaluate and that potentially have unknown gradients. The general idea is to use a probabilistic surrogate model of the utility and then use a cheaper acquisition function to decide where to evaluate the utility next. We use a GP for the surrogate model with a Matérn-5/2 Kernel (Shahriari et al., 2016) and Expected Improvement (Mockus et al., 1978) for the acquisition function. These are standard choices in the BO literature, for more detail see Shahriari et al. (2016).
We summarise the previous sections by describing our framework of estimating and optimising the sequential mutual information utility in Algorithm 3.
Algorithm 3 Sequential Bayesian Exp. Design via LFIRE using BO 1: Let D 0 = ∅ 2: Sample initial parameters from prior: Calculate the effective sampling size η using (3.9) Use all new parameter samples Perform an experiment at d * k to observe some real data y * k 18: Update the belief distribution by updating the data set: For all parameter samples θ (i) , compute new weights w

Experiments
In this section we test the framework outlined in Algorithm 3 on a number of implicit models from the literature. We first consider an oscillatory toy model with a multimodal posterior distribution. We then consider the Death Model (Cook et al., 2008) and the SIR Model (Allen, 2008) from epidemiology, as well as a model of the spread of cells (Vo et al., 2015). We evaluate these models with our framework that approximates the sequential MI utility with density ratio estimation, i.e. LFIRE.

Oscillation Toy Model
This toy model describes noisy measurements of a sinusoidal, stationary waveform sin(ωt), where the design variable is the measurement time t and the experimental aim is to optimally estimate the waveform's frequency ω. The generative model is given by p(y | ω, t) = N (y; sin(ωt), σ 2 noise ), (4.1) where we set the measurement noise to σ noise = 0.1 throughout and assume that the true model parameter takes a value of ω true = 0.5. As a prior we use a uniform distribution p(ω) = U(ω; 0, π). We can obtain analytic posterior densities by using the likelihood in (4.1) and Bayes' rule, while we can obtain corresponding posterior samples by using Markov chain Monte-Carlo (MCMC) methods.
We start the sequential BED procedure for the oscillation model by sampling 1,000 parameter samples ω (i) from the prior and for each of these we then simulate data y (i) ∼ N (y; sin(ω (i) t), σ 2 noise ) at a particular measurement time t ∈ [0, 2π]. For the summary statistics in (2.11) we use subsequent powers of the simulated data, i.e. ψ(y (i) ) = [y (i) , (y (i) ) 2 , (y (i) ) 3 ] , in order to allow for a sufficiently flexible, non-linear decision boundary in the LFIRE algorithm. We use these prior samples and the corresponding simulated data to compute 1,000 LFIRE ratios and then estimate the MI utility U 1 (t) with a sample average as in (3.4). With the help of BO, we decide at which measurement time t to evaluate the utility next and then repeat until we have maximised the utility, following Algorithm 3.
We repeat the optimisation procedure above for the BD-Opt utility in (2.2) and compare it to the MI utility. Hainy et al. (2016) used this utility in sequential design targeted at implicit models, although they only tested their method on a toy model with known likelihood. The advantages of MI over BD-Opt for models with multi-modal posteriors are widely known in the explicit setting (Ryan et al., 2016). It is nonetheless useful to verify that these advantages continue to hold when approximating the MI and the posterior with LFIRE.
We show the MI utility and the BD-Opt utility used by Hainy et al. (2016), as well as their analytic counterparts, for the first iteration in Figure 1. Shown are the posterior predictive means of the GP surrogate models, the corresponding variances and the evaluations of the utilities during the BO procedure. Due to the chosen prior and the periodic nature of the oscillation model, higher design times result in posterior distributions with more modes. Multi-modality can lead to an increase of the variance. BD-Opt thus assigns little to no worth in doing experiments at late measurement times. In contrast, the MI utility has a high value at late design times when the posterior distributions tend to have multiple modes. The corresponding optimal designs are t * 1 = 2.196 and t * 1 = 1.656 for the MI utility and the BD-Opt utility, respectively. Furthermore, the behaviour of both utilities generally well matches the analytic references computed using the closed-form expression of the data-generating distribution.
After determining the optimal measurement time t * 1 , we perform the actual experiment. Here, the real-world experiment is simulated by taking a measurement of the true data generating process with ω true = 0.5 at t * 1 where we obtained y * 1 = 0.790 and y * 1 = 0.810 for the case of the MI and BD-Opt utility, respectively. We show the corresponding estimates of the posterior distributions in Figure 2. In our approach, we compute particle weights for each of the 1,000 prior samples and then obtain the posterior, or updated belief, samples according to Algorithm 1. Importantly, the BD-Opt utility uses a particle approach as well, which means that we also need to use Algorithm 1 to obtain updated belief samples; we direct the reader to Hainy et al. (2016) for more information on how the required particle weights are computed. For visualisation purposes, we then compute a Gaussian Kernel Density Estimate (KDE) from these posterior samples to obtain the posterior densities shown in Figure 2. We also show the analytic posteriors which are computed using the closed-form expression of the datagenerating distribution. We find that the posterior distributions have two modes, which is a result of the periodic behaviour of the model. We note that one mode has support for the true model parameter.
After obtaining the relevant data D 1 = {d * 1 , y * 1 } = {t * 1 , y * 1 } for the first iteration k = 1, we compute the new particle weights w (i) 1 via (3.8) for MI and via ABC likelihoods for BD-Opt (see Hainy et al., 2016), which are then used in subsequent iterations of the sequential BED procedure. Following Algorithm 3 we continue this procedure in a similar manner until iteration k = 4, although technically this could be continued until the experiment's budget is exhausted.
We show the GP models of the MI and BD-Opt utility for all four iterations in Figure 3. Both utilities change vastly between iterations. As compared to iteration 1, the MI utility has more local optima in iteration 2, although it is still overall increasing and then peaking at t ≈ 6. A pronounced local minimum occurs around t = 2.196, the optimal design of the first iteration; this is intuitive, because performing an experiment at the same experimental design may not yield much additional information at this stage due to the relatively small measurement noise. For the same reason, the BD-Opt utility has a local minimum around t = 1.656, the optimal design of the first iteration for BD-Opt. Due to large fluctuations in the estimated BD-Opt utility around the global maximum, the GP mean does not go through all nearby evaluations and has a larger variance throughout for iteration 2.
In iteration 3, the MI utility has two local minima that occur at the locations of the two previous optimal designs because, like previously, performing an experiment at the same measurement locations may not be effective. BD-Opt on the other hand steadily increases and then peaks at the upper boundary of the design domain. This occurs because, for BD-Opt, the updated belief distribution of the parameter is unimodal after iteration 2 and becomes more narrow with increasing design times; similar reasoning follows for BD-Opt in iteration 4. We observe the same behaviour for MI in iteration 4, as the updated belief distribution used to compute the MI utility becomes uni-modal after iteration 3. We had to perform resampling during iterations 2 − 4, as the effective sample size of the weights went below 50% for both the MI and the BD-Opt utility. 5 Figure 3 shows that in iteration 1 to 3, mutual information assigns worth to several areas in the design domain that Bayesian D-Optimality does not deem important. After enough data is collected and the posterior is unimodal, however, the difference between these two utilities becomes negligible and they result in the same optimal design.
For visualisation purposes, we put a KDE over updated belief samples after each iteration, obtained by means of Algorithm 1, to plot posterior densities. This is shown in Figure 4 for the sequential MI and BD-Opt utilities. After iteration 2, only mutual information results in a multi-modal belief distribution. From iteration 3 onwards, both distributions are unimodal and similarly concentrated around the true model parameter of ω true = 0.5. After 4 iterations, the mean parameter estimate using the data from the MI utility is ω = 0.503 with a 95% credibility interval of [0.481, 0.527]. Using the data from the BD-Opt utility the mean parameter is ω = 0.494 with a 95% credibility interval of [0.468, 0.516]. The 95% credibility intervals where computed using a Gaussian KDE of the parameter samples and the highest posterior density interval (HPDI) method.
Overall, in the context of the oscillation model, the mutual information and Bayesian 5 Note that for BD-Opt we use the resampling procedure provided in Hainy et al. (2016)  D-Optimality utilities yield significantly different optimal experimental designs. As opposed to MI, BD-Opt leads to optimal experimental designs that are biased to exclude multiple explanations for the inferred parameters. When enough real-world observations are made, the updated belief distributions are no longer multi-modal but collapse to unimodal distributions, at which point the utilities become similar. Additionally, for the BD-Opt utility we noticed certain numerical instabilities that resulted from taking the mean of several posterior precisions. We rectified this by taking the median of several posterior precisions, instead of taking the mean as in Hainy et al. (2016).

Death Model
The Death Model describes the stochastic decline of a population of N individuals due to some infection. Individuals change from the susceptible state S to the infected state I at an infection rate b, which is the model parameter we are trying to estimate. Each susceptible individual can get infected with a probability of p inf (t) = 1−exp(−bt) (Cook et al., 2008) at a particular time t. The aim of the Death Model is to decide at which measurement times τ to observe the infected population I(τ ) in order to optimally estimate the true infection rate b. 6 Here we assume that for each iteration of the sequential BED scheme we only have access to a new, independent stochastic process. This means that, for instance, in iteration k = 2 we could have design times before the optimal design time τ * 1 of the first iteration. The total number of individuals ∆I(t) moving from state S to state I at time t is given by a sample from a Binomial distribution, where ∆t is the step size, set to 0.01 in this work, and I(t = 0) = 0. By discretising this time series, the number of infected at time t + ∆t is given by I(t + ∆t) = I(t) + ∆I(t).
The likelihood for this model is analytically tractable (see Cook et al., 2008;Kleinegesse and Gutmann, 2019), and thus can serve as a means to validate our framework. As a prior distribution we use a truncated Normal distribution, centred at 1 with a standard deviation of 1, while the summary statistics used to compute (2.11) are subsequent powers of the number of infected, i.e. ψ(I(τ )) = [I(τ ), I(τ ) 2 , I(τ ) 3 ] . To generate realworld observations, we use a true parameter value of b true = 1.5.
We show the first iteration of the sequential mutual information utility in the left plot of Figure 5, as well as a reference MI value obtained using the tractable likelihood. 7 The MI peaks in the region around τ ≈ 1 and stays low at early and late measurement times. The average posterior for early and late τ is wider than the one for τ ≈ 1, 8 which results in a lower MI at the boundary regions. This is because at early and late measurement times the observed number of infected I(τ ) is the same for a wide range of infection rates b, i.e. either 0 or 50 (the extreme values of I(τ )). At τ ≈ 1 most values of b yield observations of I(τ ) that are between the extreme values 0 and 50, allowing us to infer the relationship between b and I(τ ) more effectively. For later iterations, the MI generally had the same form and did not change much, with optimal measurement times that were all roughly around 1 (see Appendix F for a plot with all iterations). This reduces the uncertainty in b which, in this case, outweighs the advantages of making an observation at different measurement times such as near the boundaries.
We show a KDE of the updated belief samples after each iteration, obtained by means of Algorithm 1, in the right plot of Figure 5. Even though the updated belief distribution after the first iteration has an expected value that is close to the true parameter, the corresponding credibility interval is wide. The belief distributions in the following iterations become more narrow, which is a result of having more data to estimate the model parameter. After four iterations, the posterior mean of b equals b = 1.376 with a 95% credibility interval of [1.128, 1.621] containing b true = 1.5. The credibility intervals were computed using a Gaussian KDE over posterior samples and the HPDI method.

SIR Model
The SIR Model (Allen, 2008) is an extension of the Death model and, in addition to the number of susceptibles S(t) and infected I(t), includes one more state population, the number of individuals R(t) that have recovered from the infection and cannot be infected again. Similar to the Death model, the design variable is the measurement time τ at which to observe the state populations. For this model however, we are trying to estimate two model parameters, the rate of infection β and the rate of recovery γ. Similar to the Death model, we assume that for each iteration of the sequential BED scheme we only have access to a new, independent stochastic process. At a particular time t of the time-series of state populations, let the number of individuals that get infected during an interval ∆t, i.e. change from state S(t) to state I(t), be ∆I(t). Similarly, let the number of infected that change to the recovered state be ∆R(t). We compute these two state population changes by sampling from Binomial distributions, where the probability p inf (t) of a susceptible getting infected is defined as p inf (t) = βI(t)/N , where β ∈ [0, 1] and N is the total (constant) number of individuals. The probability p rec (t) of an infected individual recovering from the disease is defined as p rec (t) = γ, where γ ∈ [0, 1]. These state population changes define the unobserved time-series of the state populations S, I and R according to We use initial conditions of S(t = 0) = N − 1, I(t = 0) = 1 and R(t = 0) = 0, where we set N to 50 and use a time-step of ∆t = 0.01 throughout. The actual time at which we do observations is again given by τ , such that the observed data is a single value for each state population, i.e. S(τ ), I(τ ) and R(τ ). We use an uninformative, uniform prior U(0, 0.5) for both model parameters β and γ to draw initial prior samples. For the summary statistics used to compute (2.11) during the LFIRE algorithm we use subsequent powers, up to 3, of I(τ ) and R(τ ), including their products. 9 The first four sequential MI utilities of the sequential BED scheme for the SIR model are shown in Figure 6. The SIR model utilities appear similar to those of the Death Model, with the main difference being that the global optima are shifted more towards lower measurement times around τ ≈ 0.5, increasing subtlety with every iteration. Similar to the Death model, early and late measurement times result in posterior distributions that are, on average, wider than those for τ ≈ 0.5. This is because at early and late τ much of the data is the same for a wide range of model parameters θ. At early τ we mostly observe S(τ ) = 49, I(τ ) = 1 and R(τ ) = 0, i.e. the initial conditions. At late measurement times there are no infected anymore, i.e. I(τ ) = 0, and we observe a fixed S(τ ) and R(τ ) that depend on the model parameters (see Appendix D for a typical time-series plot). Because the final values of S(τ ) and R(τ ) depend on the model parameters, late measurement times result in posteriors that are slightly more narrow than those for early measurement times. This is reflected in Figure 6, where the MI is higher at late τ than at early τ . At τ ≈ 0.5 we often have numbers of infected I(τ ) that are non-zero, allowing us to infer the relationship between model parameters and data more effectively. This means that the resulting posterior for these measurement times is more narrow than elsewhere, where I(τ ) is close to zero, which leads to the global MI maxima that we see in Figure 6.
Similar to the Death model, the form of the utilities for the SIR model does not change much between iterations. The utility for the first iteration appears noisier than the other ones because the parameter samples during that iteration stem from a uniform prior, which is highly uninformative and increases the Monte-Carlo error of the sample average in (3.4). The other utilities do not see this issue as the parameter samples from the updated belief distributions are less spaced out than for the first iteration. Resampling was performed according to Algorithm 2 during iterations 2 − 4, as the effective sample size always went below 50%.
The posterior densities after every iteration are shown in Figure 7 in form of KDEs computed from the posterior samples obtained according to Algorithm 1. We see that the beliefs about the model parameters become more precise after every iteration, which can be attributed due to having more data. This visualises that data acquired around measurement time τ ≈ 0.5 are providing useful information about the model parameters. After four iterations, the mean estimate of the infection rate β is β = 0.171 with a 95% credibility interval of [0.112, 0.233]. The corresponding mean estimate of the recovery rate γ is γ = 0.045 with a 95% credibility interval of [0.024, 0.068]. Similar to before, the credibility intervals were computed using a Gaussian KDE of the posterior samples and the HPDI method. The true parameters used to generate real-world observations were β true = 0.15 and γ true = 0.05, which are both contained in the credibility intervals.
The SIR model example illustrates that we can effectively use the MI utility, computed and optimised via Algorithm 3, to perform sequential BED for an implicit model, where the likelihood function is intractable.

Cell Model
The cell model (Vo et al., 2015) describes the collective spreading of cells on a scratch assay, driven by the motility and proliferation of individual cells, with particular applications in wound healing (e.g. Dale et al., 1994) and tumor growth (e.g. Swanson et al., 2003). In the context of our work, the experimental design is about deciding when to count the number of cells on the scratch assay in order to optimally estimate the cell diffusivity and proliferation rate. Price et al. (2018b) used the Cell Model before to compare the synthetic likelihood and approximate Bayesian computation (ABC) likelihood-free inference approaches to estimate these model parameters. Importantly, in their work they assumed that they had access to 144 images of a scratch assay and went on to estimate the model parameters given that an experimenter could analyse and quantify the cell spreading in all 144 images. We here wish to find out which of these images an experimenter should analyse if there is a limited experimental budget.
The (discrete) cell model starts with a grid of size 27 × 36 and 110 initial cells that are randomly placed in the upper part of the grid. This simulates wound healing, where a part of the tissue was scratched away due to an accident. At each discrete time step, every cell in the grid has a chance of moving to a neighbouring, empty grid position, which is given by the model diffusivity D. Similarly, at each discrete time every cell also has a chance to reproduce and spawn a new cell in a neighbouring, empty position, which is dictated by the model proliferation rate λ. While the model parameters of interest are the diffusivity D and proliferation rate λ, it is often easier to work with the probability of motility P m ∈ [0, 1] and probability of proliferation P p ∈ [0, 1]. 10 For a particular combination of {P m , P p } we can then simulate a time-series of grids where cells move around and reproduce. 11 In the context of BED, the discrete design variable is then the time at which to observe this grid and count the total number of cells. In reality, a human would have to physically count the number of cells under a microscope, which is time-consuming, and therefore we want to find the optimal times at which to have the experimenter make an observation. Similar to previous models, we here assume that we have access to a new, independent stochastic process for each sequential iteration.
We shall use 144 time steps as in Vo et al. (2015) and Price et al. (2018b), which means that, including the initial grid, there are 145 grids in every time-series. For the summary statistics used to compute (2.11), we use the Hamming distance between a particular grid and the initial grid, as well as the total number of cells in a particular grid. The one-dimensional design variable is discrete and can take values between 1 and 145, i.e. d ∈ {1, . . . , 145}, while the summary statistic is two-dimensional. For the model parameters we use prior distributions p(P m ) = U(P m ; 0, 1) and p(P p ) = U(P p ; 0, 0.005); we choose the true model parameters to be P m,true = 0.35 and P p,true = 0.001 as Price et al. (2018b). Because the simulation time for the cell model is significantly more expensive than the previous models we have tested, we only use 300 initial prior samples during the sequential BED algorithm, which we run up to five iterations. We note that, while decreasing the computational resources needed, this may increase the Monte-Carlo error in (3.4) and the error in the LFIRE ratio estimate.
In the top row in Figure 8 we compare the sequential MI utilities for iteration 1 (left) and 5 (right) for the cell model; see Appendix G for a plot showing utilities for all iterations. Shown are the posterior predictive means and variances of the surrogate GP model, the BO evaluations and the respective optima. Because of the discrete domain, a normal GP tends to overfit this data and therefore we have used a one-layer deep GP (see Damianou and Lawrence, 2013) as the surrogate model, which means that the GP hyperparameters are modelled by GPs as well. For each iteration there seems to be some merit in taking observations at small and large designs but not for medium-large designs, e.g. d ∼ 20 − 40. At early designs, not much proliferation will have happened (due to its small prior probability) and so one can more easily measure the effect of motility. Conversely, at large designs one can average out the effect of motility and more easily notice the effect of proliferation, as more time has elapsed. In our case, the information gain about proliferation seems to beat that about motility in iteration 1, and similarly for iterations 2 − 4 (see Appendix G). After having repeatedly made measurements at large designs, at iteration 5 it becomes more effective to measure at small designs. This is because we have decreased the uncertainty in the proliferation parameter in iterations 1 − 4 and then need a measurement at early designs to sufficiently decrease the uncertainty in the motility parameter. Note that we resampled the parameters according to Algorithm 2 before iteration 2 and 3, as the effective sample size went below 50%.
Similarly to before, we use KDE and updated belief samples obtained from Algorithm 1 to visualise the approximate posterior densities after every iteration. In the bottom row in Figure 8 we show the posterior densities obtained after iteration 1 (left) and 5 (right); see Appendix G for a plot showing posterior distributions after every iteration. After iteration 1, the updated belief distribution has a wide spread in the P m parameter and is more narrow for the P p parameter. The optimal design at iteration 1 was at the far end of the design domain at d * 1 = 145. This is a design that helps to more easily detect the effect of proliferation as opposed to motility, which is reflected in the figure. The same phenomenon occurs in subsequent iterations 2 − 4, where the optimal designs are at the far end of the design domain (see Appendix G). This results in posterior distributions that are relatively narrow for P p but wider for P m . Taking a measurement at the small design d * 5 = 2 in iteration 5, which allows us to more easily detect the effect of motility, reduces the uncertainty in the P m parameter as well, as can be seen in the bottom right plot of Figure 8. The mode of the posterior distribution after iteration 5 is close to the true parameter value of P m,true = 0.35 and P p,true = 0.001. The estimated mean of the motility parameter is P m = 0.394 with a 95% credibility interval of [0.166, 0.642], while for the proliferation parameter it is P p = 0.00150 with a 95% credibility interval of [0.00055, 0.00265]. Both credibility intervals contain the true parameter values. The credibility intervals were computed using a Gaussian KDE of the marginal posterior samples and the HPDI method.
The cell model demonstrates that we can effectively use MI in sequential BED when the forward simulations are expensive and the design domain is discrete. For this model, an experimenter might intuitively want to take observations at regular intervals but we have seen from the sequential utility functions that the expected information gain may then be sub-optimal. Sequential BED suggests that, when there is a limited budget, it is best to first take observations at high values in the design domain as the effect of proliferation dominates that of motility. Later, however, we should take observations at small designs to reduce the uncertainty in the motility parameter as well.

Conclusion
In this work we have presented a sequential BED framework for implicit models, where the data-generating distribution is intractable but sampling from it is possible. Our framework uses the mutual information (MI) between model parameters and data as the utility function, which has not been done before in the context of sequential BED for implicit models due to computational difficulties. In particular, we showed how to obtain an estimate of the MI by density ratio estimation methods and then optimise it with Bayesian optimisation. To estimate the MI in subsequent iterations, we showed how to obtain updated belief samples by using a weighted particle approach and updating weights every iteration using the computed density ratios. We devised a resampling algorithm that yields new parameter samples whenever the effective sample size of these weights went below a minimum threshold. The framework can be used to produce sequential optimal experimental designs that can guide the data-gathering phase in a scientific experiment.
We first illustrated and explained our framework on a oscillatory toy model with multi-modal posteriors and then applied it to more challenging examples from epidemiology and cell spreading. For all examples we obtained optimal experimental designs that made intuitive sense and and resulted in informative posterior distributions. For the oscillatory toy model Bayesian D-Optimality (BD-Opt), which has been used in sequential BED once before by Hainy et al. (2016). We found that, besides being less computationally expensive, MI usually led to different optimal designs than BD-Opt, due to the latter penalising multi-modality and only focussing on posterior precision.
While we have applied our framework to implicit models with low dimensionality, the theory is general and extends to models with high-dimensional designs as well, albeit being more computationally intensive. Standard Bayesian optimisation, as used in this work, becomes, however, expensive and less effective in high dimensions. One would either have to utilise recent advances in high-dimensional Bayesian optimisation or look towards alternative gradient-free optimisation schemes, such as for example approximate coordinate exchange (Overstall and Woods, 2017).
The MI utility represents the information gain of an experiment and is thus focused on obtaining accurate estimation results. However, it does not take the computational or financial cost of the different experimental designs into account. For that purpose one may want to maximise a normalised information gain instead, where we for example divide the MI by the estimated cost of running the experiment. In our paper, we focused on experimental design for parameter estimation. However, we note that the proposed framework could also be applied to experimental design for model discrimination, as well as dual-purpose model discrimination and parameter estimation, by conditioning the data-generating process on a particular model and then averaging over the model space.
where y (i) ∼ p(y | d, θ (i) ), θ (i) ∼ p(θ) and the weights w k−1 are given by with r 1 (d * 1 , y * 1 , θ, D 0 ) = r 1 (d * 1 , y * 1 , θ) according to (2.10) and w 0 (θ) = 1 ∀ θ. In theory, the estimator in (B.1) has a lower variance than the estimator in (3.4) that we have used. In practice, however, we did not observe a significant difference but noticed that the estimator in (B.1) had longer computation times. Thus, we opted to use the sampling-based approach in (3.4) instead.

Appendix C: Parameter Space Transformation
We here describe in more detail how to transform the parameter samples θ and boundary conditions B before the resampling procedure explained in Section 3.3. Let θ (i) j be the jth element of the parameter sample θ (i) . If the parameters θ j in θ have different scales, the KD-Tree algorithm produces nearest neighbours that underestimate, or overstimate, the standard deviation σ during the resampling procedure (see section 3.3). We found that we can overcome this and increase robustness by transforming all parameter samples such that their elements are bound between 0 and 1. Thus, for every element of every parameter sample we do the transformation θ where θ max j and θ min j are the maximum and minimum, respectively, of the set of parameter samples {θ for the jth element. Now consider the boundary conditions B j = [B − j , B + j ] for the jth element of the parameter θ, where B − j and B + j are the lower and upper boundary, respectively. We assume that beyond these boundaries the prior probability p(θ j ) is zero and therefore we cannot resample beyond these boundaries. Using the same θ max j and θ min j as before, we transform the boundaries as well, i.e. In order to transform the resampled parameter samples back to the original parameter space, we simply have to invert (C.1) and obtain an expression for θ (i) j .

Appendix D: Simulation Plots for all Models
We show simulations of data as a function of time for all models considered in the main text in Figure 9 (oscillation toy model), Figure 10 (death model), Figure 11 (SIR model) and Figure 12      = p(y | θ, d)p(θ) log p(y | θ, d) where y (i) ∼ p(y | d, θ (i) ), θ (i) ∼ p(θ) and θ (j) ∼ p(θ).
For the sine model we use p(y | ω, t) = N (y; sin(ωt), 0.1 2 ), i.e. (4.1), in order to compute the reference MI according to (E.3) with N = M = 1,000. For the death model we use the data-generating distribution p(S | b, τ ) = Bin(S; S 0 , exp(−b(τ − τ 0 ))), where S is the number of susceptible individuals, S 0 = 50 and τ 0 = 0 (Cook et al., 2008;Kleinegesse and Gutmann, 2019). The number of susceptibles can be computed from the number of infected individuals I by S = S 0 − I. We then compute the reference MI using (E.3) and N = M = 1,000.

Appendix F: Additional plots for the Death Model
In Figure 13 we show average posterior densities at different measurement times for the Death model. The posterior densities were approximated by using the analytic likelihood of the model and averaged over 100 observations I(τ * ) at τ * ∈ {0.1, 1.0, 4.0}. In Figure 14 we show the sequential MI utilities for all iterations of the Death model, including the GP means and variances.