Parallel Gaussian process surrogate Bayesian inference with noisy likelihood evaluations

We consider Bayesian inference when only a limited number of noisy log-likelihood evaluations can be obtained. This occurs for example when complex simulator-based statistical models are ﬁtted to data, and synthetic likelihood (SL) method is used to form the noisy log-likelihood estimates using computationally costly forward simulations. We frame the inference task as a sequential Bayesian experimental design problem, where the log-likelihood function is modelled with a hierarchical Gaussian process (GP) surrogate model, which is used to eﬃciently select additional log-likelihood evaluation locations. Motivated by recent progress in the related problem of batch Bayesian optimisation, we develop various batch-sequential design strategies which allow to run some of the potentially costly simulations in parallel. We analyse the properties of the resulting method theoretically and empirically. Experiments with several toy problems and simulation models suggest that our method is robust, highly parallelisable, and sample-eﬃcient.


Introduction
When the analytic form of the likelihood function of a statistical model is available, standard sampling techniques such as Markov Chain Monte Carlo (MCMC, see e.g. Robert and Casella 2004) can often be used for Bayesian inference. However, many models of interest in several areas of science, for example in computational biology and ecology, have an expensive-to-evaluate or intractable likelihood function which severely complicates inference. When the likelihood is intractable but forward simulation of the model is feasible, simulation-based inference methods (also called likelihood-free inference) such as approximate Bayesian computation (ABC) can be used. Unfortunately, such algorithms typically require a huge number of simulations making inference computationally costly. Examples of models with intractable likelihoods can be found in e.g. Beaumont et al. (2002); Marin et al. (2012); Lintusaari et al. (2017); Marttinen et al. (2015); Järvenpää et al. (2018) and Section 6.2 of this article.
Surrogate models, also called meta-models or emulators, such as Gaussian processes (Rasmussen and Williams, 2006) have been used extensively to calibrate deterministic computer codes, see e.g. Kennedy and O'Hagan (2001). GP surrogates have recently also been used to accelerate Bayesian inference by modelling some part of the inferential process, such as the log-likelihood function. The model allows extracting information from the simulations efficiently, and can be used e.g. to determine where additional simulations are needed. For example, Rasmussen (2003); Kandasamy et al. (2015); Sinsbeck and Nowak (2017); Drovandi et al. (2018); Wang and Li (2018); Acerbi (2018) have developed GP-based techniques to accelerate Bayesian inference when the exact likelihood or the corresponding deterministic model is tractable but expensive. Various GP surrogate techniques have been proposed also for ABC, where one can only draw samples i.e. pseudo-data from a statistical model but not evaluate the likelihood. These include Meeds and Welling (2014); Jabot et al. (2014); Wilkinson (2014); Gutmann and Corander (2016); Järvenpää et al. (2019).
In this paper we focus on GP surrogate modelling of noisy log-likelihood evaluations. Earlier works on emulating the log-likelihood function have mostly assumed exact, i.e., noiseless evaluations or the noise has not been explicitly modelled. We show that noisy evaluations cause extra challenges. Also, although not the focus of this work, we remark that one often has some control over the noise level. While our approach is applicable whenever noisy, expensive log-likelihood evaluations of a statistical model of interest are available, we mainly focus on likelihood-free inference using the synthetic likelihood method (Wood, 2010;Price et al., 2018), where the intractable log-likelihood is approximated using repeated forward simulations at each evaluation location.
Recently, Järvenpää et al. (2019) developed a Bayesian decision theoretic framework for ABC inference and considered sequential strategies (also called active learning) to select the next evaluation location for an expensive simulation model. Here we extend this framework in two ways: 1) we modify it to address the problem of Bayesian inference using noisy log-likelihood evaluations, which is different from ABC, and 2) we develop batch-sequential design strategies to efficiently parallelise the estimation of the surrogate likelihood. In earlier related works the simulation locations have been selected either sequentially (Kandasamy et al., 2015;Sinsbeck and Nowak, 2017;Wang and Li, 2018;Acerbi, 2018;Järvenpää et al., 2019) or using simple heuristics (Wilkinson, 2014;Gutmann and Corander, 2016). Batch strategies are useful when a computing cluster is available and, as we show, can substantially reduce the computation time compared to the corresponding sequential strategies. We also analyse some properties of the proposed methods theoretically, and conduct an extensive empirical comparison.
However, while BO methods can be used to accelerate likelihood-free Bayesian inference (Gutmann and Corander, 2016), they are not specifically designed for estimating the posterior (see discussion in e.g. Kandasamy et al. 2015;Järvenpää et al. 2019). Similarly to Järvenpää et al. (2019), we explicitly design our algorithms from the first principles of Bayesian decision theory to acknowledge the goal of the analysis, i.e. estimation of the posterior density. Finally, we note that GPs in conjunction with Bayesian experimental designs have also been successful in estimating level and excursion sets of expensiveto-evaluate functions, see e.g. Bect et al. (2012); Chevalier et al. (2014); Lyu et al. (2018). This paper is organised as follows. Section 2 briefly reviews ABC and the SL. Sections 3 and 4 contain the details of our GP surrogate model and posterior estimation. Batch-sequential design strategies for sample-efficient estimation of the (approximate) posterior distribution are developed in Section 5 while Section 6 contains experiments. Finally, Section 7 contains discussion and concluding remarks. Proofs, implementation details and additional experiments can be found in the supplementary material (Järvenpää et al. 2020).

ABC and the synthetic likelihood methods
Our goal is to estimate parameters θ ∈ Θ of a simulation model given observed data x ∈ X . We assume Θ is a compact subset of R d and that the prior information about feasible values of θ is coded into a (continuous) prior pdf π(θ). For simplicity we consider only continuous parameters but discrete parameters can be handled similarly. If evaluating the likelihood function π(x | θ) is feasible, the posterior distribution can be computed using Bayes' theorem π(θ | x) ∝ π(θ)π(x | θ) up to a normalisation constant and hence be used as a target distribution in MCMC. However, when the likelihood is too costly to evaluate or unavailable, standard MCMC algorithms become infeasible.
Even when the likelihood is intractable, simulating "pseudo-data" from the model, i.e., drawing samples x θ ∼ π(· | θ), is often feasible. In this case, ABC can be used for inference, see e.g. Marin et al. (2012); Turner and Van Zandt (2012); Lintusaari et al. (2017). Standard ABC techniques approximate the posterior as where 1 denotes the indicator function, ε is a tolerance parameter and Δ : X 2 → R + is the discrepancy function used to compute the similarity between the simulated data x s and the observed data x. The discrepancy is typically constructed from lowdimensional summary statistics S : where Δ : R p × R p → R + is, for example, the weighted Euclidean distance. For each proposed parameter θ, an unbiased ABC posterior estimate can be obtained by replacing the integral in (1) with a Monte Carlo sum using some N simulated pseudo-data sets x An alternative to ABC is the synthetic likelihood method (Wood, 2010;Price et al., 2018). In SL the summary statistics S(x θ ) are assumed to have a Gaussian distribution for each parameter θ, that is (2) The first approximation results from replacing the full data x with a potentially nonsufficient summary statistics S(x). The second approximation is due to the possible violations of the Gaussianity of S(x). The expectation μ θ and covariance matrix Σ θ in (2) are unknown and are estimated for each proposed parameter θ using maximum likelihood (ML): where x (i) θ ∼ π(· | θ) for i = 1, . . . , N. As investigated by Price et al. (2018), the standard Metropolis algorithm can be combined with SL. The likelihood is then computed using (2) and the ML estimates in (3) or, alternatively, using an unbiased estimate of N (S(x) | μ θ , Σ θ ) shown in Section 2.1 of Price et al. (2018) which produces an exact pseudo-marginal MCMC if the Gaussianity assumption holds. See supplementary material C for discussion on the use of different SL estimators.
The advantage of SL over ABC is that specifying suitable ABC tuning parameters such as the tolerance and the discrepancy is avoided. While the Gaussianity of the summary statistics may not hold in practice, Price et al. (2018) have found that SL is often robust to deviations from normality. SL and its extensions (Thomas et al., 2018;An et al., 2019b,a;Frazier et al., 2019) produce pointwise noisy log-likelihood evaluations because in practice the number of repeated simulations N at each point is finite. Using (pseudo-marginal) MCMC or other sampling-based techniques for inference with these noisy targets thus requires a large number of simulations. Assuming noisy log-likelihood evaluations are available, e.g. obtained by using SL, the goal of the following sections is to develop an inference algorithm that can minimise the number of evaluations needed.

Gaussian process surrogate for the noisy log-likelihood
We denote the log-likelihood or its approximation, such as the log-SL obtained as the logarithm of (2), as f (θ) log π(x | θ). We assume that we have access to noisy loglikelihood evaluations at θ i denoted by y i ∈ R for building the surrogate model and that the "noise" i.e. the numerical or sampling error in evaluating the log-likelihood is independently Gaussian distributed. Treating the noisy log-likelihood evaluations y i as "observations", our measurement model is where σ n : Θ → (0, ∞) is a (continuous) function of θ that determines the standard deviation of the observation noise and is assumed known. To justify our model in (4), we show empirically that log-SL is well approximated by a Gaussian distribution using six benchmark simulation models in the supplementary material D.1.
We place the following hierarchical GP prior for the log-likelihood function f : where k : Θ 2 → R is a covariance function and h i : Θ → R are fixed basis functions (both assumed continuous). The nuisance parameters γ in (5) are marginalised, see e.g. O' Hagan and Kingman (1978); Rasmussen and Williams (2006), to obtain the following equivalent GP prior where h(θ) ∈ R q is a column vector consisting of the basis functions h i evaluated at θ. We use basis functions of the form 1, θ i , θ 2 i . A similar GP prior has been considered in Wilkinson (2014); Gutmann and Corander (2016); Drovandi et al. (2018), however, different from those articles, we take a fully Bayesian approach and marginalise γ as in Riihimäki and Vehtari (2014). Since little initial information is typically available on the magnitude and shape of the log-likelihood, we use relatively uninformative hyperpriors so that b = 0 and B ij = 30 2 1 i=j . We assume that the log-likelihood function is smooth, and adopt the squared exponential covariance function k(θ, although other choices, such as the Matérn covariance function, are also possible. We denote the d+1 covariance function hyperparameters as φ = (σ 2 f , l 1 , . . . , l d ). For now, we assume φ is known and omit it from our notation for simplicity.
, which we call training data, our knowledge of the log-likelihood function is and R t (θ) H(θ) − H t K −1 t k t (θ). Above H t is the q × t matrix whose columns consist of basis function values evaluated at training points θ 1:t = [θ 1 , . . . , θ t ] which is itself a d × t matrix, and H(θ) is the corresponding q × 1 vector at test point θ. From now on, we denote the GP variance function as s 2 t (θ) c t (θ, θ) and the probability law of f

Estimators of the posterior from the GP surrogate
Using the GP surrogate model for the noisy log-likelihood, we here derive estimators for the posterior which can be e.g. plugged-in to a MCMC algorithm. Resulting sampling algorithms do not require further simulator runs (unlike e.g. SL-MCMC) producing potentially huge computational savings. Figure 1 demonstrates our approach. We want to use our knowledge of the log-likelihood function represented by Π f Dt to determine the optimal point estimate of the probability density function (pdf) of the posterior. 1 The uncertainty of the log-likelihood f can be propagated to the posterior distribution of the simulation model which consequently becomes a random quantity denoted as π f : The expectation of the posterior pdf π f at each parameter θ can be formally written as and the variance can be obtained similarly (assuming these quantities exist). In principle, one could sample posterior pdfs by first drawing f (i) ∼ Π f Dt (a continuous function Θ → R), then computing π(θ) exp(f (i) (θ)), and finally normalising. However, in practice this would require discretisation of the Θ-space and involves computational challenges. For this reason and similarly to Sinsbeck and Nowak (2017); Järvenpää et al. (2019), we instead take our quantity of interest to be the unnormalised posterior which follows log-Gaussian process that allows analytical computations.
Next we derive an optimal estimator for the unnormalised posteriorπ in (12) using Bayesian decision-theory. We proceed here similarly to Sinsbeck and Nowak (2017) and consider the integrated quadratic loss function l 2 (π 1 ,π 2 ) Θ (π 1 (θ) −π 2 (θ)) 2 dθ between two (unnormalised) posterior densitiesπ 1 andπ 2 . We assumeπ 1 andπ 2 are square-integrable functions in Θ, i.e.π 1 ,π 2 ∈ L 2 (Θ). The optimal Bayes estimator, denoted byπ ∈ D, is the minimiser of the expected loss, where D = L 2 (Θ) denotes the set of candidate estimators. In detail, where Tonelli theorem is used to change the order of expectation and integration and whered ∈ D = L 2 (Θ) is a candidate estimator ofπ. Equation 13 shows that the expected loss is minimised when the integrand on the second row is minimised independently for (almost) each θ ∈ Θ. It follows from the basic results of Bayesian decision theory (see e.g. Robert 2007) that the minimum is obtained whend(θ) = E f | Dt (π f (θ)), i.e., the optimal estimator is the posterior expectation. The minimum value of (13), called Figure 1: (a) GP surrogate model for the log-SL of the Ricker model of Section 6.2 when only the first parameter, θ = log(r), is varied. The black dots show the noisy log-SL evaluations and the black lines their approximate 95% confidence intervals, the grey area the 95% credible interval, and the red line the GP mean function. (b) Uncertainty of the SL. The grey area shows the 95% credible interval of the SL and the red line is the median estimate, obtained from (15). The dashed blue line shows the standard deviation of SL computed as the square root of (14).
Bayes risk, is the integrated variance Θ V f | Dt (π f (θ)) dθ. The posterior expectation and variance can be computed from the log-Normal distribution as If we instead use L 1 loss l 1 (π 1 ,π 2 ) Θ |π 1 (θ) −π 2 (θ)| dθ whereπ 1 ,π 1 ∈ L 1 (Θ), we can similarly show that the optimal point estimator is the marginal median. The median and the α-quantile q α with α ∈ (0, 1) can be computed as where Φ −1 is the quantile function of the standard normal distribution. As we show in the supplementary material A, the Bayes risk corresponding to the L 1 loss is When MCMC is used with the point estimator of the unnormalised posterior in either (14) or (15), we are in fact targeting the following mean and median based estimators of the (normalised) posterior These are obtained by simply normalising the Bayes optimal estimators of the unnormalised posterior (and, as a consequence, a guarantee of optimality for normalised posterior does not follow). Similar estimators were also considered by Stuart and Teckentrup (2018). Both are clearly valid density functions, and tractable, unlike (11). The latter, i.e., the marginal median based estimate, is equal to (10) if we replace the unknown log-likelihood function f (θ) with a GP mean function m t (θ) and neglect GP uncertainty. On the other hand, the former, i.e., the marginal mean estimate, takes into account the GP uncertainty through the variance function s 2 t (θ). These two point estimates become the same if the GP variance is negligible.

Parallel designs of simulations
In the previous section we showed how to quantify the uncertainty of the GP surrogatebased unnormalised posterior density and we derived computable and (in a certain sense) optimal point estimates of it. Next we develop Bayesian experimental design strategies to select further locations to evaluate the log-likelihood, so that the uncertainty in the unnormalised posterior decreases as fast as possible. We focus on batch strategies and denote the batch size as b ∈ {1, 2, . . .}. Before moving on, we introduce some terminology. The next batch of b evaluation locations is obtained as the solution to an optimisation problem. We call the objective function of this optimisation problem a design criterion and the resulting batch of evaluation locations as design points or just design. The complete procedure of selecting the design points is called a batch-sequential (or, when b = 1, just sequential ) strategy. 2 In this paper we focus on synchronous parallelisation where a batch of b design points is constructed at each iteration and the corresponding b simulations are simultaneously submitted to the workers. However, the "greedy" design strategies developed in Sections 5.3 and 5.5 can also be used for asynchronous parallelisation, where a new location is immediately chosen and submitted for processing, whenever any of the running simulations completes, instead of waiting all the other b − 1 simulations to finish.

Analytical expressions for the design criteria
We first derive some general results needed for efficient evaluation of the design criteria. These can be useful also for developing batch designs for other related GP-based problems such as BQ and BO. Given would affect our knowledge about the log-likelihood f and the unnormalised posteriorπ. The following Lemma is central to our analysis. It shows how the GP mean and variance functions are affected by supplementing the training data D t with a new batch of evaluations when the unknown y * is assumed to be distributed according to the posterior predictive distribution of the GP given where δ(·) is the Dirac measure and In the Lemma, m t+b (θ; θ * ) is the GP mean function at iteration t + b whose dependence on θ * is shown explicitly. Importantly, the above Lemma shows how the GP variance decreases from s 2 t (θ) to s 2 t+b (θ; θ * ) when the extra b evaluations at θ * are included, and the reduction , if the evaluation points θ * i are located far from each other, then .
This shows that the reduction of GP variance at θ, τ 2 t (θ; θ * ), factorises over the new evaluation points θ * i in θ * . Intuitively, if the test point θ is strongly correlated with some evaluation point θ * i , including the evaluation at θ * i will result in a large reduction of variance at the test point. Furthermore, the larger the noise variance σ 2 n (θ * i ) at the evaluation point θ * i is, the less the GP variance will decrease. It clearly holds that 0 ≤ τ 2 t (θ; θ * ) ≤ s 2 t (θ). Items (i-ii) of the following Lemma summarise some additional properties of the variance reduction function in (20) and (iiiiv) show two further useful identities needed later. Item (i) shows the (rather obvious) result that the order of evaluation points in θ * does not change τ 2 t (θ; θ * ) and in the following we often identify the d × b matrix θ * with a multiset whose elements are the columns of θ * although this leads to some abuse of notation. (20) for any fixed θ has the following properties.
(i) The function value is invariant to the permutation of the evaluation locations i.e. the columns of θ * .
Then were available, such that the GP variance is not exactly zero at these locations.
Figure 2 demonstrates how two new evaluations at θ * 1 and θ * 2 reduce the GP variance in two different one-dimensional examples. Specifically, Figure 2a illustrates the fact of Lemma 5.2 (iii) that the interaction between the evaluation points, represented by the term r t (θ; θ * 1 , θ * 2 ), affects the reduction of the GP variance and its effect can be either positive or negative (the variance after two new evaluations, blue line, is either below or above the yellow line, which represents the reduced variance if the interaction is neglected). In Figure 2b the new evaluation locations are far apart and the factorisation of the variance reduction in (21) holds approximately.

Batch-sequential designs
, our goal is to select the next batch of b evaluations θ * in an optimal fashion. We take a Bayesian decision theoretic approach, where θ * is selected to minimise the expected (or median) loss, where the loss measures uncertainty remaining in the unnormalised posteriorπ f when the hypothetical observations y * at locations θ * are taken into account. In the following we develop two such techniques based on two different measures of uncertainty: variance and interquartile range (IQR). Design strategies which acknowledge the impact of the next batch, but neglect the whole remaining computational budget, are often called "myopic". It is possible to formulate a non-myopic design as a dynamic programming problem, but this is computationally demanding, see e.g. Bect et al. (2012); González et al. (2016). Consequently, we focus on myopic designs which already produce highly sample-efficient and practical algorithms.

Expected integrated variance (EIV)
As our first measure of the uncertainty of the unnormalised posteriorπ f for the selection of the next batch design θ * , we select the Bayes risk under the L 2 loss. In this case, the Bayes risk is the integrated variance function whose integrand was obtained from (14). This is similar to Sinsbeck and Nowak (2017); Järvenpää et al. (2019) who, however, considered other GP surrogate models and only sequential designs. We compute the expectation over the hypothetical noisy loglikelihoods y * for any candidate design θ * , leading to the expected integrated variance design criterion (EIV). The resulting optimal strategy is a special case of stepwise uncertainty reduction technique, see e.g. Bect et al. (2012). This criterion is evaluated efficiently without numerical simulations from the GP model, using the following result.
Proposition 5.3. With the assumptions of Lemma 5.1, the expected integrated variance design criterion L v t at any candidate design

Integrated median interquartile range (IMIQR)
A sequential design strategy based on EIV worked well in the ABC scenario of Järvenpää et al. (2019), who modelled the discrepancy in (1) with a GP. In this article we instead model the log-likelihood with a GP as illustrated in Figure 1 and the goal is to minimise the uncertainty of the unnormalised posteriorπ f , which has a log-Normal distribution for a fixed θ. However, the expectation and variance can be suboptimal estimates of the central tendency and uncertainty of a heavy-tailed distribution such as log-Normal.
For example, Figure 1b shows that the standard deviation (dashed blue line) grows very rapidly at the boundaries although at the same time the credible interval clearly indicates that the probability of the log-likelihood, and consequently the likelihood, of having a non-negligible value there is vanishingly small. The mean is similarly affected in a non-intuitive way by the heavy tails: It is fairly easy to see that This means that with a sufficiently large variance of the log-likelihood s 2 t (θ), the probability that the likelihood exp(f (θ)) is greater than its own mean becomes negligible.
The above analysis suggests (and empirical results in Section 6 further confirm) that mean-based point estimates and variance-based design strategies, such as the EIV and those proposed by Gunter et al. (2014); Kandasamy et al. (2015); Sinsbeck and Nowak (2017); Järvenpää et al. (2019); Acerbi (2018), are unsuitable when log-likelihood evaluations are noisy. A reasonable alternative for the L 2 -loss used to derive the EIV is to measure the uncertainty in the posterior using the L 1 -loss, which is less affected by extreme values. As shown in Section 3, the L 1 -loss leads to the marginal median estimate for the posterior, π med t . While the L 1 -loss produces a robust median estimator that we adopt, (16) shows that the Bayes risk with L 1 loss scales as exp(s 2 t (θ)/2) since Φ(s t (θ)) ≈ 1 for large s t (θ), such that also this measure for overall uncertainty ofπ f is affected by the heavy tails of the log-Normal distribution.
We propose a new, robust design criterion for selecting the next design. In place of the variance in EIV, we use a robust measure of uncertainty, the interquartile range IQR(θ) = q 3/4 (θ) − q 1/4 (θ). The integrated IQR loss measuring the uncertainty of the posterior pdf is defined as where u Φ −1 (p u ) and sinh(z) = (exp(z)−exp(−z))/2 for z ∈ R is the hyperbolic sine, which emerges after using (15). While we use p u = 0.75, other quantiles p u ∈ (0.5, 1) are also possible. A theoretical downside of the IQR loss is that it does not formally coincide with the Bayes risk for the L 1 or L 2 loss, which correspond to the optimal point estimators of the unnormalised posterior (see Section 4).
We also use the median in place of the mean to measure the effect of the next design θ * to the loss function. That is, we use median loss decision theory (see Yu and Clarke 2011), and define the median integrated IQR loss function as The median integrated IQR loss in (29) is intractable but it can be approximated by the integrated median IQR loss (IMIQR). This approximation 3 follows by replacing the predictive distribution of y * with a point mass, i.e., π(y * | θ * , D t ) ≈ δ(m t (θ * ) − y * ). This approximation resembles the so-called kriging believer heuristic in Ginsbourger et al. (2010). The next result gives a useful formula to calculate IMIQR.
The integrand of (30) is recognised as a product of the marginal median estimate of the posterior in (15) and the function sinh(us t+b (θ; θ * )). Hence, to minimise IMIQR, the simulation locations θ * need to be chosen as a compromise between regions where the current posterior estimate is non-negligible and where the GP variance s 2 t+b (θ; θ * ) decreases efficiently when the simulations are run at θ * . Similar interpretation holds also for the EIV function in (26). However, EIV assigns significantly more weight to areas with high GP variance than IMIQR.

Joint and greedy optimisation for batch-sequential designs
We can now evaluate EIV and IMIQR design criteria for any candidate design θ * and choose θ * as the minimiser, i.e., where L t is either the EIV in (26) or IMIQR in (30). The objective function is typically smooth but multimodal so global optimisation is needed. We call (31) as "joint" optimisation which does not scale to high dimensional parameter spaces or to large batch sizes. Even if computing the design criterion is cheap as compared to the run times of typical simulation models, solving the db-dimensional global optimisation problem is often impractical as discussed in Wilson et al. (2018). Hence, we consider greedy optimisation as also used in batch BO (Ginsbourger et al., 2010;Snoek et al., 2012;Wilson et al., 2018). The greedy optimisation procedure for both EIV and IMIQR works as follows: the first point θ * 1 is chosen as in the sequential case i.e. by solving (31) This greedy approach simplifies the difficult db-dimensional optimisation into b separate d-dimensional problems, and makes it scalable as a function of b.
In general, the design found by the greedy optimisation does not equal the minimiser of the joint criterion. It follows from Lemma 5.2 (i) that both EIV and IMIQR are invariant to the order of evaluation locations in θ * but this does not hold for the greedy procedure. Bounds for the performance of greedy maximisation of a set function have been studied in literature, see e.g. Nemhauser et al. (1978); Krause et al. (2008); Bach (2013). For example, if the design criterion (when defined equivalently using a utility so that (32) becomes a maximisation problem) is submodular and non-decreasing in batch size b, then the worst-case outcome of greedy optimisation is at least 1 − 1/e ≈ 0.63 of the corresponding optimal joint value. A utility function corresponding to IMIQR defined below is not submodular but an approximation of it is weakly submodular (see e.g. Krause et al. 2008;Krause and Cevher 2010). We use this fact to derive a weaker but still useful bound.
We here consider an approximationL IQR,a which follows from the observation that sinh(us t+b (θ; θ * )) ≈ u 2 s 2 t+b (θ; θ * ) and where we had u = Φ −1 (p u ). The approximation in (33) is reasonable when s t+b (θ; θ * ) ∈ [0, 3/u] in the region where π(θ)e mt(θ) is non-negligible. For simplicity, we consider a discretised setting where the optimisation is done over a finite setΘ ⊂ Θ and define (approximate) IMIQR utility function as for θ * ∈ 2Θ. Clearly, maximisingŨ IQR,a t (θ * ) is equivalent to minimisingL IQR,a t (θ * ). The following theorem gives a bound for the greedy optimisation of the (approximate) IMIQR utility function. The proof can be found in supplementary material B.2.
Theorem 5.5. Consider the set functionŨ IQR,a t : 2Θ → R + in (34). Let θ O be a (joint) optimal solution for maximisingŨ IQR,a t (θ) over θ ⊂Θ, |θ| ≤ b. The greedy algorithm for this maximisation problem outputs a set θ G ⊂Θ satisfying Computing ε t explicitly is difficult but we expect that often ε t Ũ IQR,a t (θ O ) since r 1:t+i (θ; θ j , θ k ) given by (23) tends to be small. However, ε t may not always be small, the term b 2 ε t scales quadratically for batch size b, and the bound holds only approximately for IMIQR. This bound still suggests that, at least in some iterations of the algorithm, greedy IMIQR produces near-optimal batch locations. On the other hand, an approximation similar to (33) for EIV would be reasonable in a very limited number of situations, and experiments in supplementary material D.3 suggest that greedy EIV scales worse as a function of b compared to the corresponding greedy IMIQR strategy. Finally, we note that even when the bound is weak, new design points cannot increase the value of EIV or IMIQR loss function as shown in supplementary material B.1. Hence, the batch strategies cannot be worse than the corresponding sequential designs and, in practice, they are highly useful as is seen empirically in Section 6.

Implementation details
Using the GP surrogate model and the analysis from the previous sections, we show the resulting inference method as Algorithm 1. Some key implementation details are discussed below and further details (e.g. on handling GP hyperparameters, MCMC methods used, and optimisation of the design criteria) are given in supplementary material C. The algorithm is shown for the SL case using the IMIQR strategy, but it works similarly for EIV, heuristic designs developed in the next section and other log-likelihood estimators besides SL. The potentially expensive simulations on the lines 2-5 and 18-21 can be done in parallel. In the SL case, the simulations can be parallelised in terms of both the number of repeated simulations N and batch size b.
Algorithm 1 GP-based SL inference using IMIQR with synchronous batch design. Use MAP estimation to obtain GP hyperparameters φ using D b0+(i−1)b

10:
if joint optim then

11:
Obtain θ (i) *   Wu and Frazier (2016) and in Chevalier et al. (2014). If d ≤ 2 we discretise the parameter space Θ and approximate the integral in the resulting grid. In higher dimensions, we use self-normalised importance sampling (IS) as in Chevalier et al. (2014); Järvenpää et al. (2019). Specifically, we draw samples from the importance distribution θ (j) ∼ π q (θ) and use these as integration points to approximate where the integrand of either (26) or 30 is denoted by I t (θ; θ * ). As the proposal π q we use the current loss (see (14) and the integrand of (16)), which is a function Θ → R + , and can be interpreted as an unnormalised pdf. This is a natural choice because the current loss has a similar shape as the expected/median loss as a function of θ. We use the same proposal in the greedy optimisation in (32) although it would be also possible to adapt the proposal π q according to the pending points θ * 1:r−1 when optimising with respect to the rth point θ * r . We have assumed that the noise function σ 2 n in (4) is known. In practice, this is a valid assumption only in the noiseless case where σ 2 n (θ) = 0. As our focus is on the noisy setting, we need to estimate σ 2 n . Sometimes σ 2 n can be assumed to be an unknown constant to be determined together with the GP hyperparameters φ using MAP estimation. However, we observed that σ 2 n often depends on the magnitude of the log-likelihood (see Figure 1), making the assumption of homoscedastic noise questionable. Similarly to Wilkinson (2014), we estimate σ 2 n using the bootstrap. Specifically, with each new training data point θ i , we resample with replacement N summary vectors from the original population {S (j) i } N j=1 for 2000 times. We then compute the empirical variance of the resulting log-SL values and use it as a plug-in estimator for σ 2 n (θ i ). For EIV, IMIQR and greedy batch versions of MAXV and MAXIQR (see Section 5.5), σ 2 n needs to be also known at candidate design points θ * . Bootstrap cannot be used because the simulated summaries are only available for training data. We take a pragmatic approach and set σ n = 10 −2 at the candidate design points as if the future evaluations were almost exact. This simplification effectively reduces the occurrence of (potentially redundant) simulations at nearby points to encourage exploration. Alternatively, one could use another GP to model the bootstrapped variances or their logarithms and use the GP mean function as a point estimate for the function σ 2 n as in Ankenman et al. (2010).

Alternative heuristic designs strategies
Here we present some heuristic alternative design strategies. These are empirically compared to the more principled EIV and IMIQR strategies in Section 6. We first focus on sequential designs where b = 1.

MAXIQR:
A natural and simple approach is to evaluate where the current variance, IQR or some other suitable (local) measure of uncertainty is maximised. Such strategies are in some contexts called "uncertainty sampling". The advantage over EIV and IMIQR is cheaper computation because the effect of the candidate design point to the whole posterior is not acknowledged. Using IQR produces the design strategy which we abbreviate as MAXIQR because it evaluates at the maximiser of IQR. Taking the logarithm of (38), the MAXIQR strategy can be equivalently written as which shows a tradeoff between evaluating where the log-posterior is presumed to be large (the first two terms in (39)) and unexplored regions where the GP variance is large (the last two terms). This formula also shows an interesting connection to the upper confidence bound (UCB) criterion commonly used in BO, see e.g. Srinivas et al. (2010); Shahriari et al. (2015). The UCB acquisition function is UCB(θ) = m t (θ) + β t s t (θ), where β t is a tradeoff parameter, here automatically chosen to be β t = Φ −1 (p u ). Compared to the standard UCB, there is, however, an extra term in (39) which further penalises regions having small variance s 2 t . If the variance s 2 t is large everywhere and/or if p u is an extreme quantile, then the last term in (39) is approximately zero and the MAXIQR design criterion approximately equals UCB.

MAXV:
When the variance is used instead of IQR, we obtain a strategy This strategy is abbreviated as MAXV which, in fact, is used by Gunter et al. (2014); Kandasamy et al. (2015) in the noiseless case, and it is called "exponentiated variance" by Kandasamy et al. (2015). Taking logarithm of (40) shows that this design also features a tradeoff between large posterior and large variance, similarly to MAX-IQR.
Since these two strategies are not derived from Bayesian decision theory, it is not immediately clear how one should parallelise these inherently sequential strategies. However, it seems reasonable to use the fact the s 2 t (θ) is always reduced near the pending evaluation locations. Motivated by this and related BO techniques in Ginsbourger et al. (2010); Snoek et al. (2012); Desautels et al. (2014), we compute the median value of the design criterion with respect to the posterior predictive distribution of the pending simulations. The next locations are chosen iteratively such that, for MAXIQR, the first point θ * 1 in the batch is chosen using (38) and the rest θ 2:b by iteratively solving θ * r = arg max θ∈Θ π(θ)e mt(θ) sinh(us t+r−1 (θ; θ * 1:r−1 )), r = 2, 3, . . . , b.
MAXV is parallelised similarly but using the expected value instead of the median.
Finally, we provide some intuition to (41) and show a connection to the local penalisation method used to parallelise sequential BO designs by Gonzalez et al. (2016). Suppose we are selecting the rth point of a batch where 2 ≤ r ≤ b. Comparison of (38) and (41) shows that (41) equals the original design criterion in (38) multiplied by a weight function ω(θ; θ * 1:r−1 ) sinh(us t+r−1 (θ; θ * 1:r−1 ))/ sinh(us t (θ)). It is easy to see that ω(θ; θ * 1:r−1 ) ∈ [0, 1]. This shows that when we take the median over the log-likelihood evaluation at the pending points θ * 1:r−1 , we are implicitly making the original acquisition function smaller around the pending points and, consequently, penalising additional evaluations there. This resembles the heuristic method by Gonzalez et al. (2016), who proposed to multiply the non-negative acquisition function, such as the objective of (38), with 1≤j<r ϕ(θ; θ * j ), where ϕ(θ; θ * j ) are local penalising functions around the pending evaluation locations θ * j , when selecting the rth point θ * r of the current batch. However, one difference between these approaches is that our weight function ω takes the interactions between the pending points into account and it cannot be factorised as ω(θ, θ * 1:r−1 ) = 1≤j<r ϕ(θ; θ * j ). Also, our weight function is not a tuning parameter but follows automatically from our analysis.

Experiments
We empirically investigate the performance of the proposed algorithm with different design strategies developed in Section 5. We compare the sequential, batch, and greedy batch strategies based on EIV and IMIQR to sequential and greedy versions of MAXV (which is essentially the same as the BAPE method by Kandasamy et al. 2015) and MAXIQR. As a simple baseline we also sample design points from the prior (always uniform) and this method is abbreviated as RAND.
We report the results as figures whose y-axis shows the accuracy between the estimated and the ground truth posterior using total variation distance (TV). TV between pdfs π 1 and π 2 is defined as TV(π 1 , π 2 ) = 1/2 Θ |π 1 (θ) − π 2 (θ)| dθ and is computed using numerical integration in 2D. In higher dimensional cases we compute the average TV between the marginal posterior densities using MCMC samples. The marginal median estimator in (17) is used to obtain the point estimate for the posterior pdf. The x-axis represents the iteration i of the Algorithm 1 (unless explicitly stated otherwise) which serves as a proxy to the total wall-time when the noisy likelihood evaluations are assumed to dominate the total computational cost. We use a fixed simulation budget so that the batch-sequential methods terminate earlier than the sequential ones because they spend the evaluation budget b times faster due to the parallel computation.
We consider two sets of experiments: toy models where noisy log-likelihood evaluations are directly evaluated (Section 6.1) and real-world simulator-based statistical models where SL is used to obtain noisy log-likelihood evaluations using N repeated simulations at each proposed parameter (Section 6.2). Although in the SL case it is often possible to adjust N adaptively, we use N = 100 (unless explicitly stated otherwise) for simplicity. In the supplementary material D.4 we show that our batch methods are beneficial for SL even though in principle it would be possible to directly parallelise the N simulations themselves. For example, when 1, 000 computer cores are available for the simulations (e.g. in a high performance computing cluster), it is beneficial to use the batch strategies (e.g. N = 100 and b = 10) instead parallelising the evaluations at a single location (N = 1000 and b = 1). More elaborate analysis of resource allocation is left for future work. A MATLAB implementation of our algorithms is available at https://github.com/mjarvenpaa/parallel-GP-SL.

Noisy toy model likelihoods
We first define three 2D densities with different characteristics: a simple Gaussian density called 'Simple', a banana-shaped density 'Banana' and a bimodal density 'Bimodal'. We then construct three 6D densities so that their 2D blocks are independent and have the corresponding 2D densities as their 2D marginals. Detailed specification and illustrations can be found in supplementary material D.2. We use the same names for the 6D densities as for the corresponding 2D ones (except that 'Bimodal' is called 'Multimodal' in 6D because it has 2 3 = 8 modes). The independence assumption is not taken into account in the GP model to make the inference problem more challenging. For simplicity, σ n (θ) is assumed constant i.e. it does not depend on the magnitude of the log-likelihood and its value is obtained using MAP estimation together with other GP hyperparameters φ at each iteration i in Algorithm 1. As an initial design in 6D we generate b 0 = 20 parameters ('Simple') or b 0 = 50 ('Banana' and 'Multimodal') from uniform priors. In the 2D case we always use b 0 = 10. We use a fixed total budget of t = 620 noisy log-likelihood evaluations ('Simple') or t = 650 ('Banana' and 'Multimodal') for both sequential and batch methods in 6D.
The results with different sequential and greedy batch-sequential strategies in 6D case with batch size b = 5 are shown in Figure 3. Good posterior approximations for the Simple example are obtained earlier than for the two other models. This is a consequence of the quadratic terms in the GP prior mean function and the exact Gaussian shape of the posterior. However, more complicated posteriors are also estimated accurately although more iterations are needed to obtain reasonable approximations. The IMIQR method works clearly the best outperforming EIV and the heuristic MAXV and MAXIQR methods which either need more iterations to obtain good approximation or fail completely to reach good results. The uniform design RAND works adequately in the Simple and Banana models but often produces poor estimates for the Multimodal case. Unsurprisingly, its performance is also poor in the real-world scenarios in Section 6.2.
The batch-sequential strategies improve the convergence speed as compared to the corresponding sequential strategies in all cases of Figure 3. In particular, the greedy batch versions of MAXV and MAXIQR even outperform the corresponding sequential methods. The greedy batch strategy in these cases encourages exploration as compared to the corresponding sequential strategy and this effect counterbalances the exploitative nature of MAXV and MAXIQR. In supplementary material D.3 we compare the joint and greedy batch strategies in 2D case where joint maximisation is still feasible. Their difference is found small for IMIQR and small or moderate for EIV suggesting that the greedy strategies are in practice nearly optimal.  points near the mode of the posterior where the local measure of uncertainty they use tend to be highest. Also, in general, the sequential and batch methods produce similar designs. However, greedy MAXIQR generates more points on the boundary than the corresponding sequential strategy and the joint IMIQR produces slightly more diverse design points as the sequential and greedy batch IMIQR. In all cases, IMIQR avoids redundant evaluations on the boundaries.
We investigate the effect of batch size b in the greedy batch MAXIQR and MAXIQR algorithms in Figure 5. In general, the convergence speed of both methods scales well as a function of b and b = 10 already yields useful improvements. However, increasing b over 40 would improve the results only slightly. Greedy batch IMIQR works overall better than batch MAXIQR. The variability in the posterior approximations produced by IMIQR is small in all cases unlike for MAXIQR which occasionally produced poor approximations (not shown for clarity). In the supplementary material D.3 we compare EIV and IMIQR in 2D. These results show that the greedy IMIQR batch-sequential Figure 5: Results with greedy batch strategies and varying batch size b for the 6D toy models. For each method, the median TV computed over 50 repeated simulations is shown. The x-axis is truncated after i = 400 iterations to ease visualisation. strategy outperforms the corresponding EIV strategy although their difference is small in the corresponding sequential cases. This suggests that, even when σ n is small so that the variance in EIV serves as a reasonable measure of uncertainty and the sequential EIV works similarly to IMIQR, the greedy batch median-based IMIQR design strategy better mimics the sequential decisions than EIV.

Simulation models
We perform experiments with three benchmark problems used previously in the ABC literature. Two of these are shown here and the third one in the supplementary material D.5. While the proposed methodology is particularly useful for expensive simulation models, we however consider only relatively cheap models as this allows to repeat the computations many times with different realisations of randomness to assess the variability and robustness, and to conduct accurate comparisons to reasonable ground truth posteriors. Nevertheless, these experiments serve as examples of challenging real-world inference scenarios where the GP and SL modelling assumptions do not hold exactly. In each problem, we set the unknown parameter of the simulation model to a value used previously in the literature and generated one data set from the simulation model using this "true" parameter. The posterior used as the ground truth was computed using SL-MCMC. Multiple chains each with length 10 6 were used to ensure that the variability due to Monte Carlo error was small.

Ricker model
We first consider the Ricker model presented in Wood (2010). In this model N t denotes the number of individuals in a population at time t which evolves according to the discrete time stochastic process N t+1 = rN t exp(−N t + ε t ), for t = 1, . . . , T , where ∼ N(0, σ 2 ε ). The initial population size is N 0 = 1. It is assumed that only a noisy Given data x = (x t ) T t=1 , the goal is to infer the three parameters θ = (log(r), φ, σ ε ). We use the uniform prior (log(r), φ, σ ε ) ∼ U([3, 5]× [4, 20] × [0, 0.8]). The same 13 summary statistics as in Wood (2010); Gutmann and Corander (2016); Price et al. (2018) are used to compute log-SL evaluations. The number of repeated simulations is fixed to N = 100. The "true" parameter to be estimated is θ true = (3.8, 10, 0.3) and it is used to generate the observed data with length T = 50. The initial training data size is b 0 = 30 and the additional budget of simulations is 600 so that the total budget is 630 SL evaluations corresponding 63000 simulations. The integrals of EIV and IMIQR are approximated using IS and σ 2 n is estimated using bootstrap as described in Section 5.4. Figure 6 shows the results. We see that the EIV and MAXV strategies perform poorly. These strategies tend to evaluate where the variance of the posterior is high, although as discussed in Section 5, these do not necessarily correspond to the regions with non-negligible likelihood. In fact, the magnitude of the log-likelihood and its noise variance σ 2 n grow fast near the boundaries of the parameter space where the chaotic nature of the model also makes the log-likelihood surface irregular which further causes difficulties with GP modelling. The IMIQR method again produces the best posterior approximations which are comparable to the true SL posterior. Some examples are shown in Figure 6b-d. Also, MAXIQR method works well on average but it produces less coherent results than IMIQR which is likely the result of its exploitative nature. In addition, unexpectedly, the greedy MAXIQR method performs poorly. The probable reason is that the batch evaluations become too diverse having many evaluations in the boundary which leads to poor GP fitting and subsequently poor future designs. However, the robust batch-sequential IMIQR method with b = 5 works as expected producing useful improvement to the convergence speed as compared to sequential IM-IQR.

g-and-k model
We consider the g-and-k distribution as in Price et al. (2018). The g-and-k model is a flexible probability distribution defined via its quantile function where a, b, c, g and k are parameters and p ∈ [0, 1] is a quantile. We fix c = 0.8 and estimate the parameters θ = (a, b, g, k) using a uniform prior π(θ) = U ([2.5, 3.5] × [0.5, 1.5] × [1.5, 2.5] × [0.3, 0.7]). We use the same four summary statistics as Price et al. (2018) who fitted an auxiliary model, skew t-distribution, to the set of samples generated from (42) using maximum likelihood, and took the resulting skew t score vector at the ML estimate as the summary statistic. Although there are only 4 summary statistics, we again use N = 100. We use the same settings as for the Ricker model except that the initial design is increased to b 0 = 40 so that the total budget is 640 SL evaluations. The true value of the parameter is chosen to be θ true = (3, 1, 2, 0.5).
Overall, the results in Figure 7 are similar to those of the Ricker model. However, the larger parameter space slows down the convergence speeds initially as expected, as compared to the Ricker model. Low dimension of the summary statistic and the moderately large value N = 100 cause the log-likelihood evaluations to be quite accurate near the modal area of the likelihood (σ n (θ true ) ≈ 0.15) and we expect that smaller N might be already enough. However, using N = 100 ensures accurate variance estimates using  Figure 6. the bootstrap. While MAXIQR strategy works almost as well as IMIQR on average, it completely fails in some individual repeated experiments producing long variability intervals in Figure 7 leaving IMIQR as the only successful method.

Effect of batch size
As the last experiment, we investigate the improvements brought by the batch-sequential IMIQR strategy in the case of real-world simulation models. We use the Ricker and gand-k models from the previous subsections. The experiment details are the same except that we consider only IMIQR strategy with several batch sizes b ∈ {2, 5, 10, 20, 30}. The results in Figure 8 show that, on average, the greedy batch-sequential IMIQR with batch sizes up to 30 produces as good approximations as the corresponding sequential strategy. The convergence speed is also improved almost linearly.
However, the variability in the quality of the estimated posteriors increases with larger batch sizes when the total budget of simulations is kept fixed. While most of the repeated runs of the algorithm have converged to excellent approximations in all cases as seen in Figure 8, there were some individual runs where the algorithm did not yet converge when the budget was used. The posterior estimate at the final iteration is often quite poor in these cases. Most of these happen with Ricker model when b ≥ 20 and with g-and-k model when b = 30. However, this behaviour is not surprising: When b is large, the complete batch is constructed using the same limited information which Figure 8: Results of the greedy batch strategy IMIQR with various batch sizes b for Ricker and g-and-k models. Top row shows the median TV over 100 repeated experiments and the bottom row shows the corresponding 90% quantile.
necessarily produces occasional poor batches providing little information. Furthermore, the importance density in (37) is likely to get worse when the batch size is increased and cause the last points in the batch to be less useful. It is thus inevitable that the batch size should not be chosen too large. Nevertheless, it is seen that batch size b = 10 already produces substantial gains (especially considering that further parallelisation is often possible with respect to N ) and produces consistently accurate posterior approximations.

Discussion and conclusions
If only a limited number of noisy log-likelihood evaluations can be computed, standard techniques such as MCMC become difficult to use for Bayesian inference. To tackle the problem, we constructed a hierarchical GP surrogate model for the noisy log-likelihood and discussed properties of the resulting estimators of the (unnormalised) posterior. We developed two batch-sequential strategies (EIV and IMIQR) based on Bayesian decision theory, to (semi-)optimally select the next evaluation locations and to parallelise the costly simulations. We also considered heuristic design strategies (MAXV and MAXIQR). We provided some theoretical analysis: We derived an approximate bound for the greedy optimisation of the batch IMIQR method using the concept of weak submodularity, showed a connection between the UCB (a common BO method) and the MAXIQR strategy, and between batch MAXIQR and the local penalisation method by Gonzalez et al. (2016). The proposed methods were investigated experimentally.
The IMIQR strategy was found to be robust both to violations of the GP surrogate model assumptions and to the heavy-tails of the resulting distributions. Unlike the other design strategies, it consistently produced posterior approximations comparable to the ground truth. Greedy batch-sequential IMIQR strategy was found to be highly useful to parallelise the potentially expensive simulations. In our experiments it produced substantial, sometimes even linear, speed improvements for batch sizes b 20. We thus recommend the IMIQR strategy. In general we were able to obtain useful posterior approximations with 10, 000 to 20, 000 simulations that can be easily parallelised. This is considerably less than e.g. using (pseudo-marginal) MCMC requiring typically at least tens of thousands of iterations corresponding to millions of simulations, careful convergence assessment and tuning of the proposal density. Another important observation was that the heuristic strategies that evaluate where the current uncertainty is highest, despite their small computational cost and good performance in earlier studies with deterministic evaluations, worked poorly with noisy log-likelihood evaluations.
Similarly to other GP surrogate techniques such as BO, fitting the GP and finding the next evaluation locations by optimising the design criterion is however not free. Our unoptimised MATLAB implementations of MAXV and MAXIQR are fast, but optimisation of the EIV and IMIQR design criteria takes a couple of seconds in 2D and around 20 to 80 seconds per parameter in 3D and 4D. This means that the proposed algorithm is useful when the simulation time is several seconds or more, which is however true with many real-world simulation models. Furthermore, the quality of the posterior approximation also depends on the choice of the surrogate model. We used the same GP model in all of our experiments with no problem-specific tuning, which already produced good results. However, some problems would certainly benefit from further adjustment and incorporation of domain knowledge. For example, if the likelihood is expected to be flat, a GP prior with a constant mean function might be appropriate.
In this work, similarly to Rasmussen (2003); Wilkinson (2014); Kandasamy et al. (2015); Gutmann and Corander (2016); Drovandi et al. (2018), we built our surrogate GP model for the log-likelihood. An alternative way would be to model the summary statistics. Meeds and Welling (2014) used such an approach but they assumed that the summary statistics are independent. However, modelling the scalar-valued loglikelihood is simpler and our approach also applies as such to non-ABC scenarios with exact (but potentially expensive) log-likelihood evaluations as in Osborne et al. (2012); Kandasamy et al. (2015); Wang and Li (2018); Acerbi (2018). Gutmann and Corander (2016); Järvenpää et al. (2018Järvenpää et al. ( , 2019 modelled the discrepancy between simulated and observed data with a GP and obtained reasonable posterior approximations with only a few hundred model simulations. Here we need N repeated simulations just to compute the log-likelihood for a single parameter value, which can be seen as the price of not having to specify an explicit discrepancy measure and the ABC tolerance. We see several avenues for future research. The consistency and convergence rates of our algorithms could be investigated theoretically. Some work towards that direction has been done by Bect et al. (2019); Stuart and Teckentrup (2018). Adaptive control of the number of repeated simulations N could likely be used to further reduce the number of simulations required, possibly as in Picheny et al. (2013). Some simulation models may behave unexpectedly near the boundaries of the parameter space violating GP model assumptions as we saw with the Ricker model in Section 6. Similarly, situations where the prior is significantly more diffuse than the posterior may be unsuitable for our approach that relies on a global GP surrogate. Consequently, it would be useful to learn adaptively not only where to evaluate next but also which parameter regions to rule out completely. This could be done as in Wilkinson (2014) or possibly by adapting ideas from the constrained BO literature (Gardner et al., 2014;Sui et al., 2015).

Supplementary Material
Supplementary material of "Parallel Gaussian process surrogate Bayesian inference with noisy likelihood evaluations" (DOI: 10.1214/20-BA1200SUPP; .pdf). The supplementary material contains proofs, implementation details and additional experimental results.