Sequentially guided MCMC proposals for synthetic likelihoods and correlated synthetic likelihoods

Synthetic likelihood (SL) is a strategy for parameter inference when the likelihood function is analytically or computationally intractable. In SL, the likelihood function of the data is replaced by a multivariate Gaussian density over summary statistics of the data. SL requires simulation of many replicate datasets at every parameter value considered by a sampling algorithm, such as Markov chain Monte Carlo (MCMC), making the method computationally-intensive. We propose two strategies to alleviate the computational burden. First, we introduce an algorithm producing a proposal distribution that is sequentially tuned and made conditional to data, thus it rapidly \textit{guides} the proposed parameters towards high posterior density regions. In our experiments, a small number of iterations of our algorithm is enough to rapidly locate high density regions, which we use to initialize one or several chains that make use of off-the-shelf adaptive MCMC methods. Our"guided"approach can also be potentially used with MCMC samplers for approximate Bayesian computation (ABC). Second, we exploit strategies borrowed from the correlated pseudo-marginal MCMC literature, to improve the chains mixing in a SL framework. Moreover, our methods enable inference for challenging case studies, when the posterior is multimodal and when the chain is initialised in low posterior probability regions of the parameter space, where standard samplers failed. To illustrate the advantages stemming from our framework we consider five benchmark examples, including estimation of parameters for a cosmological model and a stochastic model with highly non-Gaussian summary statistics.


Introduction
Synthetic likelihood (SL) is a methodology for parameter inference in stochastic models that do not admit a computationally tractable likelihood function. That is, similarly to approximate Bayesian computation (ABC, , SL only requires the ability to generate synthetic datasets from a model simulator, and statistically relevant summary statistics of the data that capture parameter-dependent variation in an adequate manner. The price to pay for its flexibility is that SL can be computationally very intensive, since it is typically embedded into a Markov chain Monte Carlo (MCMC) framework, requiring the simulation of multiple (often hundreds or thousands) synthetic datasets at each proposed parameter. The goal of our work is twofold: (i) we introduce an algorithm that sequentially produces a proposal sampler that is made conditional to data and rapidly enables the identification of high-posterior-density regions, where to initialize MCMC chains using off-the-shelf methods; (ii) we introduce a way to increase the chains mixing, by tweaking methods that have been recently proposed in the correlated particle filters literature. Hence both strategies aim at reducing the computational cost to perform Bayesian inference via SL. We show that our approaches facilitate sampling when the chains are initialised at parameter values in regions of low posterior probability, a case where SL often struggles, see the case studies in sections 6.2.1 and 6.3 where the Bayesian synthetic likelihoods (BSL) of Price et al. [2018] fails when using the adaptive MCMC proposal of Haario et al. [2001]. For the case study in section 6.3, having strongly non-Gaussian summary statistics, we show that even a BSL version robustified to non-Gaussian summaries fails to explore the posterior surface when initialized at challenging locations, while our proposal sampler is able to quickly converge towards the high-density region. Our proposal sampler can be beneficial with multimodal targets, to inform the researcher of the existence of multiple modes using a small number of iterations, see section 6.4. In addition, we inform the reader that for challenging problems where it is difficult to locate appropriate starting parameters, an alternative to our method is Bayesian optimization, which can be efficiently used for kickstarting SL-based posterior sampling [Gutmann and Corander, 2016], and is facilitated by the open-source ELFI software [Lintusaari et al., 2018].
SL is described in detail in Section 2, but here we first review its features with relevant references to the literature. SL was first proposed in Wood [2010] to produce inference for parameters θ of simulator-based models with an intractable likelihood. SL replaces the analytically intractable data likelihood p(y|θ) for observed data y with the joint density of a set of summary statistics of the data s := T (y). Here T (·) is a function of the data that has to be specified by the analyst and that can be evaluated for input y, or simulated data y * . The SL approach is characterized by the assumption that s has a multivariate normal distribution s ∼ N (µ θ , Σ θ ) with unknown mean µ θ and covariance matrix Σ θ . These can be estimated via Monte Carlo simulations of size M to obtain estimatorsμ M,θ ,Σ M,θ . The resulting multivariate Gaussian likelihood p M (s|θ) ≡ N (μ M,θ ,Σ M,θ ) can then be numerically maximised with respect to θ, to return an approximate maximimum likelihood estimator [Wood, 2010]. It can also be plugged into a Metropolis-Hastings algorithm with flat priors [Wood, 2010], so that MCMC is used as a workhorse to sample from a posterior π M (θ|s) to ultimately return the posterior mode, and hence a maximum likelihood estimator (a purely Bayesian approach is described below). The introduction of data summaries in the inference has been shown to cope well with chaotic models, where the likelihood would otherwise be difficult to optimize and the corresponding posterior surface may be difficult to explore. More generally, SL is a tool for likelihood-free inference, just like the ABC framework (see reviews Fan, 2011, Karabatsos andLeisen, 2018), where the latter can be seen as a nonparametric methodology, while SL uses a parametric distributional assumption on s. SL has found applications in e.g. ecology [Wood, 2010], epidemiology [Engblom et al., 2020, Dehideniya et al., 2019, mixed-effects modeling of tumor growth [Picchini and Forman, 2019]. For a recent generalization of the SL family of inference methods using statistical classifiers to directly target estimation of the posterior density, see Thomas et al. [2021] and Kokko et al. [2019].
While ABC is more general than SL, it can sometimes be difficult to tune and it typically suffers from the "curse of dimensionality" when the size of s increases, due to its nonparametric nature. On the other hand, the Gaussianity assumption concerning the summary statistics is the main limitation of SL. At the same time, due to its parametric nature, SL has been shown to perform satisfactorily on problems where dim(s) is large relative to dim(θ) [Ong et al., 2018]. Price et al. [2018] framed SL within a pseudo-marginal algorithm for Bayesian inference [Andrieu et al., 2009] and named the method Bayesian SL (BSL). They showed that when s is truly Gaussian, BSL produces MCMC samples from π(θ|s), not depending on the specific choice of M . However, in practice, the inference algorithm does depend on the specific choice of M , since this value affects the chains mixing. Unless the underlying computer model is trivial, producing the M datasets for each θ can be a serious computational bottleneck.
In this work we design a strategy producing a proposal sampler g(·|s) that is conditional to summary statistics of the data, by exploiting the Gaussian assumption for the summary statistics in (B)SL. We call this a guided sampler, as it proposes conditionally to data. Moreover, our guided sampler is sequentially built, and we find that our "sequentially Adapted and guided proposal for SL" (named ASL) is easy to construct and adds essentially no overhead, since it exploits quantities that are anyway computed in SL. We stress the importance of rapid convergence to the bulk of the posterior, as while SL may require a large M to get started, once it has approached high posterior probability regions M can be reduced substantially (in section 6.1 we are forced to start with M = 1, 000 and after a few iterations we can revert to M = 10 or 50). Later we briefly discuss how the proposal sampler can also be used in an ABC-MCMC algorithm. We emphasize that our algorithm should be used to rapidly identify the high density region of the posterior, and there initialize other algorithms to produce the actual inference (e.g. using some of the several available adaptive MCMC samplers). We discuss this aspect and suggest possibilities afterwards. In section 6.4 we show how ASL can be useful with multimodal targets.
Furthermore, we correlate log-synthetic likelihoods using a "blockwise" strategy, borrowed from relatively recent advances in pseudo-marginal MCMC literature. This is shown to considerably improve mixing of the chains generated via SL, while not introducing correlation can lead to unsatisfactory simulations when using starting parameter values residing relatively far from the representative ones. Finally, we wish to inform the reader of a further possibility to initialize SL algorithms, besides our sequentially adaptive proposal sampler: Bayesian optimization via the BOLFI method [Gutmann and Corander, 2016] available in ELFI [Lintusaari et al., 2018], the engine for likelihoodfree inference.
Our paper is structured as follows: in Section 2 we introduce the synthetic likelihood approach. In Section 3 we construct the adaptive proposal distribution via ASL and in section 4 we construct correlated synthetic likelihoods. In Section 5 we discuss using BOLFI and ELFI as an option for SL inference. In Section 6 we discuss four benchmarking simulation studies and a fifth one is in Supplementary Material. Code can be found at https://github.com/umbertopicchini/ASL.

Synthetic likelihood
We briefly summarize the synthetic likelihood (SL) method as proposed in Wood [2010] and in a Bayesian context in Price et al. [2018] (the latter is detailed in Supplementary Material). The main goal is to produce Bayesian inference for θ, by sampling from (an approximation to) the posterior π(θ|s) ∝p(s|θ)π(θ) using MCMC, wherep(s|θ) is the density underlying the true (unknown) distribution of s. Wood [2010] proposes a parametric approximation top(s|θ), placing the rather strong assumption that s ∼ N (µ θ , Σ θ ). The reason for this assumption is that estimators for the unknown mean and covariance of the summaries, µ θ and Σ θ respectively, can be obtained straightforwardly via simulation, as described below. As obvious from the notation used, µ θ and Σ θ depend on the unknown finite-dimensional vector parameter θ, and these are estimated by simulating independently M datasets from the assumed data-generating model, conditionally on θ. We denote the synthetic datasets simulated from the assumed model run at a given θ * with y * 1 , ..., y * M . These are such that dim(y * m ) = dim(y) (m = 1, ..., M ), with y denoting observed data, and therefore s ≡ T (y). For each dataset it is possible to construct the corresponding (possibly vector valued) summary s * m := T (y * m ), with dim(s * m ) = dim(s). These simulated summaries are used to construct the following estimators: with denoting transposition. By defining p M (s|θ) ≡ N (μ M,θ ,Σ M,θ ), the SL procedure in Wood [2010] samples from the posterior π M (θ|s) ∝ p M (s|θ)π(θ), see Algorithm 1. A slight modification of the original approach in Wood [2010] leads to the "Bayesian synthetic likelihood" (BSL) algorithm of Price et al. [2018], which samples from π(θ|s) when s is truly Gaussian, by introducing an unbiased approximation to a Gaussian likelihood. Besides this, the BSL is the same as Algorithm 1. See the Supplementary Material for details about BSL. All our numerical experiments use the BSL formulation of the inference problem. Notice when M is too small or θ * is implausible, the estimated covariance may mis-behave, e.g. may be not positive-definite: in such case, we attempt a "modified Cholesky factorization" ofΣ M,θ * , such as the one in Cheng and Higham [1998] (we used the Matlab implementation in Higham, 2015), or we tried to find a "nearest symmetric-positive-definite matrix" [Higham, 1988], using the function by D'Errico [2015]. When the simulator generating the M synthetic datasets is computationally demanding, Algorithm 1 is computer intensive, as it generally needs to be run for a number of iterations R in the order of thousands. The problem is exacerbated by the possibly poor mixing of the resulting chain. The most obvious way to alleviate the problem is to reduce the variance of the estimated likelihoods, by increasing M , but of course this makes the algorithm computationally more intensive. A further problem occurs when the initial θ * lies far away in the tails of the posterior. This may cause numerical problems when the initialΣ M,θ * is ill-conditioned, possibly requiring a very large M to get the MCMC started, and hence it is desirable to have the chains approach the bulk of the posterior as rapidly as possible.
In the following we propose two strategies aiming at keeping the mixing rate of a MCMC, produced either by SL or BSL, at acceptable levels and also to ease convergence of the chains to the regions of high posterior density. The first strategy results in designing a specific proposal distribution g(·) for use in MCMC via synthetic likelihood: this is a "sequentially Adapted and guided proposal for Synthetic Likelihoods" (shorty ASL) and is described in section 3. The second strategy reduces the variability in the Metropolis-Hastings ratio α by correlating successive pairs of synthetic likelihoods: this results in "correlated synthetic likelihoods" (CSL) described in section 4.

Guided and sequentially tuned proposals for synthetic likelihoods
In section 3.1 we illustrate the main ideas of our ASL method. In section 3.2 we specialize ASL so that we instead obtain a sequence of proposal distributions {g t } T t=1 , as detailed in Algorithm 2. What we now introduce in section 3.1 will also initialize the ASL method, i.e. provide an initial g 0 .

Main idea and initialization
Suppose θ * n is a posterior draw generated by some SL procedure (i.e. the standard method from Wood, 2010 or the BSL one from Price et al., 2018) at iteration n, e.g. θ * n ∼ π M (θ|s). Then denote with {s * 1 n , ..., s * M n } a set of M summaries simulated independently from the computer model, conditionally on the same θ * n , and defines * n = M m=1 s * m n /M . By the central limit theorem, for M sufficiently larges n has an approximately Gaussian distribution. Suppose we have at disposal N pairs {θ * n ,s * n } N n=1 . We set d θ = dim(θ) and d s = dim(s), then (θ * n ,s * n ) is a vector having length d = d θ + d s . Assume for a moment that the joint vector (θ * n ,s * n ) is a d-dimensional Gaussiandistributed vector, with (θ * n ,s * n ) ∼ N d (m, S). We stress that this assumption is made merely to construct a proposal sampler, and does not extend to the actual distribution of (θ, s). We set a d-dimensional mean vector m ≡ (m θ , m s ) and the d × d covariance matrix We estimate m and S using the N available draws. That is, define x n := (θ * n ,s * n ) then, same as in (1), we havê Oncem andŜ are obtained, it is possible to extract the corresponding entries (m θ ,m s ) andŜ θ ,Ŝ s , S sθ ,Ŝ θs . We can now use well known formulae for conditionals of a multivariate Gaussian distribution, to obtain a proposal distribution (with a slight abuse of notation) g(θ|s) ≡ N (m θ|s ,Ŝ θ|s ), withm Hence a new proposal θ * can be generated as θ * ∼ g(θ|s), and is thus "guided" by the summaries of the data s, and gets updated as new posterior draws become available, as further described below. Therefore, this "guided proposal" g(θ|s) can be employed in place of g(θ |θ) into Algorithm 1, even though we only use this proposal for a limited number of iterations, as clarified below. Clearly the proposal function g(θ|s) is independent of the last accepted value of θ, hence it is an "independence sampler" [Robert and Casella, 2004], except that its mean and covariance matrix are not kept constant.
The approach outlined so far is essentially step 3 in Algorithm 2, and together with the sequential tuning in section 3.2, allows for a rapid convergence of the chain towards the high posterior density region. However, this approach does not promote tails exploration. This is not really an issue, as we can let an MCMC incorporating our guided proposal sampler run for a small number of iterations (say 50 iterations, even if we use more for pictorial reasons), where the chain displays a high acceptance rate, and this is useful to collect many accepted draws that we can use to initialize other standard samplers enjoying proven ergodic properties, as detailed in next section 3.2. Moreover, the next section also illustrates a sampler based on the multivariate Student's distribution.

Sequential approach
The construction outlined above is only the first step of our guided adaptive sampler for synthetic likelihoods (ASL) methodology, and we now detail it to ease the actual implementation in a sequential way. We define a sequence of T + 1 "rounds" over which T + 1 kernels {g t } T t=0 are sequentially constructed. In the first round (t = 0), we construct g 0 using the output of K N MCMC iterations, obtained using e.g. a Gaussian random walk. We may consider K as burnin iterations. Once (2)-(3)-(4) are computed using the output {θ * k ,s * k } K k=1 of the burnin iterations, we obtain the first adaptive distribution denoted g 0 (θ|s) as illustrated in section 3.1. We store the draws as D := {θ * k ,s * k } K k=1 and then employ g 0 as a proposal density in further N MCMC iterations, after which we perform the following steps: (i) we collect the newly obtained batch of N pairs {θ * n ,s * n } N n=1 (where, again, θ * n ∼ π M (θ|s) ands * n is the sample mean of the already accepted simulated summaries generated conditionally to θ * n ) and add it to the previously obtained ones as D := D ∪ {θ * n ,s * n } N n=1 . Then (ii) similarly to (2) we compute the sample meanm 0:1 = (m 0:1 θ ,m 0:1 s ) and corresponding co-varianceŜ 0:1 , except that herem 0:1 andŜ 0:1 use the K + N pairs in D. (iii) Update (3)-(4) tom 0:1 θ|s andŜ 0:1 θ|s , and obtain g 1 (θ|s). (iv) Use g 1 (θ|s) for further N MCMC moves, stack the new draws into D := D ∪ {θ * n ,s * n } N n=1 , and using the K + 2N pairs in D proceed as before to obtain g 2 , and so on until the last batch of N iterations generated using g T is obtained.
From the procedure we have just illustrated, the sequence of Gaussian kernels has g t = g t (θ|s) ≡ N (m 0:t θ|s ,Ŝ 0:t θ|s ), withm 0:t θ|s andŜ 0:t θ|s the conditional mean and covariance matrix given bŷ m 0:t θ|s =m 0:t θ +Ŝ 0:t θs (Ŝ 0:t s ) −1 (s −m 0:t s ) (5) The proposal function g t uses all available present and past information, as these are obtained using the most recent version of D, which contains information from the previous t − 1 rounds in addition to the latest batch of N draws. Compared to a standard Metropolis random walk, the additional computational effort to implement our method is negligible, as it uses trivial matrix algebra applied on quantities obtained as a by-product of the SL procedure, namely the several pairs {θ * n ,s * n }. Notice (5)-(6) reduce tom 0:t θ|s ≡m 0:t θ andŜ 0:t θ|s ≡Ŝ 0:t θ respectively as soon asm 0:t s = s. The latter condition means that the chain is close to the bulk of the posterior and accepted parameters simulate summaries distributed around the observed s. Therefore, when the chain is far from its target, the additional terms in (5)-(6) can help guide the proposals thanks to an explicit conditioning to data.
An alternative to Gaussian proposals are multivariate Student's proposals. We build on the result found in Ding [2016] allowing us to write θ * n ∼ g t (θ|s), and here g t (θ|s) is a multivariate Student's distribution with ν degrees of freedom, and in this case θ * n can be simulated using with χ 2 ν+ds an independent draw from a Chi-squared distribution with ν + d s degrees of freedom, δ n = (s −m 0:t s )(Ŝ 0:t s ) −1 (s −m 0:t s ) and Z n a d θ -dimensional standard multivariate Gaussian vector that we simulate at each iteration n and is independent of χ 2 ν+ds /(ν + d s ). For simplicity, in the following we do not make distinction between the Gaussian and the Student's proposals, and the user can choose any of the two, as they are anyway obtained from the same building-blocks (2)-(6). As customary in Metropolis-Hastings, when a proposal is rejected at a generic iteration n, the last accepted pair should be stored as (θ n ,s n ). However, should the rejection rate be high (notice we have never incurred into such situation when sampling via ASL), the covarianceŜ 0:t θs would be computed on many identical repetitions of the same (θ,s)-vectors, this causing numerical instabilities. Therefore, anytime a rejection takes place, we can perform the following when storing the output of the n-th MCMC iteration: if proposal θ # ∼ g(θ|s) has been rejected at iteration n: resample independently M times with replacement from the last accepted set of summaries (s * 1 , ..., s * M ) (produced from the last accepted θ * ), to obtain the bootstrapped set (s * 1 , ...,s * M ). We use the latter set to computes * = M m=1s * m /M . Hence, at iteration n (and only when proposal This way, the averaged summaries stored in set D still consist of averages of accepted summaries (as usual), with the benefit that when the acceptance rate is low (which anyway never occurred to us with ASL)Ŝ 0:t θs is computed on a set D that has more varied information, thanks to resampling. This consideration is expressed in step 5 of Algorithm 2. Algorithm 2 constructs the sequence {g t (θ|s)} T t=1 for a SL procedure, and this constitutes our ASL approach. An advantage of ASL is Algorithm 2 ASL: synthetic likelihoods with a sequentially adapted and guided proposal 1: Input: K pairs {θ * k ,s * k } K k=1 from burnin. Positive integers N and T . Initialize D := {θ * k ,s * k } K k=1 . 2: Output: θ 1 , ..., θ T . Then θ T should be used as starting point for another adaptive MCMC algorithm. 3: Construct the proposal density g 0 using {θ * k ,s * k } K k=1 and (2)-(3)-(4) (and optionally propose from (7)). Set θ 0 := θ * K . 4: for t = 1 : T do

5:
Starting at θ t−1 run N MCMC iterations (SL or BSL) using g t−1 , producing {θ * n ,s * n } N n=1 . If the current proposal has been rejected at iteration n, thes * n may instead be computed ass * n (see main text).

7:
Set θ t := θ * N . 8: end for 9: Return θ 1 , ..., θ T to be provided as input to another adaptive MCMC algorithm for BSL or CSL. that it is self-adapting. A disadvantage is that, since the adaptation results into an independence sampler, it does not explore a neighbourhood of the last accepted draw, and newly accepted N draws obtained at stage t might not necessarily produce a rapid change into mean and covariance for the proposal function g t+1 (should a rapid change actually be required for optimal exploration of the parameter space). This is why in our applications we always use N = 1. That is, the proposal distribution is updated at each iteration by immediately incorporating information provided by the last accepted draw.
As clear from the output of Algorithm 2, we recommend to use T iterations of ASL to return i) θ T which is then used as starting parameter value for a run of BSL (or CSL see section 4) together with a standard MCMC proposal sampler; and ii) the sequence θ 1 , ..., θ T (notice this excludes the initial burnin of K iterations) of which we compute the sample covariance matrix, and the latter is used to initiate the adaptive MCMC algorithm of Haario et al. [2001], this one having proven ergodic properties (see the Supplementary Material for details on how this is performed). The above means that the inference results we report are based on draws using Haario et al. [2001] (thanks to the useful initialization via ASL). However, after the T ASL iterations, besides Haario et al. [2001] other adaptive MCMC algorithms with proven ergodic properties could be used: possibilities are e.g. Andrieu and Thoms [2008] or Vihola [2012]. Moreover, an interesting use of ASL arise with multimodal targets: if several chains are run in parallel and are initialised at different parameter values, the nature of ASL to rapidly "jump" to high density regions can point the researcher to the existence of multiple modes within few iterations of ASL (this is illustrated in section 6.4).
In our experiments we use a relatively small number of burnin iterations K (say K = 200 or 300), and when ASL is started we immediately observe a large "jump" towards the posterior mode. Importantly, rapid convergence via ASL also helps reducing the computational effort by re-tuning M : in fact, while a large value of M can be necessary when setting θ 0 in tail regions of the posterior, once the chain has converged towards the bulk of the posterior it is possible to reduce M substantially. See the g-and-k example in section 6.1, where it is necessary to start with M = 1, 000, and after using ASL for a few iterations we can revert to M = 10 or 50.
Our ASL strategy is inspired by the sequential neuronal likelihood approach found in Papamakarios et al. [2019]. In Papamakarios et al. [2019] N MCMC draws obtained in each of T stages sequentially approximate the likelihood function for models having an intractable likelihood, whose approximation at stage t is obtained by training a neuronal network (NN) on the MCMC output obtained at stage t − 1. Their approach is more general (and it is aimed at approximating the likelihood, not the MCMC proposal), but has the disadvantage of requiring the construction of a NN, and then the NN hyperparameters must be tuned at every stage t, which of course requires domain knowledge and computational resources. Our approach is framed specifically for inference via synthetic likelihoods, which is a limitation per-se, but it is completely self-tuning, with the possible exception of the burnin iterations where an initial covariance matrix must be provided by the user, though this is a minor issue when the number of parameters is limited.
Notice, a possible interesting application of our guided sampler could be envisioned with ABC-MCMC algorithms [Marjoram et al., 2003]. Even though ABC-MCMC is typically run by simulating a single vector of summary statistics at a given θ (though it is also possible to consider pseudomarginal versions, as in Picchini and Everitt, 2019), nothing prevents to run ASL for a few iterations in an ABC-MCMC context, by simulating multiple summaries at each θ as in SL, and then revert to simulating a single summary vector once the chain has reached the bulk of the ABC posterior.

Correlated synthetic likelihood
Following the success of the pseudo-marginal method (PM), returning exact Bayesian inference whenever a non-negative and unbiased estimate of an intractable likelihood is available (Beaumont, 2003, Andrieu et al., 2009, Andrieu et al., 2010, there has been much research aimed at increasing the efficiency of particle filters (or sequential Monte Carlo) for Bayesian inference in state-space models, see Schön et al. [2018] for an approachable review. A recent important addition to PM methodology, improving the acceptance rate in Metropolis-Hastings algorithms, considers inducing some correlation between the likelihoods appearing in the Metropolis-Hastings ratio. The idea underlying correlated pseudo-marginal methods (CPM), as initially proposed in Dahlin et al. [2015] and Deligiannidis et al. [2018], is that having correlated likelihoods will reduce the stochastic variability in the acceptance ratio. This reduces the stickiness in the MCMC chain, which is typically due to excessively varying likelihood approximations, when these are obtained using a "too small" number of Monte Carlo draws (named "particles"). In fact, while the variability of these estimates can be mitigated by increasing the number of particles, of course this has negative consequences on the computational budget. Instead CPM strategies allow for considerably smaller number of particles when trying to alleviate the stickiness problem, see for example Golightly et al. [2019] for applications to stochastic kinetic models, and Wiqvist et al. [2021] and Botha et al. [2021] for stochastic differential equation mixed-effects models. Interestingly, implementing CPM approaches is trivial. Deligiannidis et al. [2018] and Dahlin et al. [2015] correlate the estimated likelihoods at the proposed and current values of the model parameters by correlating the underlying standard normal random numbers used to construct the estimates of the likelihood, via a Crank-Nicolson proposal. We found particular benefit with the "blocked" PM approach (BPM) of  (see also Choppala et al., 2016 for inference in state-space models), which we now describe in full generality, i.e. regardless of our synthetic likelihoods approach which is instead considered later.
Denote with U the vector of all "auxiliary variables", i.e. pseudorandom numbers (typically standard Gaussian or uniform) that are necessary to produce a non-negative unbiased likelihood approximationp(y|θ, U) at a given parameter θ for data y. Notice U should contain the pseudorandom numbers that are used to "forward simulate" from a model, but can include also other auxiliary variables, for example the pseudo-random numbers that are generated when performing the resampling step in sequential Monte Carlo. In  the set U is divided into G blocks U = (U (1) , ..., U (G) ), and one of these blocks is updated jointly with θ in each MCMC iteration as described below. Letp(y|θ, U (i) ) be the estimated unbiased likelihood obtained using the ith block of random variates U (i) , i = 1, ..., G. Define the joint posterior of θ and U as where θ and U are a-priori independent and is the average of the G unbiased likelihood estimates and hence also unbiased. We then update the parameters jointly with a randomly-selected block U (K) in each MCMC iteration, with Pr (K = k) = 1/G for any k = 1, ..., G. "Updating a randomly selected block" means that only for that picked block U (k) new pseudorandom values are produced (and hence are "refreshed") while for the other blocks these variates are kept fixed to the previously accepted values. Using this scheme, the acceptance probability for a joint move from the current set (θ c , U c ) to a proposed set (θ p , U p ) generated using some proposal function g(θ p , Hence in case of proposal acceptance we update the joint vector (θ c , U c ) := (θ p , U p ) and move to the next iteration, where ). The resulting chain targets (8) . Notice the random variates used to compute the likelihood at the numerator of (10) are the same ones as for the likelihood at the denominator except for the k-th block, hence G − 1 blocks are shared between the numerator and denominator. Perturbing only a small fraction of the pseudo-random numbers induces beneficial correlation between subsequent pairs of likelihood estimates, as in this case the variance of α gets smaller compared to having all entries in U getting updated at each iteration. Also, we considered g(U p |U c ) ≡ p U (U p (k) ) hence the simplified expression (10). The correlation between logp (y|θ p , U p ) and logp (y|θ c , U c ) is approximately ρ = 1 − 1/G , so the larger the number of groups G that can be formed and the higher the correlation (at least theoretically). Also, note that the G approximationsp(y|θ, U (i) ) can be run in parallel on multiple processors when these likelihoods are approximated using particle filters.
We now consider synthetic likelihoods. Denote with U j the vector of auxiliary variables employed when producing the j-th model simulation (j = 1, ..., M ). Denote with (U 1 , ..., U M ) the vector stacking the variates generated across all M model simulations. We distribute those variates across G blocks: assume for simplicity that M is a multiple of G, so that for example the i-th block U (i) could be the collection of the pseudo-random numbers used in a small subset of the M model simulations, so that Same as before, in each MCMC iteration we "refresh" the variates from a randomly sampled block, while the other variates are kept fixed to the previously accepted values. In our synthetic likelihood approach we do not make use of (9) and take instead p(s|θ, U) without decomposing this into a sum of G contributions. We do not in fact compute separately the p(s|θ, U (i) ), since we found that in order for each p(s|θ, U (i) ) to behave in a numerically stable way, a not too small number of simulations M (i) should be devoted for each sub-likelihood term, or otherwise the corresponding estimated covariance may misbehave (e.g., may result not positive-definite). Therefore, in practice, we just obtain the joint p(s|θ, U), and (10) becomes which we therefore call "correlated synthetic likelihood" (CSL) approach. From the analytic point of view our correlated likelihood p(s|θ, U) is the same unbiased approximation given in Price et al. [2018] (also in Supplementary Material), hence CSL uses the BSL approach, the only difference with BSL being that the numerator and denominator of (11) have G − 1 blocks in common, while in BSL all pseudo-random numbers are refreshed at each iteration for each new likelihood.
In our experiments we show that using the acceptance criterion (11) into Algorithm 1 (regardless of the use of our ASL proposal kernel) is of benefit to ease convergence and also increase chains mixing. Moreover, it comes with no computational overhead compared to not using correlated synthetic likelihoods. The only potential issue would be some careful extra coding and the need to store (U (1) , ..., U (G) ) in memory, which could be large dimensional with complex model simulators.

Algorithmic initialization using BOLFI and ELFI
This section does not contain novel material, but it is useful to inform modellers using SL approaches of alternative strategies to initialize SL algorithms. We consider the case where obtaining a reasonable starting value θ 1 for θ by trial-and-error is not feasible, due to the computational cost of evaluating the SL density at many candidates for θ 1 . At minimum, we need to find a value θ 1 such that the corresponding SL density (the biased p M or the unbiased one in the sense of Price et al., 2018) has a positive definite covariance matrixΣ. This is not ensured when the summaries are simulated from highly non-representative values of θ, which would result in an MCMC algorithm that halts. The issue is critical, as testing many values θ 1 can be prohibitively expensive, both because the dimension of θ can be large and because the model itself might be slow to simulate from.
An approach developed in Gutmann and Corander [2016] uses Bayesian optimization when the likelihood function is intractable but realizations from a stochastic model simulator are available, which is exactly the framework that applies to ABC and SL. The resulting method, named BOLFI (Bayesian optimization for likelihood-free inference), locates a θ that either minimizes the expected value of log ∆, where ∆ is some discrepancy between simulated and observed summary statistics, say ∆ = s * − s for some distance · , or alternatively can be used to minimize the negative log-SL expression. For example, · could be the Euclidean distance ((s * − s) (s * − s) ) 1/2 , or a Mahalanobis distance ((s * − s) A(s * − s) ) 1/2 for some square matrix A weighting the individual contributions of the entries in s * and s (see . The appeal of BOLFI is that (i) it is able to rapidly focus the exploration in those regions of the parameter space where either ∆ is smaller, or the SL is larger, and (ii) it is implemented in ELFI [Lintusaari et al., 2018], the Python-based open-source engine for likelihood-free inference.
Hence, when dealing with expensive simulators, BOLFI is ideally positioned to minimize the number of attempts needed to obtain a reasonable value θ 1 , to be used to initialize the synthetic likelihoods approach. BOLFI replaces the expensive realizations from the model simulator with a "surrogate simulator" defined by a Gaussian process (GP, Rasmussen and Williams, 2006). Using simulations from the actual (expensive) simulator to form a collection of pairs such as (θ, log ∆), the GP is trained on the generated pairs and the actual optimization in BOLFI only uses the computationally cheap GP simulator. This means that the optimum returned by BOLFI does not necessarily reflect the best θ generating the observed s. It is possible to use the BOLFI optimum to initialize some other procedure within ELFI, such as Hamiltonian Monte Carlo via the NUTS algorithm of Hoffman and Gelman [2014]. However, the ELFI version of NUTS uses, again, the GP surrogate of the likelihood function. Once the BOLFI optimum is obtained, it can be used to initialise (B)SL MCMC which still uses simulations from the true model, and these may be expensive, but at least are initialised at a θ which should be "good enough" to avoid a long and expensive burnin. Illustrations of BOLFI are in sections 6.1.2 and 6.2.1. A more recent contribution, exploiting GPs to predict a log-SL, is in Järvenpää et al. [2020].

Simulation studies
Here follow four simulation studies. A fifth one, using a "perturbed" α-stable model built to pose a challenge to CSL, is in Supplementary Material. In all the considered examples we use N = 1, i.e. the ASL proposal kernel is updated at each iteration. Within ASL we always use a Gaussian proposal based on (5)-(6), and never the multivariate Student's one.

g-and-k distribution
The g-and-k distribution is a standard toy model for case studies having intractable likelihoods (e.g. Allingham et al., 2009, Fearnhead andPrangle, 2012), in that its simulation is straightforward, but it does not have a closed-form probability density function (pdf). Therefore the likelihood is analytically intractable. However, it has been noted in Rayner and MacGillivray [2002] that one can still numerically compute the pdf, by 1) numerically inverting the quantile function to get the cumulative distribution function (cdf), and 2) numerically differentiating the cdf, using finite differences, for instance. Therefore "exact" Bayesian inference (exact up to numerical discretization) is possible. This approach is implemented in the gk R package .
The g-and-k distributions is a flexibly shaped distribution that is used to model non-standard data through a small number of parameters. It is defined by its quantile function, see  for an overview. Essentially, it is possible to generate a draw Q from the distribution using the following scheme where u ∼ N (0, 1), A and B are location and scale parameters and g and k are related to skewness and kurtosis. Parameters restrictions are B > 0 and k > −0.5. We assume θ = (A, B, g, k) as parameter of interest, by noting that it is customary to keep c fixed to c = 0.8 Pettitt, 2011, Rayner andMacGillivray, 2002). We use the summaries s(w) = (s A,w , s B,w , s g,w , s k,w ) suggested in Drovandi and Pettitt [2011], where w can be observed and simulated data y and y * respectively: where P q,w is the qth empirical percentile of w. That is s A,w and s B,w are the median and the inter-quartile range of w respectively. We use the simulation strategy outlined above to generate data y, consisting of 1, 000 independent samples from the g-and-k distribution using parameters θ = (A, B, g, k) = (3, 1, 2, 0.5). We place uniform priors on the parameters: We run five inference attempts independently, always starting at θ 0 = (7.389, 7.389, 2.718, 1.221) and using the same data. For all experiments, M = 1, 000 model simulations are produced at each proposed parameter and we found this value of M to be necessary given the used starting parameters, or we would not collect enough parameter moves. However if we instead initialize the simulations close to the true values then a considerably smaller M can be employed (this is discussed later), which shows that our contribution on accelerating convergence to the high posterior probability region is important. We start by running K = 200 burnin iterations, during which we advance the chain by proposing parameters using a Gaussian random walk acting on log-scale, i.e. on log θ, with a constant diagonal covariance matrix having standard deviations (on log-scale) given by [0.025, 0.025, 0.025, 0.025] for (log A, log B, log g, log k) respectively. Given the short burnin, in the first K iterations we implement a Markov-chain-within-Metropolis strategy (MCWM, Andrieu et al., 2009) to increase the mixing of the algorithm before our sequentially guided ASL strategy starts (shortly, MCWM differs from a standard Metropolis-Hastings algorithm in that the stochastic likelihood approximation in the denominator of the Metropolis-Hastings ratio is re-evaluated at the last accepted parameter value, instead of using the value of the previously accepted synthetic likelihood). Notice the use of MCWM is strictly limited to the burnin iterations, since MCWM doubles the execution time and its theoretical properties are not well understood. At iteration K + 1, our ASL Algorithm 2 starts and is let run for 300 iterations (notice a much smaller number of iterations than 300 can be used, say 50. We chose 300 for pictorial reasons as the effect of ASL gets better noticed in figures). Afterwards BSL inherits the last draw accepted by ASL and reverts to using the adaptive Metropolis random walk proposal of Haario et al. [2001], thereafter denoted "Haario", which is used for further 2,800 iterations and is adapted as described in Supplementary Material. Therefore the total length of the chain is 3,300 (K = 200 iterations, then 300 ASL iterations then 2,800 further iterations). The covariance matrix in the adaptive proposal of "Haario" is updated every 30 iterations. The five independent inference attempts are in Figure  1a. We notice that during the burnin the chains are still quite far from the ground truth. However, as soon as ASL kicks in (iteration 201), we notice a large jump towards the true parameters. The proposal in the ASL algorithm produces a high acceptance rate which although it induces very local moves, it never gets stuck and thus provides useful information to initialize the covariance matrix in "Haario".
We now avoid using our guided ASL and run BSL using again MCWM during the burnin iterations and the adaptive "Haario" strategy for the remaining iterations, with results in Figure  1b. This shows the difficulty of running BSL when starting parameters are in the tails of the posterior, where several runs completely fail and only occasionally they manage to reach the bulk of the posterior. Furthermore, the patterns are very sticky. This is because the adaptive "Haario" proposal tunes the covariance of the Gaussian random walk sampler on the previous history of the chain, which becomes problematic if many rejections occur, as in this case the covariance shrinks, thus making the recovery difficult. This is why the high acceptance rate of ASL, coupled to the rapid convergence towards the posterior's bulk, helps collecting moves that are useful for the learning of the covariance matrix for the adaptive random walk proposal. We mentioned that if a starting parameter value is chosen closer to the ground truth value a smaller M can be employed (and recall from Figure 1b that with a standard adaptive proposal method BSL failed even with M = 1, 000). As an example, we take the sample mean of the acceptances produced via ASL from iteration 200 to 500, and use this sample mean to initialize BSL when using the "Haario" proposal with as little as M = 50: the mixing of BSL in this case is very satisfactory (results shown in the Supplementary Material) and actually even using M = 10 allows good mixing but slightly worse tail performance. Therefore, ASL can be really useful in enabling standard BSL to be used with considerable computational savings (5,000 iterations of BSL can be performed in under a minute when M = 50).

Using correlated synthetic likelihood without ASL
Here we consider the correlated synthetic likelihood (CSL) approach outlined in section 4, without the use of our ASL approach for proposing parameters, to better appreciate the individual effect of using correlated likelihoods. Notice (12) immediately suggests how to implement CSL, since the u appearing in (12) can be thought as a scalar realization of the U variate in section 4. We initialised parameters at the same starting values as in the previous experiments, across five independent inference attempts. We used CSL throughout, including the burnin phase, that is we do not employ MCWM during the burnin. After 200 burnin iterations with fixed covariance matrix, we propose parameters using "Haario". We illustrate results obtained with G = 100 blocks, which should imply a theoretical correlation of ρ = 1 − 1/100 = 0.99 between estimated synthetic loglikelihoods, see Figure 2. Figure 2 shows that, while two attempts failed (essentially because the 200 burnin iterations did not produce any acceptance), the remaining attempts managed to reach the groundtruth values. Recall that when using BSL without induced correlation (and employing the same "Haario" proposal sampler) we produced Figure 1b. The benefits of recycling pseudo-random variates are noticeable. Similar plots, but using G = 50, are in Supplementary Material. The comparison between the two cases G = 50 and G = 100 shows that inducing higher correlation (i.e. G = 100) allow faster convergence to ground-truth parameters, however at the same time the mixing is reduced due to a reuse of perhaps too many U -variates, whereas with G = 50 the chains appear to mix better.  Figure 2: g-and-k: 1,000 iterations from CSL, using G = 100 groups. The black dashed lines mark ground-truth parameters. Solid horizontal lines correspond to failed attempts.

Initialization using ELFI and BOLFI
Here we show results from the BOLFI optimizer (discussed in section 5) to find a promising area in the posterior region and hence provide a useful starting value for SL. We use the ELFI software. In this particular example BOLFI uses a Gaussian Process (GP) to learn the possibly complex and nonlinear relationship between discrepancies (or log-discrepancies) log ∆ and corresponding parameters θ. We found that for this specific example, where we set very wide and vague priors, we could not infer the parameters using BOLFI with the LCB (lower confidence bound) acquisition function regardless the value set for J 1 . This is because while in previous inference attempts we used MCMC methods to explore the posterior and having very vague priors was still feasible, here having initial samples provided by very uninformative priors is not manageable. In this section we use A ∼ U (−10, 10), B ∼ U (0, 10), g ∼ U (0, 10), k ∼ U (0, 10). These priors are narrower than in previous attempts but are still wide and uninformative enough to make this experiment interesting and challenging. Once the J 1 training samples are obtained, BOLFI starts optimizing parameters by iteratively fitting a GP and proposing points θ (j) such that each θ (j) attempts at reducing log ∆, j = 1, ..., J 2 . We first consider J 2 = 500 and then J 2 = 800. The clouds of points in Figure 3 represent all J 1 + J 2 values of log-discrepancies log ∆ (for (J 1 , J 2 ) = (20, 500) and (J 1 , J 2 ) = (100, 500)) and corresponding parameter values. The smallest values of log ∆ cluster around the ground-truth parameters which we recall are A = 3, B = 1, g = 2, k = 0.5. The values of the optimized discrepancies are in Supplementary Material. Even with a very small J 1 the obtained results appear very promising. Also, even though the estimates for k seem to be bounded by the lower limit we set for its prior, we can clearly notice a trend, in that smaller values for k return smaller discrepancies. BOLFI can be an effective tool to initialize an MCMC procedure for synthetic likelihoods.
(a) (b) Figure 3: g-and-k: log-discrepancies for the tested parameters using BOLFI with J 1 = 20 (top) and J 1 = 100 (bottom). From left to right: plots for A, B, g and k respectively.

Supernova cosmological parameters estimation with twenty summary statistics
We present an astronomical example taken from Jennings and Madigan [2017]. There, the ABC algorithm by Beaumont et al. [2009] was used for likelihood-free inference. The algorithm in Beaumont et al.
[2009] is a sequential Monte Carlo (SMC) sampler, hereafter denoted ABC-SMC, which propagates many parameter values ("particles") through a sequence of approximations of the posterior distribution of the parameters. The sequence of approximations depends on the sequence of tolerances 1:T , where T is the final iteration of the procedure. The different approaches used in order to create the series of decreasing tolerances [Beaumont et al., 2009, Del Moral et al., 2012, together with the choice for T , can lead to inefficient sampling [Simola et al., 2020]. For this reason, rather than using the ABC-SMC algorithm, we employed one of its extensions, the Adaptive ABC-PMC (hereafter aABC-PMC) found in Simola et al. [2020]. When using the aABC-PMC algorithm both the series of decreasing tolerances and T are automatically selected, by looking at the online behaviour of the approximations to the posterior distribution (aABC-PMC is also implemented in ELFI). Our goal is to show how synthetic likelihoods may be as well used in order to tackle the inferential problem, and a comparison with aABC-PMC and BOLFI is presented. In Jennings and Madigan [2017] the analysis relied on the SNANA light curve analysis package [Kessler et al., 2009] and its corresponding implementation of the SALT-II light curve fitter presented in Guy et al. [2010]. A sample of 400 supernovae with redshift range z ∈ [0.5, 1.0] are simulated and then binned into 20 redshift bins. However, for this example, we did not use SNANA and data is instead simulated following the procedure in Supplementary Material. The model that describes the distance modulus as a function of redshift z, known in the astronomical literature as Friedmann-Robertson-Model [Condon and Matthews, 2018], is: where E(z) = Ω m (1 + z) 3 + Ω k (1 + z) 2 + Ω Λ e 3 z 0 dln(1+z )[1+w(z )] . The cosmological parameters involved in (13) are five. The first three parameters are the matter density of the universe, Ω m , the dark energy density of the universe, Ω Λ and the radiation and relic neutrinos, Ω k . A constraint is involved when dealing with these three parameters, which is Ω m + Ω Λ + Ω k = 1 [Genovese et al., 2009, Tripathi et al., 2017, Usmani et al., 2008. The final two parameters are, respectively, the present value of the dark energy equation, w 0 , and the Hubble constant, h 0 . A common assumption involves a flat universe, leading to Ω k = 0, as shown in Tripathi et al. [2017], Usmani et al. [2008]. As a result, (13) simplifies and in particular E(z) can be written , where we note that Ω Λ = 1 − Ω m . Same as in Jennings and Madigan [2017], we work under the flat universe assumption. Concerning the Dark Energy Equation of State (EoS), w(·), we use the parametrization proposed in Chevallier and Polarski [2001] and in Linder [2003]: According to (14), w is assumed linear in the scale parameter. Another common assumption relies on w being constant; in this case w = w 0 . We note that several parametrizations have been proposed for the EoS (see for example Huterer and Turner [2001], Wetterich [2004] and Usmani et al. [2008]). For the present example, ground-truth parameters are set as follows: Ω m = 0.3, Ω k = 0, w 0 = −1.0 and h 0 = 0.7. In the present study h 0 is assumed known. Similarly to Jennings and Madigan [2017], we aim at inferring the cosmological parameters θ = (Ω m , w 0 ). The distance function used to compare µ with the "simulated" data µ sim (z) is: We recall that the aABC-PMC algorithm in Simola et al. [2020] uses a series of automatically selected decreasing tolerances 1:T , each inducing a better approximation to the true posterior distribution as t ∈ [1, T ] increases. When the stopping rule, based on the improvement between two consecutive posterior distributions is satisfied, the procedure automatically halts. While the ABC posterior based on 1 uses the prior distribution as proposal function, for t > 1 the aABC-PMC uses the previous iteration's ABC posterior to produce candidates, just like regular ABC-PMC or other sequential ABC procedures. In this work, as done also by Jennings and Madigan [2017], we follow Beaumont et al. [2009] regarding the selection of the perturbation kernel, which is a Gaussian distribution centered to the selected particle and having variance equal to twice the weighted sample variance of the particles selected in the previous iteration. The specifications for the aABC-PMC algorithm are found in Simola et al. [2020]. For all experiments, we set priors Ω m ∼ Beta(3, 3), since Ω m must be in (0, 1), and w 0 ∼ N (−0.5, 0.5 2 ).

Inference
We describe how to forward simulate from the model in Supplementary Material. We take s = (µ 1 , ..., µ 20 ) as "observed" summary statistics corresponding to the stochastic input generated as described in Supplementary Material. Notice, in our case s is the trivial summary statistic, in that (µ 1 , ..., µ 20 ) is the data itself. We investigate the assumption in the Supplementary Material and find that this is statistically supported, at least for summaries simulated at ground-truth parameter values. However, notice that a different behaviour might occur at other values of θ, for example at those values far from the ground truth. We found it impractical to consider M in the order of thousands, however using a smaller value of, say, M = 100 would produce an ill-conditioned covariance matrix. To overcome this problem we found it essential to use a shrinkage estimator of Σ M,θ , such as the one due to Warton [2008] and employed in a BSL context in Nott et al. [2019]. This way we managed to use as little as M = 100 model simulations. In this section we denote the BSL approach using shrinkage as "sBSL". We compare sBSL with the correlated synthetic likelihoods approach plugged into ASL, and denote this method "ACSL" (we employed shrinkage also within ACSL). We always use M = 100, and within ACSL we experiment with several number of blocks, namely G = 5 and 10. For all methods, starting parameter values are (Ω m = 0.90, w 0 = −0.5), see the Supplementary Material for further details on the MCMC settings. We first note that sBSL is unable to move away from the starting parameter values, and hence this attempt is a failure. Introducing correlation between synthetic loglikelihoods is a key feature for the success of ACSL in this case study.
Traceplots for 11,200 draws from ACSL when G = 5, 10 are in Supplementary Material. Having G > 1 helps proposals acceptance during the burnin period, so that when ACSL starts it is provided with useful information from the burnin. The output of aABC-PMC is produced by 1,000 particles (the final tolerance that is automatically selected by the algorithm after T = 9 iterations is 9 = 30.5). For comparison with aABC-PMC inference, where the latter is produced by a "cloud" of particles, we thin the output of the single chains of BSL and ACSL: we take the last 10,000 draws from ACSL and sBSL and retain every 10th draw, thus obtaining 1,000 draws that are used to report inference in Table 1. We remind the reader that sBSL fails when initialised at the same starting parameters used for ACSL: therefore to enable some comparison we start sBSL at the ground-truth parameters (this case is denoted sBSL truth in the table). Regarding BOLFI, posterior samples were produced by first obtaining 2, 000 "acquisition points" in ELFI (over which a GP model is fitted), then 10,000 draws are produced via MCMC, and finally chains were thinned to obtain 1,000 draws used for statistical inference. Comparisons between all methods are in Table 1 and Figure 4. Inference results for sBSL truth , ACSL (both attempts) and BOLFI are similar, however the ESS for BOLFI is the highest, while the ESS for sBSL truth is much lower than for ACSL, as reported in Table 1. While for this case study the exact posterior is unknown, we can speculate that discrepancies in terms of posterior variability between the results obtained with aABC-PMC and those obtained with the other methods can likely be explained by the use of the 20-dimensional summary statistics. For a study on the impact of the summaries dimension in an ABC analysis we refer the reader to Blum et al. [2013]. As a further remark, it is important to remember that, unlike standard BSL, ACSL was able to get initialised relatively far from ground-truth parameters and still able to return reasonable inference.  Table 1: Supernova model: posterior means (95% HPD interval) resulting from 1,000 thinned posterior draws from several methods. All chains are initialised at (Ω m = 0.90, w 0 = −0.5), except for sBSL truth which is sBSL initialised at ground-truth parameters. The "NA" for sBSL means that the MCMC was unable to move away from the starting location.

Simple recruitment, boom and bust with highly skewed summaries
Here we consider an example that is discussed in Fasiolo et al. [2018] and  as it proved challenging due to the highly non-Gaussian summary statistics. The recruitment boom and bust model is a discrete stochastic temporal model that can be used to represent the fluctuation of the population size of a certain group over time. Given the population size N t and parameter θ = (r, κ, α, β), the next value N t+1 has the following distribution where t ∼ Pois(β). The population oscillates between high and low level population sizes for several cycles. Same as in , true parameters are r = 0.4, κ = 50, α = 0.09 and β = 0.05 and we assume N 1 = 10 a fixed and known constant. This value of β is considered as it gives rise to highly non-Gaussian summaries, and hence it is of interest to test our methodology in such scenario. In fact, the smaller the value of β, the more problematic it is to use synthetic likelihoods.

An illustration of the summaries distribution at the true parameters values is in Supplementary
Material, together with the prior specifications, the summary statistics employed and other model specifications.
We experiment with two sets of values for the starting parameters: set 1 has r = 0.8, κ = 65, α = 0.05, β = 0.07; set 2 has a more extreme set of values, given by r = 1, κ = 75, α = 0.02, β = 0.07. We always use M = 200 (also considered in . In this case-study we could not experiment with the correlated synthetic likelihoods approach, since the state-of-art generation of Poisson draws requires executing a while-loop, where uniform draws are simulated at each iteration. Therefore it is not known in advance how many uniform draws it is necessary to store, and the implementation of correlated SL results inconvenient. When parameters are initialised in set 1, a burnin of 200 iterations aided by MCWM is considered (MCWM is not used after burnin). When initialising from set 2, we use a longer burnin of 500 iterations. During the burnin, as usual we propose parameters using a Gaussian random walk proposal with constant diagonal covariance matrix with diagonal elements [0.005 2 , 0.5 2 , 0.001 2 , 0.001 2 ]. For ASL the burnin was followed by 300 iterations (again this can be set much smaller) using the guided proposals approach, and then further 1,200 iterations using "Haario". BSL was found to diverge to wrong regions of the posterior surface with chains stuck for long periods, for both attempted starting parameters. We therefore implemented the semi-parametric BSL approach from , thereafter "semiBSL": semiBSL is a robustified version of BSL to address the case of non-Gaussiandistributed summary statistics. However, also semiBSL failed when parameters were initialized in the tails of the posterior (i.e. when using the same starting parameters considered above for ASL), meaning that chains were unable to mix, and were stuck in wrong regions, see the Supplementary Material for details. This shows that even a "robustified" version of synthetic likelihoods can be fragile to bad initializations. Therefore, results we report in Figure 5b for both standard BSL and semiBSL are based on chains initialized at the ground-truth parameter values. With ASL, at the end of the initial 500 burnin iterations we notice the characteristic "jump" towards the true parameter values, see Figure 5a. Therefore, ASL is able to produce inference also when initialised at parameters in the tails of the posterior surface, while BSL and semiBSL cannot, at least for this example. Traces for the failing semiBSL initialised at set 2 are in the Supplementary material. Similarly to the supernova model, we now compute minESS values. Using 1,000 posterior draws from the run initialised at set 2, we have that with ASL minESS is 49. For BSL, when starting at ground-truth, we have minESS=35 and for semiBSL minESS=41. Therefore, the values for semiBSL and ASL are quite similar, despite the fact that ASL is initialised at a less favorable location.

Exploration of a multimodal surface
We show how to run multiple chains employing ASL to rapidly alert the researcher of the existence of multiple modes, at a small computational cost. The toy model is admittedly very simple, but the experiment is expressive enough for our take-home message. We consider a likelihood consisting of a two-components Gaussian mixture x ∼ 0.5N (µ 1 , Σ 1 ) + 0.5N (µ 2 , Σ 2 ), where each component is two-dimensional. We consider 5,000 observations generated by such mixture with ground truth 1 ) = (−5, 10), µ 2 = (µ 2 ) = (30, 20) and covariance matrices Σ 1 and Σ 2 both having diagonal entries (4 2 , 4 2 ), however Σ 1 is diagonal while Σ 2 has off-diagonal entries both equal to 12. Data are exemplified in Figure 6(a). We assume µ 1 and µ 2 as the only unknowns, and everything else is fixed to ground-truth values. We set independent priors µ (1) 2 ∼ N (20, 2 2 ). In our experiments, summary statistics of simulated data are the estimated means of the two mixture components, as obtained by fitting a twocomponents Gaussian mixture (with known covariances set to ground-truth). Observed summaries are always s = (−5, 10, 30, 20), that is the ground truth means. We used common strategies to get around the well-known "label-switching" issue affecting mixture models: that is whenever during MCMC a vector (µ 2 ) is proposed, we sort its entries across the mixture components so that the proposed vector has components rearranged to have µ (1) 2 . Since we work in the context of synthetic likelihoods, once a simulated dataset is produced at the proposed parameters, we fit a two-components Gaussian mixture to the data as previously mentioned, and the four corresponding estimated means (which are used as summary statistics) are sorted in the same way as the proposed parameters.
We use M = 10 to approximate the synthetic likelihood and design the following experiment. For a fixed dataset with observed summaries s = (−5, 10, 30, 20) we run 100 independent chains initialised at random locations. We set up a very short burnin consisting of 49 iterations where as usual MCWM is used and a Gaussian random walk sampler is employed, where the noise in the random walk has standard deviation set to 0.2 for each proposed entry in µ 1 and µ 2 . This means that during burnin we intentionally induce slow exploration of the posterior surface. We show that, as soon as ASL starts, most chains quickly reach the high-density region of the posterior. Figure 6(b) shows 100 starting values that were randomly sampled uniformly in the 4-dimensional hypercube [−30, 50] 4 . We notice that after 49 iterations using random walk proposals the draws are still fairly close to the starting value, however one further iteration afterwards, when ASL is initialised, a rapid jump is performed towards the high density region. The clustering of the ASL draws should signal the researcher the existence of more than one mode, and hence inform her of the opportunity to initialise more than one chain for a full-fledged Bayesian inference, by picking the starting values in the clusters determined by ASL.

Discussion
We have introduced several ways to improve the performance of the computing-intensive synthetic likelihood framework. Firstly, we have developed a sequential strategy to learn a "guided by data" proposal distribution for SL. The resulting sequentially adaptive and guided SL sampler (ASL) helped the chain to rapidly approach the ground truth parameter values. Importantly, for two of the considered case studies (supernova cosmological parameters and recruitment boom-and-bust model), standard SL methods failed when initialized at remote parameter values and when the standard adaptive MCMC strategy by Haario et al. [2001] was employed, whereas ASL helped the chains to rapidly converge to high-posterior regions (remarkably, this happened even for the markedly non-Gaussian summary statistics considered in section 6.3). In addition, we have shown how to introduce correlation between successive estimates of the synthetic likelihood, calling this approach "correlated synthetic likelihoods". This should help reducing the variance in the acceptance ratio of Metropolis-Hastings, and indeed we have noticed an increase in the mixing of the chains. We have shown how this correlated SL approach (CSL) can be of help when SL is initialized in the  Figure 6: Gaussian-mixture: (a) 5,000 data-points from the likelihood model and contour lines for the latter; (b) contour lines for the likelihood model; black circles are the 100 starting values for the corresponding 100 chains; magenta circles correspond to iteration 49 (last burnin iteration) for each chain; the green asterisks correspond to iteration 50 for each chain, that is the first ASL iteration.
tails of the posterior and how beneficial CSL is in terms of chains mixing. However, CSL is not a silver bullet, and it does not necessarily have to succeed at completely eliminating the possibility for SL getting stuck when badly initialized. However, when it can be implemented, there is no obvious reason to prefer standard SL to CSL. At worst, we conjecture that for very nonlinear transformations of the data following the construction of possibly complex summary statistics (and hence complex transformations of the pseudo-random variates), it may happen that the correlation between successive likelihoods gets destroyed, thus transforming CSL into standard SL. We have challenged CSL with a "perturbed α-stable model" (in Supplementary Material) and even in this case CSL has shown beneficial. Finally, for the g-and-k and supernova examples, we have illustrated how the problem of a difficult initialization for SL can be tackled by using a Bayesian optimizationbased approach to likelihood-free inference [Gutmann and Corander, 2016], available in the ELFI software [Lintusaari et al., 2018]. However, we note further that the BOLFI implementation uses the LCB (lower confidence bound) acquisition function which can be prone to over-explore boundaries of parameter spaces and may in some cases result in a poorly resolved surrogate model. An improved acquisition function based on expected integrated variance introduced by Järvenpää et al. [2019] has been shown to lead to more accurate posterior approximation and it is also available in ELFI, although it is typically rather expensive computationally. As a summary, we believe that when a reasonable starting region where to set an initial θ is unknown, BOLFI can likely much more rapidly screen the posterior surface in the search for a promising starting region than a random walk proposal. On the other hand, when the dimension of θ is small as it is often the case in BSL applications (say ≤ 7 parameters), then our approach of producing a small number of random walk proposals followed by a short run of ASL can also be more computationally convenient and generally easy to implement. The steps taken in this work thus broaden the scope of usage of synthetic likelihood methods and open up new venues for further research on improving applicability of intractable inference.

Acknowledgments
We would like to thank Christopher Drovandi for useful feedback on an earlier draft of this paper.

SUPPLEMENTARY MATERIAL
Adaptive MCMC of Haario et al (2001) To make our work self-contained, here we give the covariance update formula for the adaptive Gaussian proposal of Haario et al. [2001], which we use throughout our paper. Say that the chain is in position θ r−1 at iteration r − 1. At iteration r we want to propose θ r ∼ N (θ r−1 , C r ), and the previous MCMC "history" is given by (θ 0 , ..., θ r−1 ). Then we can take Here K is the number of burnin iterations and C init is a user-provided covariance matrix of dimensions dim(θ) × dim(θ). I dim(θ) is the identity matrix of dimensions dim(θ) × dim(θ) and is a small positive factor (we set it to 10 −8 ). As for cov(·) this denotes the empirical covariance matrix applied to the chain's history. Importantly, it is often detrimental to update C r at every iteration. We typically update it every 30 or every 50 iterations (again using the entire past history when updating), and this still preserves the validity of the algorithm (see remark 3 in Haario et al., 2001). As for our experiments: before running ASL we run a small number of K burnin iterations by proposing via θ r ∼ N (θ r−1 , C init ), r ≤ K, where we always assume C init to be diagonal (details are given in each case study). Afterwards we run T ASL iterations. Once these have terminated, we take the (post-burnin) accepted draws from ASL, denoted (θ 1 , .., θ T ), and implement the method by Haario et al. [2001] starting at θ T and removing the burnin, that is the very first iteration has covariance matrix computed as 2.4 2 dim(θ) · cov(θ 1 , ..., θ T ) + 2.4 2 dim(θ) · · I dim(θ) , and afterwards the covariance keeps getting updated by incorporating newly accepted draws into the chain's history (and again, we do not update at each iteration).

Bayesian synthetic likelihoods
Here we provide further details regarding BSL, as found in Price et al. [2018]. A BSL procedure samples from the exact posterior π(θ|s) for any M (note that "exact" sampling is ensured only if the distribution of s is really Gaussian). The key feature exploits the idea underlying the pseudomarginal method of Andrieu et al. [2009], where an unbiased estimator is used in place of the unknown likelihood function. Price et al. [2018] noted that plugging-in the estimatesμ M,θ and Σ M,θ into the Gaussian likelihood p(s|θ) results in a biased estimator p M (s|θ) of p(s|θ). They suggest adopting the unbiased estimator of Ghurye et al. [1969]: Here π denotes the mathematical constant (not the prior), d s = dim(s), M is assumed to satisfy M > d s + 3, and for a square matrix A the function ψ(A) is defined as ψ(A) = |A| if A is positive definite and ψ(A) = 0 otherwise, where |A| is the determinant of A. Finally c(k, v) = 2 −kv/2 π −k(k−1)/4 / k i=1 Γ( 1 2 (v − i + 1)). We can then plugp(s|θ) inside algorithm 1 in place of p M (s|θ) to obtain a chain targeting π(θ|s), again only if s is Gaussian. This is a powerful result, however in practice the value of M does affect the numerical results, as a too low value of M can reduce the mixing of the chain, since the variance ofp(s|θ) increases for decreasing M .
g-and-k model g-and-k: BSL with standard adaptive MCMC and ASL initialization We consider the case where the adaptive proposal of Haario et al. [2001] is used within BSL and, importantly, the starting parameter value is provided by a chain that used ASL for a few iterations. That is, we consider the draws obtained via ASL from iteration 200 to 500 in Figure 1a from the main paper, take their sample mean, and use this to initialize BSL when the "Haario" proposal is used. Thanks to the starting value provided by ASL, which is very close to ground truth, we can afford running BSL with only M = 50. An exemplary run of 5,000 iterations can be obtained in under a minute (with pure Matlab code), with an acceptance rate of around 20%, see Figure 7. g-and-k: further results using CSL Figure 8 shows five inference runs for the g-and-k model when CSL is run with G = 50.

g-and-k: weighting the summaries in BOLFI
It is possible to assign weights to summary statistics so that the resulting discrepancy is, say, ∆ = ( ds j=1 (s * j − s j ) 2 /w 2 j ) 1/2 = ((s * − s) A(s * − s)) 1/2 , where d s = dim(s). Here the w j are nonnegative weights for each of the components of the summary statistics. Equivalently we may consider the Mahalanobis distance ∆ = ((s * −s) A(s * −s)) 1/2 , with A interpreted as some scaling matrix (say a covariance matrix). For example we could define A as the diagonal matrix A = diag(w −2 1 , ..., w −2 ds ). Summaries are automatically scaled when using the synthetic likelihoods approach (via theΣ M matrix), however this is not automatically performed in BOLFI. The reason why it is relevant to give appropriate weights to simulated and observed summaries, is that entries in s and s * may vary on very different scales, hence ∆ might be dominated by the most variable component of s and s * (see e.g. . Therefore, prior to running BOLFI, we obtain the w j 's in the following way (see also Picchini and Anderson, 2017). We simulate say L = 1, 000 independent parameter draws from the prior, θ * l ∼ π(θ), and simulate corresponding artificial data y * l ∼ p(y|θ * l ), to finally obtain artificial summaries s * l = T (y * l ), l = 1, ..., L. We store all the simulated summaries in a L × d s matrix. For each column of this matrix we compute some robust measure of variability. We consider the median absolute deviation (MAD) as recommended in , hence obtain d s MADs, (MAD 1 , ..., MAD ds ), and define w j := MAD j , j = 1, ..., d s . We then construct A as described above, and use BOLFI to optimize ∆. The procedure we have just outlined corresponds to results denoted with weighted=yes in Table 2. Results using constant w j ≡ 1 are given as weighted=no. The only times we happened to obtain a positive estimate for k was in two instances using weighted summaries. The weighting of summaries statistics is only performed when using BOLFI, not when using the SL approach (in SL, summaries are naturally weighted via the matrix Σ).

Supernova model Simulation from the model
Here we describe how to simulate a generic dataset. The same procedure is of course used to generate both "observed data" and "simulated data". We generate 10 4 variates u 1 , ..., u 10 4 , independently sampled from a truncated Gaussian u j ∼ N [0.01,1.2] (0.5, 0.05 2 ) (j = 1, ..., 10 4 ), where N [a,b] (m, σ 2 ) denotes a Gaussian distribution with mean m and variance σ 2 , truncated to the interval [a, b] Table 2: g-and-k: values of the optimized log-discrepancies and corresponding parameters for several values of J 1 and J 2 . Ground truth values are A = 3, B = 1, g = 2, k = 0.5. A "yes" in the weighted column implies that discrepancies are computed using weighted summary statistics.
are then binned into 20 intervals of equal width (essentially the bins of an histogram constructed on the u j ), then the 20 centres of the bins are obtained and these centres are the "redshifts" z 1 , ..., z 20 . Then for each z i we compute the distance modulus µ i via (13), using (Ω m , w 0 , h 0 ) = (0.3, −1, 0.7) (i = 1, ..., 20). Therefore, each simulation from the model requires first the generation of the 10,000 truncated Gaussians, then their binning and the calculations of the twenty µ i . Computing the latter is a computational bottleneck, as in order to compute a synthetic likelihood the procedure above has to be performed M times for each new proposed value of θ = (Ω m , w 0 ).

Supernova model: summaries distribution
In order to check if the synthetic likelihood methodology is suitable for conducting the analyses, the multivariate normality assumption of the employed summary statistic must be checked (see Fasiolo et al., 2018 and for how to relax the assumption). We simulate independently a total of 1, 000 summaries (each having dimension 20), using ground-truth parameters. A test for multivariate normality can be found in Krzanowski [2000] and is implemented in the checkNorm function from the R package synlik [Fasiolo and Wood, 2014], which additionally produces Figure  9. The test does not reject the multivariate normality assumption of the summary statistic at 5% significance level. Furthermore, we note that the right tail behavior in the q-q plot is not unexpected in the synthetic likelihoods context [Wood, 2010].

Supernova model: general MCMC settings and chains from ACSL
For sBSL truth and ACSL we always have a burnin of K = 200 iterations, where parameters are proposed using Gaussian random walks, with constant diagonal covariance matrix having standard deviations [0.01, 0.01] respectively for log Ω m and w 0 . For ACSL the burnin is followed by 300 iterations where parameters are proposed using our guided method, and finally followed by further 10,700 iterations using "Haario". The total run therefore comprise 11, 200 iterations. For sBSL truth the same burnin procedure with constant covariance is used, followed by "Haario" for a total of 11,200 iterations. For both sBSL truth and ACSL the covariance matrix of the summaries used a shrinkage estimator: we considered γ = 0.95 for the shrinkage parameter, which implies a small regularization toΣ M,θ , see Nott et al. [2019]. Figure 10 shows the evolution of the guided and adaptive correlated SL (ACSL) for G = 5, 10: there, only the first 2,000 (out of 11,200) iterations are shown for ease of display. The burnin iterations 1-200 use CSL with a Gaussian random walk proposal with constant covariance matrix, while iterations 201-500 use the guided ACSL proposal and the remaining iterations use adaptive MCMC via Haario et al. [2001]. Having G > 1 help the chains to move during the burnin period, so that when ACSL starts (iteration 201) it is provided with useful information from the burnin. In fact, it is possible to notice a "jump" for Ω m , which in fact happens at iteration 201, that is when the ACSL kicks in. For the interested reader: the overall time required by ELFI/BOLFI was 54 minutes, whereas it took 168 minutes to BSL and 172 minutes to ACSL (G = 10). Notice this time comparison should be taken with a grain of salt since the latter two methods have an algorithmic structure that is quite different from BOLFI, moreover BOLFI is implemented in Python while ACSL/BSL are implemented in Matlab.

Recruitment, boom and bust model
Same as in Fasiolo et al. [2018] and , prior distributions are set to r ∼ U(0, 1), κ ∼ U(10, 80), α ∼ U(0, 1), β ∼ U(0, 1). To generate a data set, same as in the cited references we simulate values for the {N t } process for 300 steps, then we discard the first 50 values to remove the transient phase of the process. Therefore, data are the remaining 250 values. We use essentially the same summary statistics as in , namely for a dataset y, we define differences and ratios as diff y = {y i − y i−1 ; i = 2, . . . , 250} and ratio y = {(y i + 1)/(y i−1 + 1); i = 2, . . . , 250}, respectively. We use the sample mean, variance, skewness and kurtosis of y, diff y and ratio y as our summary statistic, that is a total of twelve summaries. The only difference with the summaries in  is that they take ratio y = {y i /y i−1 ; i = 2, . . . , 250}, however it is not rare for {N t } to contain zeroes, and their formulation of ratio y will cause numerical infelicities. As mentioned in the main paper, the boom and bust example is particularly challenging for the BSL approach due to the strong nonlinear dependence structure between the summary statistics. As an illustration, Figure 12 shows the bivariate scatterplots of 1,000 summary statistics simulated with data-generating parameters r = 0.4, κ = 50, α = 0.09 and β = 0.05. We initialize parameters at r = 1, κ = 75, α = 0.02, β = 0.07 (this was denoted "set 2" in the main paper). We run semiBSL  for 2,000 iterations and use M = 200, see the main text for full details. semiBSL fails to mix properly and ultimately does not converge, see Figure 11, even though the burnin uses Markov-chain-within-Metropolis (MCWM) to ease mixing. As documented in previous literature, including , synthetic likelihood approaches can be fragile to bad initializations, but ASL does manage to rapidly converge and mix well, as displayed in the main text.
α-stable and "perturbed" α-stable distributions In this section we wish to challenge CSL by potentially disrupting the positive effect induced by correlating the synthetic loglikelihoods. In order to pursue this goal, we first introduce a sampling algorithm for the α-stable distribution, and then perturb this, as explained later.

Sampling from an α-stable distribution
Denote with Exp(λ) the exponential distribution with parameter λ, then steps 1-4 below illustrate the simulation of a single draw y from an α-stable distribution with parameters (α, β, γ, δ) [Chambers et al., 1976, Weron, 1996. Notice, while we are not experts of the theoretical derivation of the algorithm below, it is evident that for the case α = 1 many sources report the implementation of the formula from the highly cited paper Chambers et al. [1976] which, however, was found to contain a typo according to Weron [1996] (for α = 1 Chambers et al., 1976 has the term (π/2)w cos(u 2 )/(π/2+ βu 2 ) under the logarithm instead of w cos(u 2 )/(π/2 + βu 2 )). We follow Weron [1996].
A perturbed α-stable model α-stable distributions are utilized as models for heavy-tailed noise in many areas of statistics, finance and signal processing engineering. The univariate α-stable distribution is typically specified by four parameters: α ∈ (0, 2] determining the rate of tail decay; β ∈ [−1, 1] determining the degree and sign of asymmetry (skewness); γ > 0 the scale (under some parameterizations); and the location δ. In general, α-stable models admit no closed form expression for the density, except for the Gaussian (α = 2, β = 0) and Cauchy (α = 1, β = 0) families, however sampling is straightforward. Properties of the α-stable distributions are widely available, and the interested reader can refer for example to Peters et al. [2012] (also discussing the multivariate case). We only study the benefits induced by using correlated synthetic likelihoods (CSL) for a "perturbed" version of the univariate α-stable distribution. The perturbed model is defined in a way that, even though in CSL the two synthetic likelihoods corresponding to two consecutive iterations use correlated pseudorandom draws, however the generative models used in the two iterations will be made structurally different. This is supposed to challenge the beneficial effect of using correlated likelihoods, since the correlated draws enter different generative models. As we show, also in this case CSL provides considerably increased mixing. The procedure for sampling a single draw from a (non-perturbed) α-stable distribution was given in the previous section and shows a discontinuity at α = 1. We use this opportunity to challenge the (potential) benefits brought-in by correlated synthetic likelihoods (CSL): we define a "perturbed" version of the (standard) α-stable sampler by applying the equations for α = 1 to the case where α is "very close to 1", namely α ∈ [0.97, 1.03], and thereby writing (and everything else staying the same)ỹ =          S α,β sin(α (u 2 + B α,β )) (cos u 2 ) 1/α cos (u 2 − α (u 2 + B α,β )) w and then we say that the y generated via (19)-(20) is a draw from a perturbed α-stable distribution. This way, depending on the proposed values of α simulated during two consecutive iterations of CSL, the corresponding simulated data may not necessarily use the same equation (e.g. the pair (17) and (19) as opposed to the pair (18) and (20)) from the given simulator. As an example, say that at iteration t we accepted an α which had a value smaller than 0.97 or larger than 1.03, and hence accepted a synthetic likelihood that was constructed on simulations from (19); however at iteration t + 1 we may propose (and maybe accept) a synthetic likelihood produced from data simulated using some α ∈ [0.97, 1.03] and therefore using (20). With CSL the two synthetic likelihoods are based on correlated pseudorandom draws, however the generative models used in the two iterations of the example above would be different, thus challenging the beneficial effect of using correlated likelihoods, also because during the inference runs the proposed values of α frequently move inside and outside the region [0.97, 1.03]. Notice this challenge would not be possible with the nonperturbed α-stable model, as in such case, due to floating-point numerical representations, with zero probability a proposed value of α is exactly equal to 1, and hence at each iteration we would always generate from the equation having α = 1 in the previous section, and this would not challenge CSL.
Next section describes how BSL and CSL are implemented for the perturbed α-stable model as well as inference results.

Inference via BSL and CSL for perturbed α-stable distributions
Notice inference for α-stable distributions using synthetic likelihoods has been considered in Ong et al. [2018], and inference is conducted so that we first initialise transformed parameters (α,β,γ,δ) defined asα = log(α − 0.5) 2 − α ,β = log(β + 1) 1 − β ,γ = log(γ),δ = δ so that, when transformed back on the original scale (inside the model simulator), we are ensured that α ∈ (0.5, 2], β ∈ [−1, 1] and γ > 0. We set priors on (α,β,γ,δ) to be independent standard Gaussians for each parameter. The MCMC is therefore advanced on transformed parameters, however our inference results are shown for the original ones. As for the summary statistics, same as in Peters et al. [2012] and Ong et al. [2018], we consider a point estimator of (α, β, γ) due to McCulloch [1986], defined as (S α , S β , S γ ) where S α = q 95 (·) − q 5 (·) q 75 (·) − q 25 (·) , S β = q 95 (·) + q 5 (·) − 2q 50 (·) q 95 (·) − q 5 (·) , S γ = q 75 (·) − q 25 (·) γ true with the addition of S δ =ȳ, the sample mean of the data. Here q i (·) is the i-th empirical percentile of the submitted dataset. Notice S γ is more generally depending on the parameter γ which we decided to fix (but only for the purpose of computing the summary statistic S γ ) to its true value γ true = 1 (see below). Notice we do infer γ in our procedure, except for plugging its true value in S γ for ease of computation (however when dealing with real data, where ground-truth is of course unavailable, Peters et al., 2012 suggest using the tabled-estimate of γ provided in McCulloch, 1986 in place of γ true ). When implementing correlated synthetic likelihoods (CSL) with M model simulations, we initialise a matrix of size n × 2M (where n is the length of the data) where each row contains a pair of pseudo-random uniforms (u 1 , u 2 ). Then at each iteration we induce correlation in the synthetic likelihoods by randomly picking a block of several consecutive columns (how many columns it depends on the value of G) in said matrix and sample new pseudo-random uniforms for that block, while keeping the uniform draws in the remaining columns as fixed.
From the perturbed α-stable model we simulate an "observed" dataset consisting of 500 independent observations using α = 1.01, β = 0.5, γ = 1, δ = 0. During the inference we exclusively generate from the perturbed model, and the chain for α often transverses the two regimes (i.e. α ∈ [0.97, 1.03] and α / ∈ [0.97, 1.03]). We set starting (transformed) parameter values to be (α,β,γ,δ) = (0.8, 0.95, 2, 1) and for both BSL and CSL we run 5,200 iterations, using the perturbed model, where the first 200 are burnin iterations. When BSL is used, during the burnin MCWM is employed (and never after the burnin) and afterwards we use the "Haario" adaptive proposal. With CSL, we never use MCWM and during burnin we use a fixed covariance matrix (and of course we correlate the likelihoods), and use "Haario" afterwards. Therefore here we do not use our guided ASL, to focus instead on the benefits of using CSL, which are evident as discussed below.
Figures 13-14 report two independent inference attempts, conditionally on the same data, when using BSL and CSL (G = 50). In all cases the "Haario" method is used to produce proposals. Figures 15-16 consider two further attempts for CSL when using G = 200, always on the same data. The benefits of using an increased value for G (and hence inducing larger correlations) are evident.  Figure 16: α-stable model: (second attempt) trace-plots when using CSL with G=200.