Finding our Way in the Dark: Approximate MCMC for Approximate Bayesian Methods

With larger data at their disposal, scientists are emboldened to tackle complex questions that require sophisticated statistical models. It is not unusual for the latter to have likelihood functions that elude analytical formulations. Even under such adversity, when one can simulate from the sampling distribution, Bayesian analysis can be conducted using approximate methods such as Approximate Bayesian Computation (ABC) or Bayesian Synthetic Likelihood (BSL). A significant drawback of these methods is that the number of required simulations can be prohibitively large, thus severely limiting their scope. In this paper we design perturbed MCMC samplers that can be used within the ABC and BSL paradigms to significantly accelerate computation while maintaining control on computational efficiency. The proposed strategy relies on recycling samples from the chain's past. The algorithmic design is supported by a theoretical analysis while practical performance is examined via a series of simulation examples and data analyses.


Introduction
Since the early 1990s Bayesian statisticians have been able to operate largely due to the rapid development of Markov chain Monte Carlo (MCMC) sampling methods (see, for example Craiu and Rosenthal, 2014, for a recent review). Given observed data y 0 ∈ X n with sampling density f (y 0 |θ) indexed by parameter θ ∈ R q , Bayesian inference for functions of θ relies on the characteristics of the posterior distribution π(θ|y 0 ) = p(θ)f (y 0 |θ) where p(θ) denotes the prior distribution. When the posterior (1.1) cannot be studied analytically, we rely on MCMC algorithms to generate samples from π. While traditional MCMC samplers such as Metropolis-Hastings or Hamiltonian MCMC (see Brooks et al., 2011, and references therein) can sample distributions with unknown normalizing constants, they rely on the closed form of the unnormalized posterior, p(θ)f (y 0 |θ).
The framework we just described has been altered in multiple ways by the advent of large data. First, larger data tend to yield likelihood functions that are much more expensive to compute, thus exposing the liability inherent in the iterative nature of MCMC samplers. In response to this challenge, new computational methods based on divide and conquer (Scott et al., 2016;Wang and Dunson, 2013;Entezari et al., 2018), subsampling (Bardenet et al., 2014;Quiroz et al., 2015), pre-computation (Boland et al., 2018), or sequential (Balakrishnan et al., 2006;Maclaurin and Adams, 2015) strategies have emerged. Second, it is understood that larger data should yield answers to more complex problems. This implies the use of increasingly complex models, in as much as the sampling distribution is no longer available in closed form.
In the absence of a tractable likelihood function, statisticians have developed approximate methods to perform Bayesian inference when, for any parameter value θ ∈ R q , data y ∼ f (y|θ) can be sampled from the model. Here we consider two alternative approaches that have been proposed and gained considerable momentum in recent years: the Approximate Bayesian Computation (ABC) (Marin et al., 2012;Baragatti and Pudlo, 2014;Sisson et al., 2018a;Drovandi, 2018) and the Bayesian Synthetic Likelihood (BSL) (Wood, 2010;Drovandi et al., 2018a;Price et al., 2018). Both algorithms are effective when they are combined with Markov chain Monte Carlo sampling schemes to produce samples from an approximation of π and both share the need for generating many pseudo-data sets y ∼ f (y|θ). This comes with serious challenges when the data is large and generating a pseudo-data set is computationally expensive. In this paper we tackle the reduction of computational burden by recycling draws from the chain's history. The latter is achieved by reusing the quantities central to running the MCMC-ABC and MCMC-BSL chains, i.e. the parameter values and the corresponding pseudo-data discrepancies (for ABC) and summary statistics (for BSL). The information in past variates is incorporated using a k-Nearest-Neighbour (kNN) approach to estimate the transition kernel. The idea of using the past draws and simulations to accelerate the likelihood-free methods was addressed previously -the references to the appropriate literature are presented in Sections 2 and 4. The alternative approaches use Gaussian Processes (GP) to borrow information from past realizations of the chain that are close to the proposed next state. However, unlike kNN, the GP model does not guarantee that the estimated likelihood would approach the true unknown likelihood. Moreover, the kNN approach is easier to implement and computationally faster than the GP since the latter requires inversion of large-dimensional matrices and crucially depends on the choice of the covariance function. We demonstrate that we can control the approximating error introduced when perturbing the original kernel using some of the error analysis for perturbed Markov chains developed recently by Mitrophanov (2005), Johndrow et al. (2015b) and Johndrow and Mattingly (2017).
The paper is structured as follows. Section 2 briefly reviews the ABC method and Section 3 introduces the proposed MCMC algorithms for ABC. Section 4 reviews BSL sampling and extends the proposed methods to this class of approximations. The practical impact of these algorithms is evaluated via simulations in Section 5 and data analyses in the Supplementary Material (Levi and Craiu, 2020). The theoretical analysis showing control of perturbation errors in total variation norm is in Section 6. The paper closes with ideas for future work and conclusions.

Approximate Bayesian Computation
In order to illustrate the ABC sampler, let us consider the following Hidden Markov Model (HMM) . . . , n. (2.1) Unless Gaussian distributions are used to specify the transition and emission laws given in (2.1), the marginal distribution P (y 1 , · · · , y n |θ) cannot be calculated in closed form. It is possible to treat the hidden random variables X i as auxiliary and sample them using Particle MCMC (PMCMC) (Andrieu et al., 2010) or ensemble MCMC (Shestopaloff and Neal, 2013). However, computations become increasingly difficult as n increases. For some financial time series models such as Stochastic Volatility for log return, the α-Stable distribution may be useful to model transition and/or emission probabilities (Nolan, 2003). However, the stable distributions do not have closed form densities, thus rendering the particle and ensemble MCMC impossible to use. Other widely used examples where the likelihood functions cannot be expressed analytically include various networks models (e.g., Kolaczyk and Csárdi, 2014) and Markov random fields (Rue and Held, 2005). For such models with intractable or computationally expensive likelihood evaluations, simulation based algorithms such as ABC are frequently used for inference. In its simplest form, the ABC is an accept/reject sampler. Given a user-defined summary statistic S(y) ∈ R p , the Accept/Reject ABC is described in Algorithm 1.
Algorithm 1 Accept/Reject ABC. We emphasize that a closed form equation for the likelihood is not needed, only the ability to generate from f (y|θ) for any θ. If S(y) is a sufficient statistic and Pr(S(y) = s 0 ) > 0 then the algorithm yields posterior samples from the true posterior π(θ|y 0 ), where we defined s 0 = S(y 0 ). Alas, the level of complexity for models where ABC is needed, makes it unlikely for these two conditions to hold. In order to implement ABC under more realistic assumptions, a (small) constant is chosen and ζ * is accepted whenever d(S(y), s 0 ) < , where d(S(y), s 0 ) is a user-defined distance function. The introduction of > 0 and the use of non-sufficient statistics remove layers of exactness from the target distribution. The approximating distribution is π (θ|s 0 ) and we have lim ↓0 π (θ|s 0 ) = π(θ|s 0 ). (2.2) In light of (2.2) one would like to have S(y) = y, but if the sample size of y 0 is large, then the curse of dimensionality leads to Pr(d(y, y 0 ) < ) ≈ 0. Consequently, obtaining even a moderate number of samples using ABC becomes an unattainable goal. In almost all cases of interest, S is not a sufficient statistic, implying that some information about θ is lost. Not surprisingly, much attention has been given to finding appropriate low-dimensional summary statistics for inference (see, for example Robert et al., 2011;Fearnhead and Prangle, 2012;Marin et al., 2014;Prangle, 2015). In this paper we assume that the summary statistic S(y) is given.
In the absence of information about the model parameters, the prior and posterior distributions may be misaligned, having non-overlapping regions of mass concentration. Hence, parameter values that are drawn from the prior will be rarely retained making the algorithm very inefficient. Algorithm 2 presents the ABC-MCMC algorithm of Marjoram et al. (2003) which avoids sampling from the prior and instead relies on building a chain with a Metropolis-Hastings (MH) transition kernel, with state space where δ(y 0 , y) = d(S(y), s 0 ). Note that the goal is the marginal distribution for θ which is: π (θ|y 0 ) = π (θ, y|y 0 )dy ∝ p(θ)f (y|θ)1 {δ(y0,y)< } dy = p(θ)Pr(δ(y 0 , y) < |θ).

8: end for
There are alternatives to Algorithm 2. For instance, Lee et al. (2012) approximates P (δ(y 0 , y) < |θ) via one of its unbiased estimators, J −1 J j=1 1 {δ(y0,yj )< } where J ≥ 1 and each y j is simulated from f (y|θ). The use of unbiased estimators for P (δ(y 0 , y) < |θ) when computing the MH acceptance ratio can be validated using the theory of pseudo-marginal MCMC samplers (see the seminal paper of Andrieu and Roberts, 2009). Clearly, when the probability P (δ < |θ) is very small, this method would require simulating a large number of δs (or, equivalently, y's) in order to move to a new state. Other MCMC designs suitable for ABC can be found in Bornn et al. (2014).
Sequential Monte Carlo (SMC) samplers have also been successfully used for ABC (henceforth denoted ABC-SMC) (Sisson et al., 2007;Lee, 2012;Filippi et al., 2013). ABC-SMC requires a specified decreasing sequence of tolerances, 0 > · · · > J . The method of Lee (2012) uses the Particle MCMC design (Andrieu et al., 2010) in which samples are updated as the target distribution evolves with . More specifically, it starts by sampling θ from π 0 (θ|y 0 ) using Accept-Reject ABC. Subsequently, at time t + 1 all samples are sequentially updated so their distribution is π t+1 (θ|y 0 ) (see Lee, 2012, for a complete description). The advantage of this method is not only that it starts from large , but also that it generates independent draws. A comprehensive coverage of computational techniques for ABC can be found in Sisson et al. (2018b) and references therein. We also note a general lack of guidelines concerning the selection of , which is unfortunate as the performance of ABC sampling depends heavily on its value. To make a fair comparison between different methods, we revise ABC-MCMC algorithm by introducing a decreasing sequence 0 > · · · > J (J is number of "steps") similar to ABC-SMC and "learning" transition kernel during the burn-in as in Algorithm 3. Since

12:
Set θ (t) = ζ * with probability α and θ (t) = θ (t−1) otherwise. 13: end for the choice of proposal distribution q(·|θ) can considerably influence the performance of the ABC-MCMC sampler, we consider finite adaptation during the burn-in period of length B. In addition, during the burn-in the also varies, starting with a higher value (which makes it easier to find the initial θ (0) value) and gradually decreasing in accordance to a pre-determined scheme. In our implementations we use independent MH sampling or RWM. In the former case, the proposal is Gaussian N (·|μ,Σ) with c = 3. The RWM proposal is N (·|θ (t−1) ,Σ) with c = 2.38 2 /q (following Roberts et al., 1997;Roberts and Rosenthal, 2001). All the algorithms discussed so far rely on numerous generations of pseudo-data. Researchers have recognized that the latter can be computationally costly to produce, so proposals for reducing the simulation cost have been made by Wilkinson (2014), Drovandi et al. (2018b), Järvenpää et al. (2018) and Sherlock et al. (2017), among others. Järvenpää et al. (2018) utilized the simulated pairs (ζ, δ) to estimate the conditional distribution of δ|ζ using a Gaussian Process (GP) approach which allowed, for a new proposal ζ * , a fast calculation of P (δ < |ζ * ). Instead of estimating the conditional distribution, Wilkinson (2014) uses a GP approach to link ζ and logP (δ < ), hence to approximate the log likelihood for any ζ * . Similar ideas can also be used for Bayesian Synthetic Likelihood sampling methods. For instance, in Drovandi et al. (2018b) the GP model is used to estimate the true likelihood or its unbiased estimate. The implementation of GP may not be universally appropriate as it assumes that the distribution of logP (δ < ) is normal and has constant variance for all ζ which is clearly not true when the number of pseudo-data simulations (for each ζ) is small. Also the final likelihood estimates (from GP) are generally no longer unbiased and thus theoretical justifications for the proposed methods are necessary. The proposed idea in this paper is related to the work of Sherlock et al. (2017), where authors reduce the computational burden by utilizing a delayed-acceptance MCMC. The delayed-acceptance methods involve two acceptance ratios at each MCMC iteration. In the first one the likelihood is estimated using an estimator that involves a small computational cost (the authors use kNN) and, conditional on the proposal being accepted in stage one, the second acceptance ratio is calculated with the computationally expensive likelihood. The advantage of this method is that there is no need to compute the expensive likelihood if the proposal is not accepted in the first stage. One obvious limitation is that the expensive log-likelihood still must be calculated each time the chain moves.
To accelerate ABC-MCMC we consider a different approach and propose to store and utilize past simulations (with appropriate weights) in order to speed up the calculation while keeping under control the resulting approximating errors. The objective is to approximate P (δ < |ζ * ) for any ζ * at every MCMC iteration using past simulated (ζ, δ) proposals, making the whole procedure computationally faster. The changes proposed perturb the chain's transition kernel and we rely on the theory developed by Mitrophanov (2005) and Johndrow et al. (2015a) to assess the approximating error for the posterior. The k-Nearest-Neighbor (kNN) method is used to integrate past observations into the transition kernel. For a large enough chain history, we can control the error between the intended stationary distribution and that of the proposed accelerated MCMC as shown in Section 6.

Approximated ABC-MCMC (AABC-MCMC)
In this section we describe an ABC-MCMC algorithm that utilizes past simulations to significantly improve computational efficiency. As noted previously, the ABC-MCMC with threshold targets the density where δ(y 0 , y) = d(S(y), s 0 ) with y ∼ f (y|θ) and θ ∈ Θ. Denote h(θ) := P (δ(y 0 , y) < |θ) and note that if h were known for every θ then we could run an MH-MCMC chain with the target proportional to p(θ)h(θ). Alas, h is usually unknown and unbiased estimates can be computationally expensive or statistically inefficient. We build an alternative approach that hinges on consistent estimates of h. The latter use the chain's past history, are much cheaper to compute, and require a new theoretical treatment.
To fix ideas, suppose that at time t we set the proposal (ζ t+1 , w t+1 ) ∼ q(ζ|θ (t) )f (w|ζ) and suppose that at iteration N , all the proposals ζ n , regardless whether they were accepted or rejected, along with corresponding distances δ n = δ(w n , y 0 ) are available for 0 ≤ n ≤ N − 1. This past history is stored in the set We discuss a couple of choices for the function W (·) below. Remark 1. Note that the Markovian property of the chain is violated since the acceptance probability does not depend solely on the current state, but also on the past ones. We defer the theoretical considerations for dealing with adaptation in the context of perturbed Markov chains to a future communication. Below, we modify slightly the construction above while respecting the core idea.
In order to separate the samples used as proposals from those used to estimate h in (3.2), we will generate at each time t two independent samples ζ t+1 ∼ q(ζ|θ (t) ) and (ζ t+1 ,w t+1 ) from q(ζ|θ (t) )f (w|ζ). Then, the history Z collects the (ζ,δ) samples while the proposal used to update the chain is the ζ sample. With this notation (3.2) becomeŝ Remark 2. Even when δ * is greater than , if there is a close neighbour of ζ * whose corresponding δ is less than , then the estimated h(ζ * ) is not zero and the chain may move to a different state.
is expected to outperform the unbiased estimatorh(ζ * ) = 1 K K j=1 1 {δj < } , for both small and large K. For small K, the variability in the acceptance probabilities will be reduced, while for larger K the computational costs will be much smaller without sacrificing much in terms of precision. Since the proposed weighted estimate is no longer an unbiased estimator of h(θ), a new theoretical evaluation is needed to study the effect of perturbing the transition kernel on the statistical analysis. Central to the algorithm's utility is the Algorithm 4 Approximated ABC MCMC (AABC-MCMC).

11:
Add the dual simulated pair of parameter and discrepancy to the past set: .

14:
Set θ (t+1) = ζ * with probability α and θ (t+1) = θ (t) otherwise.. 15: end for ability to control the total variation distance between the desired distribution of interest given in (3.1) and the modified chain's target. As will be shown in Section 6, we rely on three assumptions to ensure that the chain would approximately sample from (3.1): 1) compactness of Θ; 2) uniform ergodicity of the chain using the true h and 3) uniform convergence in probability ofĥ(θ) to h(θ) as N → ∞.
The k-Nearest-Neighbor (kNN) regression approach (Fix and Hodges, 1951;Biau and Devroye, 2015) has a property of uniform consistency (Cheng, 1984). Define K = g(N ) (in our numerical experiments we have used g(·) = (·)). Without loss of generality we relabel the elements of Z N = {ζ n ,δ n } N n=1 according to distance ζ n −ζ * so that (ζ 1 ,δ 1 ) and (ζ N ,δ N ) corresponds to the smallest and largest among all distances { ζ j − ζ * : 1 ≤ j ≤ N }, respectively. The kNN method sets W Nn (ζ * ) to zero for all n > K. For n ≤ K, we focus on the following two weighting schemes: n ≤ K so that the weight decreases from 1 to 0 as n increases from 1 to K.
The kNN's theoretical properties that are used to validate our sampler rely on the independence between the pairs {ζ n ,δ n } n≥1 . Therefore, throughout the paper, we use an independent proposal in the MH sampler, i.e. q(·|θ (t) ) = q(·) and q is Gaussian. The entire procedure is outlined in Algorithm 4. Therefore, at the end of a simulation of size M the MCMC samples are {θ (1) , . . . , θ (M ) } and the history used for updating the chain is {(ζ 1 ,δ 1 ), . . . , (ζ M ,δ M )}. Note that for any N > 0, the elements in Z N are independent of the chain's history up to time N . Therefore, the transition kernel of the chain depends only on the current state so it is Markovian and non-adaptive. Note also thatĥ(θ (t) ) is required in order to determine the acceptance probability at step t + 1. In this case the h-value may be updated if θ (t) −ζ * is small enough. We could use the pseudo marginal MCMC method (Andrieu and Roberts, 2009;Andrieu and Vihola, 2015) to update only the numerator at each iteration, but this can result in slow mixing (see Drovandi et al., 2018b) since one large, "lucky" likelihood estimate can hinder the chain's prospect of moving on to a new state. However, updating both log-likelihoods at each iteration can significantly improve the mixing (see, for instance, Beaumont, 2003), but may change significantly the target distribution. We adopt an approach in which both estimates are updated. The computational cost of updating the denominator is very small in this case and the procedure is theoretically sound (see Section 6). The algorithm proposed here implements "naive" kNN which can be computationally burdensome as the chain progresses and the volume of historical data increases. Generally it requires O(1) operations to add a new value to the history set and O(M ) operations to compute the distances between the proposal and all the past draws. Clearly this is inefficient for very large M . To reduce the number of operations, one can implement a more efficient KD-tree approach (Bentley, 1975;Friedman et al., 1977), in which past samples are stored in a multidimensional binary array. Along with the work of Sherlock et al. (2017) who proposed an adaptive variation of KD-tree, the required number of operations is O(q log(M )) in order to add a new point or to search for the nearest neighbours. This can considerably speed up the proposed algorithms, but we do not pursue it in this paper.
In the next section we extend the approximate MCMC construction to Bayesian Synthetic Likelihood. In Section 5 of the Supplementary Material (Levi and Craiu, 2020) we use simulations to show that the proposed procedure generally improves the mixing of a chain.

BSL and Approximated BSL (ABSL)
An alternative way to bypass the intractability of the sampling distribution is proposed by Wood (2010). His indirect inference approach assumes that the conditional distribution for a user-defined statistic S(y) given θ is Gaussian with mean μ θ and covariance matrix Σ θ . The Synthetic Likelihood (SL) procedure assigns to each θ the likelihood SL(θ) = N (s 0 ; μ θ , Σ θ ), where as before s 0 = S(y 0 ) and N (x; μ, Σ) denotes the density of a normal with mean μ and covariance Σ. SL can be used for maximum likelihood estimation as in Wood (2010) or within the Bayesian paradigm as proposed by Drovandi et al. (2018a) and Price et al. (2018). The latter work proposes to sample the approximate posterior generated by the Bayesian Synthetic Likelihood (BSL) approach, π(θ|s 0 ) ∝ p(θ)N (s 0 ; μ θ , Σ θ ), using a MH sampler. Direct calculation of the acceptance probability is not possible because the conditional mean and covariance are unknown for any θ. However, both can be estimated based on m statistics (s 1 , · · · , s m ) sampled from their conditional distribution given θ. More precisely, after simulating y i ∼ f (y|θ) and setting s i = S(y i ), i = 1, · · · , m, one can estimatê so that the synthetic likelihood is SL(θ|y 0 ) = N (s 0 ;μ θ ,Σ θ ). The pseudo-code in Al-

Algorithm 5 Bayesian Synthetic Likelihood (BSL-MCMC).
1: Given s 0 , number of simulations m and required number of samples M .

8:
Calculate α = min 1, Set θ (t) = ζ * with probability α, and θ (t) = θ (t−1) otherwise. 10: end for gorithm 5 shows the steps involved in the BSL-MCMC sampler. Since each MH step requires calculating the ratios of two SLs calculated at different parameter values, one can anticipate the heavy computational load involved in running the chain for thousands of iterations, especially if sampling data y is expensive. Note that even though these estimates for the conditional mean and covariance are unbiased, the estimated value of the Gaussian likelihood is biased and therefore pseudo marginal MCMC theory is not applicable. Price et al. (2018) presented an unbiased Gaussian likelihood estimator and have empirically showed that using biased and unbiased estimates generally perform similarly. They have also noted that this procedure is robust with respect to m, and showed that using any m ∈ {50, . . . , 200} produce similar results. The normality assumption for summary statistics is certainly a strong one and may not hold in practice. An et al. (2018) replaced the joint Gaussian assumption with a Gaussian copula with non-parametric marginal estimates (NONPAR-BSL). The estimation is based, as in the BSL framework, on m pseudo-data samples simulated for each θ.
Clearly, BSL is computationally costly and requires many pseudo-data simulations to obtain Monte Carlo samples of even moderate sizes. As in the case of ABC, attempts were made to reduce its computational cost. In addition to ideas proposed in Wilkinson (2014) and Drovandi et al. (2018b) which can also be used for BSL, Meeds and Welling (2014) proposed to fit a GP model to pairs (s, ζ) and to use it to approximate μ ζ and Σ ζ instead of simulating new pseudo-data sets. Since the SL approach assumes normality of each summary statistic (conditional on ζ) the GP approach is justified. However, the estimated likelihood based on GP is no longer unbiased and, therefore, there is no guarantee that the perturbed chain converges to the true posterior distribution. It is further assumed that the covariance matrix Σ θ is diagonal. To alleviate this strong restriction, Everitt (2017) proposed to use a bootstrap approach for the estimation of the conditional covariance and a local linear regression model for conditional mean of the summary statistics. He also uses an SMC sampler along with the SL formulation to arrive at the final approximate posterior distribution. In this paper, instead of relying on GP or local linear regression, we implement kNN to approximate the conditional mean and covariance for any proposal ζ * . Under weak assumptions the estimate is proved to be uniformly weakly consistent. In particular, we propose to store and utilize past simulations of (ζ, s) to approximate μ ζ * , Σ ζ * for any ζ * ∈ Θ, greatly reducing the computational burden. As in the previous section, we separate the samples used to update the chain from the samples used to enrich the history of the chain. The approach can be extended for NONPAR-BSL but we will not pursue this development here. The K-Nearest-Neighbor (kNN) method is used as a non-parametric estimation tool for different quantities described above. As will be shown in Section 6 with the proposed method we can control the error between the intended stationary distribution and that of the proposed accelerated MCMC.

Approximated Bayesian Synthetic Likelihood (ABSL)
If we set s 0 = S(y 0 ) and assume the conditional normality for s 0 , the objective is to sample from During the MCMC run, the proposal ζ * is generated from q(·) and the history Z N is Then for any ζ, the conditional mean and covariance of interest is estimated using past samples as weighted averages:μ The weights are functions of distance between proposed value and parameters' values from the past, i.e. W Nn (ζ) = W ( ζ −ζ n ), where · is the Euclidean norm. To get appropriate convergence properties we use the kNN approach to calculate weights W Nn , where only the K = √ N closest values to ζ are used in the calculation of conditional means and covariances. As in the previous section, uniform (U) and linear (L) weights are used. We expect that the use of the chain's cumulated history can significantly speed up the whole procedure since it relieves the pressure to simulate many data sets y at every step. The use of the independent Metropolis kernel ensures that Z N contains independent draws which is required for theoretical validation in Section 6. We will also prove that under mild assumptions and if Θ is compact, the proposed algorithm exhibits good error control properties. In order to get a rough idea about the proposal, we propose to perform finite adaptation using J updates of the transition kernel during the burn-in period. Section A in the Supplementary Material (Levi and Craiu, 2020) details the proposed Approximated BSL (ABSL) method. For the simulations reported in the next section, we have used c = 1.5 and J = 15 to be consistent with ABC-related procedures.
For all these models, the simulation of pseudo data for any parameter is simple and computationally fast, but the use of standard estimation methods can be quite challenging, especially for (R), (SVG) and (SVS). Before running the proposed algorithms, we define the discrepancy function as δ = d(S(y), s 0 ) = (S(y) − s 0 ) T A(S(y) − s 0 ) where matrix A along with the sequence of thresholds 0 < 1 < . . . < 15 are estimated from a pilot run (details are provided in the Supplementary Material (Levi and Craiu, 2020)). Moreover we introduce a large number L = 10 10 to restrict prior distributions for several models in order to guarantee the compactness assumption of the parameter space. In all the examples below this assumption is satisfied. We compare the following algorithms: (SMC) Standard Sequential Monte Carlo for ABC; (ABC-RW) The modified ABC-MCMC algorithm which updates and the random walk Metropolis transition kernel during burn-in; (ABC-IS) The modified ABC-MCMC algorithm which updates and the Independent Metropolis transition kernel during burn-in; (BSL-RW) Modified BSL where it adapts the random walk Metropolis transition kernel during burn-in; (BSL-IS) Modified BSL where it adapts the independent Metropolis transition kernel during burn-in; (AABC-U) Approximated ABC-MCMC with independent proposals and uniform (U) weights; (AABC-L) Approximated ABC-MCMC with independent proposals and linear (L) weights; (ABSL-U) Approximated BSL-MCMC with independent proposals and uniform (U) weights; (AABC-L) Approximated BSL-MCMC with independent proposals and linear (L) weights; (Exact) Likelihood is computable and posterior samples are generated using an MCMC algorithm that is example-specific.
For SMC, 500 particles were used, total number of iterations for ABC-RW, ABC-IS, AABC-U, AABC-L, ABSL-U and ABSL-L is 50000 with 10000 for burn-in. Since BSL-RW and BSL-IS are much more computationally expensive, the total number of iterations were fixed at 10000 with 2000 burn-in and 50 pseudo-data simulations for every proposed parameter value (i.e. m = 50). The Exact chain was run for 5000 iterations and 2000 for the burn-in. It must be pointed out that all approximate samplers are based on the same summary statistics, same discrepancy function and the same sequence, so that they all start with the same initial conditions.
For more reliable results we compare these sampling algorithms under data set replications. In this study we set the number of replicates R = 100, so that for each model 100 data sets were generated and each one was analyzed with the described above sampling methods. Various statistics and measures of efficiency were calculated for every model and data set, letting θ (t) rs represent posterior samples from replicate r = 1, · · · , R, iteration t = 1, · · · , M and parameter component s = 1, · · · , q and similarlyθ (t) rs posterior from an exact chain (all draws are after the burn-in period). We let θ true s denote the true parameter that generated the data. Moreover let D rs (x),D rs (x) be estimated density function at replicate r = 1, · · · , R and components s = 1, · · · , q for approximate and exact chains respectively. Then the following quantities are defined: where Mean t (a st ) is defined as the average of {a st } over index t and in similar manner V ar t (a st ) and Cov t (a st ) represent variance and covariance respectively. The first three measures are useful in determining how close are the posterior draws from different samplers to the draws generated by the exact chain (when it is available). On the other hand, the last three are standard quantities that measure how close are the posterior means to the true parameters that generated the data. To study the efficiency of the proposed algorithms we need to take into account the CPU time needed to run a chain as well as the auto-correlation properties. Define the auto-correlation time (ACT) for every parameter's component and replicate of samples θ (t) rs as: where ρ a is the auto-correlation coefficient at lag a. In practice we sum all the lags up to the first negative correlation. Letting M − B to be the number of chain iterations (after burn-in) and CP U r correspond to the total CPU time to run the whole chain Note that these indicators are averaged over the parameter components and replicates. ESS intuitively can be thought as the approximate number of "independent" samples out of M −B, the higher is ESS the more efficient is the sampling algorithm, when ESS is combined with CPU (ESS/cpu) it provides a powerful indicator for MCMC's efficiency. Generally, a sampler with highest ESS/cpu is preferred as it produces larger number of "independent" draws per unit time. In the case of the SMC sampler, the formulas above are generally not applicable and therefore ACT and ESS are not calculated for this sampler. Instead, the efficiency of the samplers is assessed by examining the variance of the Monte Carlo estimators of various characteristics of the posterior distribution.
Here, we fix one data set and generate R = 100 replicates of posterior samples from each algorithm. Denote A rs represent an estimate of a population characteristic (quantile, expectation, etc.) and CP U r the total CPU time to run the sampler for replicate r = 1, · · · , R and parameter component s = 1, · · · , q, we define the following measure of efficiency: Generally, a sampler with lower V A ×cpu is preferable. We consider three characteristics A: F −1 θ (0.05) (A=QL, lower quantile), F −1 θ (0.95) (A=QU, upper quantile) and E(θ) (A=E, expectation). Please note that this method can be applied for the SMC sampler but it is less reliable than ESS/cpu since it is only based on one posterior distribution. For fair comparisons we generate 1000 posterior samples from each sampler after burnin period, so that M − B = 1000 for all the samplers except for SMC, and 1000 particles for SMC.
Additional simulations are included in the Supplementary Material (Levi and Craiu, 2020). Those are used to assess the performance of the proposed samplers with the random walk Metropolis (RWM) kernel instead of the independent one. These samplers are not justifiable by the theory developed in Section 6 but still show good performance in terms of the proximity to the true target distribution and the efficiency. When the models considered have a large-dimensional parameter space, the construction of an independent proposal may be difficult, and RWM offers a viable alternative. Future work will focus on developing theoretical foundations for the RWM implementation.

Ricker's Model
Ricker's model is frequently analyzed using Synthetic Likelihood procedures (Wood, 2010;Price et al., 2018). It is a particular instance of a hidden Markov model: where P ois(λ) is Poisson distribution with mean parameter λ and n = 100. Only y = (y 1 , · · · , y n ) sequence is observed, because the first 50 values are ignored. Note that all parameters θ = (θ 1 , θ 2 , θ 3 ) are unrestricted, the prior is given as (each prior parameter is independent): (5.5) Here N (a, b, c, d) is defined as a truncated normal distribution with mean a, variance b, lower and upper bounds c and d, respectively. We restrict the range of θ 2 as all algorithms become unstable for θ 2 outside this interval. Note that the marginal distribution of y is not available in closed form, but the transition distribution of hidden variables X i |x i−1 and the emission probabilities Y i |x i are known and hence we can run the Particle MCMC (PMCMC) (Andrieu et al., 2010) or Ensemble MCMC (Shestopaloff and Neal, 2013) to sample from the posterior distribution π(θ|y 0 ). Here we are utilizing the Particle MCMC with 100 particles. As suggested in Wood (2010) we set θ 0 = (log(3.8), 0.9, 2.3) and define the summary statistics S(y) as the 14-dimensional vector whose components are: (C1) #{i : y i = 0}, (C2) Average of y,ȳ, (C3:C7) Sample auto-correlations at lags 1 through 5,   (Levi and Craiu, 2020).
We show here ABC-RW instead of ABC-IS because the latter exhibits a poorer performance. The main observation is that the mixing of AABC-U is much better than that of ABC-RW with smaller auto-correlation values. ABSL-U exhibits a similar performance (plot is included in the Supplementary Material (Levi and Craiu, 2020)). To see how close are the draws from simulation-based algorithms to the draws from the Exact chain, in the Supplementary Material (Levi and Craiu, 2020)) we include a plot of the estimated approximate posterior marginal densities. A more general study, where results are averaged over 100 independent replicates, is shown in Table 1. Clearly, the proposed methods clearly outperform in terms of overall efficiency ESS/cpu, V QL * cpu, V QU * cpu and V E * cpu. For instance, AABC-U is about 20 times more efficient than standard ABC-RW in terms of ESS/cpu and shows improvement over SMC sampler when variance of quantiles, multiplied by CPU time, is considered. ABSL-U is 6 times more efficient than BSL-RW in terms of ESS/cpu and shows considerable improvement in efficiency when variance of quantiles and mean, multiplied by CPU time, is considered. At the same time DIM, DIC, TV and MSE are generally smaller for the approximate methods. However, for this model, samplers with linear weights show minor loss in efficiency compared to uniform weights.

Stochastic Volatility with α-Stable Errors
When analyzing stationary time series, it is frequently observed that the periods of high and low volatility alternate. Such phenomenon is called volatility clustering, see for example (Lux and Marchesi, 2000). One way to model such a behaviour is through a Stochastic Volatility (SV) model, where variances of the observed time series depend on hidden states that themselves form a stationary time series. The standard SV model assumes that the conditional distribution of the observed variables is Gaussian (see Supplementary Material (Levi and Craiu, 2020) for more details). Frequently, in the financial time series, a large sudden drop occurs, thus raising serious doubts about the latter assumption. Often, it is suggested to use heavy tailed distributions (instead of Gaussian) to model financial data. We consider a family of distributions named α-Stable, denoted Stab(α, β), with two parameters α ∈ (0, 2] (stability parameter) and β ∈ [−1, 1] (skew parameter). Two special cases are α = 1 and α = 2 which correspond to Cauchy and Gaussian distribution, respectively. Note that for α < 2 the distribution has undefined variance. We define the following SV model with α-Stable errors with parameter θ = (θ 1 , θ 2 , θ 3 , θ 4 ) T ∈ R 4 : This model is very similar to the simple SV with the only difference that the emission errors follow a α-Stable distribution with unknown stable parameter and fixed skew of −1. We generally prefer a negative skew emission probability to model large negative financial returns. As in the previous simulation example θ 2 and θ 3 are unrestricted. The prior distributions for this model are, independently: We set the true parameters to θ 1 = 0.95, θ 2 = −2, θ 3 = −1, θ 4 = 1.8 and length of the time series n = 500. The major challenge with this model is that there are no closed-form densities for α-Stable distributions. Hence, most MCMC samplers, including PMCMC and ensemble MCMC, cannot be used to sample from the posterior. However, sampling data from this family of distributions is feasible which makes it particularly amenable for simulation based methods such as ABC and BSL. For summary statistics we use a 7-dimensional vector whose components are:  Here quantile(y, τ) is defined as τ -quantile of the sequence y. As was shown in Schmitt et al. (2015) and Dette et al. (2015) the auto-correlation of indicators (under different quantiles) can be very useful in characterizing a time series and that is why we have added (C5), (C6) and (C7) to the summary statistic. We focus here on y 2 and its autocorrelations since the model parameters only affect the variability of y (auto-correlation of y is zero for any lag). Figure 2 illustrates the performance of AABC-U, ABSL-U and ABC-RW algorithms, by plotting the trace plot, the histogram and the auto-correlation function for θ 2 . Additional plots for all the parameters are included in the Supplementary Material (Levi and Craiu, 2020). As in the previous example, the mixing of AABC-U is much better than of ABC-RW. Since the exact sampling is not feasible in this example we compare samplers to SMC.
For more general conclusions we show average results in Table 2 over 100 data replicates. To calculate DIM, DIC and TV, samplers are compared to SMC since the exact draws cannot be obtained. Again, efficiency measures for AABC-U, AABC-L, ABSL-U and ABSL-L show significant improvement over the benchmark algorithms. All proposed methods outperform the benchmark samplers in terms of all efficiency measures except for V QU * cpu, where SMC outperforms AABC-U. For this example looking at DIM, DIC and TV may be misleading since approximate samplers are compared to another approximate sampler. Much more informative is the MSE, which is very similar across ABC-based and BSL-based algorithms.

Theoretical Justifications
In this section we show that the novel approximated ABC MCMC and BSL samplers with independent proposals exhibit proper ergodic properties in the long run. Specifically, we will show that as the number of MCMC iterations increases, the marginal distribution of {θ (t) } converges to the appropriate posterior distribution in total variation and the sample averages converge to the true expectations. In the next two sections we extend the work of Johndrow et al. (2015b) on the perturbed MCMC and then in Section 6.3 discuss necessary conditions for the ergodicity of AABC and ABSL. Note that in Sections 6.1 and 6.2, corresponds to the discrepancy between the exact and perturbed kernels and not to the threshold in ABC-based methods.

Notation
We start by reviewing the notation. Let p(θ), q(θ) denote the prior and the proposal distributions for θ ∈ Θ, respectively. Then, given a proposal ζ * , the acceptance probability is: This independent MH acceptance rate defines an exact transition kernel which we call P (·, ·). Suppose h(θ) is not known, and instead it is estimated usingĥ(θ|Z N ) where Z N are some auxiliary variables.
The acceptance probability for this perturbed algorithm (more on perturbed MCMC can be found in Roberts et al., 1998;Pillai and Smith, 2014;Johndrow and Mattingly, 2017) is: If the approximate kernel transition isP N (·, ·) = E Z N P N (·, ·; Z N ) , then the initial goal is to show that as N → ∞ the distance between this transition and the exact one converges to zero, where the distance is defined as: Here · T V is the "total variation" distance between two measures. First we prove the intuitive result that under strong consistency assumption ofĥ(θ; Z N ), the perturbed kernel converges to the exact one.
Next let P = {P N : P N − P < } be a collection of the perturbed kernels each distance from the exact kernel. The main objective is to show that if the chain utilizes a new kernelP N ∈ P at every iteration, it still results in the ergodic chain with appropriate convergence results. To achieve that we refer to the work of Johndrow et al. (2015b) on the convergence properties of the perturbed kernels, where the authors assume the same perturbed kernel ( distance to the exact one) in each MCMC iteration. In the next sub-section we will extend this result by allowing to have a different transition kernel from P at each iteration.
In Section 6.3 we show conditions for AABC and ABSL methods that guarantee the satisfaction of these 3 assumptions. Next, we let μ to be the invariant measure of the exact kernel P , and denote the perturbed Markov Chain as θ (0) , θ (1) , . . . , θ (t) with the marginal distribution of the initial state as θ (0) ∼ ν = μ 0 . Also define the marginal distributions of θ (t) by μ t , t = 1, 2, . . ., which are equal to whereP t ∈ P , t = 1, 2, . . ., (note a different perturbed transition kernel is used at each iteration) andP 0 is defined as an identity transition (for convenience).
First we need to examine the total variation distance between μ and the average measure In other words this expectation can be made arbitrary small for sufficiently large M and small enough .
To summarize, the theoretical results show that if the estimatorĥ(θ; Z N ) of the true likelihood converges uniformly in probability to h(θ) as N (the number of auxiliary variables) increases, then for any > 0 there exists a large C so that for N > C all perturbed kernelsP N are within -distance of the exact transition kernel (Theorem 6.1). Moreover if the MCMC is run with the perturbed kernelsP N at each iteration then the distribution of the average of the chain's states has almost the same distribution as the target (Theorem 6.2) and the average of any bounded function of the states converges to the true expectation (Theorem 6.3). The error only depends on the number of iterations M and the chosen . In the next section we provide sufficient conditions for AABC and ABSL methods to fit this framework thus proving their validity.

Ergodicity of AABC and ABSL
For illustration we focus on AABC but the same process applies to ABSL. Recall, for AABC algorithm we define a function h(θ) as P (δ < |θ) where δ = δ(y, y 0 ) and y ∼ f (y|θ). Unfortunately h(θ) is unknown and therefore estimated, in particular let . Actually Z N contains past generated samples that were saved before N th iteration. Given θ (current state) and ζ * (proposed state) we apply kNN to approximate h(θ) and h(ζ * ) by calculating local weighted averages of 1 {δn< } forζ n that are close to θ or ζ * . To obtain convergence results for the proposed methods, we consider the following assumptions: (B1) Θ is a compact set.
The theoretical justification of the methods also relies on the following two theorems.
We can now state the main result: The idea behind the proof relies on the results stated here. First Theorem 6.4 guarantees that the kNN estimates converge uniformly in probability for all points in the compact parameter space. The result in Theorem 6.1 yields that the perturbed kernels approach the exact kernel for sufficiently large history Z N . Since the independent proposal sampler is used in the AABC/ABSL algorithm, the Doeblin condition of the exact kernel follows from Theorem 6.5. Thus, it can be shown that the assumptions (A1)-(A3) are satisfied and the conclusions of Theorems 6.2 and 6.3 easily follow.
The theorem states that under assumptions (B1)-(B5) the proposed approximate ABC method has the necessary convergence properties, i.e. the distribution of the average of Markov chain states converges in total variation to the true target distribution and the sample average of any bounded function converges in mean square to the true expectation. Generally (B1) is a strong assumption, but in practice the parameter space can be restricted to a closed and bounded region that satisfies this assumption. (B2) and (B5) are immediately satisfied by the construction since, for the proposed algorithm, the multivariate normal distribution is used as an independent proposal distribution and we set K(N ) = √ N . The continuity of the prior density function (B3) generally can be satisfied since it is in the control of the analyst. All these assumptions are met by the simulation studies in Section 5. (B4) is harder to check since h(θ) = P (δ < |θ) is an unknown function, but it is intuitive to accept it since generally there is no reason for h(θ) to make a sudden jump or drop in its values when θ changes by a small amount especially when the parameter space is compact.
The main result for ABSL is similar but requires three additional assumptions about the behavior of the summary statistics. The idea of the proof is similar to Theorem 6.6. The continuity of the expectations and covariances of the summary statistics, (B6), is another hard-to-check assumption. Similarly, (B7) and (B8) are expected to hold due to the compactness of the parameter space ((B1)). We therefore assume that (B1)-(B8) are met for all the simulation scenarios in Section 5 and the data analysis of the real data set in the Supplementary Material (Levi and Craiu, 2020). All the proofs of the above theorems and corollaries can be found in the Supplementary Material.

Discussion and Future Work
In this paper we propose to speed up generic ABC-MCMC and BSL algorithms by reusing past simulations. This approach significantly accelerates the process and can be very useful for models where simulation of a pseudo data set is computationally expensive. We have presented theoretical arguments and sufficient assumptions for convergence properties of the perturbed chain. The performance of these strategies were examined via a series of simulations under different models. All simulations summaries show that the proposed methods significantly improve mixing and efficiency of the chain. When the likelihood is intractable, a researcher can choose either AABC or ABSL for inferential purposes. If the summary statistics can be reasonably trusted to follow a multivariate normal distribution for each parameter value, we recommend using ABSL over generic BSL as it is an order of magnitude faster and does not require the selection of a threshold . On the other hand, when one has no particular reason to assume a Gaussian distribution for the summary statistic, AABC should be preferred.
Further work needs to be done in order to extend the application of these ideas to more complex models. First, in the current implementation it is assumed that the dimension of the parameter space must be small or moderate since it is well known that the kNN approach is unreliable in higher dimensions due to the curse of dimensionality. Second, the size of the history set, i.e. the set of past samples Z N , increases as the chain progresses, which can create memory issues when the number of MCMC iterations, M , is very large. In our experiments we have used 50,000 to 100,000 iterations and have not encountered any memory problems. Third, kNN is a non-parametric method which needs to calculate the distances from the proposed sample ζ * to all the samples in the historical set, Z N . Not surprisingly, this procedure becomes computationally expensive as the number of iterations (and thus cardinality of Z N ) increases. So the advantage of these methods may diminish for a large scale MCMC. To remedy this, we can implement a more efficient kNN which uses KD-tree for faster distance computations (Sherlock et al., 2017) and/or stop adding the new members to Z N after some large N . Finally, it is clear that the independent Metropolis sampling kernel used throughout the paper might not be efficient, especially if the posterior is far from being a Gaussian. It turns out that random walk kernels (which generally have good performance as shown in the Supplementary Material (Levi and Craiu, 2020)) could be a better alternative, but require further theoretical developments. We hope that by drawing attention to the alternative approaches advanced in the paper, we will spur the Bayesian community's interest for developing strategies that are economical and adaptive for approximate Bayesian computation methods.