The Rate of Convergence for Approximate Bayesian Computation

Approximate Bayesian Computation (ABC) is a popular computational method for likelihood-free Bayesian inference. The term"likelihood-free"refers to problems where the likelihood is intractable to compute or estimate directly, but where it is possible to generate simulated data $X$ relatively easily given a candidate set of parameters $\theta$ simulated from a prior distribution. Parameters which generate simulated data within some tolerance $\delta$ of the observed data $x^*$ are regarded as plausible, and a collection of such $\theta$ is used to estimate the posterior distribution $\theta\,|\,X\!=\!x^*$. Suitable choice of $\delta$ is vital for ABC methods to return good approximations to $\theta$ in reasonable computational time. While ABC methods are widely used in practice, particularly in population genetics, study of the mathematical properties of ABC estimators is still in its infancy. We prove that ABC estimates converge to the exact solution under very weak assumptions and, under slightly stronger assumptions, quantify the rate of this convergence. Our results can be used to guide the choice of the tolerance parameter $\delta$.

One of the earliest articles about the ABC approach, covering applications in population genetics, is Tavaré et al. (1997).Newer developments and extensions of the method include placing ABC in an MCMC context (Marjoram et al., 2003), sequential ABC (Sisson et al., 2007), enhancing ABC with nonlinear regression models (Blum and François, 2010), agent-based modelling (Sottoriva and Tavaré, 2010), and a Gibbs sampling ABC approach (Wilkinson et al., 2011).A survey of recent developments is given by Marin et al. (2012).
ABC methods allow for inference in a Bayesian setting where the parameter θ ∈ R p of a statistical model is assumed to be random with a given prior distribution f θ , we have observed data X ∈ R d from a given distribution f X|θ depending on θ and we want to use these data to draw inference about θ.In the areas where ABC methods are used, the likelihood f X|θ is typically not available in an explicit form.The term "likelihood-free" is used to indicate that ABC methods do not make use of the likelihood f X|θ , but only work with samples from the joint distribution of θ and X.In the context of ABC methods, the data are usually summarised using a statistic S : R d → R q , and analysis is then based on S(X) instead of X.The choice of S affects the accuracy of the ABC estimates: the estimates can only be expected to converge to the true posterior if S is a sufficient statistic, otherwise additional error is introduced.
The basic idea of ABC methods is to replace samples from the exact posterior distribution θ | X = x * or θ | S(X) = s * with samples from an approximating distribution like θ | S(X) ≈ s * .There are many variants of the basic ABC method available, including different implementations of the condition S(X) ≈ s * .All variants use a tolerance parameter δ which controls the trade-off between fast generation of samples (large values of δ) and accuracy of samples (small values of δ).The easiest approach to implement the condition S(X) ≈ s * , considered in this paper, is to use S(X)−s * ≤ δ where • is some norm on R q .Different approaches to choosing the statistic S are used; a semi-automatic approach is described in Fearnhead and Prangle (2012).In many cases considerable improvements can be achieved by choosing the norm for comparison of S(X) to s * in a problem-specific way.
Despite the popularity of ABC methods, theoretical analysis is still in its infancy.The aim of this article is to provide a foundation for such analysis by providing rigorous results about the convergence properties of the ABC method.Here, we restrict discussion to the most basic variant, to set a baseline to which different ABC variants can be compared.We consider Monte Carlo estimates of posterior expectations, using the ABC samples for the estimate.Proposition 3.1 shows that such ABC estimates converge to the true value under very weak conditions; once this is established we investigate the rate of convergence in theorem 3.2.Similar results, but in the context of estimating posterior densities rather than posterior expectations can be found in Blum (2010) and Biau et al. (2013).
The choice of norm and of the tolerance parameter δ are major challenges in the practical application of ABC methods, but not many results are available in the literature.A numerical study of the trade-off between accuracy and computational cost, in the context of sequential ABC methods, can be found in Silk et al. (2013).Wilkinson (2013) establishes that an ABC method which accepts or rejects proposals with a probability based on the difference between the observed and proposed data converges to the correct solution under assumptions on model or measurement error.
The error of an ABC estimate is affected both by the bias of the ABC samples, controlled by the tolerance parameter δ, and by Monte Carlo error, controlled by the number n of accepted ABC samples.One of the key ingredients in our analysis is the result, shown in lemma 3.6, that for ABC estimates the bias satisfies bias ∼ δ 2 .
Similarly, in lemma 3.7 we show that for the ABC estimate we have where q is the dimension of the observation s * .It is well-known that for Monte Carlo methods the error decays, as a function of computational cost, proportional to cost −1/2 , where the exponent −1/2 is independent of dimension (see e.g.Voss, 2014, section 3.2.2).In contrast, the main result of this article, theorem 3.2, shows that, under optimal choice of δ, the basic ABC method satisfies error ∼ cost −2/(q+4) .Corollary 4.2 shows that this result holds whether we fix the number of accepted samples (controlling the precision of our estimates but allowing the computational cost to be random) or fix the number of proposals (allowing the number of accepted samples to be random).The former is aesthetically more satisfying, but in practice most users have a fixed computational budget and hence must fix the number of proposals.In either case, the rate of decay for the error gets worse as the dimension q increases and even in the one-dimensional case the exponent −2/(1 + 4) = −2/5 is worse than the exponent −1/2 for Monte Carlo methods.Fearnhead and Prangle (2012) obtain the same exponent −2/(q + 4) for the specific summary statistic S(X) = E(θ|X).For the problem of estimating the posterior density, Blum (2010) reports the slightly worse exponent −2/(q + 5).The difference between Blum's results and ours is due to the fact that for kernel density estimation an additional bandwidth parameter must be considered.
We continue by giving a very short introduction to the basic ABC method in section 2. The main results, proposition 3.1 and theorem 3.2 are presented in section 3, together with their proofs.Section 4 shows that our results are independent of whether the number of proposals or of accepted samples are fixed in the algorithm.Section 5 illustrates the results of this paper with the help of numerical experiments.Finally, in section 6, we consider the practical implications of our results.

Approximate Bayesian Computation
This section gives a short introduction to the basic ABC algorithm.A more complete description is, for example, given in Voss (2014, section 5.1).We describe the algorithm in the context of the following Bayesian inference problem: • A parameter vector θ ∈ R p is assumed to be random.Before observing any data, our belief about its value is summarised by the prior distribution f θ .
The value of θ is unknown to us and our aim is to make inference about θ.
• The available data X ∈ R d are assumed to be a sample from a distribution f X|θ , depending on the parameter θ.Inference about θ is based on a single observed sample x * from this distribution; repeated observations can be assembled into a single vector if needed.
• In the context of ABC, the data are often summarised using a statistic S : R d → R q .Since X is random, S = S(X) is random with a distribution f S|θ depending on θ.If a summary statistic is used, inference is based on the value s * = S(x * ) instead of on the full sample x * .
Our aim is to explore the posterior distribution of θ, i.e. the conditional distribution of θ given X = x * , using Monte Carlo methods.More specifically, we aim to estimate the posterior expectations for given test functions h : R p → R. Expectations of this form allow us to study many relevant properties of the posterior distribution, including the posterior mean when h(θ) = θ i with i = 1, . . ., p and posterior second moments when h(θ) = θ i θ j with i, j = 1, . . ., p.If h(θ) = 1 A (θ) for a given set A ⊆ R p , the expectation E h(θ) X = x * equals the posterior probability of hitting A; for example, the CDF can be approximated by choosing A = (−∞, a] for a ∈ R.
The basic ABC method for generating approximate samples from the posterior distribution is given in the following algorithm.
Algorithm 2.1.For a given observation s * ∈ R q and given tolerance δ > 0, ABC samples approximating the distribution θ | S = s * are random samples θ (δ) j ∈ R p computed by the following algorithm: 1: let j ← 0 2: while j < n do 3: end if 9: end while The norm • A used in the acceptance criterion is defined by s 2 A = s A −1 s for all s ∈ R q , where A is a positive definite symmetric matrix.This includes the case of the Euclidean distance • = • I for A = I.In practical applications, the matrix A is often chosen in a problem-specific way.
Using the output of the algorithm, an estimate for the posterior expectation (1) can be obtained as where θ n are computed by algorithm 2.1.Since the output of the ABC algorithm approximates the posterior distribution, the Monte Carlo estimate Y (δ) n can be used as an approximation to the posterior expectation.The ABC samples are only approximately distributed according to the posterior distribution, and thus the estimate Y (δ) n will not exactly converge to the true value y as n → ∞.The quality of the approximation can be improved by decreasing the tolerance parameter δ, but this leads at the same time to a lower acceptance probability in algorithm 2.1 and thus, ultimately, to higher computational cost for obtaining the estimate Y (δ) n .Hence, a trade-off must be made between accuracy of the results and speed of computation.
Since the algorithm is not using x * directly, but uses s * = S(x * ) instead, we require S to be a sufficient statistic, so that we have If S is not sufficient, an additional error will be introduced.
For application of the ABC method, knowledge of the distributions of θ, X and S = S(X) is not required; instead we assume that we can simulate large numbers of samples of these random variables.In contrast, in our analysis we will assume that the joint distribution of θ and S has a density f S,θ and we will need to consider properties of this density in some detail.
To conclude this section, we remark that there are two different approaches to choosing the sample size used to compute the ABC estimate Y (δ) n .If we denote the number of proposals required to generate n output samples by N ≥ n, then one approach is to choose the number N of proposals as fixed; in this case the number n ≤ N of accepted proposals is random.Alternatively, for given n, the loop in the ABC algorithm could be executed until n samples are accepted, resulting in random N and fixed n.In order to avoid complications with the definition of Y (δ) n for n = 0, we follow the second approach here and defer discussion of the case of fixed N until section 4.

Results
This section presents the main results of the paper, in proposition 3.1 and theorem 3.2, followed by proofs of these results.
Throughout, we assume that the joint distribution of θ and S has a density f S,θ , and we consider the marginal densities of S and θ given by for all t ∈ R p , respectively.We also consider the conditional density of θ given S = s, defined by s) , if f S (s) > 0, and 0 otherwise.
Our aim is to study the convergence of the estimate Y (δ) n to y as n → ∞.The fact that ABC estimates converge to the correct value as δ ↓ 0 is widely accepted and easily shown empirically.This result is made precise in the following proposition, showing that the convergence holds under very weak conditions.
We note that the assumptions required in our version of this result are very modest.Since our aim is to estimate the posterior expectation of h(θ), the assumption that the prior expectation E |h(θ)| exists is reasonable.Similarly, the phrase "for f S -almost all s * ∈ R q " in the proposition indicates that the result holds for all s * in a set A ⊆ R q with P S(X) ∈ A = 1.Since in practice the value s * will be an observed sample of S(X), this condition forms no restriction to the applicability of the result.The proposition could be further improved by removing the assumption that the distributions of θ and S(X) have densities.While a proof of this stronger result could be given following similar lines to the proof given below, here we prefer to keep the setup consistent with what is required for the result in theorem 3.2.
Our second result, theorem 3.2, quantifies the speed of convergence of Y (δ) n to y.We consider the mean squared error and relate this to the computational cost of computing the estimate Y (δ) n .For the computational cost, rather than using a sophisticated model of computation, we restrict ourselves to the naïve approach of assuming that the time required to obtain the estimate where a and b are constants and N is the number of iterations the while-loop in the algorithm has to perform until the condition j = n is met (i.e. the number of proposals required to generate n samples).To describe the asymptotic behaviour of MSE and cost, we use the following notation.
Notation.For sequences (a n ) n∈N and (b n ) n∈N of positive real numbers we write a n ∼ b n to indicate that the limit c = lim n→∞ a n /b n exists and satisfies 0 < |c| < ∞.
Our result about the speed of convergence requires the density of S(X) to have continuous third partial derivatives.More specifically, we will use the following technical assumptions on S.
Assumption A. The density f S and the function s → R p h(t) f S,θ (s, t) dt are three times continuously differentiable in a neighbourhood of s * ∈ R q .Theorem 3.2.Let h : R p → R be such that E h(θ) 2 < ∞ and let S be sufficient and satisfy assumption A. Assume Var h(θ) S = s * > 0 and Then, for f S -almost all s * , the following statements hold: as n → ∞.
2. The exponent −4/(q + 4) given in the preceding statement is optimal: for any sequence The rate of convergence in the first part of theorem 3.2 should be compared to the corresponding rate for the usual Monte Carlo estimate.Since Monte Carlo estimates are unbiased, the root-mean squared error for a Monte Carlo estimate is proportional to 1/ √ n, while the cost of generating n samples is proportional to n.Thus, for Monte Carlo estimates we have RMSE ∼ cost −1/2 .The corresponding exponent from theorem 3.2, obtained by taking square roots, is −2/(q + 4), showing slower convergence for ABC estimates.This reduced efficiency is a consequence of the additional error introduced by the bias in the ABC estimates.
The statement of the theorem implies that, as computational effort is increased, δ should be decreased proportional to n −1/4 .For this case, the error decreases proportionally to the cost with exponent r = −4/(q + 4).The second part of the theorem shows that no choice of δ can lead to a better (i.e. more negative) exponent.If the tolerances δ n are decreased in a way which leads to an exponent r > r, in the limit for large values of n such a schedule will always be inferior to the choice δ n = cn −1/4 , for any constant c > 0. It is important to note that this result only applies to the limit n → ∞.
The result of theorem 3.2 describes the situation where, for a fixed problem, the tolerance δ is decreased.Caution is advisable when comparing ABC estimates for different q.Since the constant implied in the ∼-notation depends on the choice of problem, and thus on q, the dependence of the error on q for fixed δ is not in the scope of theorem 3.2.Indeed, Blum (2010) suggests that in practice ABC methods can be successfully applied for larger values of q, despite results like theorem 3.2.
Before we turn our attention to proving the results stated above, we first remark that, without loss of generality, we can assume that the acceptance criterion in the algorithm uses Euclidean distance • instead of • A .This can be achieved by considering the modified statistic S(x) = A −1/2 S(x) and s * = S(x * ) = A −1/2 s * , where A −1/2 is the positive definite symmetric square root of the ABC algorithm using S and • A has identical output to the ABC algorithm using S and the Euclidean norm.Finally, a simple change of variables shows that the assumptions of proposition 3.1 and theorem 3.2 are satisfied for S and • A if and only they are satisfied for S and • .Thus, for the proofs we will assume that Euclidean distance is used in the algorithm.
The rest of this section contains the proofs of proposition 3.1 and theorem 3.2.In the proofs it will be convenient to use the following technical notation.
for all s ∈ R q and ϕ (δ) for all s * ∈ R q , where B(s * , δ) denotes the volume of the ball B(s * , δ).
Using the definition for h ≡ 1 we get ϕ 1 ≡ f S and for general h we get ϕ h (s) = f S (s) E h(θ) S = s .Both the exact value y from (1) and the mean of the estimator Y (δ) n from (2) can be expressed in the notation of definition 3.3.This is shown in the following lemma. and and thus we know that R p h(t) f S,θ (s, t) dt < ∞ for almost all s ∈ R q .Consequently, the conditional distribution E h(θ) S = s * exists for f S -almost all s * ∈ R q .Using Bayes' rule we get On the other hand, the samples θ (δ) j are distributed according to the conditional distribution of θ given S ∈ B(s * , δ).Thus, the density of the samples θ (δ) j can be written as where the normalising constant Z satisfies and we get This completes the proof.
Using these preparations, we can now present a proof of proposition 3.1.For the second statement, since ϕ 1 ≡ f S ∈ L 1 (R q ), we can use the Lebesgue differentiation theorem (Rudin, 1987, theorem 7.7) to conclude that ϕ h (s * ) → ϕ h (s * ) as δ ↓ 0 for almost all s * ∈ R q and using lemma 3.4 we get for almost all s * ∈ R q .This completes the proof.
For later use we also state the following simple consequence of proposition 3.1.
Corollary 3.5.Assume that E h(θ) 2 < ∞.Then, for f S -almost all s * ∈ R q we have lim Proof.From the definition of the variance we know Applying proposition 3.1 to the function h 2 first, we get lim δ↓0 E h(θ This completes the proof. The rest of this section is devoted to a proof of theorem 3.2.We first consider the bias of the estimator Y (δ) n .As is the case for Monte Carlo estimates, the bias of the ABC estimate Y (δ) n does not depend on the sample size n.The dependence on the tolerance parameter δ is given in the following lemma.This lemma is the key ingredient in the proof of theorem 3.2.Lemma 3.6.Assume that E |h(θ)| < ∞ and that S satisfies assumption A. Then, for f S -almost all s * ∈ R q , we have as δ ↓ 0 where the constant C(s * ) is given by equation (3) and ∆ denotes the Laplace operator.
Proof.Using lemma 3.4 we can write the bias as bias To prove the lemma, we have to study the rate of convergence of the averages ϕ (δ) h (s * ) to the centre value ϕ h (s * ) as δ ↓ 0. Using Taylor's formula we find where H ϕ h denotes the Hessian of ϕ h and the error term r 3 satisfies for all s ∈ B(s * , δ), and ∂ α s denotes the partial derivative corresponding to the multi-index α.Substituting the Taylor approximation into equation ( 4), we find Since the Hessian H ϕ h (s * ) is symmetric and since the domain of integration is invariant under rotations we can choose a basis in which H ϕ h (s * ) is diagonal, such that the diagonal elements coincide with the eigenvalues λ 1 , . . ., λ q .Using this basis we can write the quadratic term in (7) as where tr H ϕ h = ∆ϕ h is the trace of the Hessian.Here we used the fact that the value B(0,δ) u 2 i du does not depend on i and thus equals the average 1 q q j=1 B(0,δ) u 2 j du.Rescaling space by a factor 1/δ and using the relation For the error term we can use a similar scaling argument in the bound (6) to get 1 for some constant C. Substituting these results back into equation ( 7) we find where Using formula (8) for h ≡ 1 we also get Using representation (5) of the bias, and omitting the argument s * for brevity, we can now express the bias in powers of δ: This completes the proof.
To prove the statement of theorem 3.2, the bias of Y (δ) n has to be balanced with the computational cost for computing Y (δ) n .When δ decreases, fewer samples will satisfy the acceptance condition S(X) − s * ≤ δ and the running time of the algorithm will increase.The following lemma makes this statement precise.
Lemma 3.7.Let f S be continuous at s * .Then the expected computational cost for computing the estimate Y for all n ∈ N, δ > 0 where c 1 and c 2 are constants, and c 3 does not depend on n and satisfies c 3 (δ) → 0 as δ ↓ 0.
Proof.The computational cost for algorithm 2.1 is of the form a + bN where a and b are constants, and N is the random number of iterations of the loop required until n samples are accepted.The number of iterations required to generate one ABC sample is geometrically distributed with parameter and thus N , being the sum of n independent geometrically distributed values, has mean E(N ) = n/p.The probability p can be written as Since ϕ 1 = f S is continuous at s * , we have ϕ 1 (s * ) → ϕ 1 (s * ) as δ ↓ 0 and thus p = cδ q 1 + o(1) for some constant c.Thus, we find that the computational cost satisfies , and the proof is complete.
Finally, the following two lemmata give the relation between the approximation error caused by the bias on the one hand and the expected computational cost on the other hand.
Proof.By assumption, the limit D = lim δ n n 1/4 exists.Using lemma 3.6 and corollary 3.5, we find On the other hand, using lemma 3.7, we find Finally, combining the rates for cost and error we get the result Since the right-hand side of this equation is non-zero, this completes the proof.
Lemma 3.8 only specifies the optimal rate for the decay of δ n as a function of n.By inspecting the proof, we can derive the corresponding optimal constant: δ n should be chosen as δ n = Dn −1/4 where D minimises the expression in equation ( 9).The optimal value of D can analytically be found to be , where C(s * ) is the constant from lemma 3.6.The variance Var h(θ) S = s * is easily estimated, but it seems difficult to estimate C(s * ) in a likelihood-free way.
The following result completes the proof of theorem 3.2 by showing that no other choice of δ n can lead to a better rate for the error while retaining the same cost.Lemma 3.9.For n ∈ N let δ n ↓ 0. Assume that E h(θ) 2 < ∞ with Var h(θ) S = s * > 0, that S satisfies assumption A and C(s * ) = 0.Then, for f S -almost all s * ∈ R q , we have Proof.From lemma 3.6 we know that for all sufficiently large n, since C(s * ) 2 > 0. By lemma 3.7, as n → ∞, we have where we were able to insert the constant factors into the last term because the ∼-relation does not see constants.Using Young's inequality we find Var h(θ and combining equations ( 10), ( 11) and ( 12) we get for some constant c > 0 and all sufficiently large n.This is the required result.
The statement of theorem 3.2 coincides with the statements of lemmata 3.8 and 3.9 and thus we have completed the proof of theorem 3.2.

Fixed Number of Proposals
So far, we have fixed the number n of accepted samples; the number N of proposals required to generate n samples is then a random variable.In this section we consider the opposite approach, where the number N of proposals is fixed and the number n of accepted samples is random.Compared to the case with fixed number of samples, the case considered here includes two additional sources of error.First, with small probability no samples at all will be accepted and we shall see that this introduces a small additional bias.Secondly, the randomness in the value of n introduces additional variance in the estimate.
We show that the additional error introduced is small enough not to affect the rate of convergence.The main result here is corollary 4.2, below, which shows that the results from theorem 3.2 still hold for random n.
In analogy to the definition of Y (δ) n from equation (2), we define where ĉ is an arbitrary constant (e.g.zero or the prior mean) to be used if none of the proposals are accepted.For a given tolerance δ > 0, each of the N proposals is accepted with probability p = P S(X) ∈ B(s * , δ) , and the number of accepted samples is binomially distributed with mean E(n) = N p.In this section we consider arbitrary sequences of N and δ such that N p → ∞ and we show that asymptotically the estimators Y N p (using n = N p accepted samples) and Ŷ (δ) N (using N proposals) have the same error.
N p be as above.Then Proof.We start the proof by comparing the variances of the two estimates Ŷ Our aim is to show that the right-hand side of this equation converges to 1.
To see this, we split the right-hand side into four different terms.First, since log (1 − p) i.e. the right-most term in (13) disappears as N p → ∞.Now let ε > 0. Then we have as N p → ∞ by Chebyshev's inequality.For the lower tails of n/ N p we find Using the relative entropy lemma 2.1.9 in Dembo and Zeitouni (1998 and since q → H(q|p) is decreasing for q < p we get as N p → ∞.Finally, we find as N p → ∞.Substituting equations ( 14) to ( 18) into ( 13) and letting ε ↓ 0 we finally get Var Ŷ From the previous sections we know bias(Y and since we already have seen Let ε > 0 and let N p be large enough that all o(1) terms in (19) satisfy −ε < o(1) < ε.Then we have where the second inequality is found by maximising the function for all sufficiently large N p.Using these bounds, taking the limit N p → ∞ and finally taking the limit ε ↓ 0 we find as N p → ∞.This completes the proof.
We note that the result of proposition 4.1 does not require p to converge to 0. In this case neither MSE Ŷ converges to 0, but the proposition still shows that the difference between fixed N and fixed n is asyptotically negligible.
Corollary 4.2.Let the assumptions of theorem 3.2 hold and let (δ N ) N ∈N be a sequence with δ N ∼ N −1/(q+4) , where N denotes the (fixed) number of proposals.Then the mean squared error satisfies as N → ∞, and the given choice of δ N is optimal in the sense of theorem 3.2. Proof.
Therefore, the prior and posterior expectation for h(θ) are E h(θ) = 0.3829 and E h(θ) S = s * = 0.3648, respectively.Similarly, the constant from lemma 3.6 can be shown to be C(s * ) = 0.0323.Assumption A can be shown to hold.The function is multivariate normal, so its third derivatives exist, and are bounded and continuous.Similarly, the function also has bounded and continuous third derivatives.Thus, the assumptions hold.
The programs used to perform the simulations described in this section, written using the R environment (R Core Team, 2013) are available as supplementary material.

Experiment 1
Our first experiment validates the statement about the bias given in lemma 3.6.Since we know the exact posterior expectation, the bias can be determined experimentally.For fixed δ, we generate k independent ABC estimates, each based on n proposals.For each of the k estimates, we calculate its distance from the true posterior expectation.We then calculate the mean and standard error of these differences to obtain a Monte Carlo estimate of the bias.
Repeating this procedure for several values of δ, we can produce a plot of the estimated bias against δ, with 95% error bars.Figure 1 shows the result of a simulation, using n = 500 samples for each ABC estimate and k = 5000 ABC estimates for each value of δ.For comparison, the plot includes the theoretically predicted asymptotic bias C(s * )δ 2 , using the value C(s * ) = 0.0323.The plot shows that for small values of δ the theoretical curve is indeed a good fit to the numerical estimates of the bias.The result of the lemma is only valid as δ ↓ 0 and indeed the plot shows a discrepancy for larger values.This discrepancy is caused by only a small fraction of the sample being rejected; as δ increases, the distribution of the ABC samples approaches the prior distribution.

Experiment 2
Our second experiment validates the statement of theorem 3.2, by numerically estimating the optimal choice of delta and the corresponding MSE.
For fixed values of expected computational cost and δ, we estimate the mean squared error by generating k different ABC estimates and taking the mean of their squared distance from the true posterior expectation.This reflects how q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 0.5 1.0 1.5 2.0 0.00 0.02 0.04 0.06 0.08 δ bias the bias is estimated in experiment 1. Repeating this procedure for several values of δ, the estimates of the MSE are plotted against δ.
Our aim is to determine the optimal value of δ for fixed computational cost.From lemma 3.7 we know that the expected cost is of order nδ −q and thus we choose n ∼ δ 2 in this example.From lemma 3.6 we know that bias ∼ δ 2 .Thus, we expect the MSE for constant expected cost to be of the form for some constants a and b.Thus, we fit a curve of this form to the numerically estimated values of the MSE.The result of one such simulation, using k = 500 samples for each δ, is shown in figure 2. The curve fits the data well.We estimate the optimal values of δ and MSE, given an expected computational cost, to be those at the minimum of the fitted curve.Given the good fit between the predicted form (20) of the curve and the empirical MSE values, this procedure promises to be a robust way to estimate the optimal value of δ.The direct approach would likely require a much larger number of samples to be accurate.
Repeating the above procedure for a range of values of expected cost gives corresponding estimates for the optimal values of δ and the MSE as a function of expected cost.We expect the optimal δ and the MSE to depend on the cost like x = A • cost B .To validate the statements of theorem 3.2 we numerically estimate the exponent B from simulated data.The result of such a simulation is shown in figure 3. The data are roughly on straight lines, as expected, and q q q q q q q q q q 0.0 0.2 0.4 0.6 0.8 1.0 0.0005 0.0010 0.0015 δ mean square error The table shows an an excellent fit between empirical and theoretically predicted values.

Discussion
While the work of this article is mostly theoretical in nature, the results can be used to guide the design of an analysis using ABC methods.Our results can be applied directly in cases where a pilot run is used to tune the algorithm.The performance of the ABC algorithm depends both on the number n of samples accepted and on the tolerance δ.From theorem 3.2 we know that optimality is achieved by choosing δ proportional to n −1/4 .Consequently, if the full run is performed by increasing the number of accepted ABC samples by a factor of k, then the tolerance δ for the full run should be divided by k 1/4 .In this case the expected running time satisfies cost ∼ nδ −q ∼ n (q+4)/4 and, using lemma 3.7, we have error ∼ cost −2/(q+4) ∼ n −1/2 .q q q q q q q q q 1 2 5 20 50 200 0.4 0.5 0.6 0.7 expected cost δ q q q q q q q q q 1 2 5 20 50 200 For comparison, the additional line above the fit has the gradient expected from the theoretical results.
There are two possible scenarios: • If we want to reduce the root mean-squared error by a factor of α, we should increase the number n of accepted ABC samples by α 2 and reduce the tolerance δ by a factor of (a 2 ) 1/4 = √ α.These changes will increase the expected running time of the algorithm by a factor of α (q+4)/2 .
• If we plan to increase the (expected) running time by a factor of β, we should increase the number of accepted samples by a factor of β 4/(q+4) and divide δ by β 1/(q+4) .These changes will reduce the root mean squared error by a factor of β 2/(q+4) .These guidelines will lead to a choice of parameters which, at least asymptotically, maximises the efficiency of the analysis.

Figure 1 :
Figure 1: Simulation results illustrating the relationship between bias and δ.Circles give the mean empirical bias from 5000 simulations for each value of δ.The error bars indicate mean ±1.96 standard errors.The solid line shows the theoretically predicted asymptotic bias from lemma 3.6.

Figure 2 :
Figure 2: The estimated MSE as a function of the tolerance δ for fixed expected cost.The fitted curve has the form given in equation (20).The location of the optimal δ is marked by the vertical line.

Figure 3 :
Figure3: Estimated dependency of the optimal δ and of the corresponding MSE on the computational cost.The computational cost is given in arbitrary units, chosen such that the smallest sample size under consideration has cost 1.For comparison, the additional line above the fit has the gradient expected from the theoretical results.