On a Metropolis-Hastings importance sampling estimator

A classical approach for approximating expectations of functions w.r.t. partially known distributions is to compute the average of function values along a trajectory of a Metropolis-Hastings (MH) Markov chain. A key part in the MH algorithm is a suitable acceptance/rejection of a proposed state, which ensures the correct stationary distribution of the resulting Markov chain. However, the rejection of proposals causes highly correlated samples. In particular, when a state is rejected it is not taken any further into account. In contrast to that we consider a MH importance sampling estimator which explicitly incorporates all proposed states generated by the MH algorithm. The estimator satisfies a strong law of large numbers as well as a central limit theorem, and, in addition to that, we provide an explicit mean squared error bound. Remarkably, the asymptotic variance of the MH importance sampling estimator does not involve any correlation term in contrast to its classical counterpart. Moreover, although the analyzed estimator uses the same amount of information as the classical MH estimator, it can outperform the latter in scenarios of moderate dimensions as indicated by numerical experiments.


Introduction
Motivation.A fundamental task in computational science and statistics is the computation of expectations w.r.t. a partially unknown probability measure µ on a measurable space (G, G) determined by where µ 0 denotes a σ-finite reference measure on G and where the normalizing constant Z = G ρ(x) µ 0 (dx) ∈ (0, ∞) is typically unknown.Thus, given a function f : G → R the goal is to compute E µ (f ) = G f (x)µ(dx) only by using evaluations of f and ρ.Here, a plain Monte Carlo estimator for the approximation of E µ (f ) based on independent µ-distributed random variables is, in general, infeasible due to the unknown normalizing constant Z and the fact that we only have access to function evaluations of ρ.However, a possible and very common approach is the construction of a Markov chain for approximate sampling w.r.t.µ.In particular, the well-known Metropolis-Hastings (MH) algorithm provides a general scheme for simulating a Markov chain (X n ) n∈N with stationary distribution µ.Under appropriate assumptions the distribution of X n of such a MH Markov chain converges to µ and the classical MCMC estimator for E µ (f ) is then given by the sample average The statistical efficiency of S n (f ) highly depends on the autocorrelation of the time series (f (X n )) n∈N .In particular, a large autocorrelation diminishes the efficiency of S n (f ).An essential part in the MH algorithm is the acceptance/rejection step: Given X n = x, a sample y of Y n+1 ∼ P (x, •) is drawn, where P denotes a proposal transition kernel.But only with a certain probability this y is accepted as the next state, that is X n+1 := y, and otherwise it is rejected, such that X n+1 := x.This indicates that a potential reason for a high autocorrelation is the rejection of proposed states.Hence, the question arises whether it is possible to derive a more efficient estimator for E µ (f ) based on the potentially less correlated time series (f (Y n )) n∈N determined by the sample of proposals Y n .Main Result.In this paper we consider and analyze a modification of the classical estimator from (2) of the form , which we call MH importance sampling estimator.The (importance) weight w is chosen in such a way that we obtain a consistent estimator.More detailed, we set w(x, y) := dµ 0 dP (x,•) (y) • ρ(y) assuming the existence of the density dµ 0 dP (x,•) for each x ∈ G.The appeal of the modified estimator is that it is still based on the MH algorithm and needs no additional function evaluations of ρ and f , while, after appropriate tuning in scenarios of moderate dimensions, it can outperform the classical estimator as we illustrate in a few numerical examples in Section 4.Moreover, it can be seen and studied as an importance sampling corrected MCMC estimator, or as an importance sampling estimator using an underlying MH Markov chain for providing the importance distributions.In this paper we have chosen the first point of view and exploit the fact that the augmented MH Markov chain (X n , Y n ) n∈N inherits several desirable1 properties of the original MH Markov chain (X n ) n∈N such as Harris recurrence, see Lemma 10.By using those properties we prove the following results for the estimator A n : • Theorem 14: A strong law of large numbers (SLLN), i.e., for functions f ∈ L 1 (µ) we have almost surely A n (f ) → E µ (f ) as n → ∞; • Theorem 15: A central limit theorem (CLT), that is, for any f ∈ L 2 (µ) the scaled error √ n(A n (f ) − E µ (f )) converges in distribution to a mean-zero normal distribution N (0, σ 2 A (f )) with asymptotic variance σ 2 A (f ) given by )µ(dy)µ(dx); • Theorem 20: An estimate of the mean squared error Here, we denote by L p (µ), p ∈ [1, ∞) the Lebesgue space of functions f : G → R which are p-integrable w.r.t.µ.It is remarkable that in the asymptotic variance σ 2 A (f ) of the CLT there is no covariance or correlation term.However, there appears the density of µ w.r.t.P (x, •) which quantifies the difference of the employed importance distribution given by the proposal transition kernel P (x, •) and the desired distribution µ.
Related literature.Importance sampling is a well-established technique for approximating expectations, see [CMR05,Owe13] for textbook introductions, which has recently attracted considerable attention in terms of theory and application, see for example [APSAS17, CD18, Hin10, Sch15].In particular, its combination with Markov chain Monte Carlo methods is exploited by several authors.For example, Botev et al. [BLT13] use the MH algorithm in order to approximately sample from the minimum variance importance distribution.Vihola et al. [VHF18] consider general importance sampling estimators based on an underlying Markov chain and Martino et al. [MELC16] propose a hierarchical approach where a mixture importance distribution close to µ is constructed based on the (accepted) samples X k in the MH algorithm.Schuster and Klebanov [SK20] follow a similar idea to the latter, but rather use the proposals Y k of the MH algorithm and their asymptotic distribution as the importance distribution.Indeed, the idea of using all proposed states generated in the MH algorithm for estimating expectations such as E µ (f ) is not new.For instance, Frenkel suggests in [Fre04,Fre06] an approximation scheme which recycles the rejected states in a MH algorithm.In the work of Delmas and Jourdain [DJ09] this method is used in a control variate variance reduction approach and it is analyzed in a general framework.It turns out that for the Barker-algorithm the method is indeed beneficial, whereas for the MH-algorithm this is not necessarily the case.In particular, an estimator similar to A n (f ) as above but for sampling from normalized densities was already introduced by Casella and Robert [CR96].However, besides some numerical examples it was not further studied in [CR96] whereas their main focus, variance reduction of sampling methods by Rao-Blackwellization, got extended by [AP05,DR11].In particular, the theoretical results of Douc and Robert [DR11] provide variance reduction guarantees for their MH based estimator while keeping the additional computation cost under control.In contrast to that, using the estimator A n does not increase the number of function evaluations, but we also do not provide a guarantee of improvement.
Outline.First, we provide some basic preliminaries on Markov chains and the corresponding classical MCMC estimator S n .In Section 3 we introduce the MH importance sampling estimator, study properties of the aforementioned augmented MH Markov chain (X n , Y n ) n∈N and state the main results.In Section 4 we compare the classical MCMC estimator S n with A n numerically in two representative examples and draw some conclusions in Section 5.

Preliminaries on Markov chain Monte Carlo
Let (Ω, F, P) be a probability space.The random variables considered throughout the paper (mainly) map from this probability space to a measurable space (G, G).A (timehomogeneous) Markov chain is a sequence of random variables (X n ) n∈N which satisfy for any A ∈ G and any n ∈ N that P-almost surely where K : G×G → [0, 1] denotes a transition kernel, i.e., K(x, •) is a probability measure for any x ∈ G and the mapping x → K(x, A) is measurable for any A ∈ G. Our focus is on Markov chains designed for approximate sampling of the distribution µ.Such Markov chains typically have µ as their stationary distribution, i.e., their transition kernels K satisfy µK = µ, where µK(A) := G K(x, A) µ(dx) for any A ∈ G.

The Metropolis-Hastings algorithm
Let P : G × G → [0, 1] be a proposal transition kernel satisfying the following structural assumption.
Assumption 1.For any x ∈ G the proposal P (x, •) possesses a density p(x, •) w.r.t.µ 0 and for any y ∈ G assume This condition has some useful implications, see Proposition 2.Moreover, for example for G ⊆ R d , G = B(G) and µ 0 being the Lebesgue measure, any Gaussian proposal, such as a Gaussian-or Langevin-random walk, satisfies it.Assumption 1 allows us to define the finite "acceptance ratio" r(x, y) for the MH algorithm for any x, y ∈ G according to [Tie98, Section 2] by r(x, y) := ρ(y)p(y,x) ρ(x)p(x,y) ρ(x)p(x, y) > 0, 1 otherwise.
Then, the MH algorithm, which provides a realization of a Markov chain (X n ) n∈N , works as follows: Algorithm 1. Assume that X n = x, then the next state X n+1 is generated by the following steps: 1. Draw Y n ∼ P (x, •) and U ∼ Unif[0, 1] independently, call the result y and u, respectively.
3. Accept y with probability α(x, y), that is, if u < α(x, y), then set The Markov chain generated by the MH algorithm is called MH Markov chain, and its transition kernel, which we also call MH (transition) kernel, is given by where α c (x, y) := 1 − α(x, y).It is well-known that the transition kernel K in (3) is reversible w.r.t.µ, that is, K(x, dy)µ(dx) = K(y, dx)µ(dy).In particular, this implies that µ is a stationary distribution of K.

Strong law of large numbers, central limit theorem and mean squared error bound
For convergence, in particular the strong law of large numbers, we need the concepts of φ-irreducibility and Harris recurrence: It is proven in [Tie94, Corollary 2] that µ-irreducibility of a MH Markov chain (X n ) n∈N implies Harris recurrence.Moreover, it is known that Assumption 1 ensures µ-irreducibility and, thus, Harris recurrence: Proposition 2 ([MT96, Lemma 1.1]).Given Assumption 1 the Markov chain (X n ) n∈N realized by the MH algorithm is µ-irreducible.
We recall the SLLN of the classical MCMC estimator S n (f ) given in (2) based on the concept of Harris recurrence.
Theorem 3 (SLLN for S n , [MT93, Theorem 17.0.1]).Let (X n ) n∈N be a Harris recurrent Markov chain with stationary distribution µ on G and let f ∈ L 1 (µ).Then, for any initial distribution, i.e., any distribution of X 1 .This theorem justifies that the classical MCMC method based on the MH algorithm yields a consistent estimator.Moreover, for S n (f ) also a central limit theorem can be shown.Deriving a CLT is an important issue in studying MCMC and a lot of conditions which imply a CLT are known, for an overview we refer to the survey paper [Jon04] and the references therein.We require some further terminology.Let K : L 2 (µ) → L 2 (µ) be the transition operator associated to the transition kernel K of a Markov chain (X n ) n∈N given by For n ≥ 2 and f ∈ L 2 (µ) we have where K n is the n-step transition kernel, which is recursively defined by Note that the transition operator recovers the transition kernel, namely, for n ≥ 1 we have We also need the concept of the asymptotic variance: Let (X * n ) n∈N denote a Markov chain with transition kernel K starting at stationarity, i.e., the stationary distribution µ is also the initial one.Then, for f ∈ L 2 (µ) the asymptotic variance of the classical MCMC estimator S n (f ) for E µ (f ) is given by whenever the limit exists.One can easily see that the asymptotic variance admits the following representation in terms of the autocorrelation of the time series (f (X * n )) n∈N .Namely, where Var µ (f ) := E µ (f − E µ (f )) 2 denotes the variance of f w.r.t.µ and Corr(•, •) the correlation between random variables.
Theorem 4 (CLT for S n ).Let (X n ) n∈N be a Harris recurrent Markov chain with transition kernel K and stationary distribution µ.
then we have for any initial distribution with σ 2 S (f ) as in (4).
The theorem is justified by the following arguments.First, by [MT96, Proposition 17.1.6]it is sufficient to have a CLT when the initial distribution is a stationary one.In that case the Markov chain is an ergodic stationary process.Under condition 1., where no reversibility is necessary, the statement follows then by arguments derived in the introduction of [MW00].Under condition 2. the statement follows based on [KV86, Corollary 1.5].Although MH Markov chains are µ-reversible by construction, we encounter in the following a non-reversible Markov chain and derive a CLT by verifying 1.
The SLLN and the CLT only contain asymptotic statements, but one might be interested in explicit error bounds.For f ∈ L 2 (µ) the mean squared error of the classical MCMC estimator S n (f ) is given by E |S n (f ) − E µ (f )| 2 .Depending on different convergence properties of the underlying Markov chain different error bounds are known, see for example [JO10, LMN13, LN11, Rud09, Rud10, Rud12].In particular, there is a relation between the asymptotic variance σ 2 S (f ) and the mean squared error of S n : If and some of the error bounds have the same asymptotic behavior, see [ LMN13] and also [Rud12].

The MH importance sampling estimator
The CLT for the MCMC estimator S n (f ) shows that its statistical efficiency determined by the asymptotic variance σ 2 S (f ) is diminished by a large autocorrelation of (f (X * n )) n∈N or (f (X n )) n∈N , respectively.A reason for a large autocorrelation is the rejection of proposed states.In particular, the sequence of proposed states (f (Y n )) n∈N is potentially less correlated than the MH Markov chain itself, since no rejection is involved.For example, if a proposal kernel P on G = R d is absolutely continuous w.r.t. the Lebesgue measure, and X n ∼ µ, then we have Thus, one may ask whether it is beneficial, in terms of a higher statistical efficiency, to consider an estimator based on (f (Y n )) n∈N rather than (f (X n )) n∈N .Such an estimator might be of the form w k with suitable weights w k .The reason for the latter is the fact that Y n ∼ P (X n , •) does not follow the distribution µ.In fact, even if X n ∼ µ, then Y n ∼ µP , hence, we need to apply an importance sampling correction in order to obtain a consistent estimator A n (f ).To this end, Assumption 1 ensures the existence of: Indeed, by the fact that p(x, y) = 0 implies ρ(y) = 0 (Assumption 1) we have Moreover, the acceptance ratio r(x, y) can be expressed only in terms of ρ: As it turns out, ρ provides the correct weights w k for an estimator A n (f ), as indicated by the next result.
Proposition 5 motivates the following estimator.
Definition 6.Let Assumption 1 be satisfied and let (X n ) n∈N be a MH Markov chain, where (Y n ) n∈N denotes the corresponding proposal sequence.Then, given f ∈ L 1 (µ), the MH importance sampling estimator for E µ (f ) is with ρ defined in (5).
Remark 7. The dependence on ρ in A n is explicitly given within ρ, whereas the dependence on ρ of the classical estimator S n realized with the MH algorithm is rather implicit.Namely, it appears only in the acceptance probability of the MH algorithm.However, in many situations the computational cost for function evaluations of ρ are much larger than for function evaluations of f , such that it seems counterintuitive to use the information of the value of ρ at the proposed state, which was expensive to compute, not any further.
Remark 8.The estimator A n (f ) is related to self-normalizing importance sampling estimators for E µ (f ) of the form where (ξ k ) k∈N is an arbitrary sequence of random variables ξ k ∼ φ k and where w k = dµ 0 dφ k (ξ k )ρ(ξ k ) are the corresponding importance weights.For (ξ k ) k∈N = (Y k ) k∈N being the proposal sequence in the MH algorithm for realizing a µ-reversible Markov chain (X k ) k∈N , we recover A n (f ) with φ k = P (X k , •).In other words, A n (f ) can be viewed as an importance sampling estimator where the importance distributions φ k are determined by a MH Markov chain.
Remark 9. Related to the previous remark we highlight a recent approach similar but slightly different to ours.Namely, the authors of [SK20] propose and study a selfnormalizing importance sampler where the importance distribution is φ k = µP , i.e., the stationary distribution of the proposal sequence in the MH algorithm.Moreover, we remark that the particular form of the estimator A n (f ) in the case of already normalized weights appeared in [CR96, Section 5], but without any further analysis.Since selfnormalizing is rather inevitable in practice, we continue studying A n (f ) as in (6).

The augmented MH Markov chain and its properties
In order to analyze the MH importance sampling estimator A n we consider the augmented MH Markov chain (X n , Y n ) n∈N on G × G consisting of the original MH Markov chain (X n ) n∈N and the associated sequence of proposals (Y n ) n∈N .The transition kernel K aug of the augmented MH Markov chain is given by K aug ((x, y), dudv) := δ y (du)P (y, dv)α(x, y) + δ x (du)P (x, dv)α c (x, y) for x, y ∈ G, where δ z denotes the Dirac-measure at z ∈ G. Now we derive a useful representation of K aug and the MH kernel K, which simplify several arguments.To this end, we define the probability measure ν(dxdy) := P (x, dy)µ(dx) on (G × G, G ⊗ G) and let L 2 (ν) be the space of functions g : G × G → R which satisfy By K aug the transition operator K aug : L 2 (ν) → L 2 (ν) is induced.Furthermore, for a given proposal transition kernel P we define a linear operator P : It is easily seen that its adjoint operator P * : L 2 (µ) → L 2 (ν) is given by i.e., Pg, f µ = g, P * f ν , where •, • µ and •, • ν denote the inner products in L 2 (µ) and L 2 (ν), respectively.Let H be the transition kernel on G × G given by and let H : L 2 (ν) → L 2 (ν) denote the associated transition operator.The following properties are useful for the subsequent analysis.
Lemma 10.With the above notation we have that 3. K = PH P * and K aug = H P * P; 4. ν given in ( 7) is a stationary distribution of K aug ; 5. K n aug = H P * K n−1 P and K n = PK n−1 aug H P * for n ≥ 2.
To 2.: It is easily seen that P * P is a projection.Moreover, it is well-known that the norm of an operator and its adjoint coincide, which yields the statement in combination with To 3.: The representations can be verified by a straightforward calculation.
To 4.: For any A, B ∈ G we have where the last-but-one equality follows from the fact that µ is a stationary distribution of K. Since the Cartesian products A × B provide a generating system of G ⊗ G the result follows by the uniqueness theorem of probability measures.
To 5.: These representations are a direct consequence of 3.
Note that statement 5 of Lemma 10 yields for n ≥ 1 and g ∈ L 2 (ν) that Remark 11.In general, the transition kernel K aug is not reversible w.r.t.ν.Since reversibility is equivalent to self-adjointness of the Markov operator this can be seen by the fact that K * aug = P * P H, which does not necessarily coincides with K aug .For convenience of the reader we also provide a simple example which illustrates the non-reversibility.Consider a finite state space G = {1, 2} equipped with the counting measure µ 0 with ρ(i) = 1/2 and P (i, j) = 1/2 for all i, j ∈ G such that α(i, j) = 1.Then the transition matrix K aug is given by K aug ((i, j), (k, )) = δ j ({k}) 2 for any i, j, k, ∈ G.Here reversibility is equivalent to K aug ((i, j), (k, )) = K aug ((k, ), (i, j)) for all i, j, k, ∈ G, which is not satisfied for i = j = = 1 and k = 2. Now, using Lemma 10 we show that stability properties of the MH kernel K pass over to K aug .The proof of the following result is adapted from [VHF18, Lemma 24].
Lemma 12. Assume that φ is a σ-finite measure on (G, G) and let K denote the MH kernel as in (3).
• If K is Harris recurrent (w.r.t.φ), then K aug is also Harris recurrent (w.r.t.φ P ).
so that A 2 (x) is the slice of A for fixed first component x and A 1 is the "projection" of the set A on the first component space.For ε > 0 let By the use of (8) we prove the irreducibility statement: Assume that A ∈ G ⊗ G with φ P (A) > 0.Then, φ(A 1 ) > 0, since otherwise is zero.By the same argument, one obtains that there exists an ε > 0 such that φ(A 1 (ε)) > 0, since otherwise is zero.Because of the φ-irreducibility of K, we have for x, y ∈ G that there exist n x , n y ∈ N such that K nx (x, A 1 (ε)) > 0 and K ny (y, A 1 (ε)) > 0. Hence, if α(x, y) > 0, then ≥ α(x, y) A P (u, dv)K ny (y, du) = α(x, y) Otherwise, if α c (x, y) = 1, we obtain analogously In other words, for (x, y) ∈ G × G we find an n ∈ N (depending on α(x, y)) such that K n aug ((x, y), A) > 0, which proves the φ P -irreducibility.We turn to the Harris recurrence: Let K be Harris recurrent w.r.t.φ and let φ P (A) > 0. As above, we can conclude that there exists an ε > 0 such that φ(A 1 (ε)) > 0. Furthermore, for the augmented Markov chain (X n , Y n ) n∈N with transition kernel K aug we have By φ(A 1 (ε)) > 0 and the fact that (X n ) n∈N is Harris recurrent w.r.t.φ, with probability one there are infinitely many distinct times (τ k ) k∈N , such that X τ k ∈ A 1 (ε) for any k ∈ N. Hence Note that by construction 1 A 2 (Xτ k ) (Y τ k ) are Bernoulli random variables with success probability of at least ε.Moreover, they are conditionally independent given (X τ k ) k∈N .Hence, which shows that the augmented MH Markov chain is Harris recurrent.
Remark 13.Another consequence of Lemma 10 interesting on its own is that also geometric ergodicity is inherited by the augmented MH Markov chain.However, since this fact is not relevant for the remainder of the paper, we postpone the discussion of geometric ergodicity and its inheritance to Appendix A.

Strong law of large numbers and central limit theorem
A consistency statement in form of a SLLN of the MH importance sampling estimator defined in (6) is stated and proven in the following.A key argument in the proofs is the inheritance of Harris recurrence of (X n ) n∈N to the augmented MH Markov chain (X n , Y n ) n∈N .
Theorem 14.Let Assumption 1 be satisfied.Then, for any initial distribution and any f ∈ L 1 (µ) we have Proof.Assumption 1 implies µ-irreducibility and Harris recurrence of the MH Markov chain (X n ) n∈N due to Proposition 2. This yields, due to Lemma 12, that also the transition kernel K aug is Harris recurrent.Hence, by Theorem 3 we have for each Define h 1 (x, y) := ρ(x, y)f (y) and h 2 (x, y) := ρ(x, y).
and, thus, the numerator and denominator on the left-hand side of (9) converge a.s. to E ν (h 1 ) and E ν (h 2 ).The assertion follows then by the continuous mapping theorem and The next goal is to derive a CLT, which provides a way to quantify the asymptotic behavior of A n .Since the augmented Markov chain (X n , Y n ) n∈N is, in general, not reversible w.r.t.ν, we aim to use condition 1 of Theorem 4.
Theorem 15.Let Assumption 1 be satisfied and assume for f ∈ L 1 (µ) that is finite.Then, for any initial distribution, we have Proof.We frequently use the identity x, y ∈ G.Note that E ν (h 3 ) = 0 and h 3 ∈ L 2 (ν), since With the representation (8) one obtains for any k ≥ 2 that where the last equality follows from By the same argument we obtain K aug h 3 = 0. Hence, for the augmented MH Markov chain (X n , Y n ) n∈N condition 1. of Theorem 4 is satisfied for the function h 3 and by the inheritance of the Harris recurrence from K to K aug , see Lemma 12, we get Here By exploiting again the fact that K k aug h 3 = 0 for k ≥ 1 we obtain .
The denominator converges by Theorem 3 to Z as well as such that by Slutsky's Theorem the assertion is proven.
Remark 16.It is remarkable that the asymptotic variance σ 2 A (f ) of A n (f ) coincides with the asymptotic variance of the importance sampling estimator Here, ν denotes the stationary measure of the augmented MH Markov chain given in (7).Hence, the fact that A n (f ) is based on the, in general, dependent sequence (X k , Y k ) k∈N of the augmented MH Markov chain, does surprisingly not effect its asymptotic variance.
Remark 17. Often it is of interest to estimate the asymptotic variance appearing in a CLT.For a given f ∈ L 1 (µ) the corresponding quantity, given by Theorem 15, can be rewritten as Given this representation of σ 2 A (f ) we suggest estimating it by 2 where 1 n n j=1 f (X j ) can also be replaced by A n (f ).Now we turn to a non-asymptotic analysis, where the error criterion is the mean squared error.

Mean squared error bound
In this section we provide explicit bounds for the mean squared error of A n .Those estimates are an immediate consequence of the following two lemmas, which are similar to the arguments in [MN07, Theorem 2] and [APSAS17, Theorem 2.1].
Then, for bounded f , i.e., f ∞ := sup x∈G |f (x)| < ∞, we have Proof.Observe that D(1) = Z.Further Using the fact that (a + b) 2 ≤ 2a 2 + 2b 2 for any a, b ∈ R gives Lemma 19.Assume that the initial distribution is the stationary one, that is, X 1 ∼ µ.
Then, with the notation from Lemma 18, we have Proof.Observe that f (y)ρ(x, y)P (x, dy)µ(dx) Define the centered function g c (x, y) := ρ(x, y)f (y) − Z • E µ (f ) for any x, y ∈ G.We have Exploiting the fact that the initial distribution is the stationary one we obtain for i ≥ j that In the case k := i − j > 1 we have by representation (8) that = 0 leads to K k aug g c (x, y) = 0.By similar arguments we obtain K aug g c (x, y) = 0. Hence, By the combination of both lemmas we derive the following theorem.
Theorem 20.Assume that the initial distribution of an augmented MH Markov chain (X n , Y n ) n∈N is the stationary one, i.e., X 1 ∼ µ.Then, for bounded f : G → R we obtain Remark 21.Let us mention here two things: First, we assumed that the initial distribution is the stationary one.This assumption is certainly restrictive, we refer to [ LMN13,Rud12] for techniques to derive explicit error bounds for more general initial distribution.Second, the factor in the estimate is an upper bound of the asymptotic variance σ 2 A (f ) derived in Theorem 15.We conjecture that the estimate actually holds with σ 2 A (f ) instead of this upper bound.

Optimal calibration of proposals
Given the explicit expression for the asymptotic variance σ 2 A (f ) involving the proposal kernel P , we can ask for an optimal choice of the kernel P : G × G → [0, 1] in order to minimize σ 2 A (f ).However, finding an optimal kernel among all admissible kernels is, in general, an infeasible task.In practice, one often considers common types of proposal kernels P = P s with a tunable stepsize parameter s > 0 and ask for the optimal value of s.For example, given a measure µ on G ⊆ R d , we can use the random walk proposal where x ∈ R d and C ∈ R d×d denotes a covariance matrix, within a MH algorithm.For this proposal and the classical path average estimator S n (f ) it is widely known that a good stepsize s * S is chosen in such a way that the average acceptance rate is G α(x, y)P s * S (x, dy)µ(dx) ≈ 0.234.
For a justification and further details we refer to [RR01].For the MH importance sampling estimator A n (f ) we look for an optimal stepsize parameter s * A .Optimal in the sense that it minimizes the asymptotic variance of A n (f ), thus, we ask for where p s (x, •) denotes the density of P s (x, •) w.r.t. the reference measure µ 0 .If we assume that the mapping s → p s (x, y) is differentiable for each (x, y) ∈ G × G with derivative d ds p s (x, y), then any s minimizing V (s) satisfies By the fact that µ(dy)µ(dx) ∝ ρs (x, y) P s (x, dy)µ(dx), where ρs (x, y) = ρ(y) ps(x,y) , we can rewrite (12) and approximate d ds V (s) by using (X k , Y k ), k = 1, . . ., n from the augmented Markov chain.Thus In practice we can calibrate s such that the empirical average on the right-hand side is close to zero.We demonstrate the feasibility of this approach for two common proposals.
Example 22 (Optimal calibration of the random walk-MH).We consider µ 0 as the Lebesgue measure on G ⊆ R d and P s as in (11).Thus, where |y − x| 2 C := (y − x) C −1 (y − x), and which can be rewritten as s (x, y) P s (x, dy)µ(dx) .
In practice, we then can seek an s > 0 such that for n states (X k , Y k ) of the augmented MH Markov chain generated by the proposal P s (x, •) = N (x, s 2 C) we have Example 23 (Optimal calibration of the MALA).Another common proposal on G = R d is the one of the Metropolis-adjusted Langevin algorithm (MALA), given by where we assume that log ρ : G → R is differentiable and I d denotes the identity matrix in R d .The resulting proposal density is In order to compute the derivative d ds p s (x, y) we require which then yields Thus, in the case of MALA the necessary condition (12) is equivalent to .
Again, in practice we seek for an s > 0 such that given n states (X k , Y k ) of the augmented MH Markov chain generated by the MALA proposal P s in (14) we have s 2 close to

Numerical examples
We want to illustrate the benefits as well as the limitations of the MH importance sampling estimator A n (f ) at two simple but representative examples.To this end, we compare the considered A n (f ) to the classical path average estimator S n (f ) as well as to two other established estimators using also the proposed states Y k generated in the MH algorithm.Namely • the waste-recycling Monte Carlo estimator, for further details we refer to [Fre04,Fre06,DJ09], given by • another Markov chain importance sampling estimator also based on the proposed states, see [SK20], given by .
The notation X 1:n within B n (f ) stands for X 1 , . . ., X n .In the following we provide two comments w.r.t.B n and the other estimators.
Remark 24.For the convenience of the reader we justify heuristically that B n (f ) approximates E µ (f ).For this let ν Y be the marginal distribution of the stationary probability measure Intuitively ν Y can be considered as the asymptotic distribution of the proposed states.
The empirically computed weights w n (X 1:n , Y k ) in B n (f ) approximate importance sampling weights w(Y k ) ∝ dµ dν Y (Y k ) resulting from the asymptotic distribution ν Y of the proposed states Y k .Now if we substitute w n (X 1:n , Y k ) within B n (f ) by w(Y k ) we have an importance sampling estimator based on the proposed states which approximates E µ (f ).
Remark 25.By the fact that within the estimators S n (f ), A n (f ), and W R n (f ) only one sum appears, the number of arithmetic operations and therefore the complexity is O(n).In contrast to that, within the alternative Markov chain importance sampling estimator B n (f ) an additional sum appears in the computation of each weight w n (X 1:n , Y k ), such that the overall number of arithmetic operations is O(n 2 ), since we need to compute w n (X 1:n , Y k ), k = 1, . . ., n.To take this into account, we often compare the former three estimators to B √ n (f ).Besides that an optimal tuning of the proposal stepsize of the estimator B n (f ) is left open in [SK20], however, the authors suggest to simply use the usual calibration rule for the classical path average estimator S n (f ) from [RR01].

Bayesian inference for a differential equation
We consider a boundary value problem in one spatial dimension x ∈ [0, 1] which serves as a simple model for, e.g., stationary groundwater flow: Here, the unknown parameters u = (u 1 , u 2 ) involving the log-diffusion coefficient u 1 and the Dirichlet data u 2 at the righthand boundary x = 1 shall be inferred given noisy observations y ∈ R 2 of the solution p at x 1 = 0.25 and x 2 = 0.75.This inference setting has been already applied as a test case for sampling and filtering methods in [ESS15,GIHLS19,HV19].We place a Gaussian prior on u = (u 1 , u 2 ), namely, µ 0 ∼ N (0, I 2 ) where I 2 denotes the identity matrix in R 2 .The observation vector is given by y = (27.5,79.7) and we assume an additive measurement noise ε ∼ N (0, 0.01I 2 ), i.e., the likelihood L(y|u) of observing y given a fixed value u ∈ R 2 is where • denotes the Euclidean norm and F : R 2 → R 2 the mapping (u 1 , u 2 ) → (p(x 1 ), p(x 2 )) with The resulting posterior measure for u given the observation y follows then the form (1) with ρ(u) := L(y|u).The negative log prior and posterior density are presented in Figure 1.For approximate sampling of the posterior µ we apply now the RW-MH algorithm and MALA, see Section 3.4, with various values of the stepsize s .We let the Markov chains run for n = 10 4 iterations after a burn-in of n 0 = 10 3 iterations.Then, we use the generated path of the (augmented) MH Markov chain in order to estimate the posterior mean E µ (f ) where f (u) := (f 1 (u), f 2 (u)) with f i (u) := u i .We approximate E µ (f ) by the various estimators E n (f ) discussed at the begining of this section, that is,

Negative log posterior density
The true value E µ (f ) of the posterior mean is computed by Gauss quadrature employing 1500 Gauss-Hermite nodes in each dimension which ensures a quadrature error smaller an order of magnitude compared to S n (f ).However, this performance comes at the price of a signicant larger complexity.If we consider the Markov chain importance sampling estimators B √ n (f ) with the same complexity as the other estimators S n (f ), A n (f ), and W R n (f ), we in fact observe a worse performance to the other estimators for all chosen stepsizes.Thus, in the error-vs-complexity sense the estimator A n (f ) performs best among all considered estimators if calibrated correctly.
Optimal calibration of A n (f ): Concerning the optimal stepsize for A n (f ) we present in Figure 3 a verification of the approach outlined in Section 3.4.For both MH algorithms, the random walk-MH and MALA, we display in the top row the RMSE of S n (f ) and A n (f ) w.r.t. the chosen stepsizes.In the bottom row we display for each stepsize value s the relation of s 2 to the empirical functionals for the random walk-MH and the corresponding J f (s) and J(s) for MALA based on (15).In Section 3.4 we derived as a necessary condition for the optimal stepsize s that s 2 ≈ J f (s ) for both kind of proposals.Here, we can indeed verify this condition: the optimal s , which was calibrated by hand following the rule s 2 ≈ J f (s ), shows indeed also the smallest RMSE in the top row.The optimal s , J f (s ), and its RMSE are highlighted by a green marker in Figure 3.Besides that, choosing the rather "objective" functional J(s), which is independent of the particular quantity of interest f , and apply the alternative calibration rule s 2 ≈ J(s ) does not yield to a stepsize with minimal RMSE for A n (f ) -although the alternatively calibrated stepsize and the resulting RMSE are not that far off from the true optimum.In summary, Figure 3 verifies that the approach in Section 3.4 can indeed be applied in practice for finding the optimal stepsize for the MH importance sampling estimator A n (f ).

Bayesian inference for probit regression (PIMA data)
The second example is a test problem for logistic regression, see, e.g., [CR17] for a discussion.Here, nine predictors x i ∈ R 9 such as diastolic blood pressure, body mass index, or age are fitted to the binary outcome y i ∈ {−1, 1} for diagnosing diabetes for N = 768 members i = 1, . . ., N , of the Pima Indian tribe.For more details about the data we refer to [SED + 88].Following [CR17] the likelihood L(y|β) for the outcome y ∈ {−1, 1} N of the diagnosis is modeled by Performance of other estimators: As in Section 4.1 the waste recycling estimator W R n (f ) basically coincides with the path average estimator S n (f ) for any considered dimension d.Also for the Markov chain importance sampling estimators B n (f ) and B √ n (f ) the performance compared to S n (f ) and A n (f ) is similar to Section 4.1, i.e., B n (f ) outperformes all other estimators -but at a higher cost -whereas its costequivalent version B √ n (f ) performes worse than any other estimator.However, also B n (f ) and B √ n (f ) seem to suffer from higher dimensions of the state space as indicated in Table 1, i.e., their total variance relative to the total variance of S n (f ) becomes larger as d increases.
Optimal calibration: We observe that for S n (f ) the total variance becomes minimal for average acceptance rates between 0.2 and 0.25.This is in accordance with the well-known asymptotic result on optimal a-priori step size choices, see [RR01].The same optimal calibration holds true for the waste recycling estimator W R n (f ).For MH importance sampling estimator A n (f ) the minimal total variance is obtained for smaller and smaller average acceptance rates as the dimension d increases.In fact, the numerical results indicate that the optimal proposal step size s for A n (f ) remains constant w.r.t. the dimension d.This is in contrast to the classical MCMC estimator where the optimal asymptotic a-priori step size s behaves for a product density ρ like d −1 for the Gaussian random walk proposal, see [RR01].Moreover, for each dimension d the minimal total variances of A n (f ) where obtained for stepsizes satisfying the optimal calibration rules outlined in Section 3.4.Concerning the estimators B n (f ) and B √ n (f ) we also observe that the optimal performance occurs for decreasing acceptance rates as the dimension d increases.Here, the numerical results suggest that the optimal proposal stepsize s even increases mildly with the dimensions d.

Conclusion
In this work we studied a MH importance sampling estimator A n for which we showed a SLLN, a CLT and an explicit estimate of the mean squared error.A remarkable property of this estimator is that its asymptotic variance does not contain any autocorrelation term, in fact Corr(ρ This is in sharp contrast to the asymptotic variance of the classical MCMC estimator S n , see (4).Additionally, we performed numerical experiments which indicate that the MH importance sampling estimator can outperform the classical one.This requires the correct tuning of the underlying MH Markov chain in terms of the proposal step size where the estimator A n seems to benefit from rather small average acceptance rates in contrast to optimal scaling results for the MCMC estimator.However, we exhibit a decreasing efficiency of the MH importance sampling estimator for increasing dimension in the numerical experiments.Indeed, the classical MCMC estimator performs better for larger dimensions.This is very likely related to the well-known degeneration of efficiency for importance sampling in high dimensions, see for example the discussion [APSAS17, Section 2.5.4].

A. Inheritance of Geometric Ergodicity
A transition kernel K : G×G → [0, 1] with stationary distribution µ is L 2 (µ)-geometrically ergodic if there exists a constant r ∈ [0, 1) such that for all probability measures η on G with dη dµ ∈ L 2 (µ) there is C η ∈ [0, ∞) satisfying where d TV denotes the total variation distance.Note that if dη dµ exists, then In addition to the exponential convergence, L 2 (µ)-geometric ergodicity also yields advantages concerning the CLT for the classical MCMC estimator S n (f ) for E µ (f ).

G
g(x, y)ρ(x, y)P (x, dy) = Z G g(x, y)µ(dy), (10) for any x ∈ G and any g : G 2 → R for which one of the two integrals exist.Define the centered version of f by f c (y) := f (y) − E µ (f ) and set h 3 (x, y) := ρ(x, y)f c (y) for

Figure 1 :
Figure 1: Contour plot of the normal prior and the resulting posterior density for the example of Section 4.1.

Table 1 :
Ratio of the total variances Var(E n (f ))/ Var(S n (f )) for various estimators E n and dimensions d for the example of Section 4.2.