On the efficiency of adaptive MCMC algorithms

We study a class of adaptive Markov Chain Monte Carlo (MCMC) processes which aim at behaving as an "optimal" target process via a learning procedure. We show, under appropriate conditions, that the adaptive process and "optimal" (nonadaptive) MCMC algorithm share identical asymptotic properties. The special case of adaptive MCMC algorithms governed by stochastic approximation is considered in details and we apply our results to the adaptive Metropolis algorithm of [1]. We also propose a new class of adaptive MCMC algorithms, called quasi-perfect adaptive MCMC which possesses appealing theoretical and practical properties, as demonstrated through numerical simulations.


Introduction
Markov chain Monte Carlo (MCMC) is a popular computational method for generating samples from virtually any distribution π defined on a space X .These samples are often used to efficiently compute expectations with respect to π by invoking some form of the law of large numbers.The method consists of simulating an ergodic Markov chain {X n , n ≥ 0} on X with transition probability P such that π is a stationary distribution for this chain.In practice the choice of P is not unique, and instead it is required to choose among a family of transition probabilities {P θ , θ ∈ Θ} for some set Θ.The problem is then that of choosing the "best" transition probability P θ from this set, according to some well defined criterion.For example, the efficiency of a Metropolis-Hastings algorithm highly depends on the scaling of its proposal transition probability.In this case, the optimal scaling depends on π, the actual target distribution, and cannot be set once for all.For more details on MCMC methods, see e.g.Gilks et al. (1996).
An attractive solution to this problem, which has recently received attention, consists of using so-called adaptive MCMC methods where the transition kernel of the algorithm is sequentially tuned during the simulation in order to find the "best" θ (see e.g.Gilks et al. (1998), Haario et al. (2001), Andrieu and Robert (2001), Andrieu and Moulines (2006) and Atchade and Rosenthal (2005)).These algorithms share more or less the same structure and fit, as pointed out in Andrieu and Robert (2001), in the general framework of controlled Markov chains.More precisely one first defines a sequence of measurable functions {ρ n : Θ × X n+1 −→ Θ, for n ≥ 0} which encodes what is meant by "best".The adaptive chain is initialized with some arbitrary but fixed values (θ 0 , X 0 ) ∈ Θ × X .At iteration n ≥ 1, given (θ 0 , X 0 , . . ., X n−1 ), and θ n−1 = ρ n−1 (θ 0 , X 0 , . . ., X n−1 ) (with the convention that ρ 0 (θ, X) = θ), X n is generated according to P θn−1 (X n−1 , •) and θ n = ρ n (θ 0 , X 0 , . . ., X n ).Most examples currently developed in the literature rely on stochastic approximation type recursions e.g.Haario et al. (2001), Andrieu and Robert (2001) and Atchade and Rosenthal (2005).Clearly, the random process {X n } is in general not a Markov chain.However, with an abuse of language, we will refer here to this type of process as an adaptive MCMC algorithm.Given the non-standard nature of adaptive MCMC algorithms and the given aim of sampling from a given distribution π, it is natural to ask if adaptation preserves the desired ergodicity of classical MCMC algorithms.For example, denoting • T V the total variation norm, do we have The answer is "no" in general and counter-examples abound (see e.g.Andrieu and Moulines (2006), Atchade and Rosenthal (2005)).However positive ergodicity results do also exist.For example if adaptation of θ occurs at regeneration times, then ergodicity is preserved for almost any adaptation scheme (Gilks et al. (1998)).It is also now well established that if adaptation is diminishing (for example in the sense that |θ n − θ n−1 | → 0, as n → ∞) then ergodicity is also preserved under mild additional assumptions (see e.g.Andrieu and Moulines (2006), Atchade and Rosenthal (2005), Rosenthal and Roberts (2007)).However, beyond ergodicity, it is still unclear how efficient adaptive MCMC are.This paper addresses the problem of efficiency of adaptive MCMC.We consider the case where the adaptation process {θ n } converges (in the mean square sense for example) to a unique nonrandom limit θ * .Let {Y n } be the stationary Markov chain with transition kernel P θ * and invariant distribution π.Under some standard assumptions, we obtain a bound on the rate of convergence in total variation norm of the distribution of (X n , . . ., X n+p ) towards the distribution of (Y 0 , . . ., Y p ) as n → ∞ for any finite integer p ≥ 0 (Theorem 2.1).This bound, which depends explicitly on the rate of convergence of θ n to θ * , shed some new light on adaptive MCMC processes.Theorem 2.1 implies that the process {X n } is asymptotically stationary (in the weak convergence sense) with stationary distribution given by the distribution of {Y n }.If θ n converges to θ * fast enough, it follows as well from Theorem 2.1 that {X n } is asymptotically stationary in the total variation norm sense and as a result, there exists a coupling { Xn , Ŷn } of {X n , Y n } and a finite coupling time T such that for any n ≥ T , Xn = Ŷn .Unfortunately, as we shall see, the rates required for the convergence of {θ n } towards θ * for this latter result to hold are not realistic for current stochastic approximation based implementations of adaptive MCMC.More precisely, we pay particular attention to the case where {θ n } is constructed through a stochastic approximation recursion: most existing adaptive MCMC algorithms rely on this mechanism (Haario et al. (2001), Andrieu and Moulines (2006), Atchade and Rosenthal (2005)).In particular we derive some verifiable conditions that ensure the mean square convergence of θ n to a unique limit point θ * and prove a bound on this rate of convergence (Theorem 3.1).These results apply for example to the adaptive Metropolis algorithm of Haario et al. (2001) and show that the stochastic process generated by this algorithm is asymptotically stationary in the weak convergence sense with a limit distribution that is (almost) optimal.The paper is organized as follows.In the next section we state our main result (Theorem 2.1) and briefly discuss some of its implications.The proof of Theorem 2.1 is postponed to Section 4.1.Section 3.1 is devoted to the special case of stochastic approximation updates.We first establish a Theorem 3.1 which establishes the mean square error convergence of θ n to some θ * under verifiable conditions.We then apply our results to the adaptive Metropolis algorithm of Haario et al. (2001) (Proposition 3.1).

Statement and discussion of the results
Let (X , B(X ), π) be a probability space, (Θ, |•|) a normed space and {P θ : θ ∈ Θ} a family of transition kernels P θ : X × B(X ) → [0, 1].We assume that for any A ∈ B(X ), P θ (x, A) is measurable as a function of (θ, x).We introduce the following classical notation.If P is a transition kernel on some measure space (E, E), for n ≥ 0, we write P n for the transition kernel defined recursively as P n (x, A) = E P (x, dy)P n−1 (y, A); P 0 (x, A) = 1 A (x) where 1 A (x) is the indicator function of set A (which we might denote 1(A) at times).Also if ν is a probability measure on (E, E) and f : E → R is a measurable function, we define νP (•) := E ν(dx)P (x, •), P f (x) := E P n (x, dy)f (y) and ν(f ) := E f (y)ν(dy) whenever these integrals exist.If E is a topological space, we say that E is Polish if the topology on E arises from a metric with respect to which E is complete and separable.In this case E is equipped with its Borel σalgebra.For µ a probability measure and {µ n } a sequence of probability measures on (E, E) with E a Polish space, we say that µ n converges weakly to µ as n → ∞ and write For W ≡ 1, we obtain the total variation norm, denoted ν T V hereafter.Similarly, for two transition kernels P 1 and P 2 we define |||P 1 − P 2 ||| W as Let ρ n : Θ × X n+1 −→ Θ be a sequence of measurable functions and define the adaptive chain {X n } as follows: Without any loss of generality, we shall work with the canonical version of the process {X n } defined on (X ∞ , B(X ) ∞ ) and write P for its distribution and E the expectation with respect to P. We omit the dependence of P on θ 0 , X 0 and {ρ n }.Let Q θ be the distribution on (X ∞ , B(X ) ∞ ) of a Markov chain with initial distribution π and transition kernel P θ .When convenient, we shall write Z to denote the random process {Z n }.We assume the following: (A1) We assume that for any θ ∈ Θ, P θ has invariant distribution π and there exist a measurable function V : and The inequality (2) of ( A1) is the so-called drift condition and ( 1) is the so-called (1, ε, ν)minorization condition.These conditions have proved very effective in analyzing Markov chains.As pointed out in Andrieu and Moulines (2006), ( A1) is sufficient to ensure that Vgeometric ergodicity of the Markov chain holds uniformly in θ, i.e. there exist a measurable function For a proof, see e.g.Baxendale (2005) and the references therein.Next, we assume that where V is defined in (A1).
We assume that θ n converges to some fixed element θ * ∈ Θ in the mean square sense, (A3) There exist a deterministic sequence of positive real numbers {α n }, α n → 0 as n → ∞ and a fixed θ * ∈ Θ such that We assume that an optimality criterion has been defined with respect to which P * θ is the best possible transition kernel.Of course, in general θ * is not known and our objective here is to investigate how well the adaptive chain {X n } performs with respect to the optimal chain.Let Y = {Y n } be the stationary Markov chain on X with transition kernel P θ * and initial distribution π.For n, p ≥ 0, n finite, we introduce the projection s n,p : X ∞ → X p+1 with s n,p (w 0 , w 1 , . ..) = (w n , . . ., w n+p ).For p = ∞, we write s n .If µ is a probability measure on (X ∞ , B(X ) ∞ ), define µ (n,p) := µ • s −1 n,p , the image of µ by s n,p .If p = ∞, we simply write µ (n) .The following result is fundamental.It provides us with a comparison of the distributions of {X n . . ., X n+p } and {Y n . . ., Y n+p }.
The bound in Theorem 2.1 implies that under suitable conditions on {α k } any finite dimensional distribution of {s n (X)} converges weakly to the corresponding finite dimensional distribution of Y .As a result if X is Polish, and since weak convergence of finite dimensional marginals implies weak convergence of measures, we conclude that: Corollary 2.1.Assume that X is Polish.Under the assumptions of Theorem 2.1, When i≥1 α i < ∞, we can strenghten the conclusion of Corollary 2.1 as follows.
Corollary 2.2.In addition to the assumptions of Theorem 2.1, assume that X is Polish and that α i < ∞.Then there exist a coupling ( X, Ŷ ) of (X, Y ) and a finite coupling time T such that XT +n = ŶT +n , n ≥ 0.

Proof.
i≥1 α i < ∞ implies from Theorem 2.1 that P (n) − Q θ * T V → 0 and according to Theorem 2.1 of Goldstein (1979) on maximal coupling of random processes, this is equivalent to the existence of the asserted coupling .
1.The conclusion of Corollary 2.1 is that the adaptive chain is asymptotically stationary in the weak convergence sense with limiting distribution equal to the distribution of the "optimal" chain.When α n < ∞, Corollary 2.2 asserts that the adaptive chain is asymptotically stationary in the total variation norm.In the latter case, the existence of the coupling is interesting as it suggests that the adaptive chain and the optimal Markov chain are essentially the same process.
2. The condition n≥1 α n < ∞ cannot be removed in general from Corollary 2.2.
3 Adaptive chains governed by stochastic approximation

Validity of (A3) for the stochastic approximation recursion
Our main objective here is to show that (A3) holds when the family of updating equations {ρ n } corresponds to the popular stochastic approximation procedure.We will assume for simplicity that Θ is a compact subset of the Euclidean space R p for some positive integer p and denote by •, • the inner product on R p .We assume that {θ n } is a stochastic approximation sequence, defined as follows.Let H : Θ×X → R p and let {γ n } be some sequence of positive real numbers.
For n ≥ 0 we recursively define the sequence, ) ∈ Θ then θ n+1 = θ and d n+1 = 0, otherwise θ n+1 = θ n and d n+1 = 1.We define F n = σ(X 0 , X 1 , . . ., X n ) and will denote P and E the probability and expectation of this process, omitting again the dependence of these probability and expectation on θ, x and {γ n }.This set-up is particularly relevant as many recently proposed adaptive MCMC algorithms rely on stochastic approximation.An extensive literature exists on stochastic approximation algorithms (see e.g.Benveniste et al. (1990), Kushner and Yin (2003) and the references therein).
In order to establish our result, we will need the following definitions and assumption.
Definition 3.1.Let f : Θ × X → R q for some positive integer q and let W : X → [1, ∞) be two functions.We say that f is W -bounded in its first argument if and we say that f is W -Lipschitz in its first argument if In this section we will require the following additional assumption, specific to the stochastic approximation framework.

Application to the Adaptive Metropolis algorithm
In this section, we apply our result to the adaptive version of the Random Walk Metropolis (RWM) algorithm of Haario et al. (2001).We assume here that X is a compact subset of R p the p-dimensional (p ≥ 1) Euclidian space equipped with the Euclidean topology and the associated σ-algebra B(X ).Let π be the probability measure of interest and assume that π has a bounded density (also denoted π) with respect to the Lebesgue measure on X .Let q Σ be the density of the 0 mean Normal distribution with covariance matrix Σ, where x T is the transpose of x.
The RWM algorithm with target density π and proposal density q Σ is the following.Given X n , a 'proposal' Y is generated from q Σ (X n , •).Then we either 'accept' Y and set X n+1 = Y with probability α(X n , Y ) or 'reject' Y and set X n+1 = X n with probability 1 − α(X n , Y ) where Define µ * := X xπ(dx) the mean of π and Λ * := X xx T π(dx) and Σ * := Λ * − µ * (µ * ) T its covariance matrix.It is intuitively clear that the best performance should be obtained when Σ is proportional to Σ * .In Haario et al. (2001), an adaptive algorithm has been proposed to learn Σ * on the fly.As pointed out in Andrieu and Robert (2001), their algorithm is a particular instance of the Robbins-Monro algorithm with Markovian dynamic.We present here an equivalent alternative Robbins-Monro recursion which naturally lends itself to the application of Theorem 3.1.Let I p be the p × p identity matrix, the algorithm we study is as follows: Algorithm 3.1.Initialization Choose X 0 = x 0 ∈ X the initial point.Choose µ 0 ∈ X an initial estimate of µ * and Λ 0 a symmetric positive matrix, an initial estimate of Λ * , such that Λ 0 − µ 0 µ T 0 is positive.Let ε > 0. Iteration At time n + 1 for n ≥ 0, given X n ∈ X , µ n ∈ X and Λ n a symmetric positive matrix: The small matrix εI p ensures that the covariance matrix Σ n remains positive definite, Haario et al. (2001).We write θ n := (µ n , Λ n ), θ * := (µ * , Λ * ) and Σ * := Λ * − µ * (µ * ) T .Let P be the distribution of the process (X n ) on (X ∞ , B(X ) ∞ ) and E its associated expectation.As before, we omit the dependence of P on the initial values and other parameters of the algorithm x 0 , θ 0 etc... Let also Q denote the distribution on (X ∞ , B(X ) ∞ ) of the stationary Markov chain with initial distribution π and transition kernel P Σ * +εIp .An application of Theorems 3.1 and 2.1 give the following proposition.We omit the details.
Proposition 3.1.The adaptive RWM algorithm described above is such that: (ii) for any bounded measurable f : X → R, as n → ∞, (iii) the process is weakly consistent in the sense that, and there exist C ∈ (0, ∞) such that for any finite n, p ≥ 1: Furthermore for any integer sequence {p n } such that p n = o(n), Remark 3.2.Note that in the case of this linear Robbins-Monro recursion, a more precise L 2 result can also be directly obtained from the martingale decomposition in Andrieu and Moulines (2006), see also Andrieu (2004) for a discussion.

Proofs
4.1 Proof of Theorem 2.1 Proof.Let {X i } be our adaptive process and {Y i } the homogeneous Markov chain with transition probability P θ * .Throughout this section, F n = σ(X 0 , Y 0 , . . ., X n , Y n ).It is sufficient to work with functions of the form f := p i=0 f i for {f i : X → R, |f i | ≤ 1, i = 0, . . ., p} a family of measurable functions and any p > 1.The proof relies on the following decomposition An estimate of the inner conditional expectation term is given in Proposition 4.2 below and the outer expectation operator is studied in Proposition 4.1 below as well.The combination of these results leads to hence the result.
Proposition 4.1.Assume (A1-3).Let g ∈ L V 1/2 and {i n } ⊂ Z + be such that for all n ∈ Z, i n < n.Then there exists ρ ∈ (0, 1) and C ∈ (0, ∞) such that for any n ≥ 1, If α n ∝ n −γ for γ > 0, then there exists C ∈ (0, ∞) such that Proof.Let {X n } be our adaptive process and {Y n } be the time-homogeneous Markov chain with transition probability P θ * .First we have the following decomposition The first term is easily dealt with since from the Markov property and by Lemma 5.1 Andrieu et al. (2005), For the second term we introduce the following telescoping sum decomposition, Now, for j ∈ {i n , . . ., n − 1} from Cauchy-Schwartz's inequality, and consequently, using Lemma 5.1 Andrieu et al. (2005), we first conclude that Now in the case where α j ∝ j −γ we will choose i n in order to balance the two terms depending on n on the right hand side.To that purpose we note that, and check that the choice 1−ρ and ρ n−in ∼ n −γ , which concludes the proof.
Let {φ k : X → [−1, 1], k = 0, . . ., p + 1} be a family of functions defined as φ p+1 (x) = 1 and for k = p, . . ., 0 We have the proposition, Proposition 4.2.Assume (A1-3).Let {X i } be the adaptive chain and Let {f i : X → R, |f i | ≤ 1, i = 0, . . ., p} be a family of measurable functions.Then there exists a constant C ∈ (0, ∞) such that for any n, p ∈ Z + , Proof.We have the following telescoping sum decomposition, For any k = 0, . . ., p, using the Markov property, one has 4.2 Proof of Theorem 3.1 Proof.In what follows C is a finite universal constant, whose value might change upon each appearance.With for any n ≥ 0 ∆ n := θ n − θ * we have From assumptions (A1) and (A4), and e.g.Lemma 5.1 in Andrieu et al. (2005) we deduce that, From Proposition 4.3 we have that Consequently there exists a constant C 1 such that for n ≥ 1, and we conclude using Lemma 23 p. 245 in Benveniste et al. (1990).
We first recall the following fundamental lemma, which can be found in the proof of Proposition 3 in Andrieu and Moulines (2006).