Concentration Inequalities for Gibbs Sampling under D L 2 -metric

The aim of this paper is to investigate the Gibbs sampling that's used for computing the mean of observables with respect to some function f depending on a very small number of variables. For this type of observable, by using the d l 2-metric one obtains the sharp concentration estimate for the empirical mean, which in particular yields the correct speed in the concentration for f depending on a single observable.


Introduction
Let µ be a Gibbs probability measure on E N with dimension N large, i.e., µ(dx 1 where π is some σ-finite reference measure on E. Our purpose is to study the Gibbs sampling-a Markov Chain Monte-Carlo method (MCMC in short) for approximating µ.
Gibbs sampling is also called Glauber dynamics with systematic scan(see [6]).
Let µ i (•|x) (x = (x 1 , • • • , x N ) ∈ E N ) be the regular conditional distribution of x i knowing (x j , j = i) under µ, i.e., which is a one-dimensional measure, easy to simulate in practice.
By iterations of the one-dimensional conditional distributions (µ i , i = 1, • • • , N ), the Gibbs sampling is the time-homogeneous Markov chain (Z k , k = 0, 1, • • • ), where each Z k is the random vector on E N after the dynamics has been sequentially applied to all sites.(For details see Section 2.) In [6], Dyer, Goldberg and Jerrum study mixing time of Gibbs sampling on finite spin systems by Dobrushin uniqueness conditions.But we will study concentration inequalities for Gibbs sampling on the general space by Dobrushin conditions such as [17].

P(
That result in particular yields the correct speed in the concentration of functions of type f (x) = 1 N N i=1 g(x i ), but not for f depending on a very small number of variables(for example f (x) = g(x 1 )).
So the purpose of this paper is to solve the problem above, i.e., to establish a new sharp concentration estimate for P( where the function f depends on a small number of variables.Our method is to prove Talagrand's T 2 -transport inequality with respect to (w.r.t. in short) the d l N

2
-metric (see later the definition of (2.1)), which is much stronger than the T 1 -transport inequality w.r.t.d l N 1 metric.The main new feature of our T 2 -transport inequality is dimension free, now.As well known the T 2 -transport inequality is much more difficult than the T 1 -transport inequality (see [3,7,8]).Technically this obliges us to introduce a new type of Dobrushin interdependence coefficients and complicates much the process of tensorization.
This paper is organized as follows.The next section contains some preliminaries about transport inequality and Gibbs sampling.We present the main results in Section 3, and prove them in Section 4.

Transport inequality
Throughout the paper E is a Polish space with the Borel σ-field B, and d is a metric on E such that d(x, y) is lower semi-continuous on E 2 (so d does not necessarily generate the topology of E).On the product space E N , we consider the l N p (p = 1, 2)-metric Later sometimes d lp is short for d l N p (or d l n p ) when the index N (or n respectively) is obvious from the context.E N is endowed with the d l N p -metric unless otherwise stated.
Let M 1 (E) be the space of Borel probability measures on E, and (x 0 ∈ E is some fixed point, but the definition above does not depend on x 0 by the triangle inequality).Given ν 1 , ν 2 ∈ M d p (E), the L p -Wasserstein distance between ν 1 , ν 2 is given by where the infimum is taken over all probability measures π on E × E such that its marginal distributions are respectively ν 1 and ν 2 (called a coupling of ν 1 and ν 2 ).When µ, ν are probability measures, the Kullback information (or relative entropy) of ν with respect to µ is defined as

Gibbs sampling
) be the given regular conditional distribution of x i knowing (x j , j = i) under µ, and μi (dy|x) be the lift of µ i to E N .
Gibbs sampling is described as follows.Given a initial point 0) be a non-homogeneous Markov chain starting from x 0 defined on some probability space (Ω, , F, P x0 ), and given and the Gibbs sampling is the time-homogeneous Markov chain (Z k = X kN , k = 0, 1, • • • ), whose transition probability is P .

Main results
Throughout the paper we assume that Obviously c (p) ii = 0. Denote by A p the operator norm of a general N by N matrix A acting as an operator from l N p to itself.Then the well known Dobrushin uniqueness condition (see [4,5]) is So the generalization of Dobrushin uniqueness condition is read as Our main results are the following: and for some constant c > 0, Then for any Lipschitzian function f on E N with f Lip(d l N
Remark 3.3.Recall some results from [17, Lemma 3.4 and Theorem 2.7]: assume that , and for some constant c > 0, Then for any Lipschitzian function f on E N with f Lip(d l N

1
) ≤ α, one has , ∀t > 0, n ≥ 1. Concentration inequalities for Gibbs sampling Remark 3.4.By a result of Gozlan (for details see [8]): T 2 (c) is equivalent to dimensionfree concentration on product spaces, which is one main difference between T 2 and T 1 .
Remark 3.5.It's easy to show that the concentration inequality (3.3) is sharp: in fact, take E N = R N , µ = γ N where γ is the Gaussian law N (0, 1), then (Z k , k ≥ 1) is an independent identically distributed sequence, C (2) = 0, c = 1, r 1 = r ∞ = 0, and so the inequality (3.3) becomes sharp for f (x) = x 1 : in this case, the inequality (3.3) is read as: however, it's well known that lim n→∞ Next we emphasize differences and improvements of this theorem compared with 3) implies for all t > 0, n ≥ 1, which is of speed n/l.Wang (see [9]) for the log-Sobolev inequality (which is stronger than T 2 (c) by Otto-Villani [14]), or the very general sufficient condition of Lyapunov function method for T 2 (c) by Cattiaux et al. [2].

The construction of the coupling.
Given any two initial distributions ν 1 and ν 2 on E N , we begin by constructing our coupled non-homogeneous Markov chain (X i , Y i ) i≥0 , which is similar to the coupling in [17] or [13].
(For the existence of such a coupling, refer to Villani [16].)Define the partial order on R N by a ≤ b if and only if a i ≤ b i , i = 1, • • • , N.Then, by the triangle inequality for the metric W 2,d , We have for ∀k ∈ N, and so , where (the blank in the matrix means 0), and Therefore by iterations, we have By (4.2) above, Markov property and iterations, ij d(x j , y j ).
The key is to construct an appropriate coupling of Q and P, that is, two random sequences Y [1,N ] and X [1,N ] taking values on E N distributed according to Q and P, respectively, on some probability space (Ω, F, P).We define a joint distribution L(Y [1,N ] , X [1,N ] ) by induction as follows (the Marton coupling).
At first the law of (Y 1 , X 1 ) is the optimal coupling of Q(x 1 ∈ •) and P (x Assume that for some i, ) is the optimal coupling of Q i (•|y [1,i−1] ) and P i (•|x [1,i−1] ), that is Obviously, Y [1,N ] , X [1,N ] are of law Q, P respectively.By the triangle inequality for the W 2,d distance, The above inequality gives us , and note that the norm of a general random vector a = (a ECP 19 (2014), paper 63.
In order to prove Theorem 3.1, we need the following dependent tensorization of T 2 (from the result of Djellout-Guillin-Wu [3, Theorem 2.5]).
) denote the regular conditional law of x i given x [1,i−1] under P for 2 ≤ i ≤ n, and P i (•|x [1,i−1] ) be the distribution of x 1 for i = 1 where x [1,0] denotes some fixed point x 0 on E.
Then for any probability measure Q on E n , W 2,d l n 2 (Q, P) ≤ 2κH(Q|P) 1 − r .
By Lemma 4.3 above, we can obtain the following key lemma, which can be considered as the main theoretical result of this paper., ω, ω ∈ (E N ) n .
Let P x be the distribution of our Gibbs sampling (Z 1 , • • • , Z n ) on (E N ) n equipped with the Borel-σ algebra, where the starting point x ∈ E N is arbitrary.Assume r ∞ r 1 < 1 2 and (H2).Then for any probability measure Q on ((E N ) n , (d l2 ) l2 ), we have In other words ECP 19 (2014), paper 63.