An Information-Theoretic Proof of a Finite de Finetti Theorem

A finite form of de Finetti's representation theorem is established using elementary information-theoretic tools: The distribution of the first $k$ random variables in an exchangeable binary vector of length $n\geq k$ is close to a mixture of product distributions. Closeness is measured in terms of the relative entropy and an explicit bound is provided.


Introduction
A finite sequence of random variables (X 1 , X 2 , . . . , X n ) is exchangeable if it has the same distribution as (X π(1) , X π(2) , . . . , X π(n) ) for every permutation π of {1, 2, . . . , n}. An infinite sequence {X k ; k ≥ 1} is exchangeable if (X 1 , X 2 , . . . , X n ) is exchangeable for all n. The celebrated representation theorem of de Finetti [8,9] states that the distribution of any infinite exchangeable sequence of binary random variables can be expressed as a mixture of the distributions corresponding to independent and identically distributed (i.i.d.) Bernoulli trials. For discussions of the role of de Finetti's theorem in connection with the foundations of Bayesian statistics and subjective probability see, e.g, [10,5] and the references therein.
Although it is easy to see via simple examples that de Finetti's theorem may fail for finite binary exchangeable sequences, for large but finite n the distribution of the first k random variables of an exchangeable vector of length n admits an approximate de Finetti-style representation. Quantitative versions of this statement have been established by Diaconis [10] and Diaconis and Freedman [13]. The approach of Diaconis' proof in [10] is based on a geometric interpretation of the set of exchangeable measures as a convex subset of the probability simplex.
The purpose of this note is to provide a new information-theoretic proof of a related finite version of de Finetti's theorem. For each p ∈ [0, 1], let P p denote the Bernoulli probability mass function with parameter p, P p (1) = 1 − P p (0) = p, and write D(P Q) = x∈B P (x) log[P (x)/Q(x)] for the relative entropy (or Kullback-Leibler divergence) between two probability mass functions P, Q on the same discrete set B; throughout, 'log' denotes the natural logarithm to base e.
Theorem. Let n ≥ 2. If the binary random variables (X 1 , X 2 , . . . , X n ) are exchangeable, then there is a probability measure µ on [0, 1], such that, for every 1 ≤ k ≤ n, the relative entropy between the probability mass function Q k of (X 1 , X 2 , . . . , X k ) and the mixture M k,µ := P k p dµ(p) satisfies: By Pinsker's inequality [7,19], P − Q 2 ≤ 2D(P Q), the theorem also implies that, where P − Q := 2 sup B |P (B) − Q(B)| denotes the total variation distance between P and Q. This bound is suboptimal in that, as shown by Diaconis and Freedman [13], the correct rate with respect to the total variation distance in (2) is O(k/n). On the other hand, (1) gives an explicit bound for the stronger notion of relative entropy 'distance.' Rather than to obtain optimal rates, our primary motivation is to illustrate how elementary information-theoretic ideas can be used to provide an alternative proof strategy for de Finetti's theorem, following a long series of works developing this point of view, including informationtheoretic proofs of Markov chain convergence [21,16], the central limit theorem [4,2], Poisson and compound Poisson approximation [18,3], and the Hewitt-Savage 0-1 law [20].
Before turning to the proof, we mention that there are numerous generalisations and extensions of de Finetti's classical theorem and its finite version along different directions; see, e.g., [11] and the references therein. The classical de Finetti representation theorem has been shown to hold for exchangeable processes with values in much more general spaces than {0, 1} [15], and for mixtures of Markov chains [12]. Recently, an elementary proof of de Finetti's theorem for the binary case was given in [17], a more analytic proof appeared in [1], and connections with category theory were drawn in [14].

Proof of the finite de Finetti theorem
We first need to introduce some notation. Let n ≥ 2 be fixed. For any 1 ≤ i ≤ j ≤ n, write X j i for the block of random variables X j i = (X i , X i+1 , . . . , X j ). Denote by N i,j the number of 1s in X j i , so that N i,j = j k=i X k , and for every 0 ≤ ℓ ≤ n write A ℓ for the event {N 1,n = ℓ}. The main step of the proof is the estimate in the lemma below, which gives a bound on the degree of dependence between X i and X k i+1 , conditional on A ℓ . This bound is expressed in terms of the mutual information. Let (X, Y ) be two discrete random variables with joint probability mass function (p.m.f.) P XY and marginal p.m.f.s P X and P Y , respectively. Recall that the entropy H(X) of X, often viewed as a measure of the inherent "randomness" of X [6], is defined as, H(X) = H(P X ) = − x P X (x) log P X (x), where the sum is over all possible values of X with nonzero probability. Similarly, the conditional entropy of Y given X is, where P Y |X (y|x) = P XY (x, y)/P X (x).
The mutual information between X and Y is I(X; Y ) = H(Y ) − H(Y |X), and it can also be expressed as: For any event A, we write I(X; Y |A) for the mutual information between X and Y when all relevant p.m.f.s are conditioned on A.
From the definition, an obvious interpretation of I(X; Y ) is as a measure of the amount of "common randomness" in X and Y . Additionally, since I(X; Y ) is always nonnegative and equal to zero iff X and Y are independent, the mutual information can be viewed as a universal, nonlinear measure of dependence between X and Y . See [6] for standard properties of the entropy, relative entropy and mutual information.
Finally, we record an elementary bound that will be used in the proof of the lemma. Write h(p) = −p log p − (1 − p) log(1 − p), p ∈ [0, 1], for the binary entropy function. Then a simple Taylor expansion gives: Lemma. For all 1 ≤ k ≤ n, all 1 ≤ i ≤ k − 1, and any 0 ≤ ℓ ≤ n: Proof. We assume without loss of generality that k ≤ n/2, for otherwise the result is trivially true since the mutual information in the statement is always no greater than 1. Also, if ℓ = 0 or n the conditional mutual information is zero and the result is again trivially true. Let Q n denote the p.m.f. of X n 1 . By exchangeability, conditional on A ℓ , all sequences in {0, 1} n with exactly ℓ 1s have the same probability under Q n , so X n 1 conditional on A ℓ is uniformly distributed among all such sequences. This implies that for all 1 ≤ k ≤ n/2, 1 ≤ i ≤ k − 1, and 1 ≤ ℓ ≤ n − 1, For the mutual information we have: If the probability in the third term above is nonzero, then necessarily ℓ ≥ n − k + 1 and thus, using n ≥ 2k, h(ℓ/n) ≤ 2 k n−k log n. On the other hand, if ℓ + k − i − n + 1 ≤ N i+1,k < ℓ, then both ℓ/n and (ℓ − N i+1,k )/(n − (k − i)) are between 1/n and (n − 1)/n, so from (3) the first term in (4) is bounded above by, Finally, by Markov's inequality, the probability in the second term is no more than k/n, while the binary entropy is always bounded above by 1. Combining these three estimates yields, The result follows.
We are now ready to prove the theorem. By the bound in the lemma, Also, by definition of the mutual information, using the obvious notation H(X|A) for the entropy of the conditional p.m.f. of X given A, Finally, writing µ for the distribution of ℓ/n = (1/n) n i=1 X i on {0, 1/n, 2/n . . . , 1}, averaging both sides with respect to ℓ, and using the joint convexity of relative entropy, yields the claimed result.
Remarks. The mixing measure µ = µ n in the theorem is completely characterised in the proof as the distribution of (1/n) n i=1 X i , and it is the same for all k. Moreover, if {X n ; n ≥ 1} is an infinite exchangeable sequence then it is also stationary, so by the ergodic theorem (1/n) n i=1 X i converges a.s. to some X, and the µ n converge weakly to the law, say µ, of X. For fixed k, since P k p is a bounded and continuous function of p ∈ [0, 1], we have for any x k 1 ∈ {0, 1} k , M k,µn (x k 1 ) = P k p (x k 1 )dµ n (p) → M k,µ (x k 1 ) = P k p (x k 1 )dµ(p), and by our theorem, Q k − M n,k = O( (log n)/n). Therefore, we can conclude that, for each k ≥ 1, and thus recover de Finetti's classical representation theorem. Finally we note that the argument used in the proof of the lemma as well as the proof of our theorem can easily be extended to provide corresponding results for exchangeable vectors taking values in any finite set. But as the the constants involved become quite cumbersome and our main motivation is to illustrate the connection with information-theoretic ideas (rather to obtain the most general possible results), we have chosen to restrict attention to the binary case.