Exact sampling for intractable probability distributions via a Bernoulli factory

Many applications in the field of statistics require Markov chain Monte Carlo methods. Determining appropriate starting values and run lengths can be both analytically and empirically challenging. A desire to overcome these problems has led to the development of exact, or perfect, sampling algorithms which convert a Markov chain into an algorithm that produces i.i.d. samples from the stationary distribution. Unfortunately, very few of these algorithms have been developed for the distributions that arise in statistical applications, which typically have uncountable support. Here we study an exact sampling algorithm using a geometrically ergodic Markov chain on a general state space. Our work provides a significant reduction to the number of input draws necessary for the Bernoulli factory, which enables exact sampling via a rejection sampling approach. We illustrate the algorithm on a univariate Metropolis-Hastings sampler and a bivariate Gibbs sampler, which provide a proof of concept and insight into hyper-parameter selection. Finally, we illustrate the algorithm on a Bayesian version of the one-way random effects model with data from a styrene exposure study.


Introduction
Suppose we want to explore a probability distribution π defined on X.Further suppose π is intractable in the sense that direct (i.i.d.) sampling is unavailable.In this setting, the Markov chain Monte Carlo (MCMC) method can be a useful tool since it is often straightforward to construct and simulate an ergodic Markov chain that has π as its stationary distribution (Chen et al., 2000;Liu, 2001;Robert and Casella, 1999).The two main drawbacks of MCMC relative to direct sampling from π are (i) the difficulty in ascertaining how long the Markov chain needs to be run before it gets "close" to π (see, e.g., Jones and Hobert, 2001), and (ii) the difficulty in deriving and calculating asymptotically valid standard errors for the ergodic averages that are used to approximate intractable expectations under π (see, e.g., Flegal et al., 2008).
A desire to overcome these problems has led to the development of clever techniques using a Markov chain to create an algorithm that produces i.i.d.draws from the stationary distribution (e.g., Craiu and Meng, 2011;Green and Murdoch, 1999;Huber, 2004;Propp and Wilson, 1996;Wilson, 2000).Unfortunately, very few of these so-called perfect sampling algorithms have been developed for the distributions that arise in realistic statistical applications, which typically have uncountable support.Asmussen et al. (1992) and Blanchet and Meng (2005) provide one such algorithm applicable to Markov chains on general state spaces.The main assumption necessary is that the chain satisfies a one-step minorization condition.As we describe later, under this condition the stationary distribution admits a mixture representation, suggesting the following two-step sampling approach: sample the discrete distribution corresponding to the mixture weights, then sample the selected mixture component.
This approach has never been successfully implemented, however, it has been used to obtain approximate draws from π, see for example Blanchet and Thomas (2007), Hobert et al. (2006) and Hobert and Robert (2004).The difficult part is drawing from the discrete distribution corresponding to the mixture weights, which is done via a rejection sampling approach.In this paper, we provide solutions to a number of practical problems and illustrate the algorithm on three examples.This requires overcoming two challenges: (i) obtaining a dominating proposal distribution and (ii) generating a Bernoulli variate to decide whether a proposed draw is accepted or not.While (ii) might seem trivial, the challenge is that we are unable to (exactly) compute the success probability for this Bernoulli variate.
A solution to (i) requires identification of a bounding (proposal) probability mass function for the target mass function.Previously, Blanchet and Meng (2005) proposed an upper bound on moments associated with the target mass function.Blanchet and Thomas (2007) used output from a preliminary run of the Markov chain to construct an approximate upper bound.We provide an explicit bound for Markov chains satisfying a geometric drift condition using results from Roberts and Tweedie (1999).
A solution to (ii) will determine whether to accept a proposed draw.The decision is made by generating a Bernoulli random variable with success probability that involves the ratio between the target and proposal mass functions.In our case, the target mass function is unknown, apparently making this step impossible.However, one can still generate such a Bernoulli variate, using a socalled Bernoulli factory (Keane and O'Brien, 1994).Briefly, a Bernoulli factory is an algorithm that outputs a Bernoulli variate with success probability f (p), from i.i.d.Bernoulli(p) variates, when f is known but p is unknown.
The rejection sampling approach requires a Bernoulli factory algorithm for f (p) = ap and a ∈ (1, ∞).Nacu and Peres (2005) and Latuszynski et al. (2011) provide an algorithm when a = 2, but their algorithms are computationally demanding and scale poorly for a ∈ (1, ∞).For example when p ∈ (0, .4),one requires at least 65, 536 Bernoulli(p) random variables to generate one Bernoulli(2p) variate.In this paper, we provide an algorithm for any a ∈ (1, ∞) that reduces the computational time substantially.For example when p ∈ (0, .4),we can obtain a Bernoulli(2p) variate with only 256 Bernoulli(p) random variables.This is an important reduction because the Bernoulli factory accounts for much of the computational time in the exact sampling algorithm.Section 3 contains a full description the Bernoulli factory and our modification.
Our solutions to (i) and (ii) yield an exact sampling algorithm for π.The algorithm is suitable even for intractable distributions on general state spaces: that is, for distributions that typically arise in statistical applications.This is an important extension, since very few existing algorithms apply to general state spaces, but it is limited in the sense that one must be able to establish a drift and associated minorization condition for the underlying Markov chain.The current algorithm can be computationally demanding, however we have successfully implemented it in three examples.
Our first example considers a univariate Metropolis-Hastings sampler for which we obtain 1000 i.i.d.draws.The second example considers a slightly more complicated bivariate Gibbs sampler where we again obtain 1000 i.i.d.draws.These two examples could be considered toy examples in the sense that i.i.d.observations are available for each.However, they provide insights into the performance and hyper-parameter selection of the algorithm.
Our final example considers a Bayesian version of the classical one-way random effects model that is widely used to analyze data.We illustrate the exact sampling algorithm, using data from a styrene exposure study, to obtain 20 i.i.d.draws.This is the first successful implementation of an exact sampling algorithm for a model of this type.Our analysis considers a balanced design and requires development of a suitable drift condition, which improves upon the existing drift constants of Tan and Hobert (2009).
These examples give hope for exact sampling algorithms for general state space Markov chains in more complicated settings.Even if we are unable to obtain multiple draws in these settings, a single exact draw will alleviate the need for burn-in entirely.
The rest of this paper is organized as follows.Section 2 provides the mixture representation of π, details the rejection sampling approach and bounds the tail probabilities of the proposal distribution.Section 3 introduces the Bernoulli factory and proposes a new target function that speeds up the algorithm significantly.Section 4 gives the full exact sampling algorithm.Sections 5 and 6 implement the algorithm for two toy examples and a Bayesian version of a one-way random effects model, respectively.Finally, Section 7 discusses our implementation and provides some general recommendations to practitioners.

Exact sampling via a mixture distribution
Suppose we want to explore the intractable probability measure π(dx) defined on the measurable space (X, B(X)).Let P : X × B(X) → [0, 1] be a Markov transition function and let X = {X n } ∞ n=0 denote the corresponding Markov chain.Then for x ∈ X and a measurable set A, Assume that π is an invariant measure for the chain; i.e., π(A) = X P (x, A) π(dx) for all measurable A. Assume further that X satisfies the usual regularity conditions, which are irreducibility, aperiodicity and positive Harris recurrence.For definitions, see Meyn and Tweedie (1993) and Roberts and Rosenthal (2004).Finally, assume we are able to simulate the chain; that is, given X n = x, we have the ability to draw from P (x, •).
The exact sampling algorithm considered here utilizes a mixture representation for π (Asmussen et al., 1992;Hobert et al., 2006;Hobert and Robert, 2004), however, we must first develop the split chain.The main assumption necessary is that X satisfies a one-step minorization condition, i.e. there exists a function s : X → [0, 1] satisfying X s(x) π(dx) > 0 and some probability measure Q(dy) on (X, B(X)) such that, (1) Given X satisfies the one-step minorization condition at (1), then P can be decomposed as where and define R(x, dy) = 0 if s(x) = 1.It is helpful to think of (2) as a mixture of two Markov transition functions with probabilities s(x) and 1 − s(x).Equation (2) shows that it is possible to simulate X n+1 given X n = x as follows: Flip a coin (independently) that comes up heads with probability s(x).If the coin is a head, take This decomposition has several important applications in MCMC.Indeed, it can be used to perform regenerative simulation (Hobert et al., 2002;Mykland et al., 1995) and to derive computable bounds on the convergence rate of X (Lund and Tweedie, 1996;Roberts and Tweedie, 1999;Rosenthal, 1995Rosenthal, , 2002)).Now consider a new Markov chain that actually includes the coin flips mentioned above.Let Then, conditional on X n+1 = x , δ n+1 ∼ Bernoulli(s(x )).This chain is called the split chain.Equation (2) implies that, marginally, the sequence of X n values in the split chain has the same overall probability law as the original Markov chain X (Nummelin, 1984, Chapter 4).Note that, if δ n = 1, then the distribution of (X n+1 , δ n+1 ) does not depend on x.
Remark 1.We can avoid drawing from Q(•) entirely by changing the order slightly.Given x is the current state, we can simply generate X i+1 ∼ P (x, •) in the usual manner and then generate δ i |X i , X i+1 with where q(•) and k(•|x) are the densities corresponding to Q(•) and P (Nummelin, 1984, p. 62).

A mixture representation of π
The reason for introducing the split chain is that it possesses an accessible atom, X×{1}.Indeed, each time the set X×{1} is entered, the split chain stochastically restarts itself (because the next X n has distribution Q).Let N = {1, 2, 3, . . .} and (X 0 , δ 0 ) ∈ X × {1}, then define the first return time to the atom as Our assumptions about P imply that E(τ ) < ∞ and hence the sequence {p n } ∞ n=1 defined by is nonnegative, nonincreasing, and sums to one.Let T denote a discrete random variable on N with Pr(T = n) = p n ; also let Q n be the conditional distribution of X n given that the split chain does not return to the atom before time n.Thus for any n ∈ N and any measurable ) Then π can be written as the following mixture of the Q n values: Remark 2. The representation at (4) can be obtained from results in Asmussen et al. (1992) by applying their methods to the split chain.Alternatively, Hobert and Robert (2004) obtain the representation when s(x) has the specific form εI C (x) and Hobert et al. (2006) obtain the representation with the more general minorization shown here.
The representation at (4) offers an alternative sampling scheme for π.First, make a random draw from the set {Q 1 , Q 2 , Q 3 , . . .} according to the probabilities p 1 , p 2 , p 3 , . . .and then make an independent random draw from the chosen Q n .
Drawing from Q n is simple even when n ≥ 2. Indeed, just repeatedly simulate n iterations of the split chain, and accept X n the first time that δ 1 = • • • = δ n−1 = 0.The challenging part of this recipe is drawing from the set {Q 1 , Q 2 , Q 3 , . . .}, i.e. simulating a random variable T , since the p n values are not computable.Hobert et al. (2006) and Hobert and Robert (2004) approximate the p n values, which yields approximate draws from T and thus π.In this paper, we obtain exact draws from T that result in exact draws from π.
Remark 3. Let K|T = n be the number of simulations of the split chain before we get a draw from Q n , conditional on T = n from Step 1 above.Then K|T = n is geometric with mean E(K|T = n) = 1/P (τ ≥ n).Unfortunately, Blanchet and Meng (2005) show E(K) = ∞, and justifiably argue (4) should not be used for multiple replications.This presents a major challenge in the applicability of our algorithm and others that can be similarly expressed, some of which are discussed in the next section.

Rejection sampler for T
There is one case where simulating T is simple (Hobert et al., 2006;Hobert and Robert, 2004).Suppose that in the minorization condition (1), s(x) ≡ ε > 0 for all x ∈ X (implying the Markov chain is uniformly ergodic) and consider the procedure for simulating the split chain with this constant s.In particular, note that the coin flip determining whether δ n is 0 or 1 does not depend on x, and it follows that the number of steps until the first return to the accessible atom has a geometric distribution.Indeed, Pr(τ ≥ n) = (1 − ε) n−1 .Hence, E(τ ) = 1/ε and so T also has a geometric distribution.Therefore it is easy to make exact draws from π. Hobert and Robert (2004) show this exact sampling algorithm is equivalent to Murdoch and Green's (1998) Multigamma Coupler and to Wilson's (2000) Read-Once algorithm.It is interesting that (4) can be used to reconstruct perfect sampling algorithms based on coupling from the past despite the fact that its derivation involves no backward simulation arguments.Of course, this exact sampling algorithm will be useless from a practical standpoint if ε is too small.
Unfortunately, in statistical inference problems, the MCMC algorithms are usually driven by Markov chains that are not uniformly ergodic and, hence, cannot satisfy (1) with a constant s.Moreover, there is no efficient method to simulate T where s is non-constant.(When s is non-constant, the distribution of τ is complex and its mass function is not available in closed form.Hence, the mass function of T is also unknown, which precludes direct simulation of T .)Therefore we must resort to indirect methods of simulating T .
Fortunately, simulating τ is trivial-indeed, one can simply run the split chain and count the number of steps until it returns to X × {1}.Because this provides an unlimited supply of i.i.d.copies of τ , we can use a rejection sampling approach (Asmussen et al., 1992;Blanchet and Meng, 2005) where M is a finite, positive constant.Consider a rejection sampler with candidate mass function d(•)/D.Thus which justifies the following rejection sampler.
Rejection sampler for simulating T : Unfortunately, the standard method of simulating B (by computing a Pr(τ ≥ n) and comparing it to an independent Uniform(0, 1) random variable) is not available to us because the mass function of τ is unavailable in closed form.However, we may draw B without knowing the value of Pr(τ ≥ n) using a supply of i.i.d.copies of τ .This is the basis of our exact sampling approach.
Suppose a ∈ (0, 1] and let p = Pr(τ ≥ n), then there exists a simple solution to generate B ∼ Bernoulli(ap), which Fill (1998) calls "engineering a coin flip".Indeed simulate a single τ and define That is, B ∼ Bernoulli(ap), obtained by simulating a single τ and a single Bernoulli V .When a ∈ (1, ∞), we will obtain B ∼ Bernoulli(ap) via the Bernoulli factory described in Section 3.For now assume such a simulation is possible, then what remains to establish is a computable tail probability bound, i.e. the sequence d(n) and the constant M .Blanchet and Meng (2005) bound the moments of τ , however they do not explicitly determine computable values M and d(n).Fortunately, Pr(τ ≥ n) can be bounded above by a known constant times a known geometric mass function if X satisfies a geometric drift and associated one-step minorization conditions.

Tail probability bound
We will say a drift condition holds if there exists some function V : X → [1, ∞), some 0 < λ < 1 and some b < ∞, such that (5) In addition, we require the associated one-step minorization condition as follows; assume that s(x) is bounded below by ε on C and that Hobert and Robert (2004) provide the following bound on Pr(τ ≥ n) based results in Roberts and Tweedie (1999) where Having established the inequality in (7), we next detail how to generate B ∼ Bernoulli(ap) when a > 1.

Bernoulli factory
Given a sequence W = {W n } n≥1 of i.i.d.Bernoulli(p) random variables, where p is unknown, a Bernoulli factory is an algorithm that simulates a random variable B ∼ Bernoulli(f (p)), where f is a known function.For the exact sampling algorithm, we require a Bernoulli factory where f (p) = ap.This idea arrose in Asmussen et al. (1992) when proposing an exact sampling algorithm for general regenerative processes.
Consider f : S → [0, 1], where S ⊂ (0, 1).Keane and O'Brien (1994) show is it possible to simulate a random variable B ∼ Bernoulli(f (p)) for all p ∈ S if and only if f is constant, or f is continuous and satisfies, for some n ≥ 1, While Keane and O'Brien (1994) develop the necessary and sufficient conditions on f , they do not provide a detailed description of an algorithm.Nacu and Peres (2005) suggest a constructive algorithm via Bernstein polynomials for fast simulation, i.e. the number of input Bernoulli(p) variates needed for the algorithm has exponentially bounded tails.However, we find no practical implementation since it requires dealing with sets of exponential size.Our approach is based on the recent work of Latuszynski et al. (2011), which avoids keeping track of large sets by introducing a single auxiliary random variable.
The general approach, in the formulation of Latuszynski et al. (2011), is to construct two random approximations to f (p), denoted U n and L n , which depend on W 1 , W 2 , . . ., W n and satisfy The decision to continue sampling or output a zero or a one in the Bernoulli factory is made using an auxiliary Uniform(0, 1) variable.
Remark 4. The almost sure monotonicity requirement in ( 9) is typically difficult to attain and thus Latuszynski et al. (2011) relax it by using super/submartingales instead.

Modified target function
For the rejection sampling approach to simulating T , we have the ability to simulate W by setting W i = I(τ i ≥ n) for i ≥ 1.We require a single B ∼ Bernoulli(ap), where a = 1/[M d(n)] is a known constant such that a > 0. The outcome B determines if we accept or reject the proposed value.For a ∈ (0, 1] we use the simple solution in Section 2 and for a ∈ (1, ∞) we use the Bernoulli factory.
Unfortunately, the function f (p) = ap on (0, 1/a) does not satisfy (8) and cannot be simulated via the Bernoulli factory.However, when restricted to f (p) = min{ap, 1 − ω} for ω > 0, such a simulation is possible.Nacu and Peres (2005) and Latuszynski et al. (2011) provide a detailed algorithm for a = 2 and 0 < ω < 1/4.Their construction requires a minimum of 65,536 input variables (see Table 1) before the requirement U n ≤ 1 at (9) is met.This is due to the fact that f (p) = min{2p, 1 − ω} is not differentiable and the Bernstein polynomials can approximate general Lipschitz functions at a rate of 1/ √ n (see part (i) of Lemma 6 from Nacu and Peres, 2005).However, when the target function f is twice differentiable, the rate increases to 1/n (see part (ii) of the same Lemma).
This suggests the number of Bernoulli(p) input variates required may decrease significantly by using a twice differentiable f .With this in mind, we propose extending ap smoothly from 0, 1−ω a to [0, 1].Fix δ < ω and consider the following function which is bounded by δ and twice differentiable, with e .(F is related to the Gauss error function (erf), though it can be simply calculated from a standard normal distribution function.) Then define our target function f as In other words, we have extended ap such that f defined at (10 1. Simulate G 0 ∼ Uniform(0, 1).
Moreover the probability that it needs N > n iterations equals a 2 /nδ √ 2e.
Proof.See Appendix A.
Remark 5. Note that the probability Algorithm I needs N > n iterations is independent of the unknown value of p. Hence the number of Bernoulli(p) variates required will also be independent of p. Algorithm I provides a constructive algorithm for a ∈ (1, ∞) and reduces the number of input variables by a factor of over 100, which we demonstrate in the following example.This is a critical improvement since the Bernoulli factory accounts for most of the computational demands of the exact sampling algorithm.

Bernoulli factory example
Consider generating 10,000 Bernoulli(ap) variates for various values of a while setting p = 0.01, ω = 1/5 and δ = 1/6.Table 1 displays the minimum number of Bernoulli(p) variates required, along with the observed mean and standard deviation for the count of Bernoulli(p) variates used to generate 10,000 Bernoulli(ap) variates.We can see for a = 2 the minimum and observed mean have been reduced substantially when comparing our target function with the Nacu and Peres ( 2005) target (N&P).The reduction in observed input Bernoulli(p) variates represents a 120 times reduction in computational time.The N&P implementation always stops the simulation at the minimum, and hence the standard deviation of the count is 0. Table 1 also shows the input variates required increases as a increases.Simulations for other p values (not shown) provide very similar results.
Exact sampling algorithm for π: The algorithm requires selection of ε, C, λ, and κ from a range of possible values (depending on the drift and minorization).Further, β ∈ (1, β * ) must be selected depending on the previously selected parameters.Each selection impacts the algorithm performance and we suggest investigation of different settings for a given example.Our examples in Sections 5 and 6 discuss hyperparameter selection and provide further recommendations.

Toy examples
This section contains two toy examples in the sense that we can obtain i.i.d.samples for each, hence there is no practical reason for considering MCMC algorithms.The purpose is to gain insights into the exact sampler and study its performance.

Metropolis-Hastings example
This section illustrates the exact sampling algorithm for a Metropolis-Hastings sampler.Suppose that X = [0, ∞) and π(dx) = f X (x) dx where f X (x) = e −x I(x ≥ 0).Consider the function V (x) = e cx for some c > 0 and suppose we use a Metropolis sampler with a symmetric proposal density g( e −c(y−x) + e (c−1)(y−x) + 1 − e −(y−x) g(y|x)dy.
Selecting g to be the uniform density g(y|x) When combined, ( 11) and ( 12) obtain a drift condition for the Metropolis-Hastings sampler considered.However, the selection of the constants c and γ is crucial to obtaining a reasonable computation time.The constants β and M , which are described in Section 2.3, also depend on c and γ.Based on our example in Section 3, our strategy is to maximize β, which in turn results in small values for a = 1/[M d(n)κ].Figure 1(a) displays a contour map of β * for c < 0.3 and γ < 10.Based on this plot, we select c = 0.028 and γ = 4, resulting in β = 1.0243 (as seen below).Evaluating the integrals in ( 11) and ( 12) gives the drift condition where λ = 0.977, b = 0.1 and the small set C = [0, 4].The interval [0, γ] is indeed a small set, since, when x ∈ [0, γ], This establishes the necessary minorization condition P (x, dy) ≥ s(x)Q(dy) where The remaining numerical elements required for the geometric probability bound are: .09197 and J = (A − ε)/λ = 0.99283 < 1.Then β = 1/λ = 1.0243.Finally, the Bernoulli factory hyper-parameters are κ = 5/4 and δ = 1/6 resulting in ω = 0.2.Following Mykland et al. (1995), we can now simulate the split chain as follows: (1) draw X n+1 from g(•|X n ) and accept it with probability min 1, f (Xn+1) f (Xn)

;
(2) if the candidate in step (1) is rejected, set δ n = 0, otherwise, generate δ n as a Bernoulli variate with success probability given by .
Using the exact sampling algorithm, we generated 1000 i.i.d.Exp(1) random variates.Figure 1(b) shows a Q-Q plot of the observed draws versus the theoretical quantiles of the Exp(1) distribution.During our simulations none of the proposed T * values which required the use of the Bernoulli factory were accepted.This is due to the fact that our Metropolis-Hastings sampler regenerates very fast, roughly in about 20 moves.Thus for large T * , when the Bernoulli factory is necessary, the probability Pr(τ > T * ) is negligible and the Bernoulli factory outputs a zero.Improvements via modified drift and minorization or hyper-parameter selection may improve this situation.

Gibbs example
Suppose Y i |µ, θ ∼ N(µ, θ) independently for i = 1, . . ., m where m ≥ 3 and assume the standard invariant prior ν(µ, θ) ∝ θ −1/2 .The resulting posterior density is where s 2 is the usual biased sample variance.It is easy to see the full conditional densities, f (µ|θ, y) and f (θ|µ, y), are given by µ|θ, y ∼ N(ȳ, θ/m) and θ|µ, y ∼ IG((m − 1)/2, m s 2 + (ȳ − µ) 2 /2), hence a Gibbs sampler is appropriate.(We say W ∼ IG(α, β) if its density is proportional to w −(α+1) e −β/w I(w > 0).)We consider the Gibbs sampler that updates θ then µ; that is, letting x = (θ , µ ) denote the current state and x = (θ, µ) denote the future state, the transition looks like (θ , µ ) → (θ, µ ) → (θ, µ).This Markov chain then has state space X = R + × R and transition density Appendix B provides a drift and minorization condition using a small set in the form  the best results.Finally, the Bernoulli factory hyper-parameters are κ = 5/4 and δ = 1/6 resulting in ω = 0.2.Table 2 summarizes the resulting constants given the hyper-parameter choices.Notice the Bernoulli factory will be necessary for proposed values greater than or equal to 10, that is with probability 0.067.The constants for values greater than 20 are not listed.However, these values are of interest since the minimum number of observed τ values becomes extremely large.While values in this range are uncommon, P (T * > 20) ≈ 0.002, they occur with enough frequency to slow down the algorithm substantially.
The exact sampling algorithm was used to generate 1000 exact draws from the posterior density at (13).Generating the 1000 draws required approximately 35 hours of computational time, about 2 minutes per draw.A total of 27,665 T * values were proposed of which 69 (0.25%) were greater than 20.Implementing the Bernoulli factory required 1.52e9 τ values, or 1.52e6 τ values per exact draw.However, most of the computational time, and necessary τ values, were used for a small number of proposed T * values.Similar to the Metropolis-Hastings example, none of the accepted T * values were from the Bernoulli factory.
The largest proposed T * value was 37 with a ≈ 3848 requiring 1.07e9 τ values (of which 0 were ≥ 37) to implement the Bernoulli factory.This value alone accounted for about 70% of the total number of τ values, and hence about 70% of the total computational time.It should be noted that the largest accepted T * value was 9, so proposals unlikely to be accepted account for most of the computational demands.Removing only the largest proposal, the remaining 999 draws required approximately 10 hours of computation, or about 35 seconds per draw.
Alternatively for this example, we can sequentially sample from (13) to obtain i.i.d.draws (Flegal et al., 2008).Figure 2 compares the 1000 exact draws to 1000 i.i.d.draws from a sequential sampler using a Q-Q plot for both µ and θ.We can see from these plots that the exact sampling algorithm is indeed working well.

Bayesian random effects model
This section considers a Bayesian version of the one-way random effects model given by  where the random effects φ i are i.i.d.N (µ, σ 2 φ ) and independently the errors ζ ij are i.i.d.N (0, σ 2 e ).Thus (µ, σ 2 φ , σ 2 e ) is the unknown parameter.Bayesian analysis using this model requires specifying a prior distribution, for which we consider the family of inverse gamma priors where α 1 , α 2 , β 1 and β 2 are hyper-parameters.If we let y = {y ij } and φ = {φ i } denote the vectors of observed data and random effects respectively, then the posterior density is as follows where For ease of exposition, we will suppress the dependency on the data y and define the usual summary statistics: ȳi We consider a block Gibbs sampler that updates θ = (σ 2 φ , σ 2 e ) then ξ = (µ, φ), that is (θ , ξ ) → (θ, ξ ) → (θ, ξ).The necessary full conditionals can be obtained via manipulation of (15).That is, f (θ|ξ ) is the product of two inverse gammas such that ) is multivariate normal density whose parameters are given in Tan and Hobert (2009).This Markov chain then has state space . Implementation of the exact sampling algorithm requires a drift and associated minorization condition as at ( 5) and ( 6).Hobert and Geyer (1998), Jones andHobert (2001, 2004) and Tan and Hobert (2009) analyze variations of the proposed block Gibbs sampler, however none obtain sufficient constants for a practical implementation of our algorithm.To this end, the following theorem improves upon the drift constants of Tan and Hobert (2009) for a balanced design while using a simplified version of their drift function.
Then there exists K ≥ 1, δ 1 > 0 and δ 2 > 0 such that λ * < 1 and (5) holds.That is, for any λ ∈ (λ * , 1), The drift condition still holds if we increase the small set to ≤ d} , for which Appendix C provides the associated minorization condition.

Styrene exposure dataset
We will implement the exact sampling algorithm using the styrene exposure dataset from Lyles et al. (1997) analyzed previously by Jones and Hobert (2001) and Tan and Hobert (2009).The data, summarized in Table 3, is from a balanced design such that m i = m = 3 for i = 1, . . ., q, q = 13, and M = mq = 39.
Using these settings β * = 1.000092 and we choose β = 1.000083 resulting in M = 10.19413.We again used Bernoulli factory hyper-parameters of κ = 5/4 and δ = 1/6 resulting in ω = 0.2.In this case, the Bernoulli factory is necessary for proposals greater than 30,706, approximately 8% of proposed values.It is extremely likely the output from the Bernoulli factory will be zero since a sample of 1000 i.i.d.τ values yielded only a maximum of 1278.
The exact sampling algorithm was run until we obtained a 20 i.i.d.draws from the posterior at (15) which took 31,887 proposed T * values and 2.61e8 τ values for the Bernoulli factory.The accepted T * values and i.i.d.θ values are listed in Table 4. Notice the maximum accepted T * was 287, which is well below

Discussion
This paper describes an exact sampling algorithm using a geometrically ergodic Markov chain on a general state space.The algorithm is applicable for any Markov chain where one can establish a drift and associated minorization with computable constants.The limitation of the method is that the simulation time may be prohibitive.Blanchet and Thomas (2007) implement an approximate version using the Bayesian probit regression example from van Dyk and Meng ( 2001) with regeneration settings provided by Roy and Hobert (2007).This example is ill-suited using the proposed algorithm because of computational limitations related to the Bernoulli factory and in obtaining a practical ε.Specifically, we found (in simpler examples) obtaining a single draw from π sometimes required millions of i.i.d.τ variates.Unfortunately, even using non-constant s(x), the probit example requires about 14,000 Markov chain draws per τ (Flegal and Jones, 2010).Hence obtaining a single draw from π would require an obscene number of draws from X. Implementation for more complicated Markov chains, such as this, likely requires further improvements, or a lot of patience.
Careful analysis of the Markov chain sampler is necessary to find useful drift and minorization constants.Most research establishing drift and minorization is undertaken to prove geometric ergodicity, in which case the obtained constants are of secondary importance.However, performance of the exact sampling algorithm is heavily dependent on these constants.Improving them may be enough to obtain exact samples in many settings.
Alternatively, the speed of the overall algorithm would improve if one could find a bound using non-constant s(x) or a sharper bound with ε.The current bound at (7) could potentially be modified upon by only considering specific models, or specific classes of models.
Finally, one could obtain further improvements to the Bernoulli factory since it requires most of the necessary τ variates.Our work has already obtained a 100 times reduction in computational time.However there may be further improvements available for the Bernstein polynomial coefficients, modifications to Algorithm 4 of Latuszynski et al. (2011) or an entirely different method to estimate f .Hyper-parameter settings also impact performance and could be investigated further.
Utah between June 28 and July 2, 2009.We are especially grateful to Jim Hobert who proposed this research problem.We would also like to thank fellow conference participants Adam Guetz, Xia Hua, Wai Liu, Yufei Liu, and Vivek Roy.Finally, we are grateful to the anonymous referees, anonymous associate editor, Galin Jones, Ioannis Kosmidis and Krzysztof Latuszyński for their constructive comments and helpful discussions, which resulted in many improvements.Latuszynski et al. (2011) and Lemma 6 of Nacu and Peres (2005) prove existence of an algorithm that simulates f (p) if (i) f has second derivative f which is continuous and (ii) the coefficients a and b satisfy Condition (i) is clearly satisfied by construction, so it remains to check condition (ii).Since the coefficients a and b are defined through f , inequalities ( 17) and (18) will be checked using the properties of f .Recall b(n, k) = a(n, k) + C/(2n), the inequalities ( 17) and ( 18) above can be re-expressed (Nacu and Peres, 2005) as where X is a hypergeometric random variable, with parameters (2n, k, n).Using the definition of a, and the fact that f is concave, the first inequality is a direct application of Jensen's inequality.The second part is a straight forward application of Lemma 6 from Nacu and Peres (2005) and the properties of the hypergeometric distribution.
Finally, the probability the algorithm needs N > n follows directly from definitions of the coefficients a and b and Theorem 2.5 of Latuszynski et al. (2011).minorization and Rosenthal (1995)-type drift condition for m ≥ 5, and hence prove the associated Markov chain is geometrically ergodic.Using their argument we show a Roberts-and-Tweedie-type drift condition at (5) (Roberts andTweedie, 1999, 2001) holds using the function V 2 .Conditional independence (based on the update order, see Jones and Hobert, 2001) yields and hence then the drift condition at (5) is satisfied, that is It is easy to see from ( 19) that

B.2. Minorization condition
Now we establish the associated minorization condition at (6) using a similar argument to Jones and Hobert (2001).
Notice the minorization condition holds for any d > 0.
For the exact sampling algorithm, we can see from ( 21) and the definitions of C and b that
Simulating the split chain requires evaluation of (3) similar to the calculation in Appendix B, 2n) using f at (10), then we can state Algorithm 4 ofLatuszynski et al. (2011) with our modification.Algorithm I.
, and call the result n.Set a = 1/ [M d(n)κ].2. If a ≤ 1, draw a single τ random variable.Let W = I(τ ≥ n) and independently draw V ∼ Bernoulli (a).Set B = V W . 3. If a > 1,use the Bernoulli factory to obtain B, a single Bernoulli random variable with success probability p = a Pr(τ ≥ n). 4. If B = 1, accept T * = n; if B = 0, return to Step 1. 5. Make a draw from Q n (•).

Fig 2 .
Fig 2. Q-Q plots comparing 1000 exact draws to 1000 i.i.d.draws from a sequential sampler.

.
The points σ * φ and σ * e are the intersection points of the two inverse gamma densities determined byσ * = b 1 − b 2 a(log b 1 − log b 2 )where a, b 1 and b 2 are the parameters of the two inverse gamma distributions.

Table 1
Comparison of count of input Bernoulli(p) variates to implement the Bernoulli factory.

Table 2
Summary constants for Gibbs sampler example with β = 1.35.The row min refers to the minimum number of observed τ s to implement the Bernoulli factory and p(n) = P (T * = n).

Table 3
Styrene exposure data summary statistics.

Table 4
List of 20 i.i.d.θ draws from the posterior at (15) with the accepted T * values.ourobserved maximum of 1278 from 1000 i.i.d.τ values.Hence, drawing from Q n was easy and almost all of the simulation time was used for the Bernoulli factory.Obtaining the 20 i.i.d.draws required 24 days of computational time utilizing six processors in parallel (equating to 144 days on a single processor).