Zero Bias Enhanced Stein Couplings

The Stein couplings of Chen and Roellin (2010) vastly expanded the range of applications for which coupling constructions in Stein's method for normal approximation could be applied, and subsumed both Stein's classical exchangeable pair, as well as the size bias coupling. A further simple generalization includes zero bias couplings, and also allows for situations where the coupling is not exact. The zero bias versions result in bounds for which often tedious computations of a variance of a conditional expectation is not required. An example to the Lightbulb process shows that even though the method may be simple to apply, it may yield improvements over previous results that had achieved bounds with optimal rates and small, explicit constants.

The method proceeds by the use of a characterizing equation for some target distribution which is to serve as an approximation to that of a random variable W of interest.
In the case of the normal, W has the N (µ, σ 2 ) distribution if and only if where here, and in like displays that follow in this section, we implicitly take F to be the class of functions for which the quantities written exist, which in particular will always include the collection of infinitely differentiable functions with compact support.
Next, given any collection of functions H, we define the pseudo-metric d H (X, Y ) = sup h∈H |Eh(Y ) − Eh(X)|, (1.2) which, for example, produces the Wasserstein d(·, ·) and Kolmogorov metric d K (·, ·) by taking H to be the class  To put this expression to real use requires bounds on the solution f h , such as on the magnitude of its derivatives. Though the form (1.4) may seem at first glance to be more complicated and more difficult to handle than (1.2), the latter involves the distribution of only a single random variable. It is perhaps for this reason that one of the best known ways to handle the right hand side of (1.4) is through coupling constructions. An advance in understanding coupling methods was achieved in [6] by the introduction of the family of Stein couplings, of which many previously known couplings, such as the exchangeable pair, and size bias couplings, are special cases.
We say the triple of random variables (W ′′ , W, G) is a Stein coupling when  for some λ ∈ (0, 1), then we obtain Stein's classical exchangeable pair that satisfies a size bias coupling can be seen as a Stein coupling by taking the triple to be (W ′′ , W, G) = (W s , W, µ) when the sized biased variable W s is defined on the same space as W .
An inventive coupling distinct from the ones just described is Coupling 2A in [6], all together demonstrating the large range of possibilities covered.
One key step in computing bounds in the Kolmogorov or Wasserstein metric to the normal using Stein couplings typically requires bounding a variance of a conditional expectation; for the exchangeable pair and size bias couplings described above, one needs to respectively bound (1.9) see, for instance, the terms r 1 , r 2 and r 3 in Theorem 2.1 of [6] for the Kolmogorov bound, or Theorem 3.20 of [29], for the Wasserstein. However, these terms are often difficult, tedious or even not tractable to compute. Bounds using size bias in [18] for the lightbulb problem, described in detail below, required a delicate and highly detailed analysis involving the eigenvaules of a certain Markov chain of multinomial type. On the other hand, the zero bias coupling [16] (see also [5]) can yield Wasserstein and Kolmogorov bounds without requiring cumbersome computation. We recall that for a mean µ random variable W with variance σ 2 ∈ (0, ∞), we say W * has the W -zero bias distribution on a space with some measure Q, when (1.10) where we have subtracted the mean as in [7], and consequently have W * = d (W −µ) * +µ. In the canonical case where µ = 0 and σ 2 = 1, we obtain the bound (see [13], [14]) in the Wasserstein distance d between W and the standard normal Z. The right hand side above can be upper bounded by 2E|W * −W | for any coupling of (W * , W ), in contrast to (1.9). Kolmogorov bounds to the normal can be obtained without computation from any almost sure bound on |W * − W |, see [5]. Theorem 2.3 below provides bounds that generalize results previously obtained in both these metrics in the framework of Zero biased enhanced Stein couplings, or zbest for short, that includes both Stein and zero bias couplings as special cases. For a random variable W with mean µ and variance σ 2 ∈ (0, ∞) on a probability space with measure P (implicit in the expectation E), we say the triple (W ′′ , W ′ , G) of random variables is a zbest coupling with respect to a measure Q when (1.11) Clearly, zbest couplings include Stein couplings as a special case, though in contrast to (1.5) W ′ is not constrained to be equal to, nor even have the same distribution as, the random variable W . Enlarging the set of possibilities in this way leads to some additional useful flexibility, including generalizations of zero bias that likewise avoid the computation of quantities such as (1.9).
Naturally, nothing being for free, the work that would otherwise be required for the computation of (1.9) must be paid for in another manner. In order to explain what is needed we give more detail on how a useful 'second order' zbest coupling construction proceeds. Lemma 2.1 in Section 2 shows how we may begin with some initial zbest coupling, say, for instance, a standard exchangeable pair, and by changing the measure to some judicious P † , be naturally led to a new zbest coupling that may possess favorable properties. However, this new coupling satisfies (1.11) with Q = P † , and for Theorem 2.3 to apply, it must be constructed on the original space. That is, once the distribution of this new, advantageous triple is specified under P † , we need to construct a triple (W ‡ , W † , G † ) on the original space so that L P (W ‡ , W † , G † ) = L P † (W ′′ , W ′ , G) where L denotes distribution, or law, on a space with underlying measure as indicated; the understanding of the procedure and the notation may be furthered upon reviewing Example 2.4, where we treat an independent sum of Bernoulli variables.
In general, the final bound in Theorem 2.3 depends on a number of choices. Indeed, this is so even when restricting to the use of Stein couplings, while for zbest one also has some influence when making the change of measure. Overall though, natural choices do seem to arise for given applications.
We apply the methods developed here for the lightbulb problem first analyzed in [28] and further in [18]. This instance demonstrates that the approach considered here may lead to easier computation of Kolmogorov and Wasserstein bounds which may also be superior to those produced by methods that require computation of quantities such as those in (1.9).
In Section 2 we show how to use one zbest coupling to produce another by a change of measure, obtain bounds in the Wasserstein and Kolmogorov metric, and give an introductory example. The Lightbulb Problem application can be found in Section 3.

Zbest couplings
Taking our variables as in (1.11), let where U is a standard uniform variable independent of W ′′ , W ′ and G. For a smooth function f for which the following expectations exist, and arbitrarily defining any 0/0 expressions, integrating over the uniform variable for the last equality we obtain Recalling (1.11) and taking Q expectation on both sides above results in the identity showing in particular that Q(DG > 0) > 0. When DG = σ 2 Q-a.s. then (2.2) recovers (1.10), and hence occurs if and only if W * has the W -zero bias distribution. The important message of Lemma 2.1 below is that when D and G in a given zbest coupling satisfy DG ≥ 0 Q-a.s. then by biasing via the Radon Nikodym factor DG/E Q [DG] to obtain the measure P † one produces a new zbest coupling with DG = σ 2 P † -a.s, and therefore one that yields a variable with the W -zero bias distribution. Further, if the non-negativity condition is not satisfied, one may nevertheless be able to construct a variable with a distribution close to that of the W -zero bias, and achieve a bound to the normal that has an additional term, see Theorem 2.3.
We illustrate the construction of the zero bias distribution as described in Lemma 2.1 in a few simple examples. This construction was previously understood only in the special case of the Stein exchangeable pair (W ′′ , W ′ ) as described in Example 1.1, where it was shown that (1.5) holds with µ = 0 and G = (W ′′ − W ′ )/2λ. In particular, DG = (W ′′ − W ′ ) 2 /2λ is non-negative, and non-trivial by (2.3), hence one may construct a new measure P † by Lemma 2.1 shows that this construction produces a new zbest coupling that yields a variable with the zero bias distribution via the interpolation (2.1). A different, and rather simple special case is produced by taking a random variable W with mean µ = 0, noting that (W, 0, W ) trivially satisfies (1.11) with Q = P and that DG = W 2 is non-negative. In this case, as is known, and follows from Lemma 2.1, by taking dP = w 2 σ 2 dP via (2.1) one finds that W * = U W (2.4) has the W -zero bias distribution, where L(W ) ∼ L P (W ), and is independent of U .
Then (W ′′ , W ′ , G/R) is a zbest coupling with respect to the measure then one may choose R = DG/σ 2 , and the resulting W * of (2.1) has the W -zero bias distribution under P † .
Proof: As P † (R = 0) = 0 the ratio G/R is well defined under P † , and changing measure we have where we have used (1.11) for the final equality, and for the penultimate, via (2.5) and For the final claim, by (2.6) the variable R = DG/σ 2 is non-negative, integrates to 1 under Q by (2.3), and satisfies (2.5), as The first claim gives that (W ′′ , W ′ , σ 2 /D) is a zbest coupling under P † , and the last claim now follows from (2.2) with Q = P † . ✷ yields W * with the W -zero bias distribution as in (1.10).
For normal approximation, the following theorem yields bounds that imply existing results in [13] and [5] for zero-bias couplings in both the Wasserstein and Kolmogorov metrics, as in that case the term E|1 − GD| vanishes; the additional generality is gained by not requiring that W * has the W -zero bias distribution exactly.  (2.8) and when |W * − W | ≤ δ for some δ ≥ 0, where the constant multiplying δ in the bound is less than 2.03.
where we have applied the mean value theorem to obtain the first term, and the bounds f ′′ h ≤ 2 and f ′ h ≤ 2/π from (2.13) of Lemma 2.4 in [5]. Taking supremum on the left hand side over h ∈ Lip 1 yields the result.
For the Kolmogorov bound, given an arbitrary z ∈ R, by Lemma 2.2 in [5], the unique bounded solution f z of the Stein equation ( where Φ is the cumulative distribution function of the standard normal; we set f ′ z (z) = lim w↑z f ′ z (w) so that the first equality in (1.3) holds for all w ∈ R. Hence, and so Taking expectation, applying (2.2) and using that |f ′ z (w)| ≤ 1 for all real w, z via (2.8) of Lemma 2.3 of [5] in the final inequality yields For the second term, applying the bound using E|W | ≤ √ EW 2 = 1. Substituting into (2.11) produces a lower bound on P (W ≤ z) − Φ(z). Proving an analogous upper bound by the same reasoning, and then taking supremum over z ∈ R yields the claim (2.9).   That is, in this second case one sets the I th indicator X I equal to 1 and constructs the remaining variables according to their conditional distribution given that updated value.
As the Bernoullis here are independent, and independent of I, for i ∈ [n] we have P (X = e, I = i) = P (X = e)P (I = i) We can construct a vector of indicators X ′′ on the same space as X and I according to the recipe suggested by (2.13), whose coordinate sum will be the Y -size biased variable Y ′′ , by specifying that P (X ′′ = e ′′ , X = e, I = i) = P (X = e)P (I = i)1(e ′′ j = e j , j = i, e ′′ i = 1), (2.15) as, again by independence, the distribution of {X j , j = i} is unaffected by conditioning on X i . Hence, the coupling amounts to simply replacing X I by 1 no matter its previous value when forming X ′′ , and when doing so we obtain As Y ′′ ≥ Y the coupling is monotone, and following Remark 2.2 we form the P † distribution in (2.7) by changing the P measure in (2.15) according to (2.16), yielding Hence, sampling (X, I) by (2.14) and setting X ‡ j = X † j = X j , j = I, X † I = 0, X ‡ I = 1 yields a coupling of (X ‡ , X † ) and X such that L P (X ‡ , Recalling that U is independent of all other variables, taking expectation yields The inequalities in (2.12) now follow by noting that the final terms of the bounds of Theorem 2.3 vanish in the zero-bias case, and that (aY ) * = aY * for all a = 0 (see [5]).

The Lightbulb Process
The lightbulb problem was first considered in [28] as a model for the behavior of skin receptors subject to a medication released by dermal patches, and subsequently studied in [18]. In the model, in a sequence of n stages, n skin receptors, which here we will imagine as lightbulbs, are toggled from one state to the other upon absorbing a pharmaceutical. Initially all n lightbulbs are turned off. Then, at stages r ∈ [n], a set of r lightbulbs, selected uniformly from all subsets of [n] of size r, independently of previous stages, have their state toggled. The random variable Y of interest in the pharmaceutical application counts the number of lightbulbs that are turned on after stage n is complete. The restriction that the number of stages and bulbs are equal is not essential, and more general frameworks are discussed in [28] and [18], such as where in stage r the number of toggled lightbulbs is some given s r ∈ [n]; these variations can be analyzed as below.
More formally consider X ∈ {0, 1} n×n , a matrix of Bernoulli variables, called here a configuration, and whose components we refer to as toggle variables. The initial state of the bulbs is given deterministically with all bulbs off. For stages r ∈ [n] the components of X have the interpretation that X ri = 1 if the status of bulb i is changed at stage r, The toggle variables at stage r that form the vector (X r1 , . . . , X rn ) are clearly exchangeable and the marginal distribution of the components X ri are Bernoulli with success probability r/n. For bulbs i ∈ [n], the variables are the indicator that bulb i is on at the terminal time n, and the total number of bulbs on at that time, respectively. In what follows we let, say, Y ′ and Y ′′ be computed from configurations X ′ and X ′′ as Y is from X.
In [18] a normal approximation to Y was achieved by constructing Y ′′ on the same space as Y with the Y -size bias distribution, and using that construction [12] also obtains a concentration bound. After reviewing that construction we apply Lemma 2.1 to create another zbest coupling in a way similar to the process followed in Example 2.4.
Given a configuration X ∈ E, for a stage r ∈ [n] and indices {i, j} ⊂ [n], let X r,i↔j be the configuration with components that is, the new configuration is the same as X, but with the toggle variables X ri and X rj interchanged.
We take n ≥ 4 and even for simplicity; Remark 3.3 describes how the odd case may be handled. The size biased coupling (Y ′′ , Y ) was constructed in [18] as follows. First, sample X according to (3.1). Given X, sample I uniformly from [n], and given X and I, let J be an index chosen uniformly at random over the set of n/2 indices {j ∈ [n] : X n/2,j = X n/2,I }. Finally, let X ′′ = X if X I = 1 and X n/2,I↔J otherwise. Formally, with e i = ( r e ri ) mod 2 and e, e ′′ ∈ E, we take P (X ′′ = e ′′ , X = e, I = i, J = j) In [18], via a verification of (2.13), it was shown that Y ′′ , the number of lightbulbs that are on in the terminal stage in configuration X ′′ , has the Y size bias distribution and satisfies Y ′′ − Y = 21{X I = 0, X J = 0} = 21{Y ′′ = Y }. (3.5) Referring the reader to [18] for a proof of the size property, we show (3.5). Indeed, if X I = 1 then Y ′′ = Y and the three quantities above are all zero. If X I = 0 and X J = 1 then all quantities are again zero as interchanging the toggles of bulbs I and J in stage n/2 will result in a configuration X ′′ in which X ′′ I = 1, X ′′ J = 0, thus leaving Y ′′ with the same value as Y . In the remaining case X I = 0, X J = 0 and all quantities above equal 2, as the exchange toggles the final state of both bulbs I and J, turning both from off to on, increasing Y by 2.
Identity (3.5) demonstrates that this size bias coupling is monotone, and thus Remark 2.2 is in force. To continue, one needs to construct the biased version of the original pair according to P † in (2.7), which, making use of the first equality in (3.5), is seen to be given by 1(e ′ n/2,i = e ′ n/2,j , e ′ i = 0, e ′ j = 0, e ′′ = (e ′ ) n/2,i↔j ), (3.6) where in the last equality we have applied (3.3) and the definition of η in (3.4). Integrating out the distribution of X ′′ , we find the marginal of X ′ under P † satisfies Lemma 3.1. There exists a coupling of X and (X ‡ , X † ), having distributions specified by (3.4) and (3.6), respectively, that satisfies One easily verifies that P (X † ∈ E) = 1 and that each φ ab is an involution mapping between {e ∈ E : e I = a, e J = b} and {e ∈ E : e I = 0, e J = 0}. Moreover, for any f ∈ E, Note that on the event {I = i, J = j} we have X n/2,i = X n/2,j , hence the probability in the a, b th summand is zero unless (φ ab (f )) n/2,i = (φ ab (f )) n/2,j , which is equivalent to the condition that f n/2,i = f n/2,j . Similarly, for this probability to be non-zero we must have that (φ ab (f )) i = a, (φ ab (f )) j = b, which implies that f i = 0, f j = 0. For f ∈ E such that f n/2,i = f n/2,j and f i = 0, f j = 0 the restriction in the summand that X i = a, X j = b is redundant, and as φ ab (f ) ∈ E and the probability P (X = e, I = i, J = j) is constant over its support by (3.4), we obtain P (X † = f , I = i, J = j) ∝ 1(f n/2,i = f n/2,j , f i = 0, f j = 0), in agreement with (3.7). That is, the constructed X † has the desired distribution, and setting X ‡ = φ 11 (X † ) it is immediate that the pair (X ‡ , X † ) has joint distribution (3.6).
Lastly, as each mapping φ ab transposes at most two toggles, the final states of X and X † can differ for at most two lightbulbs, thus verifying the first claim of (3.8). The second claim follows directly from (3.6).

✷
The mean µ and variance σ 2 of Y are given by µ = n/2 and σ 2 = where the first equality follows from a simple symmetry argument, and the next from Section 2.3 of [28], and as given in (3), (4) and (5) of [18], where λ n there is denoted λ n,2,n . Note that when n is even, as the terms in the product for s and n − s are equal, λ n = 1 − 4(n/2) 2 n(n − 1) and hence σ 2 ≤ n/4. Applying the reasoning that proves (1.6) of [8] demonstrates that the order 1/σ in the Kolmogorov bound in [18], and (3.9), is unimprovable.
Hence the bound produced here is superior for all values over which the former was valid.

Remark 3.3.
The odd case can be handled using randomization as in [18]. In particular, one constructs a configuration of toggle variables V giving rise to an intermediate variable V , close to Y and having favorable symmetry properties for size biasing, by randomly adding, or removing, a toggle variable in the two middle stages m and m + 1, respectively. To size bias V , with I the random index as in the even case, when V I = 0 a randomization is applied to the toggle variable of bulb I in one of the two middle stages.
If that does not succeed in changing the state (the event F = 0), then an interchange of the toggle variables V I and V J , such as achieved in (3.3), is performed in the chosen middle stage. This gives rise to an additional term 1(V I = 0, F = 1) for the difference (3.5) (see (46) of [18]), that here produces our change of measure. That term, and one accounting for the difference between V and Y , produce only a small additional term in the final bound; note the difference between the even and odd case results in Theorem 1 of [18].