Stein's Method and Descents after Riffle Shuffles

Berestycki and Durrett used techniques from random graph theory to prove that the distance to the identity after iterating the random transposition shuffle undergoes a transition from Poisson to normal behavior. This paper establishes an analogous result for distance after iterates of riffle shuffles or iterates of riffle shuffles and cuts. The analysis uses different tools: Stein's method and generating functions. A useful technique which emerges is that of making a problem more tractable by adding extra symmetry, then using Stein's method to exploit the symmetry in the modified problem, and from this deducing information about the original problem.


Introduction
Given a group G and a generating set T , one defines the distance of an element g as the smallest number m so that g can be written as a product t 1 • • • t m of elements of T .A natural question is to fix m and study the distribution of distance of an element g which is the product of m generators chosen at random from T .Clearly the distance of g is at most m, and it is of interest to know when m is a reliable approximation for the distance.
One motivation for this question comes from the study of genome rearrangement (see the survey [Du] for details).Here G is the group of 2 n n! signed permutations on n symbols, and the generating set consists of the reversals ρ(i, j).Here a reversal ρ(i, j) applied to a permutation For instance the reversal ρ(2, 4) would send 4 1 − 5 − 2 3 to 4 2 5 − 1 3. Hannenhalli and Pevzner [HaPe] found an exact combinatorial formula for the distance of a signed permutation and Borque and Pevzner [BoPe] perform simulations suggesting the hypothesis that after r random reversals, r is a good approximation to the distance provided that r/n < .4.While this remains unproved, the paper [BeDu] obtained definitive results for a simpler model, where G is S n (the symmetric group consisting of all n! unsigned permutations), and the generating set consists of all n 2 random transpositions.Then the distance of π is simply n − number of cycles of π.Letting D r be the distance after r iterations of the random transposition walk, they showed that D cn/2 ∼ u(c)n, where u is an explicit function satisfying u(c) = c/2 for c ≤ 1 and u(c) < c/2 for c > 1.They also described the fluctuation of D cn/2 about its mean in each of three regimes (subcritical where the fluctuations are Poisson, critical, and supercritical where the fluctuations are normal).They exploit a connection between the transposition walk and random graphs (about which an enormous amount is known).
Another case considered in the literature is where G = S n and T consists of the set of n − 1 adjacent transpositions [EES].Then the distance of a permutation π is the number of pairs (i, j) such that 1 ≤ i < j ≤ n and π(i) > π(j).They give upper and lower bounds for the expected distance after r iterations, and show that it can be much less than r.
In this paper we study the case G = S n where the generating set T consists of the identity and all permutations whose inverse has exactly one descent.
Here we use the terminology that a permutation has a descent at position i (where 1 ≤ i ≤ n − 1) if π(i) > π(i + 1).Note that unlike the examples of the previous paragraphs, this generating set is not symmetric (i.e. the inverse of a generator need not be a generator).It is technically convenient to assign probability n+1 2 n to the identity and probability 1 2 n to each of the other generators, so that as described in the next paragraph, we obtain the Gilbert-Shannon-Reeds model of riffle shuffling.Riffle shuffles are mathematically uniquitous (see [Di] for an overview of connections to dynamical systems, Lie theory and much else), and the gambler's question of looking at a deck of cards and estimating how many shuffles have occurred is in perfect analogy to the biologist's question of looking at a signed permutation and estimating how many reversals have occurred.
Riffle shuffling proceeds as follows.Given a deck of n cards, one cuts it into 2 packets with probability of respective pile sizes j, n − j given by ( n j ) 2 n .Then cards are dropped from the packets with probability proportional to the packet size at a given time; thus if the current packet sizes are A 1 , A 2 , the next card is dropped from packet i with probability A i /(A 1 +A 2 ).Bayer and Diaconis [BayDi] prove the fundamental result that after r riffle shuffles, the probability of obtaining the permutation . Here d(π) denotes the number of descents of π, that is |{i : 1 ≤ i ≤ n − 1, π(i) > π(i + 1)}|.For instance the permutation 3 1 4 2 5 has two descents.From the Bayer-Diaconis formula it is clear that the distance of a permutation π −1 is simply log 2 (d(π) + 1) .Thus the study of distance for riffle shuffles is the study of the distribution of d(π) under the measure . Since a permutation has at most n − 1 descents, the formula for distance also shows that the diameter of S n with the generating set T is log 2 (n) , much smaller than the diameter in the biological examples which are polynomial in n.
More generally, for k and n integers, we let R k,n denote the measure on S n which chooses π with probability and study the number of descents.First let us review what is known.As k → ∞, the distribution R k,n tends to the uniform distribution on S n .It is well known ( [CKSS]), [T]) that for n ≥ 2 the number of descents has mean n−1 2 and variance n+1 12 and that d(π)−(n−1)/2 √ (n+1)/12 is asymptotically normal.Aldous [A] proved that 3 2 log 2 (n) riffle shuffles are necessary and suffice to be close to the uniform distribution on S n .Bayer and Diaconis [BayDi] give more refined asymptotics, proving that for k = 2 c n 3/2 with c a real number, Motivated by this result, Mann [Ma] proved that if k = an 3/2 , with a fixed, then the number of descents has mean ) and variance n 12 +O(n 1/2 ), and is asymptotically normal as n → ∞.He deduces this from Tanny's local limit theorem for d(π) under the uniform distribution [T] and from the formula for R k,n .
We prove two new results concerning the distribution of d(π) under the measure R k,n .First, we complement the above results on normal approximation by using Stein's method to upper bound the total variation distance between the distribution of k − 1 − d(π) and a Poisson variable with mean k n+1 ; our bound shows the approximation to be good when k n is small.Second, we use generating functions to give very precise asymptotic estimates for the mean and variance of d(π) is at most Cα n and that where C α , A α are constants depending on α (and are universal constants for α ≥ 1).
The regime k = αn is natural since it corresponds to having the number of iterations of the same order as the diameter of the generating set; this is also where the interesting phenomena of [BeDu] occurred.Mann [Ma] had exact expressions for the mean and variance (which we derive another way) but only obtained asymptotics described earlier when k = an 3/2 with a fixed.The main technical point and effort of the paper [SGO] on information loss in card shuffling was to obtain asymptotics for the mean and variance of d(π) under the measure R k,n .They used the method of indicator variables to study the case k ≥ n2 ω(n) with ω(n) → ∞.This yields no information about the regime of interest to us (k = αn with α fixed).Moreover the bounds stated in the previous paragraph are typically sharper than their bounds.To see this note that one can assume k = O(n 3/2 ) since as mentioned earlier for larger k the distribution R k,n is essentially the uniform distribution.Assuming that k > n our error term for the mean is ), and our error term for the variance is O(1) whereas their formula for variance is the asymptotic expression O( n 3 k ) + O(n).Numerical evidence (requested by the referee) also suggests that our approximation for the mean and variance are quite good.For instance for a n=52 card deck, one has the following table obtained by plugging in small values to the asymptotic estimates : Next let us describe the technique we use to study the distribution of d(π) under R k,n , as we believe this to be as interesting as the result itself.To apply Stein's method to study a statistic W , one often uses an exchangeable pair (W, W ) of random variables (this means that the distribution of (W, W ) is the same as that of (W , W )) such that the conditional expectation E(W |W ) is approximately (1 − λ)W for some λ.Typically to construct such a pair one would use a Markov chain on S n which is reversible with respect to the measure R k,n , choose π from R k,n , let π be obtained from π from one step in the chain, and finally set (W, W ) = (W (π), W (π )).For the problem at hand this does not seem easy.Thus we modify the problem; instead of considering the measure R k,n , we consider a measure C k,n which chooses a permutation π with probability 1).This probability measure was introduced in [Fu1] and C 2 r ,n (π) gives the chance of obtaining π −1 after first cutting the deck at a uniformly chosen random position and then performing r iterations of a riffle shuffle.The advantage of working with C k,n is that it has a natural symmetry which leaves it invariant, since performing two consecutive cuts at random positions is the same as performing a single cut.As will be explained in Section 2, this symmetry leads to an exchangeable pair (d, d ) with the very convenient property that We obtain a Poisson approximation theorem for d under the measure C k,n .Although the measures R k,n and C k,n are not close when k n is small (a main result of [Fu3] is that that the total variation distance between them is roughly n This implies a Poisson approximation theorem for the original problem of interest. Incidentally, it is proved in [Fu1] that r iterations of "cut and then riffle shuffle" yields exactly the same distribution as performing a single cut and then iterating r riffle shuffles.Thus the chance of π −1 after r iterations of "cut and then riffle shuffle" is C 2 r ,n (π), which implies that the distance of π after the "cut and then riffle shuffle" process is log 2 (c(π)) .Thus the study of distance for the "cut and then riffle shuffle" procedure is equivalent to the study of c under the distribution C k,n .But as mentioned in the last paragraph, we will prove that this is the same as the distribution of d + 1 under R k,n−1 .Hence the theory of distance for "cut and then riffle shuffle" is equivalent to the theory for riffle shuffles, and we shall say nothing more about it.
The reader may wonder why we don't apply our exchangeable pair for normal approximation.Most theorems for Stein's method for normal approximation assume that the pair (W, W ) satisfies the property E(W |W ) = (1 − λ)W for some λ.In our case this only approximately holds, that is where G(W ) is small.There are normal approximation results in the literature ( [RR], [Ch]) for dealing with this situation, but they require that E|G(W )| λ goes to 0. Using interesting properties of Eulerian numbers, we show that even for the uniform distribution (the k → ∞ limit of C k,n ) the quantity E|G(W )| λ is bounded away from 0. Finding a version of Stein's method which allows normal approximation for our exchangeable pair (even for the uniform distribution) is an important open problem.Incidentally, for the special case of the uniform distribution, it is possible to prove a central limit theorem for d(π) by Stein's method [Fu2], using a different exchangeable pair.
Having described the main motivations and ideas of the paper, we describe its organization.Section 2 defines an exchangeable pair to be used in the study of d(π) under the measure C k,n , and develops a number of properties of it.It also gives closed formulas (but not asymptotics) for the mean and variance of d(π), by relating them to the mean and variance of c(π), and computing the latter using generating functions.Section 3 uses the exchangeable pair of Section 2 to prove a Poisson approximation theorem for k − d(π) under the measure C k,n (the bounds are valid for all integer values of k and n but informative only when k n is small).It then shows how to deduce from this a Poisson approximation theorem for k − 1 − d(π) under the measure R k,n .Section 4 gives asymptotics for the mean and variance for c(π) under C k,n for k n > 1 2π (and so also for d(π) under C k,n and R k,n ).It then explores further properties of the exchangeable pair which are related to normal approximation.Finally, it gives a quick algorithm for sampling from R k,n , which should be useful in empirically investigating the nature of the transition from Poisson to normal behavior.

The Exchangeable pair, mean, and variance
This section constructs an exchangeable pair (d, d ) for the measure C k,n and develops some of its properties.Throughout we let E C denote expectation with respect to C k,n .We relate E C (d) and E C (d 2 ) to E C (c) and E C (c 2 ), and then use generating functions to find expressions (whose asymptotics will be studied later) for E C (c) and E C (c 2 ).
To begin let us construct an exchangeable pair (d, d ).We represent permutations π in two line form.Thus the permutation represented by i : 1 2 3 4 5 6 7 π(i) : 6 4 1 5 3 2 7 sends 1 to 6, 2 to 4, and so on.One constructs a permutation π by choosing uniformly at random one of the n cyclic shifts of the symbols in the bottow row of the two line form of π.For instance with probability 1/7 one obtains the permutation π which is represented by i : 1 2 3 4 5 6 7 π (i) : 5 3 2 7 6 4 1 .
An essential point is that if π is chosen from the measure C k,n , then so is π ; note that this would not be so for the measure R k,n .Thus if one chooses π from C k,n , defines π as above, and sets (d, Recall that π is said to have a cyclic descent at position j if either 1 ≤ j ≤ n − 1 and π(j) > π(j + 1) or j = n and π(n) > π(1).It is helpful to define random variables χ j (π) (1 ≤ j ≤ n) where χ j (π) = 1 if π has a cyclic descent at position j and χ j (π) = 0 if π does not have a cyclic descent at position j.We let I denote the indicator function of an event.We also use the standard notion that if Y is a random variable, E(Y |A) is the conditional expectation of Y given A.
Proof.Note that d = d + 1 occurs only if π has a cyclic descent at n and that then it occurs with probability n−1−d n .Note also that d = d − 1 occurs only if π does not have a cyclic descent at n, and that it then occurs with probability d n .To summarize, As a corollary, we obtain Corollary 2.2.
Lemma 2.3 will be helpful at several points in this paper.
Proof.Observe that As a consequence, we obtain Corollary 2.4. Proof.
where the final equality is Lemma 2.3.This is equivalent to the statement of the corollary.
Next we use generating functions to compute E C (c) and E C (c 2 ).For this some lemmas are useful.
Lemma 2.5.( [Fu1]) For n > 1, the number of elements in S n with i cyclic descents is equal to n multiplied by the number of elements in S n−1 with i−1 descents.
Proof.Given Lemma 2.5, the result now follows from the well known generating function for descents (e.g.[FoS]) Proposition 2.7 gives a closed formula for E C (c).
Proposition 2.7.For n > 1, Proof.Multiplying the equation of Lemma 2.6 by (1 − t) n and then differentiating with respect to t, one obtains the equation Multiplying both sides by The coefficient of t k on the left hand side is precisely the expected value of c under the measure C k,n .The proposition now follows by computing the coefficient of t k on the right hand side.
By a similar argument, one obtains an exact expression for E C (c 2 ).
Proposition 2.8.For n > 1, Proof.From the proof of Proposition 2.7, we know that Differentiate with respect to t, multiply both sides by t nk n−1 (1−t) n , and take the coefficient of t k .On the left hand side one gets E C (c 2 ).On the right hand side one obtains the coefficient of After elementary simplifications the result follows.

Poisson regime
A main result of this section is a Stein's method proof that for k much smaller than n, the random variable X(π) := k − d(π) under the measure C k,n is approximately Poisson with mean λ := k n .Then we show how this can be used to deduce Poisson limits for k − c(π) under the measure C k,n and for k − 1 − d(π) under the measure R k,n .
To begin we recall Stein's method for Poisson approximation.A book length treatment of Stein's method for Poisson approximation is [BarHJ], but that book emphasizes the coupling approach.We prefer to work from first principles along the lines of Stein's original formulation as presented in [St].A very recent survey of this approach is the paper [ChDiMe].
Throughout we use the exchangeable pair (X, X ), where π and π are as in Section 2, X(π) = k − d(π), X = X(π ), and the underlying probability measure is C k,n .Let P λ denote probability under the Poisson distribution of mean λ, and as usual let P C denote probability with respect to the measure C k,n .Let A be subset of Z + , the set of non-negative integers.Stein's method is based on the following "Stein's equation" Let us specify the terms on the right hand side of the equation, in the special case of interest to us.
(1) The function g = g λ,A : Z + → R is constructed to solve the equation λg(j + 1) − jg(j) = I j∈A − P λ {A}, j ≥ 0 where g(0) is taken to be 0. We also need the following lemma which bounds certain quantities related to g. (2) The map T λ sends real valued functions on Z + to real valued functions on Z + and is defined by  5) Finally (and this is where one has to make a careful choice), the map α is a map from real valued functions on Z + to antisymmetric real valued functions on S n × S n .In the proof of Theorem 3.3 we will specify which α we use.In order to approximate X by a Poisson(λ) random variable, it will be useful to approximate the mean of the random variable c(π)  n by λ.This is accomplished in the next lemma, the second part of which is not needed in the sequel. (2) Proof.For the first assertion, note that by Proposition 2.7, The second assertion follows since the formula for C k,n forces c ≤ k with probability 1.
where k, n are positive integers.Then for any Since A, λ are fixed, throughout the proof the function g λ,A is denoted by g.We specify the map α to be used in the "Stein equation" Given a real valued function f on Z + , we define αf by Note that as required this is an antisymmetric function on S n × S n .Then one computes that T αg is the function on S n defined by Thus by the reasoning of Lemma 2.1, Since iT λ g(π) = λg(X + 1) − Xg(X) and I χn(π)=0 = 1 − I χn(π)=1 , one concludes that Thus to complete the proof, for each of the three terms in square brackets, we bound the expectation under the measure C k,n .Lemma 3.1 and part 1 of Lemma 3.2 give that For the second term in square brackets, one argues as for the first term in square brackets to get an upper bound of nk(1 − 1 k ) n .To bound the expectation of the third term in square brackets, note that the nonnegativity of c(π)I χn(π)=1 and Lemma 3.1 imply that By the explicit formula for C k,n , it follows that c(π) ≤ k with probability 1.
Hence this is at most ( k n ) 2 .
To conclude this section, we show how Theorem 3.3 can be used to deduce Poisson approximations for the two statistics we really care about: Proof.Observe that for any l ≥ 0, which implies that Summing over l ≥ 0 gives that where the final inequality is Proposition 2.7 (or also since c ≤ k with probability 1).The result follows.
Corollary 3.5.Let λ = k n where k, n are positive integers.Then for any Proof.This is immediate from Theorem 3.3 and Proposition 3.4.
Proposition 3.6 shows that the distribution of d(π) + 1 under the measure R k,n is exactly the same as the distribution of c(π) under the measure C k,n+1 .Proposition 3.6.For any r ≥ 0, Proof.Note from the formula for R k,n that for any r, the probability of r multiplied by the number of permutations in S n with r descents.Similarly, from the formula for C k,n+1 one sees that for any r, the probability of r + 1 cyclic descents under the by the number of permutations in S n+1 with r + 1 cyclic descents.The result follows from Lemma 2.5.
where k, n are positive integers.Then for any Proof.This is immediate Corollary 3.5 and Proposition 3.6.

Other regimes
This section is organized into three subsections.Subsection 4.1 gives good asymptotic results for the mean and variance of c(π) under C k,n (and so also for d(π) under D k,n ).Subsection 4.2 develops further properties of the exchangeable pair d, d under C k,n which are relevant to normal approximation.Subsection 4.3 gives a speedy algorithm for sampling from the measures C k,n and R k,n .
4.1.Asymptotics of mean and variance.This subsection derives sharp estimates for the mean and variance of c(π) under the measure C k,n when k n ≥ 1 2π .Since by Proposition 3.6 the distribution of d(π) under R k,n is the same as the distribution of c(π) − 1 under C k,n+1 , one immediately obtains results (which we stated in the introduction) for the mean and variance of d(π) under R k,n .We also remark that Corollaries 2.2 and 2.4 imply results for d(π) under C k,n .
Throughout we will use information about the Bernoulli numbers B n .They are defined by the generating function f The zero at 0 in the denominator of f (z) cancels with the zero of z, so f (z) is analytic for |z| < 2π but has first order poles at z = ±2πi, ±4πi, • • • .We also use the notation that (n) t denotes n(n − 1) • • • (n − t + 1) and that (n) 0 = 1.
To see the connection with Bernoulli numbers, Lemma 4.1 shows how to write E C (c) in terms of them.
Proof.This follows from Proposition 2.7 and by the expansion of partial power sums in [GR]: Lemmas 4.2 and 4.3 give two elementary estimates.
Lemma 4.2.For 0 ≤ t ≤ n, Proof.For t = 0, 1 the result is clear, so suppose that t ≥ 2. We show that n .
To see this write 1 n which proves the upper bound.For the lower bound note that 2n 2 .The last inequality is on page 103 of [HLP].
where C l,α is a constant depending on l and α (and if α ≥ 1 the constant depends only on l).
Proof.Recall that B t vanishes for t ≥ 3 odd and that there is a bound |B 2t | ≤ 8 √ πt( t πe ) 2t for t ≥ 1 [Le].Combining this with Stirling's bound t! ≥ √ 2πt( t e ) t e 1/(12t+1) [Fe] one concludes that where C is a universal constant.The ratio test shows that ∞ t=0 (1+t) l (2πα) t converges for 2πα > 1 (and moreover is at most a constant depending on l if α ≥ 1).
We also require a result about Bernoulli numbers.
Proof.We use the generating function f (z) = t≥0 Btz t t!
= z e z −1 for the Bernoulli numbers, which as mentioned earlier is analytic for |z| < 2π.For the first assertion simply set z = 1/α.For the second assertion, one computes z 2 2 d dz d dz f (z) and evaluates it at z = 1/α.For the third equation one differentiates f (z) with respect to z and then sets z = 1 α .For the fourth equation one differentiates f (z) three times with respect to z, then multiplies by z 2 2 and sets z = 1 α .Next we give an estimate for E C (c).
Proof.By Lemma 4.1, From this and Lemma 4.2 it follows that From Lemma 4.3, the error term is at most Cα n where C α is a constant depending on α (and which is independent of α for α ≥ 1).The result now follows from parts 1 and 2 of Lemma 4.4.Proposition 4.6 estimates the variance of c(π).
Proof.From Proposition 2.8 and the expansion of partial power sums in [GR]: Lemma 4.2 implies that the absolute value of the difference between E C (c 2 ) and Next it is necessary to bound the five summands in the error term.Lemma 4.3 shows that the first four summands in the error term are at most a constant depending on α (or a universal constant if α ≥ 1).Since B t vanishes for t ≥ 3 odd, |B 2t | ≤ 8 √ πt( t πe ) 2t [Le], and 2πα > 1, the fifth summand in the error term goes to 0 much faster than a universal constant.
The result now follows by combining the above observations with Lemma 4.4 and Proposition 4.5.(2) If either j = 0 or π(j) < π(j + 1) and 1 ≤ j ≤ n − 1, the chance of inserting n + 1 after j is k−d(π)−1 k(n+1) .Then after running the algorithm for n − 1 steps, the distribution obtained on S n is precisely R k,n .
Proof.First note that the transition probabilities of the algorithm sum to 1, since
) The map i sends real valued functions on Z + to real valued functions on S n , the symmetric group.It is defined by (if )[π] = f (X(π)).(4) The map T is a map from the set of real valued antisymmetric functions on S n ×S n to the set of real valued functions on S n .It is defined by T f [π] = E C (f (π, π )|π).(Since the pair (π, π ) is exchangeable, E C (T f ) = 0, which is crucial for the proof of the Stein equation).(

4. 2 .
Further properties of the exchangeable pair.This subsection develops further properties of the exchangeable pair (d, d ) from Section 2. Since we are interested in central limit theorems, it is natural to instead study (W, W ) whereW = d−E C (d) √ V ar C (d)andW = d −E C (d) √ V ar C (d).Note that from Lemma 2.1 one knowsE C (W − W |π).In what follows we study E C (W − W |W ), which is typically used in normal approximation by Stein's method.Proposition 4.7.E C (W − W |d = r) C (d) P C (c = r + 1) P C (d = r) (r + 1)(n − 1) n − E C (d) .Proof.Since d is a function of π, E C (W − W |d = r) = a aP C (W − W = a, d = r) P C (d = r) = a π:d(π)=r aP C (W − W = a, π) P C (d = r) = π:d(π)=r P C (π) P C (d = r) a aP C (W − W = a, π) P C (π) = π:d(π)=r P C (π) P C (d = r) E C (W − W |π).