Exponential and Laplace approximation for occupation statistics of branching random walk

We study occupancy counts for the critical nearest-neighbor branching random walk on the d-dimensional lattice, conditioned on non-extinction. For d > 3, Lalley and Zheng [4] showed that the properly scaled joint distribution of the number of sites occupied by j generation-n particles, j = 1, 2, . . ., converges in distribution as n goes to infinity, to a deterministic multiple of a single exponential random variable. The limiting exponential variable can be understood as the classical Yaglom limit of the total population size of generation n. Here we study the second order fluctuations around this limit, first, by providing a rate of convergence in the Wasserstein metric that holds for all d > 3, and second, by showing that for d > 7, the weak limit of the scaled joint differences between the number of occupancy-j sites and appropriate multiples of the total population size converge in the Wasserstein metric to a multivariate symmetric Laplace distribution. We also provide a rate of convergence for this latter result.


Introduction
Branching random walk (BRW) is a fundamental mathematical model of a population evolving in time and space, which has been intensely studied for more than 50 years due to its connection to population genetics and superprocesses; see, e.g., [1,Chapter 9] and references. Among this literature, the most relevant to our study is [4,Theorem 5], which states that the exponential distribution arises asymptotically for certain occupation statistics of a critical BRW conditioned on non-extinction. Their result is closely related to the classical theorem of [13], which says that the distribution of the size of a critical Galton-Watson process, properly scaled and conditioned on non-extinction, converges to the exponential distribution. Yaglom's theorem has a large related literature of embellishments and extensions, e.g., [5] and [2] give elegant probabilistic proofs, and [7] gives a rate of convergence using Stein's method.
We now define the nearest neighbor critical BRW on the d-dimensional integer lattice. At each time step n = 1, 2, . . ., every particle generates an independent number of offspring having distribution X with E[X] = 1, Var(X) = σ 2 < ∞, and each offspring moves to a site randomly chosen from the 2d + 1 sites having distance less than or equal to 1 from the site of its parent. We say that a site has multiplicity j in the nth generation if there are exactly j particles from the nth generation at that site. Starting the process from a single particle at the origin, let Z n be the number of particles in nth generation, and let M n (j) be the total number of multiplicity j sites in the nth generation. Lalley and Zheng [4,Theorem 5] showed that, when d 3, there are constants κ 1 , κ 2 . . . with j 1 jκ j = 1 such that, as n → ∞, L Z n n , M n (1) n , M n (2) n , . . . Z n > 0 → L (1, κ 1 , κ 2 , . . .)Z , (1.1) where Z ∼ Exp(σ 2 /2) is exponential rate σ 2 /2, and the convergence is with respect to the product topology (which is the same as convergence of finite dimensional distributions).
We study the second order fluctuations in this limit, working with the finite dimensional distributions of (1.1) in the L 1 -Wasserstein metric. More precisely, let · 1 be the L 1 -norm on R r , let H r = {h : R r → R : |h(x) − h(y)| x − y 1 for every x, y ∈ R r } be the set of L 1 -Lipschitz continuous functions with constant 1, and define  Our first main result is as follows. Below and throughout the paper, we use c to represent constants that do not depend on n, but possibly L (X), the dimension d, and the length of the vector r, and can differ from line to line. We also disregard the pathological case where Var(X) = 0. Before discussing the ideas behind the proofs of these two theorems, we make a few remarks. The limiting covariance matrix Σ is a constant multiple (Var(X)/2) of the limit of the (unconditional) covariance matrix of (M n (j)−Z n E[M n (j)]) r j=1 , given by Lemma 2.9 below. We are only able to show the limit exists, and cannot exclude the possibility that Σ is degenerate or even zero. Where Theorem 1.2 applies, it implies a rate of convergence of n −1/2 in Theorem 1.1. However, we present the results in this way, as it is conceptually natural, and simplifies the presentation of the proofs. An interesting open question from our study is what is the minimal dimension for which the convergence in Theorem 1.2 occurs?
The assumption that d 7 stems from Lemma 2.9, giving the behavior of the covariance matrix. It may be possible to sharpen some estimates, e.g., (2.21), but there are others that may be sharp and still require d 7, for example, the upper bound of (2.23), which must be o(1) (as n → ∞) for our arguments to go through.
We now turn to a discussion of the proofs, where a key result is the following rate of convergence for Yaglom's theorem from [7].
With this result in hand, the basic intuition behind Theorems 1.1 and 1.2 is that the number of multiplicity j sites in generation n is approximately a sum of a random number of conditionally independent random variables. The number of summands is Z m , for a well-chosen m < n, and a given summand represents the contribution from only descendants of a single generation m particle. If m is large, then Theorem 1.3 implies that Z m will be roughly exponential with large mean; hence approximately geometrically distributed; and, if d 3, the summands approximate the true variable because the random walk is transient, and most low occupancy sites consist of particles descended from exactly one individual in generation m; see Lemma 2.5. Thus the vector of Theorem 1.1 is approximately a geometric sum with small parameter, which, by Rényi's Theorem for geometric sums, is close to its mean times an exponential. For Theorem 1.2, the idea is similar, but the summands need to be centered for the Laplace limit to arise. A first thought is to subtract the mean of the summands, but in fact we must subtract the mean times a variable with mean one that is highly correlated to the summand to get the correct scaling; see Lemma 2.9. In order to obtain rates of convergence for the Laplace distribution, we prove a general approximation result for random sums, Theorem 2.8 below. The approach of [4] used to obtain (1.1) uses a similar idea, but the conditioning is different to ours, and does not seem amenable to obtaining the error bounds necessary for Theorems 1.1 and 1.2. Here we use couplings via an explicit construction of (M n (j)|Z n > 0), which is an elaboration of [5], along with Theorem 1.3, to evaluate the bounds necessary to obtain Theorems 1.1 and 1.2. We also use the explicit representation in a novel way to compare two different conditionings appearing in our argument; see Lemma 2.3. The use of the Wasserstein metric is essential in our argument, even if Kolmogorov bounds are the eventual goal (via standard smoothing arguments).
The organization of the paper is as follows. In the next section we provide constructions and lemmas used to prove Theorems 1.1 and 1.2. In Section 2.1 we state and prove our general Laplace approximation result, and then apply it to prove Theorem 1.2. Section 3 gives some auxiliary multivariate normal approximation results that are adapted to our setting, and used in the proofs.

Constructions, moment bounds and proofs
To prove Theorems 1.1 and 1.2, we first need to relate L (Z m |Z n > 0) to L (Z m |Z m > 0) (in Lemma 2.3 below). We use the size-biased tree construction from [5].
Size-biased tree construction. Assume that the tree is labeled and ordered, so if w and v are vertices in the tree from the same generation and w is to the left of v, then the offspring of w is to the left of the offspring of v, too. Start in Generation 0 with one vertex v 0 and let it have a number of offspring distributed according to a size-biased version X s of X, so that Pick one of the offspring of v 0 uniformly at random and call it v 1 . To each of the siblings of v 1 , attach an independent Galton-Watson branching process with offspring distribution X. For v 1 proceed as for v 0 , that is, give it a size-biased number of offspring, pick one at uniformly at random, call it v 2 , attach independent Galton-Watson branching process to the siblings of v 2 and so on. For 1 j n, denote by L n,j and R n,j the number of particles in generation n of this tree that are descendants of the siblings of v j to the left and right (excluding v j ). This gives rise to the size-biased tree.
From this construction we define another tree T n as follows. For 1 j n, let R n,j be independent random variables with L (R n,j ) = L (R n,j |L n,j = 0).
Start with a single "marked" particle in generation 0, represented as the root vertex of T n , and give this particle R n,1 offspring. Then choose the leftmost offspring of the marked particle as the generation 1 marked particle, and give it R n,2 offspring. To continue, the generation j marked particle is the leftmost offspring of the marked particle in Generation j − 1, and has R n,j+1 offspring. In addition, every non-marked particle has descendants according to an independent Galton-Watson tree with offspring distribution L (X). Let T n be the tree generated in this way to generation n. The key fact from [5, Theorem C(i)], is that the distribution of T n is the same as the entire tree created from an ordinary Galton-Watson process with offspring distribution X conditional on non-extinction up to Generation n. Moreover, this tree and its marked particles can be closely coupled to the size-biased tree and the v j "spine" offspring. Now, let A k,j = {L k,j = 0} and let X m,j , X m,j be random variables that are independent of each other and the size-biased tree constructed above, such that Before proving the lemma, we state a key result for controlling the conditioning on non-extinction is the following second order version of "Kolmogorov's estimate" found in [11,Display between (5) and (6)].
where the first inequality is a union bound, the equality is by independence of lineages, the second inequality is because L (L (i) m,j ) = L (Z m−j ), and the last is by Lemma 2.2. To bound further, we have that, conditional on X j , I j , using the independence of lineages, where we have used Lemma 2.2. The function g(x) = xa x is 1-Lipschitz on (0, ∞) for any a < 1, so we can apply Theorem 1.3 and use the fact that L (L where we have used Lemma 2.2, and we now choose K so that c log(n−m) 2 /(n−m) < 1/2 whenever n − m > K. Using this bound and combining the last three displays, we have Now noting that given I j and X j , R m,j is independent of A m,j and A n,j , we easily find A union bound implies Combining this with (2.2) shows (ii). For (iii), we show that for k = m, n, which easily implies the result. To show (2.3), we use the following correlation inequality: if f is non-decreasing and g is non-increasing, then Cov(f (X), g(X)) 0. Note that where the first line is obvious and the second is because of conditional independence.
But we can couple I j to X j in such a way that it is non-decreasing with X j , and thus the correlation inequality implies Combining the last two displays implies (2.3) for k = m. For k = n, using similar ideas, and the second factor decreases with L m,j and so Finally, But again we can couple (X j , I j ) such that I j is non-decreasing in X j , and thus and then (iv) easily follows from (iii) and To continue, we need a lemma giving some moment information for variables in the BRW.
Lemma 2.4. Assume the definitions and constructions above, and let Y n;m denote the number of particles in generation n of the BRW that occupy a site with another generation n particle that has a different ancestor at generation m. We have the following: , n = 1, 2, . . . has a limit and the bound follows since We give a construction of the critical BRW, building from the size-biased tree construction.

BRW construction.
To construct M n (j) conditional on Z n > 0, first generate T n from the size-bias tree section above. Since this tree is distributed as the Galton-Watson tree given Z n > 0, we construct the conditional BRW by attaching a random direction to each offspring, chosen uniformly and independently from the 2d + 1 available directions for the nearest-neighbor random walk. It is obvious that this "modified" BRW process has the same distribution as the original conditioned on non-extinction to generation n. For the modified process, letẐ k denote the size of generation k andM n (j) be the number of multiplicity j sites in generation n, then, in particular, we have L Z m , Z n , (M n (j)) r j=1 |Z n > 0 = L Ẑ m ,Ẑ n , (M n (j)) r j=1 .

Modified BRW construction.
A key to our approach is the following lemma that shows the cost of replacingM n (i) by a sum of a random sum of conditionally independent variables. GivenẐ m , for i = 2, . . . ,Ẑ m , let Z i n,m be the number of generation n offspring of the ith particle in generation m of the modified BRW construction; here the labelling is left to right (so particle 1 is always the marked particle), and note these are distributed as the sizes of the (n − m)th generations of i.i.d. Galton-Watson trees with offspring distribution L (X). Let also M i n,m (j) be the number of sites having exactly j generation-n descendants from the generation m particle labeled i in the critical BRW construction above, where the counts ignore particles descended from other generation m particles at those sites. Also let (Z 1 n,m , M 1 n,m (j) be an independent copy of (Z n−m , M n−m (j)). Note that givenẐ m , M 1 n,m (j), . . . , MẐ m n,m (j) are i.i.d. Lemma 2.5. For the variables described above and m < n, Proof. The differences between the two variables are (i) multiplicity j sites with more than 1 ancestor from generation m, (ii) multiplicity k > j sites with exactly j particles descended from some single generation m particle, (iii) the number of multiplicity j sites with only descendants of the first particle of generation m, and (iv) M 1 n,m (j Before proving Theorem 1.1, we state and prove a simple lemma. Lemma 2.6. For any nonnegative random variable Y on the same space as (Z j ) 0 j n and m < n, we have Proof. Using Kolmogorov's approximation, where the hat-couplings are those in the BRW description above and where recall the Z i n,m are the number of generation n offspring of the ith particle in generation m of the modified BRW construction, which are distributed as the sizes of the (n − m)th generations of i.i.d. Galton-Watson trees with offspring distribution L (X).

Laplace distribution approximation
The centered multivariate symmetric Laplace distribution is a cousin to the Gaussian distribution that arises in a number of contexts and applications; see [3] for a book length treatment of this distribution. The r-dimensional distribution is denoted SL r (Σ), where the parameter Σ is an r × r positive definite matrix. In general, its law is the same as that of √ EZ, where E ∼ Exp(1) and Z is a centered multivariate normal vector with covariance matrix Σ. The covariance matrix of SL r (Σ) is Σ, which can thus be thought of as a scaling parameter. The characteristic function is evidently  (1),

|x| and the multivariate density is given in [3, (5.2.2)] in terms of modified
Bessel functions of the 3rd kind.
The symmetric Laplace distribution arises as the limit of a geometric sum. More precisely, we have the following theorem, which is elementary, using, for example, characteristic functions. Theorem 2.7. Let N p ∼ Geo(p) be independent of X 1 , X 2 , . . ., which are i.i.d. rdimensional random vectors having mean zero and covariance matrix Σ. Then as p → 0, Here we provide a rate of convergence to a generalization of Theorem 2.7 in a metric amenable to our setting; see also [8] for a related result when r = 1. Theorem 2.8. Let M 1 be a random variable with mean µ > 1, independent of X 1 , X 2 , . . ., which are i.i.d. r-dimensional random vectors with zero mean, covariance matrix Σ = (Σ ij ), and finite third moments. Then there is a constant C r depending only on r such that Proof. Let E ∼ Exp(1), N ∼ Geo(µ −1 ) and Z = Z Σ be a centered multivariate normal vector with covariance matrix Σ, with the three variables independent and independent of M and the X i . Then, since L ( We use below that if X, Y, Z are random elements defined on the same space, then Conditioning on M , applying (2.10), and using independence, we find that where we have used Jensen's inequality in the last line.
To bound (2.8), we use (2.10) and the general fact that for fixed z ∈ R d , and random Now use the dual definition of Wasserstein distance, see, for example [9] or [12], to choose a coupling between M and N such that d W (L (M ), L (N )) = E |M − N | . Using Thus, conditioning on Z, using (2.10) and independence, we have that (2.12) implies where the last inequality is because To bound (2.9), we again use (2.10), (2.12), and independence, to find since |1 + 1/(µ log(1 − µ))| 1 for µ > 1. Finally, Then the limits (2.17) Remark 2.10. As a check on the limiting constant and linear growth of the fourth moment of (M n (j) − µ n (j)Z n ) given by (2.17), Theorem 1.2 suggests (assuming appropriate uniform integrability) that for E ∼ Exp(1) and Z j ∼ N(0, Σ jj ), The left hand side of (2.18) is equal to be defined as follows: The first (respectively, second) coordinate is the number of sites with j (respectively, k) particles in generation n descended from particle i in generation 1, and the third coordinate is the number of offspring in generation n descended from generation 1. Note that, given Z 1 , these are i.i.d. copies of (M n−1 (j), M n−1 (k), Z n−1 ). Now write The first term above is the main contribution, and the second term is a small error. For the first term, Similarly, Finally, cn and, from the proof of [4, Proposition 21], that |µ n ( ) − µ n−1 ( )| n −d/2 , we can collect the work above to find To bound the errors, first note (2.20) Similarly, and the same inequality holds with |e (2) n (k)| on the left hand side. For α > 0, to be chosen later, we bound Since d 7, n 1 n 1−d/3 < ∞, and therefore (A n (j, k)) n 1 is a Cauchy sequence; denote its limit by Σ jk , and observe that To prove the second assertion, we fix j and drop it from the notation, e.g., writing M n for M n (j). We follow the strategy above, but now there are higher moments, which we denote by where the second equality follows, similar to the argument above, from where a k is 1, −4, or 6 as appropriate. Bounding these similar to (2.20) and (2.21) above, using that E[X 5+ 18/(d−6) ] < ∞, we have that for α > 0 and β = 18/(d − 6) + 1, n ]n −β(1+α) + n 3(1+α)−d/2 ) c(n 3−βα + n 3(1+α)−d/2 ).
Choosing α = (d − 6)/6 − ε, for ε > 0 small enough that αβ > 3, and noting that d 7, we have for δ = min{αβ − 3, 3ε} > 0. Thus we find the fourth moment (2.22) is equal to (2.24) for δ = min{δ, 3 − d/2} > 0. As before, we expand the random sums in the expectations above and then simplify. We cover in detail only the middle term, which is the most involved, and just write the final expressions for the other terms. Write Σ {i,j,k} for sums over distinct indices. For the middle term, For a quick parity check of this formula, note that for non-negative integer z,

Similar arguments shows
Plugging these into (2.24), we find (it's easiest to compute the coefficients for each of σ 2 , γ 3 , γ 4 ; the last two are zero) that Therefore, where Σ n = (A n−m (j, k)) j,k with A n−m (j, k) = Cov( M 1 n,m (j), M 1 n,m (k)). Using the coupling definition of Wasserstein distance and Lemma 2.5, we can bound (2.25) by noting Similarly, (2.26) is bounded from where we used Lemma 2.4 in the second inequality, and part (iii) of Lemma 2.1 in the second to last. Noting our Wasserstein distance is with respect to L 1 distance, summing over j and k, and using the inequalities above shows that ( Multiplying this by the n −1/6 factor coming from the powers of µ in the bound from Theorem 2.8 gives a term of order n − 2d−9 6(2d+1) . For the remaining (nontrivial) term, the triangle inequality implies cn − 2d−9 6(2d+1) , and putting these bounds into Theorem 2.8 implies that (2.28) is bounded by cn − 2d−9 6(2d+1) .
Finally, we bound (2.29). Using the representation L ( √ EZ) = SL r (Cov(Z)) for E distributed as an exponential with rate one, independent of Z, an r-dimensional multivariate normal, we apply Lemma 3.3 below and Lemma 2.9, and noting that d 7, cn − 2d−9 6(2d+1) , and combining the bounds above yields the theorem.

CLT with error
In this section we prove a multivariate CLT with Wasserstein error for sums of i.i.d. variables that is adapted to our setting. The proof is relatively standard using Stein's method, with the complications that we are working in the Wasserstein (rather than smoother test function) metric, and that we do not demand the covariance matrix be non-singular.
In what follows, denote by |·| the Euclidean L 2 -norm. For a k-times differential Clearly, for any vectors a 1 , . . . , a k , x ∈ R r , we have r i1,...,ir=1 a 1,i1 · · · a k,i k ∂ k f (x) ∂x i1 · · · ∂x ir |a 1 | · · · |a k |M k (f ). Var X 1 = Σ = (Σ uv ) 1 u,v r . Let W = n −1/2 n i=1 X i , and let Z have a standard multivariate normal distribution, and let Z Σ = Σ 1/2 Z. Then, for any differentiable function Proof. We first replace h by h ε , which is defined as where Z has a standard multivariate normal distribution, independent of all else.
Applying (3.1) to the quantity inside the second expectation, we have r u,v,w=1 Now, let Y 1 be an independent copy of Y 1 , and note that we have Σ uv = n E(Y 1,u Y 1,v ).
Applying again (3.1) to the quantity inside the second expectation, we have r u,v,w=1 Subtracting one from the other, it follows that yields the final bound.
We also have the following easy corollary to fit our setup above. .
Proof. By equivalence of norms in R r , there is a constant q r depending only on r such that for any a ∈ R r , q −1 r |a| a 1 q r |a|. Therefore, for any a, x ∈ R r , r i=1 a i |a| ∂h(x) ∂x i a 1 |a| q r , and E |X| 3 q 3 r E X 3 1 . The result now follows from Theorem 3.1.
Finally, we state a simple lemma used to compare centered multivariate normal distributions.

Lemma 3.3.
Let Σ and Σ be two non-negative semi-definite (r × r) matrices for r 1. Let X = (X 1 , . . . , X r ), respectively Y = (Y 1 , . . . , Y r ), be a centered multivariate random normal vector with covariance matrix Σ, respectively Σ . Then for some constant C that only depends on r.
Proof. Using Stein's identity for the multivariate normal, for any twice-differentiable function f , we have The result now follows by using the smoothing argument and Stein's method as in the proof of Theorem 3.1; we omit the details.