Simple conditions for convergence of sequential Monte Carlo genealogies with applications

Sequential Monte Carlo algorithms are popular methods for approximating integrals in problems such as non-linear filtering and smoothing. Their performance depends strongly on the properties of an induced genealogical process. We present simple conditions under which the limiting process, as the number of particles grows, is a time-rescaled Kingman coalescent. We establish these conditions for standard sequential Monte Carlo with a broad class of low-variance resampling schemes, as well as for conditional sequential Monte Carlo with multinomial resampling.


Introduction
Sequential Monte Carlo refers to a broad class of stochastic algorithms in which a population of particles evolves in discrete time. These algorithms are best known for their application in non-linear filtering and smoothing (Gordon et al., 1993), but can be applied generally to mean-field approximation of Feynman-Kac flows; see Del Moral (2004) for more background. These methods have found diverse applications throughout signal processing, statistics, econometrics, biology, and many other disciplines, and understanding their properties is of widespread importance; see, for example, Doucet and Johansen (2011); Fearnhead and Künsch (2018); Naesseth et al. (2019) for recent surveys.
The algorithm proceeds by iterating through two steps: a mutation step in which the positions of the particles are updated by applying some Markov kernel; and a selection step in which the particles are weighted by some potential function and resampled to form the next generation. Resampling stochastically duplicates high-weight particles and removes low-weight particles. The duplicates of a particle in the following generation are termed its offspring. A succession of resampling steps induces a genealogy, that is, a process recording the parent-offspring relationships between particles at consecutive generations.
Sequential Monte Carlo genealogies are important because they capture the phenomenon of ancestral degeneracy, which has a substantial impact on the performance of the algorithm. Due to resampling, the number of distinct ancestors whose descendants comprise the particles at the terminal time decays rapidly as the time horizon increases, a well known consequence of which is that path-based Monte Carlo estimators typically have high variance (see, for example, Briers et al. (2010); Fearnhead and Künsch (2018)).
We provide simple sufficient conditions under which the genealogy of a sequential Monte Carlo algorithm converges to a time-rescaled n-coalescent (Kingman, 1982) in the large population limit. We require control over only the second and third moments of the marginal family size of each parent. This is a substantial improvement over previous work of Koskela et al. (2018) which requires additional control over fourth moments, including cross-terms, to obtain the same convergence result. That work also imposes a condition on the speed of convergence that is violated for instance by the neutral Moran model, any finite sample of which is known to have a Kingman genealogy in the large population limit (see Durrett, 2008, p47). Our result therefore covers a strictly broader class of algorithms and also relies on simple conditions that can be verified more easily. These simple conditions admit verification for a broad class of algorithms.
The result is known to apply to standard sequential Monte Carlo with multinomial resampling (Koskela et al., 2018, Corollary 1). We additionally prove convergence for any resampling scheme based on stochastic rounding, described in Definition 2. This includes low-variance schemes such as systematic resampling, residual resampling with stratified residuals, and the more exotic schemes proposed in Crisan and Lyons (1997) and Gerber et al. (2019). The results presented in this paper therefore provide the first a priori characterization of the genealogy of a Sequential Monte Carlo algorithm for resampling schemes that are widely used in practice, offering quantitative insight into the nature of ancestral degeneracy and into the relative performance of different resampling schemes. We also show that our result applies to conditional sequential Monte Carlo, a variant that forms a building block of the particle Gibbs algorithm (Andrieu et al., 2010), in which it is important to maintain at least two distinct ancestors across a fixed time window. That our results apply to conditional sequential Monte Carlo is important because for many practical statistical problems comprising a high-dimensional latent state space model, particle Gibbs is one of the few algorithms that is able to explore parameter posteriors in reasonable time, so understanding the factors affecting its mixing time is of great interest.

Sequential Monte Carlo
Algorithm 1 describes sequential Monte Carlo with N particles over a fixed time window T . The initial proposal distribution is µ, (K t ) T t=1 is a sequence of Markov transition kernels with respective densities (q t ) T t=1 , and (g t ) T t=0 is a sequence of potential functions. For simplicity these may be assumed to exist on a common state space that is a subspace of R d , although the state spaces can in general be any sequence of Polish spaces. At generation t, w ) is the vector of particle weights, and a ) is the vector of resampled parental indices. The variety of possible procedures for the resample step is an active area of research. Some important examples are explored in Section 4. We take valid resampling schemes to be those where: the total number of resampled offspring is N ; the expected number of offspring of particle i conditional on the weights is N w (i) t ; and each offspring is assigned equal weight after resampling. is uniform over all valid assignments. The standing assumption is a weaker condition than exchangeability (Möhle, 1998, p446). Any resampling scheme can be made to satisfy it by applying an additional permutation of the offspring indices after selecting the parents (see Andrieu et al., 2010, p. 290).

A limit theorem for sequential Monte Carlo genealogies
For convenience, we will henceforth measure time backwards, with the terminal particles at time 0 and the initial particles at time T . The forward-time process of particles replicating or dying induces a coalescent process when viewed in backwards in time. The full forward-time process is Markovian, but this no longer holds after integrating out the positions of particles and their weights. Thus, the reverse-time process of ancestral lineages without position or weight information is not Markovian either.
We will analyse an asymptotic regime in which the total number of particles N → ∞, and consider the finite-dimensional restriction to a sample of n ≤ N terminal particles. The genealogy of such a sample is described by a partition-valued stochastic process (G if and only if terminal particles i and j share a common ancestor at time t. We will take N → ∞ and show that the n-coalescent (Kingman, 1982) is the correct limiting object. Our limit theorem will apply after a rescaling of time in which the genealogy is viewed over an infinite time horizon; that is, T → ∞.
Definition 1. The n-coalescent is the homogeneous continuous-time Markov process on the set of partitions of {1, . . . , n} with infinitesimal generator Q having entries where ξ and η are partitions of {1, ..., n}, |ξ| denotes the number of blocks in ξ, and ξ ≺ η means that η is obtained from ξ by merging exactly one pair of blocks.
Throughout the following, falling factorials are denoted by (a) b = a(a − 1) · · · (a − b + 1). We denote by F t = σ(ν (1:N ) s : 1 ≤ s ≤ t) the filtration generated by offspring counts, and use the shorthand E t (·) ≡ E(· | F t−1 ) for filtered expectations. Since time is labelled in reverse, the filtrations contain information about the future of the original system rather than the past.
A central quantity for analysing convergence of these coalescent processes is the pair merger rate, which can be interpreted as the probability that a randomly chosen pair of particles at time t − 1 have a common ancestor at time t. Conditional on F t , this probability is (1) In the n-coalescent the pair merger rate is equal to 1. Thus, as N → ∞, the time scaling required to possibly obtain a Kingman coalescent limit is the generalized inverse τ N (t) = min s ≥ 1 : Following Möhle (1998), we exclude the case where P{τ N (t) = ∞} > 0 for finite t. This occurs for example with a minimum-variance resampling scheme where the potentials are constant. In the n-coalescent, there are almost surely no mergers involving more than two lineages at a time. As shown in Koskela et al. (2018, Lemma 1, Case 3), an upper bound on the conditional probability that more than two lineages merge at time t is This includes both the possibility of three or more lineages merging into one, and of two or more simultaneous pair mergers. Theorem 1, our main result, gives a simple sufficient condition controlling these merger rates to yield a Kingman limit for particle genealogies.
Theorem 1. Let ν (1:N ) t denote the offspring numbers in an interacting particle system satisfying the standing assumption and such that, for any N sufficiently large, P{τ N (t) = ∞} = 0 for all finite t.
for all N , uniformly in t ≥ 1. Then the rescaled genealogical process (G (n,N ) Proof. Theorem 1 has the same conclusion as Koskela et al. (2018, Theorem 1), but with weaker and more tractable conditions. We show that these simpler conditions are sufficient. The conditions for Koskela et al. (2018, Theorem 1) are as N → ∞, for some strictly positive constant C t,s that does not depend on N . The series of Lemmata 1-3 below show that the assumptions (3)-(5) follow from (2). Lemma 4 allows us to remove condition (6) by improving upon some arguments from the proof of Koskela et al. (2018, Theorem 1), the details of which can be found in Appendix 1.
which concludes the proof.
Proof. We decompose D N (t) as the sum of two terms and consider their filtered expectations. The first is The second is Substituting (8) into (11) and substituting (12) into (10) gives Combining (9) and (13), we have that Finally, invoking Koskela et al. (2018, Lemma 2) twice gives and recalling that ε > 0 was arbitrary concludes the proof.
For Lemma 4, we introduce the quantity p ξη (t). For any fixed n and N , p ξη (t) is the time t − 1 conditional transition probability of the genealogical process from ξ to η, where ξ and η are partitions of {1, . . . , n}. The transition probability is non-zero only when η = ξ or η can be obtained from ξ by merging some blocks. Let b j (j = 1, . . . , |ξ|) denote the number of blocks in ξ that merged to form block j of η.
for a constant B |ξ| > 0 increasing in |ξ| that does not depend on N .
Proof. Let κ i = |{j : b j = i}| denote the multiplicity of mergers of size i, with the slight abuse of terminology that κ 1 counts non-merger events. In particular, we have that κ 1 + 2κ 2 + · · · + |ξ|κ |ξ| = |ξ|. Now because the right hand side subtracts the probabilities of all possible merger events. See Fu (2006, Eq (11)) for the combinatorial factor, which gives the number of partitions of a sequence of length |ξ| having κ j subsequences of length j for each j. The omitted k = |ξ| summand would correspond to the probability of an identity transition. The non-increasing ordering of (b 1 , . . . , b k ) in the sum is arbitrary, but without loss of generality: choosing any ordering of the same set of merger sizes would give the same result. Firstly, we separate out the k = |ξ| − 1 term, which covers isolated binary mergers, and note that in that case the only possible b-vector is (2, 1, . . . , 1), for which and a multinomial expansion argument yields For the other summands, we have |ξ|! |ξ| j=1 (j!) κj κ j ! ≤ |ξ|! and, similarly to Koskela et al. (2018, Lemma 1, Case 3), The summand in the third term depends neither on k nor on b 1 , . . . , b k , and the number of terms in those sums is bounded above by γ |ξ|−2 (|ξ| − 2), where γ n is the number of integer partitions of n. By Hardy and Ramanujan (1918, Section 2), γ n < Ke 2 √ 2n /n for a constant K > 0 independent of n. Thus, for |ξ| > 2, where B |ξ| > 0 depends on |ξ| but not on N . When |ξ| ≤ 2, there are no higher order interactions and the result is immediate.
Using (14) By construction, E(Y i ) = X i for each i. Taking X to be N times the vector of particle weights, we can therefore use stochastic rounding for the resample procedure in Algorithm 1, under the further constraint that Y 1 + · · · + Y N = N . Several ways to enforce this constraint have been proposed, including systematic resampling (Carpenter et al., 1999;Whitley, 1994), residual resampling with stratified or systematic residuals (Whitley, 1994), the branching system of Crisan and Lyons (1997), and the Srinivasan sampling process resampling introduced in Gerber et al. (2019).
Corollary 1. Consider a sequential Monte Carlo algorithm using any stochastic rounding as its resampling scheme, such that the standing assumption is satisfied. Assume that there exist constants ε ∈ (0, 1], a ∈ [1, ∞) such that for all x, x ′ , t, Assume also that, for some ζ > 0, δ ∈ (0, 1), P(max i w ) t≥0 denote the genealogy of a random sample of n terminal particles from the output of the algorithm when the total number of particles used is N . Then, for any fixed n, the time-scaled genealogy (G (n,N ) τN (t) ) t≥0 converges to Kingman's n-coalescent as N → ∞, in the sense of finite-dimensional distributions.
The condition on the range of weights is not strong, and can even be made weaker. Indeed its role is simply to exclude systems with such well-behaved weights that non-trivial resampling occurs only finitely often over an infinite time horizon. In contrast, condition (15) is strong, but is widespread in the literature on sequential Monte Carlo where is is often known as the strong mixing condition (Del Moral, 2004, Section 3.5.2); it is often possible to relax this assumption at the expense of considerable technical complication.
Remark 1. In a similar vein to Koskela et al. (2018, Remark 3), if we consider the weight vector as fixed, the time scale induced by stochastic rounding is slower than that induced by multinomial resampling. Details are given in Appendix 2.
Remark 2. Since every stochastic rounding has the same marginal distributions and the first moment of (1) depends only on marginal family size distributions, the expected coalescence rate is the same whichever stochastic rounding is used for resampling. Thus the time-scale on which the n-coalescent is recovered is equal in expectation for every such scheme.
Proof. Using the forward-time Markov property of sequential Monte Carlo, and the associated conditional dependence graph, for each N we establish a sequence of σ-algebras such that ν We can find the probability, up to a constant C, of ν (i) t taking each value: We then use (15) to bound the probabilities above and below: As these are the only two possibilities, we can easily find bounds on the normalized probabilities. We use that, for A, B > 0, where A + is an upper bound on A and B − is a lower bound on B. This yields slightly tighter bounds than just bounding the numerator A above and the denominator A + B below. For example, In particular, we see that E{(ν Using (15) along with the form of the weights in Algorithm 1, 1/(N a 2 ) ≤ w (i) t ≤ a 2 /N almost surely for each i. With the simple inequality ⌊x⌋ ≤ x, this gives us E{(ν Finally, since this bound applies uniformly for each i, and by applying the tower property, the ratio of interest is bounded by (2) is satisfied. It remains to show that, for N sufficiently large, P(τ N (t) = ∞) = 0 for all finite t, a technicality which is verified in Lemma 5 in Appendix 3. The result then follows by applying Theorem 1.

Conditional sequential Monte Carlo updates
Conditional sequential Monte Carlo differs from Algorithm 1 in that one predetermined trajectory is conditioned to survive all of the resampling steps. We refer to this sequence of positions as the immortal trajectory, and the immortal particle is the particle in a particular generation that is part of the immortal trajectory. Conditional sequential Monte Carlo was introduced as a component of the particle Gibbs algorithm (Andrieu et al., 2010) but has found somewhat wider application in fields as diverse as smoothing (Jacob et al., 2020;Shestopaloff and Doucet, 2019) and optimization (see Chapter 6 of a 2015 University of Warwick PhD thesis by A. Finke). In particle Gibbs, the immortal trajectory x ⋆ 0:T at each time step is sampled from the output of the previous conditional sequential Monte Carlo run. It is therefore important that with high probability at least two distinct lineages survive each run so that the immortal trajectory can be updated. A single run of conditional sequential Monte Carlo with multinomial resampling is presented in Algorithm 2.
Algorithm 2: Conditional sequential Monte Carlo with multinomial resampling Although it is also possible to construct a conditional sequential Monte Carlo algorithm using a low variance resampling scheme, here we treat only the case of multinomial resampling. We believe that the result can be extended to other resampling schemes by similar arguments to those of Corollary 1.
Corollary 2. Consider a conditional sequential Monte Carlo algorithm using multinomial resampling, such that the standing assumption is satisfied. Assume there exist constants ε ∈ (0, 1], a ∈ [1, ∞) and probability density h(·) such that for all x, x ′ , t, Let (G (n,N ) t ) t≥0 denote the genealogy of a random sample of n terminal particles from the output of the algorithm when the total number of particles used is N . Then, for any fixed n, the timescaled genealogy (G (n,N ) τN (t) ) t≥0 converges to Kingman's n-coalescent as N → ∞, in the sense of finitedimensional distributions.
Condition (18) is in a convenient form for this argument; it is slightly weaker than (15) which is sufficient to ensure it.
Proof. Define the conditioning σ-algebra H t as in (16). We assume without loss of generality that the immortal particle takes index 1 in each generation. This significantly simplifies the notation, but the same argument holds if the immortal indices are taken to be a ⋆ (0:T ) rather than (1, . . . , 1). The parental indices are conditionally independent, as in standard sequential Monte Carlo with multinomial resampling, but we have to treat i = 1 as a special case. We have the following conditional law on parental indices The joint conditional law is therefore First we make the following observation, which follows from a balls-in-bins coupling. Assume (18). Then for any function f : {1, . . . , N } N → R such that f (a where the elements of A are all mutually independent and independent of F ∞ , and distributed according to where the vector of probabilities is given up to a constant in the argument of Categorical and Multinomial distributions, here and in the following. We use these random vectors to construct bounds that are independent of F ∞ . Also define V We can apply (19) to obtain the lower bound using the moments of the Multinomial distribution (Mosimann, 1962), along with the identity (X + 1) 2 ≡ 2(X) 1 + (X) 2 . This is further bounded by Similarly, we derive an upper bound on f i (a t ) 3 , this time using the identity (X + 1) 3 ≡ 3(X) 2 + (X) 3 : Altogether we upper bound the ratio by Thus (2) is satisfied. It remains to show that, for N sufficiently large, P{τ N (t) = ∞} = 0 for all finite t, a technicality which is proved in Lemma 6 in Appendix 3. Applying Theorem 1 gives the result. bound. A multinomial expansion of the product on the second line yields where Π i ([n]) denotes the set of partitions of {1, . . . , n} into exactly i blocks. Expanding the product over λ gives and expanding the product over µ results in Via a further multinomial expansion, the lower bound for the k-step transition probability can be written as . . .
An argument completely analogous to that in (Koskela et al., 2018, Appendix) shows that passing the expectation and the limit through the infinite sums is justified, whereupon the contribution of terms with k d=1 (|π d |+|σ d |) > 0 vanishes. To see why, follow the argument used to show that the contribution of multiple merger trajectories vanishes in the corresponding upper bound in (Koskela et al., 2018). That leaves Recall (Koskela et al., 2018, Eq (11)): Applying this k times in (21) yields . . .
If w . For the other cases, suppose k/N ≤ w (i) t < (k + 1)/N for some k ∈ {0, . . . , N − 1}. Then ⌊N w (i) For each N ≥ 2, any w (i) t ∈ [0, 1] falls into one of the above cases, so for any fixed vector w (1:N ) t of weights, we have that ∆ i ≥ 0 for all i. For N = 1 the result is immediate. This concludes the proof.
By a filtered version of the second Borel-Cantelli lemma (see for example Durrett, 2019, Theorem 4.3.4), this implies that c N (t) > 2/N 2 for infinitely many t, almost surely. This ensures, for all t < ∞, that P {∃s < ∞ : s r=1 c N (r) ≥ t} = 1, which by definition of τ N (t) is equivalent to P{τ N (t) = ∞} = 0.
Proof. Since c N (t) ∈ [0, 1] almost surely and has strictly positive expectation, for any fixed N the distribution of c N (t) with given expectation that maximises P{c N (t) = 0 | F t−1 } is two atoms, at 0 and 1 respectively. To ensure the correct expectation, the atom at 1 should have mass P{c N (t) = 1 | F t−1 } = E t {c N (t)}, which is bounded below by (20). If c N (t) > 0 then c N (t) ≥ 2/(N ) 2 > 2/N 2 . Hence, in general P{c N (t) > 2/N 2 | F t−1 } ≥ E t {c N (t)}, and for any finite N , By an argument analogous to the conclusion of Lemma 5, P{τ N (t) = ∞} = 0 for all t < ∞.
Appendix 4 D-separation argument to establish conditional independence of a (1:N ) t and F t−1 given H t Figure 1 shows part of the conditional dependence graph implied by Algorithm 1. Our aim is to find a σ-algebra H t at each time t that separates the ancestral process (encoded by a (1:N ) t ) from the filtration F t−1 . That is, a (1:N ) t is conditionally independent of F t−1 given H t . By a D-separation argument (see Verma and Pearl, 1988), the nodes highlighted in grey suffice as the generator of H t . That is, for each t, we take H t = σ(X Notice that ν