Truncated Simulation and Inference in Edge-Exchangeable Networks

Edge-exchangeable probabilistic network models generate edges as an i.i.d.~sequence from a discrete measure, providing a simple means for statistical inference of latent network properties. The measure is often constructed using the self-product of a realization from a Bayesian nonparametric (BNP) discrete prior; but unlike in standard BNP models, the self-product measure prior is not conjugate the likelihood, hindering the development of exact simulation and inference algorithms. Approximation via finite truncation of the discrete measure is a straightforward alternative, but incurs an unknown approximation error. In this paper, we develop methods for forward simulation and posterior inference in random self-product-measure models based on truncation, and provide theoretical guarantees on the quality of the results as a function of the truncation level. The techniques we present are general and extend to the broader class of discrete Bayesian nonparametric models.


Introduction
Probabilistic generative models have for many years been key tools in the analysis of network data [1,2]. Recent work in the area [3][4][5][6][7][8][9][10][11][12][13] has begun to incorporate the use of nonparametric discrete measures, in an effort to address the limitations of traditional models in capturing the sparsity of real large-scale networks [14]. These models construct a discrete random measure Θ (often a completely random measure, or CRM [15]) on a space Ψ, associate each atom of the measure with a vertex in the network, and then use the self-product of the measure-i.e., the measure Θ × Θ on Ψ 2 -to represent the magnitude of interaction between vertices.
While the inclusion of a nonparametric measure enables capturing sparsity, it also makes both generative simulation and posterior inference via Markov chain Monte Carlo (MCMC) [16;17,Ch. 11,12] significantly more challenging. In standard Bayesian models with discrete nonparametric measures-such as the Dirichlet process mixture model [18] or beta process latent feature model [19]this issue is typically addressed by exploiting the conjugacy of the (normalized) completely random measure prior and the likelihood to marginalize the latent infinite discrete measure [20]. But in the case of nonparametric network models, however, there is no such reprieve; the self-product of a completely random measure is not a completely random measure, and exact marginalization is typically not possible.
Another option is to truncate the discrete CRM to have finitely many atoms, and perform simulation and inference based on the truncated CRM. Exact truncation schemes based on auxiliary variables [21][22][23] are limited to models where certain tail probabilities can be computed exactly. Fixed truncation [24][25][26][27], on the other hand, is easy to apply to any CRM-based model; but it involves an approximation with potentially unknown error. Although the approximation error of truncated CRM models has been thoroughly studied in past work [27][28][29][30][31], these results apply only to generative simulation-i.e., not inference-and do not apply to self-product CRMs that commonly appear in network models.
In this work, we provide tractable methods for both generative simulation and posterior inference of discrete Bayesian nonparametric models based on truncation, as well as rigorous theoretical analyses of the error incurred by truncation in both scenarios. In particular, our theory and methods require only the ability to compute bounds on-instead of exact evaluation of-intractable tail probabilities. Our work focuses on the case of self-product-measure-based edge-exchangeable network sequences [3,5,32,33], whose edges are simulated i.i.d. conditional on the discrete random product measure Θ × Θ, but the ideas here apply without much effort to the broader class of discrete Bayesian nonparametric models. As an intermediate step of possible independent interest, we also show that the nonzero rates generated from the rejection representation [34] of a Poisson process have the same distribution as the well-known but typically intractable inverse Lévy or Ferguson-Klass representation [35]. This provides a novel method for simulating the inverse Lévy representation, which has a wide variety of uses in applications of Poisson processes [36-38].

Completely random measures and self-products
A completely random measure (CRM) Θ on Ψ is a random measure such that for any collection of K ∈ N disjoint measurable sets A 1 , ..., A K ⊂ Ψ, the values Θ(A 1 ), ..., Θ(A K ) are independent random variables [15]. In this work, we focus on discrete CRMs taking the form Θ = k θ k δ ψ k , where δ x is a Dirac measure on Ψ at location x ∈ Ψ (i.e., δ x (A) = 1 if x ∈ A and 0 otherwise), and (θ k , ψ k ) ∞ k=1 are a sequence of rates θ k and labels ψ k generated from a Poisson process on R + × Ψ with mean measure ν(dθ) × L(dψ). Here L is a diffuse probability measure, and ν is a σ-finite measure satisfying ν(R + ) = ∞, which guarantees that the Poisson process has countably infinitely many points almost surely. The space Ψ and distribution L will not affect our analysis; thus as a shorthand, we write CRM(ν) for the distribution of Θ: One can construct a multidimensional measure Θ (d) on Ψ d , d ∈ N from Θ defined in Eq. (1) by taking its self-product. In particular, we define where i is a d-dimensional multi-index, and N d = is the set of such indices with all distinct components. Note that Θ (d) is no longer a CRM, as it does not satisfy the independence condition.

Series representations
To simulate a realization Θ ∼ CRM(ν)-e.g., as a first step in the simulation of a self-product measure Θ (d) -the rates θ k and labels ψ k may be generated in sequence using a series representation [39] of the CRM. In particular, we begin by simulating the jumps of a unit-rate homogeneous Poisson process (Γ k ) ∞ k=1 on R + in increasing order. For a given distribution g on R + and nonnegative measurable function τ : R + × R + → R + , we set Depending on the particular choice of τ and g, one can construct several different series representations of a CRM [28]. For example, the inverse Lévy representation [35] has the form θ k = ν ← (Γ k ), ν ← (x) := inf {y : ν ([y, ∞)) ≤ x} .
In many cases, computing ν ← (x) is intractable, making it hard to generate θ k in this manner. Alternatively, we can generate a series of rates from CRM(ν) with the rejection representation [34], which has the form where µ is a measure on R + chosen such that dν dµ ≤ 1 uniformly and µ ← (x) is easy to calculate in closed-form. While there are many other sequential representations of CRMs [28], the representations in Eqs. (3) to (5) are broadly applicable and play a key role in our theoretical analysis.

Edge-exchangeable graphs
Self-product measures Θ (d) of the form Eq. (2) with d = 2 have recently been used as priors in a number of probabilistic network models [3,4]. 1 The focus of the present work are those models that associate each ψ k with a vertex, each tuple ζ i = (ψ i1 , . . . , ψ i d ) with a (hyper)edge, and then build a growing sequence of networks by sequentially generating edges from Θ (d) in rounds n = 1, . . . , N . There are a number of choices for how to construct such a sequence. For example, in each round n we may add multiple edges via an independent likelihood process X n ∼ LP(h, Θ (d) ) defined by where x ni = k denotes that there were k copies of edge ζ i added at round n, and h(·|ϑ) is a probability distribution on N ∪ {0}. We denote the mean µ(ϑ) := ∞ k=0 k · h(k | ϑ) and probability of 0 under h to be π(ϑ) := h(0|ϑ) for convenience. By the Slivnyak-Mecke theorem [40], if h satisfies then finitely many edges are added to the graph in each round. We make this assumption throughout this work. Alternatively, if then Ω := Θ (d) (Ψ d ) < ∞, and we may add only a single edge per round n via a categorical likelihood process X n ∼ Categorical(Θ (d) ) defined by This construction has appeared in [4], where Θ follows a Dirichlet process, which can be seen as a normalized gamma process [41]. Note that our definition of Categorical(Θ (d) ) involves simulating from the normalization; we use this definition to avoid introducing new notation for normalized processes. Using either likelihood process, the edges in the network after N rounds are There are also network models based on self-product measure priors that do not generate edge-exchangeable sequences [8,11]; it is likely that many of the techniques in the present work would extend to these models, but we leave the study of this to future work.
i.e., x i ∈ N ∪ {0} represents the count of edge i after N rounds.
There are three points to note about this formulation. First, since the atom locations ζ i are not used, we can represent the network using only its array of edge counts where N d denotes the set of integer arrays indexed by N d = with finitely many nonzero entries. Note that N d is a countable set. Second, by construction, the distribution of E N is invariant to reorderings of the arrival of edges, and thus the network is edge-exchangeable [3,[5][6][7]. Finally, note that the network E N as formulated in Eq. (9) is in general a directed multigraph with no self-loops (due to the restriction to indices i ∈ N d = rather than N d ). Although the main theoretical results in this work are developed in this setting, we provide an additional result in Section 3.1 to translate to other common network structures (e.g. binary undirected networks).

Truncated generative simulation
In this section, we consider the tractable generative simulation of network models via truncation, and analyze the approximation error incurred in doing so as a function of K ∈ N (the truncation level) and number of rounds of generation N ∈ N. In particular, to construct a truncated self-product measure, we first split the underlying CRM Θ into a truncation and tail component, and construct the self-product Θ (d) K from the truncation Θ K as in Eq. (2). Fig. 1 provides an illustration of the truncation of Θ and Θ (2) , showing that Θ (2) can be decomposed into a sum of four parts, Thus, while we only discard Θ K+ in truncating Θ to Θ K , we discard three parts in truncating Θ (2) K to Θ (2) K ; and in general, we discard 2 d − 1 parts of Θ (d) when truncating it to Θ (d) K . We therefore intuitively might expect higher truncation error when approximating Θ (d) K ≈ Θ (d) than when approximating Θ K ≈ Θ; in Sections 3.2 and 3.3, we will show that this is indeed the case.
Once the measure is truncated, a truncated network-based on Θ (d) K -can be constructed in the same manner as the original network using the independent likelihood process Eq. (6) or categorical likelihood process Eq. (8). We denote E N,K = (x i,K ) i∈N d = ∈ N d to be the corresponding edge set of the truncated network up round N , where x i,K = 0 for any index i ∈ N d = such that some component i j > K. We keep E N and E N,K in the same space in order to compare their distributions in Sections 3.1 to 3.3. An illustration of the difference between truncation of CRMs (d = 1) and self-product CRMs with d = 2. Intuitively, increasing d means that a higher proportion of mass is discarded in the truncation process.

Truncation error bound
We formulate the approximation error incurred by truncation as the total variation distance between the marginal network distributions, i.e., of E N and E N,K . The first step in the analysis of truncation error-provided by Lemma 3.1is to show that this is bounded above by the probability that there are edges in the full network E N involving vertices beyond the truncation level K. To this end, we denote the maximum vertex index of E N to be and note that by definition, I N ≤ K if and only if all edges in E N fall in the truncated region.
Lemma 3.1. Let Θ = ∞ k=1 θ k δ ψ k be a random discrete measure, and Θ K = K k=1 θ k δ ψ k be its truncation to K atoms. Let Θ (d) be the self-product of Θ, and Θ (d) K be the self-product of Θ K . Let P N and P N,K be the marginal distributions of edge sets E N , E N,K ∈ N d under either the independent or categorical likelihood process. Then As mentioned in Section 2.3, the network E N is in general a directed multigraph with no self loops. However, Lemma 3.1-and the downstream truncation error bounds presented in Sections 3.2 and 3.3-also apply to any graph E N = (x i ) i∈N d = that is a function of the original graph E N = f (E N ) such that truncation commutes with the function, i.e., E N,K = f (E N,K ). For example, to obtain a truncation error bound for the common setting of undirected binary imsart-generic ver. 2014/10/16 file: ms.tex date: September 15, 2021 graphs, we generate the directed multigraph E N as above and construct the undirected binary graph E N via Corollary 3.2 provides the precise statement of the result; note that the bound is identical to that from Lemma 3.1.
be the set of edges of a network with truncation E N,K ∈ N d , and denote their marginal distributions P N and P N,K . If there exists a measurable function f such that and

Independent likelihood process
We now specialize Lemma 3.1 to the setting where Θ is a CRM generated by a series representation of the form Eq. (3), and the network is generated via the independent likelihood process from Eq. (6). As a first step towards a bound on the truncation error for general hypergraphs with d > 1 in Theorem 3.4, we present a simpler corollary in the case where d = 2, which is of direct interest in analyzing the truncation error of popular edge-exchangeable networks.
The proof of this result (and Theorem 3.4 below) in Appendix B follows by representing the tail of the CRM as a unit-rate Poisson process on [Γ K , ∞). Though perhaps complicated at first glance, an intuitive interpretation of the truncation error terms B K,i is provided by Fig. 1b. B K,1 corresponds to the truncation error arising from the upper right quadrant, where both vertices participating in the edge were in the discarded tail region. B K,2 is the truncation error arising from the bottom right and upper left quadrants, where one of the two vertices participating in the edge was in the truncation, and the other was in the tail. Finally, B K,3 represents the truncation error arising from edges in which one vertex was at the boundary of tail and truncation, and the other was in the tail.
Theorem 3.4 is the generalization of Corollary 3.3 from d = 2 to the general setting of arbitrary hypergraphs with d > 1. The bound is analogous to that in Corollary 3.3-with B K expressed as a sum of terms, each corresponding to whether vertices were in the tail, boundary, or truncation region-except that there are d > 1 vertices participating in each edge, resulting in more terms in the sum. Note that Theorem 3.4 also guarantees that the bound is not vacuous, and indeed converges to 0 as the truncation level K → ∞ as expected.
Theorem 3.4. In the setting of Lemma 3.1, suppose Θ is a CRM generated from the series representation Eq. (3), edges are generated from the independent likelihood process Eq. (6), and 1 < d ≤ K. Then δ · is the Dirac delta, dγ := d j=1 dγ j , andθ := d j=1 τ (U j , γ j ). Furthermore, The same geometric intuition from the d = 2-dimensional case applies to the general hypergraph truncation error in Eq. (11). B K,1 corresponds to the error arising from the edges whose vertices all belong to the tail region Θ K+ . Each term in the summation in B K,2 corresponds to the error arising from edges that have out of d vertices belonging to the truncation Θ K . Each term in the summation in B K,3 corresponds to the error arising from the edges that have − 1 out of d vertices belonging to the truncation Θ K and have one vertex exactly on the boundary of the truncation. Note that we obtain Corollary 3.3 by taking d = 2 in Eq. (11).

Categorical likelihood process
We may also specialize Lemma 3.1 to the setting where the network is generated via the single-edge-per-step categorical likelihood process in Eq. (8). However, truncation with the categorical likelihood process poses a few key challenges. From a practical angle, certain choices of series representation for generating Θ may be problematic. For instance, when using the rejection representation Eq. (5) in the typical case where µ = ν, there is a nonzero probability that meaning there aren't enough accepted vertices in the truncation to generate a single edge. In this case, the categorical likelihood process-which must generate exactly one edge per step-is ill-defined. An additional theoretical challenge arises from the normalization of the original and truncated networks in Eq. (8), which prevents the use of the usual theoretical tools for analyzing CRMs.
Fortunately, the inverse Lévy representation provides an avenue to address both issues. The rates θ k are all guaranteed to be nonzero-meaning as long as K ≥ d, the categorical likelihood process is well-defined-and are decreasing, which enables our theoretical analysis in Appendix B.1. However, as mentioned earlier, the inverse Lévy representation is well-known to be intractable to use in most practical settings. Theorem 3.5 provides a solution: we use the rejection representation to simulate the rates θ k , but instead of simulating for iterations k = 1, . . . , K, we simulate until we obtain K nonzero rates. This is no longer a sample of a truncated rejection representation; but Theorem 3.5 shows that the first K nonzero rates have the same distribution as simulating K iterations of the inverse Lévy representation. Therefore, we can tailor the analysis of truncation error for the categorical likelihood process in Theorem 3.6 to the inverse Lévy representation, and simulate its truncation for any K using the tractable rejection representation in practice.
Theorem 3.5. Let θ 1 , . . . , θ K be the first K rates from the inverse Lévy representation of a CRM, and let θ 1 , . . . , θ K be the first K nonzero rates from any rejection representation of the same CRM. Then Theorem 3.6. In the setting of Lemma 3.1, suppose Θ is a CRM generated from the inverse Lévy representation Eq. (4), edges are generated from the categorical likelihood process Eq. (8), and 1 < d ≤ K. Then imsart-generic ver. 2014/10/16 file: ms.tex date: September 15, 2021 and

Examples
We now apply the results of this section to binary undirected networks (d = 2) constructed via Eq. (10) from common edge-exchangeable networks [3]. We derive the convergence rate of truncation error, and provide simulations of the expected number of edges and vertices under the infinite and truncated network.
In each simulation we use the rejection representation to simulate Θ, and run N = 10, 000 steps of network construction.
Beta-Bernoulli process network Suppose Θ is generated from a beta process, and network E N is generated using the independent Bernoulli likelihood process [3]. The beta process BP(γ, λ, α) [42] with discount parameter α ∈ [0, 1), concentration parameter λ > −α, and mass parameter γ > 0 is a CRM with rate measure The Bernoulli likelihood has the form To simulate the process, we use a proposal rate measure µ given by Dense network: When α = 0, the binary beta-Bernoulli graph is dense and Therefore the rejection representation Eq. (5) of BP(γ, λ, 0) can be written as This implies that the truncation error of the dense binary beta-Bernoulli network converges to 0 geometrically in K. with λ = 2 and γ = 1; it can be seen that for the dense beta-Bernoulli graph, truncated graphs with relatively low truncation level-in this case, K ≈ 50approximate the true network model well.
Sparse network: When α ∈ (0, 1), the binary beta-Bernoulli graph is sparse and Therefore the rejection representation Eq. (5) of BP(γ, λ, α) can be written as This bound suggests that the truncation error for the sparse binary beta-Bernoulli network converges to 0 much more slowly than for the dense graph. Fig. 2b corroborates this result in simulation with λ = 2, γ = 1, and α = 0.6; it can be seen that for the sparse beta-Bernoulli graph, truncated graphs behave significantly differently from the true graph for moderate truncation levels. In practice, one should select an appropriate (large) value of K using our error bounds as guidance, and use sparse data structures to avoid undue memory burden.
Gamma-independent Poisson network Next, consider the network with Θ generated from a gamma process, and the network E N generated using the independent Poisson likelihood process. The gamma process ΓP(γ, λ, α) [43] with discount parameter α ∈ [0, 1), scale parameter λ > 0 and mass parameter γ > 0 has rate measure The Poisson likelihood has the form Dense network: When α = 0, the gamma-Poisson graph is dense, and we choose the proposal measure µ to be Therefore, the rejection representation in Eq. (5) has the form Again, for the dense network B K converges to 0 geometrically, indicating that truncation may provide reasonable approximations to the original network. Fig. 3a corroborates this result in simulation with λ = 2 and γ = 1; for K ≈ 50, no vertices are discarded on average by truncation.
Sparse network: When α ∈ (0, 1), the gamma-Poisson graph is sparse, and we choose the proposal measure µ to be Therefore the rejection representation in Eq. (5) has the form Again, for the sparse network B K converges to 0 slowly, suggesting that the truncation error for the sparse binary gamma-Poisson graph converges more slowly than for the dense graph. Fig. 3b corroborates this result in simulation with λ = 2, γ = 1, and α = 0.6; for a moderate range of truncation values K ≤ 100, the truncated graph behaves very differently from the true graph. Therefore in practice, one should follow the above guidance for the beta-Bernoulli network: select a value of K using our error bounds, and avoid intractable memory requirements by using sparse data structures.

Truncated posterior inference
In this section, we develop a tractable approximate posterior inference method for network models via truncation, and analyze its approximation error. In particular, given an observed network E N , we want to simulate from the posterior distribution of the CRM rates and the parameters of the CRM rate measure. Since exact expressions of full conditional densities are not available, we truncate the model and run an approximate Markov chain Monte Carlo algorithm. We provide a rigorous theoretical justification for this simple approach by establishing a bound on the total variation distance between the truncated and exact posterior distribution. This in turn provides a method to select the truncation level in a principled manner. Although this section focuses on self-product-CRM-based network models, the method and theory we develop are both general and extend easily to other CRMbased models, e.g. for clustering [44], feature allocation [45], and trait allocation [6]. In particular, the methodology only requires bounds on tail occupancy probabilities (e.g., in the present context, the probability that I N ≤ K) rather than the exact evaluation of these quantities.

Truncation error bound
We begin by examining the density of the posterior distribution of the K th rate from the inverse Lévy representation θ K for some fixed K ∈ N, the unordered collection of rates (θ k ) K−1 k=1 such that θ k ≥ θ K , and the parameters σ ∈ R m of the CRM rate measure ν σ given the observed set of edges E N . We denote ν σ (θ) to be the density of ν σ (dθ) and p(σ) to be the prior density of σ, both with respect to the Lebesgue measure. Given these definitions the posterior density can be expressed as where All the factors on the first row correspond to the prior of σ and (θ k ) K k=1 , and the first factor on the second row corresponds to the likelihood of the portion of the network involving only vertices 1 . . . K; these are straightforward to evaluate, though ν σ [θ K , ∞) will occasionally require a 1-dimensional numerical integration. The last factor corresponds to the likelihood of the portion of the network involving vertices beyond K, and is not generally possible to evaluate exactly.
To handle this term, suppose that K is large enough that x K+ 1:N = 0. Then p(x K+ 1:N |θ 1:K , σ) = P (I N ≤ K | Γ 1:K , σ), i.e., the probability that no portion of the network involves vertices of index > K. Theorem 4.1 provides upper and lower bounds on this probability akin to those of Theorem 3.4-indeed, Theorem 4.1 is an intermediate step in the proof of Theorem 3.4-that apply conditionally on the values of U 1:K , Γ 1:K rather than marginally. This theorem also makes the dependence of the bound on the rate measure parameters σ notationally explicit via parametrized series representation components τ σ and g σ from Eq. (3). Finally, though Theorem 4.1 applies to general series representations, we require only the specific instantiation for the inverse Lévy representation in the present context.
imsart-generic ver. 2014/10/16 file: ms.tex date: September 15, 2021 where Since U 1:K is unused in the inverse Lévy representation, in the present context we use the notation B(Γ 1:K , σ) for brevity. The bound in Theorem 4.1 implies that as long as K is set large enough such that both x K+ 1:N = 0 and B(Γ 1: Therefore as K increases, this term should become ≈ 1 and have a negligible effect on the posterior density. We use this intuition to propose a truncated Metropolis-Hastings algorithm that sets K > I N , ignores the p(x K+ 1:N |θ 1:K , σ) term in the acceptance ratio, and fixes x K+ N to 0. Theorem 4.2 provides a rigorous analysis of the error involved in using the truncated sampler.

Adaptive truncated Metropolis-Hastings
Crucially, as long as one can obtain samples from the truncated posterior distribution, one can estimate the bound in Eq. (15) using sample estimates of the tail probabilityΠ {B(Γ 1:K , σ) ≤ /N } ≥ 1 − η. Therefore, one can compute the bound Eq. (15) without needing to evaluate p(x K+ 1:N |θ 1:K , σ) exactly. This suggests the following adaptive truncation procedure: 1. Pick a value of K > I N and desired total variation error ξ ∈ (0, 1). 2. Obtain samples from the truncated posterior.  In this work, we start by initializing K to I N + 1. In order to decide how much to increase K by in each iteration, we use the sequential representation to extrapolate the total variation bound Eq. (15) to larger values of K without actually performing MCMC sampling. In particular, for each posterior sample, we use its hyperparameters to generate additional rates from the sequential representation. We then use these extended samples to compute the total variation error guarantee as per step 3 above. We continue to generate additional rates (typically doubling the number each time) until the predicted total variation guarantee is below the desired threshold. Finally, we use linear interpolation between the last two predicted errors to find the next truncation level K that matches the desired (log) error threshold. This fixes a new value of K; at this point, we return to step 2 above and iterate.

Experiments
In this section, we examine the properties of the proposed adaptive truncated inference algorithm for the beta-independent Bernoulli network model in Eqs. (12) and (13) with discount α ∈ (0, 1), concentration λ > 1, mass γ > 0, unordered collection of rates θ 1:K−1 , and K th rate from the sequential representation θ K . In order to simplify inference, we transform each of these parameters to an unconstrained version: We use a Markov chain Monte Carlo algorithm that includes an exact Gibbs sampling move for γ, and separate Gaussian random-walk Metropolis-Hastings moves for α u , λ u , θ u,K , all θ u,k such that vertex k has degree 0 (jointly), and each θ u,k such that vertex k has nonzero degree (individually).

Synthetic data
We first apply the model to synthetic data simulated from the generative model. We simulate a sparse network with parameters λ = 2, γ = 1, α = 0.2, and N = 10 5 , and a dense network with λ = 2, γ = 1, α = 0, and N = 10 7 . In both settings we set the truncation level for data generation to 500, the desired total variation bound from Eq. (15) to 0.01, and initialize the sampler with α = 0.4, λ = 5, γ = 2 and θ generated from the sequential representation. All Metropolis-Hastings moves have proposal standard deviation 0.1 except the sparse network α u move, which has standard deivation 0.03. Figs. 4 and 5 show histograms of 5,000 marginal posterior samples of the hyperparameters for the dense (true α = 0) and sparse (true α = 0.2) networks. In both cases, the approximate posterior in the first round of adaptation (orange histogram) does not concentrate on the true hyperparameter values, despite the relatively large number of generative rounds N . Fig. 6-which displays the truncation error and predictive adaptation procedure-shows why this is the case. In both networks, the first adaptation iteration identifies a large truncation error. After a single round of adaptation, the approximate posterior distributions (blue histograms) in Figs. 4 and 5 concentrate more on the true values as expected, and the truncation errors fall well below the desired threshold (log 10 (0.01) = −2). It is worth noting that the predictive extrapolation appears to be quite conservative in these examples, and especially in the dense network. This happens because the approximate posterior for the dense network (respectively, sparse network) assigns mass to higher values of α (respectively, γ) than it should, which results in larger truncation error and thus a larger predicted required truncation level. . Grey dashed lines and circles visualize the predicted truncation error using rates generated from the sequential representation. The vertical dotted line shows that the algorithm selects a value of K that attempts to match the desired log 10 error threshold of −2 using the predictions.
Real network data Next, we apply the model to a Facebook-like Social Network 2 [46]. The original source network contains a sequence of 61,734 weighted, time-stamped edges, and 1,899 vertices. We preprocess the data by removing the edge weights, binning the edge sequence into 30-minute intervals, and removing the initial transient of network growth, resulting in 1,899 vertices and 10,435 edges over N = 6,427 rounds of generation. We again set a desired total variation error guarantee of 0.01 during inference. All Metropolis-Hastings moves have proposal standard deviation 0.1 except the α u and degree-0 θ u moves, which have standard deviation 0.04 and 0.03, respectively. We initialize the sampler to α = 0.1, γ = 2, λ = 20 and sample rates θ from the prior sequential representation. Fig. 7 shows the posterior marginal histograms for the hyperparameters α, λ, γ in both the first iteration (orange) and the second and final iteration (blue) of truncation adaptation. The posterior distribution suggests that the network is dense (i.e. α ≈ 0). This conclusion is supported both by the close match of 100 samples from the posterior predictive distribution, shown in Figs. 8a and 8b, and the findings of past work using this data [8]. Further, as in the synthetic examples, the truncation adaptation terminates after two iterations; but in this case the histograms do not change very much between the two. This is essentially because the truncation error in the first iteration is relatively low (≈ 0.02), leading to a reasonably accurate truncated posterior and hence accurate predictions of the truncation error at higher truncation levels.

Conclusion
In this paper, we developed methods for tractable generative simulation and posterior inference in statistical models with discrete nonparametric priors via finite truncation. We demonstrated that these approximate truncation-based approaches are sound via theoretical error analysis. In the process, we also showed that the nonzero rates of the (tractable) rejection representation of a

Appendix A: Equivalence between nonzero rates from a rejection representation and the inverse Lévy representation
Proof of Theorem 3.5. Denote T k1 be the first nonzero element that is generated from the rejection representation from Eq. (5) and correspondingly, denote Γ k1 be the jump of the unit-rate homogeneous Poisson process on R + such that T k1 = µ ← (Γ k1 ), where µ is the proposal measure in the rejection representation. Let f be a bounded continuous function. Then ∼ Unif(0, Γ j ). Using the change of variable y = µ ← (u), we obtain Therefore, using the same change of variable trick, imsart-generic ver. 2014/10/16 file: ms.tex date: September 15, 2021 Suppose that θ 1 is the first rate generated using the inverse Lévy representation. Then Making the change of variable y = ν ← (γ), we obtain Therefore, the first nonzero rate θ k1 from the rejection representation has the same marginal distribution as the first rate θ 1 from the inverse Lévy representation. We now employ an inductive argument. Suppose that we have shown that the marginal distribution of first nonzero M elements Ξ M := (T k1 , T k2 , · · · , T k M ) from the rejection representation has the same marginal distribution as the first M elements Θ M := (θ 1 , · · · , θ M ) from the inverse Lévy representation. To prove the same for M + 1 elements, it suffices to show that the conditional distribution Then Using steps similar to the base case, Making the change of variable y = µ ← (Γ k M + u) as before, we obtain dν.
imsart-generic ver. 2014/10/16 file: ms.tex date: September 15, 2021 Making another change of variables y = µ ← (Γ k M +γ) in the original integral-and On the other hand, imsart-generic ver. 2014/10/16 file: ms.tex date: September 15, 2021 Conditioned on I N ≤ K, f (x|Θ) = f K (x|Θ) under both the independent and categorical likelihood. So By Fubini's Theorem, Proof of Theorems 3.4 and 4.1. Denote the set of indices such that i ∈ I ,K indicates that the first elements of i belong to the truncation, and the remaining d − elements belong to the tail. By Jensen's inequality, This equation arises by noting that I N ≤ K if and only if for all i involving an index i j > K, the count of edge i is 0 after N rounds; the factor d accounts for the fact that ϑ i = d j=1 θ ij is independent of the ordering of the i j . Consider a single term i∈I ,K log π(ϑ i ) in the above sum. Since we are conditioning on U 1:K , Γ 1:K , we have that θ i1 , . . . , θ i are fixed in the expectation, and the remaining steps Γ K+1 , Γ K+2 , · · · are the ordered jumps of a unit rate homogeneous Poisson process on [Γ K , ∞). By the marking property of the Poisson process [47], conditioned on Γ K , we have that ∼ g. Substitution of this expression into Eq. (16) yields the result of Theorem 4.1. Next, we consider the bound on the marginal probability P (I N ≤ K). By Jensen's inequality applied to Eq. (16) and following the previous derivation, we find that Using the fact that conditioned on Γ K , Γ 1:K−1 are uniformly distributed on [0, Γ K ], and that at most one i j can be equal to K, imsart-generic ver. 2014/10/16 file: ms.tex date: September 15, 2021 where the first and second terms arise from portions of the sum where all indices satisfy i j = K and one index satisfies i j = K, respectively. To complete the result, we study the asymptotics of the marginal probability bound. It follows from Eq. (7) that I N < ∞ almost surely. Therefore It then follows from [28, Lemma B.1] and continuous mapping theorem that Since this sequence is monotonically increasing in K, we have that

B.1. Proof of Theorem 3.6
We first state an useful results which states that if one perturbs the probabilities of a countable discrete distribution by i.i.d. Gumbel(0, 1) random variables, the arg max of the resulting set is a sample from that distribution.
Lemma B.1. [28, Lemma 5.2] Let (p j ) ∞ j=1 be a countable collection of positive numbers such that j p j < ∞ and letp j = pj k p k . If (W j ) ∞ j=1 are i.i.d Gumbel(0, 1) random variables, then arg max j∈N W j + log p j exists, is unique a.s., and has distribution arg max Proof of Theorem 3.6. Since the N edges from the categorical likelihood process are i.i.d. categorical random variables, by Jensen's inequality, Next, since ϑ i = d j=1 θ ij , we can simulate a categorical variable with probabilities proportional to ϑ i , i ∈ N d = by sampling d indices (J 1 , · · · , J d ) independently from a categorical distribution with probabilities proportional to (θ 1 , θ 2 , . . . ) and discarding samples where J j = J k for some 1 ≤ j, k ≤ d. Denote θ k = θ k / k θ k to be the normalized rates, P J,K := K j=J θ j , and the event Q := {J j = J k , ∀1 ≤ j, k ≤ d}. Then imsart-generic ver. 2014/10/16 file: ms.tex date: September 15, 2021 Since the normalized rates θ k are generated from the inverse Lévy representation, they are monotone decreasing. Therefore By Jensen's inequality, Note that for a categorical random variable J with class probabilities given by θ j /(1 − P 1,d−1 ), j ≥ d, the quantity P d,K /(1 − P 1,d−1 ) is the probability that d ≤ J ≤ K. So by the infinite Gumbel-max sampling lemma, where we can replace θ j with the unnormalized θ j because the normalization does not affect the arg max. Denoting we have that and so the remainder of the proof focuses on the conditional expectation. Conditioned on Γ K , The cumulative distribution function and the probability density function of the Gumbel distribution Gumbel(0, 1) is and Therefore, Denote Q(u, t) = e −ν ← (u)e −t , and Conditioned on Γ K , the tail Γ K+1 , Γ K+2 , · · · is a unit rate homogeneous Poisson process on [Γ K , ∞) that is independent of Γ 1 , · · · , Γ K−1 . So conditioned on Γ K , where Γ k is unit rate of homogeneous Poisson process on R + . Since Γ k is a Poisson process, log ν ← (Γ K + Γ k ) + W k is also a Poisson process with the rate measure is the probability that no atom of the Poisson process is greater than x. For a Poisson process with rate measure µ(dt), this probability is e − ∞ x µ(dt) , where the second equation comes from the fact that the inner integrand is the probability density function of a Gumbel distribution. Therefore, For the categorical variable J with class probabilities given by θ j /(1 − P 1,d−1 ), j ≥ d, it holds that P(d ≤ J ≤ K) ↑ 1 as K → ∞. By the monotone convergence theorem Appendix C: Error bounds for edge-exchangeable networks

C.2. Beta-independent Bernoulli process network
Dense network When α = 0, Substituting ν(dx), µ(dx) and π(θ) into Eq. (17) and noting that the integrand is symmetric around the line x 1 = x 2 , For any a ∈ (0, 1), dividing the integral into two parts and bounding each part separately, Assume for the moment that λ = 0.5. Use the fact that F K (t) ≤ (3t/K) K and note also that when It can be seen that there exists a constant value c > 1 such that 3γ log c = 1/c. Setting a = c −K and using the first order Taylor's expansion to approximate the first term, it can be seen that Therefore, there exists K 0 such that when K > K 0 , If λ = 0.5, which can be bounded similarly by choosing the same c and setting a = c −K . Therefore we can find a constant K 0 and for K > K 0 , imsart-generic ver. 2014/10/16 file: ms.tex date: September 15, 2021 Next, the term B K,2 is bounded via This has exactly the same form as B K,1 , except that the CDF of Γ K is replaced with that of Γ K−1 . Therefore, it can be shown that for large K, Finally, B K,3 may be expressed as We split the analysis of this term into two cases. In the first case, assuming λ ≥ 2, we have that imsart-generic ver. 2014/10/16 file: ms.tex date: September 15, 2021 On the other hand, if λ < 2, we bound the integral over [0, γ ] and over [γ , ∞) separately. Since 1 − e −x ≥ x 2 for x ∈ [0, 1], As K → ∞, the second term will dominate the first term, so when λ − 2 < 0, the following inequality holds for large K, 3 and as K → ∞, B K,3 will dominate B K,1 and B K,2 . So there exists a K 0 ∈ N such that for K > K 0 , Sparse network When α > 0, Similar to the case when α = 0, Denoting t = γ α −1 x −α , then By Stirling's formula, Now we consider the error bound for B K,2 . As we have shown in the example when α = 0, here we can obtain that Similar to B K,1 and B K,2 , We again split our analysis into two cases. First, suppose that λ + α − 2 ≥ 0. Then Since α < 1, and so where the last equation is obtained from Stirling's formula.
imsart-generic ver. 2014/10/16 file: ms.tex date: September 15, 2021 So in this case we obtain that for large K, Asymptotically, B K,2 will dominate B K,1 and B K,3 , so there exists K 0 ∈ N such that when K > K 0 ,
Note that the integrand with respect to x 2 is the density function of the gamma distribution with shape α and rate λ, so the integral is less than 1. We partition the outer integral into two parts and bound them separately, By setting the two terms in the brackets equal, we get imsart-generic ver. 2014/10/16 file: ms.tex date: September 15, 2021 So Similar to the last example where α = 0, here This has the same form as B K,1 , and therefore Finally, since both e −λx < 1 and e −λµ ← (Γ K ) < 1, .