Bootstrap estimators for the tail-index and for the count statistics of graphex processes

: Graphex processes resolve some pathologies in traditional random graph models, notably, providing models that are both projective and allow sparsity. Most of the literature on graphex processes study them from a probabilistic point of view. Techniques for inferring the parameter of these processes – the so-called graphon – are still marginal; exceptions are a few papers considering parametric families of graphons. Nonparametric estimation remains unconsidered. In this paper, we propose estimators for a selected choice of functionals of the graphon. Our estimators originate from the subsampling theory for graphex processes, hence can be seen as a form of bootstrap procedure.


Introduction
Many statistical models for relational data are based on vertex-exchangeable random graphs (also called graphon models), see [45] for a review. These models are interpretable and easy to work with, but have the considerable limitation that, almost surely, they generate dense graphs; that is, the number of edges scales quadratically with the number of vertices as the graph grows. Folk wisdom holds that real-world datasets are sparse and their degree distributions are power-law 1 implying that graphon based approaches are unsatisfactory. Several model classes have been proposed to address this issue. A non exhaustive list includes edge exchangeable models [17], preferential attachement models [4], sparse graphon models [6,7], and graphex processes [13,52,8].
Graphex processes have been proposed as a generalization of the graphon models that preserve the key properties, but allow for sparsity and degree distribution with power-law behaviour [13,52,14]. Thus, graphex processes are good candidates to model empirical graphs. Apart of [13,51,53,29], inference on graphex processes is still marginal in the literature, most of papers studying graphex processes from a probabilistic point of view only. Efficient estimation remains challenging for these new models. The present paper intend to contribute to the study of graphex processes from a statistical viewpoint, driven by two main motivations described in section 2.
On a high-level, a graphex process {G t : t ≥ 0} is a graph-valued stochastic process. They are used as probabilitic models for graph observations, which are seen as the realization of a sample G t of the process at a finite but unknown size t. All graphs will be simple, and will always be considered to be finite, except in rare occasions, in which case they will be clearly stated to be infinite. The law of the process {G t : t ≥ 0} is determined by a jointly measurable symmetric function W : R 2 + → [0, 1], called the graphon [13,52]. Our goal is to do nonparametric inference about some functionals of W . The functionals we consider are the tail-index of the process (see section 4) and the rescaled injective homomorphism densities (see section 5). The motivations for these functionals are detailed in section 2, and general definitions about graphex processes are recalled in section 3. Along the way, we use our estimators to construct statistical tests for certain counts of motifs. We propose some applications of our results in section 6. All the estimators we propose are constructed using the subsampling invariance of graphex processes under the p-sampling mechanism uncovered by [53]. In particular, we use the scale invariance of certain statistics of G t to estimate the functionals of interest using p-samples of G t . For this reason, our estimators may be viewed as some form of bootstrap estimators [24,5].

Sparsity and power-law
Since graphex processes intend to improve on graphon processes in term of sparsity and power-law behaviour, these notions need to be clarified. Sparsity, is usually understood as a property of a sequence of graphs, i.e. as the growth of the number of edges e(G t ) in term of the number of vertices v(G t ). In many data analysis situations we just have a single observation at some particular size. This begs the question: does sparsity have practical relevance in this case? Regarding the power-law behaviour of the degree distribution, there is a long historical debate about whether or not this is desirable to model empirical graphs. We come back to this controversy in section 4.4. Anyway, it is fair to ask if it makes sense to account for sparsity and power-law in modeling. The main issue is that we often think about sparsity and power-law as a property of the sample G t rather than a property of the population. This is misleading since without modeling assumptions there is no reason that sparsity and degree distribution of G t inform us about the population.
Regarding graphex processes, both sparsity and power-law behaviour arise as a structural property of W , i.e. the tail-index σ ∈ [0, 1], as explained in section 3.4. Consequently, σ a distinguished feature of the population, constituting an interesting summarizing statistic, as it explains the growth of the network and part of its structure. Whence, rather than talking about sparsity and powerlaw, we shall rather talk about the tail-index. We show in section 4 that sparse graphex samples have a readily identifiable signature which can be uncovered using the subsampling invariance described in section 3.3. This can be leveraged to visually diagnose if a given graph can be reasonably assumed to be sample of a sparse graphex process (see section 6.1).
Also, we note that it is of practical interest to construct graphex models that are close analogues of graphon models that have been proven to be useful in practice; this is the approach of [51,29], and a number of models of this kind are described in [8]. Then, one wants to combine the efficient procedures already established for the graphon model with some new procedure for estimating the additional parameters governing the sparse behaviour, such as the tail-index.

Nonparametric inference, bootstrap, and hypothesis testing
Apart of the work of [51,12,13,29], algorithms form statistical inference of graphex processes are still marginal in the literature. Moreover, even the most advanced algorithm of [51] assumes a parametric structure for W . Full nonparametric estimation of W is to the best of our knowledge still a challenge. One possibility is to adapt the method of moments [7] to graphex processes. The main idea is that W is entirely determined by the collection of all rescaled injective homomorphism densities [7,5,21], which can be further estimated using the count statistics, i.e. the number of occurrence of certain motifs in the graph. We elaborate on this in section 5.
Besides interest for nonparametric estimation of W , the count statistics also give nice hints on the structure of the graph. They have been used in testing equality of features of networks and finding confidence intervals of the count feature [42,47], and similarly to determine if the dearth of certain motifs is statistically significant [5].
While in principle count statistics are exactly computable, in practice the problem becomes rapidly intractable for large motifs. In [5], the authors propose to use two different types of bootstrap techniques to make computations feasible. The idea is to count the motifs in a possibly small subsampled subgraph to estimate the count in the whole graph. Here we propose an alternative using p-sampling. Though for counting motifs the performance is equivalent (and thus for inference too), there are some advantages when it comes to test some hypotheses, or simply if one is interested in also estimating the variance of the count statistics.
As in [5], we apply our method to construct statistical tests about the count statistics. However, in contrast with [5] our null hypotheses is a graphex process, instead of sparse graphon process. Though it might seem anecdotal 2 , this has a major advantage: bootstraped samples of sparse graphon processes do not have distinguished distributional properties, while p-sampling preserves the distribution of graphex processes. We exploit this invariance to construct estimators for the variances of count statistics that are based on recycling the bootstrap samples used to estimate them, while [5] have to rely on the construction of ad-hoc estimators. Our estimators are simply rescaled version of the empirical variance of the bootstrap samples, and thus straightforward to compute.

Notations
Most of our notations are taken or inspired from [21]. We denote the vertex and edge sets of a graph G by V (G) and E(G), and the numbers of vertices and edges by v(G) := |V (G)| and e(G) := |E(G)|. Also for every j ∈ N := {1, 2, . . .}, we let d j (G) be the number of vertices with degree j in G. We consider both labeled and unlabeled graphs; the labels will always be the integers 1, . . . , n where n is the number of vertices in the graph. A labeled graph is thus a graph with vertex set N n := {1, . . . , n} for some n ≥ 1; we let L n denote the set of the 2 ( n 2 ) labeled graphs on N n and let L := ∞ n=1 L n . An unlabeled graph can be regarded as a labeled graph where we ignore the labels; formally, we define U n , the set of unlabeled graphs or order n, as the quotient L n / ∼ = of labeled graphs modulo isomorphisms. We let U := ∞ n=1 U n = L/ ∼ =, the set of all unlabeled graphs. We If X t and Y t are random variables, the limits are understood almost-surely. In case where the limits hold in probability, we use X t = O p (Y t ) and X t = o p (Y t ). Inequalities up to generic constants are denoted by the symbols and . The symbol E denote expectations, and when it is needed to emphasize that {G t : t ≥ 0} is a graphex process with parameter W , we use E W .

Model
We now define the (simple) graphex process. The word simple refers to the fact that we consider undirected graphs with no loops. The model has been introduced in [13] and further studied in [52,53,51,8,9,31,14]. We recall that W : R 2 + → [0, 1] is a jointly measurable symmetric function, usually referred to as the graphon [52].
A graphex process {G t : t ≥ 0} is a graph-valued stochastic process which can be constructed as follows. Sample a unit-rate Poisson random measure V on R 2 + [see 18,19], and identify V with the collection of points (t, x) ∈ R 2 + such that V has a point mass, that is V ≡ {(t 1 , x 1 ), (t 2 , x 2 ), . . .}. LetG ∈ L be a (eventually infinite) graph with vertex set V, such that for each pair of vertices v i = (t i , x i ) and v j = (t j , x j ), there is an edge between v i and v j with probability W (x i , x j ), independently for any two i = j. Let G ∈ L be the subgraph ofG consisting only of the vertices with degree at least one. Note that G is almostsurely a graph with countably infinitely many vertices, and that the set of edges is also countably infinite except if W is equal to 0 almost-everywhere. For each t ≥ 0 let G t ∈ U be the unlabeled induced subgraph of G consisting only on the vertices (t , x) such that t ≤ t. The process {G t : t ≥ 0} is a graphex process with parameter W . For any fixed t ≥ 0, G t ∈ U is called a sample of size t. We insist that G t is unlabeled. In other words, the labels (t i , x i ) are not observed, and the statistician may think of them as latent variables.
If R 2 + W (x, y) dxdy < ∞, then at each t ≥ 0 the sample G t has finitely many vertices and edges almost-surely [13,52].

Remark 1.
In this paper, we only consider graphex processes with no loops. We note that in the seminal papers [13,52] graphex processes are defined with the possibility of having loops, and hence the definition here is a bit less general. Indeed, the processes in this paper are a special case of [13,52] corresponding to let W (x, x) = 0 for all x ∈ R + in their model.

Subsampling of graphex processes
Various sampling designs have been proposed in the statistical and computer science literature to derive representative samples of a given network [35,49,38,45]. Many of these sampling designs have been analyzed from the design-based sampling point of view [50,15]. It is common in the statistical network literature to oppose model-based and design-based sampling. In the context of graphex processes, [53,9] make the bridge between the two points of view by identifying the sampling design naturally associated with graphex processes.
Of particular interest, [53,9] defines the p-sampling procedure. Given a graph G, a p-sample of G, written later ps(G, p), is the random graph obtained from G by sampling the vertices of G independently with probability p and returning the induced subgraph where only non-isolated vertices are conserved. The following lemma is taken from [53] and will be at the core of our inference technique. From [9] we also know the converse result that if ps(G t , p) d = G pt , then {G t : t ≥ 0} is a graphex process. This result shed light on the sampling design associated with graphex processes: we may also view G t as a sample obtained from a population graph G t with t > t > 0 and G t = ps(G t , t/t ).

Graphex marginal and tail-index
Recently, [14] have studied asymptotics for an important class of graphex models (encompassing most known examples). They derive the remarkable result that, for this model class, the sparsity and the degree distribution of the graph are governed by the tail behaviour of the graphex marginal μ : More specifically, assuming without loss of generality that μ is monotone, they show that if there is σ ∈ [0, 1] and a slowly varying function 3 such that as then, under mild supplementary assumptions on W (related to our assumption 2 below), the asymptotic behaviour of v(G t ), e(G t ) and d j (G t ) is determined by σ. They characterize the four regimes listed below.
• Dense: σ = 0 and lim t→∞ (t) < ∞. In that case μ −1 has compact support and e(G t ) v(G t ) 2 almost-surely. • Almost-dense: σ = 0 and lim t→∞ (t) = ∞. In that case e(G t )/v(G t ) 2 → 0 almost-surely, but e(G t )/v(G t ) 2− → ∞ almost-surely for every > 0. • Sparse with power-law : σ ∈ (0, 1). In that case, e(G t ) [v(G t )/ (t)] 2 1+σ almost-surely. Moreover, for every j ∈ N, d j (G t )/v(G t ) → σΓ(j − σ)/j! almost-surely as t → ∞. • Very sparse: σ = 1. In that case, has to go to zero sufficiently fast and We call σ in equation (3.1) the tail-index of μ. Here we are mostly interested in the sparse with power-law regime, i.e. σ ∈ (0, 1). It is the most interesting regime, since the dense regime is equivalent to the classical graphon model, and the very sparse is an extreme case unlikely to be useful in statistical applications.

Heuristic and the p-sampling argument
Before introducing our estimator for σ and results on its risk, we review here the heuristic we used to construct the estimator. Since it relies mostly on lemma 1, we shall refer to this heuristic as the p-sampling argument in the next. The idea is to use the scale-invariance of certain statistics. From [14], we have v(G t ) ∼ t 1+σ (t)Γ(1 − σ) almost-surely as t → ∞, under mild conditions. Imagine we could observe {G t : t ≥ 0} at different sizes t < t , with t/t known. Then, we could asymptotically recover σ because almost-surely as t → ∞, at least if t /t remains constant. Of course, we usually have only one sample G t with unknown t > 0. The key insight is that p-sampling allows us to effectively simulate a sample G t with t < t; intuitively, this is because ps(G t , p) d = G pt by lemma 1. Of course, here we don't have access to the marginal law of ps(G t , p), only to the conditional ps(G t , p) | G t . Hence some cares have to be taken, but the intuition remains useful. In particular, we can asymptotically recover σ from v(G t ) and v(ps(G t , p)) for some p ∈ [0, 1], or even better, from v(G t ) and It is easy to see that for any simple graph G ∈ U with no isolated vertex, In fig. 1, we illustrate on a simulation example how to infer σ from p → N 1 (G t )/N p (G t ). Indeed, log(N 1 (G t )) − log(N p (G t )) → (1 + σ) log(p) almostsurely as t → ∞ under mild assumptions, by the discussion above and the results in [14]. Whence, σ may be estimated by determining the slope of curve log(N 1 (G t )/N p (G t )) as a function of p 4 .

An estimator and its risk
In view of the heuristic of the previous section, we can construct a simple estimator for σ. For a chosen value of p ∈ (0, 1), we propose to use as an estimator for σ,σ Here Gt is a sample from the GGP model of [13] with parameters (σ, τ ) = (0.2, 5) and t = 2000 (see also example 5). The sample has 26671 vertices and 293923 edges. With these parameters μ has a tail-index equal to 0.2. The red dotted line is y = (1 + σ)x, which is the expected asymptotic behaviour of log Np(G t ) as a function of − log p. The plain black line is y = (1 +σp)x, whereσp is the estimated value of σ using p = 0.5 (represented by the blue dotted vertical line). On the right: the green curve represents the degree distribution of the same sample. In plain black (respectively dotted red) is represented d j ∝ j −1−σp (respectively d j ∝ j −1−σ to illustrate that the degree distribution is also governed by the tail-index.
Notice thatσ p is well-defined since N 1 ≥ N p almost-surely for all 0 ≤ p ≤ 1. Also note thatσ p does not depend explicitly on the sample size t; this is important because t is generally considered to be unknown.
In order to be able to compute the risk ofσ p , we require further assumptions on the model. Our first assumption is analogous to the assumptions encountered in [52,14]. We first introduce the 2-points correlation function ν : We also make the following assumption on ν, which is a classical assumption in the literature [52,14] to be able to control the variance of certain (non-linear) functionals of graphex processes, such has the number of vertices.
Also, it is well-known [25,14], that under equation (3.1) the asymptotic equivalence Naulet et al. holds as t → ∞. This relation is already enough to show consistency ofσ p . However, further assumptions on μ are required to bound the bias of the estimator. In the present paper, we shall assume the following. Assumption 3. We assume that equation (3.1) holds. We furthermore assume that for all p ∈ (0, 1) there is a sequence (Γ p,t ) t>0 such that Assumption 3 is closely related to second-order regular variation assumptions used in extreme values theory [20]. Its purpose is the following. The integral in equation (4.1) plays a determining role in computing the risk ofσ p . Equation (4.1), however, says nothing about the rate at which the integral approaches t σ (t). Moreover, we know that (pt)/ (t) → 1 as t → ∞ by the assumptions on , but again, the rate of convergence is unknown. These two rates are encapsulated in assumption 3, allowing us to quantify the risk ofσ p . Bounds on Γ p,t for various examples are given in section 4. 3. These examples show that in many cases Γ p,t has polynomial decay.
We are now in position to state the main theorem of this section, whose proof is deferred to appendix A.

Theorem 1. Under assumption 2 and 3, there is a constant C > 0 depending only on p and W such that for all t ≥ 1 it holds
, t 1−2η .

Remark 2. The computation of the estimator requires to choose a value for p.
Theoretical guidelines are that any value of p bounded away from zero yields a consistent estimator. But, this leaves unclear how to choose in practice. Optimally, we could build a selection procedure that selects the "optimal" value of p, in the sense that bias and variance are suitably balanced. This would however require to make finer assumptions on the structure of the bias. We believe the gain would be marginal as in many cases the value ofσ p is relatively robust to the choice of p, see figs. 1 and 4. In practice, we recommend to do a Hill plot to assess the pertinence of estimating σ and to determine a reasonable value of p, as detailed in section 6.1.

Discussion about the bias
In [14], the authors suggest to estimate the tail-index usinĝ Although the variance ofσ p has polynomial decay in t, the bias is proportional to Γ p,t , whose decay can be anything. Hence it is not clear ifσ p dominatesσ CPR asymptotically.
Here we present some examples where the bias ofσ p has also polynomial decay in t. We consider the examples which were originally given in [14], because they cover a large spectrum of behaviours.
and the graph is almost-surely dense. Assumption 2 is trivially satisfied with η = 1. Also, it follows from appendix C.1 that assumption 3 is satisfied with Γ p,t ∝ t −1 : the risk is polynomially decreasing on this example.

Example 2. Sparse, almost dense graphs without power law. Consider the func-
and assumption 2 is trivially satisfied with η = 1 and σ = 0. Moreover, it follows from appendix C.2 that assumption 3 is satisfied with Γ p,t ∝ 1/ log(t): the risk is logarithmically decreasing on this example.

Example 3. Sparse graphs with power law, separable. For
and assumption 2 holds trivially with η = 1. Moreover, from appendix C.3, assumption 3 holds with Γ p,t ∝ t −σ : the risk is polynomially decreasing on this example.

Example 4. Sparse graphs with power law, non separable. For
It is shown in [14] that assumption 2 holds with η = (1 + σ)/2. We have μ(x) = σ(1 + x) −1/σ , which is up to a constant the same marginal as the previous example, so that assumption 3 is verified with Γ p,t ∝ t −σ : the risk is polynomially decreasing on this example.
. This corresponds to the model in [13]. It is shown in [14] that assumption 2 holds with η = 1. We establish in appendix C.4 that assumption 3 is satisfied with Γ p,t ∝ t −σ : the risk is polynomially decreasing on this example.
Although the asymptotic behaviour of the graphs are relatively close for examples 1 and 2, the bias decay is radically different and estimation is harder in the second example. We also estimated E[(σ 0.5 − σ) 2 ] 1/2 and E[(σ 0.5 − σ) 2 ] 1/2 on the previous examples using 10000 Monte Carlo samples. In the simulations, we picked σ = 0.3 for the sparse separable and sparse non-separable example, while the GGP example corresponds to a choice of σ = 0.5. The raw numbers are given in tables 1 and 2.
We pictured in fig. 2 the risk ofσ 0.5 andσ CPR as functions of t for all the simulated examples. As expected, the risk ofσ 0.5 has polynomial decay every-time but the almost dense example, while the risk ofσ CPR remains logarithmically decreasing on all examples. We note, however, that even in the situation when the risk has logarithm decay,σ 0.5 seems to perform slightly better thanσ CPR .

On the controversy about scale-free nature of real graphs
There is no actual consensus on the scale-free nature of empirical graphs 5 . Here, we briefly review both sides of the debate. This is also an opportunity for us to review what σ is and what it isn't.
According to the general opinion, a finite graph G is called scale-free if there are c > 0, γ > 1 such that for j large enough We left the previous definition intentionally non-rigorous to emphasize one of the origin of the controversy: the lack of consensus on the definition of scale-freeness [30,54]. Barabási and Albert [4] found that for three empirical networks equation (4.2) seemed true, and they attributed this property to a growth mechanism they called preferential attachment. Since then, many papers claimed to have discovered new types of power-law networks and new mechanism creating them, see for instance [44,30]. But not everybody agrees, and some authors believe that power-law degree distribution is rarely encountered in practice [32,16]. The heated debate has culminated recently with [10], claiming that scale-free networks are rare after reviewing a large collection of empirical networks.
The methodology in [10] consists on testing as a null hypothesis a power-law model for the degree distribution against an alternative which is not power-law. On a corpus of 927 empirical networks, which they claim is representative of real networks, they found that only a very few failed to reject the null hypothesis.
Since its first appearance as a preprint [10] has generated a lot of activity, whether it be in agreement or disagreement. We share the same opinion as [30,54] that the source of the debate is the lack of a rigorous definition. [54] makes an effort toward a clean definition and they arrive at the opposite conclusion of [10]. Another argument against [10], which seems to be the most widespread, is that scale-freeness is only well-defined in the infinite-size limit [30]. According to this view, a network is scale-free if its degree distribution approaches a power-law as the network keeps growing following the same mechanisms. In other word, instead of equation (4.2), we shall say that Then, γ is not defined as property of the sample G t , but rather as a joint property of the limiting graph and the growth mechanism, equivalently of the sampling design. Here we insist on the importance of the sampling design in the definition: if {G t : t ≥ 0} is a graphex process, then γ = 1 + σ almostsurely under mild conditions [14], though the degree distribution of G t at finite t might have a different index, as illustrated in fig. 3. Another famous example is the traceroute sampling which is known to produce samples with power-law degree distributions, even when the population is not power-law [37,1]. Another example of this type is provided in section 6.3. Hence, if one is ready to accept equation (4.3) as a definition, talking about scale-freeness without knowledge of the sampling design is irrelevant.
One of the criticism of the definition in equation (4.3) is made on the basis that we never observe {G t : t ≥ 0} but only G t , and that we need tools adapted to finite networks. We have mixed feelings about this. On one hand, the same criticism could be made about any statistical analysis. Indeed, one of the primary goal of statistics is to infer stuff about a population given a sample. In the case of i.i.d. observations, we have accepted for a while to idealize the population as an abstract propability distribution, and this has been proven to be very effective in summarizing properties of the population as parameters of the distribution. When the observation is a graph, we have no reason to proceed differently (see in particular in [30] the example of epidemic threshold). On the other hand [30] points out the difficulty in determining the right sampling mechanism, and it is true that models of random networks are commonly misspecified. Then, it is less clear that the parameters make sense, albeit this depends on the degree of misspecification and the task to achieve.
Finally, we note that the test in [10] only involves the degree distribution, but graphs are more than just their degree distribution (see in particular section 6.2.2). For these reason, we don't believe in the existence of a universal model that can explain all empirical graphs, and it is the work of the statistician to determine a good model for a given task ; whence it is useful to study as many models as possible. This opinion seems to be shared by the authors of [10]. We add that in many cases the focus should be put on achieving the task at hand with accuracy rather than trying to explain all the structure of the sample.
Besides, the assertion that graphex processes are not desirable to model empirical graphs because power-law graphs are rare is too simplistic: (i) no serious methodology can objectively prove this rarity (what is a representative set of empirical graphs? what definition of scale-freeness to use?), (ii) it is the degree distribution of graphex processes in the infinite size limit which is power-law 6 , and (iii) there are empirical graphs for which some graphex processes have been found to explain well the degree distribution (see [13,51] and also section 6). Overall, one should not forget to look at other aspects of the modeling that might be more important than just the degree distribution.

Relation to Hill estimator
Regardless of the controversy of section 4.4, tail-index of the degree distribution of G t is often reported as an interesting summarizing statistics [35,Section 4]. [54] plea in favor of using various estimators as the index is notoriously hard to estimate and different estimators may give very different answers on a given sample. In our opinion, the interest in the tail-index of G t is debatable as it might be unrelated with the index of the population, an issue we already discussed in section 4.4, and which we will emphasize here. The most common reported value for the tail-index of the sample is based on the Hill estimator where d min > 0 is a cutoff determining where the tail of the degree distribution starts. Most of empirical graphs are reported to haveγ Hill ∈ [2,3], while graphex processes have almost-surely index γ = 1 + σ ∈ [1,2], at least if we rely on the definition of equation (4.3). So one might think graphex processes are unrealistic to model graphs withγ Hill ∈ [2,3]. Once again, we have to be careful: it is not obvious that the Hill estimator gives consistent answers for γ. Though it is known from [55] that it is consistent in the preferential attachment model, nothing is known about graphex processes. In particular, in fig. 3 we illustrate what can go wrong: on a simulated sample with true σ = 0.3 (hence γ = 1.3), our estimator givesσ 0.5 = 0.3335, while a non carefully calibrated Hill estimator givesγ Hill = 2.1181 ∈ [2,3]. Once again, this shows that one has to be careful when relating degree distribution of the sample and the idealized degree distribution of the population, i.e. the sampling design has to be taken into account when inferring properties about the population.

Count statistics of graphex processes: definition, inference, and hypothesis testing
We briefly discuss how the p-sampling argument may also be used to estimate other functionals of W . This essentially extends the work of [7,5] from graphon In green is the degree distribution Gt is a sample from the GGP model of [13] with parameters (σ, τ ) = (0.3, 10) and t = 10000 (see also example 5 to graphex. In particular, using p-sampling allows to derive a natural notion of bootstrap for networks. Using bootstrap samples, we can then estimate the count statistics of the graph (sections 5.1 and 5.2), which can in turn be used to do nonparametric inference on the (rescaled injective) homomorphism densities (section 5.3). We also use count statistics to construct tests for the number of occurrence of certain motifs (section 5.4). Finally, we discuss the relation of these results with some other works (section 5.5).

Count statistics
Count statistics have received a lot of attention in the last few years, for multiple reasons. A first reason is that some motifs are of interest to get insights on structure of the network. For instance, counting the number of triangles permit to determine the transitivity of the graph, which is a popular measure of the tendency of the vertices to cluster together [56]. Another reason is that counts of motifs characterize the distribution of the network (see [21] and section 5.3), so that nonparametric inference on the distribution of the network is possible using counts of certain motifs [6,7,5]. Other related uses of motifs counts involve hypothesis testing [5] and networks comparison [5,2]. Count of motifs is notoriously hard, especially when the graph is large and the motif has a large number of vertices. Even counting simple motifs such has triangles might rapidly become intractable on large graphs [48]. Methods based on subsampling the graph have emerged to bypass the complexity. A non exhaustive list of existing methods includes methods based on vertex sampling [5,34,2,41], the parametric bootstrap [28,39], or based on edge sampling [3,23,26,27]. Here, we take advantage of the p-sampling to count motifs. We discuss in section 5.5 how that relates to already existing works, and some of the advantages in doing p-sampling over other sampling design.
Before going further, we shall first clarify what me mean by count statistics. For a finite graph R, we define the number of copies of R in G as the number of i.e. the number of adjacency preserving map from R to G. We call this number n(R, G), the count statistic of R. (5.1)

Bootstrap estimators for count statistics: model-free results
We first show how to derive a bootstrap estimate for n(R, G) using p-sampling which works for any graph G (so G is not necessarily a graphex process). The result is of interest by itself, since it proves a consistent estimator for n(R, G) regardless of the distribution of G, and that estimator is parallelizable, in contrast with the computation of n(R, G) which may be challenging for complex motifs R. Our estimator is based on the following easy proposition.

Proposition 1.
Let G ∈ U and let R ∈ L k be a connected graph such that n(R, G) < ∞. Then, From the previous proposition, we can obtain a natural estimator of n(R, G), which we refer as the bootstrap estimator. Indeed, for a fixed p ∈ (0, 1) to be chosen, and for some integer B ≥ 1, we letĜ 1 , . . . ,Ĝ B be i.i.d. copies of ps(G, p) | G, and we estimate n(R, G) bŷ The main advantage is that n(R,Ĝ b ) are intrinsically faster to compute than n(R, G), especially when G is sparse, and computations can be run in parallel. Interestingly, the formula (5.2) is model-free, in the sense that it is true for any graph G that has no isolated vertices. It follows that if G has no isolated vertices, thenn(R, G) is an unbiased estimator of n(R, G), comparable to the Horvitz-Thompson estimator constructed in [34] 7 . The next proposition quantify the risk ofn(R, G) as an estimator of n(R, G).

Proposition 2.
Let G ∈ U and let R ∈ L k be a connected graph such that n(R, G) < ∞. Thenn(R, G) is an unbiased estimator of n(R, G), and var(n(R, G) | G) = var(n (R, ps(G, p) It is worth pointing out that by the previous,n(R, G) is a consistent estimator of n(R, G) as B → ∞, and thus with almost no condition on G. This result is of interest by itself, if one is only interested in counting motifs in G (see also section 6.2).

Nonparametric inference
Beyond their interest in describing the structure of the network, count statistics permit to estimate the rescaled injective homomorphism densities [8]. Those are distinguished functionals of W whose collection entirely determine W [21,40,8]. The rescaled injective homomorphism density of R in W is defined as It follows from [21,40,8] that W can be in principle recovered -up to measure preserving transformations -from the collection of all {h(R, W ) : R ∈ L}. We shall only consider motifs R and processes for which the previous quantity is finite. Indeed, we will assume that a certain number of moments of W are finite.

Proposition 3.
Let R ∈ L k be a connected graph. If {G t : t ≥ 0} is a graphex process satisfying assumption 4 with N = k, then almost-surely, The consequence of the last proposition is that h(R, W ) can be consistently estimated as t → ∞ byĥ where in turn we can leverage the consistency of our bootstrap estimator of n(R, G t ) to make computations tractable. We note that the method of moments of [7] can be adapted to estimate W fromĥ(R 1 , G t ),ĥ(R 2 , G t ), . . . for clever choice of motifs R 1 , R 2 , . . . .

Hypothesis testing
Besides their interest for estimating W , the count statistics have been used in testing equality of features of networks and finding confidence intervals of the count feature [42,47]. The authors of [5] have an example where they use counts of certain motifs to test if the dearth of short-cycles in a network is statistically significant. As argued in [5], the main challenge is then the problem of finding estimates of s 2 t (R) := var W (n(R, G t )). Here we show how the p-sample argument can be used in order to recycle the bootstrap samples to estimate s 2 t (R), which provides a somewhat simpler estimator than in [5]. Our estimator is based on the result of the following proposition.
This suggests that s 2 t (R) may be estimated bŷ where in turn the variance in the numerator of the last display can be approximated up to arbitrary precision by recycling the bootstrap samples used for the estimation of n(R, G t ). The next proposition establishes thatŝ 2 (R) is consistent.

Proposition 5.
Let {G t : t ≥ 0} be a graphex process satisfying assumption 4 with N = 4k − 2. For any connected graphs R ∈ L k , as t → ∞ The fact that we are able to consistently estimate s 2 t (R) permit to perform hypothesis testing forn(R, W, t) := E W [n(R, G t )]. In particular, it is of interest to construct tests for H 0 :n(R, W, t) =n(R, W 0 , t 0 ) against H a :n(R, W, t) = n(R, W 0 , t 0 ), see for instance section 6.2.2 and also [5]. We show in the next proposition that asymptotically consistent testing for these hypotheses can be made on the basis of the test statistic Proposition 6. Let {G t : t ≥ 0} be a graphex process with graphon function W 0 satisfying assumption 4 with N = 4k. For any connected graphs R ∈ L k ,

Remark 4. It would also be of interest to construct a test for the weaker hypothesis H
. It seems to us that this is a much harder problem. Indeed, it is true thatĥ(R, G t ) ∼ h(R, W 0 ) almost-surely under H 0 by [8], and so we could imagine constructing a test based onĥ. However, the statistical fluctuations of n(R, are exactly of the same order, which makes the asymptotic normality ofĥ(R, G t ) less evident.

Relation to other works
The work presented here relates to many of the paper cited previously. In our opinion, the two most related papers are the work of [5] and of [34]. We now discuss the relation between these papers and our results. In [5], the authors consider nonparametric estimation of certain functionals of the graphon in a sparse graphon model, with applications in hypotheses testing and networks comparison. The sparse graphon model and the graphex model are different, though they are tightly related, as explained for instance in [8]. The functionals considered in [5,7] are related to the expectation of counts of motifs too. We note, however, that they use a slightly different notion of counts than n(R, G t ), though this do not fundamentally make difference and their procedures can be easily adapted to n(R, G t ). The paper [5] consider two subsampling procedures for counting motifs. Of particular interest for us is their uniform subsampling bootstrap. Given a sample G n with n vertices, they consider the subgraphG n,m which consists on uniformly sampling m < n vertices without replacement of G n and returning the induced subgraph. For moderate m, this is morally equivalent to p-sampling. Sampling vertices without replacement is quite natural for dense graphon sample, but less evident in the case of a sparse graphon sample. The issue is for instance discussed in [41]: in the classical dense graphon modelG n,m d = G m , which is not necessarily true in the sparse model. This is not an issue to estimate n(R, G n ), but it is if one wants to estimate var(n(R, G n )), to do hypothesis testing for instance. Because the distribution ofG n,m has no distinguished properties, it is unclear that var(n(R, G n )) can be estimated from var(n(R,G n,m ) | G n ); so [5] have to rely on rather complicated procedures to estimate var(s(R, G n )).
In [34], the authors are also interested in estimating the counts of certain motifs using vertex sampling. They also consider two subsampling schemes. In particular, they consider uniform subsampling of vertices, which is almost psampling. The only difference is that they keep isolated vertices in the subsample while we don't. Since we only consider connected motifs, the two methods are the same, as isolated vertices cannot contribute to the motif. Consequently, our results in section 5.2 are partial restatements of [34], though they also consider a different notion of counts. We note that in [34] the authors are only interested in counting motifs. They don't make probabilistic assumption on the graph and hence do not consider inference on the distribution of the graph.

Assessing the pertinence of sparse graphex models on some real networks
The tail-index of μ is a distinguished feature of sparse graphex processes (i.e. graphex processes satisfying assumptions 1 to 3 for σ = 0). As such, it can be used to diagnose the pertinence of using such a model on some real datasets.
In particular, if G t is a sample of a sparse graphex process, then we expect the degree distribution of G t to be close to a power-law with index 1 + σ, and p → log N1(Gt) Np(Gt) to be closely linear with slope 1 + σ, at least for p bounded away from 0 (the asymptotic being true only in this region). Equivalently, we might look at the plot of p →σ p (G t ), which we call a Hill plot in analogy with the classical Hill plots used in extreme value theory to calibrate the Hill estimator [22]. If the model is correct, we expect the Hill plot to be constant, except perhaps at the boundary close to p = 0. The Hill plot is doubly interesting as it allows for both assessing the pertinence of the model, but also to choose a reasonable value of p for the estimation of the tail-index (see also remark 2).
We choose to illustrate the diagnostic on two datasets from [43], the Flickr and the Youtube datasets. These datasets are available from the KONECT database [36]. These are social networks datasets where vertices are users and edges represent friendships. We note that the Flickr dataset is a directed graph, but we ignore arrows and see it as an undirected graph. We summarize in table 3 the main statistics of these networks. These two networks are interesting in the way they have been sampled. Indeed, [43] crawled these networks by searching for the whole connected component reachable from a random set of users. The methodology may be summarized as starting from a single user, and in each step, retrieve the list of friends for a not yet visited user and add these users to a list of users to visit. The process is continued until exhaustion of the list. This is known as breadth-first search (BFS) sampling. Interestingly, [43] crawled the Flickr and Youtube networks using BFS sampling every day on a large period of time (several weeks). In particular, each day they revisited all the users they had previously discovered, and in addition all the new users that were reachable from the previously known users. By choosing a reasonable sets of users to start with, they have access to the whole largest connected component of these networks. Further, provided the BFS sampling goes well, their crawling mechanism may be viewed as observing snapshots of a graphex process at increasing times. We plotted side-by-side the Hill plot and the degree distribution, respectively for the Flickr and Youtube networks, in figs. 4 and 5. We plotted the degree We notice that the Hill plot is essentially flat in this case, which is the expected behaviour when the model is well-specified. On the right: In green is the degree distribution of the Flickr dataset. In red is the line corresponding to d j ∝ j −1−σp . We see that both curves align well.
distributions on a log-log scale, and added in red dotted lines a linear curve with slope −(1 +σ p ), whereσ p has been computed using p = 0.75. In case the model is correct, we then expect the degree distribution and the linear curve to align well. We can see in fig. 6 that the Flickr dataset exhibits some features of sparse graphex processes: its Hill plot is relatively flat with value ≈ 0.5, and the slope of its degree distribution on a log-log scale is close to ≈ −1.5. Regarding the Youtube dataset, in fig. 5, this is less evident, even the plots plea in favor of a misspecification of the sparse graphex model. This might be because the model is not reasonable, which is likely to be caused by the crawling procedure used in [43]: the data they have at their disposal make uncertain that all users in Youtube have effectively been sampled [43, Section 4.2.4] 8 .
Finally, we mention that these plots should only be used as visual diagnostics, and it is hard to draw firm conclusions about the validity of the model from these plots. One would need in particular to do a true statistical test for the degree distribution, yet this is unclear how to perform at this time.

Consistent estimation of count statistics
We can also use the results of section 5.2 to count the number of motifs in a network. We illustrate this on the number of triangles in the Youtube network of the previous section. We note here that the number of triangles is equal to n(R, G t )/6 where R is a triangle graph. We also note that the method is consistent regardless of the distribution of G t , whence it does not matter that we found evidence that the Youtube dataset is unlikely to be a sample of a sparse graphex process. We plot in fig. 6 the estimated number of triangles and the estimated standard deviation as a function of the number of bootstrap samples used. The bootstrap samples used are small (p = 0.01), implying that in average each sample has ≈ 12 triangles, which is fast to count. Yet, with only 2000 samples (which can all be computed in parallel) the accuracy is already quite good.

Testing for count statistics
We have seen in section 6.1 that the degree distribution of the Flickr network was plausibly well explained as a sample from a graphex process. In particular it is found that σ = 0.5 does explain quite well the "sparsity" and the degree distribution. The current state if the art algorithm for inference on graphex processes is the algorithm of [13] and its subsequent generalization in [51]. Here we focus on the simpler version of Caron and Fox [13]. Caron and Fox's algorithm is based on the GGP model previously described in example 5. Hence the model is a parametric family with parameter set {W ≡ W σ,τ : σ < 1, τ > 0}. One of the desirable feature of the GGP model is that the parameter σ is exactly the tail-index of μ described in section 3.4 9 . Yet simple, this model is able to accommodate for σ ∈ [0, 1), that is from dense to sparse with power-law. The other parameter τ acts as a cut-off for the degree distribution, controlling the separation between the power-law regime and non power-law regime (corresponding to larger degrees), see [13]. In particular, the GGP model with (σ, τ, t) = (0.5, 6.3, 12000) satisfies E W [v(G t )] ≈ 2.3 millions and E W [e(G t )] ≈ 23 millions. So, the degree distribution and the size of Flickr are well explained by the GGP model with (σ, τ, t) = (0.5, 6.3, 12000), as emphasized by table 3 and fig. 7. Due to its relative simplicity, this model enables for efficient and tractable computations [13,51]. Then, it makes sense to determine if other aspects of the structure of Flickr can be as well explained by the same model. Here we test if the number of triangles observed in Flickr is compatible with the GGP model with H 0 : (σ 0 , τ 0 , t 0 ) = (0. 5, 6.3, 12000). Under H 0 , by numerical integration we find thatn(R, W 0 , t 0 ) = 3.75 · 10 8 when R is a triangle. Using the bootstrap procedure of section 5 on the Flickr dataset (we used p = 0.1 and 2000 bootstrap samples), we estimated n(R, G t ) = 5.03 · 10 9 and ŝ 2 (G t ) = 1.93 · 10 8 . This gives an observed value of the test statistic T (R, G t ) of approximately 24.2, and then a p-value ≈ 0. Hence H 0 is rejected in this case: without surprise it does not seem reasonable that the Flickr dataset is well-explained by a model as simple as the GGP model. This emphasize the need of more advanced algorithm for inference, in particular for non-parametric estimation.

Test-train split
Model evaluation is a key component of data analysis. Often, this involves randomly splitting the available data into a test set and a training set. There are many seemingly natural ways to partition a graph, so some care is required in splitting the data. The choice of partitioning scheme may induce a significant sampling bias in the test and training sets, impeding evaluation and complicating model comparison. Consider for instance the case of independent and identically distributed observations (Y 1 , . . . , Y n ) iid ∼ P n : it is natural to split the data (Y 1 , . . . , Y n ) by subsampling uniformly without replacement 0 ≤ m ≤ n observations within the sample. This is because the obtained subsample has the desirable property to be distributed according to P m . We argue that in the context of graphex processes, the natural way to split the graph is precisely the p-sampling procedure described in section 3.3, because it is the only procedure leaving (up to rescaling) the distribution of the graph invariant [53,9].
To illustrate the difficulty with test-train splitting of graphs, we built two training sets from the Flickr dataset: one using the p-sampling procedure, and the other using the so-called star sampling [11,35]. The star sampling we used consists on selecting a set of vertices by uniform sampling with probability p (the stars centers), and for each of these center select all of its neighbors (the star points), and then returning the induced subgraph. We used p = 0.2 and p = 0.013 . . . to obtain training sets with a comparable number of vertices (respectively 201633 for p-sampling and 193075 for star sampling). However, the structure of the obtained training sets are very different. The p-sample training set has 952723 edges, while the star sample has 15739418 edges. Whence the star training set is much denser, as expected. The degree distributions of the two training sets are also quite different, as illustrated in fig. 8. On the same figure, we drew a line with slope −1.5, which was found to explain well the degree distribution of the complete dataset. We see that it adjusts also quite well the degree distribution of the p-sampled training-set. So at least in term of degree distribution, the p-sampling training set captures better the structure of the complete dataset.
In the situation where N p,t ≥ 1, we have To ease notations, we define, We now introduce functions ϕ 1 , ϕ 2 : (−1, ∞) → R + such that for every z = 0, For z = 0 the functions ϕ 1 and ϕ 2 are extended by continuity. The functions ϕ 1 and ϕ 2 are non-negative and monotonically decreasing on (−1, ∞). We can therefore write when N p,t ≥ 1, But, by Young's inequality we have Combining the Young inequality estimate with equations (A.1) and (A.2) gives Then the conclusion of theorem 1 follows from the estimates in sections A.3 and A.4.

A.3. Study of the bias term b σ,t
The estimate for the bias term follows from the expectation of N p,t for p ∈ [0, 1] given in the next proposition. Remark that from the definition of N p,t , recalling

Proposition 7.
For any p ∈ [0, 1] and t > 0 it holds, Proof. We assume here that p = 1. The case p = 1 follows from similar steps, or by taking cautiously the limit p → 1 in the proof. Then we have, where the last line follows from independence and dominated convergence theorem to take care of the infinite product. It is easily seen from the definition of B that for all i = j, It follows, invoking the Slivnyak-Mecke formula [19,Chapter 13], Then the conclusion of the proposition follows by Campbell's formula [33, Section 3.2] because, It is clear from the result of the previous proposition that we have, It then follows from assumption 3 that,

A.4. Variance estimates for N p,t
We now compute the estimate for E[Z 2 1 ] and E[Z 2 p ] in the proof of the main theorem. We assume here that p = 1. The case p = 1 follows by taking cautiously the limit p → 1 in the proof. To shorten coming equations, we write q := 1 − p. We then have, We call p 2 S 1 the first term of the rhs of the previous equation, and p 2 S 2 the second term, so that N 2 p,t = p 2 S 1 + p 2 S 2 . That is, We now compute E[S 1 ]. Expanding the square, By independence, and using the dominated convergence theorem to take care of the infinite product, From the definition of B we get that, Therefore, Invoking the Slivnyak-Mecke theorem [19,Chapter 13], Therefore, by Campbell's formula [33,Section 3.2] (see also the proof of proposition 7), That is, We now compute E[S 2 ]. We start with the expansion of the product in the expression of S 2 , The following is justified by independence and dominated convergence theorem, That is, because of equations (A.3) and (A.4), We rewrite the previous in a slightly different form in order to be able to use the Slivnyak-Mecke theorem [19,Chapter 13], x k )) .

Z. Naulet et al.
Then we can apply the Slivnyak-Mecke theorem to find that, Again, we rewrite the previous in a more convenient form to apply the Slivnyak-Mecke theorem a second time, Applying the Slivnyak-Mecke theorem to the previous, Using Campbell's formula [
Hence, it remains to prove the equation (B.2). LetG ∈ L be the graph constructed in section 3.2, and letG t ⊆G be the subgraph consisting only on vertices v i = (t i , x i ) with t i ≤ t. Likewise, G t is (in distribution) the subgraph of G t consisting only on non-isolated vertices. But, since R is connected, we have n(R, G t ) d = n(R,G t ). Write V = {(t 1 , x 1 ), . . .} so that we can assume wlog that V (G t ) = N and V (R) = N k , and we can rewrite, Then, it is enough to establish that var(ñ t (R, V)) ∼ ct 2k−1 to finish the proof. We note that by equation (B.3) We let N k : A 2 k → Z + such that N k (i, i ) counts the number of distinct elements in i ∪ i . Then, All the terms in the previous summation have finite expectation under assumption 4 if N ≥ 2k. Also, using Slivnyak-Mecke's formula, we see that the expectation of the term corresponding to m = 2k is exactly E[ñ t (R, V)] 2 (so indeed it is enough to have N = 2k − 1). Similarly, the expectation of the other terms is seen to be O(t m ). It follows that, which behave as ct 2k−1 when t → ∞, for a constant c > 0 depending only on W and R.

B.4. Proof of proposition 5
We define for simplicity c(G t ) := var(n(R, ps(G t , p)) | G t ). So, in view of proposition 4 and its proof, we then have, Thus it is enough to show that var(c(Gt)) E W [c(Gt)] 2 = O(t −1 ). We have obtained in appendix B.3 that E W [c(G t )] t 2k−1 , thus the proposition is proved if we show that var W (c(G t )) = O(t 4k−3 ). We use the same notations and definitions as in appendix B.3. Then, with the same arguments as in appendices B.2 and B.3 we obtain that where, It is easily seen that under assumption 4 with N ≥ 2k 1 + 2k 2 − 2, all the terms involved in equation (B.4) have finite expectations. Also, the terms corresponding to m 1 + m 2 ≤ 4k − 3 are seen to have expectation O(t 4k−3 ). But the only way to have m 1 + m 2 > 4k − 3 is to have m 1 = 2k − 1 and m 2 = 2k − 1. Hence, we deduce that . We also note that by equation (B.8) and the same arguments that led to the previous formula, var W (C a1,a2 2k−1 ).
To finish the proof, we note thatC a1,a2 where Q is the connected graph with 2k − 1 vertices obtained by joining two copies of R, R 1 and R 2 , in such a way that vertex a 1 of R 1 and a 2 of R 2 becomes the same vertex in Q. Hence var W (C 2k−1 ) = O(t 4k−3 ) by equation (B.2), and var W (c(G t )) = O(t 4k−3 ) too.