Sparse Networks with Core-Periphery Structure

We propose a statistical model for graphs with a core-periphery structure. To do this we define a precise notion of what it means for a graph to have this structure, based on the sparsity properties of the subgraphs of core and periphery nodes. We present a class of sparse graphs with such properties, and provide methods to simulate from this class, and to perform posterior inference. We demonstrate that our model can detect core-periphery structure in simulated and real-world networks.


Introduction
Network data arises in a number of areas, including social, biological and transport networks. Modelling of this type of data has become an increasingly important field in recent times, partly due to the greater availability of the data, and also due to the fact that we increasingly want to model complex systems with many interacting components. A central topic within this field is the design of random graph models. Various models have been proposed, building on the early work of Erdős [13]. A key goal with these models is to capture important properties of real-world graphs. In particular, a lot of effort has gone into the 2. Statistical network models with core-periphery structure

Definitions
In this section, we formally define what it means for graphs to be sparse and to have a core-periphery structure.
Let G = (G α ) α≥0 be a family of growing undirected random graphs with no isolated vertices, where α ≥ 0 is interpreted as a size parameter. G α = (V α , E α ) where V α and E α are the set of vertices and edges respectively. Denote respectively N α = |V α | and N (e) α = |E α | for the number of nodes and edges in G α , and assume N α , N (e) α → ∞ almost surely as α → ∞. We first give the definition of sparsity for the family G = (G α ) α≥0 . Definition 2.2 (Core-periphery structure). We say that a graph family G = (G α ) α≥0 is sparse with core-periphery structure if the graph is sparse with a dense core subgraph, that is α . In other words, the core corresponds to a small dense subgraph of a sparse graph, with sparse connections to the other part of the graph, which is called the periphery.

A model for networks with core-periphery structure
Having defined what we mean by core-periphery structure, we now give a construction of a core-periphery network. Building on the model of [41], we also allow the network to exhibit community structure. Following [6], we represent a graph by the point process on the plane Z = i,j z ij δ (θi,θj ) (2.4) where z ij = z ji = 1 if i and j are connected, and 0 otherwise. Here, each node i is located at some point θ i ∈ R + = [0, ∞). A finite graph G α of size α > 0 is obtained by considering the restriction of Z to [0, α] 2 , see [6]. As in [41], we consider that the probability of a connection between nodes i and j is given by the link function Pr(z ij = 1|(w l1 , w l2 , . . . , w lK ) l=1,2,..
Here, w i1 ≥ 0 is the core parameter and the parameters w ik > 0, for k = 2, . . . , K, are interpreted as the degree of affiliation of node i to community k. In the particular case K = 2, we have core-periphery but no community structure. The model parameters (w i1 , w i2 , , . . . , w iK , θ i ) i=1,2,... are the points of a Poisson point process on R K+1 + with mean measure ρ(dw 1 , dw 2 , . . . , dw K )dθ where ρ is a σ-finite measure on R K + , concentrated on R K + \{(0, 0, . . . , 0)}, which satisfies R K + min(1, K k=1 w k )ρ(dw 1 , dw 2 , . . . , dw K ) < ∞. As shown in [41], the resulting graph is sparse if (2.6) and dense otherwise. [41] considered a specific class of models for ρ, based on compound completely random measures (CRMs) [17], where ρ is concentrated on (0, ∞) K ; that is, for all nodes i and communities k we have w ik > 0. As a consequence, the sparsity pattern is homogeneous across the graph. We consider a different and more flexible construction here where w i1 ≥ 0 and w ik > 0, k = 2, . . . , K. As will be shown in Theorem 3.1, the core-periphery property can be enforced by making the following assumptions on the mean measure ρ: (0,∞)×{0} K−1 ρ(dw 1 , dw 2 , . . . , dw K ) = 0, (2.7) {0}×(0,∞) K−1 ρ(dw 1 , dw 2 , . . . , dw K ) > 0, (2.8) (2.9) We identify the set C = {i | w i1 > 0} as the core nodes, and the remaining ones P = {i | w i1 = 0} as the periphery nodes. Assumption (2.7) ensures that we do not have a subset of core nodes disconnected from the rest of the graph. Along with Equation (2.8), it also ensures that each node has at least one strictly positive community affiliation parameter. In practice, we use a construction where all of the community affiliation parameters are strictly positive. The strict positivity assumptions in Equations (2.8) and (2.9) ensure that the size of the core or periphery is not empty with probability 1. The boundedness assumption in Equation (2.9) ensures that the subgraph of core nodes is dense, as will be shown in Theorem 3.1. Note that the overall graph may be sparse or dense, depending whether the integral in Equation (2.8) is finite or not.
We now propose a way to construct such a function ρ. We start with a base Lévy measure ρ 0 on (0, ∞). Let (w i0 , θ i ) i≥1 be the points of a Poisson point process with mean measure ρ 0 (dw 0 )dθ and set where F is a probability distribution on R K−1 + and the scores β i1 ∈ {0, 1} and (β i2 , . . . , β iK ) ∈ R K−1 + > 0, i ≥ 1 are mutually independent. In other words, with probability 1 − e −wi0 , w i1 = w i0 > 0 and i is a core node, while with probability e −wi0 , w i1 = 0 and i belongs to the periphery. ρ is then given by: (2.11) In this paper, we set ρ 0 to be the mean measure of the jump part of a generalized gamma process (GGP) where (σ, τ ) satisfy σ ∈ (0, 1), τ ≥ 0 or σ ∈ (−∞, 0], τ > 0. The GGP has been used extensively due to its flexibility, the interpretability of its parameters and its conjugacy properties [26,7]. Note crucially that, although the formulation (2.10) resembles the formulation of the class of compound CRMs introduced by [17], in our construction, the scores β i1 are not identically distributed, and depend on the base parameter w i0 . This key difference enables the model to have different sparsity properties, as shown in Section 3.
Of particular interest, for computational reasons, is the case where F is set to be the product of independent gamma distributions F (dβ 2 , . . . , dβ K where a k > 0 and b k > 0 are hyperparameters for k = 1, . . . , K. We simulate a network from the above model with parameters K = 2, α = 200, σ = 0.2, τ = 1, b = 0.5, a = 0.2. This corresponds to the case where we have core-periphery but not community structure. In either case, we refer to our model as the Sparse Core Periphery (SparseCP) model. For comparison, we also simulate a network from the Sparse Network with Overlapping Communities (SNetOC) model of [41] with K = 2 communities and parameters α = 100, σ = Fig 1. Comparing a core-periphery graph with no community structure simulated from the SparseCP model to a graph with two communities from the SNetOC model of [41]. Plots of (a) the graph with two communities, with nodes from each community in green and red respectively, and (b) the core-periphery graph with core nodes in green and periphery nodes in red. In each case the size of the nodes is proportional to their mean sociability. Adjacency matrices for (c) the graph with two communities and (d) the core-periphery graph. In each case the nodes are ordered by community or core membership, with the red lines partitioning the two groups. Degree distributions of (e) the subgraphs of nodes within communities 1 and 2 for the two community graph, and (f ) the subgraphs of core and periphery nodes for the core-periphery graph. 0.2, τ = 2.5, b = (0.2, 0.05), a = (0.4, 0.1). The parameters are chosen so that the graphs have similar sizes and overall degree distributions.  ordered according to the community where they have the largest affiliation for the SNetOC model, and by core (w i1 > 0) or periphery (w i1 = 0) affiliation for the SparseCP model. In Figures 1(e) and 1(f) we represent the empirical degree distributions of nodes within each community for SNetOC, and for the core and periphery subgraphs for the SparseCP model. In these plots, and all subsequent degree distribution plots, the degrees have been binned together in evenly spaced logarithmic bins, see e.g. [34]. Although the overall degree distributions for both models are very close (see Figure 2), the degree distributions of the subgraphs are very different: the degree distributions within each subgraph have a similar power-law behaviour for SNetOC, whereas for the SparseCP model, it exhibits a Poisson-like behaviour for the core, and a power-law behaviour for the periphery.

Parameter interpretability
In summary, the different parameters in the model control the structure of the network as follows: • w i1 -We interpret w i1 as a coreness parameter, with w i1 > 0 indicating that node i is in the core. • w i2 , . . . , w iK -When K = 2, we can interpret w i2 as an overall sociability parameter. Otherwise w i2 , . . . , w iK indicate the affiliation to each of the respective communities, and we can interpret them as sociabilities within each of these communities. • σ -As we will see from Theorem 3.2, σ controls the overall sparsity of the network, with a higher value leading to sparser networks. It also controls the size of the core, with a larger value of σ leading to a smaller relative core size. • τ -As we can see from the theoretical results, τ does not affect the asymptotic rates for the densities of the different regions. Its main effect is to induce an exponential tilting of large degrees in the degree distribution, as we see in Figure 18 in Appendix B.
• a 2 , . . . , a K , b 2 , . . . , b K -The hyperparameters of the community sociability parameters control the distribution F defined in Equation 2.13, which is the product of Γ(a k , b k ) distributions. Furthermore increasing the a k decreases the relative size of the core by changing the sizes of the w ik relative to w i1 , while increasing the b k increases the relative size. • α -This is a size-parameter, tuning the number of nodes and edges in the network.
In the following section we show theoretically and through simulations that these models recover the core-periphery structure defined in Definition 2.2.

Core-periphery and sparsity properties
In this section we study the asymptotic behaviour of the number of nodes N α , the number of nodes in the core N α,c and in the periphery N α,p , together with the number of edges N α,c−p , which are the key quantities to understand the core periphery structure. They are defined as then the graph is sparse overall and (2.2) holds. Otherwise, it is dense, and (2.1) holds.
We now characterize more precisely the sparsity properties for the particular model described by Equations (2.11), (2.12) together with (2.13), in Section 2.2.
Theorem 3.2. Consider the graph family defined by Equations (2.4) and (2.5), where the Lévy measure ρ takes the form of Equation (2.11), with a generalized gamma process base measure ρ 0 , and F a product of independent gamma distributions. Assume further that σ ∈ (0, 1) and τ > 0, so that Then, almost surely as α tends to infinity, A natural consequence of the definition of core-periphery structure that we use is that the relative size of the core tends to zero, since the overall graph must be sparse. In Corollary 3.2.1 we confirm this, noting that σ ∈ (0, 1). in the periphery region 2 1+σ/2 in the core-periphery region When σ < 0, β = 2 in each region.
In Figure 3, we see a graphical representation of these results. From Figure  3(a) we see that the number of edges grows quadratically with the number of nodes when σ < 0 (and thus the graph is dense) and otherwise grows with power-law exponent 2 1+σ . We see similar behaviour for the periphery and coreperiphery regions, but in Figure 3(b) we see that the core region is dense for any value of σ. In Appendix B we present some more empirical results on the effects of varying the model parameters on the degree distribution, sparsity properties and core proportion. Relationship between nodes and edges for varying σ for (a) the overall graph, (b) the core, (c) the core-periphery region and (d) the periphery. In each case we use K = 2, τ = 1, a = 0.2, b = 1 K and vary α to generate different sized graphs. We also plot a quadratic relationship between nodes and edges, for comparison.

Experiments
In this section, we present a series of experiments where we perform posterior inference using our model. We first introduce the MCMC algorithm we use to perform this inference. We then detail experiments performed on simulated and real data sets. Our goal here is twofold. Firstly, we show the ability of our model to capture both the core-periphery structure and sparsity properties of networks. Secondly, we emphasise the benefits of our approach over that of [41]. Finally, when considering our model as purely a core-periphery detection method, we show that it compares favourably against some other standard methods.

Posterior inference
In order to test our model, we design an MCMC algorithm to perform posterior inference, based on those of [6] and [41]. We assume that we have observed a set of connections (z ij ) 1≤i,j,≤Nα where N α is the number of nodes with at least one connection. We want to infer the parameters (w i1 , . . . , w iK ) i=1,...,Nα . We also want to estimate the sums of the parameters for the nodes with no connection (w * 1 , . . . , w * K ), the hyperparameters φ of the mean intensity ρ and the parameter α which is also assumed to be unknown. Thus the aim is to sample from the posterior In order to do this, we define an algorithm which uses Metropolis-Hastings (MH) and Hamiltonian Monte Carlo (HMC) updates with a Gibbs sampler to perform posterior inference. The details of the algorithm are provided in Appendix C.

Simulated data
We start by considering simulated data. This is firstly in order to study the convergence of our MCMC algorithm. Secondly, it allows us to compare our model to that of [41] in a setting where have complete information about the ground truth.
In order to study the convergence of our algorithm, we generate synthetic data from the SparseCP model described in Section 2, with the construction given in 2.2. We generate a graph with K = 2, i.e. with core-periphery structure but no community structure. We use parameters α = 200, σ = 0.2, τ = 1, b = 1 K , a = 0.2. In our case, the sampled graph has 778 nodes and 5984 edges. We fit our model to the simulated network, placing vague Γ(0.01, 0.01) priors on the unknown parameters α, 1 − σ, τ , a k and b k . We initialize the parameters with an initialization run (as described in Section C.4 of Appendix C) of 10 000 steps. We then run 3 parallel MCMC chains of 500 000, discarding the first 375 000 samples as burn in. Finally, we thin the remaining 125 000 to give a sample size of 500. Trace plots and convergence diagnostics are given in Section D.1.1 of Appendix D.
Our model accurately recovers the mean sociability parameters (in this case, the meanw i = 1 2 (w i1 + w i2 ) of the core and overall sociability parameters) of both high and low degree nodes, providing reasonable credible intervals in each case as shown in Figures 4(a) and 4(b). In Figure 4(c) we see some high degree periphery nodes in particular being mis-classified into the core. However, we expect that these high degree periphery nodes will be some of the hardest to correctly classify. By generating 5000 graphs from the posterior predictive distribution, we also see that the model fits the empirical power-law degree distribution of the generated graph well, as shown in 4(d). Here, and in other plots of the posterior predictive degree distribution, the lower bound of the 95% credible interval is equal to 0 for some degree bins, however the plot is on the log scale, and so we cannot properly represent the interval in this case. Instead, only the upper bound of the interval is shown, with all the space below this line also within the credible interval.
Moreover, we see that we are able to very accurately recover the classification into core and periphery. In our generated graph, 137 out of 144 nodes are classified correctly into the core, with 631 out of 634 are classified correctly into the periphery. Importantly, the core is not simply comprised of the nodes with the highest degrees. This comes from the form of the probability of connection between two nodes; 1 − e − K k=1 w ik w jk . Thus a node i in the periphery, and so with w i1 = 0, can have a high probability of connection with another node j if they both have large overall sociabilities w i2 , w j2 . Thus, nodes not in the core can still have a high degree. Conversely, nodes in the core but with low overall sociabilities may have low degrees. In Figure 4(c) we see the credible intervals for the core sociability parameters, and we see that there are several high degree nodes in the periphery which are identified correctly. This means that our algorithm is not simply classifying the highest degree nodes into the core.
In Appendix D, we also test our model on a graph with core-periphery and community structure. In this case, K is taken to be 4 in order to generate a graph with a core-periphery structure and 3 overlapping communities. We show that we are able to recover both the core-periphery and community structure in this case.
Comparison with the SNetOC model of [41]. Our model is an extension of that of [41], to allow the modelling of core-periphery structure as well as community structure. The question arises as to whether their SNetOC model could perform the same role as ours, with the core thought of as another community. Indeed, when examining real world networks we will study an example in which [41] identify a Hub community which is similar to the core identified by our model.
Here, we show some of the problems that can occur when trying to use the SNetOC model for core-periphery detection, and the advantages that our model can provide. We start by generating a graph from the SparseCP model with the parameters K = 2, α = 200, σ = 0.2, τ = 1, b = 1, a = 0.2. This choice of parameters leads to a larger relative core size than in our previous example. We then fit both the SparseCP model and the SNetOC model to this, and compare the results. We do this in two ways, firstly the fit of the posterior predictive degree distribution to the empirical degree distribution, and secondly the classification accuracy of the estimated core.
In Figure 5(a) we see the posterior predictive degree distribution for the SparseCP model, with the corresponding plot for the SNetOC model given in 5(b). To compare these distributions, we can measure the distance between the posterior predictive degree distributions and the empirical degree distribution. It is well known that the Kolmogorov Smirnov (KS) statistic has poor sensitivity to deviations from the empirical distribution that occur in the tails [30]. So, we give here the reweighted KS statistic suggested by [8], which weights the tails more strongly: where S(x) is the CDF of observed degrees, P (x) is the CDF of degrees of graphs sampled from the posterior predictive distribution and x min is the minimum value among the observed and sampled degrees. We see from the left hand side of Figure 6 that our model gives a better fit in terms of the weighted KS distance. Moreover, the SparseCP model (correctly) estimates σ to be positive, whilst the SNetOC model estimates σ to be negative. This shows one important benefit of our model, namely that it can model sparse networks with a large core well (the true core proportion in this case is around 0.3). Conversely, we see here an example of where the SNetOC model fails to differentiate between a dense network and a sparse network with a dense subgraph. It is important to note that in some situations, the SNetOC model correctly models the empirical degree distribution. One such example is the graph that we use to test the convergence of our MCMC sampler, in which the core is significantly smaller.
We can also compare the two models in terms of the core they identify. In the SparseCP model the binary variable β i1 = 1 wi1>0 indicates if the node i is in the core or not. We can then report the posterior distribution of this variable, or the point estimate β i1 = arg max k=0,1 Pr(β i1 = k | data).
For the SNetOC model it is more difficult to perform this classification. A naive way to proceed is to assign each node to the community with the highest weight, and then identify the core as the subgraph of nodes with the highest density. However, in the SNetOC model nodes can belong to multiple communities, and a node can belong to the core, and still have a higher weight for another community. This method of classification gives poor results.
To provide a fairer comparison, we can also perform classification with the SNetOC model by assigning node i to the core if w ic > T , where c denotes the index of the hub community, and T > 0 is some threshold. The problem here is that there is no objective way to choose T , without resorting to some crossvalidation. In this simulated example, we can choose the threshold so that the number of nodes that are placed into the hub community by this classification method is equal to the number that are in the true core. This corresponds to a threshold of T = 0.014 in this case. If we do this, the classification accuracies of the SparseCP and SNetOC methods are very similar, and very close to 1. However, this method is not applicable for real data where we do not know the size of the true core in advance.
Even in this "best-case" scenario for the SNetOC model where we classify based on thresholding the w ic optimally in some sense, we can still see the advantage of the SparseCP model by looking at the fit of the posterior predictive to the empirical degree distribution of the subgraph formed by the nodes in the (true) core. In Figures 5(c) and 5(d) we see that the SparseCP model is performing significantly better than the SNetOC model in this regard. This is confirmed by the weighted KS distances that we see in the right hand side of Figure 6.
When comparing the SparseCP and SNetOC model on real data sets, we will see many of the same features that we have noted here. In particular, the problems that the SNetOC model has with modelling a sparse graph with a relatively large dense core, and the problems with choosing a thresholding level to perform classification.

Real world networks
We now consider the setting of real data. There are many classical examples of networks with core-periphery structure. Our model is designed to detect this structure in sparse networks with a power-law degree distribution. We apply our method on two real world networks, one that has a power-law distribution and one that does not. The proposed method performs well in the power-law setting but also produces reasonable and interpretable results in the non-powerlaw case. The power-law network we consider is the USairport network of airports with at least one connection to a US airport in 2010 2 . This network was previously considered by [41], and we compare the two methods to see the benefits of using a core-periphery model in this case. The other network we consider is the Trade network of historical international trade 3 for which the core-periphery structure has been previously studied [12,14]. In Table 1, we give the size of these networks, the value of K, the estimated relative size of the core and the estimated value of σ.
For the USairport data set we set K = 4. This follows from the work of [41], who found that there were four interpretable communities in the network. Three of these corresponded to geographical regions, with the final one corresponding to Hub airports. We therefore take the same choice of K to see if our model can identify the same interpretable structure, and to compare the two methods. For the Trade network we take K = 2, as in this case we do not concern ourselves with community structure, but simply look to see if it is possible to identify core-periphery structure even when the network does note have a power-law degree distribution. In each case, we assume vague Γ(0.01, 0.01) priors on the unknown parameters α, 1 − σ, τ , a k and b k .

US Airport network
The first real world network we consider is the USairport network of airports with at least one connection to a US airport in 2010. Airport networks such as this have been seen to have a core-periphery structure [21,12]. Furthermore, one of the communities identified by [41] are the Hub airports, highly connected airports with no preferred location. [39] suggest that this network could more accurately be modelled by a core-periphery model. When comparing the results of our model to that of [41], we find that this is indeed the case. In order to compare both models, we take K = 4 when fitting the SparseCP model to the network, to retain the three other communities identified using the SNetOC model. We run 3 parallel MCMC chains with an initialization run of 10 000 iterations followed by 10 000 000 iterations. We then discard the first 5 000 000 samples as burn in and thin the remaining 5 000 000 to give a sample size of 500. Trace plots and convergence diagnostics are reported in Section D.2.1 of Appendix D. Note that despite the large number of iterations, it seems that the sampler has not fully converged yet. Nonetheless, we observed that increasing the number of iterations does not the change significantly the values of the core-periphery parameters of interest and the overall interpretation of the network. In Figure 7(a) is given the adjacency matrix formed by ordering the nodes firstly by core and periphery, and then by their highest community sociability. From this, we can see the core-periphery and community structure identified by our model. The three communities have the same geographical interpretation as in [41], corresponding to East, West and Alaska. In Figure 8, we report the weights of a selection of nodes in the core and in the periphery. Periphery airports with large degree are generally large regional airports, with many connections within the East or West communities. The final community corresponds to Alaskan airports, and we can see from Figure 7(a) that airports in this community generally do not have many connections to the core. If we investigate the classification into core and periphery in more detail, we see several interesting features. Firstly, when looking at the airports placed in the core, we can compare these to the list of hub airports for various airline companies 4 . We see that 38 out of 47 of these airline hub airports are placed in the core, with some interesting exceptions. Three of the airline hub airports not placed in the core are only hubs for the parcel delivery services Fedex and UPS, whilst one is only a hub for a small charter airline. The most interesting exceptions are Chicago Midway and Dallas Love Field airports, both of which are focus cities for major airline company Southwest Airlines but are not placed in the core. A possible explanation for this is that Southwest Airlines does not operate using a traditional spoke-hub model, instead utilising a point-to-point system. In this system, transport routes are directly between locations, rather than via central hubs. This means that we do not see the presence of core airports with large numbers of connections to airports in many different locations. Connections in this network can be better explained by geographical proximity, and the focus cities are not necessarily core nodes in the sense of our model.
Secondly, we see that the SparseCP model does not simply classify by degree, the airports that we see in Figure 8(b) have degrees higher than many core nodes. As described in the simulated data study, this comes from the form of the probability of connection between two nodes, which can be high even if neither is in core.
If we look at the core nodes with low degrees, we can see a clear pattern here. Most of these airports are major international hubs such as in London Heathrow, Frankfurt International, Zurich and Tokyo Narita. Another airport that appears here is Honolulu International. In each case, it is not surprising that these airports are placed in the core, as most of the connections are longdistance flights to the major US hub airports.
Investigating the overestimation of high degree nodes more carefully, we see that for our model and that of [41], the posterior distribution on b 4 , the parame- ter entering the gamma distribution F corresponding to the Alaskan community (the 3rd community identified by our model) concentrates on values very close to 0 (see Figure 33 in Appendix D). This increases the posterior variance of the sociabilities w i4 , and leads to some graphs with very high degree nodes being generated when simulating from the posterior predictive distribution. A possible reason for this is that the Alaskan airports exhibit a different type of behaviour, not well explained by our model. Alaskan airports are densely connected to each other but not to the other communities, or the core nodes. Conversely, the East and West communities contain large numbers of core nodes, and also have more connections between communities.
In order to investigate this possible source of model misspecification further, we repeat our analysis of this network, taking out all of the Alaskan airports, and any airport which only has connections to Alaskan airports. In Appendix E we present the results. We see that the overestimation problem is no longer present, but overall the fit to the degree distribution is no better than for the full model. Furthermore, if we examine the nodes that are placed into the core and periphery, there is not very much difference.
Comparison with the SNetOC model of [41]. We now compare the fit of our model and the SNetOC model to the USairport network. In Figures 9(a) and 9(b), we report the 95% posterior predictive intervals for the degree distribution for both models. The SparseCP provides a slightly better fit to the degree distribution, in particular for large degree nodes.
In Figure 10(a) we report some statistics to measure the distances between the posterior predictive degree distributions from each model and the empirical degree distribution. Firstly, we see that our model is performing significantly better in terms of the reweighted KS distance defined in Equation (4.2). To identify in which tail the deviation between the two models occurs, we also report the Rényi statistics  where S(x) is the cdf of observed degrees, P (x) is the posterior predictive cdf of the degrees and x min and x max are respectively the minimum and maximum observed degrees. Such statistics have been used as an alternative to the standard KS statistics, to measure departure in the tail [30]. L 2 is more sensitive to a departure for the low degrees, and U 2 for the high degrees. Both models perform similarly well for low degrees, but the fit of the SparseCP model is better for the large degree nodes. We now compare the classification accuracy of the two models. We use as ground truth for the core nodes the 47 US airports designated as hubs by airlines.
For the SparseCP model, for and 0 otherwise, for some threshold T > 0. For comparison with SNetOC, we also compute the same estimate under our model for various thresholds T (called SparseCP variable threshold). For each method and for a range of thresholds T , we plot the true positive rate vs the false positive rate for x ∈ [0, 1], and calculate the Area Under the Curve (AUC), as a measure on the classification accuracy. We also compare against the naive classification method for the SNe-tOC model mentioned in the previous section, where a node is assigned to the hub community if its hub community weight is the highest amongst all of its community weights. We call this method SNetOC Naive. Figure 11 compares the classification accuracy from the SparseCP model (in green) to that obtained from the SNetOC model with different thresholds (in red). We see that with the right choice of threshold, the SNetOC model can slightly outperform ours, but for other thresholds it performs far worse. We also plot (in blue) the AUC obtained from our model if we do not threshold w i1 at 0, but instead at higher values. This can be thought of as performing classification using a different loss function, which could be useful in a setting where we are more concerned with misclassifying periphery nodes as core than the reverse. For any value of T , the proposed model provides a higher accuracy than the SNetOC model. Moreover, in practice, we do not have an objective way to choose a particular threshold (without resorting to cross-validation). The naive SNetOC classification does not require a choice of threshold, however we see here that it provides a far lower classification accuracy. This suggests that our model provides a better way to identify core-periphery structure.
Furthermore, if we plot the degree distribution of the subgraph formed by the US airport hubs, we see that it is very far from power-law. In Figure 12 we see that the SparseCP model is capturing this degree distribution better than the SNetOC model.

World trade network
The next network we consider is the dyadic trade network between countries 5 . The original data details the flow of trade between pairs of countries from 1870 to 2009. In our case, we consider the single simple, undirected network found by aggregating the data over all the years. However, as shown by [12], when the trade data set is considered as an unweighted network, with a link between countries if there was any flow of trade between them, then the core-periphery structure is quite weak. There is a strong core-periphery structure if the weighted (by volume of trade) network is considered. As our method currently cannot deal with weighted networks, we instead use a cutoff method, forming a binary network by only considering trade links over a certain volume. The cutoff was chosen to give a network with a heterogeneous degree distribution which was close to a power-law. With a very low cutoff, the network is quite dense, with most countries connected to each other. With a high cutoff, only a small number of large countries have any connections. However, we find that the degree distribution is not power-law, with σ estimated to be negative. We found that this was the case even with different values of the cutoff.
In order to perform posterior inference, we run 3 MCMC chains with an initialization run of 10 000 steps and full chain lengths of 200 000. Trace plots and convergence diagnostics are reported in Section D.2.2 of Appendix D, suggesting the convergence of the MCMC sampler.
As we see in Table 1, σ is estimated to be negative in this case. This means that this network does not fit into our definition of a sparse graph as defined in Definition 2.2. Nevertheless, in Figure 13 we plot the posterior predictive distribution and we see that we can estimate the degree distribution fairly well, albeit with a large posterior predictive interval. Furthermore, we will see that we can obtain an interpretable classification of countries into core and periphery. Therefore, our model is still producing useful results despite the network not technically fitting into our framework.
As with the US Airport network, we calculate the reweighted KS distances in Figure 10(b). We again compare against the model of [41], with two communities. Here we see that both models perform similarly well. However, the advantage of our model is that it gives a discrete classification between core and periphery. we see this structure clearly from the adjacency matrix in Figure 7(b).
In Figure 14 we see the world map, coloured by the value of mean sociability parameter in Figure 14(a), and by the value of the core sociability parameter in Figure 14(b). We see that the core consists largely of the large, developed countries that we expect, as well as some smaller European countries. This fits with other results that have been obtained in the literature [12]. Comparing the two plots, some of the more interesting results are countries that have a relatively high core sociability compared to their mean sociability. These include countries such as Russia, and indicate countries that trade predominantly with other countries in the core. Conversely, we see other countries such as the US and China which have relatively lower core sociabilities. These are countries which trade more internationally, with countries in the core and periphery.

Comparisons
Finally, we want to compare our method against other standard methods of coreperiphery detection. In order to do this we generate simulated core-periphery networks, and run each algorithm on them. We can then calculate the Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve [18] in order to determine the accuracy of the classification of each algorithm. In this case, we simply compare the binary core-periphery classification of each method, but we can still use the AUC as a measure of the accuracy of this binary classification. The methods that we compare against are: the algorithm of [4], the MINRES algorithm of [5] and [28], the Stochastic Block-Model (SBM) of [44], the three different methods of [10], the aggregate core score method of [39] and the method of [11]. In each case, we implement the methods using the software found in the cpalgorithm Python package 6 . While the method of [39] gives continuous coreness parameters, we convert these to binary classifications in the same way that they do in their simulation study.
As we have previously noted, our focus here is on model-based approaches. The only other model-based approach amongst these is that of [44]. Our ap- proach already provides several advantages over the rest of the models considered, such as the ability to model the degree distribution of the networks being studied. However, here we focus on the classification accuracy and how it compares to some classic and more contemporary alternatives. Specifically, we are interested in comparing the accuracy in the setting for which our model is designed: the modelling of sparse graphs with a core-periphery structure and power-law degree distribution.

Comparison on in-model simulated data
The first comparison we do is using the SparseCP model to generate power-law, core-periphery networks. Of course, we expect our model to perform very well here, and indeed we see this as we compare the AUC for each model. We vary the strength of the core-periphery structure by varying b in our model, which varies the relative sociabilities of the nodes in the core and periphery. We run 20 simulations for each value of b, and adjusting the value of α to keep the number of nodes roughly equal in each case. We then measure the average classification accuracy in each case. In Figure 15 we present the results for varying val of b.  As b increases, the average relative size of the core increases from 0.12 to 0.57, whilst the average number of nodes stays roughly constant. We can see that classifying core and periphery generally becomes easier to do as b is increased. However, we see that our method achieves the highest accuracy in each case.
The case b = 5 is an extreme case, with the core comprising almost 60% of the network, and in this case some of the other methods perform badly, with some methods placing all the nodes in the core in some of the simulations. Recalling our definition of core-periphery networks, we see from Figure 16 that the degree distribution is very far from a power-law, due to the large number of core nodes. Nevertheless, it is encouraging to see that in this case we are still able to accurately recover the degree distribution.
There are a few reasons why we might expect to see this difference in classification performance. The only other model-based approach, based on the SBM [44], cannot account for power-law degree distributions and tends to clas-sify nodes according to their degree. Other non-model-based approaches have similar issues.
In an attempt to allow a fairer comparison, we also test our method against the alternatives mentioned above on another simulated network, again with a power-law degree distribution, but in this case not generated using our model.

Comparison on out-of-model simulated data
The second comparison we do is using the simulated core-periphery networks of [21]. Our algorithm is based on theirs, but differs slightly to allow us to tune how well the core and periphery can be separated by degree. We construct networks as follows: 1. Generate degrees m 1 < m 2 < . . . < m N from a power-law distribution.
These are the desired degrees and can be thought of as stubs, as in the configuration model [33]. 2. Place node i in the core with probability q i , where q i is given by where κ and m min are parameters that we use to tune the model. Call the set of core nodes C H and the set of periphery nodes P H . 3. Go through each of the nodes i ∈ C H in increasing order of degree (node i has degree m i ) and for each node i attach its stubs to those of nodes j with m j ≥ m i as long as the degree of j is less than m j . 4. Attach the remaining stubs randomly and make them into edges if they do not form loops or multiple edges.
The form of q i is a standard approximation to a Heaviside step function. When κ is large, the function approximates a step function with a jump at m min . In this case, the core is comprised of only the nodes with degrees m i ≥ m min . However, for smaller κ we have low degree nodes entering the core and high degree nodes entering the periphery. In Figure 17 we compare our method to the alternatives for varying κ As in our previous simulation study, we calculate the area under the ROC curve, averaged over 20 realisations of networks for each value of κ. We see that when κ is large, and the core and periphery are essentially divided by degree, our method performs very well, but so do those of [44] and [39]. As we decrease κ, the other methods fail to perform as well, while ours retains a high accuracy. Therefore, we see that our method also outperforms the alternatives when applied to simulated networks that are not generated using our model. Specifically, we do better especially when the core and periphery nodes are not split simply by degree. We have seen in the USairport example that the form of the probability of connection between nodes is what allows nodes in the periphery to have high degree, and vice-versa. This explains why our model performs well here. We must also note here that some of the methods we compare against here are very fast, taking only seconds to produce a classification into core and periphery. However, our sampler generally takes ∼ 5 minutes to run on a standard desktop computer. Furthermore, as discussed earlier in our case the aim is not just to perform classification, but to estimate the parameters of a generative model.

Conclusions
In this work, we provide a precise definition of a sparse network with a coreperiphery structure, based on the sparsity properties of the subgraphs of core and periphery nodes. Building on earlier work from [6, 41], we then present a class of sparse graphs with such properties. We obtain theoretical results on the sparsity properties of our model. Specifically, we see that the SparseCP model generates a core region which is dense, and a periphery region which is sparse. Theoretical results on the relative size of the core are also obtained.
We provide methods to simulate from this class of graphs, and to perform posterior inference with this class of models. We demonstrate that our approach can detect interpretable core-periphery structure in two real-world airport and trade networks, while providing a good fit to global structural properties of the networks. When restricting ourselves to simply looking at core-periphery classification accuracy, we see that it compares favourably against various alternatives when tested in the power-law setting.
A property of the SparseCP model is that the relative size of the core tends to zero as the size of the graph increases. In some applications, we may instead want to have a network which is overall dense, but with a sparse periphery region, where the relative size of the core is bounded above zero. Furthermore, whilst our model can currently accommodate the existence of multiple communities as well as a core-periphery structure, it currently cannot be used to model networks with multiple core-periphery pairs. Work has been done in the literature on the detection of multiple such pairs [25] and it could be valuable to extend our model to this setting.

A.2. Proof
A straightforward adaptation of Theorem 1 in [6] and Proposition 1 in [31] yields almost surely as α tends to infinity. It follows that α,c−p α 2 almost surely as α → ∞.
Two nodes i and j in the periphery connect with probability where (w i2 , . . . , w iK ) i≥1,wi1=0 are the points of a Poisson point process with mean ρ p (dw 2 , . . . , dw K ) = w1∈[0,∞) ρ p (dw 1 , dw 2 , . . . , dw K ). This is therefore the same model as that of [41]. We therefore have, using Proposition 4 in [41] and therefore N α,p = ω(α) and N α = ω(α). Noting that ρ p (dw 2 , . . . , dw K ) = {0}×(0,∞) K−1 ρ(dw 1 , dw 2 , . . . , dw K ) finishes the proof of Theorem (3.1). We now consider the particular case of the Lévy measure (2.11). We have where F is a product of independent gamma distributions, and ρ 0 is the jump part of a GGP with parameters σ and τ . We recognise this as the particular compound CRM model of [41], except that we now have a base measure Then ρ 0 (dw 0 ) = ∞ =⇒ ρ 0,p (dw 0 ) = ∞ for our particular choice of ρ 0 . Hence, the results of Proposition 5 of [41] tell us that Returning to the core nodes, we know that N α,c α if For the particular case of the Lévy measure (2.11), we have As before, this is the same as the particular compound CRM model of [41], except that we now have a base measure In this case ρ 0,c (dw 0 ) is the Laplace exponent ψ(1) of the base Lévy measure ρ 0 , which is finite. Hence, the results of Proposition 5 of [41] tell us that (0,∞) K ρ(dw 1 , dw 2 , . . . , dw K ) < ∞ as desired, and thus N α,c α. This completes the proof of Theorem 3.2.
In the particular case K = 2, we can give a more direct proof of the asymptotics for the periphery nodes and overall graph. We present this here for ease of understanding. In this case we have Using a slight abuse of notation, let ρ p (w 2 ) and ρ 0 (w 0 ) denote the intensity functions of the measures ρ p (dw 2 ) and ρ 0 (dw 0 ). Let f denote the pdf of a gamma random variable with parameters a and b. We have where β 2 has cdf F . Finally, using [3, Proposition 1.5.8. p. 26], we obtain and it follows from [6, Theorem 4] that which gives the desired results for the periphery nodes and the overall graph. The corresponding results for the core nodes, which tell us that N α,c α, can be found by taking K = 2 in the general proof. This then completes the proof of Theorem 3.2 in the particular case K = 2. The results of Corollary 3.2.1, both follow directly from N α,c α and the fact that N α = Ω α 1+σ for the overall graph.

Appendix B: More simulation results
We show here empirical results on the effects that changing various parameters of the model have on the degree distributions, sparsity properties and core proportion. We restrict ourselves to the K = 2 case here, for ease of visualisation.

B.1. Degree distributions
We first look at the degree distributions for the overall graph, and for the core and periphery nodes separately. The first thing we notice is that, as we expect, the core and periphery nodes have very different degree distributions. In Figure  18 we see the results for varying σ and τ . Here we see that increasing σ leads to lower degree nodes in the overall graph, as well as both core and periphery regions. We expect this, since we know that a larger value of σ means that the graph is more sparse. The distribution begins to look closer to a pure powerlaw for large σ. We see further that increasing τ has little effect on the degree distribution in the core (for τ = 5 the core size is very small, leading to less interpretable results). We also see fewer nodes with a high degree, and again behaviour more closely resembling a power-law. Figure 19 shows the results for varying a and b. When increasing a we see that the shape of the degree distribution for the core does not change, although the number of nodes in the core increases. Overall and in the periphery we see little difference in the degree distribution apart from nodes with high degree.
For larger values of a there are more of these. The degree distributions for b are similar, except that increasing b decreases the number of nodes with a high degree.

B.2. Sparsity
We have seen how the sparsity properties in the different regions are controlled by σ. From Figures 20 and 21 we see that, as expected, changing the parameters a and b does not affect this, since the gradients of the lines in each case are the same. However, we do see that for fixed sized graphs, increasing a increases the density in the overall graph, and in the different regions. Conversely, increasing b decreases the density in the core-periphery and periphery regions. The density in the core region is also decreased, but this effect appears to be far less significant.

B.3. Core proportion
We are finally interested in how the proportion of nodes in the core is affected by changing a and b. From Figure 22 we see that, as expected, the asymptotic rates are not affected by changing these hyperparameters. However, for fixed size graphs, increasing a decreases the relative size of the core region, whilst the opposite is true for b.

Appendix C: MCMC algorithm details
We give here the details of our MCMC algorithm, as detailed in Algorithm 1, in particular how it differs from the Algorithm of [41].   As in [41], we assume that we have observed a set of connections (z ij ) 1≤i,j,≤Nα where N α is the number of nodes with at least one connection. We want to infer the parameters (w i1 , . . . , w iK ) i=1,...,Nα . We also want to estimate the sums of the parameters for the nodes with no connection (w * 1 , . . . , w * K ), the hyperparameters φ of the mean intensity ρ and the parameter α which is also assumed to be unknown. Thus the aim is to sample from the posterior As in [41], we introduce the latent count variablesñ ijk using the data augmentation scheme given in Section C.3. Then we can define an algorithm which uses Metropolis-Hastings (MH) and Hamiltonian Monte Carlo (HMC) updates with a Gibbs sampler to perform posterior inference. At each iteration of the Gibbs sampler we update using Algorithm 1.
The key difference in our case compared to [41], as we see in Section 2.2, is that the β i1 are not identically distributed, and depend on w i0 . This gives us the separate term involving the β i1 in (C.2). However, we see that we can still follow the same algorithm, with some modifications.
In general, if ρ can be evaluated pointwise, a MH update can be used for this step, however this will scale poorly with the number of nodes. In the compound CRM case, if f and ρ 0 are differentiable then a HMC update can be used. For the SNetOC model where f is the product of independent gamma distributions and ρ 0 is the mean measure of a generalized gamma process this differentiability condition does hold. Thus, a HMC sampler is used to update (w i0 , β i1 , . . . , β iK ) i=1,...,Nα all at once, with the potential energy function U defined as U (w 0 , β) can be calculated from (C.2) using a simple change of variables. Furthermore its derivatives have simple analytic forms, and so the HMC step can be computed exactly.
In our case this strategy will not work, because β 1 is a binary variable, and so both the log transformation and the HMC update of the (β i1 ) i=1,...,Nα cannot be used. We propose an alternative solution, using a Gibbs step to update β 1 separately from the others as follows This method also allows us to deal with the fact that the β i1 are not identically distributed, and depend on the w i0 . The two conditional distributions can be found from (C.2) as follows. Ignoring the terms not involving (w 0 , β 2 , . . . , β K ), the first conditional is proportional to We can simulate from this using HMC as before, except that now β 1 is considered to be constant, and we have to incorporate an extra term involving the w i0 coming from the distribution of β i0 | w i0 . The details of the exact HMC sampler are therefore very similar to the SNetOC model, and we omit them here. The second conditional distribution is proportional to In order to simulate from this distribution we first approximate the conditional by linearizing the exponent as follows where β * i1 is the sampled value of β i1 from Step 1 of the previous iteration of the overall Gibbs sampler in Algorithm 1. We thus treat β * i1 as a constant here. Using this approximation, and defining we have that the conditional distribution of (β i1 ) i=1,...,Nα given the rest is proportional to We see that we can simulate each of the β i1 individually. The distribution f i (β i1 ) depends on m i1 , and there are two cases.
1. If m i1 > 0 then, recalling that β i1 is a binary variable, P r(β i1 = 0) = f i (0) = 0 and so P r(β i1 = 1) = 1. This means that if the sum of the latent counts jñ ij1 from the previous iteration is not equal to 0 then β i1 will be updated to 1. We also know from (C.9) that if β i1 = 0 thenñ ij1 = 0 ∀j. However, this does not lead to a problem of losing irreducibility of the Markov chain, since the reverse is not true.
2. If m i1 = 0 then we have that which we recognize as a Bernoulli distribution with parameter Thus we can update the β i1 given the rest by sampling Of course, we are conditioning on m i1 when performing this step, so we will always know what case we are in. Thus, the above can be used to sample β 1 from the relevant conditional distribution.

C.2. Step 2
In this step we want to sample from p((w * k ) k=1,...,K , φ, α | rest), which we know from (C.2) is proportional to This distribution is not of a standard form, and involves the pdf g * α (w * 1 , . . . , w * K ; φ) of the total masses. This pdf typically has no analytic expression, but its Laplace transform is available. For the SNetOC model a MH step is therefore used, with a proposal q(w * 1:K ,φ,α | w * 1:K , φ, α) =q(w * 1: and q(w * 1:K | w * 1:K ,φ,α) is an exponentially tilted version of g * α . Here, λ k = w * k + 2 Nα i=1 w ik , and ψ is the Laplace exponent. In the compound CRM case ψ takes a simple form that only involved evaluating a one-dimensional integral. The calculation of ψ in our case can be done similarly, the only difference being that we need the moment generating function of a Bernoulli distribution for the first component. The details are omitted here. Similarly, since the distribution of β 1 has no hyperparameter, the same q(φ | φ) can be used in our case as the compound CRM case of the SNetOC model, ignoring the k = 1 term.
2. Calculate the set A of edges (i, j) such that n ij1 = 0 ∀(i, j) ∈ A ⇐⇒ m i1 = 0 ∀i ∈ I. 3. For (i, j) ∈ A, update ( n ij1 , . . . , n ijK ) as normal, using the conditional distribution given z, w. 4. For (i, j) ∈ A, propose an update for ( n ij1 , . . . , n ijK ) from the mixture distribution: for i = j, and similarly for i = j. This means that with probability p lat we set n ij1 to 0 and simulate n ij2 , . . . , n ijK from a truncated Poisson distribution. With probability 1 − p lat , n ij1 , . . . , n ijK are simulated from a truncated Poisson distribution as normal. 5. Accept the proposal using the standard Metropolis Hastings acceptance rate given by where n = ( n) 1≤i,j≤Nα,k=1,...,K , P is the true distribution of the n ijk as given in (C.9), Q is the proposal distribution detailed above and n old is the value of n from the previous iteration of the overall sampler.
In practice, we find that this method allows for faster mixing of the algorithm.

C.4. Parameter initialization
We start with the initialization method employed by [41], which performs a short run with K = 1, using the parameter estimates obtained there as initial values for the full run. When testing our model on simulated data sets with K = 2 (i.e. with a core parameter w 1 and overall sociability parameter w 2 ) we find that when the core-periphery structure is particularly weak our estimated values of w 1 are close to the true values of w 2 , and vice versa. Similarly, when taking K > 2, we sometimes find that the estimated values of the core parameter are instead close to the values of one of the community parameters. We therefore introduce a new initialization procedure, in order to ensure that our core parameter w 1 is estimating the core-periphery structure. We first perform a short initialization run using the SNetOC model. Having done this, we use the community sociabilities from the SNetOC model in order to initialize the weights w k , k = 2, . . . , K in our model. Initializing in this way prevents the problem of w 1 approximating the sociabilities of one of the communities. As shown in Section 4, the SNetOC model is not an ideal model for identifying core-periphery structure for several reasons. However, we do see that one of the identified community sociability parameters can sometimes approximate our core parameter, and thus provide us with a way to initialize this parameter as well. In practice, we find that initializing the size of the core to be significantly larger than we believe it to be leads to better mixing of the MCMC sampler.

C.5. Approximation of the log-posterior density
The posterior probability density function, up to a normalizing constant, takes the form where m ik = Nα j=1 n ijk + n jik and g * α (w * 1 , . . . , w * K ; φ) is the probability density function of the random vector (W 1 ([0, α] This is intractable due to the lack of an analytic expression for g * α . We can however approximate the log-posterior. Noting that Using this approximation, one can now integrate out w * k and we then obtain the approximation where ψ(t 1 , . . . , t K ) is the multivariate Laplace exponent, which can be evaluated numerically. We will use the approximated log-posterior density in order to evaluate the convergence of our MCMC sampler in the experiments we run. In particular, we can use it as a one-dimensional statistic from which to calculate the Gelman-Rubin convergence diagnostic [16].

Appendix D: Additional experimental results
In this section, we provide a number of additional plots related to the experiments in Section 4. They are organised as follows. In Section D. In Section 4.2, we present a selection of results from performing posterior inference on a simulated graph generated with K = 2, α = 200, σ = 0.2, τ = 1, b = 1 K , a = 0.2. Here, we present some additional results specifically regarding the output of the MCMC sampler. In Figures 23 and 24 we see the parameter trace plots and histograms.
The green lines and stars respectively correspond to the values of the model parameters used to generate the graphs, and we see that the posterior converges around the true values.
In Figure 25 we give the trace and autocorrelation plots of an approximation of the log-posterior (up to a normalizing constant). The details of the calculation of the approximate log-posterior are given in Section C.5. Here, the green line gives the value of the approximate log-posterior using the true model parameters. The log-posterior trace plot suggests that the chains have converged and the autocorrelation plot shows that the correlation of the samples decreases quickly with increasing lag.
In order to test the convergence of the chains, we calculate the Gelman-Rubin convergence diagnosticR [16]. Due to the high number of parameters in the SparseCP model, we calculate a univariate statistic using the sampled values of the approximate log-posterior. Recalling thatR < 1.1 suggests convergence, we find that in our caseR = 1.03. Thus we are satisfied that the chains have indeed converged.

D.1.2. Core-periphery and community structure
Here, we test our model on a graph with core-periphery and community structure, to confirm that we can detect the presence of both. In this case, K is taken to be 4 in order to generate a graph with a core-periphery structure and 3 overlapping communities. In Figure 26(a) we see the adjacency matrix for the  simulated network, sorted into core and periphery (indicated by the red lines) and then by largest community sociability (indicated by the black lines). We see a clear global core-periphery structure and community structure. As before, we see from Figures 26(b) and 26(d) that we are able to recover the degree distribution, as well as the mean sociabilities of the high and low degree nodes.
In Figures 27 and 28 we see the trace plots and histograms for the identifiable parameters for this graph. The green lines and stars correspond to the true values of the parameters, and as before we see that the posterior distribution converges around the true values. In Figure 29 we plot the credible intervals for each of the sociabilities w 1 , . . . , w 4 , and see that we are able to recover both the core and community sociabilities.
As before, to test convergence we calculate the approximate log-posterior probability density function (up to a normalizing constant). We see from Figure 30 that the chain has converged to the true value. This is confirmed by the Gelman-Rubin diagnostic, which in this case comes out asR = 1.03. However, from the autocorrelation plot we see that the mixing is not as good as in the case without community structure, with dependencies not vanishing as quickly as we would like for increasing lag. This indicates that estimation in the setting with core-periphery and community structure can be more difficult, as we might expect.

D.1.3. No core-periphery structure
Here, we perform some additional experiments to test our model in two situations in which we do not expect it to work, in order to check that the behaviour is still sensible. Recalling that in our model the overall generated graph is still sparse, we first simulate data from the model of [6], with σ > 0. This model generates a sparse graph without a core-periphery structure. In this case, the size of the core estimated by our model tends to zero. Furthermore, we see from Figure 31(a) that we still accurately recover the degree distribution in this case.
Secondly, we test our model on a graph generated from the model of [6], but with σ < 0. In the construction of our model, the core and periphery are distinguished by the fact that the core nodes form a dense subgraph in an otherwise sparse graph. Thus, when the overall graph is dense, we do not have this distinction and our model struggles to identify any latent core-periphery structure. However, in this case this structure is not present. We see two different types of behaviour, depending on the parameters of the model and the initial conditions. 1. σ is (correctly) estimated to be negative, and the size of the core goes to zero. This is the behaviour we see in Figure 31(b) and is the behaviour we would expect. 2. σ is (incorrectly) estimated to be positive, and the whole graph is estimated to be in the core. Although this may not seem intuitive, it still fits with our definition of the core being a dense subgraph within a sparse network.

D.2.1. US airport network
In Section 4.3.1, we present the results from performing posterior inference on the US Airport network. Here, we present some additional results specifically regarding the output of the MCMC sampler. In Figure 32 we give the trace plot and autocorrelation of the approximate logposterior. The Gelman-Rubin statistic comes out to beR = 1.25 in this case, suggesting that the chain has not converged despite the large number of iterations. We can also calculate the Gelman-Rubin statistic for parameters α and σ. Here we obtain values of R α = 1.10 andR σ = 1.13. Again, these are both above the threshold of 1.1 which we generally believe to indicate convergence. However, the lower values here indicate that the problem of convergence may not be as bad as indicated by the log-posterior.
These results indicate that the convergence of our sampler is slow. As in  [41], even running a very long chain did not lead to an improvement. This long chain took roughly 60 hours to run, translating to a time cost of around 0.02 seconds per iteration. However, it is at least encouraging to see that the different chains are exploring parts of the parameter space with similar approximate logposterior density. In Figure 33 we report the trace plots for the identifiable parameters of the US Airport network. We can perform some posterior predictive checks, by comparing the posterior predictive distribution of the degree standard deviation, global cluster coefficient and average local cluster coefficient to the respective empirical values. In Figure  34 we see that the global cluster coefficient is being fit well, while the other statistics are being slightly overestimated.

D.2.2. World trade network
In Section 4.3.2, we present the results from performing posterior inference on the World Trade network. Here, we present some additional results specifically regarding the output of the MCMC sampler. In Figure 35 we give the trace plot and autocorrelation of the approximate logposterior. In this case, the Gelman-Rubin statistic ofR = 1.009 suggests convergence of the Markov chain. As before, we can also calculate the Gelman-Rubin statistic for parameters α and σ. Here we obtain values ofR α = 1.01 andR σ = 1.01, providing further evidence of convergence.
In Figure 36 we see the trace plots for the identifiable parameters of the World Trade network.

Appendix E: US airport network case study repeated without Alaskan airports
In Section 4.3.1, we note a problem occurring due to the sociability parameter for the Alaskan community. Here, we repeat our analysis having taken out all the Alaskan airports, as well as any airport only connected to Alaskan airports. The relevant plots comprise Figures 37-41.
In Figure 37 we see the trace plots for the identifiable parameters. From this we see that we no longer have the same problem of an estimate of one of the b k being very close to 0.
In Figure 38 we see the autocorrelation plot of the approximate log-posterior density. This seems as though it is converging to a stable value. If we look at the Gelman-Rubin convergence diagnostic, this comes out to beR = 1.05 in this case, indicating convergence. We also obtain values ofR α = 1.06 andR σ = 1.03, providing further evidence of convergence.
In Figure 39 we see the posterior predictive degree plot on the left, compared to the corresponding plot for the model of [41] with three communities on the right. We see that we are no longer overestimating the high degree nodes. We also see that the SNetOC model has estimated σ < 0 in this case, whereas for the SparseCP model we still estimate σ > 0.
If we look at the reweighted KS distances as before in Figure 40, we see that our model is still providing a better fit to the data, but not necessarily any better than the fit before excluding the Alaskan airports. In this case we report the Rényi statistics L 2 and U 1 , because we see that our model is overestimating the number of low degree nodes, and underestimating the number of high   degree nodes.
Furthermore, we see that the core and communities identified here are largely the same as before (without the "Alaska" community). In Figure 41 we plot the adjacency matrix in this case, again ordered into core and periphery, and then by highest community weight.