Finite Mixtures of ERGMs for Modeling Ensembles of Networks

. Ensembles of networks arise in many scientiﬁc ﬁelds, but there are relatively few statistical tools for inferring their generative processes, particularly in the presence of both dyadic dependence and cross-graph heterogeneity. To address this gap, we propose characterizing network ensembles via ﬁnite mixtures of exponential family random graph models (ERGMs), a class of parametric statistical models that has been successful in explicitly modeling the complex stochastic processes that govern the structure of edges in a network. Our proposed modeling framework can also be used for applications such as model-based clustering of ensembles of networks and density estimation for complex graph distributions. We develop a joint approach to estimate the number of mixture components and identify cluster-speciﬁc parameters simultaneously as well as to obtain an identi-ﬁed model under the Bayesian paradigm. Speciﬁcally, we develop a Metropolis-within-Gibbs algorithm to perform Bayesian inference, and estimate the number of mixture components using a strategy of deliberate overﬁtting with sparse priors that removes excess components during MCMC. As the true ERGM likelihood is generally intractable for model speciﬁcations with dyadic dependence terms, we consider two tractable approximations (pseudolikelihood and adjusted pseudo-likelihood) to facilitate eﬃcient statistical inference. We run simulation studies to compare the performance of these two approximations with respect to multiple metrics, showing conditions under which both are useful. We demonstrate the utility of the proposed approach using an ensemble of political co-voting networks among U.S. Senators and an ensemble of brain functional connectivity networks.


Introduction
Data involving ensembles of networks -that is, multiple independent networks -arise in various scientific fields, including sociology (Slaughter and Koehly, 2016;Stewart et al., 2019), neuroscience (Simpson et al., 2011;Obando and De Vico Fallani, 2017), molecular biology (Unhelkar et al., 2017;Grazioli et al., 2019), and political science (Moody and Mucha, 2013) among others.Typically, ensembles of networks represent the action of multiple generative processes, with different processes being prominent

Finite Mixtures of ERGMs for Modeling Ensembles of Networks
in different settings.A reasonable starting point for analysis of such data is to posit that this variation can be represented in terms of a discrete set of subpopulations, such that the networks drawn from any given subpopulation tend to be produced by similar generative processes.Given a set of potential generative models, one would then like to identify the subsets of networks drawn from a particular subpopulation, or a probabilistic mixture of multiple subpopulations.It is natural to view this as a hierarchical finite mixture problem, with the base distributions being parametric distributions on graphs.As a plausible approximation to the underlying data generating process, the hierarchical finite mixture framework provides a flexible approach for predictive modeling of ensemble of networks.If one seeks to predict graph structures drawn from a heterogeneous (super)population learned from observed data, one needs to average over the possible generative processes that might end up producing the observation that one wants to predict.Such a view is similar in spirit to model averaging techniques (Hoeting et al., 1999;Hjort and Claeskens, 2003), especially if interpreted in terms of a hierarchical problem in which we seek to predict an outcome of interest (e.g., co-voting prevalence among U.S. senators) by first predicting network structure and then predicting the behavior of a process on that network.In that setting, if it turned out that there were K types of possible network formation processes and we did not know which one ours happened to be, we would certainly want to average across the types.
There is a growing body of literature on the analysis of ensembles of networks.This includes work on discriminative analysis of networks via distance or similarity measures (e.g.Banks and Carley, 1994;Butts and Carley, 2005;Fitzhugh et al., 2015), which can be broadly viewed as mapping the ensemble of interest into some highdimensional space (e.g., the Hamming space of graphs), and then employing standard multivariate analysis techniques (e.g., hierarchical clustering, multidimensional scaling) to seek an informative low-dimensional approximation.Other approaches work with user-selected graph statistics, either directly (e.g.Pržulj, 2007;Sweet et al., 2019) or by e.g., modeling quantiles of the observed statistics relative to a reference distribution to control for size and density effects (Butts, 2011).As such, these approaches do not attempt to provide generative models for the networks within the ensemble, though they may in some cases provide generative models for summary statistics (e.g., predicting the conditional uniform graph quantile for the transitivity of a new graph drawn from the same ensemble).Another relevant work is Durante et al. (2017), where the authors propose a Bayesian nonparametric approach for modeling the population-level network based on a mixture model framework on the joint distribution of the edges through a latent space model representation.
In the category of generative models for complex networks, a common approach is to employ multilevel models with exponential random graph models (ERGMs, a general family of parametric models for networks (see, e.g.Schweinberger et al., 2020, for a review)), as base distributions.Faust and Skvoretz (2002) introduced both multivariate meta-analysis of ERGM parameters from a common model family (fit to an ensemble of graphs) and predicted conditional edge probabilities from the generative base models as tools for leveraging ERGMs to compare networks.More elaborate meta-analytic procedures and hierarchical models for population of networks were subsequently developed by, among others, Zijlstra et al. (2006); Slaughter and Koehly (2016); McFarland et al. (2014); Butts (2017), and Stewart et al. (2019).In general, those methods have either not posited a generative model for the parameters of the base distribution, as in descriptive meta-analytic approaches (which can be problematic when model interpretation and simulation from the resulting model are of interest), or not suitable for identifying subpopulations from heterogeneous data (as in hierarchical models without mixture structure).Although work such as that of Lehmann et al. (2021) enables the modeling of heterogeneity in brain functional connectivity networks, it requires that subpopulation labels be observed (which is often not the case).Therefore, joint modeling of population-level and network-level parameters where subpopulation memberships are unknown, or where the true generative process otherwise involves a mixture of graph distributions, has remained an open problem to date in the ERGM context.
In this paper, we propose using mixtures of ERGMs to model the generative process of ensembles of networks in which the subpopulation labels are not available, under the general framework of finite mixture models (McLachlan and Basford, 1988;Fraley and Raftery, 2002;Bouveyron et al., 2019).Such a formulation provides a useful probabilistic interpretation of the results and allows for convenient statistical inference; we note in particular that related approaches have proven to be efficacious for modeling structure within networks (e.g.Salter-Townshend and Murphy, 2015;Schweinberger and Handcock, 2015;Snijders and Nowicki, 1997).Recent work on using mixtures of network models with conditionally independent dyads (e.g., a priori stochastic blockmodels, the p 1 model) for modeling multiple network observations (Signorelli and Wit, 2020) can encounter difficulties when the observed networks exhibit strong dyadic dependence, which is often the case for real-world networks.We develop a joint approach to simultaneously estimate the number of mixture components and identify cluster-specific parameters, as well as to obtain an identified model under the Bayesian paradigm.Specifically, we propose a Metropolis-within-Gibbs algorithm to perform Bayesian inference, and estimate the number of mixture components using the method of Malsiner-Walli et al. (2016) that employs a deliberately overfitting model with sparsity-inducing priors over a large number of mixture components, with the excess components being "emptied" during MCMC.We also adapt the Bayesian information criterion and Deviance information criterion to enable the comparison of different specifications of ERGMs.
The remainder of this paper is structured as follows.In Section 2, we briefly introduce the exponential-family random graph models (ERGMs) and common estimation techniques.Section 3 describes the idea of finite mixtures of ERGMs, along with the estimation algorithm and the method for selecting the number of mixture components.Section 4 presents simulation studies showing that the proposed method can accurately recover the true subpopulation assignments and model parameters.Sections 5 and 6 show the results of our method applied to a political co-voting data and brain functional connectivity networks, respectively.Section 7 concludes with a discussion.

Exponential-Family Random Graph Models (ERGMs)
In recent years, ERGMs have found applications in empirical research in a wide range of scientific fields.Recent examples include the study of large friendship networks (Goodreau, 2007), genetic and metabolic networks (Saul and Filkov, 2007), disease transmission networks (Groendyke et al., 2012), conflict networks in the international system (Cranmer and Desmarais, 2011), the structure of ancient networks in various of archaeological settings (Amati et al., 2019), the structural comparison of protein structure networks (Grazioli et al., 2019), the effects of functional integration and functional segregation in brain functional connectivity networks (Simpson et al., 2011;Sinke et al., 2016;Obando and De Vico Fallani, 2017), and the impact of endogenous network effects on the formation of interhospital patient referral networks (Caimo et al., 2017).While addressing very different problems in different empirical settings, what these studies have in common is a clear methodological commitment to modeling network mechanisms directly via parametric effects, rather than just attempting to "control for" unspecified dependence among the observations (e.g., via latent structure).The ability to provide generative and interpretable models of complex network structure is an important asset of this approach, which we leverage here in the context of graph ensembles.

Definition and Estimation
Exponential-family random graph models (ERGMs) (Holland and Leinhardt, 1981;Frank and Strauss, 1986;Snijders et al., 2006;Hunter and Handcock, 2006), also known as p-star models (Wasserman and Pattison, 1996), are a family of parametric statistical models developed for explicitly modeling the complex stochastic processes that govern the formation of edges among pairs of nodes in a network.We introduce them first in the single-network case.Consider the set of nodes in the network of interest, V , and let |V | = n be its cardinality, i.e. the number of nodes in the network.We represent the network's structure via an order-n random adjacency matrix Y, in which each element takes 1 or 0 representing the presence or absence of a tie between incident nodes.Letting Y n be the set of all possible network configurations on n nodes, we write the probability mass function (pmf) of Y taking a particular configuration Y in the form of a discrete exponential family as where θ = (θ 1 , • • • , θ q ) ∈ R q is a vector of (curved) model parameters, mapped to the natural parameters by The natural parameters η may depend on the sizes of the networks and may be non-linear functions of a parameter vector θ.The user-defined sufficient statistics g : Y n → R p may incorporate fixed and known covariates X that are measured on the nodes or dyads.The sufficient statistics incorporate network features of interest that are believed to be crucial to the mechanisms giving rise to the graph set (see, e.g., Morris et al., 2008).Here, h defines the reference measure for the model family; often chosen to be the counting measure h(y) ≡ 1, ∀y ∈ Y n for unvalued graphs with fixed n, other reference measures can make more sense in different settings (see e.g.Schweinberger et al., 2020).Finally, the normalizing factor z g,h,η,X,Yn (θ) = y ∈Yn exp η(θ) g(y ; X) h(y ) ensures that (1) sums to 1 over the support Y n .Relatedly, normalizability requires that θ only take values for which z g,h,η,X,Yn (θ) is finite.
Exact maximum likelihood estimation (MLE) for ERGM parameters requires direct evaluation of the normalizing factor, which involves integrating an extremely rough (i.e., high variance) function over all possible network configurations (2 ( n 2 ) non-negative terms for an undirected network of size n).This cannot be done by brute force except for trivially small graphs (n 7), and the roughness of the underlying function precludes simple Monte Carlo strategies; thus, alternative approaches that approximate or avoid this calculation are of substantial interest.To date, the most frequently used approaches include: maximum pseudo-likelihood estimation (MPLE; Besag (1974)) adapted by Strauss and Ikeda (1990); Markov Chain Monte Carlo MLE (MCMC MLE; Geyer and Thompson (1992)) by Handcock (2003); Hunter and Handcock (2006); approximate MLE based on Stochastic approximation (SA; Robbins and Monro (1951); Pflug (1996)) by Snijders (2002); fully Bayesian inference based on exchange algorithm (Caimo and Friel, 2011); approximate Bayesian inference based on adjusted pseudolikelihood (Bouranis et al., 2017); and variational Bayesian inference based on fully adjusted pseudolikelihood (Tan and Friel, 2020).As simulation-based MLE-finding algorithms (e.g., MCMC MLE, SA) rely on good initial parameter configuration to seed their simulations, there are also some work on this aspect, including the partial stepping technique (Hummel et al., 2012) and the contrastive divergence (CD, Hinton (2002))-based techniques adapted to ERGMs by Krivitsky (2017).Despite the computational challenges, these and related strategies have made ERGM inference practical for well-posed model families (see Schweinberger et al., 2020, for a recent review).

Size-Adjusted Parameterizations
It is worth noting that the behavior of (1) across n is highly dependent on the choice of reference measure, h.In particular, the counting measure -while a mathematically convenient choice -implicitly sets the base distribution of the network to be the uniform distribution on Y n , and has the side effect of generating graphs whose densities are ceteris paribus constant in n.When network size varies, this is not always realistic: in many networks, mean degree is approximately constant in n, implying that density must scale as n −1 .To correct for this, Krivitsky et al. (2011) propose the reference measure h(y) = n −M (y) , where M is the edge count.This is equivalent to adding a sizedependent offset of − log n to the natural parameter associated with the edge count, i.e., η 1 (θ) = θ 1 − log n, where θ 1 ∈ R is a parameter that does not depend on the network size.In the present work, we employ the Krivitsky reference measure as above for the networks of varying sizes, though other size-adjusted parameterizations are also possible (e.g., Butts and Almquist, 2015;Kolaczyk and Krivitsky, 2015).

Finite Mixtures of ERGMs
We assume a population of networks (Y (1) , V (1) , X (1) ), . . ., (Y (m) , V (m) , X (m) ), where Y (i) is a graph structure on vertex set V (i) with covariate set X (i) .Our interest is in modeling m) , X (m) ), where it will be assumed that the respective graph structures are conditionally independent given the respective subpopulation membership, vertex sets, and covariates.To simplify notation, V is implicitly absorbed into X for the remainder of this paper.
We formulate the data generating mechanism as a hierarchical model and we note that Bayesian inference is a natural choice here, since (1) it is more robust to initialization and less prone to converge to local minima than maximum likelihood; (2) interval estimation is straightforward and does not rely on the assumption of approximate normality; and (3) it provides principled answers in fixed-n, m settings.
With a prespecified upper bound for the number of subpopulations (or equivalently, mixture components) K, we have the following hierarchical representation of the model (see Figure 1 for a graphical representation) τ = (τ 1 , . . ., τ K ) ∼ Dir K (α, . . ., α), P (Z i = j) = τ j for every i = 1, . . ., m, and j = 1, . . ., K, given τ , θ 1 , . . ., θ K ∼ N p (μ, Ψ) independently given μ, Ψ, ) and Ψ = 25×I p to specify a flat multivariate normal prior for θ k 's, which is a standard choice in the ERGM literature (Caimo and Friel, 2014;Bouranis et al., 2018) to reflect our belief that the real-world graph is more likely to have a density smaller than 0.5.Other mean values can be used if there is additional prior knowledge about the graph density.
With the Krivitsky reference measure, a multivariate normal prior with mean 0 for all elements reflects the prior belief that expected baseline mean degree (i.e., discounting other effects) is around 1, which is more reasonable for typical sparse networks.Again, other mean values can be used if there is additional prior knowledge about the mean degree of the graphs of interests.To identify the number of mixture components, we deliberately choose a large value for K and a small value for α (e.g., 0.01 or 0.001, as demonstrated in Malsiner-Walli et al. ( 2016)), so that the superfluous components can be eliminated during MCMC, and hence a straightforward estimator for the true number of components is given by the most frequent number of non-empty components visited.
In our numerical examples, we choose α from a set of pre-specified small values that achieves the most accurate estimate of the cluster number.It is of interest to develop efficient marginal likelihood computing methods for our proposed method in a future work and use them to guide the choice of α using an empirical Bayesian approach.
From this point onward, we focus on ERGM model specifications in canonical form (i.e., η(θ) ≡ θ) as it is a rich model family comprising most commonly-used model specifications and known to be computationally stable.For networks of equal sizes, it is natural to use counting measure h(y) ≡ 1, therefore (1) becomes where z(θ) = y ∈Yn exp θ g(y ; X) .And for networks of varying sizes, with Krivitsky reference measure, (1) becomes where the normalizing factor z n (θ) = y ∈Yn exp θ g(y ; X) − log(n) i,j y ij .

Bayesian Estimation
As noted, we perform posterior inference via MCMC.Our algorithm iterates over the model parameters (θ 1 , . . ., θ K , τ ) = (θ, τ ) with the priors given above, and the la- Where possible, we sample from the full conditional posterior distributions; otherwise we use Metropolis-Hastings steps.
The proposal distribution q(•|θ) in the Metropolis step is set by the user to achieve good performance of the algorithm.Following literature on Bayesian ERGMs (Caimo and Friel, 2011;Bouranis et al., 2017), we use a symmetric multivariate Gaussian proposal for updating θ k , k = 1, 2, . . ., K. As suggested by Frühwirth-Schnatter (2001), an additional permutation step is added to randomly permute the current labeling of the components at the end of each MCMC iteration, which ensures that the sampler explores all K! modes of the full posterior distribution and prevents the sampler from being trapped around a single posterior mode (Geweke, 2007).

Finite Mixtures of ERGMs for Modeling Ensembles of Networks
Algorithm 1 Metropolis-within-Gibbs sampler for the ERGM mixture model.
Accept θ k as θ t k with probability equal to end for 10: Random permutation of the labeling: select randomly one permutation ρ of K! possible permutations of {1, . . ., K} and substitute (5) 11: end for

Approximations to Intractable Likelihoods
The likelihood P(y (i) |X (i) ; θ) is intractable for general ERGMs with dyadic dependent terms, which poses two challenges to the proposed Algorithm -(1) updating θ k 's via Metropolis-Hastings ratio ( 8) is a doubly-intractable problem (Murray et al., 2006); (2) updating latent indicator variables Z i requires a tractable likelihood.Although the exchange algorithm (Caimo and Friel, 2011) can be used to tackle the former challenge by cancelling out the intractable normalizing factors based on auxiliary variables, it falls short of the latter due to its inability to provide an approximation to the ERGM likelihood.Importance-sampling based approximation (Koskinen, 2004(Koskinen, , 2008) ) appears to be a straightforward solution, but it can be too expensive to be practical when the cost of simulating from ERGMs is considerable, as such simulations are needed for each network observation at each MCMC iteration.Therefore, we propose to consider tractable approximations to the true ERGM likelihood (1) when fitting the proposed model.

Pseudolikelihood
The pseudolikelihood approximates the true ERGM likelihood by a product of full conditional distributions of edge variables where Δ i,j g(y; X) = g(y + ij ; X) − g(y − ij ; X) are the so-called change statistics associated with the dyad (i, j), representing the change in sufficient statistics when y ij is toggled from 0 (y − ij ) to 1 (y + ij ) with the rest of the network remaining unchanged; D denotes the set of all pairs of dyads.For directed networks, D = {(i, j)|i, j ∈ N , i = j}, while for undirected networks, D = {(i, j)|i, j ∈ N , i < j}.In the frequentist paradigm, maximizing (6) gives the so-called MPLE, which is relatively fast, algorithmically convenient, and able to provide approximate parameter estimates for even badly-specified models.
While empirical observations show that MPLE can cause bias and underestimate standard errors (Van Duijn et al., 2009) for models with strong dyadic dependence, it has been the default choice for initialization of MCMC-MLE algorithms.There is also promising work on using bootstrapped MPLE to construct confidence intervals (Schmid and Desmarais, 2017) for large and sparse networks, as the MPLE is usually close to the MLE in such cases (Desmarais and Cranmer, 2010).Similar logic has motivated the use of Bayesian bootstrap estimation based on "pseudo maximum a posteriori (MAP)" estimates using the PL approximation to the likelihood (Grazioli et al., 2019).
As the full conditionals of edge variables are tractable (see ( 6)), sampling from ERGMs can be implemented through a Gibbs sampling procedure.To improve the mixing in the typical case of ERGMs that concentrate probability mass on sparse graphs, the default simulation algorithm in statnet implements a "tie no tie"(TNT) sampler, in which the dyad is randomly selected with equal probability from the set of dyads with ties or the dyads without ties.Bouranis et al. (2018) proposed an adjusted pseudolikelihood for correcting the mode, curvature and magnitude of the pseudolikelihood, and their simulation studies show that the adjusted pseudolikelihood can provide an accurate approximation to the true likelihood in the presence of strong dyadic dependence (where pseudolikelihood falls short).Building upon adjusted pseudolikelihood approximation, Tan and Friel (2020) developed a variety of variational methods for Gaussian approximation of the posterior density and model selection, which is shown to yield comparable performance to that of the exchange algorithm.The adjusted pseudolikelihood is defined as follows,

Adjusted Pseudolikelihood
where the constant C > 0 adjusts the magnitude and φ : R p → R p is an invertible affine transformation that adjusts the mode and curvature to match the true ERGM likelihood.
In ( 8), W is a p × p upper triangular matrix, θML = arg max θ P(y|X; θ) is the maximum likelihood estimate (MLE), and θPL = arg max θ f P L (y|X; φ(θ)) is the maximum pseudolikelihood estimate (MPLE).The estimation of matrix W and scaling constant C require simulating from ERGMs at the MLE to approximate the Hessian of log P(y|X; θ) and at a sequence of points (i.e., temperatures, using the terms for path sampling (Gelman and Meng, 1998)) from the coordinate origin to the MLE to approximate the normalizing factor z( θML ) (see supplementary material (Yin et al., 2021) for more details).

Identifying the Number of Clusters
We derive the posterior distribution of the number of non-empty components, K 0 , from the MCMC output following the approach of Ishwaran et al. (2001) and Nobile et al. (2004).Letting N (t) k denote the number of observations allocated to component k at iteration t = 1, 2, . . ., T (after discarding the burn-in and thinning) and I(•) denote the indicator function, we represent the number of non-empty components at iteration t as , and therefore estimate the posterior P(K 0 = h|y) for h = 1, 2, . . ., K by the respective relative frequency.A natural point estimator for the number of clusters is then the posterior mode, K0 = arg max h P(K 0 = h|y), which is optimal under a zero-one loss.

Post MCMC Inference
Identification of the finite mixture model requires handling the label switching problem caused by the invariance of representation with respect to ordering the components.The resulting multimodality and symmetry of the posterior distribution for symmetric priors makes it difficult to conduct inference on component-specific parameters.Following Malsiner-Walli et al. ( 2016), we perform K-centroids cluster analysis with Mahalanobis distance (Leisch, 2006) on the point process representation of the MCMC draws to post-process the MCMC output and obtain a unique labeling as follows 1. Estimate the number of mixture components ( K0 ) based on the retained MCMC iterations t = 1, 2, . . ., T after discarding the burn-in and thinning using the method described in Section 3.3, that is, K0 = mode(K 2. For T retained MCMC iterations, extract only the subsequence T 0 ( T ) where the number of non-empty components is exactly equal to K0 .
3. For all T 0 iterations in the subsequence, remove the draws from empty components and arrange the remaining draws of the different components in a matrix with K0 × T 0 rows and p columns.
4. Perform K-centroid cluster analysis based on the Mahalanobis distance on K0 ×T 0 draws to group them into K0 clusters.
5. Examine the clustering results and only keep those draws where the resulting cluster membership indicator vector forms a permutation of 1, 2, . . ., K0 , which gives a subsequence of T0 ( T 0 ) draws.
6.For the remaining T0 draws, a unique labeling is achieved by resorting the draws according to the cluster membership from the K-centroid cluster analysis in step 4 and rescaling the elements by their sum in each τ t such that they sum to 1.

Choosing Between Competing Model Specifications
When multiple model specifications M j , j = 1, . . ., J are under consideration, it is important to have a fast and convenient model selection index.In particular, we illustrate two popular information criterion based choices here, the Bayesian information criterion (BIC) (Schwarz, 1978) and a version of the observed deviance information criteria (DIC) introduced by Celeux et al. (2006), which is an extension of the original DIC (Spiegelhalter et al., 2002) to latent variable models.Consider the likelihood of network Y (i) taking value y (i) under the mixture model therefore the BIC is defined as follows (10) where θMj , K0 (M j ) and p Mj corresponds to the posterior mode of θ, the estimated number of mixture components, and the number of parameters in each ERGM component, respectively.Following Celeux et al. (2006), given remaining posterior draws ) and observed ensemble of networks y = (y (1) and

Posterior Probability of Cluster Membership
We estimate the posterior marginal probability of individuals belonging to each cluster (alternately: graphs having been generated by a particular process) via Monte Carlo: where π(θ, τ |y) is the posterior distribution of θ, τ .The integral ( 12) is computationally intractable.Hence we use posterior samples It is also possible to obtain the joint probabilities for the partitions and use decision theoretic approaches (Fritsch and Ickstadt, 2009;Wade and Ghahramani, 2018) to assign cluster memberships.These are promising future directions to pursue if further enhancements of performance are needed.In our work, we focus on the marginal probability method, as we have found it to perform well in practice.

Simulation Studies
We conduct simulation studies to evaluate the performance of the proposed algorithm with pseudolikelihood (PL) and adjusted pseudolikelihood (APL) on the following metrics: identification of mixture components; parameter estimation; and posterior predictive distribution of key network characteristics.We also compare the proposed method to K-means clustering in terms of their cluster membership recovery performance based on (i) network sufficient statistics (SS), (ii) maximum likelihood estimates, and (iii) maximum pseudolikelihood estimates.In all of the simulation studies, we consider ensembles of networks simulated from mixtures of ERGMs with three components of equal sizes.

Performance Evaluation
The identification of mixture components consist of two aspects, namely, the identification of true number of components and the recovery of true cluster memberships.To examine if the proposed algorithm can consistently identify the true number of components, we consider the proportion of times the true number of components is identified across all replicates under each experiment setting.To evaluate how accurately the true cluster memberships can be recovered, we calculate the average value of Rand index (RI) (Rand, 1971) and the average value of variation of information (VI) (Meilȃ, 2007) over 50 replicates.The RI measures the proportion of concordant pairs between two data clusterings, taking value 1 when two clusterings are identical and 0 when two clusterings do not agree on any pair of points.The VI, a true metric on clusterings, yields a value of 0 for identical clusterings with higher values indicating more disparate clustering assignments.We also compare the summary statistics for post MCMC inference, including T 0 , T0 , 1 − T 0 / T , and 1 − T0 /T 0 as defined in Section 3.4 (see supplementary material for more details).
For parameter estimation, we compute the relative bias of posterior mean and frequentist coverage rates of 95% posterior credible intervals.As it is possible that the estimated number of clusters is different from the true number of components, we focus on those runs in which the true number of components is successfully identified when evaluating the parameter estimation accuracy.
We examine the posterior predictive performance of the resulting mixture models by checking the discrepancy between the distributions of several critical network summary measures calculated on the observed networks and those of simulated networks.
As there is a computational overhead associated with the calculation of adjusted pseudolikelihood, we also include a comparison of computation time.

Simulation Design
In this simulation study, we consider ensembles of networks simulated from mixtures of ERGM distributions defined on three most commonly used network sufficient statistics , geometrically weighted edgewise shared partners (GWESP).Here EP k (y) is the number of connected pairs that have exactly k common neighbors, which measures local clustering in a network.The decay parameter φ controls the relative contribution of EP k (y) to the GWESP statistic, and it is fixed at 0.25 in this case.
• g 3 (y; X) = i<j y ij 1 {Xi=Xj } , the total number of edges with endpoints sharing the same value on nodal covariate X, often known as a nodematch term.
We fix nodal covariate X to be a binary variable, and let one half of nodes take value 0, while the other half take value 1 on X.To examine the performance of the proposed approach across a range of different conditions, we run a full-factorial experiment on the following two treatments • Network size: 40, 100.
We thus have a total of 4 experimental conditions, each of which is run under pseudolikelihood (PL) and adjusted pseudolikelihood (APL) (hence there is a total of 8(= 4×2) simulation settings) for 50 repetitions.The true cluster-specific parameters are specified as to ensure that the simulated networks (i) have similar mean degree (∼ 9.9, that is, networks of size 100 have density ∼ 0.10) across different mixture components and network sizes (Figure 2); and (ii) represent three common and intuitive patterns in real-world networks.Parameter settings in the first row correspond to the case in which there is a weak triadic closure effect (edge variables are nearly independent Bernoulli draws), and parameter settings in the second row correspond to the case in which there is a strong homophily effect but an intermediate triadic closure effect, while the parameter settings in the third row correspond to the case in which there is a strong triadic closure effect but an intermediate homophily effect.To maintain this pattern, we fix the values of coefficients associated with GWESP and nodematch terms across settings with different network sizes, and only modify the coefficient of edges term to keep the mean degree value as desired.
We apply the proposed Algorithm 1 to analyze the synthetic data sets with the initial number of clusters K = 6.All MCMC chains are run for 60000 iterations with the first 60% of draws discarded as burn-in and a thinning interval of size 50 to ensure convergence.We draw random values from a discrete uniform distribution taking values in {1, . . ., K} to initialize the latent membership indicators Z 0 i , and draw random values from the Dirichlet prior to initialize the value of τ 0 .The initial values of θ are drawn independently from a uniform distribution U(−0.1, 0.1).We fix α = 0.001 on the basis of some experimentation (more details are available in the Supplementary Material).All computations in this paper are implemented in R (R Core Team, 2019), and we use software suite statnet (Handcock et al., 2008) to generate networks from ERGMs where the burn-in for the first simulated network (network size: n) from each mixture component under each replicate is set as 20n 2 with a thinning interval of 4n 2 iterations.

Identification of Mixture Components
To check the performance of the proposed inferential procedure at identifying the correct number of mixture components, we first count the frequencies of K0 over 50 repetitions across different experimental conditions and likelihood approximation methods.Table 1 shows the proportion of times the true number of mixture components is identified.We observe that the performance of PL at identifying the true number of mixture components appears to be slightly better than that of APL when the network size is large (n = 100), but the performance of adjusted pseudolikelihood (APL) is more robust across all 4 experimental conditions (accuracy > 70%) and it is apparently superior to PL when the network size is small (n = 40).
With respect to the clustering performance measured by average RI and average VI, we note that the proposed mixture model yields satisfactory performance at absolute scale (average RI > 95% for all settings except for APL under network size = 100, cluster size = 30).For K-means clustering (the number of clusters in K-means is set to the estimated number of mixture components from our mixture model in the corresponding replicate to facilitate a fair comparison), we find that K-means clustering on the sufficient statistics is uniformly inferior to K-means clustering on the parameter estimates (MPLE or MLE), and it is worth noting that the K-means clustering based on MPLE is better than that of MLE when the network size is large but worse when the network size is small.Our proposed mixture model has a clear advantage over K-means when the network size is small (n = 40) and this advantage becomes less significant as the sample size and network size increases.All these results confirm the numerical advantage of our proposed mixture model and also suggest that K-means may provide a useful initialization for our method.

Parameter Estimation
Figure 3 summarizes the relative bias of posterior mean estimate under different experimental conditions and likelihood approximation methods (PL and APL).We observe that the relative biases are small in general for both likelihood approximations, and decrease as sample size increases given the same network size.We also note that the relative biases are even smaller for large networks, which confirm the current understanding of PL as discussed in Section 3.2.Although it appears that the relative bias for the parameters associated with GWESP term estimated from APL is quite substantial at relative scale (about 20%) for the cluster "W" (i.e., cluster with weak dyadic dependent effects), it is very small at absolute scale as the true parameter value is −0.10.
Figure 4 shows the frequentist coverage of 95% posterior credible intervals under different experimental conditions and likelihood approximation methods.We notice that the posterior intervals estimated from APL are superior to those estimated from PL in terms of the coverage rates, especially for the parameters associated with the dyadic dependent term (i.e., GWESP, in this context).Also, we observed a fairly general pattern that the frequentist coverage of the posterior credible intervals yielded by APL approaches 95% as the sample size (cluster size) increases regardless of the network size.Therefore we have empirical evidence showing that the curvature correction step for APL functions well to combat the over-confidence issue caused by PL for the interval estimation of parameters associated with the dyadic dependent terms reported in Van Duijn et al. (2009).
In sum, PL can lead to good performance in terms of the point estimation but is lacking in the uncertainty quantification when the dyadic dependent effect is strong, which can be substantially mitigated, if not fully taken care of, by the use of APL.

Posterior Predictive Assessments
We consider four widely used metrics that characterize different aspects of graph structure and compare the mean values of these metrics calculated from posterior predictive distribution to those from the observed data.In practice, other metrics (such as homophily and posterior predictive p-values) can also be used based on practitioner's domain knowledge and preference.A utility function (e.g. a weighted average of metrics) may also be used to summarize the posterior predictive checks.Specifically, our selected graph-level metrics are • Mean eigenvector centrality: the eigenvector centrality (EC) is a node-level metric that measures the degree of membership of a given node in the largest core/periphery structure in the graph, and we take mean eigenvector centrality among all nodes in the graph to convert it to a graph-level metric. 1 The eigenvector centrality 1 Except in very rare cases for which the graph adjacency matrix lacks a principal eigenvalue.In such circumstances, eigenvector centrality is a signed indicator of membership in the two largest core/periphery structures (positive versus negative).is also the best one-dimensional approximation of the graph structure (in a leastsquares sense), and accuracy in reproducing it indicates the extent to which the model is able to recover the broadest structural features of the graph.
• Transitivity: a standard measure of triadic closure in network analysis (Wasserman and Faust, 1994), defined as the ratio of complete triangles to all potentially complete triangles.
• Standard deviation of degree distribution: a measure of the level of heterogeneity in degree distribution.
• Mean of inverse path length: also known as the mean of geodesic distances, a measure of the overall closeness between nodes in a graph.As the above-listed graph-level metric is defined for each graph, we construct an ensemble-level metric out of each one of them by taking the mean, which yields a total of four ensemble-level metrics.When doing posterior predictive simulations, we first select 100 draws from the posterior samples and simulate an ensemble of networks that is of the same size as the synthetic data for each selected draw, and then we compare the posterior predictive data to the distribution giving rise to the synthetic data by calculating the mean absolute relative difference for those four ensemble-level metrics across 50 repetitions under each simulation setting (see Figure 5).We note that PL and APL yield very similar posterior predictive performance when the network size is small (n = 40) but their difference becomes considerable when the network size is large (n = 100) where we note that the posterior samples estimated from APL have an advantage in producing networks that are closer to the target distribution with respect to mean eigenvector centrality and average inverse path length, whereas the posterior samples estimated from PL have an advantage in producing networks that are closer under the respective setting.The numbers in the parentheses in the "Total" and "Relative time" columns are calculated assuming the APL calculation is run in parallel on a machine with m computing cores.

Finite Mixtures of ERGMs for Modeling Ensembles of Networks
to the target distribution with respect to transitivity and degree standard deviation.More importantly, regardless of the likelihood approximation used, there is a clear trend showing that the recovery of those ensemble-level metrics are becoming better as sample size (i.e., cluster size) increases for each fixed network size.

Computation Time
Table 2 shows a comparison on the computation time for PL and APL under four different experimental conditions with MCMC settings specified in Section 4.2 (60000 total iterations, initial number of mixture components K = 6, random initial values) using a single computing core (Intel Xeon E5-2699 v4 CPU @ 2.20GHz).As both PL and APL provide tractable approximations to true ERGM likelihoods, they are comparable in terms of the MCMC sampling time T AP L MCMC ≈ T P L MCMC under the same MCMC settings.As the implementation of APL requires a computational overhead upfront as detailed in Section 3.2, the overall computation time under APL can be substantially larger than that under PL if only one computing core is available, but we note that multi-core architectures can be easily exploited to reduce the overall computation time under APL, as the calculations needed to obtain the values required for APL are independent for all networks.Therefore, given the APL calculation time as T AP L i for the i-th network, i = 1, . . ., m, and MCMC sampling time T AP L MCMC , the total computation time under APL can be reduced from computing cores are available.Therefore, the relative computation time can also be greatly reduced (as shown by the numbers in parentheses in Table 2).If more comput-edges nodematch("D-D") nodemix("D-R") gwesp, φ = 0.25  1) in the supplementary material) can also be drawn in a completely independent manner.

Application to Political Co-Voting Networks
We apply the proposed method to cluster the co-voting networks among U.S. Senators from 1867 (start year of Congress 40) to 2014 (end year of Congress 113), which were first analyzed by Moody and Mucha (2013) using modularity and role-based blockmodels.
The co-voting networks are constructed based on the roll call voting data from http:// voteview.com,which contains the voting decision of each Senator (yay, nay, or abstain) for every bill brought to Congress.2 in the co-voting network represent Senators and an edge is placed between two nodes if the corresponding Senators vote concurrently (both yay or both nay) on at least 75% of the bills to which they were both present.Here we aim at identifying subgroups of networks that appear to have similar generating characteristics within the group but different characteristics across groups.

Model Specification and Estimation
Figure 6 shows that the co-voting networks vary in structure on different years, and the party-affiliation appears to be a key factor affecting the co-voting patterns among Senators.Therefore we consider two competing ERGM specifications in Table 3, where the statistic nodemix("D-R") counts the total number of edges between Democrats and Republicans i<j y ij 1 {Xi=R,Xj =D} .We note that the interpretation of nodematch("D-D") and nodemix("D-R") should be relative to the baseline propensity of forming edges between Republicans under our model specifications.
We note that these networks vary in size (range: 69-112 3 ) and thus include an offset term to adjust for network size, e.g., the Krivitsky reference measure discussed in Section 2.2, which provides a parameterization with constant baseline expected degree.We run long MCMC chains (150000 iterations with first 60% draw discarded as burnin and a thinning interval of size 50) with K = 6 and random initial values using Figure 6: Co-voting networks of 41st, 64th, 71st, 91st, 100th and 111th Congress, which were formed in the year of 1869, 1915, 1929, 1969, 1987 and 2009, respectively.Colors indicate Senators' party affiliations, blue = Democrats(D), red = Republican(R).
the Dirichlet prior for τ and multivariate normal priors for θ k 's.We set the mean of the multivariate normal prior μ to be (1, 0, . . ., 0 p−1 ) to reflect the prior belief that the expected mean degree is larger than 1.Because the total number of cross-party edges is zero for some networks, we use the pseudolikelihood approximation for statistical inference as there is no finite maximizer of the likelihood when the observed statistics happen to lie on the relative boundary of the convex hull of possible values of the sufficient statistics (Handcock, 2003), causing issues for approximating the normalizing factor.
We calculate the DIC and BIC values for the candidate model specifications, which yields DIC(M 1 ) = 220240, DIC(M 2 ) = 188177; BIC(M 1 ) = 219869, BIC(M 2 ) = 187806; since both model selection criteria identify M 2 as the better model specification, we conduct cluster-specific analysis based on it.Such a huge gap in DIC values is indicative of the substantial importance of incorporating triadic closure effects in modeling co-voting networks.The estimated number of non-empty mixture components K0 is 4 for M 2 and we summarize the posterior mean and 95% credible intervals for cluster-specific parameters under M 2 in Table 4.We note that the size-invariant parameters for edge term (first column) can be interpreted as the log of the baseline mean degree (rather than the logit of the baseline density, as in the case of the counting measure), suggesting expected degrees varying from approximately exp(0.92)≈ 2.5 to exp(2.47)≈ 11.8 across clusters prior to consideration of other effects.
Based on these estimates, we see inhibition of cross-party ties (D-R row in Table 4) except for the first cluster and strong triadic closure (see gwesp.0.25 row in Table 4), as well as apparent qualitative distinctions between clusters.To probe the impact of the four behavioral regimes inferred from the co-voting data, we take a simulation-based approach which exploits a critical advantage of working with a fully generative model: the ability to perform "what-if" analyses that separate effects due to observed covariates from differences in structure arising from differences in generative processes.Specifically, we consider how the entire ensemble of Congressional networks would be expected to have been different, if each respective regime had governed the U.S. Congress for the entire study period.Such an analysis proceeds as follows.First, we simulate a set of posterior predictive networks for each Congress during the study period, with parameters drawn from the posterior distribution of each respective cluster.Each collection of networks can be thought of as a simulated "alternate history," in which the size and composition of each Congress were held to their real-world values but the behavioral tendencies that shaped the co-voting networks throughout the period were reflective of only one of the three clusters.Systematic differences in network structure across sets thus provide insight into the potential impact of behavioral regime, controlling for size and composition.
One important property that can be probed in this way is the expected incidence of voting coalitions, which play an important role in party politics.Here, we focus on minimal coalitions, defined as sets of three legislators who consistently vote together (i.e., triangles).Within-party coalitions can be sources of party cohesion, although they also act as blocks that can sometimes resist (and must be negotiated with by) party leaders; cross-party coalitions, by contrast, pose significant challenges to party cohesion, but can also serve as foci for sponsorship and promotion of bipartisan legislation.Both are hence significant, with distinct implications for the political landscape.To examine the coalition structures that would have been expected to occur under our three behavioral regimes, we simulate ten "alternate histories" from the posterior distributions of each cluster, calculating the realized proportions of intra-Democratic, intra-Republican, and inter-Party triangles.(That is, the counts of fully connected triads with all three members as Democrats, all three members as Republicans, or members from both parties, scaled by their maximum possible values.)Using proportions rather than raw counts ensures these metrics are normalized for network size and the distribution of party affiliations in each Congress; substantively, this choice of scaling tells us how close each party (or the cross-party cut) is to forming a perfect coalition, in which all members vote in concert.Figures 7 and 8 respectively show the realized proportion of intra-party and inter-party triangles in the simulated networks.Both figures show substantial differences in coalition structure, implying that the behavioral regimes associated with the three inferred clusters would be expected to have a meaningful impact on the political process.Specifically, we note the following: • The regime of cluster 1 is marked by the extremely strong coalition of Democrats compared to Republicans.
• The key symbol of the regime of cluster 2 is marked by the formation of very few voting coalitions, either within party or between party.
• By contrast, the regimes of cluster 3 show a much higher incidence of intra-party coalition formation, with roughly 10-20% of the potential intra-party coalitions being present.Coalition incidence differs by party, with Republicans forming more coalitions in percentage versus Democrats.Interestingly, the regime 3 also shows the highest rate of cross-party coalition formation; while the rate is very low overall, it is considerably higher than all other clusters.
Figure 8: Proportion of realized inter-party triangles in simulated networks.
• Finally, the regime of cluster 4 favors extremely high levels of intra-party cohesion, with rates approaching 50% of the maximum possible for Republicans and 70% for Democrats.As this implies, the resulting networks are also highly asymmetric, with the Democratic party expected to generate a much more cohesive coalition structure than the Republicans.Interestingly, this strong intra-party coalition formation does not exist entirely at the expense of cross-party coalitions: we find an expected rate of cross-party coalition formation that is only slightly less than that expected for networks arising under cluster 3.That said, the much higher incidence of intra-party coalition formation under cluster 4 leads inter-party coalitions to be a smaller fraction of the total coalition set than under cluster 3, potentially making them less critical to the legislative process.
Taken together, these observations suggest that the cluster 2 regime tends to generate uniformly loose voting networks with very few coalitions of any kind.These networks may resist polarization, but their high level of fragmentation may make it more difficult to assemble the sorts of alliances needed to push through controversial legislation.By contrast, the regime of cluster 3 tends to produce uniformly clustered networks with moderately high levels of coalition formation in both parties coupled with relatively high numbers of cross-party coalitions.These networks may pose particular challenges for party leaders, as they contain a mix of multiple local coalitions that must be courted for votes, "lone wolves" outside of coalitions who must be approached individually, and likely defectors whose cross-party coalitions provide a bullwark against within-party influence.The regime of cluster 4 tends to produce party-cohesive networks dominated by dense intra-party coalitions on both sides of the aisle (but with substantially higher levels of cohesion among Democratic legislators).This regime offers party leaders the greatest chance of being able to mobilize members in support of legislation, at the cost of potential legislative deadlock during periods of high inter-party conflict.Finally, the extreme behavior of Democrats in early years under the regime of cluster 1 might be partially attributed to the fact that the percentage share of Democratic legislators in the Congress is too low to allow them to be divisive (17% in 1867, 16% in 1869, 24% in 1871, 27% in 1873).
Figure 9 shows maximum probability cluster assignments over the study period, which enables us to connect the co-voting behaviors to different eras.We observe that the co-voting culture represented by cluster 1 is seen mostly at the beginning of the study period (1867, 1869, 1871, 1873 and 1877), the culture represented by cluster 2 and cluster 3 alternate in the late nineteenth and almost the entire twentieth centuries, while the culture represented by asymmetric polarization represented by cluster 4 becomes dominant after the late 1990's.

Application to Brain Functional Connectivity Networks
In this section, we apply the proposed method to model an ensemble of brain functional connectivity networks that was previously studied in Simpson et al. (2011Simpson et al. ( , 2012)).The data were collected among 10 normal subjects (5 female, average age: 27.7 years old, standard deviation: 4.7 years) who were part (Subject No. 002, 003, 005, 008, 009, 010, 012, 013, 016, 021) of a larger functional magnetic resonance imaging (fMRI) studies of age-related changes in cross-modal deactivations (Peiffer et al., 2009).The nodes in brain functional connectivity networks correspond to 90 distinct anatomical regions of interest (ROI) defined by the Automated Anatomical Labeling atlas (AAL) (Tzourio-Mazoyer et al., 2002) and these 90 regions are divided symmetrically across left and right hemispheres (see Figure 10).The edges in these functional connectivity networks were constructed by thresholding the temporal correlation coefficient matrix adjusted for motion and physiological noise (see Hayasaka and Laurienti, 2010;Simpson et al., 2011, for further details) to discard the weak and non-significant links that may represent spurious connections.The thresholds were selected at subject level to make log(n) log( d) ≈ 2.8, or equivalently d ≈ n 1/2.8 ≈ 5 for each network, where n = 90 is the total number of nodes and d is the average degree.

Model Specification
With the edge statistic controlling the baseline propensity of forming edges, Simpson et al. (2011) proposed to use geometrically weighted edgewise shared partner (GWESP) edges gwesp, φ = 0.25 gwnsp, φ = 0.25 nodematch("Hemisphere") statistic and geometrically weighted non-edgewise shared partner (GWNSP) statistic4 to capture the two most important structural features in brain functional connectivity networks, namely, functional segregation and functional integration (Rubinov and Sporns, 2010), where the functional segregation in the brain refers to the ability for specialized processing to occur within densely interconnected groups of brain regions, and the functional integration in the brain means the ability to rapidly combine specialized information from distributed brain regions.In the context of network modeling perspective, the former favors the graphs with more local clusters (local efficiency), whereas the latter requires the presence of bridging edges connecting the local clusters to make different regions in the whole brain coordinate more efficiently (global efficiency).Building upon this seminal work, Sinke et al. (2016) added a nodematch term with respect to hemisphere (i.e., total number of edges connecting nodes in the same hemisphere) to capture the additional propensity of forming edges between nodes in the same hemisphere.Following the choices in literature, we consider two competing model specifications (Table 5).
For each candidate model specification, we use adjusted pseudolikelihood and run long MCMC chains (100000 total iterations where first 60% of draws are discarded as burn-in and a thinning interval of size 100) with K = 4 and prior specified in Section 3. The estimated number of mixture components K0 is 2 for both model specifications.We calculate the DIC and BIC values for candidate model specifications and find DIC(M 1 ) = 25997.5,DIC(M 2 ) = 26873.2;BIC(M 1 ) = 25999.2,BIC(M 2 ) = 26835.2,therefore we select M 1 for subsequent analysis (see online supplementary material for the results of posterior predictive checks with respect to two key characteristics of primary interests, global efficiency and local efficiency).We note that our mixture model provides a richer framework for assessing the heterogeneity in brain functional connectivity networks than was feasible in prior work such as Simpson et al. (2011) and Sinke et al. (2016), in which each individual network was fitted to ERGMs separately followed by a direct comparison on the coefficient values across prespecified subgroups (e.g., age).Figure 11: Global efficiency and local efficiency for networks simulated from cluster 1 and cluster 2.

Results
Table 6 presents the posterior means and 95% credible intervals for cluster-specific parameters and we run simulations to compare the difference in local efficiency and global efficiency between the networks represented by these two clusters.Specifically, we select 100 draws from the retained posterior samples for each cluster and simulate a network for each selected draw, which yields 100 representative networks for each cluster.We calculate the global efficiency and local efficiency (see Rubinov and Sporns, 2010, for details on these metrics) of these simulated networks using function efficiency in R package brainGraph (Watson, 2020).Figure 11 shows that the networks represented by cluster 1 exhibits higher level of global efficiency (functional aggregation) while those represented by cluster 2 shows higher level of local efficiency (functional segregation).

Finite Mixtures of ERGMs for Modeling Ensembles of Networks
Figure 12: Posterior probability of cluster membership.The x-axis gives the subject id (age) order by age.
Figure 12 shows the posterior probability of cluster membership with subject id ordered by age.Despite the fact that several previous published analyses (Simpson et al., 2011;Sinke et al., 2016) were concerned with comparing the brain functional connectivity networks between younger and older adults, we do not a observe consistent pattern between cluster membership and age from our clustering results; this suggests that, while individual differences are present, they may not correspond to a priori obvious characteristics.The ability to discover such patterns (rather than to have to specify them ex ante) is an advantage of this approach.

Conclusion
This paper develops a mixture of ERGMs approach for modeling the generative process leading to heterogeneous network ensembles.We propose a Metropolis-within-Gibbs algorithm to perform Bayesian inference, and estimate the number of mixture components based on a deliberately overfitting model with sparse priors that allow us to eliminate excess components during MCMC.To deal with the intractable likelihoods of ERGMs with complex dependence, we consider two tractable approximations (pseudolikelihood (PL) and adjusted pseudolikelihood (APL)) to facilitate efficient statistical inference.Simulation studies show that both PL and APL are able to provide accurate point estimates of cluster-specific parameters but the latter provides better quantification of the uncertainties.With respect to clustering performance, the APL appears to be more robust in terms of identifying the true number of clusters across different network sizes and sample sizes, while the PL improves substantially when the network size is large.Although the computation time of the APL is substantially larger than that of PL if only one computing core is available, this disadvantage can be greatly reduced when multi-core architectures are available.
In our model, we choose to use small values of α to promote a small number of clusters for parsimonious estimation and to maintain the desired theoretical properties of the resulting posterior distributions.This idea has been previously explored in Malsiner-Walli et al. (2016) and a theoretical justification is given in Rousseau and Mengersen (2011).However, those two references mainly focus on regular likelihood function and it remains unclear how those results can be extended to PL and APL for complex models such as ERGM.Investigating the impact of α on the posterior distribution under the use of PL and APL is an interesting future working direction.
We apply the proposed approach to study the political co-voting networks among U.S. Senators with size-adjusted parameterizations being used to account for the difference in the size of the observed networks, and identify four clusters that represent very different co-voting patterns.After matching the clusters with temporal information, we observe one extremely asymmetric co-voting pattern under which the Democrats are extremely cohesive, one roughly symmetric co-voting pattern with very few voting coalitions and one mildly asymmetric co-voting patterns under which the Republicans are more cohesive that alternated in late nineteenth and almost the entire twentieth century, and an abrupt shift in the co-voting pattern towards the direction of political party polarization in last two decades.As the regimes of voting behavior are visibly correlated over time according to our analysis, further analysis of this data can leverage dynamic network models, which is a whole area onto itself, to capture the temporal dependency at the node level (i.e., same Senator incumbent for multiple Congresses) as well as at the graph level (i.e., neighboring Congresses are more likely to hold similar co-voting culture).
The application to brain functional connectivity networks showcases that the proposed approach can offer a more principled alternative to existing methods, which fit each network to the ERGM separately, for modeling the heterogeneity in brain functional connectivity networks.Compared to other methods in the literature, our method allows principled statistical inference for the generative processes of heterogeneous ensembles of networks.
In closing, we comment on four important directions for future research that could prove beneficial to the modeling of ensembles of networks.It is worth noting that the sizes of the US congresses between 1867 and 2014 range from 69 to 112, non-identical but broadly similar.More importantly, these size changes occur within a social system whose basic structure remains fairly similar throughout the time period.In other cases, however, large size differences may be accompanied by increasingly complex internal barriers to interaction or other additional exogenous structure that must be accounted for to obtain realistic predictions.While this additional structure is not available in the form of covariates, more sophisticated size-adjusted parameterizations may be required; reference measures or other tools facilitating "automatic" correction of such

Finite Mixtures of ERGMs for Modeling Ensembles of Networks
effects would facilitate mixture modeling in such scenarios.With respect to likelihood calculation, it is encouraging that we obtain favorable results in terms of clustering performance when the network size is large and point estimation in general in our simulation study using the easily computed pseudo-likelihood approximation, and adjusted pseudolikelihood can provide us with better uncertainty estimates and more robust clustering performance in general but at higher computational cost.Although the additional computational costs accompanied by the use of adjusted pseudolikelihood (APL) can be reduced with the use of multi-core architectures, it is of critical interest to identify the optimal settings beyond which there is diminishing return for the path sampling step involved in the APL calculation.A natural further extension of the finite mixture modeling framework could be Dirichlet Process mixtures (DPM) or mixture of finite mixtures (MFM) of ERGMs where the number of mixture components can vary depending on the incoming data size.Although computationally challenging, such an extension can provide a highly flexible-yet-interpretable density estimation framework for complex graph distributions.Last but not least, the developments of more scalable inference algorithms (e.g., variational inference) are favorable for handling data of large size.

Figure 1 :
Figure 1: Structure of the graph mixture model.Random quantities are depicted within circles, fixed quantities within rectangles; observables are shaded.

Figure 3 :
Figure 3: Relative bias.Three clusters are represented by the strength of dyadic dependent effects (S: Strong; I: Intermediate; W: Weak), which is GWESP with φ = 0.25 in this case.The solid blue horizontal line corresponds to the target level: 0.

Figure 4 :
Figure 4: Frequentist coverage of 95% posterior credible intervals.Three clusters are represented by the strength of dyadic dependent effects (S: Strong; I: Intermediate; W: Weak), which is GWESP with φ = 0.25 in this case.The solid blue horizontal line corresponds to the target level: 0.95.

Figure 5 :
Figure 5: Mean absolute relative difference across 50 repetitions under each simulation setting.

Figure 9 :
Figure 9: Maximum probability cluster assignments over study period.Colors indicate the majority party determined by the numbers of Senators in the dataset in the corresponding Congress (blue = Democratic (D), red = Republican (R)).Regimes of voting behavior are visibly correlated over time.

Figure 10 :
Figure 10: Brain connectivity networks of Subject No. 021, 003, 016, 013.Colours (green, blue) indicate the different hemispheres (left, right); coordinates of nodes are solely determined by the built-in algorithm of R package network (Butts, 2008) for the purpose of demonstration, without taking the actual 3D spatial structure into consideration.

Table 1 :
Estimation accuracy of K0 , and average RI across 50 replicates for four experimental conditions (network size, cluster size), under the proposed mixture model (bayesERGMmix), and K-means clustering on MPLE, MLE and sufficient statistics (SS).The number of clusters in K-means clustering are based on the K0 estimated from bayesERGMmix (PL or APL, whichever yields higher accuracy of K0 in the respective experimental condition).

Table 2 :
Elapsed time for four experimental conditions (network size, cluster size) measured in minutes with PL and APL as approximations to true ERGM likelihoods.The "Relative time" column shows the computation time relative to that of PL under the same MCMC setting for the respective simulation setting.The numbers in parentheses in the "APL calculation" column are max i=1,...,m T AP L i

Table 3 :
List of candidate model specifications (M 1 , M 2 ) for co-voting networks.indicatesthe corresponding term is included in the respective model.ingcores are available, we note that the computation time under APL can be further reduced by exploiting the fact that network samples needed at different temperatures (see Equation (

Table 5 :
List of candidate model specifications (M 1 , M 2 ) for brain functional connectivity networks.indicatesthe corresponding term is included in the respective model.