Exponential-Family Random Graph Models for Valued Networks

Exponential-family random graph models (ERGMs) provide a principled and flexible way to model and simulate features common in social networks, such as propensities for homophily, mutuality, and friend-of-a-friend triad closure, through choice of model terms (sufficient statistics). However, those ERGMs modeling the more complex features have, to date, been limited to binary data: presence or absence of ties. Thus, analysis of valued networks, such as those where counts, measurements, or ranks are observed, has necessitated dichotomizing them, losing information and introducing biases. In this work, we generalize ERGMs to valued networks. Focusing on modeling counts, we formulate an ERGM for networks whose ties are counts and discuss issues that arise when moving beyond the binary case. We introduce model terms that generalize and model common social network features for such data and apply these methods to a network dataset whose values are counts of interactions.


Introduction
Networks are used to represent and analyze phenomena ranging from sexual partnerships (Morris and Kretzschmar, 1997), to advice giving in an office (Lazega and Pattison, 1999), to friendship relations (Goodreau, Kitts and Morris, 2008;Newcomb, 1961), to international relations (Ward and Hoff, 2007), to scientific collaboration, and many other domains (Goldenberg et al., 2009). More often than not, the relations of interest are not strictly dichotomous in the sense that all present relations are effectively equal to each other. For example, in sexual partnership networks, some ties are short-term while others are longterm or marital; friendships and acquaintance have degrees of strength, as do international relations; and while a particular individual seeking advice might seek it from some coworkers but not others, he or she will likely do it in some specific order and weight advice of some more than others.
Network data with valued relations come in many forms. Observing messages (Freeman and Freeman, 1980;Diesner and Carley, 2005), instances of personal interaction (Bernard, Killworth andSailer, 1979-1980), or counting co-occurrences or common features of social actors (Zachary, 1977;Batagelj and Mrvar, 2006) produce relations in the form of counts. Measurements, such as duration of interaction (Wyatt, Choudhury and Bilmes, 2009) or volume of trade (Westveld and Hoff, 2011) produce relations in the form of (effectively) continuous values. Observations of states of alliance and war (Read, 1954) produce signed relationships. Sociometric surveys often produce ranks in addition to binary measures of affection (Sampson, 1968;Newcomb, 1961;Bernard, Killworth andSailer, 1979-1980;Harris et al., 2003).
Exponential-family random graph models (ERGMs) are generative models for networks which postulate an exponential family over the space of networks of interest (Holland and Leinhardt, 1981;Frank and Strauss, 1986), specified by their sufficient statistics , or, as with Frank and Strauss (1986), by their conditional independence structure leading to sufficient statistics (Besag, 1974). These sufficient statistics typically embody the features of the network of interest that are believed to be significant to the social process which had produced it, such as degree distribution (e.g., propensity towards monogamy in sexual partnership networks), homophily (i.e., "birds of a feather flock together"), and triad-closure bias (i.e., "a friend of a friend is a friend") .  A major limitation of ERGMs to date has been that they have been applied almost exclusively to binary relations: a relationship between a given actor i and a given actor j is either present or absent. This is a serious limitation: valued network data have to be dichotomized for ERGM analysis, an approach which loses information and may introduce biases. (Thomas and Blitzstein, 2011) Some extensions of ERGMs to specific forms of valued ties have been formulated: to networks with polytomous tie values, represented as a constrained three-way binary array by Robins, Pattison and Wasserman (1999) and more directly by Wyatt, Choudhury and Bilmes (2009;; to multiple binary networks by Pattison and Wasserman (1999); and the authors are also aware of some preliminary work by Handcock (2006) on ERGMs for signed network data. Rinaldo, Fienberg and Zhou (2009) discussed binary ERGMs as a special case and a motivating application of their developments in geometry of discrete exponential families.
A broad exception to this limitation has been a subfamily of ERGMs that have the property that the ties and their values are stochastically independent given the model parameters. Unlike the dependent case, the likelihoods for these models can often be expressed as generalized linear or nonlinear models, and they tend to have tractable normalizing constants, which allows them to more easily be embedded in a hierarchical framework. Thus, to represent common properties of social networks, such as actor heterogeneity, triad-closure bias, and clustering, latent class and position models have been used and extended to valued networks. (Hoff, 2005;Krivitsky et al., 2009;Mariadassou, Robin and Vacher, 2010) In this work, we generalize the ERGM framework to directly model valued networks, particularly networks with count dyad values, while retaining much of the flexibility and interpretability of binary ERGMs, including the abovedescribed property in the case when tie values are independent under the model. In Section 2, we review conventional ERGMs and describe their traits that valued ERGMs should inherit. In Section 3, we describe the framework that extends the model class to networks with counts as dyad values and discuss additional considerations that emerge when each dyad's sample space is no longer binary. In Section 4, we give some details and caveats of our implementation of these models and briefly address the issue of ERGM degeneracy as it pertains to count data. Applying ERGMs requires one to specify and interpret sufficient statistics that embody network features of interest, all the while avoiding undesirable phenomena such as ERGM degeneracy. Thus, in Section 5, we introduce and discuss statistics to represent a variety of features commonly found in social networks, as well as features specific to networks of counts. In Section 6 we use these statistics to model social forces that affect the structure of a network of counts of conversations among members of a fraternity. Finally, in Section 7, we discuss generalizing ERGMs to other types of valued data.

ERGMs for binary data
In this section, we define notation, review the (potentially curved) exponentialfamily random graph model and identify those of its properties that we wish to retain when generalizing.

Notation and binary ERGM definition
Let N be the set of actors in the network of interest, assumed known and fixed for the purposes of this paper, and let n ≡ |N | be its cardinality, or the number of actors in the network. For the purposes of this paper, let a dyad be defined as a (usually distinct) pair of actors, ordered if the network of interest is directed, unordered if not, between whom a relation of interest may exist, and let Y be the set of all dyads. More concretely, if the network of interest is directed, Y ⊆ N × N , and if it is not, Y ⊆ {{i, j} : (i, j) ∈ N × N }. In many problems, a relation of interest cannot exist between an actor and itself (e.g., a friendship network), or actors are partitioned into classes with relations only existing between classes (e.g., bipartite networks of actors attending events), in which case Y is a proper subset of N × N , excluding those pairs (i, j) between which there can be no relation of interest.
Further, let the set of possible networks of interest (the sample space of the model) Y ⊆ 2 Y , the power set of the dyads in the network. Then a network y ∈ Y, can be considered a set of ties (i, j). Again, in some problems, there may be additional constraints on Y. A common example of such constraints are degree constraints induced by the survey format (Harris et al., 2003;Goodreau, Kitts and Morris, 2008).
Using notation similar to that of Hunter and Handcock (2006) and Krivitsky, Handcock and Morris (2011), an exponential-family random graph model has the form for random network variable Y and its realization y; model parameter vector θ ∈ Θ (for parameter space Θ ⊆ R q ) and its mapping to canonical parameters η : Θ → R p ; a vector of sufficient statistics g : Y → R p , which may also depend on data x, assumed fixed and known; and a normalizing constant (in y) κ η,g : R q → R which ensures that (1) sum to 1 and thus has the value Here, we have given the most general case defined by Hunter and Handcock (2006). Usually, q = p and η(θ) ≡ θ, so the exponential family is linear. For notational simplicity, we will omit x for the remainder of this paper, as g incorporates it implicitly.

Properties of binary ERGM
2.2.1. Conditional distributions and change statistics Snijders et al. (2006), , Krivitsky, Handcock and Morris (2011), and others define change statistics or change scores, which emerge when considering the probability of a single dyad having a tie given the rest of the network and provide a convenient local interpretation of ERGMs. To summarize, define the p-vector of change statistics where y + (i, j) is the network y with edge or arc (i, j) added if absent (and unchanged if present) and y − (i, j) is the network y with edge or arc (i, j) removed if present (and unchanged if absent). Then, through cancellations, It is often the case that the form of ∆ i,j g(y) is simpler than that of g(y) both algebraically and computationally. For example, the change statistic for edge count |y| is simply 1, indicating that a unit increase in η |y| (θ) will increase the conditional log-odds of a tie by 1, while the change statistic for the number of triangles in a network is |y i ∩ y j |, the number of neighbors i and j have in common, suggesting that a positive coefficient on this statistic will increase the odds of a tie between i and j exponentially in the number of common neighbors.  and Krivitsky, Handcock and Morris (2011) offer a further discussion of change statistics and their uses, and Snijders et al. (2006) and Schweinberger (2011) use them to diagnose degeneracy in ERGMs. It would be desirable for a generalization of ERGM to valued networks to facilitate similar local interpretations. Furthermore, the conditional distribution serves as the basis for maximum pseudo-likelihood estimation (MPLE) for these models. (Strauss and Ikeda, 1990)

Relationship to logistic regression
If the model has the property of dyadic independence discussed in the Introduction, or, equivalently, the change statistic ∆ i,j g(y) is constant in y (but may vary for different (i, j)), the model trivially reduces to logistic regression. In that case, the MLE and the MPLE are equivalent. (Strauss and Ikeda, 1990) Similarly, it may be a desirable trait for valued generalizations of ERGMs to also reduce to GLM for dyad-independent choices of sufficient statistics.

ERGM for counts
We now define ERGMs for count data and discuss the issues that arise in the transition.

Model definition
Define N , n, and Y as above. Let N 0 be the set of natural numbers and 0. Here, we focus on counts with no a priori upper bound -or counts best modeled thus. Instead of defining the sample space Y as a subset of a power set, define it as Y ⊆ N Y 0 , a set of mappings that assign to each dyad (i, j) ∈ Y a count. Let y i,j = y(i, j) ∈ N 0 be the value associated with dyad (i, j).
A (potentially curved) ERGM for a random network of counts Y ∈ Y then has the pmf where the normalizing constant with η, g, and θ defined as above, and ( Barndorff-Nielsen, 1978, pp. 115-116;Brown, 1986, pp. 1-2), with Θ N being the natural parameter space if the ERGM is linear. Notably, while (3) is trivial for binary networks because their sample space is finite, for counts it can be a fairly complex constraint. For the remainder of this paper, we will focus on linear ERGMs, so unless otherwise noted, p = q and η(θ) ≡ θ.

Reference measure
In addition to the specification of the sufficient statistics g and, for curved families, mapping η of model parameters to canonical parameters, an ERGM for counts depends on the specification of the function h : Y → [0, ∞). Formally, along with the sample space, it specifies the reference measure: the distribution relative to which the exponential form is specified. For binary ERGMs, h is usually not specified explicitly, though in some ERGM applications, such as models with offsets (Krivitsky, Handcock and Morris, 2011, for example) and profile likelihood calculations of , the terms with fixed parameters are implicitly absorbed into h.
For valued network data in general, and for count data in particular, specification of h gains a great deal of importance, setting the baseline shape of the dyad distribution and constraining the parameter space. Consider a very simple p = 1 model with g(y) = ( (i,j)∈Y y i,j ), the sum of all dyad values. If h(y) = 1 (i.e., discrete uniform), the resulting family has the pmf ∼ Geometric(p = 1 − exp (θ)), with θ < 0 by (3). On the other hand, suppose that, instead, h(y) = (i,j)∈Y (y i,j !) −1 . Then, ∼ Poisson(µ = exp (θ)), with Θ N = R. The shape of the resulting distributions for a fixed mean is given in Figure 1.
The reference measure h thus determines the support and the basic shape of the ERGM distribution. For this reason, we define a geometric-reference ERGM to have the form (2) with h(y) = 1 and a Poisson-reference ERGM to have h(y) = (i,j)∈Y (y i,j !) −1 .
Note that this does not mean that any Poisson-reference ERGM will, even under dyadic independence, be dyadwise Poisson. We discuss the sufficient conditions for this in Section 5.2.1.

Inference and implementation
As exponential families, valued ERGMs, and ERGMs for counts in particular, inherit the inferential properties of discrete exponential families in general and binary ERGMs in particular, including calculation of standard errors and analysis of deviance. They also inherit the caveats. For example, the Wald test results based on standard errors depend on asymptotics which are questionable for ERGMs with complex dependence structure (Hunter and Handcock, 2006), so, in Section 6 we confirm the most important of the results using a simple Monte Carlo test: we fit a nested model without the statistic of interest and simulate its distribution under such a model. The quantile of the observed value of the statistic of interest can then be used as a more robust P -value.
At the same time, generalizing ERGMs to counts raises additional inferential issues. In particular, the infinite sample space of counts means that the con-straint (3) is not always trivially satisfied, which results in some valued ERGM specifications not fulfilling regularity conditions. We give an example of this in Section 5.2.3 and Appendix B. Additional computational issues also arise.

Computational issues
The greatest practical difficulty associated with likelihood inference on these models is usually that the normalizing constant κ h,η,g (θ) is intractable, its exact evaluation requiring integration over the sample space Y. However, the exponential-family nature of model also means that, provided a method exists to simulate realizations of networks from the model of interest given a particular θ, the methods of Geyer and Thompson (1992) for fitting exponential families with intractable normalizing constants and, more specifically, their application to ERGMs by Hunter and Handcock (2006), may be used. These methods rely on network sufficient statistics rather than networks themselves and can thus be used with little modification. More concretely, the ratio of two normalizing constants evaluated at θ ′ and θ can be expressed as so given a sample Y (1) , . . . , Y (S) from an initial guess θ, it can be estimated Another method for fitting ERGMs, taking advantage of the equivalence of the method of moments to the maximum likelihood estimator for linear exponential families, was implemented by Snijders (2002), using the algorithm by Robbins and Monro (1951) for simulated statistics to fit the model. This approach also trivially extends to valued ERGMs.
Furthermore, because the normalizing constant (if it is finite) is thus accommodated by the fitting algorithm, we may focus on the unnormalized density for the purposes of model specification and interpretation. Therefore, for the remainder of this paper, we specify our models up to proportionality, as Geyer (1999) suggests.
That (3) is not trivially satisfied for all θ ∈ R q presents an additional computational challenge: even for relatively simple network models, the natural parameter space Θ N may have a nontrivial shape. For example, even a simple geometric-reference ERGM an intersection of up to |Y| half-spaces (linear constraints). Models with complex dependence structure may have less predictable parameter spaces, and, due to the nature of the algorithm of Hunter and Handcock (2006), the only general way to detect whether a guess for θ had strayed outside of Θ N may be by diagnostics on the simulation. Bayesian inference with improper priors faces a similar problem, and addressing it in the context of ERGMs is a subject for future work. For this paper, we focus on models in which parameter spaces are provably unconstrained or have very simple constraints.
We base our implementation on the R package ergm for fitting binary ERGMs. (Handcock et al., 2012) The design of that package separates the specification of model sufficient statistics from the specification of the sample space of networks , so we implement our models by substituting in a Metropolis-Hastings sampler that implements our Y and h of interest. (A simple sampling algorithm for realizations from a Poisson-reference ERGM, optimized for zero-inflated data, is described in Appendix A.) This implementation will be incorporated into a future public release of ergm.

Model degeneracy
Application of ERGMs has long been associated with a complex of problems collectively referred to as "degeneracy". (Handcock, 2003;Rinaldo, Fienberg and Zhou, 2009;Schweinberger, 2011) Rinaldo, Fienberg and Zhou, in particular, list three specific, interrelated, phenomena: 1) when a parameter configuration -even the MLE -induces a distribution for which only a small number of possible networks have non-negligible probabilities, and these networks are often very different from each other (e.g., a sparser-than-observed graph and a complete graph) for an effectively bimodal distribution; 2) when the MLE is hard to find by the available MCMC methods; and 3) when the probability of the observed network under the MLE is relatively low -the observed network is, effectively, between the modes. This bimodality and concentration is often a consequence of the model inducing overly strong positive dependence among dyad values. For example, Snijders et al. (2006) use change statistics to show that under models with positive coefficients on triangle and k-star (k ≥ 2) counts -the classic "degenerate" ERGM terms -every tie added to the network increases the conditional odds of several other ties and does not decrease the odds of any, creating what Snijders et al. call an "avalanche" toward the complete graph, which emerges as by far the highest-probability realization. (More concretely, under a model with a triangle count with coefficient θ △ , adding a tie (i, j) increases the conditional odds of as many ties as i and j have neighbors by exp (θ △ ).) Adjusting other parameters, such as density, down to obtain the expected level of sparsity close to that of the observed graph merely induces the bimodal distribution of Phenomenon 1.
An infinite sample space makes Phenomenon 1, as such, unlikely, because the "avalanche" does not have a maximal graph in which to concentrate. However, it does not preclude excessive dependence inducing a bimodal distribution at the MLE, even if neither mode is remotely degenerate in the probabilistic sense. The observed network being between these modes, this may lead to Phenomenon 3, and, due to the nature of the estimation algorithms, such a situation may, indeed, lead to failing estimation -Phenomenon 2.
In this work, we seek to avoid this problem by constructing statistics that prevent the "avalanche" by limiting dependence or employing counterweights to reduce it. (An example of the former approach is the modeling of transitivity in Section 5.2.6, and an example of the latter is the centering in the within-actor covariance statistic developed in Section 5.2.5.) Formal diagnostics developed to date, such as those of Schweinberger (2011) do not appear to be directly applicable to models with infinite sample spaces, so we rely on MCMC diagnostics  instead.

Statistics and interpretation for count data
In this section, we develop sufficient statistics for count data to represent network features that may be of interest and discuss their interpretation. In particular, unless otherwise noted, we focus on the Poisson-reference ERGM without complex constraints: Y = N Y 0 and h(y) = (i,j)∈Y (y i,j !) −1 .

Interpretation of model parameters
The sufficient statistics of the binary ERGMs and valued ERGMs alike embody the structural properties of the network that are of interest. The tools available for interpreting them are similar as well.

Expectations of sufficient statistics
In a linear ERGM, if Θ N is an open set, then, for every k ∈ 1..p, and holding θ k ′ , k ′ = k, fixed, it is a general exponential family property that the expectation E θ;h,η,g (g k (Y )) is strictly increasing in θ k . (Barndorff-Nielsen, 1978, pp. 120-121) Thus, if the statistic g k is a measurement of some feature of interest of the network (e.g., magnitude of counts, interactions between or within a group, isolates, triadic structures), a greater value of θ k results in a distribution of networks with more of the feature measured by g k present.

Discrete change statistic and conditional distribution
Binary ERGM statistics have a "local" interpretation in the form of change statistics summarized in Section 2.2.1, and we describe similar tools for "local" interpretation of ERGMs for counts here. Define the set of networks is the set of networks such that all dyads but the focus dyad (i, j) are fixed to their values in y while (i, j) itself may vary over its possible values; and define y (i,j)=k ≡ (y ′ ∈ Y i,j (y) : y ′ i,j = k) to be the network with non-focus dyads fixed and focus dyad set to k. Then, let the discrete change statistic This statistic emerges when taking the ratio of probabilities of two networks that are identical except for a single dyad value: This may be used to assess the effect of a particular ERGM term on the decay rate of the ratios of probabilities of successive values of dyads (Shmueli et al., 2005) and on the shape of the dyadwise conditional distribution: the conditional distribution of a dyad (i, j) ∈ Y, given all other dyads (i ′ , j ′ ) ∈ Y\{(i, j)}, for an arbitrary baseline k 0 .

Model specification statistics
We now propose some specific model statistics to represent common network structural properties and distributions of counts.

Poisson modeling
We begin with statistics that produce Poisson-distributed dyads and model network phenomena that can be represented in a dyad-independent manner. As a binary ERGM reduces to a logistic regression model under dyadic independence, a Poisson-reference ERGM may reduce to a Poisson regression model.
In a Poisson-reference ERGM, the normalizing constant has a simple closed form if g(y ′ ) is linear in y ′ i,j and does not depend on any other dyads y ′ i ′ ,j ′ , (i ′ , j ′ ) = (i, j): giving a Poisson log-linear model, and ∆ 0→1 i,j g effectively becomes the covariate vector for Y i,j . (If g(y ′ ) is linear in y ′ i,j but does depend on other dyadsx i,j in (4) depends on y ′ i ′ ,j ′ but not on y ′ i,j itself -the dyad distribution is conditionally Poisson but not marginally so. An example of this arises in Section 5.2.4.) Morris, Handcock and Hunter (2008) describe many dyad-independent sufficient statistics for binary ERGMs. All of them have the general form where x i,j,k ≡ ∆ i,j g k and x i,j,k may be viewed as exogenous (to the model) covariates in a logistic regression for each tie. They could then be used to model a variety of patterns for degree heterogeneity and mixing among actors over (assumed) exogenous attributes. For example, for a uniform homophily model, x i,j,k may be an indicator of whether i and j belong to the same group. If y i,j are counts, these statistics induce a Poisson regression type model (for a Poissonreference ERGM), where the effect of a unit increase in some θ k on dyad (i, j) is to increase its expectation by a factor of exp (x i,j,k ). Krivitsky et al. (2009) use this type of model to model Slovenian periodical "co-readerships" (Batagelj and Mrvar, 2006) -numbers of readers who report reading each pair of periodicals of interest -using as exogenous covariates the class of periodical (daily, weekly, regional, etc.) and the overall readership levels of each periodical.
Curved (i.e., η(θ) = θ, p > q, and η not a linear mapping) ERGMs, in which the g satisfy (4) and dyadic independence, may induce nonlinear Poisson regression. An example of this is the likelihood component of some latent space network models, with latent space positions being treated as free parameters: the likelihoods of the hierarchical models of Hoff (2005) and Krivitsky et al. (2009) are special cases of such an ERGM, with η(θ) = (η i,j (θ)) (i,j)∈Y and g(y) = (y i,j ) (i,j)∈Y (i.e., the sufficient statistic is the network), and η i,j (θ) mapping latent space positions and other parameters contained in θ to the logarithms of dyad means (i.e., the dyadwise canonical parameters).

Zero modification
We now turn to model terms that may reshape the distribution of the counts away from Poisson. Social networks tend to be sparse, and larger networks of similar nature tend to be more sparse (Krivitsky, Handcock and Morris, 2011). If the interactions among the actors are counted, it is often the case that if two actors interact at all, they interact multiple times. This leads to dyadwise distributions that are zero-inflated relative to Poisson.
These features of sparsity can be modeled using statistics developed for binary ERGMs, applied to a network produced by thresholding the counts (at 1, for zero-modification). For example, a Poisson-reference ERGM with p = 2 and This is a parametrization of a zero-modified Poisson distribution (Lambert, 1992), though not a commonly used one, with the probability of 0 being (1 + exp (θ 2 ) (exp (exp (θ 1 )) − 1)) −1 and nonzero values being distributed (conditionally on not being 0) Poisson(µ = exp (θ 1 )), both reducing to Poisson's when θ 2 = 0. Notably, the probability of 0 decreases as θ 1 increases, rather than being solely controlled by θ 2 .

Dispersion modeling
Consider the social network of face-to-face conversations among people living in a region. A typical individual will likely not interact at all with vast majority of others, have one-time or infrequent interaction with a large number of others (e.g., with clerks or tellers), and a lot of interaction with a relatively small number of others (e.g., family, coworkers). Some of this may be accounted for by information about social roles and preexisting relationships, but if such information is not available, this leads to a highly overdispersed distribution relative to Poisson, or even zero-inflated Poisson. Overdispersed counts are often modeled using the negative binomial distribution. (McCullagh and Nelder, 1989, p. 199) However, the negative binomial distribution with an unknown dispersion parameter is not an exponential family, making it difficult to fit using our inference techniques. We thus discuss two purely exponential-family approaches for dealing with non-Poisson-dispersed interaction counts in general and overdispersed counts in particular.

Conway-Maxwell-Poisson Distribution
Conway-Maxwell-Poisson (CMP) distribution (Shmueli et al., 2005) is an exponential family for counts, able to represent both under-and overdispersion: adding a sufficient statistic of the form g CMP (y) = to a Poisson-reference ERGM otherwise fulfilling conditions for Poisson regression described in Section 5.2.1 turns a Poisson regression model for dyads into a CMP regression model. Its coefficient, θ CMP , constrained by (3) to θ CMP ≤ 1, controls the degree of dispersion: θ CMP = 0 retains the Poisson distribution; θ CMP < 0 induces underdispersion relative to Poisson, approaching the Bernoulli distribution as θ CMP → −∞; and θ CMP > 0 induces overdispersion, attaining the geometric distribution at θ CMP = 1, its most overdispersed point.
Normally, the greatest hurdle associated with using CMP is that its normalizing constant does not, in general, have a known closed form. In our case, because intractable normalizing constants are already accommodated by the methods of Section 4, using CMP requires no additional effort.
At the same time, CMP is neither regular nor steep (per Appendix B), so the properties of its estimators are not guaranteed, particularly for highly overdispersed data. We have found experimentally that counts as dispersed as geometric distribution or more so often cause the fitting methods of Section 4 to fail.
Variance-like parameters Some control over the variance can be attained by adding a statistic of the form g · a (y) = (i,j)∈Y y a i,j , a = 1. Statistics with a > 1, such as g · 2 (y) = (i,j)∈Y y 2 i,j , suffer the same problem as a Strauss point process (Kelly and Ripley, 1976): for any θ, ǫ > 0, lim y→∞ exp(θy 1+ǫ )/y! = ∞, leading to (3) constraining θ ≤ 0, able to represent only underdispersion.
Thus, we propose to model dispersion by adding a statistic of the form To the extent that the counts are Poisson-like, the square root is a variancestabilizing transformation (McCullagh and Nelder, 1989, p. 196). Then, a model with p = 2 and dyadwise sufficient statistic may be viewed as a modeling the first and second moments of √ y i,j . That the highest-order term is still on the order of y i,j guarantees that Θ N = R p -a practical advantage over CMP.
As with CMP, the normalizing constant is intractable. To explore the shape of this distribution, we fixed θ 1 at each of a range of values and found θ 2 s such that the induced distribution had the expected value of 1. We then simulated from the fit. The estimated pmf for each configuration and the comparison with the geometric distribution with the same expectation is given in Figure 2. Smaller coefficients on (6) (θ 1 ) correspond to greater dispersion, with coefficients on dyad sum (θ 2 ) increasing to compensate, and vice versa, with θ 1 = 0 corresponding to a Poisson distribution. As the dispersion increases, the mean is preserved in part by increasing Pr(Y i,j = 0) and, for sufficiently high values of y i,j , the   (7). Because Pr(Y = 0) varies greatly for different θ 1 yet can be adjusted separately by an appropriate model term, we plot the probabilities conditional on Y > 0. geometric distribution still dominates. Thus, there is a trade-off between the convenience of a model without complex constraints on the parameter space and the ability to model greater dispersion. In practice, if the substantive reasons for overdispersion are due to unaccounted-for heterogeneity, the latter might not be a serious disadvantage, and excess zeros can be compensated for by a term from Section 5.2.2.

Mutuality
Many directed networks, such as friendship nominations, exhibit mutualitythat, other things being equal, if a tie (i, j) exists, a tie (j, i) is more likely to exist as well -and binary ERGMs can model this phenomenon using a sufficient statistic g ↔ (y) = (i,j)∈Y,i<j y i,j y j,i = (i,j)∈Y,i<j min(y i,j , y j,i ), counting the number of reciprocated ties. (Holland and Leinhardt, 1981) Other sufficient statistics that can model it include g ↔ (y) = (i,j)∈Y,i<j 1 yi,j =yj,i and g ↔ (y) = (i,j)∈Y,i<j 1 yi,j =yj,i , the counts of asymmetric and symmetric dyads, respectively.  In the presence of an edge count term, these three are simply different parametrizations of the same distribution family: Nevertheless, these three different statistics suggest two major ways to generalize the terms to count data: by evaluating a product or a minimum of the values, or by evaluating their similarity or difference. We discuss them in turn.
Product It is tempting to model mutuality for count data in the same manner as for binary data, with y i,j and y j,i being values rather than indicators. For example, a simple model with overall dyad mean and reciprocity terms, with p = 2 and would have a conditional Poisson distribution: a desirable property. However, because for any c > 0, lim y→∞ exp(cy 2 )/(y!) 2 = ∞, for θ 2 > 0, representing positive mutuality, (3) is not fulfilled. (Note that the expected value of Y i,j is exponential in the value of Y j,i and vice versa. Again, a Strauss point process exhibits a similar problem. (Kelly and Ripley, 1976)) Geometric mean As with dispersion, the problem can be alleviated by using the geometric mean of y i,j and y j,i instead of their product. As in Section 5.2.3, this choice may be justified as an analog of covariance on variance-stabilized counts. This changes the shape of the distribution in ways that are difficult to interpret: if then Pr θ;h,η,g (Y i,j = y i,j |Y ∈ Y i,j (y)) ∝ exp θ 1 y i,j + (θ 2 √ y j,i ) √ y i,j /y i,j !, and, with nonzero y j,i , the probabilities of greater values of Y i,j are inflated by more. The analogy to covariance further suggests centering the statistic: Minimum An alternative generalization is to take the minimum of the two values. For example, if (9) Thus, a possible interpretation for this term is that the conditional probability for a particular value of Y i,j , y i,j is deflated by exp (θ 2 ) for every unit by which y i,j is less than y j,i . In a sense, y j,i "pulls up" y i,j to its level and vice versa.
Negative difference Generalizing the concept of similarity between y i,j and y j,i leads to a statistic of difference between their values. We negate it so that a positive coefficient value leads to greater mutuality. Then, and so the conditional probability of a particular y i,j is deflated by exp (θ 2 ) for every unit difference from y j,i , in either direction. Thus, y j,i "pulls in" y i,j and vice versa. Of course, other differences (e.g., squared difference) are also possible. We use the discrete change statistic to visualize the differences among these variants in Figure 3, plotting the θ ↔ ∆ 0→yi,j i,j g ↔ (y) summand of log Pr θ;h,η,g (Y i,j = y i,j |Y ∈ Y i,j (y)) Pr θ;h,η,g (Y i,j = 0|Y ∈ Y i,j (y)) = θ · ∆ 0→yi,j i,j g(y) Fig 3. Effect of proposed mutuality statistics (g↔) with parameter θ↔ > 0 on the distribution of Y i,j , given that Y j,i = y j,i . Whereas the min(y i,j , y j,i ) statistic deflates the probabilities of those values of y i,j that are less than y j,i , thus inflating all of those of y i,j above or equal to it, thus "pulling Y i,j up", the − |y i,j − y j,i | statistic deflates the probabilities in both directions away from y j,i , thus inflating those that are the closest, "pulling Y i,j in". √ y i,j y j,i inflates greater values of y i,j in general, inflating by more for greater √ y j,i .
for each variant. Lastly, while the conditional distributions, and hence the parameter interpretations for the minimum and the negative difference statistic, are different, models induced by (9) and (10) are also reparametrizations of each other: min(y i,j , y j,i ) = 1 2 ((y i,j + y j,i ) − |y i,j − y j,i |).

Actor heterogeneity
It is often the case that different actors in a network have different overall propensities to have ties: they are heterogeneous in their gregariousness, popularity, and/or (undirected) sociality. Some of this heterogeneity may be accounted for by exogenous covariates. For the unaccounted-for heterogeneity, two major approaches have been used: conditional, in which actor-specific parameters are added to the model to absorb its effects, and marginal, in which statistics are added that represent the effects of heterogeneity on the overall network features. Examples of the conditional approach include the very first exponentialfamily model for networks, the p 1 , which used a fixed effect for every actor (Holland and Leinhardt, 1981); and the p 2 model and latent space models, which used random effects instead (van Duijn, Snijders and Zijlstra, 2004;Hoff, 2005;Krivitsky et al., 2009;Mariadassou, Robin and Vacher, 2010). The marginal approach includes the count of k-stars for k ≥ 2 (Frank and Strauss, 1986), which, for a fixed network density, become more prevalent as heterogeneity increases, at the cost of often inducing ERGM degeneracy; alternating k-stars and geometrically weighted degree statistics (Snijders et al., 2006;Hunter and Handcock, 2006), which attempt to remedy the degeneracy of k-stars; and statistics such as the square root degree activity/popularity, which sum each actor's degree taken to 3/2 power, which also increases with greater heterogeneity, but not as rapidly as 2-stars do (Snijders, van de Bunt and Steglich, 2010), avoiding degeneracy. In the conditional approach, using fixed effects lacks parsimony and using random effects creates a problem with a doubly-intractable normalizing constant, beyond the scope of this paper, so we develop a marginal approach here. Actor heterogeneity may be viewed marginally as positive within-actor correlation among the dyad values. Following the discussion in the previous sections, we propose a form of pooled within-actor covariance of variance-stabilized dyad values, scaled to the same magnitude as the dyad sum: for Y i→ being the set of actors to who whom i may have ties (≡ {j ′ : (i, j ′ ) ∈ Y}) and √ y defined in (8). This statistic would increase with greater out-tie heterogeneity, an analogous statistic can be specified for in-tie heterogeneity, and dropping the directionality produces an undirected version of this statistic.
We have considered other variants, including the uncentered version, in which each summand in (11) is simply √ y i,j y i,k . We found that in undirected networks in particular, such a model term can induce a degeneracy-like bimodal distribution of networks. (This is likely because in undirected networks, the positive dependence is not contained within each actor, so subtracting √ y serves as a counterweight to avert the "avalanche".)

Triad-closure bias
We now turn to the question of how to represent triad-closure bias -friend-ofa-friend effects -in count data. As with mutuality, merely multiplying values of the dyads in a triad leads to a model that cannot have positive triad closure bias. In addition, ERGM sufficient statistics that take counts over triads often exhibit degeneracy. (Schweinberger, 2011) For these reasons, we describe a family of statistics that sum over dyads instead. Wyatt, Choudhury and Blimes (2010) use a generalization of the curved geometrically-weighted edgewise shared partners (GWESP) statistic (Hunter and Handcock, 2006), though it is not clear whether it is suitable for data with an infinite sample space. We thus describe a more conservative family of statistics.
One term used to model triad closure in binary dynamic networks by Snijders, van de Bunt and Steglich (2010) is the transitive ties effect, the most conservative special case of the GWESP (Hunter and Handcock, 2006) statistic. This statistic counts the number of ties (i, j) such that there exists at least one path of length 2 (two-path) between them -a third actor k such that y i,k = y k,j = 1. (Unlike the triangle count, each tie may contribute at most +1 to the statistic, no matter how many such ks exist.) One generalization of this statistic to counts is g trans. ties (y) = (i,j)∈Y min y i,j , max k∈N (min(y i,k , y k,j )) .
Intuitively, define the strength of a two-path from i to j to be the minimum of the values along the path. The statistic is then the sum over the dyads (i, j) of the minimum of the value of (i, j) and the value of the strongest two-path between them. The interpretation is thus somewhat analogous to that of the minimum mutuality statistic, with y j,i replaced by max k∈N (min(y i,k , y k,j )). The motivation for using minimum, as opposed to negative absolute difference, to combine the two-path value with the focus dyad value is that the intuitive notion of friend-of-a-friend effect that this statistic embodies suggests that while the presence of a mutual friend may increase the probability or expected value of a particular friendship (i.e., "pull it up"), it should not limit it (i.e., "pull it in") as an absolute difference would. These interpretations are somewhat oversimplified: it is just as true that a positive coefficient on this statistic results in y i,j "pulling up" the potential two-paths between i and j.
The statistic (12) is a fairly conservative one, less likely to induce excessive dependence and bimodality, at the cost of sensitivity. More generally, one may specify a triadic statistic using three functions: first, v 2-path : N 2 0 → R, how the "value" of a two-path i → j → k is computed from its constituent segments; second, v combine : R n−2 → R, how the values of the possible two-paths from i to j are combined with each other to compute the strength of the pressure on i and j to close the triad or increase their interaction; and third, v affect : N 0 × R → R how this pressure affects Y i,j . Given these, Thus, for example, one could set v combine to sum its arguments rather than take their maximum, or one can replace taking the minimum with taking a geometric mean. We illustrate the difference it makes in Section 6.

Application to interactions within a fraternity
In a series of studies in the 1970s, Bernard, Killworth andSailer (1979-1980) assessed accuracy of retrospective sociometric surveys in a number of settings, including a college fraternity whose 58 occupants had all lived there for at least three months. To record the true amounts of interaction, for several days, unobtrusive observers were sent to periodically walk through the fraternity to note students engaged in conversation. Obtaining these network data from Batagelj and Mrvar (2006), we model these observed pairwise interaction counts.
The raw distribution of counts, given in Figure 4(a), appears to be strongly overdispersed relative to Poisson, and, indeed, relative to the geometric distribution: the mean of counts is 1.9, while their standard deviation (not variance) is 3.4. At least some of this is due to actor heterogeneity: the square root of the within-actor variance of the counts is 3.1. Excluding extreme observations (all values over 20) does not make a qualitative difference. (The statistics become 1.8, 2.8, and 2.5, respectively.) Nor does there appear to be a natural place to threshold the counts to produce a binary network. (See Figure 4(b).) We thus model the baseline shape of the distribution of counts using the following terms: baseline propensity to have ties: number of dyads with nonzero value; baseline intensity of interactions: sum of dyad values; and underdispersion: the statistic (6). (We have also attempted to use CMP (via (5)) but found the process to be unstable due to the greater-than-geometric level of dispersion.) Little was recorded about the social roles of the fraternity members, so we consider the effects of endogenous social forces: actor heterogeneity: the undirected version of (11); transitivity of intensities: the statistic (12). Faust (2007), in particular, found that in many empirical networks, much of the apparent triadic effects are accounted for by variations in degree distribution and other lower-order effects. Thus, we consider four models: baseline shape only    (Hunter and Handcock, 2006).
(B), baseline with heterogeneity (BH), baseline with transitivity (BT), and all terms (BHT), to explore this concept in a valued setting. We report the model fits in Table 1. MCMC diagnostics, described by , show adequate mixing and unimodal distributions of sufficient statistics, and networks simulated from these fits have, on average, statistics equal to the observed sufficient statistics. The baseline dyadwise distribution terms are difficult to interpret, but the highly negative coefficient on underdispersion suggests a a strong degree of overdispersion, as expected. Some of this overdispersion appears to be absorbed by modeling actor heterogeneity, however. There are indications of a high degree of heterogeneity in individuals' propensity to interact, over and above that expected for even the overdispersed baseline distribution. (Monte Carlo P -val. < 0.001 based on 10,000 draws.) Without accounting for actor heterogeneity (i.e., Model BT), there appears to be a strong transitivity effect -a friend of a friend is a friend -and the Monte Carlo test confirms this with a similar P -value. However, if actor heterogeneity is accounted for, the transitivity effects vanish (simulated onesided P -val. = 0.43), suggesting that the underlying social process is better explained by a relatively small number of highly social individuals whose ties to each other and to (less social) third parties create excess transitive ties for the overall amount of interaction observed. At the same time, if, instead of using (12) as the test statistic, we use a less conservative statistic of the form (13) with v 2-path (x 1 , x 2 ) = √ x 1 x 2 (geometric mean), v combine (x 1 , . . . , x n−2 ) = n−2 k=1 x k , and v affect (x 1 , x 2 ) = √ x 1 x 2 , the effect's significance seems to increase (one-sided P -val. = 0.07). However, when we attempted to fit the model with this effect, the process exhibited the degeneracy-like bimodality. This suggests that there is a trade-off between stability and power to detect subtle effects.

Discussion
We have generalized the exponential-family random graph models to networks whose relationships are unbounded counts, explored the issues that arise when generalizing, and proposed ways to model several common network features for count data. We demonstrated our development by a study of the interaction of individual heterogeneity and friend-of-a-friend effects in a network with a hard-to-model dyadwise count distribution. This paper focused on modeling counts. More generally, one can define a valued ERGM by replacing the set of possible dyad values N 0 by a more general set S and replacing h(y) with a more general σ-finite measure space (Y, Y, P h ) with reference measure P h , then postulating a probability measure P θ;P h ,η,g with Radon-Nikodym derivative of P θ;P h ,η,g with respect to P h , dP θ;P h ,η,g dP h (y) = exp (η(θ) · g(y)) κ P h ,η,g (θ) , (Barndorff-Nielsen, 1978, pp. 115-116;Brown, 1986, pp. 1-2) with the normalizing constant κ P h ,η,g (θ) = Y exp (η(θ) · g(y)) dP h (y).
For binary and count data, and discrete data in general, P h could be specified as a function relative to the counting measure, while for continuous data, it could be defined with respect to the Lebesgue measure. Still, as with count data, the shape of this function would need to be specified. Other scenarios might call for more complex specifications of the reference measure. Some network data, such as measurements of duration of conversation (Wyatt, Choudhury and Blimes, 2010) and international trade volumes (Westveld and Hoff, 2011) are continuous measurements except for having a positive probability of two actors not conversing at all or two countries having no measured trade. Westveld and Hoff use a normal distribution to model the log-transformed trade volume, imputing 0 = log(1) for 0 observed trade volumes (all nonzero trade volumes being greater than 1 unit), and they note this issue and address it by pointing out that in their (latent-variable) model, an impact of such an outlier would be contained. Valued ERGMs may provide a more principled approach by specifying a semicontinuous P h , such as one that puts a mass of 1/2 on 0 and 1/2 on Lebesgue measure on (0, ∞).
We have also focused on data that do not impose any constraints on the sample space: Y ≡ S Y . But, some types of network data, such as those where each actor (ego) ranks the others (alters) (Newcomb, 1961, for example) can be viewed in this framework as having a constrained sample space: setting S = {1..n − 1} and constraining Y to ensure that each ego assigns a unique rank to each alter gives a sample space of permutations that could, with a counting measure, serve as the reference measure for an ERGM on rank data. These, and other applications are a subject for ongoing and future work.
This paper focuses on models for cross-sectional networks, where a single snapshot of relationship states or relationships aggregated over a time period are observed. For longitudinal data, comprising multiple snapshots of networks over the same actors over time, binary ERGMs have been used as a basis for discrete-time models for network tie evolution by Robins and Pattison (2001), Wyatt, Choudhury and Bilmes (2009;, Hanneke, Fu and Xing (2010), Krivitsky and Handcock (2010), and others. Valued ERGMs can be directly applied to the temporal ERGMs of Hanneke, Fu and Xing (2010) although their adaptation to the work of Krivitsky and Handcock (2010) may be less straightforward, especially if the benefits to interpretability of the separable models are to be retained.
In practice, networks are not always observed completely. Handcock and Gile (2010) develop an approach to ERGM inference for partially observed or sampled binary networks. It would be natural to extend this approach to valued networks and valued ERGMs.
Some methods for assessing a network model's fit, particularly MCMC diagnostics  can be used with little or no modification. Others, like the goodness-of-fit methods of Hunter, Goodreau and Handcock (2008) may require development of characteristics meaningful for valued networks. It may also be possible to extend the stability criteria of Schweinberger (2011) to models with infinite sample spaces.
value of a dyad and, occasionally (with a specified probability π 0 ), proposing a jump directly to 0. Because, as we discuss in Section 5.2.2, counts of interactions are often zero-inflated relative to Poisson, setting π 0 > 0 can be used to speed-up mixing. For highly overdispersed distributions, a Poisson kernel may be trivially replaced by a geometric or even negative-binomial kernel.
This algorithm selects the dyad on which the jump is to be proposed at random. A possible improvement to this algorithm would be to adapt to it the tie-no-tie (TNT) proposal , which optimizes sampling in sparse (zero-inflated) networks by focusing on dyads which have nonzero values.
Theorem B.1. The Conway-Maxwell-Poisson family is not regular.
Theorem B.2. The Conway-Maxwell-Poisson family is not steep.
Because the non-steep boundary corresponds to the most dispersed distribution that CMP can represent, maximum likelihood estimator properties for data which are highly overdispersed are not guaranteed.