Likelihood-based Inference for Exponential-Family Random Graph Models via Linear Programming

This article discusses the problem of determining whether a given point, or set of points, lies within the convex hull of another set of points in d dimensions. This problem arises naturally in a statistical context when using a particular approximation to the loglikelihood function for an exponential family model; in particular, we discuss the application to network models here. While the convex hull question may be solved via a simple linear program, this approach is not well known in the statistical literature. Furthermore, this article details several substantial improvements to the convex hull-testing algorithm currently implemented in the widely used ergm package for network modeling.


Monte Carlo maximum likelihood estimation for exponential families
Suppose that we observe a complex data structure Y from a sample space Y, and we believe that the process that produced Y may be captured sufficiently by the d-dimensional vector statistic g : Y → R d . A natural probability model for such an object-one that minimises the additional assumptions made in the sense of maximising entropy-is the exponential family class of models (Jaynes, 1957). If Y modelled via a discrete or a continuous distribution, the probability mass function or density has the form, parametrized by the canonical parameter θ ∈ R d (Barndorff-Nielsen, 1978, among others), The form is then a product of a function h(·) of the data alone specifying the distribution under θ = 0, a function κ(·) of the parameter alone, defined below, and an exponential with the data's and parameter's interaction. The exact form of κ(θ) depends on the type of distribution: (A full measure-theoretic formulation is also possible for distributions that are neither discrete nor continuous; and a θ ∈ R q for q ≤ d may be mapped through a vector function η : R q → R d .) Likelihood-based inference for exponential-family models centers on the log-likelihood function (θ) = θ g(Y obs ) − log κ Y,h,g (θ).
Likelihood calculations can be highly computationally challenging when the sum or integral (1) is intractable (e.g., Geyer and Thompson, 1992). For the sake of brevity, we will omit "Y, h, g" from the subscript of p θ;Y,h,g (Y ) and κ Y,h,g (θ) for the remainder of this paper, unless they differ from these defaults.
Most commonly, exponential families are used as a tool for deriving inferential properties of families that belong to this class, albeit with a different parametrization. For a common example, the canonical parameter of the Normal(µ, σ 2 ) distribution in its exponential family form is θ = [µσ −2 , σ −2 ], which is far less convenient and interpretable.
However, there are domains in which an exponential family model is specified directly through its sufficient statistics, often with the help of the Hammersley-Clifford Theorem (Besag, 1974, among others). These include the Strauss spatial point processes (Strauss, 1975), the Conway-Maxwell-Poisson model for count data (Shmueli et al., 2005), and exponential-family random graph models (ERGMs) (Holland and Leinhardt, 1981;Frank and Strauss, 1986, for original derivations) for networks and other relational data. While our development is motivated by and focuses on ERGMs, it applies to other scenarios involving exponential-family models with intractable normalizing constants because the methods we discuss operate on the sufficient statistic g(Y ) rather than on the original data structure.
Consider a graph Y on n vertices, with the vertices being labelled 1, . . . , n. We will focus on binary undirected graphs with no self-loops and no further constraints-a discrete distribution. Thus, we can express Y = 2 {{i,j}∈{1,...,n} 2 :i =j} , the power set of the set of all distinct unordered pairs of vertex indices.
An exponential family on such a sample space is called an exponentialfamily random graph model (ERGM). Substantively, elements of g(Y ) then represent features of the graph-e.g., the number of edges or other structures, or connections between exogenously defined groups-whose prevalence is hypothesised to influence the relative likelihoods of different graphs.
For example, an edge count statistic would, through its corresponding parameter, control the relative probabilities of sparser and denser graphs, and in turn the expected density of the graph, conditional on other statistics. Some of the graph statistics, the most familiar being the number of triangles, or "friend of a friend is a friend," configurations, induce stochastic dependence among the edges. The triangle statistic in particular is problematic for reasons reviewed by a number of authors (e.g., Section 3.1 of Schweinberger et al., 2020) and has been largely superseded by statistics proposed by Snijders et al. (2006) and others.
For our special case, κ(θ) has the form (1a), a summation over all possible graphs. For the population of binary, undirected, vertex-labelled graphs with no self-loops, the cardinality of Y is 2 ( n 2 ) , a number exponential in the square of the number of vertices. Thus, even for small networks, this sum is intractable. For example, for n = 10, summation is required of |Y| ≈ 3.5 × 10 13 elements-too many to compute by "brute force." Under some choices of g(·), the summation (1a) simplifies, but for most interesting models-those involving complex dependence among relationships in the network represented by the graph-maximization of (θ) can be an enormous computational challenge.
Various authors have proposed methods for approximately maximizing the log-likelihood function. For instance, Snijders (2002) introduced a Robbins-Monro algorithm (Robbins and Monro, 1951) based on the fact that whenθ denotes the maximum likelihood estimator (MLE), Alternatively, the MCMC MLE idea of Geyer and Thompson (1992) was adapted to the ERGM framework by, among others, Hummel et al. (2012). This idea is based on the fact that the log-likelihood-ratio suggesting that if we sample r networks Y 1 , . . . , Y r from the approximate distribution p θ 0 (·) via MCMC, we can employ an estimator as an approximation to λ(θ, θ 0 ). In some cases, the network may be partially unobserved, i.e., Y obs is not a complete network in Y. Letting Y(Y obs ) denote the set of all networks in Y that coincide with Y obs wherever Y obs is observed, we may generalize (2) as in Handcock and Gile (2010) by writing Equation (4) generalizes (2) because Y(Y obs ) consists of the singleton {Y obs } when the network is fully observed. This likelihood may be approximated using the approach of Gelfand and Carlin (1993) and Geyer (1994): Draw samples Y 1 , . . . , Y r and Z 1 , . . . , Z s from Y and Y(Y obs ), respectively via MCMC, such that the stationary distributions are p θ 0 ;Y (·) and p θ 0 ;Y(Y obs ) (·), respectively. The generalization of (3) is then It seems that (5) allows us to find an approximate maximum likelihood estimator by simply maximizing over θ. Yet there are problems with this approach, and in this paper we focus on one problem in particular: It is not always the case that (5) has a maximizer, nor even an upper bound.
To characterize precisely when the approximation of (5) has a maximizer, we first define a term that will be important throughout this article: The convex hull of any set of points in d-dimensional Euclidean space is the smallest convex set containing that set, i.e., the intersection of all convex sets containing that set. According to exponential family theory (e.g., Theorem 9.13 of Barndorff-Nielsen, 1978),λ(θ, θ 0 ) in (3) has a maximizer if and only if g(Y obs ) is contained in the interior of the convex hull of {g(Y 1 ), . . . , g(Y r )}. Indeed, it is straightforward to show thatλ(θ, θ 0 ) in the more general expression (5) does not have a maximum if any of {g(Z 1 ), . . . , g(Z s )} lies outside the convex hull of {g(Y 1 ), . . . , g(Y r )}. We prove this in Section 2.
The question of determining when a point lies within the convex hull of another set of points is thus relevant when using approximations (3) and (5). The remainder of this article shows how to determine whether a given point or set of points lies inside a given convex hull using linear programming and explains how the ergm package for R exploits this fact, then develops improvements to the algorithm and extending them to the case where a network is only partially observed.

Convex hull testing as a linear program
We introduce the terms target set and test set to refer to the set T = {g(Y 1 ), . . . , g(Y r )} and the set S = {g(Z 1 ), . . . , g(Z s )}, respectively. To reiterate, the convex hull of any set of d-dimensional points is the smallest convex set containing that set, and the convex hull is always closed in R d . The interior of this convex hull will play a special role here, so we let C(T ) denote the interior of the convex hull of T . If the points in T satisfy a linear constraint, C(T ) is empty since in this case the convex hull lies in an affine subspace of dimension smaller than d. We assume here that C(T ) is nonempty, which in turn means that the convex hull is the closure of C(T ), which we denote by C(T ).
As mentioned near the end of Section 1, the approximation in (5) fails to admit a maximizer whenever S ⊂ C(T ), which may be proved quite simply: Proof. Suppose there exists an element of S, say g(Z 1 ), in the open set R d \ C(T ). By the convexity of C(T ), this means there exists a hyperplane H-that is, an affine (d − 1)-dimensional subspace-containing g(Z 1 ) and such that H ∩ C(T ) = ∅. In other words, there exist a scalar z 0 and a d-vector z such that z 0 + z g(Z 1 ) = 0 and z 0 + z g(Y i ) < 0 for i = 1, . . . , r.
However, S ⊂ C(T ) is merely a necessary, not sufficient, condition for (5) to have a unique maximizer. Another necessary condition is that ∂ 2λ (θ,θ 0 ) ∂θ must be negative-definite for all θ and θ 0 , but it is not sufficient either, since a direction of recession (Geyer, 2009, for example) may exist. We are thus not aware of a necessary and sufficient condition when S contains more than one point. As we see below, for our purposes, the conditions that S ⊂ C(T ) and that Var(T ) − Var(S) is positive-definite suffice. We now demonstrate that linear programming can provide a method of testing definitively whether S ⊂ C(T ). As a practical matter, we can apply this method to a slightly perturbed version of S in which each test point is expanded away from the centroid of T by a small amount; if the perturbed S is contained in C(T ), then S ⊂ C(T ). This approach has the added benefit of ensuring not only that S ⊂ C(T ) but that each point in S is bounded away from the boundary of C(T ) by a small amount, which can prevent some computational challenges in using the approximation in (5). Yet we will also show, in Section 3, that we may transform the linear program to test directly whether S ⊂ C(T ).
In the particular case in which Y obs is a full network, i.e., there are no missing data, S is the singleton {g(Y obs )}. This is the case considered by Hummel et al. (2012), who propose an approximation method that checks S ⊂ C(T ) and then replaces g(Y obs ) by some other point contained in C(T ) whenever S ⊂ C(T ). In particular, the method defines a "pseudo-observation", ξ, as the convex combination γg(Y obs ) + (1 − γ)t, where t is the centroid of T and γ ∈ (0, 1]. A previous implementation of the algorithm in the ergm package (Hunter et al., 2008) chooses the largest γ value such thatξ ∈ C(T ) via a grid search on (0, 1], then maximizes the resulting approximation to yield a new parameter vectorθ, which then replaces θ 0 in defining the MCMC stationary distribution, and the process repeats until g(Y obs ) ∈ C(T ) for two consecutive iterations. The current ergm-package implementation of the algorithm uses a bisection search method that incorporates a prior belief on the value of γ using the same condition thatξ ∈ C(T ).
One downside of these algorithms is that they are all "trial-and-error" algorithms requiring multiple checks of the conditionξ ∈ C(T ). In this article, we show how to eliminate this "trial-and-error" approach.
We first frame the check of whether p ∈ C(T ), for an arbitrary column vector p ∈ R d , as a linear program. Let M be the (r × d)-dimensional matrix whose rows M 1 , . . . , M r are the points in the target set T ; furthermore, let , we may find an affine (d − 1)-dimensional subspace, which we call a separating hyperplane, that separates the points in T from the point p in the sense that p lies in one (open) half-space defined by the hyperplane and the points in T -i.e., the rows of M -all lie in the other (closed) half-space. Mathematically, this separating hyperplane is determined by the affine subspace {x ∈ R d : z 0 + x z = 0} for some scalar z 0 and d-vector z. Thus, z ∈ R d and z 0 ∈ R determine a separating hyperplane whenever z 0 +p z < 0 In other words, a hyperplane separating p from C(M ) exists whenever the minimum value of z 0 + p z is strictly negative, where the minimum is taken over all z 0 ∈ R and z = (z 1 , . . . , z d ) ∈ R d such that M i z ≥ −z 0 for i = 1, . . . , n. Notationally, we will use inequality symbols to compare vectors componentwise, so for example instead of "M i z ≥ −z 0 for i = 1, . . . , n", we may write "M z ≥ −z 0 1". The existence of a separating hyperplane can thus be determined using the following linear program: If the minimum value of the objective function z 0 + p z is exactly 0-it can never be strictly positive because (z 0 , z) = 0 d+1 determines a feasible point-then p is in C(M ), the closure of C(M ). If the minimum value is strictly negative, then a separating hyperplane not containing p exists and thus p ∈ C(M ). The bounds −1 ≤ z ≤ 1, which we call box constraints, ensure that the linear program has a finite minimizer when p ∈ C(M ): If z is unconstrained and there exists a feasible z 0 , z with z 0 + p z < 0, then (cz 0 ) + (cp) z < z 0 + p z for any c > 1 and no finite minimizer exists. The box constraints are implemented in the current ergm package version of the convex hull check; however, in Section 3 we show how to eliminate them entirely, leading to a more streamlined linear program. The current implementation of the algorithm in the ergm package uses the interface package lpSolveAPI to solve the linear program of 6, which we refer to as "LP (6)" henceforth.
In case the test set S contains more than one point, we may test each point individually to determine whether each point in S is contained in C(M ).

Reformulating the single-test-point algorithm
The trial-and-error algorithm of Hummel et al. (2012), which applies only to the case where the test set S consists of the single point p, never exploits the fact that the separating hyperplane, if it exists, is the minimizer of a function. We propose to improve this algorithm by reformulating the linear program and then using the minimizers found. First, we establish a few basic facts that help us simplify the problem.
Since translating p and all points in T by the same constant d-vector does not change whether p ∈ C(M ), we assume throughout this section that all points are translated so that the origin is a point in C(M ) that we refer to as the "centroid" of M . In practice, we often take the centroid of M to be the sample mean of the points in the target set. Yet none of the results of this section rely on any particular definition of "centroid"; here, we assume only that the centroid 0 ∈ R d lies in the interior of the convex hull of the target set.
Since p may be assumed not to coincide with the centroid, else the question of whether p ∈ C(M ) is answered immediately, at least one of the coordinates of p is nonzero. We assume without loss of generality that p 1 = 0; otherwise, we may simply permute the coordinates of all the points without changing whether p ∈ C(M ). Below, we define a simple invertible linear transformation that maps p to the unit vector e 1 = (1, 0, . . . , 0) . Applying this transformation allows us to simplify the original linear program and thereby clarifies our exposition while facilitating establishing results; then, once we have proved our results, we will apply the inverse transformation to restore the original coordinates, achieving a simplification of the linear program in the process.
One important fact about any invertible linear transformation, when applied to p and M , is that it does not change whether or not p ∈ C(M ), as we now prove.
Proof. Any point in the convex hull of a set may be written as a convex combination of points in that set. Suppose p ∈ C(M ). Then there exists a ∈ R r such that p = a M and r i=1 a i = 1. Since (Rp) = a M R , we know that Rp ∈ C(M R ). Conversely, if Rp ∈ C(M R ), the same reasoning applied to the full-rank transformation R −1 shows that p ∈ C(M ).
Consider the full-rank linear transformation defined by that maps p to the standard basis vector e 1 ∈ R d . After applying this linear transformation, we want to know whether a hyperplane exists that separates the point e 1 from the rows of M R . Such a hyperplane, if it exists, may be written as {x ∈ R d : z 0 +x z = 0} for some z 0 ∈ R and z ∈ R d . Since the origin is an interior point of C(M R ), no separating hyperplane may pass through the origin. Therefore, we may limit our search to those cases where z 0 = 0, which means that we may divide by z 0 the equation defining the hyperplane. Stated differently, we may take z 0 = 1 and rewrite our hyperplane as {x ∈ R d : 1 + x z = 0} without loss of generality.
Summarizing the arguments above, we now seek a d-vector z satisfying 1 + e 1 z < 0 and 1 + (M R ) i z ≥ 0 for i = 1, . . . , r. Furthermore, fixing z 0 = 1 means that no box constraints are necessary in the reformulation of the linear program designed to search for a separating hyperplane. Therefore, our new linear program, after transformation by R, becomes minimize: The zero vector is a feasible point in both LP (6) and LP (8). However, unlike LP (6), zero cannot be the optimum of LP (8); that is, the solution of LP (8) always defines a hyperplane, whether or not it is a separating hyperplane. We now prove this fact and show how to determine whether e 1 ∈ C(M R ): Proposition 2. Let z * denote a minimizer of LP (8). Then z * 1 < 0, and p ∈ C(M R ) if and only if −1 ≤ z * 1 . Furthermore, the ray with endpoint at the origin and passing through e 1 intersects the boundary of C(M R ) at −(1/z * 1 )e 1 .
Proof. Since [M R ]0 = 0 > −1, there exists some > 0 such that − e 1 satisfies the constraints while reducing the objective function z 1 below zero, so z * 1 must be strictly negative. Now let H = {x : 1+x z * = 0}. Since z * is a feasible point, C(M R ) lies entirely in the closed positive half-plane defined by H. Thus, if z * 1 < −1, then e 1 lies in the open negative half-plane defined by H and so H is a separating hyperplane. Conversely, if a separating hyperplane {x : w 0 +x w = 0} exists, then w 0 must be strictly positive since 0 must be separated from e 1 and thus, dividing by w 0 , we must have (M R )(w/w 0 ) ≥ −1 and w 1 /w 0 < −1.
Since z * is a minimizer, this implies z * 1 ≤ w 1 /w 0 < −1. The point a = −(1/z * 1 )e 1 lies on the positive x 1 -axis, that is, on the ray with endpoint at the origin that passes through e 1 . If a lies in the open set C(M R ), then there exists > 0 such that (1 + )a also lies in C(M R ); but this is impossible since (1 + )a z * = −(1 + ) < −1 and thus (1 + )a violates the constraint that must be satisfied by every point in C(M R ) due to convexity. On the other hand, a must lie in C(M R ) since otherwise there exists > 0 such that (1 − )a ∈ C(M R ), which in turn means that there exists a hyperplane separating C(M R ) from (1 − )a which must therefore intersect the positive x 1 -axis at a point between the origin and (1 − )a, contradicting the fact that z * 1 is the smallest possible such point of intersection. We conclude that a = −(1/z * 1 )e 1 must lie on the boundary of C(M R ).
Applying the full-rank transformation R −1 , in order to transform back to the original coordinates, establishes the following corollary: This result gives a way to determine when p lies in C(M ), not merely C(M ), which is an improvement over LP (6). It also suggests that we may reconsider LP (8) after transforming by R −1 back to the original coordinates. Indeed, since z in LP (8) may take any value in R d and R has full rank, there is no loss of generality in replacing z by (R ) −1 z, which allows us to rewrite LP (8) as minimize: p z subject to: Taking z * and z * * as minimizers of LP (8) and LP (9), respectively, we must have z * 1 = p z * * since the minimum objective function value must be the same for the two equivalent linear programs. We conclude by Corollary 1 that p is in C(M ) if and only if p z * * > −1 and, furthermore, the point −p/(p z * * ) is the point in C(M ) closest to the origin in the direction of p.

Duality
This section is a digression in the sense that it does not lead to improvements in the performance of the algorithms we discuss. Yet it shows how the convex hull problem leads to a particularly simple development of the well-known linear programming idea called duality. In Sections 2 and 3, we derived linear programs to determine whether a test point p is in C(M ) based on the concept of a supporting hyperplane. Here, we take a different approach, resulting in a new linear program that is known as the dual of the original.
By definition of convexity, any point in C(M ) is a convex combination of the rows of M . That is, p ∈ C(M ) if and only if there exists a nonnegative vector y ∈ R n with 1 y ≤ 1 and p = M y. We only require 1 y ≤ 1, not 1 y = 1, because the point 0 is in C(M ), so a convex combination of points of M may place positive weight on 0. Rewriting 1 y ≤ 1 as −1 y ≥ −1, we may therefore formulate a linear program to test p ∈ C(M ) as follows: maximize: − 1 y subject to: If the maximum objective function value is found to be strictly less than −1, there is no convex combination of the rows of M that yields p and we conclude that p ∈ C(M ). According to standard linear program theory (Vanderbei, 1997, Section 5.8), LP (9) is precisely what is known as the dual of LP (10), and vice versa. We may develop this theory as follows: If z ∈ R d satisfies the n constraints in LP (9), we may multiply each inequality by a nonnegative real number and add all of the inequalities to obtain a valid inequality. Therefore, if y ∈ R d is a nonnegative vector, we conclude that y M z ≥ −y 1. If the constraints in LP (10) are completely satisfied, we may replace y M by p , then minimize with respect to z and maximize with respect to y to obtain min z p z ≥ max y −1 y.
Inequality (11) is sometimes called the weak duality theorem. The strong duality theorem, which we do not prove here, says that equality holds in this case (Vanderbei, 1997, Section 5.4). In our convex hull problem, strong duality means among other things that p is on the boundary of C(M ) if and only if the maximum objective function value in LP (10) equals −1, since we proved this fact for the dual linear program in Section 3.

Applications and Benchmarks
To illustrate the results of Section 3, we first consider problems in 2 dimensions since they are easy to visualize. Such low-dimensional examples are also helpful since they sometimes lend themselves to closed-form solutions of linear programs. For the following benchmarks and examples, the following relevant R (R Core Team, 2021) package versions were used: ergm 4.1.6804, Rglpk 0.6.4, and lpSolveAPI 5.5.2.0.17.7.

A Three-point Example
Let us consider a particularly simple example in which the target set consists of the 2-vectors (−1, 0) , (a, 1) , and (b, −1) for a > 0 and b > 0. This choice guarantees that the convex hull is a triangle containing the origin, so we shall consider the origin to be the centroid in this example, regardless of the actual value of the mean of the three points.
When the test point p equals (1, 0) , we are in a situation that could arise after transforming by R. This leads to LP (8), which may be solved in closed form because in this case the constraints become Since the lines x 2 = −1/a−x 1 /a and x 2 = −1/b+x 1 /b have one positive slope and one negative slope, we minimize the maximum at the point of intersection, i.e., when −1/a − z 2 /a = −1/b + z 2 /b, which implies z * 2 = (a − b)/(a + b),   which in turn implies z * 1 = −2/(a + b). By simple examination, we know that (1, 0) is interior to the convex hull exactly when the line through (a, 1) and (b, −1) intersects the x 1 -axis at a value larger than 1. Thus, p ∈ C(M ) if and only if (a + b)/2 > 1, which is equivalent to z * 1 = −2/(a + b) > −1, thus verifying the result of Corollary 1 for this simple example.
To make the example more concrete, let us take a = 2 and b = 1. Then, we see in Figure 1a that p = e 1 (labeled with 0) is in the convex hull, and, furthermore, that the x 1 coordinate of the point of intersection is (2 + 1)/2 = 3/2, and the separating line's location does not depend where, along the ray from the origin, p is. Now, instead, let p equal (3, 2) . This is the situation depicted in Figure 1b. The original LP (6), which includes box constraints, finds one separating line-in two dimensions, a hyperplane is a line-but the intersection of this line and the segment connecting 0 to p, labeled with the digit 1, is not optimal. A second application of LP (6) finds the optimal point labeled 2. In this case, point 2 lies on the line {x : 1 + x 1 + 3x 2 = 0}, and it is instructive that the box constraints alone prevent LP (6) from finding this solution when p = (3, 2) but not when p = (3/2, 1) , where the latter point is the intermediate point labeled 1 in Figure 1b.
Lastly, we illustrate the linear transformation of the previous problem by R. Figure 1c shows the problem transformed using (7), with the dotted lines and arrows showing the original coordinate system, then LP (8) is applied to find the intersection point in one iteration.

Benchmarking Rglpk against lpSolveAPI
Here, we define two simple functions that each use LP (9) to search for the point on the ray from 0 through p that intersects the boundary of C(M ). Each of these functions exploits an existing R package that wraps open-source code solving linear programs: respectively, Rglpk (Theussl and Hornik, 2019) wraps the GNU Linear Programming Kit (GLPK) (Makhorin et al., 2020) and lpSolveAPI (lp_solve et al., 2020) wraps the lp_solve library (Berkelaar et al., 2021). Here is a function that uses lpSolveAPI: Both implementations produce the same result: γ ≈ 0.4801, but Rglpk takes 10 seconds, whereas lpSolveAPI takes 19. However, Rglpk requires GLPK to be installed separately on some platforms; thus, our suggestion for package developers is to check for availability of GLPK on the system and use Rglpk if it is installed.

Testing multiple points
As explained at the beginning of Section 2, the approximated difference of log-likelihoods in (5) requires that every g(Z i ) ∈ S be contained in C(T ) in order for the approximation to have a maximizer. We might therefore consider a strategy of checking, prior to using (5), whether g(Z i ) ∈ C(T ) for all i = 1, . . . , s using the single-test-point methods discussed earlier. This strategy has two potential drawbacks: First, the computational burden might be quite high if any of the dimension d, the number of test points s, or the number of target points r is large. Second, we must decide what to do in the case where one or more of the points in S is found to be outside C(T ). This section addresses each of these questions and then presents an illustrative example using a network dataset with missing edges.

Reducing computational burden
There are evidently many possible ways to approach the question of how to efficiently decide which test points, if any, lie outside C(T ). Even if we only consider ideas for reducing the size of the set T in such a way as to have little or no influence on the answer, we might attempt to search for and eliminate target set points either that lie entirely inside C(T ), since eliminating such points from T does not change C(T ) at all, or that lie close to other points in T , since eliminating such points does not change C(T ) very much. Here, we merely suggest a simplistic version of the first approach, as a thorough exploration of methods for reducing the size of T without altering C(T ) is well beyond the scope of this article.
Whether a point in T is interior to C(T ) or whether it lies on the boundary of C(T ) has to do with that point's so-called data depth, a concept often attributed (in the two-dimensional case) to Tukey (1975). Because the deepest points are the ones that can be eliminated without changing C(T ), the sizable literature on data depth (see, e.g., the discussion by Zuo and Serfling, 2000) may be relevant to our problem. For the d = 2-dimensional case, several authors have developed efficient methods for identifying exactly the points lying on the convex hull boundary; this is a particular case of the so-called convex layers problem (Chazelle, 1985).
Among the most simplistic ideas is to use Mahalanobis distance from the centroid, defined for any point x ∈ T as whereΣ is the sample covariance matrix, as a measure of data depth. For the |T | = 100,000-point example of Section 5.2, we also sample |S| = 5 corners of the unit cube in R 20 and try eliminating the fraction f = i/10 for i = 1, . . . , 10 of the deepest points as measured by Mahalanobis distance: the solution appreciably. Since computing effort scales linearly with the number of points in T , this seems like a useful tool. We nonetheless recommend caution, as our experiments indicate that the fraction of points that may be discarded by their Mahalanobis distances varies considerably for different choices of d. For example, we find for d = 50 that each of the 100,000 points sampled uniformly randomly in the unit cube lies on the boundary of the convex hull.

Rescaling observed statistics
Recalling the case of a completely observed network, where S consists of a single point g(Y obs ), the strategy used by Hummel et al. (2012) is to find a point, sayξ, between 0 and g(Y obs ) that lies inside C(T ) yet close to the boundary. By treatingξ as though it is g(Y obs ) in constructing approximation (3), the approximation will have a well-defined maximizer, and this maximizer is then used to generate a new target set in the next iteration of the algorithm.
In analogous fashion, when S consists of multiple points and some of them lie outside C(T ), we propose to shrink each of these points toward 0 using the same scaling factor, say γ, where γ ∈ (0, 1] is chosen so that every element of S lies within C(T ) after the scaling is applied. Since LP (9) yields the optimal step length for each point p, we obtain this step length by iterating through the points in S and selecting the least of the resulting step lengths. This transformation also has the effect of shifting the sample mean of the points in S to a point somewhere between 0 and the sample mean of the untransformed points in S.

Example
This section illustrates the idea of Section 6.2 using a network representing a school friendship network based on a school community in the rural western United States. The synthetic faux.mesa.high network in the ergm package includes 205 students in grades 7 through 12. We may create a copy of this network and then randomly select 10% of the edge observations (whether the edge is present or absent) to be categorized as missing: If we now define an ERGM using a few statistics related to those originally used to create the faux.mesa.high network-details are found via help(faux.mesa.high)-we begin with an estimator that can be derived using straightforward logistic regression. We denote this maximum pseudolikelihood estimator or MPLE, details of which may be found in Section 5.2 of Hunter et al. (2008), by theta0 or θ 0 . fmhFormula <-fmh~edges + nodematch("Grade") + gwesp(log(3 / 2), fixed = TRUE) theta0 <-coef(ergm(fmhFormula, estimate = "MPLE")) To construct approximation (5), we need two samples of random networks, Y 1 , . . . , Y r and Z 1 , . . . , Z s from Y and Y(Y obs ), respectively. For the latter, we employ the constraints capability of the ergm package.
With our samples in place, we may now construct approximation (5). Here, we negate the function so that our objective is to minimize it: l <-function(ThetaMinusTheta0, gY, gZ, scale) { -log(mean(exp(scale * gZ %*% ThetaMinusTheta0))) + log(mean(exp(gY %*% ThetaMinusTheta0))) } Finally, we employ the optim function in R (R Core Team, 2021) to minimize the objective function. We use a scale value of 90% of the value that places one of the test values exactly on the boundary of the convex hull so that all of the scaled test points are interior to the convex hull. We may repeat the whole process iteratively until the entire set of test points is well within the convex hull boundary: multipliers <-scale NewTheta <-theta0 while (scale < 1.11) { # We want 0.9 * scale > 1 when finished NewTheta <-NewTheta + optim(0 * NewTheta, l, gY = gY, gZ = gZ, scale = 0.9 * scale )$par theta0 <-rbind(theta0, NewTheta) gY <-simulate(fmhFormula, coef = NewTheta, nsim = 500, output = "stats", control = snctrl(MCMC.interval = 1e4) : Pairwise scatterplots of three-dimensional network statistics generated using the MPLE, with statistics on the full sample space of networks as black dots and those on the sample space constrained to coincide with the observed network as orange dots. The constrained points are rescaled as blue points so that none lies outside the convex hull of the black points. In each plot, there are 500 black, 100 orange, and 100 blue points. ) gZ <-simulate(fmhFormula, coef = NewTheta, nsim = 100, output = "stats", constraints =~observed, control = snctrl(MCMC.interval = 1e4) ) centroid <-colMeans(gY) gY <-sweep(gY, 2, centroid) # Translate the g(Y) statistics gZ <-sweep(gZ, 2, centroid) # Translate the g(Y) statistics scale <-Inf for (j in 1:nrow(gZ)) { scale <-min(scale, -1 / LPmod2(gY, gZ[j, ])) } multipliers <-c(multipliers, scale) } Table 1 gives successive values of θ 0 that are determined as maximizers of (5) after the test set points g(Z 1 ), . . . , g(Z s ) are rescaled to lie within the convex hull of g(Y 1 ), . . . , g(Y r ). The maximum pseudo-likelihood estimate (MPLE) in the first row of the table is obtained using logistic regression and is often used as an initial approximation to the maximum likelihood estimator when employing MCMC MLE (Hunter et al., 2008). However, the MPLE fails to take the missing network observations into account and, as demonstrated by Hummel et al. (2012), it is not the case that the MPLE generates sample network statistics near the observed statistics. Figure 4 shows that the final value of θ 0 produces samples such that the whole test set lies on the interior of the target set, which allows (5) to be maximized to produce an approximate MLE. As recommended by Hummel et al. (2012), we might use moderate-sized samples to obtain a viable θ 0 value using the idea here, then use much larger samples once θ 0 has been found in order to improve the accuracy of Approximation (5).

Discussion
This article discusses the problem of determining whether a given point, or set of points, lies within the convex hull of another set of points in d dimensions.
While this problem, along with its solution via linear programming, is known, we are not aware of any work that discusses it in the context of a statistical problem such as one discussed here, namely, the maximization of an approximation to the loglikelihood function for an intractable exponentialfamily model. Here, we provide multiple improvements on the simplistic implementation of the linear programming solution to the yes-or-no question involving a single test point that exists in the ergm package (Hunter et al., 2008) as a means of implementing the idea of Hummel et al. (2012): First, we eliminate the need for the "box constraints" of ergm and show how the dual linear program may be derived from first principles. Second, we render the "trial-and-error" approach obsolete by showing how to find the exact point of intersection between the convex hull boundary and the ray originating at the origin and passing through the test point. Third, we test the lpSolveAPI package that is currently used by ergm against the Rglpk package, finding that the latter appears to be far more efficient at solving the particular linear programs we encounter. Fourth, we discuss the statistical case of missing network observations, in which the test set may consist of multiple points, establish an important necessary condition, and suggest a method for handling this case.
In addition, we point out several ways in which this work might be extended, particularly in the case of multiple test points. For one, the question of how to streamline computations is wide open, particularly since it is not necessary to find the exact maximum scaling factor that maps each test point into the convex hull. For the purposes of approximating a maximum likelihood estimator, we seek only an upper bound on the acceptable scaling factors; indeed, in practice we want to scale all test points so that they are inside the boundary. This means that it might be possible to eliminate from the target set any points that are sufficiently close to another target set point and that doing so would not change the needed scaling factor too much.
We might also consider how to optimize the size of the sample chosen for the test set in the first place. For instance, if the scaling factor needed is considerably smaller than one, there might be an advantage in sampling just a handful of points, possibly just a single point in order to move the initial value of θ 0 closer to the true MLE, when a larger sample of test points could be drawn.
One thing that is clear is that the extensions described here would be much more difficult, if not impossible, to consider without the improvements to the ergm package's convex-hull testing procedure outlined in this article.