Consistency of Maximum Likelihood for Continuous-Space Network Models I

A very popular class of models for networks posits that each node is represented by a point in a continuous latent space, and that the probability of an edge between nodes is a decreasing function of the distance between them in this latent space. We study the embedding problem for these models, of recovering the latent positions from the observed graph. Assuming certain natural symmetry and smoothness properties, we establish the uniform convergence of the log-likelihood of latent positions as the number of nodes grows. A consequence is that the maximum likelihood embedding converges on the true positions in a certain information-theoretic sense. Extensions of these results, to recovering distributions in the latent space, and so distributions over arbitrarily large graphs, will be treated in the sequel.


Introduction
The statistical analysis of network data, like other sorts of statistical analysis, models the data we observe as the outcome of stochastic processes, and rests on inferring aspects of those processes from their results.It is essential that the methods of inference be consistent, that as they get more and more information, they should come closer and closer to the truth.In this paper, we address the consistency of non-parametric maximum likelihood estimation for a popular class of network models, those based on continuous latent spaces.
In these models, every node in the network corresponds to a point in a latent, continuous metric space, and the probability of an edge or tie between two nodes is a decreasing function of the distance between their points in the latent space.These models are popular because they are easily interpreted in very plausible ways, and often provide good fits to data.Moreover, they have extremely convenient mathematical and statistical properties: they lead to exchangeable, projectively-consistent distributions over graphs; the comparison of two networks reduces to comparing two clouds of points in the latent space, or even to comparing two densities therein; it is easy to simulate new networks from the estimated model for purposes of bootstrapping, etc.While the latent space has typically been taken to be a low-dimensional Euclidean space (Hoff, Raftery and Handcock, 2002), recent work has suggested that in many applications it would be better to take the space to non-Euclidean, specifically negatively curved or hyperbolic (Krioukov et al., 2010;Asta and Shalizi, 2015).
We can estimate continuous latent space models in the sense of an embedding: given an observed graph, we wish to work backwards the locations of the nodes in the latent space, i.e., to "embed" the graph in the latent space.The most straightforward method of embedding is a maximum likelihood estimator (MLE), treating the latent position of each node as a parameter (or vector of parameters).While it is straightforward to say that a good embedding should converge on the true coordinates as the number of nodes n → ∞, making this mathematically precise is somewhat tricky.(We would, for example, need to define a metric on the space of embeddings of graphs of different sizes.)Instead, we prove the next best thing: that for continuous latent space models of sufficient symmetry and tameness, the distribution over graphs implied by the MLE converges, in normalized Kullback-Leibler (KL) divergence, to the distribution implied by the true embedding.That is, the MLE becomes statistically indistinguishable, in its observable consequences, from the truth.This is a consequence of a result we establish along the way, about the uniform convergence of normalized log-likelihoods to their expectation values (Theorem 4); the rate of this uniform convergence upper-bounds the rate at which the MLE approaches the true embedding in KL divergence.
In the sequel in preparation, we combine our results about normalized loglikelihood with the construction of a specific class of metrics on growing sequences of embeddings, to establish a more conventional, coordinate-wise notion of consistency, and consistency for a subsequent estimator of the node density in the latent space.
Section §2 reviews background on continuous latent space models of networks.Section §3 states our main results, along with certain technical assumptions.All proofs, and a number of subsidiary results and lemmas, are deferred to Section §4.In the Appendix, we note that our results generalize to mis-specified models.

Background
In many, though not all, network data-analysis situations, we have only one network -perhaps not even all of that one network -from which we nonetheless want to draw inferences about the whole data-generating process.This clearly will require a law of large numbers or ergodic theorem to ensure that a single large sample is representative of the whole process.The network, however, is a single high-dimensional object whose every part is dependent on every other part.This is also true of time-series and spatial data, but there we can often use the fact that distant parts of the data should be nearly independent of each other.While general networks often exhibit such decay, networks in nature often lack a natural, exogenous sense of distance (in the technical, geometric sense) that explains such decay.
Continuous latent space (CLS) models are precisely generative models for networks which exhibit just such an exogenous sense of distance.Each node is represented as a location in a continuous metric space, the latent space.Conditional on the vector of all node locations, the probability of an edge between two nodes is a decreasing function of the distance between their locations, and all edges are independent.Generative models for networks for which the existence of different edges is conditionally independent with respect to some latent quantity µ are common; however CLS models, at least as taken in this paper, are distinguished by the particular geometric form that µ takes.
As mentioned above, the best-known CLS model for social networks is that of Hoff, Raftery and Handcock (2002), where the metric space is taken to be Euclidean, and node locations are assumed to be drawn iidly from a Gaussian distribution.In random geometric graphs (Penrose, 2003), the locations are drawn iidly from a distribution on a metric space possibly more general than Euclidean space and the probabilities of connecting edges are either 0 or 1 based on a threshold.
As also mentioned above, there is more recent work which indicates that for some applications it would be better to let the latent space be negatively curved, i.e. hyperbolic (Albert, DasGupta and Mobasheri, 2014;Kennedy, Narayan and Saniee, 2013;Krioukov et al., 2010).Mathematically, this is because many real networks can be naturally embedded into such spaces.More substantively, many real-world networks show highly skewed degree distributions, very short path lengths, a division into a core and peripheries where short paths between peripheral nodes "bend back" towards the core, and a hierarchical organization of clustering.Thus if the latent space is chosen to be a certain hyperboloid, one naturally obtains graphs exhibiting all these properties (Krioukov et al., 2010).
The CLS models we have mentioned so far have presumed that node locations follow tractable, parametric families in the latent space.This is mathematically inessential -many of the results carry over perfectly well to arbitrary densities -and scientifically unmotivated.Because CLS models may need very different spaces depending on applications, we investigate consistency of nonparametric estimation for them at a level of generality which abstracts away from many of the details of particular spaces and their metrics.
To the best of our knowledge, there are no results in the existing literature on the consistency of embedding for CLS models where edge probabilties vary continuously with distance 1 .

Geometric Network Inference
Our goal is to show that when the continuous latent space model is sufficiently smooth, and the geometry of the latent space is itself sufficiently symmetric, then the maximum-likelihood embedding of a graph converges, in normalized Kullback-Leibler divergence, to the true locations of the nodes (Theorem 4).As an intermediate step, we show the uniform convergence of normalized loglikelihoods on their expectation values, at an explicit rate, which also gives us the rate of KL convergence of the MLE on the truth.All proofs are postponed to §4.

Setting and Conventions
We consider only simple, undirected, unlabeled graphs; we will write a random graph as G, and will sometimes abuse notation to also write G for the adjacency matrix, so that G pq = G qp = 1 if there is an edge between nodes p and q, and = 0 otherwise.
All random graphs G in this paper have conditionally independent edges; that is, we assume for each G, there exists a random quantity µ such that G | µ has independently distributed edges.A continuous latent space model assumes that µ has a certain geometric nature, which will be defined in the succeeding paragraphs.
All the metrics of metric spaces will be denoted by dist; context will make clear which metric dist is describing.Our model for generating random graphs begins with a metric measure space M , a metric space equipped with a Borel measure, and the corresponding group isom(M ) of measure-preserving isometries M ∼ = M .Every node is located at (equivalently, "represented by" or "labeled with") a point in M , x i for the i th node; the location of the first n nodes is x 1:n ∈ M n , and a countable sequence of locations will be x 1:∞ .For each n, there is a non-increasing link function w n : [0, ∞) → [0, 1], and nodes i and j are joined by an edge with probability w n (dist (x i , x j )).By a latent space (M, w 1:∞ ), we will mean the combination of M and a sequence w 1:∞ of link functions w 1 , w 2 , . ... When the latent space is understood, we write graph n (x 1:n ) for the distribution of a random graph on n vertices located at x 1:n .Thus in the particular case G = graph n (x 1:n ), we have µ = x 1:n .
1 Computationally-tractable and consistent embedding algorithms exist for some kinds of random geometric graph where edges are deterministically present between sufficientlyclose nodes and otherwise deterministically absent (Dani et al., 2021), but they rely crucially on deterministic links, and their statistical efficiency is unknown.Uniform consistency for variants of these sorts of CLS models, such as random dot product graphs (RDPG) Young and Scheinerman (2007), have been well established (c.f.Athreya et al. (2017)); the inherently linear algebraic methods used to develop estimators and consistency results in the RDPG setting do not seem portable in the metric setting.
It is clear that for any φ ∈ isom(M ), we have for every n, Accordingly, we will use [x 1:n ] to indicate the equivalence class of n-tuples in M n carried by isometries to x 1:n ; the metric on M extends to these isometry classes in the natural way, We cannot hope to find x 1:n by observing the graph it leads to, but we can hope to identify [x 1:n ].
Conventions When n and m are integers, n < m, n : m will be the set {n, n + 1, . . .m − 1, m}.Unless otherwise specified, all limits will be taken as n → ∞.All probabilities and expectations will be taken with respect to the actual generating distribution of G.

Axioms on the generative model
We recall that a metric space M is k-homogeneous if every isometry between finite submetric spaces each of size k of M extends to an isometry on M , an isometry M → M .There we call a metric space M ∞-homogeneous if every isometry between finite submetric spaces of M extends to an isometry on M .The literature takes homogeneous to usually mean 1-homogeneous but to sometimes to mean ∞-homogeneous.Motivating examples are Euclidean space R d and the Poincaré Halfplane H 2 , described in §3.3.Almost any example of a metric space with a single "singularity" x 1 , such as a "figure 8," is not 1-homogeneous; for a close enough point x 2 , there are also points x 3 , x 4 such that dist(x 1 , x 2 ) = dist(x 3 , x 4 ), but intuitively there cannot be any isometry carrying a singularity to a non-singularity.An example of a metric space that is 1-homogeneous but not ∞-homogeneous is the orientable surface of infinite genus.Identifiability of graph distributions determined by certain CLS models is possible.We define such CLS models below.Definition 1.A latent space (M, w 1:∞ ) is regular when: 1. M is a complete ∞-homogeneous Riemannian manifold; 2. The group of isometries on M has only finitely-many connected components; 3. The function w n is injective and smooth for each n; and 4. The sequence w Proposition 2. The metric spaces R d and H 2 satisfy points (1) and ( 2) of Definition 1 with where B M denotes the number of connected components of the group of isometries on a metric space M .
Demanding that v n = o( √ n) is done with an eye towards the needs of the proofs in §4.Some common examples of link functions (cf.(Krioukov et al., 2010)) include the following two kinds: The first sort defines a graph where edges are deterministically present between sufficiently-close nodes, and deterministically absent between more distant nodes.The second sort, in which the sequence of T n s are fixed temperature parameters; the higher the temperature T n , the closer the link function is to a constant probability 1 /2.The determinism of the first kind violates logit-boundedness.The second kind satisfies logit-boundedness when By extension, a CLS model is regular when (M, w 1:∞ ) is.The proof of the following proposition, a straightforward consequence of ∞-homogeneity and injectivity of the link functions, is omitted.
Theorem 3 lets us identify graph distributions of the form graph n (x 1:n ) with isometry classes [x 1:n ].

An example in the literature
Latent spaces of the form (H 2 , w n ). where

Estimation
Given a latent space model (M, w 1:∞ ) and an n-node graph G, the likelihood L(x 1:n ; G) of observing coordinates x 1:n ∈ M m is given as the product of edge probabilities2 : Taking logs and dividing by the number of summands, we obtain the normalized log-likelihood ℓ(x 1:n ; G) of observing coordinates x 1:n ∈ M m by: As usual, when there is no ambiguity about the graph G providing the data, we will suppress that as an argument, writing ℓ(x 1:n ).
Taking expectations with respect to the actual graph distribution of a random graph G having n nodes, we define the expected normalized log-likelihood by As we review in Sec.4.2, well-known results from information theory show that −ℓ(x 1:n ) can always be decomposed into the sum of two non-negative terms, Here the first term, the "source entropy rate" H, captures the inherent stochasticity of the data source.The second term, the "divergence rate" D(x 1:n ), measures the distance, or rather the normalized Kullback-Leibler divergence, between that true distribution and graph n (x 1:n ).Among other properties, D(x 1:n ) controls the power of any hypothesis test to distinguish graph n (x 1:n ) from the true distribution.This divergence is minimized by x 1:n = x * 1:n ; when the model is well-specified, D(x * 1:n ) = 0. We are now in a position to state our main results.
Theorem 4. Suppose that the CLS model is regular.Then Corollary 5. Suppose that the CLS model is regular, and

Proofs
This section furnishes proofs of main results about networks.We can sketch the general approach as follows.We show that the expected log-likelihood achives its maximum precisely at the true coordinates up to isometry (Lemma 6).We then show that (in large graphs) the log-likelihood ℓ(x 1:n ) is, with arbitrarily high probability, arbitrarily close to its expectation value for each x 1:n (Lemmas 9 and 10).We then extend that to a uniform convergence in probability, over all of M n (Theorem 4).To do so, we need to bound the richness (pseudodimension (Anthony and Bartlett, 1999, §11), a continuous generalization of VC dimension) of the family of log-likelihood functions (Theorem 8), which involves the complexity of the latent space's geometry, specifically of its isometry group isom(M ).Having done this, we have shown that the MLE also has close to the maximum expected log-likelihood.We emphasize this because the expected log-likelihood has a natural information-theoretic interpretation in terms of divergence from the truth (Eq.4.3 below).

Notation
Before we dive into details, we first introduce some additional notation for our proofs.We will use G for both a (random or deterministic) graph and its adjacency matrix.
We fix the latent space as (M, w 1:∞ ).For brevity, define As usual with binary observations, we can rewrite (3.6) so that the sum is taken over all pairs of distinct (p, q) and then replace each summand by log (1 − w n (dist (x p , x q )))+ G pq λ n (x p , x q ).This brings out that the only data-dependent (and hence random) part of ℓ is linear in the entries of the adjacency matrix, and in the logit transform of the link-probability function.As usual, when there is no ambiguity about the graph G providing the data, we will suppress that as an argument, writing ℓ(x 1:n ).We write the class of log-likelihood functions as L n .

Information Theory
Recall the definition of expected normalized log-likelihood from Eq. (4.2): Taking expectations with respect to the actual graph distribution of a random graph G having n nodes, we define the expected normalized log-likelihood (the cross-entropy; Cover and Thomas 2006, ch. 2) by where the expectation is taken with respect to the random graph G (and not the random graph G conditioned on some random equantity µ making the edges independent.)For notational convenience, set (so that π pq (1) = w n (x p , x q ) and π * pq (1) = w n (x * p , x * q ).)Then π * pq (a) log π pq (a).
In information theory (Cover and Thomas, 2006, ch. 2), this quantity is known as the (normalized) cross-entropy, and we know that as the left side is the cross-entropy of the distribution π pq with respect to the distribution π * pq and the right side is the sum of ordinary entropy H with the Kullback-Leibler divergence D. Since both entropy and KL divergence are additive over independent random variables (Cover and Thomas, 2006, ch. 2) like G pq , we have 3 , defining H[π * ] and D(π * π) in the obvious ways, Unsurprisingly, ℓ achieves a maximum at the (isometry class of) the true coordinates 4 .Lemma 6.For ∞-homogeneous M and G ∼ graph 3 The decomposition of expected log-likelihood into a entropy term which only involves the true distribution of the data, plus a KL divergence, goes back to at least Kullback (1968). 4The statement and proof of the following lemma presume that the model is well-specified.If the model is mis-specified, then infx 1:n D(π * π) is still well-defined, and still defines the value of the supremum for ℓ.The pseudo-true parameter value would be one which actually attained the infimum of the divergence (White, 1994).This, in turn, would be the projection of π * on to the manifold of distributions generated by the model (Amari et al., 1987).All later invocations of Lemma 6 could be replaced by the assumption merely that this pseudo-truth is well-defined.
Proof.Letting H and D respectively denote entropy and KL divergence as in (4.3), D(π * π) ≥ 0, with equality if and only if π * = π.Therefore we have that the divergence-minimizing π must be the distribution over graphs generated by some x 1:n ∈ [x * 1:n ], and conversely that any parameter vector in that isometry class will minimize the divergence.The lemma follows from (4.3).

Geometric Complexity of Continuous Spaces
For various adjacency matrices G 1 , G 2 , etc., let us abbreviate ℓ(x 1:n ; G i ) as ℓ i (x 1:n ) (following Anthony and Bartlett 1999, p. 91).Let us pick r different adjacency matrices G 1 , . . ., G r , and set ψ(x 1:n ) = ℓ 1 (x 1:n ), . . ., ℓ r (x 1:n ) .We will be concerned with the geometry of the level sets of ψ, i.e., the sets defined by ψ −1 (c) for c ∈ R r .We say that a function ψ : M n → R r has has fibers with uniform bound B on the number of path-components if ψ −1 (x) has at most B path-components, equivalence classes of points where two points are equivalent if there is a path in ψ −1 (x) connecting them, for each x ∈ R r .
Proposition 7. Suppose that all functions in L n are jointly continuous in their d parameters almost everywhere, and that L n has fibers with uniform bound B on the number of path-components.Then the growth function of L n , i.e., the maximum number of ways that m ≥ d data points G 1 , . . .G m could be dichotomized by thresholded functions from L n , is at most Thus the pseudo-dimension of L n is at most 2 log 2 B + 2d log 2 2/ ln 2.
Proof.The inequality (4.4) is a simplification of Theorem 7.6 of Anthony and Bartlett (1999, p. 91), which allows for sets to be defined by k-term Boolean combinations of thresholded functions from L n .(That is, the quoted bound is that of the theorem with k = 1.)Moreover, while Theorem 7.6 of Anthony and Bartlett (1999) assumes that all functions in L n are C d , the proof (op.cit., sec.7.4) only requires continuity in the simplified setting k = 1.
For any class of sets with VC dimension v < ∞, the growth function is polynomial in m, Π(m) ≤ (em/v) v (Anthony and Bartlett, 1999, Theorem 3.7, p. 40), and, conversely, if Π(m) < 2 m for any m, then the class of sets has VC dimension at most m.Since Eq. 4.4 shows that Π(m) grows only polynomially in m, the VC dimension must be finite.Comparing the O((m/d) d ) rate of Eq. 4.4 to the O((m/v) v ) generic VC rate suggests v = O(d), but it is desirable, for later purposes, to find a more exact result.
To do so, we find the least m where Eq. 4.4 is strictly below 2 m , and take the logarithm: and it will be sufficient for the right-hand side to be < m.This in turn is implied by so this is an upper bound on the VC dimension of the subgraphs of L n , and so on the pseudo-dimension of L n .
Next we bound the complexity of log-likelihoods for certain latent spaces.
where B M is the number of path-components of the space of isometries on M .isom(M ).
Proof.By the fact that (M, w 1:∞ ) is smooth, L n is C ∞ in all its n dim M continuous parameters, so in applying Proposition 7, we may set d = n dim M .Define φ(x 1:n ; G) to be the function M n → R n 2 sending a tuple x 1:n to the vector whose (pq)th coordinate, for 1 ≤ p, q ≤ n, is dist(x p , x q ).Define T : R n 2 → R n by the rule Note each ℓ norm (−; G) ∈ L n satisfies ℓ norm = T φ(−; G).The preimage T −1 (c) of a point under T , a linear transformation, is either empty or a (connected and convex) linear subspace of R n 2 .The function φ(−; G) has bounded connected components with bound B M because φ(x where CC(X) denotes the number of path-components of a space X.Thus each ℓ(−; G) ∈ L n has bounded connected components with bound B M .The hypotheses of Proposition 7 being satisfied, (4.9) follows from Proposition 7.

Pointwise Convergence of Log-Likelihoods
Lemma 9. Suppose that all of the edges in G are conditionally independent given some random variable µ.Then for any ǫ > 0, In particular, this holds when G ∼ graph n (x * 1:n ) or G ∼ graph n (f ).Proof.Changing a single G pq , but leaving the rest the same, changes ℓ(x 1:n ; G) by 1 n(n−1) λ n (x p , x q ).The G pq , for p < q, are all independent given µ.We may thus appeal to the bounded difference (McDiarmid) inequality (Boucheron, Lugosi and Massart, 2013, Theorem 6.2, p. 171): if f is a function of independent random variables, and changing the k th variable changes ℓ by at most c k , then In the present case, c pq = λ n (x p , x q ).Thus, and so P r |ℓ(x Since the unconditional deviation probability is just the expected value of the conditional probability, which has the same upper bound regardless of µ, the result follows (cf.Shalizi and Kontorovich 2013, Theorem 2).Finally, note that all edges in graph n (x * 1:n ) are unconditionally independent, while those in graph n (f ) are conditionally independent given X 1:n , which plays the role of µ.
This lemma appears to give exponential concentration at an O(n 4 ) rate, but of course the denominator of the rate itself contains n 2 = O(n 2 ) terms, so the over-all rate is only O(n 2 ).Of course, there must be some control over the elements in the denominator.
Lemma 10.If −v n ≤ λ n (x p , x q ) ≤ v n , then for any x 1:n and ǫ > 0, n , and the result follows from Lemma 9.

Uniform Convergence of Log-Likelihoods
Lemmas 9 and 10 show that, with high probability, ℓ(x 1:n ) is close to its expectation value ℓ(x 1:n ) for any given parameter vector x 1:n .However, we need to show that the MLE X1:n has an expected log-likelihood close to the optimal value.We shall do this by showing that, uniformly over M n , ℓ(x 1:n ) is close to ℓ(x 1:n ) with high probability.That is, we will show that sup This is a stronger conclusion than even that of Lemma 10: since M is a continuous space, even if each parameter vector has a likelihood which is exponentially close to its expected value, there are an uncountable infinity of parameter vectors.Thus, for all we know right now, an uncountable infinity of them might be simultaneously showing large deviations, and continue to do so no matter how much data we have.We will thus need to show that likelihood at different parameter values are not allowed to fluctuate independently, but rather are mutually constraining, and so eventually force uniform convergence.
If there were only a finite number of allowed parameter vectors, we could combine Lemma 10 with a union bound to deduce (4.15).With an infinite space, we need to bound the covering number of L n .To recall5 , the L 1 covering number of a class F of functions at scale ǫ and m points, N 1 (ǫ, F, m), is the cardinality of the smallest set of functions f j ∈ F which will guarantee that, for any choice of points a 1 , . . .a m , sup 1 ,...,am 1 m m i=1 |f (a i )f j (a i )| ≤ ǫ for some f j (this definition can be straightforwardly shown to be equivalent to that of Anthony and Bartlett (1999)).Typically, as in Anthony and Bartlett (1999, Theorem 17.1, p. 241), a uniform concentration inequality takes the form of where the individual deviation inequality is (4.17) In turn, Anthony and Bartlett (1999, Theorem 18.4, p. 251) shows that the L 1 covering number N 1 (ǫ, F, m) of a class F of functions with finite pseudodimension v at scale ǫ and m observations is bounded: In our setting, we have m = 1.(That is, we observe one high-dimensional sample; notice that the bound is independent of m so this hardly matters.) It thus remains to bound the pseudo-dimension of L n .This involves a rather technical geometric argument, ultimately revolving on the group structure of the isometries of (M, dist).This may be summed up in the existence of a constant B M , which is 2 for any Euclidean space, and (as it happens) also 2 for H 2 .This matter was handled in §4.3.

Proof of Theorem 4
By assumption, there exists a sequence ν 1 , ν 2 , . . . of non-negative reals such that |λ n (x p , x q )| ≤ v n for each n and p, q with ν n ∈ o( √ n).Presume for the moment that we know the L 1 covering number of L n is at most N 1 (L n , ǫ, 1).Then

P r sup
The proof is entirely parallel to that of Theorem 17.1 in Anthony and Bartlett (1999, p. 241), except for using Lemma 10 in place of Hoeffding's inequality, and so omitted.Now, by Proposition 2 B M = 2 and therefore by Theorem 8, the pseudodimension of L n is at most 2 log 2 B M + 2n dim M log 2 2/ ln 2. The L 1 covering number of L n is thus exponentially bounded in O(n log 1/ǫ), specifically (Anthony and Bartlett, 1999, Theorem 18.4, p. 251) ) by our regularity assumption.For fixed ǫ, then, the uniform deviation probability over all of L n in (4.19) is therefore exponentially small, hence we have convergence in probability to zero.
Remark 1: In applying the theorems from Anthony and Bartlett (1999), remember that we have only one sample (m = 1), which is however of growing (O(n 2 )) dimensions, with a more-slowly growing (O(n)) number of parameters.
Remark 2: From the proof of the theorem, we see that if v 2 n grows slowly enough, the sum of the deviation probabilities tends to a finite limit.Convergence in probability would then be converted to almost-sure convergence by means of the Borel-Cantelli lemma, if the graphs at different n can all be placed into a common probability space.Doing so however raises some subtle issues we prefer not to address here (cf.(Shalizi and Rinaldo, 2013)).

Proof of Corollary 5
We adapt a very standard pattern of argument used to prove oracle inequalities in learning theory.This begins with Lemma 6, that ℓ(x * where in Eq. 4.22 we use the trivial fact that since x1:n maximizes the likelihood, ℓ(x * 1:n ) ≤ ℓ(x 1:n ), and the last line invokes Theorem 4.

Conclusion
We have formulated and proven a notion of convergence for non-parametric likelihood estimators of graphs generated from continuous latent space models, under some mild assumptions on the generative models.Traditional convergence results for statistical estimators are a kind of ergodicity, or long-term mixing, for multiple, independent samples.The size of a single sample network here plays the role of the number of samples in traditional formulations of consistency.These main results hold even when our generative models are mis-specified, i.e. when we fix a latent space but the generating graph distributions are not defined in terms of the space, under some additional assumptions [Appendix A].Continuous latent space models turn out to provide the necessary ergodicity through conditional independence.A consequent notion of consistency, which we save for future work, requires some formalization of what we mean by convergence of estimates, i.e. sequences of coordinates, of varying sizes.And a proof of such a consistency result will likely require some adaptation of standard technical tools for concluding convergence of extremal estimators from convergence in random objective functions (eg.van der Vaart (1998).) Appendix A: Mis-specified models Our consistency results extend from specified to certain mis-specified models.
We still assume the existence of a latent space (M, w 1:∞ ) as before, but assume that sample graphs are sampled not by a distribution of the form graph n (x 1:n ) but in fact by some arbitrary distribution of graphs having n nodes.The only assumption we make about such random graphs G in this section, as before, is that there exists some random variable µ such that the edges of G are conditionally inependent given µ.For the case where G is drawn from a CLS model, µ can be taken to be the random latent coordinates of the nodes of G.We call a sequence G 1 , G 2 , . . . of random graphs almost-specified if there exists x * 1:∞ ∈ M ∞ such that, for all sufficiently large n, ℓ(x 1:n ) achieves a maximum uniquely exactly for 21) ≤ ℓ(x * 1:n ) − ℓ(x * 1:n ) + ℓ(x 1:n ) − ℓ(x 1:n ) (4.22) ≤ |ℓ(x * 1:n ) − ℓ(x * 1:n )| + |ℓ(x 1:n ) − ℓ(x 1:n )| (4.23) x 1:n ∈ [x * 1:n ].For such an almost-specified model, x * 1:∞ plays the role of the true coordinates and the assumption of being almost specified plays the role of Lemma 6 (e.g. in all proofs); we call such x * 1:∞ the pseudo-coordinates of the almost-specified model.Consequently, we can restate our main results at the following level of generality.Theorem 11.For an almost specified model with pseudo-coordinates x * 1: * and a compact, regular latent space (M, w 1:∞ ), sup x1:n |ℓ(x 1:n ) − ℓ(x 1:n )| P → 0 is the Poincaré halfplane with metric : N 1 (L n , ǫ, 2) is bounded above by