Perfect Clustering for Stochastic Blockmodel Graphs via Adjacency Spectral Embedding

Vertex clustering in a stochastic blockmodel graph has wide applicability and has been the subject of extensive research. In thispaper, we provide a short proof that the adjacency spectral embedding can be used to obtain perfect clustering for the stochastic blockmodel and the degree-corrected stochastic blockmodel. We also show an analogous result for the more general random dot product graph model.

1. Introduction.In many problems arising in the natural sciences, technology, business and politics, it is crucial to understand the specific connections among the objects under study: for example, the interactions between members of a political party; the firing of synapses in a neuronal network; or citation patterns in reference literature.Mathematically, these objects and their connections are modelled as graphs, and a common goal is to find clusters of similar vertices within any graph.
The stochastic blockmodel and its extensions are key tools for modeling communities and groups in graphs [5,6].Given a stochastic blockmodel graph, one way to cluster the vertices is to estimate the block memberships of all vertices.Many techniques have been proposed for estimating the block memberships, including maximization of modularity [8], likelihood maximization [2,4], Bayesian methods [12] and spectral techniques [11,13].In this paper, we focus on the latter, particularly on clustering the vertices using the adjacency spectral embedding.
The authors of [3] have shown that likelihood-based techniques can be employed to achieve error-free vertex clustering with probability tending to one as n, the number of vertices, tends to infinity.For clustering using spectral methods, however, the best existing results assert only that with probability tending to one, at most O(log(n)) vertices will be misclustered.We provide here the necessary theory to bring the error estimates for these two methods in line, showing that spectral clustering will have asymptotically perfect performance for the stochastic blockmodel.
Our main result is Theorem 5, which gives a bound, dependent on model assumptions, on the probability that a mean square error clustering of the adjacency spectral embedding will be error-free.We also discuss the asymptotic implications of these results.To prove this theorem, we prove two key lemmas about a more general model.Consequently, we are able to prove an analogous result for clustering in the degree corrected stochastic blockmodel [7].Finally, we prove a more general result that spectral clustering of random dot product graphs is strongly universally consistent in the sense of [9].

Setting and Main
Theorem.In our present setting, we deviate slightly from the standard characterization of the stochastic blockmodel and focus, instead, on a particular latent position parametrization of the model, namely the random dot product graph [17].Random dot product graphs are a convenient theoretical tool, and their spectral properties are well-understood.This setting replaces the non-geometric constructions, in which each block is associated with a categorical label, with a geometric construction in which each block is associated with a point in Euclidean space.

Definition 1 (Random Dot Product Graph (RDPG)). A random adjacency matrix
is said to be an instance of a random dot product graph (RDPG) if Remark.In general we will denote the rows of an n × d matrix M by M i .
For an RDPG, we also define P = XX T and let S ∈ R d×d , and In particular, we shall assume throughout this paper that the non-zero eigenvalues of P are distinct, i.e., the inequalities above are strict.
It easily follows that there exists an orthonormal W such that V S 1 2 = XW .We thus suppose that X = V S 1 2 .This assumption does not lead to any loss of generality since, for the inference tasks in this paper, we will be performing clustering using the mean square error criterion, which is invariant under orthogonal transformations.Similarly define V ∈ R n×d , and Ŝ ∈ R d×d .We define X = V Ŝ1 2 to be the adjacency spectral embedding (ASE) of A.
We shall assume, for ease of exposition, that the diagonal entries of Ŝ are positive.As will be seen later, e.g., Lemma 7, this assumption is justified in the context of random dot product graphs due to the concentration of the eigenvalues of A around those of P .
We now define the stochastic blockmodel in terms of random dot product graphs.
Definition 2 ((Positive Semidefinite) Stochastic Blockmodel (SBM)).We say an RDPG is an SBM with K blocks if the number of distinct rows in X is K.In this case, we define the block membership function τ ∶ [n] ↦ [K] to be a function such that τ (i) = τ (j) if and only if X i = X j .For each k ∈ [K], let n k be the number of vertices such that τ i = k.
Remark.Our definition of an SBM is somewhat atypical: SBMs are traditionally defined by the K × K matrix of probabilities of adjacencies between vertices in each of the blocks.The two definitions coincide provided this matrix is positive semidefinite.
Next, we introduce mean square error clustering, which is the clustering sought by K-means clustering.
Definition 3 (Mean Square Error (MSE) Clustering).The MSE clustering of the rows of X into K blocks returns the optimal cluster centroids for the MSE clustering, and τ ∶ [n] ↦ [K] which satisfies τi = τj if and only if Ĉi = Ĉj , where Ĉi is the i th row of Ĉ.
We note some basic properties of α → β norms.In particular, M 2→∞ = max i M i 2 .Indeed, In addition, for any matrix A and B whose product AB is well-defined, ABx β ≤ A α→β Bx α by the definition of the ⋅ α→β norm and this implies We now state a technical lemma in which we bound the maximum difference between the rows of X and the rows of an orthogonal transformation of X.
Lemma 4. Let ∆ and γ be defined as ∆ is then the maximum of the row sums of P and γn is the minimum gap among the distinct eigenvalues of P .Suppose 0 < η < 1 2 is given such that γn ≥ 4 ∆ log (n η).Then, with probability at least 1 − 2η, one has Lemma 4 gives far greater control of the errors than previous results that were derived for the Frobenius norm X − X F ; the latter bounds do not allow fine control of the errors in the individual rows of X. Lemma 4, on the other hand, provides exactly this control and, as such, vastly improves the bounds on the error rate of MSE clustering of X.
Our main theorem is the following result on the probability that mean square error clustering on the rows of X is error-free.
Theorem 5. Let A ∼ RDPG(X) and is an SBM with K blocks and block membership function τ and suppose η ∈ (0, 1 2).Denote the bound on X − X 2→∞ in Lemma 4 as β = β(d, n, η, γ).Let r > 0 be such that for all i, j ∈ be the optimal MSE clustering of the rows of X into K clusters.Let S K denote the symmetric group on K, and π ∈ S K a permutation of the blocks.Finally, let n min = min k∈[K] n k be the smallest block size.If r > β n n min and γn > 4 ∆ log(n η) then with probability at least 1 − 2η, We remark that these conditions are quite natural: we require that the rows of X with distinct entries have some minimum separation, and that this separation is not too small compared to the smallest block size.Our assumption on γ ensures a large enough gap in the eigenvalues to use Lemma 4. We note that Lemma 4 is applicable to the sparse setting, i.e., the setting wherein the average degrees of the vertices are of order ω(log k n) for some k ≥ 2. Finally, we observe that the theorem has both finite-sample and asymptotic implications.In particular, under these model assumptions, for any finite n, the theorem gives a lower bound on the probability of perfect clustering.We do not assert-and indeed it is easy to refute-that in the finite sample case, perfect clustering occurs with probability one.Nevertheless, we can choose η = n −c for some constant c ≥ 2, in which case the probability of perfect clustering approaches one as n tends to infinity.
3. Proof of Theorem 5. Before we prove Theorem 5, we first collect a sequence of useful bounds from [13,15,16].We then prove two key lemmas.
The next two lemmas from [1] are essential to our argument.
Lemma 7. In the setting of Proposition 6, if the events in Proposition 6 occur, then Lemma 8.In the setting of Proposition 6, if the events in Proposition 6 occur, then We then have the following bound Lemma 9.In the setting of Proposition 6, if the events in Proposition 6 occur, then Proof.Let E = A − V Ŝ V ⊺ .Denoting by Z the quantity AV Ŝ−1 2 , we have where C 1 and C 2 are given by We have by Proposition 6 and our assumption that γ √ n ≥ 4 log(n η).We now apply Lemma 9 to derive the bound Similarly, we have that E 2→2 ≤ 2 ∆ log(n η).Therefore, we can bound C 2 by 3 2  , from which the desired bound follows.
We now prove Lemma 4 by dominating the ⋅ 2→∞ operator norm by the Frobenius norm and applying Hoeffding's inequality.
Proof of Lemma 4. Since X = P V S −1 2 we can add and subtract the matrix AV Ŝ−1 2 and AV S −1 2 to rewrite X − X as Lemma 9 bounds the first term in terms of the Frobenius norm which is a bound for the 2 → ∞ norm.For the second term, we have as both Ŝ and S are diagonal matrices.We thus have and hence We now bound the third term.Let Z ij denote the i, jth entry, and Z i the ith row, of the n × d matrix (A − P )V .Observe that Next, since Therefore, Hoeffding's inequality implies 2 log(2nd η) ≤ η nd Since there are nd entries Z ij , a simple union bound ensures that and consequently that The third term can therefore be bounded as d log (2nd η) 2γn with probability at least 1 − η.Combining the bounds for the above three terms yields Lemma 4.
Proof of Theorem 5. We assume that the event in Lemma 4 occurs and show that this implies the result.Since Let B 1 , B 2 , . . ., B K be L 2 -balls with radii 2r around the K distinct rows of X.By the assumptions in Theorem 5, these balls are disjoint.Suppose now that there exists k ∈ [K] such that B k does not contain any rows of Ĉ.Then Ĉ − X F > 2r √ n min , as no row of Ĉ is within 2r of a row of X in B k , of which there are at least n min .This implies that Ĉ a contradiction.Therefore, Ĉ − X 2→∞ ≤ 2r.Hence, by the pigeonhole principle, each ball B k contains precisely one distinct row of Ĉ.
If X i = X j , then both Ĉi and Ĉj are elements of B τ (i) , and since there is exactly one distinct row of Ĉ in B k , Ĉi = Ĉj .Conversely, if Ĉi = Ĉj , then X i and X j are in disjoint balls B k and B k ′ for some k, k ′ ∈ [K], implying that X i = X j .Thus, X i = X j if and only if Ĉi = Ĉj , proving the theorem.

Degree Corrected SBM.
In this section we extend our results to the degree corrected SBM [7].
Definition 10 (Degree Corrected Stochastic Blockmodel (DCSBM)).We say an RDPG is a DCSBM with K blocks if there exist K unit vectors y 1 , . . ., y K ∈ R d such that for each i ∈ [n], there exists k ∈ [K] and c i ∈ (0, 1) such that X i = c i y k .
For this model, we introduce Y ∈ R n×d to be such that Y i = y τ (i) , so each row of Y has unit L 2 -norm.Similarly, let Ŷ = diag( X X⊺ ) −1 2 X where diag(⋅) denotes the operation of setting all the off-diagonal elements of its argument to 0. If we denote the unit sphere in Lemma 11.In the setting of Lemma 4, let the n × d matrices Ỹ , Ŷ ∈ S n be the projections of X and X, respectively, onto S n .Let This again leads to the following theorem about the probability of error-free clustering.
Theorem 12. Suppose A ∼ RDPG(X) and is a DCSBM with K blocks and block membership function τ and suppose η ∈ (0, 1 2).Let y 1 , . . ., y K be the K unit vectors for the DCSBM and let c min denote the smallest scaling factor.Let γ, β be as in Theorem 5. Suppose r > 0 is such that for all i, j ∈ be the optimal MSE clustering of the rows of Ŷ , the projection of X onto S n , into K clusters.Finally, let n min = min k∈[K] n k be the smallest block size.If r > (2β n n min ) c min and γn > 4 ∆ log(n η), then with probability at least 1 − 2η, The proof of this theorem follows mutatis mutandis from the proof of Theorem 5.

Strong universal consistency.
The problem strong consistency of Kmeans clustering is considered in [9].Specifically, suppose that {X 1 , . . ., X n } is a sample of independent observations from some common compactly supported distribution F on R d .Denote by F n the empirical distribution of the {X i } and let C be a set containing K or fewer points.Suppose that The problem of K-means mean square error clustering given {X 1 , . . ., X n } can then be viewed as the minimization of Φ(A, F n ) for φ(r) = r 2 over all sets C containing K or fewer elements.The strong consistency of K-means clustering corresponds then to the following statement.
Theorem 13 ([9]).Suppose that for each k = 1, . . ., K, there is a unique set Ck for which Let s i denote the summand in the above display.We then have the following bound and hence We thus have the bound → 0 as desired.
6. Discussion.Lemma 4 bounds on the 2-to-∞ norm of the difference between X and a linearly transformed X.In our proof, we exploit the fact that the clustering criterion is relatively insensitive to such linear transformations.
If we are interested in asymptotics, we can take η to be n −c for some c and so β = O(d∆ 3 log(n) (γn) 7 2 ).We can also consider the special case in which the constants are fixed in n and n min = Θ(n), whereupon the conditions of Theorem 5 are all satisfied for n sufficiently large.For example, if we suppose there are K positions ξ 1 , . . ., ξ k and P[X i = ξ k ] = π k for some π k > 0, i.e. a mixture of point masses, then this is an stochastic block model with independent, identically distributed block memberships across vertices; it is not hard to show that the number of errors converges almost surely to zero.
Indeed, one can construct many examples where perfect performance is achieved asymptotically.We will not detail all regimes explicitly, but rather note that this theorem can be applied to a growing number of blocks, possibly impacting d, γ and n min , as well as moderately sparse regimes, impacting γ.
The DCSBM is more flexible than the SBM, but still has key properties useful in modeling group structures in graphs.In [10], the authors provide complementary results for spectral analysis of the DCSBM without requiring lower bounds on the degrees; however, in turn, they obtain less-than-perfect clustering.Our results are the first to show that, depending on model parameters, the probability of perfect clustering tends to one as the number of vertices tends to infinity.The keys to the easy extension of these results to more general models are Lemmas 4 and 9, stated here in the RDPG setting.The generality of this setting renders these lemmas applicable to a number of other models and inference tasks.For example, the above results can be used to show that the K-nearest neighbor classifier of the adjacency spectral embedding is a strongly universally consistent vertex classifier, thereby extending the results of [14].
For any given {X 1 , ..., X n }, denote by C n a minimizer of Φ(C, F n ) over all sets C containing K or fewer elements.Then C n → CK almost surely andΦ(C n , F n ) → Φ( CK , F ) almost surely.We now state the counterpart to Theorem 13 for the RDPG setting.Theorem 14.Let A ∼ RDPG(X) where the latent positions are sampled from some common compactly supported distribution F .Let Fn be the empirical distribution of the { Xi } n i=1 .Denote by Ĉn a minimizer of Φ(C, Fn ) over all sets C containing K or fewer elements.Then provided that the conditions in Theorem 13 holds for F , Ĉn → CK almost surely, and furthermore, Φ( Ĉn , Fn ) → Φ( CK , F ) almost surely.Proof.We can suppose, without loss of generality, that F is a distribution on a totally bounded set, say Ω.Let G denote the family of functions of the form g C (x) = min c∈C φ( x − c ) where C ranges over all subsets of Ω containing K or fewer points.The theorem is equivalent to showing F ) g∈G g d Fn − g dF n