Estimating the number of communities by spectral methods

: Community detection is a fundamental problem in network analysis with many methods available to estimate communities. Most of these methods assume that the number of communities is known, which is often not the case in practice. We study a simple and very fast method for estimating the number of communities based on the spectral properties of certain graph operators, such as the non-backtracking matrix and the Bethe Hessian matrix. We show that the method performs well under several models and a wide range of parameters, and is guaranteed to be consistent under several asymptotic regimes. We compare this method to several existing methods for estimating the number of communities and show that it is both more accurate and more computationally eﬃcient.


Introduction
The problem of clustering similar objects into groups is a fundamental problem in data analysis.In network analysis, it is known as community detection ( [34,3,10,4]).Given a network, which consists of a set of nodes and a set of edges between them, the goal of community detection is to cluster the nodes into groups (communities) so that nodes in the same community share a similar connectivity.
One of the simplest ways of modeling a community structure is the stochastic block model (SBM), proposed by [17].Given the number of communities K, n node labels c i are drawn independently from a multinomial distribution with parameter π = (π 1 , ..., π K ).The edges between pairs of nodes (i, j) are then drawn independently from a Bernoulli distribution with parameter P c i c j and collected in the n × n adjacency matrix A, with A ij = 1 if nodes i and j are connected by an edge, and 0 otherwise.A limitation of the stochastic block model is that all nodes in the same communities are equivalent and follow the same degree distribution, whereas many real networks contain a small number of high-degree nodes, the so called hubs.To address this limitation, [19] proposed the degreecorrected stochastic block model (DCSBM).It assigns a degree parameter θ i to each node i, and edges between nodes are drawn independently with probabilities θ i θ j P c i c j .The community detection task is to recover the labels c i given the adjacency matrix A.
A large number of methods have been proposed for finding the underlying community structure ( [28,33,3,10,37,12,4,20,41,30,38]).Most of these methods require the number of communities K as input, but in practice K is often unknown.To address this problem, a few likelihood-based methods have been proposed to estimate K under either the SBM or the DCSBM ( [14,21,35,39,43]).These methods use BIC-type criteria for choosing the number of communities from a set of possible values, which requires computing the likelihood, done using either MCMC or the variational method, which are both computationally very challenging for large networks.A different approach based on the distribution of leading eigenvalues of an appropriately scaled version of the adjacency matrix was proposed by [9,23].Under the SBM, distributions of the leading eigenvalues converge to the Tracy-Widom distribution; this fact is used to determine K through a sequence of hypothesis tests.Since the rate of convergence is slow for relatively sparse networks, a bootstrap correction procedure was employed, which also leads to a high computational cost.Cross-validation approaches were proposed by [13] and [24].While they have good properties under the SBM and the DCSBM, they require estimating communities on many random network splits, and are computationally costly.
To the best of our knowledge, all existing methods are either restricted to a specific model or computationally intensive.In this paper we study a fast and reliable method that uses spectral properties of either the Bethe Hessian or the non-backtracking matrices.Under a simple SBM in the sparse regime, these matrices have been used to recover the community structure ( [20,38,11]); It was observed in the physics literature that the informative eigenvalues (i.e., those corresponding to eigenvectors which encode the community structure) of these matrices are well separated from the bulk and can be used to estimate the number of communities, but the properties of this estimator have never been investigated, either theoretically or empirically.We show that the number of "informative" (to be defined explicitly below) eigenvalues of these matrices directly estimates the number of communities, and the estimate performs well under different network models and over a wide range of parameter values, outperforming existing methods designed specifically for estimating K under either SBM or DCSBM.This method is extremely computationally efficient, since all it requires is computing a few leading eigenvalues of just one typically sparse matrix, and to the best of ourknowledge, is by far the fastest available accurate method for estimating the number of communities.
Several new methods for estimating the number of communities K have been developed concurrently with the present paper.For example, [36] use a variant of the Chinese restaurant process to generate community assignments, which automatically yields a choice of K; this method is implemented via a Monte Carlo sampling scheme, which is computationally intensive.A method based on semi-definite programming, another very computationally intensive technique, was derived and proved to be consistent for assortative networks by [44].Improving on [43], the authors of [18] proposed a corrected BIC criterion in [43] to correct for under-estimation.More recently, [26] combined spectral clustering with binary segmentation to derive a new estimate of K. Compared to all these new methods, the estimators based on Bethe Hessian or non-backtracking matrices we study is still the most computationally efficient, arguably the simplest, and competitive on estimation accuracy (see [26] for some numerical comparisons).The theoretical analysis of the Bethe-Hessian and the nonbactracking matrices we provide in this paper explain this performance and cover a wider range of settings, including sparse, dense, assortative and disassortative networks; no other method is known to be applicable under a wider range of settings, and most are narrower.

Preliminaries
Recall A is the n × n symmetric network adjacency matrix.Let d i = n j=1 A ij be the degree of node i. Treating A as a random matrix, let E A be the expectation of A, and let d = 1 n n i=1 E d i be the average expected node degree.
2.1.The non-backtracking matrix.Let m be the number of edges in an undirected network, 2m = n i,j=1 A ij .To construct the non-backtracking matrix, we represent the edge between node i and node j by two directed edges, one from i to j and the other from j to i.The 2m × 2m matrix B, indexed by these directed edges, is defined by Bi→j,k→l = 1 if j = k and i = l 0 otherwise.
It is well-known [5,20] that the spectrum of B consists of ±1 and eigenvalues of an 2n×2n matrix Here 0 n is the n × n matrix of all zeros, I n is the n × n identity matrix, and is n × n diagonal matrix with degrees d i on the diagonal.It was observed by [20] that if a network has K communities then the first K largest (in absolute value) eigenvalues of B are real-valued and well separated from the bulk, which is contained in a circle of radius B 1/2 .We refer to these K eigenvalues as informative eigenvalues of B. It was also shown by [20] that the spectral norm of the non-backtracking matrix is approximated by For a special case of a sparse SBM with a bounded expected node degree, [11] proved that the leading eigenvalues of B concentrate around non-zero eigenvalues of E A and the bulk is contained in a circle of radius B 1/2 , and used the corresponding leading eigenvectors to recover the community labels.The spectrum of B for denser Erdos-Renyi graphs was later analyzed in [42].In particular, if d ≫ n 5/6 , then every eigenvalue of (d − 1) −1/2 B is within a vanishing distance from a limiting spectrum supported on the unit circle of the complex plane.In Theorem A.1 below we extend this result to much sparser and more general random graphs and require only that d ≫ log n.

2.2.
The Bethe Hessian matrix.The Bethe Hessian matrix is defined by where r ∈ R is a parameter.In graph theory, the determinant of H(r) is the Ihara-Bass formula for the graph zeta function.It vanishes if r is an eigenvalue of the non-backtracking matrix [16,6,5].The Bethe Hessian was used for community detection by [38] Under the SBM, they argued that the best choice of r is r c = ± √ d, depending on whether the network is assortative or disassortative; for a more general network, they take r c = ± B 1/2 .For assortative sparse networks with K communities and a bounded d, they empirically showed that the K eigenvalues of H(r c ) whose corresponding eigenvectors encode the community structure are negative, while the bulk of H(r c ) are positive.Thus, the number of negative eigenvalues of H(r c ) corresponds to the number of communities.In Theorem 4.3 below, we prove that this method isindeed consistent for graphs with d ≫ log n.

Spectral estimates of the number of communities
The spectral properties of the non-backtracking and the Bethe Hessian matrices lead to natural estimates of the number of communities, but they have not been previously considered in this context.We next outline several spectral methods to determine the number of communities K.They are based on simple counts of eigenvalues of either the non-backtracking matrix or the Bethe Hessian matrix, and therefore do not require any

Method
Parameter Estimated number of communities K Table 1.Spectral methods for estimating the number of communities.
adjustment for different models such as SBM or DCSBM.We list them in Table 1, and proceed to explain the motivation for each one.
3.1.Estimating K from the non-backtracking matrix.As we will show in Theorems 4.1 and 4.2 under the SBM, the informative eigenvalues of the non-backtracking matrix are real-valued and separated from the bulk of radius B 1/2 .Therefore we can estimate K by counting the number of real eigenvalues of B that are at least B 1/2 .We denote this method by NB (for non-backtracking).As shown by Theorem 4.2 and numerical results in Section 5, this estimate of K also works under much more general models with low-rank structure such as DCSBM.When the network is balanced (communities have similar sizes and edge densities), NB performs well; however, the accuracy of NB drops if the communities are unbalanced in either size or edge density.Since B is not symmetric, computing the eigenvalues of B is slightly more demanding than that of the Bethe Hessian matrix for large networks.
3.2.Estimating K from the Bethe Hessian matrix.The number of communities corresponds to the number of negative eigenvalues of H(r); the challenge is in choosing an appropriate value of r.It was argued by [38] that when r = B 1/2 , the informative eigenvalues of H(r) are negative, while the bulk are positive; by [20], B can be approximated by d from (2.2).Following these results, we first choose r to be r m = d1/2 and call the corresponding method BHm.Simulations show that using r = r m and r = B 1/2 produce similar results; we choose r = r m because computing r m is less demanding than computing which was proposed by [38] for recovering the community structure under the SBM; we call the corresponding method BHa.We have found that when the network is balanced, NB, BHm and BHa perform similarly; when the network is unbalanced, BHa produces better results.
Both BHm and BHa tend to underestimate the number of communities, especially when the network is unbalanced.In that setting, some informative eigenvalues of H(r) become positive, although they may still be far from the bulk.Based on this observation, we correct BHm and BHa by also using positive eigenvalues of H(r) that are much close to zero than to the bulk.Namely, we sort eigenvalues of H(r) in non-increasing order where t > 0 is a tuning parameter.Note that if , therefore the number of negative eigenvalues of H(r) is always upper bounded by K. Heuristically, if the bulk follows the semi-circular law and λ n−k ≥ 0 is given, then the probability that 0 ≤ λ n−k+1 ≤ λ n−k /t is less than 1/t.When 1/t is sufficiently small, we may suspect that λ n−k+1 is an informative eigenvalue.In practice we find that t ∈ [4,6] works well; we will set t = 5 for all computations in this paper.Simulations show that K performs well, especially for unbalanced networks.The resulting methods are denoted by BHmc and BHac, respectively.We will also use BH to refer to all the methods that use the Bethe Hessian matrix.For a summary of these methods, see Table 1.

Consistency
The consistency of the non-backtracking matrix based method (NB) for estimating the number of communities in the sparse regime under the stochastic block model follows directly from Theorem 4 of [11].We state this consistency result here for completeness.The proof given by [11] is combinatorial in nature and this approach unfortunately does not extend to any other regimes or the Bethe-Hessian matrix.
Theorem 4.1 (Consistency in the sparse regime).Consider a stochastic block model with π = (π 1 , ..., π K ) and P = (P kl ) = 1 n P (0) for some fixed K × K symmetric matrix P (0) .Assume that (diag(π)P ) r has positive entries for some positive integer r.Further, assume that E(d i ) = d > 1 for all i, and all K non-zero eigenvalues of P are greater than √ d.Then with probability tending to one as n → ∞, the number of real eigenvalues of B that are at least B 1/2 is equal to K.
To better understand the condition on the eigenvalues of P , consider the simple model G(n, a n , b n ).This model assumes that there are two communities of equal sizes and nodes are connected with probability a/n if they are in the same community, and b/n otherwise.Since the two non-zero eigenvalues of P are (a + b)/2 and (a − b)/2, the condition on eigenvalues of P is (a − b) 2 > 2(a + b).This matches the phase transition condition for the detectability in the sparse regime [29,31,27].
Next, we prove the consistency of the proposed methods in the denser regime d ≫ log n, sometimes referred to as semi-dense in contrast to the dense regime of d = O(n).For this regime, we make the following assumptions.Assumption 4.1.All nodes have the same expected degree satisfying Assumption 4.3.The expected degree d in Assumption 4.1 satisfies Following [11], we assume in Assumption 4.1 that all nodes have the same expected degree.This corresponds perhaps to the most challenging setting where expected degrees alone do not contain information about the latent structure of interest.As in [11] and [42], this assumption allows us to simplify our analysis of the non-backtracking matrix considerably.If some communities have different expected degrees, we can first use node degrees to identify them and divide the network into sub-networks of similar expected node degrees and apply our results on the sub-networks.Note that for the degree-corrected stochastic block model, if the underlying stochastic block model satisfies this assumption and the degree parameters are drawn from the same distribution, then the degree-corrected stochastic block model itself will also satisfy the assumption.
The lower bound on λ To compute this estimator in practice, we simply set ε = 0 and estimate d with the average observed degree d = (d The key result for proving Theorem 4.2 is Theorem A.1 in Appendix A, which establishes a connection between the spectra of nonbacktracking and adjacency matrices, and may also be of independent interest.Theorem A.1 is a significant improvement on Theorem 1.5 in [42], which only considers the Erdös-Rényi model and requires a much stronger condition, d ≫ n 5/6 instead of d ≫ log n. For the Bethe Hessian, no formal results have been previously established.We show in the following theorem that both BHm and BHa methods produce consistent estimator of K = rank(E A), provided that the following stronger version of Assumption 4.2 holds.
Note that Assumption 4.2 allows networks to be disassortative, meaning probabilities of connections between communities are higher than within communities, in which case the eigenvalues of E A may be negative.In contrast, Assumption 4.4 requires all eigenvalues of E A to be non-negative.Again in practice, we set ε = 0 to compute the estimator.

Numerical results
In this section, we briefly compare the empirical accuracy of estimating the number of communities by using the non-backtracking matrix (NB), and all the versions based on the Bethe Hessian matrix (BHm, BHmc, BHa, and BHac), described in Section 3.1 and Section 3.2.We compare them with two other methods representative of approaches in the literature to estimating the number of communities in networks: the network crossvalidation method (NCV) proposed by [13] and a likelihood-based BIC-type method (VLH, for variational likelihood) proposed by [43].We use NCVbm and NCVdc to denote the versions of the NCV method specifically designed for the SBM and the DCSBM, respectively; VLH is only designed to work under the SBM, so it is not included in the DCSBM comparisons.To make comparisons with VLH computationally feasible, instead of using the variational method to estimate the posterior of the community labels as done by [43], we first estimate the node labels by the pseudo-likelihood method proposed by [4] and then compute the posterior following [43].In small-scale simulations where both approaches are computationally feasible (results omitted) we found that substituting pseudo-likelihood for the variational method has very little effect on the estimate of K.The tuning parameter of VLH is set to one following [43].We do not include the method of [9] in these comparisons due to its high computational cost.Note that our theoretical analysis assumes for simplicity that all expected node degrees are equal (Theorems 4.1, 4.2 and 4.3); however, we allow different expected node degrees in simulations.In this section, d = 1 n n i=1 E d i denotes the average expected node degree.

Synthetic networks.
To generate synthetic networks, we fix the labels c ∈ {1, ..., K} n so that c i = k if nπ k−1 + 1 ≤ i < nπ k , where π 0 = 0.The label matrix Z ∈ R n×K , given by Z ik = 1(c i = k), encodes c by representing each node's label with a row of K elements, exactly one of which is equal to 1, and the rest are equal to 0. Let P be a K × K matrix with the diagonal w = (w 1 , ..., w K ) and off-diagonal entries β, and M = ZP Z T .Under the stochastic block model, we generate entires of A using the edge probability matrix E(A) = ρ n M ; the average degree d is controlled by ρ n .The parameter w controls the relative edge densities within communities, and β controls the out-in probability ratio.Smaller values of β and larger values of d make the problem easier.For the DCSBM, we generate the degree parameters θ i from a distribution that takes two values, P(θ = 1) = 1 − γ and P(θ = 0.2) = γ.Parameter γ controls the fraction of "hubs", the high-degree nodes allowed under the DCSBM, and setting γ = 0 gives back the regular SBM.Given θ = (θ i , ..., θ n ), the edges are generated independently with probabilities E(A) = ρ n diag(θ)M diag(θ), where diag(θ) is a diagonal matrix with θ i 's on the diagonal.
The number of nodes is set to n = 1200, the out-in probability ratio β = 0.2, and we vary the average degree d, weights w, and community sizes determined by the vector π.We consider three different values for the number of communities, K = 2, 4, and 6.For each setting, we generate 200 replications of the network and record the accuracy, defined as the fraction of times a method correctly estimates the true number of communities K.
The methods NCV and VLH require a pre-specified set of K values to choose from; we use the set {1, 2, ..., 8} for synthetic networks and {1, 2, ..., 15} for real-world networks.
We start by varying the average degree d, which controls the overall difficulty of the problem, while keeping community sizes equal.Figure 1 shows the performance of all methods for the balanced community density case, w i = 1 for all 1 ≤ i ≤ K. Figure 2 shows the unbalanced case, with w = (1, 2) for K = 2, w = (1, 1, 2, 3) for K = 4, and w = (1, 1, 1, 1, 2, 3) for K = 6.In every figure, the top row corresponds to the SBM (γ = 0) and the bottom row to the DCSBM (γ = 0.9, meaning 10% of nodes are hubs).
In general, we see that when everything is balanced (Figure 1), all spectral methods perform fairly similarly and outperform both cross-validation (NCV) and the BIC-type criterion (VLH).Also, for larger K and especially under DCSBM, the corrected versions are somewhat better than the uncorrected ones, and the best Bethe Hessian methods are better than the non-backtracking estimator.
For networks with equal size communities but different edge densities within communities (Figure 2), cross-validation performs poorly, but VLH relatively improves.For larger K the spectral methods are also distinguishable, with all BH methods dominating NB, and corrected versions providing improvement.Overall, BHac is the best spectral method, with VLH comparable for the SBM in this case.The BHac method is the best overall for DCSBM where VLH is not applicable.Communities of different sizes present a challenge for community detection methods in general, and the presence of relatively small communities makes the problem of estimating K difficult.To test the sensitivity of all the methods to this factor, we change the proportions of nodes falling into each community setting π 1 = r/K, π K = (2 − r)/K, and π i = 1/K for 2 ≤ i ≤ K − 1, and varying r in the range [0.  the community sizes become more similar, and are all equal when r = 1. Figure 3 shows the performance of all methods as a function of r.The top row corresponds to the SBM (γ = 0), the bottom row to the DCSBM (γ = 0.9), and the within-community edge density parameters w i = 1 for all 1 ≤ i ≤ K.Here we see that VLH is less sensitive to r than the spectral methods, but unfortunately it is not available under the DCSBM.Cross-validation is still dominated by spectral methods except for very small values of r, where all methods perform poorly.The corrections still provide a slight improvement for Bethe Hessian based methods, although all spectral methods perform fairly similarly in this case.

5.2.
Real world networks.Finally, we apply the proposed methods on several popular network datasets which come with the "ground truth" node labels and the corresponding number of communities.We note that the network structure itself can indicate a different number of communities than those given in the ground truth, since those are typically derived from one specific node attribute and there may be other communities or subcommunities corresponding to different attributes.However, these ground truth labels still provide a reasonable baseline against which to compare estimators.The college football network [15] represents 115 US college football teams and the games they played in 2000.The "ground truth" communities are the 12 conferences that the teams belong to.The political books network [32], compiled around 2004, consists of 105 books about US politics; an edge is "frequently purchased together" on Amazon.The K = 3 communities are "conservative", "liberal", or "neutral", labelled manually based on contents.The dolphin network [25] is a social network of 62 dolphins, with edges representing social interactions, and K = 2 communities are based on a split which In all plots, w i = 1 for 1 ≤ i ≤ K; the average degrees are λ n = 10 (left), 15 (middle), and 20 (right).
happened after one dolphin left the group.Similarly, the karate club network [45] is a social network of 34 members of a karate club, with edges representing friendships, and K = 2 communities based on a split following a dispute.Finally, the political blogs network [2], collected around 2004, consists of blogs about US politics, with edges representing web links, and K = 2 communities are "conservative" and "liberal", based on manual labelling.For this dataset, as is commonly done in the literature, we only consider its largest connected component of 1222 nodes.Table 2 shows the estimated number of communities in these networks.All spectral methods estimate the correct number of communities for dolphins and the karate club, and do a reasonable job for the college football and political books data.For political blogs, all methods but NCV and VLH estimate a much larger number of communities, suggesting the estimates correspond to smaller sub-communities with more uniform degree distributions that have been previously detected by other authors.We also found that the VLH method was highly dependent on the tuning parameter, and the estimates by NCVbm and NCVdc varied noticeably from run to run due to their use of random partitions.

Discussion
The numerical experiments suggest that the spectral methods provide extremely fast and reliable estimates of the number of communities K for balanced networks, with the Bethe Hessian based method with the threshold choice r a and the correction described in (3.1)  be an intrinsic limitation of spectral methods.This suggests that their estimates can be used as a lower bound on K and a starting point for a more elaborate and computationally demanding likelihood-based method like VLH, in the same way that spectral clustering can be used to initialize a more sophisticated community detection method.Having a small set of plausible values of K to focus on can significantly reduce the computational cost and improve the accuracy of estimating the number of communities.
For semi-dense networks, we show in Theorems 4.2 and 4.3 that estimating the number of communities is possible below the exact recovery threshold.For example, under Determining the exact condition under which estimating the number of communities is possible is an interesting and challenging question and we leave it for future research.
Following [42], we will work with the following rescaled conjugation of the nonbacktracking matrix B defined in (2.1) (which has the same eigenvalues as B/ √ α where α = d − 1) The key result for proving Theorem 4.2 is Theorem A.1 below, which establishes a connection between spectra of H + E and H.The spectrum of H is closely related to the spectrum of the adjacency matrix, and is discussed in Section A.1.
To prove Theorem A.1, we only need a crude bound on A − E A that is known to hold for very general graph models, including SBM, DCSBM and inhomogeneous Erdos-Renyi models [22].For clarity, we put this bound in Assumption A.1 below.We will replace it with a sharper bound in Theorem A.2 to prove Theorem 4.2.
Assumption A.1.With probability at least 1 − 1/n, the following inequality holds It is easy to see that Assumption A.1 implies E = O(1/ √ d) with high probability while [42] shows that H is diagonalizable as follows.
A.1.Spectrum of H. Denote by v 1 , ..., v n and λ 1 , λ 2 , ..., λ n eigenvectors and corresponding eigenvalues of A/ √ α ordered so that For each i, H has two eigenvalues µ 2i−1 and µ 2i that are solutions of equation µ 2 − λ i µ + 1 = 0, that is The corresponding left (unit) eigenvectors of H are and their inner product is The corresponding right eigenvectors of H are proportional to with inner product Note that x 2i−1 and x 2i are not unit vectors.Their squared norms are It is convenient to not normalize x 2i−1 and x 2i because H admits the decomposition Note that from the formulas above we have The space C 2n can be decomposed as a direct sum of orthogonal two-dimensional subspaces span{x 2i−1 , x 2i } = span{y 2i−1 , y 2i }, which are invariant under the action of H.Moreover, the orthogonal projection onto span{x 2i−1 , x 2i } is given by A.2. Spectrum of H + E. The main difficulty of analyzing the spectrum of H + E is that H and E are not symmetric so standard Weyl's inequalities do not apply even though E is small.Wang and Wood [42] use the Bauer-Fike theorem instead and show that for Erdos-Renyi random graphs, the perturbation of E is negligible if the average degree is at least of order n 5/6 .This strong assumption is likely an artifact of their proof because the Bauer-Fike bound is often not tight.In fact, by a direct and more careful analysis we show in the following theorem that the spectrum of H + E is close to the spectrum of H for much sparser graphs.

Theorem A.1 (Connection between spectra of non-backtracking and adjacency matrices).
There exists a constant C > 0 such that the following holds.Consider random graphs satisfying Assumptions 4.1 and A.1.Then with probability at least 1 − 1/n, for each eigenvalue β of H + E, there exists an eigenvalue µ of H such that |β − µ| ≤ Cd −1/8 .For proving Theorem 4.2, we replace Assumption A.1 with the following shaper bound on A − E A , which holds under stronger assumptions.This bound follows directly from [7] and [40]; see also [42].
Theorem A.2 (Concentration of adjacency matrix).There exists a constant C 1 , C 2 > 0 such that the following holds.Assume that Then with probability at least 1 − 1/n, we have We are now ready to prove Theorem 4.2.
Proof of Theorem 4.2.Let λ 1 (E A), ..., λ K (E A) be the nonzero eigenvalues of E A and Then by Weyl's inequality and Theorem A.2, with probability at least 1 − 1/n we have Similarly, for i ≥ K + 1 we have Theorem A.1 and the continuity of eigenvalues with respect to small perturbation then imply that for 1 ≤ i ≤ K, To show that the K largest eigenvalues in magnitude of B are real, we use the following deterministic inclusion bound for the spectrum of B; see [5,Theorem 3.7].Let d min ≥ 2 and d max be the minimal and maximal degrees of a graph.Then the spectrum of B satisfies In our setting, we bound d max using standard Bernstein's inequality: with probability at least 1 − 1/n, Since all complex eigenvalues of B are contained in a circle of radius at most √ d max − 1, the K largest eigenvalues of B in magnitude, which are outside the circle of radius (1 + ε) √ d, must be real.The proof is complete.
The rest of this section is devoted to proving Theorem A.1.Besides the facts listed in Section A.1, we need the following elementary lemmas, the proofs of which are postponed until the end of this section.
Lemma A.3.Let x, y, v be unit vectors with | x, y | ≤ 1 − ε for some ε ∈ [0, 1], v ∈ span{x, y} and a, b ∈ C be any complex numbers.Then We are now ready to prove Theorem A.1.
Proof of Theorem A.1.Denote by P i the orthogonal projection onto span{x 2i−1 , x 2i }.Let u be a unit eigenvector of H + E with corresponding eigenvalue β and u i = P i u/ P i u .Note first that u = This allows us to write Eu as follows: Note that the terms in above sum belong to orthogonal subspaces of C 2n .Therefore where T i denotes the first factor of the corresponding term in the sum.
Let ε ∈ (0, 1/4) be a small number to be chosen later.Consider first the eigenvalues λ i with magnitude not close to 2, namely those satisfying ||λ i | − 2| > ε.From (A.5) and 1 − 2ε.By multiplying x2i with a complex number of magnitude one if necessary, we may assume that u i , x2i ≥ 1 − 2ε for i ∈ I, and consequently We are now ready to show that β is close to an eigenvalue of H.By (A.18), (A.14), (A.17), the fact that β and µ 2i are bounded for i ∈ I, and triangle inequality we have Together with (A.14) this implies Finally, it follows from (A.11), (A.13) and (A.19) that if β is an eigenvalue of H + E then there exists an eigenvalue µ of H such that and therefore the proof is complete.
Proof of Lemma A.3.We prove the first inequality: To prove the second inequality, denote z = x−y and w = x+y.Then z, w are perpendicular and x = (z + w)/2, y = (w − z)/2.Therefore Note that the restriction of zz * + ww * on span{x, y} is a positive definite matrix with eigenvalues z 2 and w 2 because z and w are perpendicular.By the first inequality The proof is complete.
Proof of Lemma A.5.Since x2i−1 = x 2i−1 −1 x 2i−1 and y 2i form an orthonormal basis of W i , it is enough to bound H x2i−1 and Hy 2i .Note that the restriction H i of H on W i has the formula For the more involved calculation of Hy 2i we will repeatedly use identities which follow directly from the formulas of µ 2i−1 and µ 2i in (A.2).

Figure 1 .
Figure 1.The accuracy of estimating K as a function of the average degree.All communities have equal sizes, and w i = 1 for all 1 ≤ i ≤ K.

Appendix B. Proof of Theorem 4. 3 Proof of Theorem 4 . 3 .√ d 2 +
We first rewrite the Bethe Hessian as follows:H(r) = (r 2 − 1)I − r(A − E A) + D − r Ā =: Ĥ(r) − r E A.We show that eigenvalues of Ĥ(r) are non-negative and are of smaller order than non-zero eigenvalues of r E A. This in turn implies that K eigenvalues of H(r) are negative while the rest are positive.By Theorem A.1, with probability at least 1 − 1/n we haveA − E A ≤ 2 √ d + C log n. (B.1)To bound the node degrees, we use the standard Bernstein's inequality: with probability at least 1 − 1/n,D − E D ≤ C d log n, |r 2 − (1 + ε) 2 d| ≤ C d log n. (B.2)For square matrices X, Y we use X Y to signify that X − Y is positive semidefinite.Then by (B.1), (B.2) and Assumption 4.2, we haveĤ(r) (r 2 − 1) − r 2 √ d + C log n + (1 + ε) 2 d − C d log n I r − (2ε + ε 2 )d − C d log n I 0 (B.3) because ε = C log n/d.For a subspace U ⊆ R n , we denote by dim(U ) the dimension of U , and by U ⊥ the orthogonal complement of U .Also, let col(E A) be the column space of E A. Using the Courant min-max principle (see e.g.[8, Corollary III.1.2])and (B.3), we haveρ n−K (H(r)) = max dim(U )=n−K min x∈U, x =1 H(r)x, x ≥ min x∈col(E A) ⊥ , x =1 H(r)x, x ≥ 0.Therefore the n − K largest eigenvalues of H(r) are non-negative.It remains to show that the K smallest eigenvalues of H(r) are negative.From (B.1), (B.2), and a triangle inequality, we have Ĥ(r) ≤ 4d + C d log n. (B.4)On the other hand, from (B.2) and Assumption 4.2 we get λ K (r E A) ≥ (1 + ε) √ d 4 √ d + C log n ≥ 4d + C d log n. (B.5) comparison, exact community recovery under G(n, a n , b n ) with known number of communities requires (a − b) 2 > 2(a + b + 2 √ ab) log n (see e.g.[1, Theorem 13]).Assumption 4.3 guarantees a sharp bound on A− E A , which is established by [7].We use this bound in the proofs of Theorem 4.2 and Theorem 4.3 below.For the Erdös-Rényi model, Assumption 4.3 is equivalent to d ≤ n 2/13 .It is unclear if this condition can be removed from the result of [7] and consequently from Theorem 4.2 and Theorem 4.3.Theorem 4.2 (Consistency of NB based method in the semi-dense regime).Consider random graphs that satisfy Assumptions 4.1, 4.2 and 4.3.Then with probability at least 1− 1/n, the nonbacktracking matrix has exactly K real eigenvalues with magnitude at least .According to Theorem 4.2, the K informative eigenvalues of the nonbacktracking matrix are separated from the bulk by a circle of radius (1 + ε) √ d, where ε is vanishing for d ≫ log n.Unlike in Theorem 4.1, K is allowed to depend on n in Theorem 4.2.

Table 2 .
the best choice in most scenarios.With communities of significantly different sizes, they tend to underestimate K by combining small communities together, which seems to Estimates of the number of communities in real-world networks.