Classification and estimation in the Stochastic Block Model based on the empirical degrees

The Stochastic Block Model (Holland et al., 1983) is a mixture model for heterogeneous network data. Unlike the usual statistical framework, new nodes give additional information about the previous ones in this model. Thereby the distribution of the degrees concentrates in points conditionally on the node class. We show under a mild assumption that classification, estimation and model selection can actually be achieved with no more than the empirical degree data. We provide an algorithm able to process very large networks and consistent estimators based on it. In particular, we prove a bound of the probability of misclassification of at least one node, including when the number of classes grows.


Introduction
Strong attention has recently been paid to network models in many domains such as social sciences, biology or computer science.Networks are used to represent pairwise interactions between entities.For example, sociologists are interested in observing friendships, calls and collaboration between people, companies or countries.Genomicists wonder which gene regulates which other.But the most famous examples are undoubtedly the Internet, where data traffic involves millions of routers or computers, and the World Wide Web, containing millions of pages connected by hyperlinks.A lot of other examples of real-world networks are empirically treated in Albert and Barabási (2002), and book Faust and Wasserman (1994) gives a general introduction to mathematical modelling of networks, and especially to graph theory.
One of the main features expected from graph models is inhomogeneity.Some articles, e.g.Bollobás et al. (2007) or Van Der Hofstad (2009), address this question.In the Erdős-Rényi model introduced by Erdős and Rényi (1959) and Gilbert (1959), all nodes play the same role, while most real-world networks are definitely not homogeneous.
In this paper, we are interested in the Stochastic Blockmodel (SBM), introduced by Holland et al. (1983) and inspired by Holland and Leinhardt (1981) and Fienberg and Wasserman (1981).This model assumes discrete inhomogeneity in the underlying social structure of the observed population: n nodes are split into Q homogeneous classes, called blocks, or more generally clusters.Then it is assumed that the distribution of the edge between two nodes, depends only on the blocks to which they belong.Thereby, within each class, all nodes have the same connection behavior: they are said to be structurally equivalent (Lorrain and White, 1971).When the class assignment is known, the social structure can possibly be visualized through the meta-graph (Picard et al., 2009), which emphasizes the role of each class.However the block structure is supposed to be not observed or latent.Thus the assignment Z and the model parameters must be estimated a posteriori through the observed graph X, which is a real challenge, especially in large networks.
Our main purpose in this paper is to present a consistent inference method under SBM, which can above all process very large graphs.Snijders and Nowicki (1997) have proposed a maximum likelihood estimate based on the EM algorithm for very small graphs with Q = 2 blocks.They have also proposed a Bayesian approach based on Gibbs sampling for larger graphs (hundreds of nodes), which they have extended to arbitrary block numbers in Nowicki and Snijders (2001).However the usual techniques enables the processing of only relatively small graphs, because they suffer severely from graph intricacy.In particular the EM algorithm deals with the conditional distribution of the labels Z given the observations X, whose dependency graph is actually a clique in the case of SBM (see paragraph 5.1 in Daudin et al. (2008)).Inspired by Wainwright and Jordan (2008), Daudin et al. (2008) have developed approximate methods using variational techniques in the context of SBM.From a physical point of view, the variational paradigm amounts to mean-field approximation, see Jaakkola (2000).Thus thousands of nodes can be processed with this variational EM algorithm.Lastly, Celisse et al. (2011) proves the variational method to be consistent precisely under SBM.
All previous methods treat both classification and parameter estimation directly and at the same time.They are alternatively updated at each step of EMbased algorithms.Yet those tasks are actually not symmetrical, and moreover estimators are quite simple when Z is known.The classification -remaining the main pitfall thus far -can be completed first, and then the latent assignment Z just replaced with this classification by plug-in in order to estimate the parameters.
Searching for clusters from a graph is computationally difficult and has different meanings.Many algorithms, especially coming from physics and computer science, aim at detecting highly connected clusters, which are self-defined as optimizing some objective function, see Lancichinetti et al. (2009) and Girvan and Newman (2002).In contrast, the blocks under SBM have a model-based definition and do not necessarily have many inner connections (see examples in Daudin et al. (2008)).Therefore, most algorithms designed for community detection are generally not suitable in this context.Bickel and Chen (2009), Choi et al. (2010), Celisse et al. (2011) and Rohe et al. (2010) prove that it is asymptotically possible to uncover the latent structure of the graph Z.In this work, we additionally show under a mild assumption that it is possible to do so, just by utilizing degree data instead of the whole graph X.As a consequence, we can work with n variables instead of n 2 , which makes classification computations much faster.The basic reason why so little information is needed -compared with other models with latent structureis specific to SBM.The number of observed variables (X ij ) 1≤i,j≤n grows faster than the number of latent variables Z, therefore even marginal distributions of X concentrate very fast.Our algorithm actually expands the procedure introduced by Snijders and Nowicki (1997) when Q = 2. Like Bickel and Chen (2009), we provide probabilistic bounds for the occurrence of one error at least.Moreover we take the random assignment into account, even when the number of classes Q increases and the average degree vanishes.Related results are given in Choi et al. (2010) and Rohe et al. (2010).Nevertheless the bounds concern the rate of misclassified nodes instead, and do not prevent the number of errors from growing to infinity as fast as the square root of n for instance.They also require the assignment Z to be fixed.
The paper is organized as follows.In Section 2, we begin by presenting the model we shall study and some notations are fixed.Above all a concentration property of the degree distribution is stated in paragraph 2.2, which will be very useful in proving the consistency of the method mentioned above.The classification algorithm (called LG) and the main results are presented in this section as well.In particular, Theorem 2.2 provides a bound of the error probability and Proposition 2.2.1 gives some convergence rates when the number of classes is allowed to grow.The consistency proof of the LG algorithm is provided in Section 3. Section 4 is devoted to deriving simple estimators of the parameters by plug-in and their consistency is also demonstrated.A simulation study in Section 5 illustrates the behavior of the LG algorithm, which is discussed afterwards.In Section 6, the model and the algorithm are more accurately studied.As an application, it is lastly proved that it is likewise possible to find out asymptotically the right number Q of blocks of the model.That completes the method relying just on degrees.

Model
We first recall the SBM model.For all integers n ≥ 1, [n] denotes the set {1, . . ., n}.The undirected binary graphs with n nodes are defined by the pair ([n], X) where X is a symmetric binary square matrix of size n.X is called the adjacency matrix of the graph.Let Q ≥ 1 be the number of blocks.
. ., α Q ) be the vector of the block proportions in the whole population.
• Conditionally on the labels Z, the variables {X ij , i, j ∈ [n]} are independent Bernoulli variables.Conditionally on {Z i = q, Z j = r}, the parameter of X ij is π qr .
π qr is the connection probability between any q-labeled node and any r-labeled node.Noting π = (π qr ) q,r∈[Q] the connection matrix, the parameters of the model are (α, π).This model will be denoted by G(n, π, α).Note that in the sequel n will be often removed in the notations for the sake of simplicity.This is a classical problem in mixture models: the block labeling is naturally not identifiable.The content of the blocks remains unchanged by permutating labels.But equivalence classes are identifiable as soon as n ≥ 2Q, see Celisse et al. (2011).

Degree distribution
For all i ∈ [n], let D n i = j =i X ij the degree of the node i, that is the number of neighbors of this node.
is therefore a sample of a mixture of binomial distributed random variables with parameters These variables are correlated.Thus we are not in the validity range of the usual algorithms for mixtures like EM.But there is only one edge shared by any pair of nodes and the degrees are consequently not heavily correlated.Using the EM algorithm would make sense for practical purposes.Nevertheless we have chosen to use a faster one-step algorithm, unlike EM which is iterative.

A concentration inequality for binomial random variables
The following inequality will be useful throughout the article.This will especially account for the fast concentration of the degree distribution.It is a straightforward consequence of Hoeffding's inequality for bounded variables.

Concentration property of the normalized degrees
Define the normalized degree of node i ∈ [n]: cluster around their average conditionally on the node class when n is increasing, according to (CCT): Hence normalized degrees corresponding to q-labeled nodes gather around π q .Consequently, in the degree distribution, nodes from different classes split up into groups centered around π q , provided that all conditional averages (π q ) q∈[Q] are different.From now on, we will assume that they are: Note that, if it is known that two classes have the same conditional average, it is possible to resort to the concentration of another marginal distribution: the distribution of the number of common neighbors for each pair or nodes.Refer to Appendix B.

Largest Gaps Algorithm
Because of the concentration, a larger gap is expected between normalized degrees of nodes from different classes than nodes from the same class.The following algorithm relies on this remark.It consists in building Q blocks by finding the Q − 1 largest intervals formed by two consecutive normalized degrees. If denotes the same sequence but sorted in increasing order.

Algorithm
• Sort the sequence of the normalized degrees in increasing order: • Calculate every gap between consecutive normalized degrees:
T 11) x y This algorithm has all the qualities mentioned in Introduction and makes good use of the concentration, which makes the consistency easy to prove.
Whereas variational EM algorithms runs as many quadratic steps as needed to reach convergence and classical spectral clustering runs in cubic time, this algorithm is especially fast.Indeed the sorting runs in quasilinear time and although the computation of the degrees is quadratic, this is a very basic operation which is very quickly performed.Note that Condon and Karp (2001) gave an algorithm running in linear time and consistent under SBM -called planted -partition model in this paper -, but provided that the weights of the blocks are equal.
Nevertheless this algorithm seems to be relatively naive because it takes every normalized degree into account and each one carries the same weight, even if it is isolated and not statistically representative.In the worst case, one point is sufficient to trick the algorithm yet makes the classification wrong by a majority, especially at low graph sizes.

Main results
The true (respectively estimated) partition of [n] in classes is denoted by the set ) and the cardinality of the true q-labeled class by N n q (resp.by N n q ).We expect the estimated partition to be almost surely the true partition when n is large enough.Define E n as the event "The LG algorithm makes at least one mistake", that is: Definition 2. Define δ the characteristic minimal gap (or separability) of the model in the following way: Finally, let us define α 0 the smallest proportion of the model.The classification is harder for small values of α 0 : Section 3 contains the proof of this theorem.The most important parameter is δ: the smaller it is, the harder the separation between the classes is, and so the larger n must be to retrieve the true partition.

Convergence rates
In order to derive orders of magnitude of n to achieve convergence in Theorem 2.2, we choose another asymptotic framework only in this paragraph, where the parameters are functions of n.Consistency does not mean convergence under the distribution of G(n, α, π) anymore, but under G(n, α n , π n ), with α n = (α n 1 , . . ., α n Qn ) and π n = (π n qr ) 1≤q,r≤Qn .We assume that: The inference method with LG algorithm is still consistent under the following assumptions: Proof.Assumption (a) implies that there exists C > 2 √ 2 such that for n large enough: Secondly (A) requires (Q n − 1)δ n ≤ 1 as a necessary condition.Hence, applying the first inequality: According to Assumption (c), there exists C > 1 such that for n large enough: Large graphs are more and more sparse as n increases, which results in the decrease in the connectivity defined by π n = E α n ,π n (T n 1 ).
Proposition 2.2.2.The LG algorithm is still consistent in the following cases: • while ln n n We sketch the proof with the following inequality, which estimates the connectivity of the sparsest model: 3 Consistency proof of the LG algorithm

An ideal event for the algorithm
The LG algorithm delivers the true partition especially when none of the classes is empty, and the spreading of the normalized degrees is small compared with the minimal gap δ.A n denotes the event "No true class is empty", that is {N n q = 0} Definition 3. We call maximal intraclass distance (or spreading) the random variable d n defined by: This is the maximal distance between the normalized degree of a node and its own conditional mean, over all nodes and all classes.This is basically a measurement of the within-class spreading of the normalized degrees.
Proposition 3.0.3.Under Assumption (A), the following inclusion holds for all ε > 0: • If nodes i and j have label q, then: • Inversely, if they have different labels, respectively q and r, then: As a conclusion of this alternative, i and j are in the same class if and only if 4+ε .Notice moreover that there exists exactly Q − 1 intervals among the set ([T i , T j [) i,j strictly greater than 2δ 4+ε on this event.Hence the Q − 1 largest intervals lie between groups of normalized degrees from different classes; whereas all others lie between degrees of the same class.In this case the algorithm returns the true partition.

Bound of the probability of large spreading
In this paragraph we shall show that the dispersion d n converges to 0 thanks to the subgaussian tail of the binomial distributions.This is a basic result of this article, because all others require controlling the dispersion.Proposition 3.0.4.For all t > 0: Proof.It consists in conditioning by the class of each node, in order to apply the concentration inequality (CCT), and of a union bound.Since D n i ∼ B(n, π q ), (CCT) gave the inequality (1): Hence: Remark.Furthermore d n almost surely converges to 0 because the upper bound is summable, by applying a usual consequence of the Borel-Cantelli lemma.

Bound of the error probability (proof of Theorem 2.2)
Thanks to the bound of the probability of large spreading, one can easily conclude that the ideal event A n ∩ {d ≤ δ 4+ε } is actually strongly likely for n large enough and for all ε > 0: Proof.First we have A n ∩ {d ≤ δ 4+ε } ⊂ E n according to Proposition 3.0.3,hence: On the one hand, Proposition 3.0.4implies that: On the other hand A n corresponds to "There exists an empty class".For all q ∈ [Q], N q ∼ B(n, α q ), hence: Once the both previous inequalities have been put together, we have an upper bound of P (E n ) which depends on ε.The limit of the upper bound when ε tends to zero yields the bound of the Theorem.

Consistency of the plug-in estimators
If the true classes were known, the usual moment estimators would be enough to estimate (α, π).Indeed the empirical proportions estimate α and the connection frequencies estimate the connection probabilities.We first prove that if we knew the classes, we would obtain a consistent estimate.However those variables are not observed but latent.That is why we plug the partition delivered by any consistent classification algorithm into these estimators.Notice that it does not depend on the choice of the consistent algorithm.
Notations For all q, r in [Q], C qr denotes C q × C r , and N qr its cardinality.If q = r, N qr = N q N r and if q = r, N qq = Nq(Nq−1) 2 . We define the following estimators: Recall that all of these variables are hidden thus far.
Proof.For all q ∈ [Q], N q is the sum of n independent Bernoulli random variables with parameter α q .Applying directly the concentration inequality, we get for all t > 0 and q ∈ [Q]: Applying the concentration inequality (CCT) conditionally on N qr and then taking the expectation, we get for all t > 0: Define: Let (r n ) be a non-negative sequence tending to infinity.We split up the support of the expectation into two pieces, depending on the values of N qr .On the one hand the exponential term inside the expectation is bounded on the first piece of the support by a deterministic sequence.On the other hand, the probability of the support of the second piece of the expectation |N qr − α qr n 2 | > r n is accurately controlled by using the concentration inequality derived from (CCT) in Appendix A.
In order to have a vanishing bound (B), we just have to choose (r n ) such that: For example, r n = n 7/4 , hence: Then we conclude with a union bound: Finally we conclude for all parameters:

Estimation with hidden classes
We now assume that we have got a partition of the nodes { C q } q returned by any classification algorithm.The estimators α and π are defined by plug-in with the estimated partition { C q } q instead of the true one {C q } q .If the classification is right, then estimators both with hat and with tilde are equal.
Proof.For all t > 0, let On the event E n , the equality ( α, π) = ( α, π) holds, hence: The first term converges to 0 according to Theorem 4.1 and the second one as well, provided the algorithm is consistent (see Theorem 2.2).

Conclusions
The previous paragraphs did not depend on the algorithm chosen.Now putting together the results of the previous section and the results concerning the LG algorithm, we get: Note that the estimation procedure requires larger graphs to achieve consistency than does the classification procedure with the LG algorithm alone.This is basically due to the variability of the empirical proportions.Since the upper bound is summable, a usual consequence of the Borel-Cantelli lemma implies the strong consistency of these estimators.
Discussion.We now consider the asymptotic framework G(n, α n , π n ), as we already did in paragraph 2.4.The previous bound above is very interesting when lim α n 0 > 0 and then lim Q n < +∞, because it allows strong consistency for example.If we want just consistency, we can change the bound so that the convergence rates of (α n 0 ) and (Q n ) are more optimal in our asymptotic framework.
Proposition 4.3.1.The inference method with LG algorithm is still consistent under Assumptions (a), (b) and (d), where Proof.First of all, we consider the bound (B) in the proof of Theorem 4.1, and this time, we take r n = √ 4n 3 ln n, so that it yields the following bound: Assumption (b) is sufficient to show the convergence of 2Qe −2nt 2 and 2Q 2 n 4 n .Assumptions (b) and (d) have to be proved sufficient for the remaining term of the bound (B').Assumption (d) implies that there exists C > √ 2 such that for n large enough: It is easily deduced from this that the first term of the bound (B') therefore converges to zero.Moreover, note that the convergence of this term implies the convergence of Q n (1 − α n 0 ) n as well.Recall that Assumption (a) implies the convergence of 2e − 1 8 nδ 2 n .As a conclusion, the consistency holds.

Simulation study
Our main purpose in this study is to figure out how the LG algorithm behaves in practice, and above all, to check whether the bounds of Theorem 2.2 are pessimistic or not.The empirical frequency of the graphs with no error would be of great interest, because that is the quantity the bound concerns.But actually this frequency has no smooth evolution: it suddenly shifts from 0 to almost 1.We shall use two types of error rates: a global one and one for each class, so as to examine more accurately the results given by the algorithm.

Simulation design
The parameters used in the simulation are: The evolutions of the classification error rates and the estimators with respect to the number of nodes n are averaged over 1000 graphs drawn from G(n, α, π) and displayed from 1000 to 60000 nodes.
First of all, the global error rate g n is defined as the proportion of node pairs (i, j), either classified in distinct classes whereas their true labels are identical, or classified together whereas their true labels are different.That is, denoting Z the label vector returned by the LG algorithm: Secondly, we also propose error rates for each class.Define I q , resp.M q , the rate of intruders (or false positive rate) in the class q predicted by the algorithm, resp.the rate of missing nodes of the true class q (or false negative rate): The algorithm gives labels to the nodes in order of increasing degree.Indeed the true labels are expected to be sorted this way, because π 1 < π 2 < π 3 .This partially solves the label switching problem which arises when trying to identify the true labels instead of the equivalence classes.The evolution is quite satisfactory because the error rate completely vanishes from n = 45000 nodes, which is even earlier than expected from the bound of Theorem 2.2.Indeed this bound predicted that the probability of at least one error would not be less than 0.05 earlier than n = 300000.The bound seems to be pessimistic, basically because of the union bound, used in the proof of Proposition 3.0.4.After a dramatic decrease at the beginning, the evolution encounters a slight stagnation between n = 10000 and n = 20000 nodes.An interpretation of this transitional phase can be given with the error rates for each class.The first class is much better detected even at low graph sizes, unlike class 2 and class 3. Indeed it is sufficient that the maximal intraclass distance d n is less than (π 2 − π 1 )/4 to detect this class, whereas the other two are not supposed to be separated before

Results
according to our previous study.That is the reason why the global error rate dramatically decreases until reaching n = 10000 nodes, and why it does not vanish before reaching n = 25000.Note that the bound of Theorems 3.0.4and 2.2 had not predicted this before reaching n = 50000 and n = 264000 respectively.In short, as long as the tails of the normalized degree distribution are overlapping, the classes are mixed and cannot be properly detected.The curves show in particular that many nodes of class 2 seem to be caught by class 3. Indeed there are many intruders from class 2 in class 3. The missing nodes of class 1 are likely caught by class 2. As a consequence, the proportion of classes 1 and 2 are underestimated in the transitional phase, whereas the proportion of class 3 is overestimated.The inversion of classes 2 and 3 is shown again on graphic 4.1, as on 3.1.

Model selection
Up to this section, the number of classes was supposed to be known and was an input parameter of the LG algorithm.Our main purpose hereafter is to examine more accurately the sequence of the gaps sorted in increasing order and then the sequence of the intervals between the means of the groups given by the LG algorithm, depending on the selected number of classes Q for the model.As an application of this study, we finally show that degrees are likewise sufficient to asymptotically select the right number of classes.

Study of the gap sequence
We will use the same notations as in the last section.Moreover Q 0 denotes the true number of classes, and Q the current input parameter of the LG algorithm.We will often use the event , where no class is empty and the dispersion d n is so small that the Q 0 − 1 largest intervals separate the true classes (see Proposition 3.0.3with ε = 1).Then we can affirm that two normalized degrees are in the same class if and only if their distance is less than 2d n .
Let (G n q ) q∈[n−1] be the sequence of the distances between consecutive normalized degrees , but sorted in decreasing order: , sorted in decreasing order.This is called the sequence of the theoretical gaps.The following theorem states that largest empirical gaps converge to the corresponding theoretical gaps, which enforces our intuition about the model.
Refer to Appendix C to see the proof.One can easily realize that the only gap (among the Q 0 − 1 largest) lying between π (q) and π (q+1) converges to π (q+1) − π (q) .However the index of this interval is random and depends on n.This interesting but technical problem is solved in the second part of the proof.For the moment we provide a weaker version of this theorem, the proof of which is much simpler.Its conclusion is sufficient for our purposes.Theorem 6.2.For all q < Q 0 , lim n→+∞ G q > 0 Proof.If q < Q 0 : on the event B n , the Q 0 − 1 largest intervals necessarily lie between normalized degrees from different classes.There exists i ∈ C r and j ∈ C s , where As the upper bound is summable, according to the Borel-Cantelli lemma, Therefore lim n→+∞ G q ≥ 3 5 δ > 0 almost surely.
All further gaps lie between degrees of nodes of the same class and then converge to zero.The next theorem gives an estimation of the convergence rate.Theorem 6.3.For all β ∈]0, 1[, the triangular array converges uniformly w.r.t.q and a.s. to zero when n tends to infinity.
Proof.First of all, recall that for all n, Therefore we can just prove that n , and the uniform convergence will follow.
On the event B n , the Q 0 − 1 largest intervals lie between normalized degrees from different classes.The next intervals lie between degrees from the same class, and the distance to their corresponding conditional mean is at most d n .

Study of the intervals between estimated classes
By distances between estimated classes, we mean distances between empirical averages of the normalized degrees of each class, provided by the LG algorithm.Define m q to be the average of the normalized degrees of the q-labeled class estimated by the algorithm: The sequence of the gaps between consecutive averages (m (q+1 is sorted in order of decreasing length, just as the sequence of the gaps (T (i+1) − T (i) ) i∈[n−1] is in the previous paragraph.This new sequence is denoted by . Of course it depends on the current Q, whereas (G q ) q does not.When Q = Q 0 , H q and G q are very close for all q ≤ Q 0 − 1.On the contrary, when Q < Q 0 , some of the (H q ) q∈[Q0−1] stretch over several classes and include more than one of the G q .As a result, there is at least one q such that H q asymptotically differs from G q .Theorem 6.4.
Proof.Let (J q ) q∈[Q0−1] the Q 0 − 1 largest intervals between consecutive normalized degrees, hence for all q, |J q | = G q .Define also The union of J 0 , J 1 , . . ., J Q−1 , J Q partially covers the interval [0, 1[.These intervals are separated and the distance between the bounds of consecutive intervals is at most 2d n .As a result: the right-hand side (which actually equals 1), we deduce from both previous inequalities that: The first assertion follows directly from this inequality; for all t > 0: Q < Q 0 Subtracting the right-hand side from the second inequality only yields this time: But as shown in Theorem 6.2, the lower limit of G q is non-negative for all q ≤ Q 0 − 1.A fortiori, the second assertion of the theorem 6.4 stands as well.

Application to model selection
The summed differences Q−1 q=1 (H q − G q ) examined in the last paragraph have an interesting property regarding model selection: when Q is the right number of classes, it converges to zero, and when Q is too small, it converges to a nonnegative value, because one of the H q does not match G q .Thus this quantity measures the risk of underestimating the number of classes.
provide good initialization values for other algorithms which depend severely on them.
Above all, the LG algorithm performs every task using the degree data alone.As a conclusion, the degree data asymptotically includes the information needed to achieve all of the statistical inference in this model.

A Concentration inequality for products of binomial distributed variables
Proposition A.0.1.Let X (respectively Y ) be a sum of n independent bernoulli distributed variables with parameter p, respectively q.Then for all t > 0 Proof.
The last line is obtained by applying the usual concentration inequality (CCT) to both X and Y .
With a similar proof, we prove that for all t ∈]0, 1/4]:

B Separation of mixed classes
Suppose that there are Q classes and π q = π r for some q and r.For the sake of simplicity, all other conditional averages are assumed to be pairwise distinct.
The LG algorithm is supposed to be previously applied to the graph with the input parameter Q − 1.Let us point out that the Q − 1 groups returned by LG are asymptotically the true classes, except classes q and r, which are mixed together in one group of nodes, denoted by M .We shall briefly explain a procedure to separate this group, using the concentration of some additional binomial variables, namely the number of common neighbors of each pair of nodes.
Notation.Define α the diagonal matrix the diagonal coefficients of which are (α q ) q∈[Q] and the bilinear map on R Q : which is a scalar product, as soon as α q is non-negative for all q.• α denotes the associated norm.
For all pairs of nodes (i, j) ∈ M × M , define Y ijk is a Bernoulli distributed variable, that equals one if and only if i and j are both connected to k.Its parameter conditionally depends on each class of nodes i and j: • If i and j both belong to the q-labeled class: where π q is the row vector (π ql ) l .Symmetrically, if they both belong to the r-labeled class, the parameter is π r 2 α .
• Otherwise, if they belong to distinct classes q = r: The behavior of the new variables D ij looks like that of the degrees; they once more quickly concentrate around their average value as a consequence of the concentration of binomial variables.There are three groups of node pairs, concentrating around π q 2 α , π r 2 α , or π q , π r α .Suppose that π q α ≤ π r α .Applying the Cauchy-Schwarz inequality, The case of equality in the Cauchy-Schwarz inequality cannot arise; if it did, then π q and π r would be collinear vectors.Noting c the constant of collinearity, we would get π q = cπ r .But π q and π r are assumed to be equal in this section; hence c = 1.π q and π r would be equal.This is not allowed by the model for identifiability reasons.The inequality is finally strict, which especially implies: The furthest group to the right on the real line consequently contains only pairs of nodes of the same membership, which is sufficient to solve the mixing problem.We just have to extract this group from the other two by using the LG algorithm with Q = 2 as input parameter.Define W as the set of the pairs which are in this group, and F as the set of nodes, which are involved in those pairs.Let K be the graph defined by (F, W ).There are three cases: • If π q , π r α ≤ π q α ≤ π r α and π q α − π q , π r α < π r α − π q α , K asymptotically forms one clique composed of all nodes from the r-labeled class.Hence we deduce that remaining nodes are from the q-labeled class.
• If π q , π r α < π q α ≤ π r α and π q α − π q , π r α > π r α − π q α , then the graph K has asymptotically two cliques: one formed by the nodes of class q and the other one by the nodes of class r.If the equality holds in the second inequality, there is either one clique as in the first case or two, depending on the selected gap.
• If π q α < π q , π r α < π r α , the gap between π q α and π q , π r α is necessarily strictly shorter than the one between π q , π r α and π r α .Indeed this amounts to saying that π q −π r 2 > 0. Thus K asymptotically forms one clique again.
C Proof of Theorem 6.1 Let us define (J i ) i∈[n] the sequence of the intervals [T (i) , T (i+1) [ sorted in order of decreasing length, hence for all i ∈ [n], |J i | = G i .We suppose hereafter that the sequence (π q ) q is sorted in increasing order: Proof.On the event B n , among the Q 0 − 1 largest intervals, we can associate with each π q the only one lying between π q and π q+1 .Namely the only J i with i ∈ [Q 0 − 1] such that J i ∩]π q , π q+1 [ = ∅.S(q) denotes the index in [Q 0 − 1] corresponding to this unique interval.
Moreover, s(q) denotes one of the indexes s ∈ [Q 0 − 1] such that γ s = π q+1 − π q , chosen so that s is injective.Let us point out that S is a random permutation whereas s is deterministic.In order to simplify notations, we silently make the deterministic index change r = s(q).Thereby (γ q ) q still denotes the sequence (γ s(q) ) q , and S the permutation S • s −1 .
If (u i ) 1≤i≤m is a sequence, we write i ∼ u j if and only if u i = u j .∼ u is an equivalence relation.Applying the Lemma C.1 stated and proved afterwards, if d n ≤ η, there exists r ∼ γ q such that q = S(r).Notice furthermore that the sequence (γ q ) q∈[Q0−1] is constant on the ∼ γ -equivalence classes.The term |G q − γ q | is necessarily in the sum r∼q |G S(r) − γ r |.Finally, define ) + 2e −2nη 2 according to (3).
Lemma C.1.Let (u i ) 1≤i≤m , (v i ) 1≤i≤m be two real decreasing sequences.Let p be the number of ∼ u -equivalence classes and σ one permutation of {1, . . ., m}.
Proof.Since u is decreasing, the ∼ u -equivalence classes are just sets of consecutive natural integers.Define recursively (r i ) 1≤i≤p the increasing sequence of indexes j when the value of u j changes: • Let r 1 = 1.
• For i ≥ 1, let r i+1 be the smallest integer j > r i such that The construction of (r i ) i implies that for all j < r i , all r i ≤ l < r i+1 and all k ≥ r i+1 : u j < u k < u l , and furthermore v σ(j) < v σ(k) < v σ(l) as well.As v decreases, σ({r i , . . ., r i+1 − 1}) = {r i , . . ., r i+1 − 1}.The result follows directly from this.

Figure 2 :
Figure 2: Evolution of the average global error rate gn as a function of the graph size

Figure 3 :+
Figure 3: Error rates I n q and M n q