Consistency and Asymptotic Normality of Stochastic Block Models Estimators from Sampled Data

Statistical analysis of network is an active research area and the literature counts a lot of papers concerned with network models and statistical analysis of networks. However, very few papers deal with missing data in network analysis and we reckon that, in practice, networks are often observed with missing values. In this paper we focus on the Stochastic Block Model with valued edges and consider a MCAR setting by assuming that every dyad (pair of nodes) is sampled identically and independently of the others with probability $\rho>0$. We prove that maximum likelihood estimators and its variational approximations are consistent and asymptotically normal in the presence of missing data as soon as the sampling probability $\rho$ satisfies $\rho\gg\log(n)/n$.


Introduction
For the last decade, statistical network analyses has a been a very active research topic and the statistical modeling of networks has found many applications in social sciences and biology for example Aicher et al. [2014], Barbillon et al. [2015], Mariadassou et al. [2010], Wasserman and Faust [1994] and Zachary [1977].
Many random graphs models have been widely studied, either from a theoretical or an empirical point of view.The first model studied was Erdős-Rényi model [Erdős and Renyi, 1959] which assumes that each pair of nodes (dyad) is connected independently to the others with the same probability.This model assumes homogeneity of all nodes across the network.In order to alleviate this constraint, many families of models have been introduced.Most are endowed with a latent structure [reviewed in Matias and Robin, 2014] to capture heterogeneity across nodes.Among those, the Stochastic Block Model [in short SBM, see Frank andHarary, 1982, Holland et al., 1983] is one of the oldest and most studied as it is highly flexible and can capture a large variety of structures (affiliation, hub, bipartite and many other).In order to estimate this model, Bayesian approaches were first proposed [Snijders andNowicki, 1997, Nowicki andSnijders, 2001] but have been superseded by variational methods [Daudin et al., 2008, Latouche et al., 2012].The former class of approaches are exact but lack the computational efficiency and scalability that the latter offers.
Theoretical guarantees concerning maximum likelihood estimators (in short MLE) and variational methods for the binary SBM estimation is not an easy task and have been widely studied.In Celisse et al. [2012], consistency of MLE and variational estimates is proven but asymptotic normality requires that the estimators converges at rate at least n −1 , which is not proven in the paper, although some results were available for some particular cases (affiliation for example).Ambroise and Matias [2012] tackles the specific case of affiliation model with equal group proportion and proves the consistency and asymptotic normality of parameter estimates.Bickel et al. [2013] extends those results to arbitrary binary SBM graphs and improves Celisse et al. [2012] by removing the condition on the convergence rate.Following along the path of Bickel et al. [2013], Brault et al. [2017] proved consistency and asymptotic normality of estimators (MLE and variational) to weighted Latent Block Models where the weights distribution belongs to a regular one-dimensional exponential family.In particular, considering non-bounded edge values invalidates several parts of the proofs for binary graphs and requires substantial adaptations and additional results, notably concentration inequalities for sums of unbounded, non-gaussian random variables.
Some results are also available for the related semi-parametric problem of assignment reconstruction.Mariadassou and Matias [2015] show that the conditional distribution of the (latent) assignments converge to a degenerate distribution and Rohe et al. [2010] prove that, when the data are generated according to a SBM model, spectral methods are consistent.Choi et al. [2012] extend those results to settings where the density of the graph goes to 0 as Ω(log α (n)/n) (for α large enough) and/or the number of groups goes to +∞ as √ n.Finally, Wang and Bickel [2017] and Hu et al. [2017] also show that model selection for the number of groups is consistent for dense graphs, they suggest using a penalized likelihood criteria with penalty of the form k(k+1) 2 log(n) + λn log(k) where λ is a tuning parameter.
In this paper we consider a simple setting with fixed number of groups and fixed density but weighted edges and missing values.In most network studies, there is a strong asymmetry between the presence of an edge and its absence: the lack of proof that an edge exists is taken as proof that the edge does not exist and edges with uncertain status are considered as non existent in the graph.This is the strategy adopted in most sparse asymptotic settings where the density of edges goes to 0 asymptotically Bickel et al. [2013].We adopt a different point of view where edges with uncertain status are considered as missing, rather than absent and explicitly accounted for their missing nature.We use the framework of Rubin [1976] and its application to network data, see Kolaczyk [2009] and Handcock and Gile [2010], for parameter inference in presence of missing values and more specifically its applications to SBM Tabouy et al. [2019].We prove that, in the MCAR setting where each dyad is missing independently and with the same probability, the MLE and variational estimates are still consistent and imsart-generic ver.2014/10/16 file: SBM-MCAR.texdate: April 3, 2019 asymptotically normal.
The article is organized as follows.We first present the model and missing data theory applied to our context with some examples of sampling designs.We then posit some definitions and discuss the assumptions required for our results in Section 2. In Section 3 we establish asymptotic normality for the completeobserved model (i.e.observed SBM where latent variables are known).Section 4 is the main result of this paper and states that the observed-likelihood behaves like the complete-observed likelihood (i.e.joint likelihood of the observed data and latent variables) close to its maximum.The proof is sketched in Section 5. Consequences for the MLE and variational estimator, as well as comparison to existing results, are in discussed in Section 6. Technical lemmas and details of the proofs are available in the appendices.

Stochastic Block Model
In SBM, nodes from a set N {1, . . ., n} are distributed among a set Q {1, . . ., Q} of hidden blocks that model the latent structure of the graph.The block-memberships are encoded by (z i , i ∈ N ) where the z i are independant random variables with prior probabilities α = (α 1 , . . ., α Q ), such that P(z i = q) = α q , for all q ∈ Q.The value y ij of any dyad (i, j) in D = N × N , with i = j, only depends on the blocks i and j belong to.The variables (y ij )s are thus independent conditionally on the (z i )s: In the following, y = (y ij ) i,j∈D is the n × n adjacency matrix of the random graph, z = (z 1 , . . ., z n ) the n-vector of the latent blocks.With a slight abuse of notation, we associate to z i a binary vector (z i1 , . . ., z iQ ) such that z i = q ⇔ z iq = 1, z iℓ = 0, for all ℓ = q.In this case z is a n × Q matrix.
We note the complete parameter set as θ = (α, π) ∈ Θ where Θ stands for the parameter space.When performing inference from data, we note θ ⋆ = (α ⋆ , π ⋆ ) the true parameter set, i.e. the parameter values used to generate the data, and z ⋆ the true (and usually unobserved) memberships of nodes.For any z, we also note: • z +q = i z iq the size of block q for membership z • z ⋆ +q its counterpart for z ⋆ .

Missing data for SBM
Regarding SBM inference, a missing value corresponds to a missing entry in the adjacency matrix y, typically denoted by NA's.We rely on the n × n sampling imsart-generic ver.2014/10/16 file: SBM-MCAR.texdate: April 3, 2019 matrix r to record the missing state of each entry: As a shortcut, we use y o = {y ij : r ij = 1} and y m = {y ij : r ij = 0} to respectively denote the observed and missing dyads.The sampling design is the description of the stochastic process that generates r.It is assumed that the network exists before the sampling design acts upon it, which is fully characterized by the conditional distribution p ψ (r|y), the parameters of which are such that ψ and θ live in a product space Θ × Ψ.In this paper we are going to focus on a specific type of missingness, called missing completely at random (MCAR) for which p ψ (r|y) = p ψ (r) and leave aside more complex forms of dependencies such as Missing at random (MAR) and Not missing at random (NMAR).
We then follow the framework of [Rubin, 1976] and Tabouy et al. [2019] for missing data and define the joint probability density function as (2.2) Property 2.1.According to Equation (2.2), if the sampling design is MCAR, then maximising p θ,ψ (y o , z, r) or p θ,ψ (y o , r) in θ is equivalent to maximising p θ (y o ) in θ, this corresponds to the ignorability notion defined in Rubin [1976].

Sampling design examples
We present here some examples of sampling designs to illustrate differences between notions of MCAR, MAR and NMAR.
Definition 2.2 (Random dyad sampling).Each dyad (i, j) ∈ D has the same probability P(r ij = 1) = ρ of being observed, independently of the others.This design is MCAR.
Definition 2.3 (Random node sampling).The random node sampling consists in selecting independently with probability ρ a set of nodes and then observing the corresponding rows and columns of matrix y.
The major point in both examples is that the probability (ρ in random dyad sampling and 1 − (1 − ρ) 2 in the random node sampling) of observing a dyad does not depend on its value.In contrast, the following dyad-centered sampling design adapted to binary networks is NMAR since the probability to observe a dyad depends on its value: Definition 2.4 (Double standard sampling).Each dyad (i, j) ∈ D is observed, independently of other dyads, with a probability depending on its value: P(r ij = 1|y ij = 0) = ρ 0 and P(r ij = 1|y ij = 1) = ρ 1 .
For non-binary networks, specifying the sampling design is more involved and requires defining the sampling density for every possible value of y ij , e.g.(P(r ij = 1|y ij = k)) k∈N for Poisson-valued edges.

Observed-likelihoods
When the labels are known, the complete-observed log-likelihood is given by: But the labels are usually unobserved, and the observed log-likelihood is obtained by integration over all memberships: (2.4)

Models and Assumptions
We focus here on parametric models where ϕ belongs to a regular one-dimension exponential family in canonical form: where π belongs to the space A, so that ϕ(•, π) is well defined for all π ∈ A.
Classical properties of exponential families ensure that ψ is convex, infinitely differentiable on Å, that (ψ ′ ) −1 is well defined on ψ ′ ( Å). Furthemore, when In the following, we assume that missing data are produced according to a random dyad sampling with parameter ρ > 0.
The previous assumptions are standard.Assumption A 1 ensure that the group proportions and the sampling parameter are bounded away from 0 and 1 so that no group disappears when n goes to infinity.It also ensures that π is bounded away from the boundaries of the A. This is essential for the subexponential properties of Propositions 2.8 and 2.9.A 2 and A 3 are necessary for identifiability purposes: the model is trivially not identifiable if the map π → ϕ(., π) is not injective.A 4 states the identifiability of SBM parameters under random dyad sampling.Note that, combined with A 3 , it implies that all columns and all rows of π ⋆ are distincts and therefore there are no two groups with identical connectivity profiles.In the following, we consider that Q, the number of classes (or groups) is known.

Identifiability
Since r is independant on y, the identifiability of SBM with emission law in the one-dimension exponential family under random dyad sampling can be stated in two steps.First the sampling parameter ρ and secondly the SBM parameters θ ⋆ = (α ⋆ , π ⋆ ) given ρ.
Proposition 2.5.The sampling parameter ρ > 0 of random dyad sampling is identifiable w.r.t. the sampling distribution.
Proof.See Tabouy et al. [2019].The proof does not depend on y being binary but also holds for y distributed as in Eq. (2.5).
Proposition 2.6.Let n ≥ 2Q and assume that for any 1 ≤ q ≤ Q, ρ > 0, π ⋆ q > 0 and that the coordinates of α ⋆ ψ ′ (π ⋆ ), where ψ ′ is applied component-wise, are pairwise distinct.Then, under random dyad sampling, SBM parameters are identifiable w.r.t. the distribution of the observed part of the SBM up to label switching.
Proof.The proof is nearly identical to the one written in Tabouy et al. [2019] and inspired by Celisse et al. [2012] for the binary SBM under random dyad sampling.However, substituting E[y ij |z i = q] to s q in the proof ensures that α ⋆ is identifiable.Finally, the fact that (ψ ′ ) −1 is a one-to-one map ensures that π ⋆ is identifiable.
Note that asymptotically, the assumption n ≥ 2Q is always satisfied since Q is fixed and n grows to infinity.

Subexponential variables
Remark 2.7.Since we restricted π in a bounded subset of Å, the variance of y π is bounded away from 0 and +∞.We note (2.6) Similarly, since π belongs to a bounded subset of a open interval, there exists a constant κ > 0, such that Proposition 2.9.Considering x = y π r ij + λr ij (we recall that r ij ∼ B(ρ)), with r ij independant of y π and λ ∈ R bounded.There are non-negative numbers ν and b such that x is subexponential with parameters (ν 2 , b −1 ).

Symmetry
We now introduce the concepts of assignments and parameter symmetries, that must be accounted for when studying the asymptotic properties of the MLE.
Complications stemming from symmetries are related to but no equivalent to the problem of label-switching in mixture models.
Definition 2.10 (permutation).Let s be a permutation on {1, . . ., Q}.If A is a matrix with Q columns and n rows, we define A s as the matrix obtained by permuting the columns of A according to s, i.e. for any row i and column Definition 2.11 (equivalence).We define the following equivalence relationships: • Two assignments z and z ′ are equivalent, noted ∼, if they are equal up to label permutation, i.e. there exists a permutation s such that z ′ = z s .• Two parameters θ and θ ′ are equivalent, noted ∼, if they are equal up to label permutation, i.e. there exists a permutation s such that (α s , π s ) = (α ′ , π ′ ).• (θ, z) and (θ ′ , z ′ ) are equivalent, noted ∼, if they are equal up to label permutation on π and z, i.e. there exists a permutation s such that (π s , z s ) = (π ′ , z ′ ).This is label-switching.
θ exhibits symmetry if it exhibits symmetry for any non trivial permutations s.
Finally the set of permutations for which θ exhibits symmetry is noted Sym(θ).
Remark 2.13.The set of parameters that exhibit symmetry is a manifold of null Lebesgue measure in Θ.The notion of symmetry allows us to deal with a notion of non-identifiability of the class labels that is subtler than and different from label switching.More precisely Label switching is when : In particular, in label-switching, z and z s have the same likelihood but under equivalent yet different parameters θs.In contrast, in the presence of symmetry, multiple assignments can have exactly the same likelihood under θ.
The issue of symmetry forces us to use a notion of distance between assignment that is invariant to label permutation.
Definition 2.14 (distance).We define the following distance, up to equivalence, between configurations z and z ⋆ : where, for all matrix z, we use the Hamming norm • 0 defined by Definition 2.15 (Set of local assignments).We note S(z ⋆ , r) the set of configurations that have a representative (for ∼) within relative radius r of z ⋆ :

Other definitions
We finally introduce a few useful notions that will be instrumental in the proofs.The first is "regular" assignments, for which each group has "enough" nodes: Definition 2.16 (c-regular assignments).Let z ∈ Z.For any c > 0, we say that z is c-regular if min q z +q ≥ cn. (2.7) Class distinctness δ(π) captures the differences between groups: lower values of δ(π) means that at least two classes are very similar.δ(π) is intrisically linked to the convergence rate of several estimates.Definition 2.17 (class distinctness).For θ = (α, π) ∈ Θ.We define: the Kullback divergence between ϕ(., π) and ϕ(., π ′ ), when ϕ comes from an exponential family.

Complete-observed Model
In the following we study the asymptotic properties of the complete-observed data model, i.e. when the true assignment z ⋆ is known.
Proposition 3.1.Under random dyad sampling, defining N i = i,j R ij and Ω 0,n = {∀i ∈ {1, ..., n}, N i 1} the set of nodes with at least one dyaddy observed.Then Proof.This proposition is a direct consequence of Borel-Cantelli's theorem.Details are available in appendix A.
Remark 3.2.This result shows that, with high probability, the network has no unobserved node.In the remainder, we work conditionnally on Ω 0,n .
Let θ c = ( α, π) be the MLE of θ in the complete-observed data model.Simple manipulations of Equation (2.3) yield: Since there are missing values in the adjacency matrix, we need the following technical lemma to prove asymptotic normality of π qℓ 's in the complete data model.
Proof.The proof of this lemma is based on Hoeffding's decomposition for Ustatistics and on the proof of Hoeffding's concentration inequality.Details are postponed to appendix A.
Then the estimates πqℓ (z ⋆ ) are independent and asymptotically Gaussian with limit distribution: Proof.The proof is postponed to appendix A. The first part is a direct application of central limit theorem for i.i.d.variables and the second part relies on a variant of the central limit theorem for random sums of random variables.
Proposition 3.5 (Local asymptotic normality).Let L ⋆ co be the complete likelihood function defined on Θ by L ⋆ co (α, π) = log p (y o , z ⋆ ; θ).For any s, t and u in a compact set, we have: where ⊙ denote the Hadamard product of two matrices (element-wise product) and Σ α ⋆ and Σ π ⋆ are defined in Proposition 3.4.Y α ⋆ is asymptotically Gaussian with zero mean and variance matrix Σ α ⋆ .Y π ⋆ is a random matrix with independent entries that are asymptotically gaussian zero mean and variance Proof.This result is based on a Taylor expansion of L ⋆ co in a neighborhood of (α ⋆ , π ⋆ ).Details are available in appendix A.

Main Result
Our main result compares the observed likelihood ratio p(y o ; θ)/p(y o ; θ ⋆ ) with the complete likelihood p(y o , z ⋆ ; θ ′ )/p(y o , z ⋆ ; θ ⋆ ) to show that they have the same argmax.To ease the comparison, we work only on the high probablity set Ω 1 of c/2-regular configurations, i.e. that have Ω(n) nodes in each group as defined in Section 2, Proposition 4.1.Define Z 1 as the subset of Z made of c/2-regular assignments, with c defined in assumption H 1 .Note Ω 1 the event {z ⋆ ∈ Z 1 }, then: Proof.This proposition is a consequence of Hoeffding's inequality.See appendix A for more details.
We can now state our main result: Theorem 4.2 (complete-observed). set of permutation s for which θ = (α, π) exhibits symmetry.Then, for n tending to infinity, the observed likelihood ratio behaves like the complete likelihood ratio, up to a bounded multiplicative factor: where the o P is uniform over all θ ∈ Θ.
The maximum over all θ ′ that are equivalent to θ stems from the fact that because of label-switching, θ is only identifiable up to its ∼-equivalence class from the observed likelihood, whereas it is completely identifiable from the complete likelihood.The multiplicative factor arises from the fact that equivalent assignments have exactly the same complete likelihood and contribute equally to the observed likelihood.
Corollary 4.3.If Θ contains only parameters with no symmetry: where the o P is uniform over all Θ.

Proof Sketch
The proof of theorem relies on controlling deviations of the log-likelihood ratios from their expectations.We introduce a few notations for those quantities.
Remark 5.3.Note the absence of the random variable r in ȳqℓ (z).
Since LR(θ, z) ≤ Λ(z), the profile ratio is useful to remove the dependency on θ and reduce the study to a series of problems depending only on z.The following propositions show when those quantities reach their maximum values and what the corresponding values are.
Proofs of Propositions 5.2, 5.4, 5.6 and 5.5 are postponed to Appendix B.

High level view of the proof
The proof proceeds with an examination of the asymptotic behavior of LR on three types of configurations that partition Z: 1. global control : for z such that Λ(z) = Ω(−n 2 ), Proposition 5.7 proves a large deviation behavior and shows that LR = −Ω P (n 2 ).In turn, those assignments contribute a o P of p(y o , z ⋆ ; θ ⋆ )) to the sum (Proposition 5.8).
imsart-generic ver.2014/10/16 file: SBM-MCAR.texdate: April 3, 2019 2. local control : a small deviation result (Proposition 5.9) is needed to show that the combined contribution of assignments close to but not equivalent to z ⋆ is also a o P of p(y o , z ⋆ ; θ ⋆ ) (Proposition 5.10).3. equivalent assignments: Proposition 5.11 examines which of the remaining assignments, all equivalent to z ⋆ , contribute to the sum.
These results are presented in next section 5.3 and their proofs postponed to Appendix B. They are then put together in section 5.4 to prove our main result.The remainder of the section is devoted to the asymptotics of the ML and variational estimators as a consequence of the main result.

Local Control
Proposition 5.9 (small deviations LR).Conditionally upon Ω 1 , The next proposition uses Propositions 5.9 and 5.6 to show that the combined contribution to the observed likelihood of assignments close to z ⋆ is also a o P of p(z ⋆ , y o ; θ ⋆ ): Proposition 5.10 (contribution of local assignments).With the previous notations and C the positive constant defined in Proposition 5.5: imsart-generic ver.2014/10/16 file: SBM-MCAR.texdate: April 3, 2019

Equivalent assignments
It remains to study the contribution of equivalent assignments.
Proposition 5.11 (contribution of equivalent assignments).For all θ ∈ Θ, we have where the o P is uniform in θ.

Proof of the main result
Proof.We work conditionally on Ω 1 .Choose z ⋆ ∈ Z 1 and a sequence t n decreasing to 0 but satisfying ρnt n / log(n) → +∞.According to Proposition 5.8, Since t n decreases to 0, it gets smaller than C (used in proposition 5.10) for n large enough.As this point, Proposition 5.10 ensures that: And therefore the observed likelihood ratio reduces as: And Proposition 5.11 allows us to conclude

Variational and Maximum Likelihood Estimates
This section is devoted to the asymptotic of the ML and variational estimators in the incomplete data model as a consequence of the main result 4.2.Note that, with high probability, ML and variational estimators have no symmetry since the set {θ : # Sym(θ) > 1} is a manifold of null Lebesque's mesure in Θ.

ML estimator
The asymptotic behavior of the maximum likelihood estimator in the incomplete data model is a direct consequence of Theorem 4.2 and Proposition 3.5.
Corollary 6.1 (Asymptotic behavior of θ MLE ).Denote θ MLE the maximum likelihood estimator and use the notations of Proposition 3.4.There exist permutations s of {1, . . ., Q} such that Hence, the maximum likelihood estimator for the SBM under random-dyad sampling condition is consistent and asymptotically normal, with the same behavior as the maximum likelihood estimator in the complete data model.The proof is postponed to appendix B.10.

Variational estimator
Due to the complex dependency structure of the observations, the maximum likelihood estimator of the SBM is not numerically tractable, even with the Expectation Maximisation algorithm.In practice, a variational approximation is often used [see Daudin et al., 2008]: for any joint distribution Q ∈ Q on Z a lower bound of L(θ) is given by where . Choosing Q to be the set of product distributions, such that for all z ziq allows us to obtain tractable expressions of J (Q, θ).The variational estimate θ var of θ is defined as The following corollary states that θ var has the same asymptotic properties as θ MLE and θ MC , in particular is consistent and asymptotically normal.The proof is very similar to the proof of Theorem 6.1 and postponed to appendix B.10.

Discussion
Close examination of the different proofs, especially of Prop.5.10, reveals that the quantities driving convergence of the estimates are ρnδ(π ⋆ ), which must go to +∞ with n to ensure validity of Prop.5.8, and ρnt n δ(π ⋆ ), which must be larger than log(n) while t n → 0, to ensure validity of Prop.5.10.Both conditions are met as soon as ρ ≫ log(n)/n, allowing for a large fraction of missing edges.Note that this limiting rate for missingness is the same as the one found for graph density in sparse settings to achieve consistency and local asymptotic normality of θ [Bickel et al., 2013].
In this paper, we focused on data sampled according to random dyad sampling.However, as described in section 2.3, there are many other ways to sample a network.In the case of node-centered sampling design, like random node sampling, the main difficulty to prove consistency and asymptotic normality is the dependency between the r ij variables.Indeed, in random node sampling, the variable r i0j0 depends on all r ij0 and r i0j (for all i, j ∈ N ).As a consequence, many results proved in this paper are not valid under random node sampling.NMAR sampling designs raises problem of their own: each design requires its own estimation procedure [Tabouy et al., 2019] and therefore its own analysis.For example, even parameter estimation under the double standard sampling for binary networks mentioned in section 2.3 is still an unsolved problem: numerical experiments suggest that θ = (π, α) and ψ = (ρ 0 , ρ 1 ) are jointly identifiable but there is no formal proof.

A.2. Proof of lemma 3.3
Proof.Noticing that E[r ij z iq z jℓ ] = ρα q α l and defining q q,ℓ i,j = r ij z iq z jℓ − ρα q α l .By Hoeffding decomposition for U-statistics (see Hoeffding [1948]) where for each permutation σ ∈ S, ) is a sum of independant r.v.Then, for γ > 0 by Jensen's inequality and Hoeffding's lemma about bounded r.v.
Finally, using the same proof than Hoeffding's inequality allows us to conclude.
A.5. Proof of proposition 4.1 Proof.In regular configurations, each group has Ω(n) members, where u n = Ω(n) if there exists two constant a, b > 0 such that for n enough large an ≤ u n ≤ bn.c/2-regular assignments, with c defined in Assumption H 1 , have high P θ ⋆ -probability in the space of all assignments, uniformly over all θ ⋆ ∈ Θ.Each z +q is a sum of n i.i.d Bernoulli r.v. with parameter α q ≥ α min ≥ c.A simple Hoeffding bound shows that taking a union bound over Q values of q leads to Proposition 4.1.Proof.First of all we will prove equation 5.3, where Z i = q ⇔ z iq = 1.Noticing that the (i, j) for which z iq z jℓ = 0 does not contributes in any of the two terms of the ratio.The calculus of this expectation is then equivalent to calculate an expectation of the general form Finally, imsart-generic ver.2014/10/16 file: SBM-MCAR.texdate: April 3, 2019 B.2. Proof of proposition 5.4 Proof.Defining ν(y, π) = yπ − ψ(π).For y fixed, ν(y, π) is maximized at π = (ψ ′ ) −1 (y).Manipulations yield which is maximized at B.3.Proof of Proposition 5.6 (maximum of ELR and Λ) Proof.We condition on z ⋆ and prove Equation (5.5): If z ⋆ is regular, and for n > 2/c, all the rows of IR(z) have at least one positive element and we can apply Lemma 3.2 of Bickel et al. [2013] to characterize the maximum for ELR.
The separation and local behavior of G around z ⋆ is a direct consequence of the proposition 5.5.

B.4. Proof of Proposition 5.5 (Local upper bound for Λ)
Proof.We work conditionally on z ⋆ .The principle of the proof relies on the extension of Λ to a continuous subspace of M Q ([0, 1]), in which the confusion imsart-generic ver.2014/10/16 file: SBM-MCAR.texdate: April 3, 2019 matrix is naturally embedded.The regularity assumption allows us to work on a subspace that is bounded away from the borders of M Q ([0, 1]).The proof then proceeds by computing the gradient of Λ at and around its argmax and using those gradients to control the local behavior of Λ around its argmax.The local behavior allows us in turn to show that Λ is well-separated.
Note that Λ only depends on z through IR(z).We can therefore extend it to matrix U ∈ U c where U is the subset of matrices M Q ([0, 1]) with each row sum higher than c/2.
and 1 is the Q × Q matrix filled with 1. Confusion matrix IR(z) satisfy IR(z)1I = α(z ⋆ ), with 1I = (1, . . ., 1) T a vector only containing 1 values, and are obviously in U c as soon as z ⋆ is c/2 regular.The maps f q,q ′ ,ℓ,ℓ ′ : (U ) → KL(π ⋆ qℓ , πqℓ (U )) are twice differentiable with second derivatives bounded over U c and therefore so is Λ(U ).Tedious but straightforward computations show that the derivative of Λ at D α := Diag(α(z ⋆ )) is: A(z ⋆ ) is the matrix-derivative of − Λ/n 2 at D α .Since z ⋆ is c/2-regular and by definition of δ(π ⋆ ), A(z ⋆ ) qq ′ ≥ cρδ(π ⋆ ) if q = q ′ and A(z ⋆ ) qq = 0 for all q.By boundedness of the second derivative, there exists C > 0 such that for all D α and all H ∈ B(D α , C), we have: ). U − D α have nonnegative off diagonal coefficients and negative diagonal coefficients.Furthermore, the coefficients of U, D α sum up to 1 and Tr(D α ) = 1.By Taylor expansion, there exists H also in imsart-generic ver.2014/10/16 file: SBM-MCAR.texdate: April 3, 2019 To conclude the proof, assume without loss of generality that z ∈ S(z ⋆ , C) achieves the .0,∼ norm (i.e. it is the closest to z ⋆ in its representative class).
According to Proposition B.3 of Brault et al. [2017] , In particular, for all ε n < νb We can then remove the conditioning and take a union bound.B.6.Proof of Proposition 5.8 (contribution of far away assignments) Proof.Conditionally on z ⋆ , we know from proposition 5.6 that Λ is maximal in z ⋆ and its equivalence class.Choose 0 < t n decreasing to 0 but satisfying → +∞.According to 5.6 (iii), for all z / ∈ S(z . By proposition 5.7, and with our choice of ε n , with probability higher than where the second line comes from inequality (B.1), the third from the global control studied in Proposition 5.7 and the definition of ε n , the fourth from the definition of p(y o , z ⋆ ; θ ⋆ ), the fifth from the bounds on α ⋆ and the last from In addition, with our choice of t n , we have ε n ≫ log(n)/n so that the series n ∆ 1 n (ε n ) converges and: B.7. Proof of Proposition 5.9 (local convergence LR) Proof.We work conditionally on z ⋆ ∈ Z 1 .Choose ε ≤ κσ 2 small.Assignments z at .0,∼ -distance less than c/4 of z ⋆ are c/4-regular.According to Proposition B.1 of Brault et al. [2017] , y qℓ and ȳqℓ are at distance at most ε with probability where Λ(z) = E Λ(z)|z ⋆ .Manipulation of Λ, Λ and Λ yield Concerning the first term.The function f is twice differentiable on Å with By Proposition B.1 (adapted for SBM) of Brault et al. [2017] , ( y qℓ − ȳqℓ ) 2 = O P (1/n 2 ) where the O P is uniform in z and does not depend on z ⋆ .Similarly, ȳqℓ is a convex combination of the S ⋆ qℓ = ψ ′ (π ⋆ qℓ ) therefore, Note that: and y qℓ − ȳqℓ = o P (1).Therefore The remaining term writes and is also o P (( z − z ⋆ 0,∼ /n) uniformly in z and z ⋆ ∈ Ω 1 by Proposition C.2.
Concerning the second term.For all q, ℓ, defining Using the following notations we are able to write Where the second equality is the sum of independent random variables.Note that : imsart-generic ver.2014/10/16 file: SBM-MCAR.texdate: April 3, 2019 Concerning the third term.Using arguments developed previously leads to the same conclusion than before : As a conclusion, writing where the first line comes from the definition of Λ, the second line from Proposition 5.6 and the third from Proposition 5.9.Thanks to proposition D.1, we also know that: We may prove the corollary by contradiction.Note first that unless Θ is constrained and with high probability, θ MLE and θ(z ⋆ ) exhibit no symmetries.Indeed, equalities like y qℓ = y q ′ ,ℓ ′ have vanishingly small probabilities of being simultaneously true when y ij is discrete, and even null when y ij is continuous.As this inequality is true for every couple z, we have in particular: Again unless Θ is constrained, θ V AR exhibits no symmetries with high probability and the same proof by contradiction as in appendix B.10 gives the result.
imsart-generic ver.2014/10/16 file: SBM-MCAR.texdate: April 3, 2019 where the first inequality comes from the definition of α(z) and the second from Lemma B.6 of Brault et al. [2017] and the fact that z ⋆ and z are c/4-regular.
Assume that A 1 to A 4 with random-dyad sampling hold for the Stochastic Block Model of known order with n × n observations coming from an univariate exponential family and define # Sym(θ) as the imsart-generic ver.2014/10/16 file: SBM-MCAR.texdate: April 3, 2019