Estimation of dense stochastic block models visited by random walks

We are interested in recovering information on a stochastic block model from the subgraph discovered by an exploring random walk. Stochastic block models correspond to populations structured into a finite number of types, where two individuals are connected by an edge independently from the other pairs and with a probability depending on their types. We consider here the dense case where the random network can be approximated by a graphon. This problem is motivated from the study of chain-referral surveys where each interviewee provides information on her/his contacts in the social network. First, we write the likelihood of the subgraph discovered by the random walk: biases are appearing since hubs and majority types are more likely to be sampled. Even for the case where the types are observed, the maximum likelihood estimator is not explicit any more. When the types of the vertices is unobserved, we use an SAEM algorithm to maximize the likelihood. Second, we propose a different estimation strategy using new results by Athreya and Roellin. It consists in de-biasing the maximum likelihood estimator proposed in Daudin et al. and that ignores the biases.


Introduction
A way to infer a random structure such as the graph of a social network and discover its properties is to explore it with random walks (e.g.[27]).This mathematical idea can be put into practice to reveal hidden populations such as drug users by using referral chain sampling where each new person provides information on her/his contacts: see for example the snowball sampling [15] or the 'respondent-driven sampling' (RDS) introduced by Heckathorn [16] (see also the PhD thesis of the second author [33]).These methods were first used to estimate the size of the hidden population or to infer population means, under the assumption that subjects' network degree determines their probability of being sampled, see Volz and Heckathorn [34] (see also [22]).Because the inclusion probability of a subject is complicated to compute, due to the dependencies associated with the graph and the fact that the sampling should be in practice without replacement, an important numerical literature on the subject has followed (see e.g.[13,14,26]).Gile [12] proposed an improved estimator for population means taking into account the without replacement sampling, and Rohe established critical threshold for the design effects [28].Because of privacy restrictions, the social-network information is usually only a tree, as each interviewee has been 'invited' into the survey by a previously interviewed subject.
Crawford, Wu and Heimer [10] use a Bayesian approach to integrate over the missing edge between recruited individuals.It appears that the information gathered in chain-referral surveys can also be used in estimating the social network itself or at least properties associated with its topology.Recent surveys allow to gather connectivity information for recruited members: see for example the Rolls et al. [29] and Jauffret-Roustide et al. [31].Interviewees are asked for a description of their contacts, and for a first name or a nickname.This information allows to reconstruct partially the social network and obtain a subgraph that is not a tree.It is then natural to wonder how much information on the total graph can be recovered from the observation of the subgraph obtained by the chain-referral sampling.Of course, biases have been emphasized as individuals of high degrees (hubs) are sampled with higher probability and 'common profiles' are much more likely to be discovered (e.g.[20]).This motivates the present paper.To fix the framework of study, we consider a particular class of random graphs, namely the Stochastic Block Models (SBM) that are popular models for social networks (see [17] and the review [1]).For this parametric model, inferring the distribution of the random graph boils down to a finite dimensional parameter estimation.Also, for simplification, we consider here a model of random walk on the continuous version of the SBM graph, namely the SBM graphon that is introduced in the next paragraph.Two estimations strategies are considered in this paper.First, we establish the likelihood of a random walk exploring this structure, and which accounts for the sampling biases.Two cases are classically considered, depending on whether the types of the visited nodes are observed or not.Even in the case of a complete observation, the maximum likelihood estimator has no explicit form.When the types of the vertices are unobserved, we adapt the Stochastic Approximation Expectation-Maximization algorithm (SAEM) as introduced in [7,21].Second, we propose a new estimation using new theoretical probabilistic results by Athreya and Roellin [3] who compute an exact formula for the bias.We provide a consistent estimator in the case of complete observations and a de-biasing strategy for the usual maximum likelihood estimator of Daudin et al. [11] in the case where the types of the explored nodes are unknown.
We consider as a toy model a Stochastic Block Model graphon with Q classes.Graphons, considered here as symmetric integrable functions from [0, 1] 2 to R, can be seen as limit of dense graphs (see e.g.[23]).Recall that SBM graphs are a generalization of Erdös-Rényi graphs, where each node i is characterized by a type, Z i ∈ {1, . . ., Q}, with Q the number of different possible values.The random variable (r.v.) Z i are assumed independent and identically distributed (i.i.d.) with P(Z i = q) = α q > 0. Each pair of nodes {i, j} is connected independently with a probability π Zi,Zj ∈ (0, 1) that depends only on the types.Because the graph is non oriented, the matrix with entries π qr is symmetric (π qr = π rq ).Thus, for a given Q, the distributions of SBM graphs are parameterized by the vector θ = (α q , π qr , ; q, r ∈ {1, • • • Q}).
When the number of vertices of the graph tends to infinity, it is known that the dense graph converges to a limiting continuous object called graphon, see e.g.[5,6,23].Let us recall the definition of the SBM graphon.
For the sequel, we introduce the partition of [0, 1] defined by where for q ∈ {1, . . .Q}, A q = q k=1 α k , with A 0 = 0 by convention.The SBM graphon κ θ , associated with the parameter θ = (α q , π qr , ; q, r ∈ {1, • • • Q}), is the function from [0, 1] 2 to [0, 1] defined as follows: Heuristically, we can see [0, 1] as a continuum of vertices, and the graphon is the limit of the expectation of the adjacency matrix of the graph in the sense that κ θ (x, y) measures the probability of connection between x and y.
We consider a random walk on the graphon κ θ , i.e. the process X = (X m ) m≥1 with values in [0, 1] and transition kernel: This random walk is the analogous of the classical random walk on a graph that jumps from a vertex to one of its neighbouring vertices chosen uniformly at random.One simplification brought by studying the random walk on the graphon lies in the facts that (i) nodes can be visited only once and the random walk does not return to previously explored nodes, (ii) the Markov chain can not get stuck as would an avoiding random walk on a discrete graph.
From the exploration of this random walk, we can construct a subgraph of the 'nodes' visited.Assume that we observe n steps of the random walk, i.e.X (n) = (X 1 , . . ., X n ).The associated path (up to its nth step) is a subgraph (chain) This chain is completed by sampling independently edges between vertices that are not already connected with probability according to their types.We denote by (Y ij ; i, j ∈ {1, . . .n}) the adjacency matrix of the resulting graph, i.e.Y ij = 1 if and only if i ∼ Gn j.Because the graph is non-oriented, we have Y ij = Y ji .Moreover, notice that by construction, we always have Y i,i+1 = 1 for i ∈ {1, . . .n − 1}.Following the notation of Athreya and Röllin [3], we denote by , and the edges are as follows.Let i and j be two vertices.
• If there is an edge between i and j in H n , i ∼ Hn j then there is also an edge between these nodes in G n : i ∼ Gn j. • If there is no edge between i and j in H n , we connect i and j in G n with probability κ θ (X i , X j ).
This subgraph G n is the RDS graph.Notice that the random walk and the subgraph G n can be defined for general graphons and not only SBM graphons (see [3]).
In the rest of the paper, we assume that this is the model generating our data and that the observation corresponds to a realization of G n .The complete data consists in: • the types of the successive vertices visited We will consider both the cases where (i) all these elements are observed, and the case where only a partial information is available: (ii) the adjacency matrix (Y ij ) i,j∈{1,•••n} and the positions X i 's of the vertices are observed, but not the Z i 's.Notice that in the latter case, some information on the types Z i 's can still be recovered since the latter depend on the X i 's.(iii) only the adjacency matrix Our purpose is to estimate θ = (α q , π qr ; q, r ∈ {1, • • • Q}) using the subgraph G n .In the literature, the estimation of SBM graphs has been extensively studied, but often in a framework where the number of nodes is known.In particular, variational EM approaches have been used in many cases where types are unknown, see [11,30,24].The estimation of SBM graphs, when the total population size is unknown and when we only have a subgraph obtained by a chain-referral method, is not studied to our knowledge.We develop in this paper two approaches that we compare in a final numerical section (Section 5).
For the first approach, it is possible to write the likelihood of G n .Here, because graph is explored through an RDS random walk, our likelihood differs from the likelihoods in these papers: it accounts both on the transitions of the random walk and on the connectivity of vertices given their types.We study in Section 3 the maximum likelihood estimator (MLE) in our setting for both cases, when the nodes types are observed (Section 3.1) or not (Section 3.2).Even when the observation is complete, the maximum likelihood estimator does not have an explicit form.When the types are unknown, we adapt to our likelihood the variational EM approach of [11].The second approach developed in Section 4 is inspired by the recent work of Athreya and Röllin [3].These authors showed that when we observe the random walk sufficiently long (n → +∞), the sequence of graphs (G(H n , κ θ )) n≥1 converges to a biased graphon of κ θ .Based on their probabilistic result, a natural estimator of the biased graphon turns out to be the MLE in the 'classical' case studied by [11].Based on this estimator that is not consistent in our case, we propose a new consistent estimator of θ.We first detail the estimation for the case of complete observations (Section 4.1) and then extend the variation EM of the first approach to this case (Section 4.2).Another possibility without using the information on the X i 's is developed in Section 4.2.2.

Probabilistic setting
In this section, we give some important properties of the RDS Markov chain X (n) , in particular on its long term behaviour.Then we explain the biases that appear when estimating the graphon κ θ from the RDS subgraph G n .

Exploration by a random walk
Assumption 1.In all the paper, we consider the graphon κ θ of an SBM graph (see (1.2)) and we assume that κ θ is connected, i.e. that for all measurable subset A ⊂ [0, 1] such that its Lebesgue measure |A| ∈ (0, 1), ) Let us now introduce some notations: 2) The quantity πq corresponds to the mean connectivity of a node of class q and π corresponds to the mean connectivity of a node chosen uniformly in [0, 1].
Proposition 2.1.Under Assumptions 1, the random walk X = (X n ) n≥1 admits a unique invariant probability measure The general proof is given in [3,Prop. 4.1] but for the case of SBM graphons, the result is easy to prove.
From expression (2.3), we see that for q ∈ {1, • • • Q}, the measure of the class q with respect to m(dx) is: So, if πq > π, α q > α q and the stationary measure m(dx) puts more weight on the interval I q which has a larger than average connectivity, compared with the Lebesgue measure.If π1 = • • • πQ = π are all equal, we have α q = α q for all q ∈ {1, • • • Q} and m(dx) is the uniform measure on [0, 1] by (2.3).Otherwise, we expect biases in how the graphon κ θ is discovered by G n .

Convergence of dense graphs
We are interested in the case where n → +∞.Then, the (dense) RDS graph G n might converge to a graphon, and it is natural to compare the possible limit to the graphon κ θ on which the random walk moves.Let us recall briefly some topological facts.We refer the interested reader to [23].
Let us give first some notations.For integers n and k . For a graph G, E(G) denotes the edges of G and i ∼ G j means that {i, j} ∈ E(G).We can define the subgraph F density in G by: [1, n]].This notion of subgraph density can be generalized to a graphon κ by: Let F denote the class of isomorphism classes on finite graphs and let (F i ) i≥1 be a particular enumeration of F.Then, the distance of two graphs G and G is: The convergence of the large graphs to graphons can be expressed with this distance [23,Chapter 11].

Biases in the discovery of κ θ
Let us denote by Γ the cumulative distribution function of m(dx): where Athreya and Röllin [3] have proved that the graphon discovered by the RDS is biased: Proposition 2.2 (Corollary 2.2 [3]).We have under Assumptions 1 that: where the generalised inverse of Γ is and where for all x, y ∈ [0, 1], (2.9) This proposition, that is true not only for SBM graphons but also in more general cases, as developed in [3], says that the topology of the subgraph discovered by the RDS is biased compared with the true underlying structure (κ) because the random walk visits more likely the nodes with high degrees (hubs) and the frequent types.
In the case of an SBM graphon parameterized by θ = (α q , π qr ; q, r ∈ {1, • • • Q}), and under Assumption (1), Γ is a one-to-one map and Γ −1 is its usual inverse function: it is here the piecewise affine function that maps the interval [ A q−1 , A q ) to [A q−1 , A q ).We have here: with the notation (1.2) and where For SBM graphons, there will be no bias when κ θ = κ θ , i.e. when for all q ∈ {1, • • • Q}, α q = α q .
Example 2.3.When Q = 2, the graphon is given: (2.12) This function is represented in Fig. 1 The invariant probability measure is: As a result (see Fig. 1), the bias graphon κ θ corresponds to the SBM graphon (2.12) where the weights of the class 1 is changed from α to (2.13) In this particular case, it can be seen that . This is satisfied for example when π 11 = π 12 = π 22 (Erdös-Rényi) or when α = 1/2 and π 11 = π 22 (both types are symmetric).

Empirical cumulative distribution
As seen in the previous paragraph, the bias linked with the discovery of the graphon κ θ by the RDS subgraph G n is expressed in term of the cumulative distribution Γ of the stationary distribution m of X (n) .In the sequel, the empirical cumulative distribution of m will be useful and we recall here some facts: (2.14) Lemma 2.4.Γ n and Γ −1 n converge a.s.uniformly to Γ and Γ −1 respectively.Proof.The almost sure pointwise convergence of Γ n to Γ is a consequence of the ergodic theorem.Then, the a.s.uniform convergence is obtain by the Glivenko-Cantelli theorem.Let us prove the uniform convergence of Γ −1 n to Γ −1 .Because all the α q 's are positive, Γ is a nondecreasing and piecewise affine bijection and the inverse bijection Γ −1 is also nondecreasing and piecewise affine.Let ε > 0 and n 0 ∈ N sufficiently large so that for all Because the jumps of Γ n are a.s. of size 1/n, we necessarily have that which proves the uniform convergence of Γ −1 n to Γ −1 .

Likelihood estimation
In this section, we write the likelihood of G n and compute the MLE of the parameters θ in Section 3.1, when we have complete observations: Here our likelihood is specific to the RDS exploration.The MLE does not have an explicit formula and we explain how to compute it numerically.Then in Section 3.2, we study the case where the types Notice that the estimation in this Section 3 makes only use of the connectivity information carried by the random variables Y ij .The estimators here do not depend on the positions X i .The types Z may be known or unobserved.
Let us introduce some notations.We define by N q n , q ∈ {1, ..., Q} the number of vertices of type q sampled by the Markov chain.For q, r ∈ {1, ..., Q} we also define by: the number of couples of types (q, r) that are connected (resp.not connected).

Complete observations
Assume that we observe a subset of explored nodes discovered by the RDS, with their classes and connections: Notice that in the above formula, the notation π qq is a shortcut for π min(q,q ),max(q,q ) .
Proof.We have that where the first product corresponds to the likelihood of the types sampled along the Markov chain, and the second product corresponds to the likelihood of edges between vertices that are not visited successively by the Markov chain.Because the graph is non-oriented, it is sufficient to consider i < j.Thus: = 1 by construction).Finally, rewriting the above likelihood using N q n , N q↔r n and N q r n , we obtain: which provides the announced result.
Proposition 3.2.The MLE θ = ( α q , π qr ; 1 ≤ q ≤ r ≤ Q) is the solution of the following system of equations: Proof.The log likelihood of the observations is: π qq α q .
When we optimize the function log L with respect to the parameters and under the constraint that Q q=1 α q = 1, we obtain after computation of the Lagrangian the following system.First, the estimator θ = ( α q , π qr ; 1 ≤ q ≤ r ≤ Q) satisfies the constraint Second, the other equations of the system are: These equations give (3.4) for all 1 ≤ q ≤ r ≤ Q.In the sequel, the example with Q = 2 will be developed.
The identifiability of the model where the sampling of nodes is i.i.d. is a result Allman et al. [2,Theorem 7].In our case, the consistence of π qr is obtained by Van der Vaart [32,Th. 5.7].Indeed, the sequence of log-likelihoods renormalized by 1/n 2 converges to a limit when n → +∞ and this limit admits a local maximum around the true parameters (π qr ).For the parameters α q , it is more tricky.Techniques developed by Célisse et al. [9] and which are based on explicit expressions of the estimators can not be followed here.We can rewrite the likelihood of the Y i,j 's as a mixture, given the probability of the Z i 's, but the latter are not independent, which complicates the computation.This is left for further research.
Remark 3.3.When the graph is completely observed and not only through the sampling from a Markov chain, the classical likelihood, as obtained in Daudin et al. [11] is: (3.7) The difference between (3.7) and (3.2) is the first product which corresponds of the likelihood of the node types.In the classical case, these types are chosen independently whereas here they are discovered by the successive states of the Markov chain.In this classical case, the MLE has an explicit formula: Here, for the likelihood (3.1), the MLE which solves (3.4) is not explicit any more.Let us discuss briefly the case of two classes (Q = 2).The parameter is then θ = (α, π 11 , π 12 , π 22 ).Define θ = ( α, π 11 , π 12 , π 22 ) the estimator of θ.The log likelihood is now: Beware that the parameter π 12 appears in the two last lines.Then the estimators θ is the solution of (3.9) Notice that the system of equations (3.9)-(3.12) is non-linear and can not be simplified further.Also, there does not exist the explicit solution for it.An algorithm for computing a particular solution for the case Q = 2 is given in section 3.3.1 of the PhD thesis [33].In our case, we use a numerical function: the nlm function of R to solve the system (3.9)-(3.12)numerically to get the approximated values for the MLE θ.For the numerical simulations, we refer the reader to Section 5.

Incomplete observations: SAEM Algorithm
Here, we assume that the types Z = (Z i ) i=1,...,n are unobserved.In this case, the likelihood of the observed data ) is obtained by summing the complete-data likelihood (3.2) over all the possible values of the unobserved variables Z: Unfortunately, this sum is not tractable and it is classical to use the Expectation-Maximization (EM) algorithm to compute the maximum likelihood.Here we use an SAEM algorithm (see [7,21]) with the variational approximation of the conditional distribution of Z given Y introduced in [11], and adapt their methods to our setting with the likelihood (3.1) .
Let us sum up the EM algorithm (see e.g.[7,8,21]).Given the observed data: the Markov chain X (n) , the connections (Y ij , i, j ∈ X (n) ) and the number of blocks Q and the current estimator θ, and given the value θ (k−1) at the (k − 1) th iteration of the EM, on the k th step, we compute the conditional expectation of the log-likelihood L(Z|X, Y, θ (k) ) given X, Y for the current fit θ (k) .Here there is no explicit expression for the latter likelihood because the exact distribution of Z given X, Y is unknown and this we need to approximate it numerically by using an SAEM algorithm [7,21], proceeding as follows.

The SAEM algorithm
Given the information of the k − 1 iteration θ (k−1) = (α (k−1) , π (k−1) ), at the k th iteration of SAEM: Step 1: Choosing the appropriate Z (k) -Simulate a candidate Z c following the proposal distribution q θ (k−1) (.|Z (k−1) ).The choice of proposal distribution is discussed in Section 3.2.2,where we use a variational approach.
-Calculate the acceptance probability -Accept the candidate Z c with probability ω: Step 2: Stochastic approximation Update the quantity with the initialization )] and (s k ) k∈N is a positive decreasing step sizes sequence satisfying Step 3: Maximization Choose θ (k) to be the value of θ that maximizes Kuhn and Lavielle studied the convergence of the sequence θ (k) in [21].In the particular case of SBM, and for the incomplete likelihood based on (3.7), the consistency of EM and variational methods has been studied by Célisse et al. [9] and asymptotic normality has been established by Bickel et al. [4].The likelihood that is considered here differs and these results can not be directly applied, but a study along these lines could be investigated.

Variational approach
For the proposal distribution q θ (k−1) (. | Z (k−1) ) of Z (k) , we follow Daudin et al. [11], who use a variational approach.Let us recall the main idea of this approach.The general strategy has been described in Jordan et al. [19] or Jaakkola [18].
Recall the likelihood L(Y, θ) of the incomplete data (3.13).The idea of the variational approach is to replace the likelihood by a lower bound: where KL(µ, ν) := dµ log dµ dν is the Kullback-Leibler divergence of distributions µ and ν, and where R Y,θ (Z) is an approximation of the conditional distribution L(Z|Y, θ).When R Y,θ is a good-approximation of L(Z|Y, θ), J (R Y,θ ) is very closed to L(Y, θ).
Here, Z takes discrete values in {1, ..., Q}.Then, Following [11], we restrict to distributions R Y,θ that belong to the family of multinomial probability distributions parameterized by τ = (τ 1 , • • • τ Q ), as approximated conditional distribution of Z given Y and θ.These multinomial distributions assume independence of the Z i 's conditionally to the Y , which makes computations tractable .If we look for the parameter τ that maximizes (3.17), we will hence obtain the best approximation of L(Z|Y, θ) among these multinomial distributions.We will chose the latter to be the proposal distribution for Z in the Step 1 of the SAEM algorithm.

Proposal distribution for the Step 1 of SAEM
For the sake of simplicity, we treat here the case Q = 2, but generalization is straightforward.Using the previous results, we can now detail the Step 1 of the SAEM algorithm.Given the parameters θ (k−1) , the types Z (k−1) and the data (Y ij ; i, j ∈ [ [1, n]]), we proceed as follows.
Step 1: We compute the parameters τ as in Proposition 3.5.The parameters in (3.25) are given by θ (k−1) and the terms b(Y ij , π ) and b(Y ij , π ) are computed with the types Z (k−1) .
Step 2: We simulate a candidate Z c ∈ {1, 2} n for Z such that Z c i − 1 follows the law Ber(τ i ).Recall that the acceptance probability is where the complete likelihood with respect to α, π, Z, Y is

Estimation via biased graphon and 'classical likelihood'
In Section 3, the MLE are computed but they do not have explicit formula in the case of RDS exploration.We thus investigate other estimators.The most natural one is the graphon estimator corresponding to (3.8).It turns out that we can study the asymptotic bias of this estimator thanks to the result of Athreya and Röllin [3].First, in Section 4.1 we provide a two-step estimator in the case where everything is observed: This new estimator is explicit: we compute the estimator (3.8) of Daudin et al. [11] and then correct the weights of classes according to the formula of Athreya and Röllin (see (4.2)).
Then in Section 4.2, when the Z i 's are unobserved, we propose an SAEM estimator based on the one introduced above.Here, we need some to have the knowledge on the positions X i 's of the Markov chain X (n) when the Z i 's are missing.Notice however that (i) the knowledge of the X i 's gives partial knowledge on the types Z i 's since the latter are determined from the X i 's once the intervals I q are given and (ii) the likelihood function (3.1) depends on the X i 's only through the Z i 's .
Definition 4.1.The estimator of θ, is defined in two steps.
First step: we estimate θ = ( α, π).A natural estimator is the classical MLE when assuming that there is no biases.Let us therefore define: for q = r and π n qq := 2N q↔q n N q n (N q n − 1) . (4.1) Second step: we correct the estimator θ to obtain θ.Especially, we specify an estimator of α q obtained by correcting the estimator λ q of α q .For this, we set for q ∈ {1, . . .Q}, Λ n q = q k=1 λ n k and define where Γ n is the cumulative empirical distribution function of the X i 's, see (2.14).
Proposition 4.2.Under Assumptions 1, (i) For all q, r ∈ {1, • • • Q} λ n q is a consistent estimator of α q and π n qr is a consistent estimator of π qr : where we recall the notations of (1.1) and (2.4).
In the special case of Q = 2, an estimator of ).The proof of Proposition 4.2 is done in the next section (Section 4.1.1).
We can go a little further: we indeed have two empirical approximations of the limiting graphon κ θ : the graph G n (which converge to κ θ by the result of Athreya and Röllin) and the graphon χ n associated with θ and defined below (whose convergence remains to be proved).The following result concludes that these two approximations are asymptotically equal, providing as a result the convergence of χ n .It is proved in Section 4.1.2.
Proposition 4.3.The graphon associated to the estimator ( λ n q , π qr ; q, r ∈ {1, . . .Q}) is defined as: with J n q = [ Λ n q−1 , Λ n q ) and Λ n q are defined above (4.2).We have under Assumption 1 that: (i) when n → +∞, (ii) The limit of the empirical graphon χ n is thus the biased graphon κ θ .Let us consider point (i) of Proposition 4.2.The limit for λ n q follows from the ergodic theorem.Indeed, we can write that The ergodic theorem for the Markov chain (X n ) n says that It remains to prove that π n qr is a consistent estimator of π qr .Rewrite π n qr as Recall that the subgraph G n is constructed from the Markov chain X (n) and that each pair of non-consecutive vertices X i and X j are connected with probability κ θ (Z i , Z j ) depending on theirs types and independently of the others edges.Let us focus on the number of edges N q↔r n : two cases have to be distinguished.
Case 1, q = r: The number of edges of types (q, r) is Then, By the ergodic theorem for Markov chain X (n) , we have lim Since lim n→+∞ λ n q = α q > 0 in probability, there exists a constant c > 0 such that c ≤ inf q∈{1,...Q} α q and lim n→+∞ P 1 and hence the first term in the right hand side of (4.8) converges to 0 in probability.
Consider now the second term in the r.h.s. of (4.8).Let us define the function We have Since lim n→+∞ λ n q = α q > 0 in probability, for fixed ε > 0, Thus the second and the third terms on the right hand side of (4.9) tend to zero as n tends to infinity.It remains the first term to be treated.When one edge is changed, the value of f is changed by most 1/n 2 .Applying McDiarmid's concentration [25] for function f , we obtain: in probability as n → ∞.This finishes the proof for Case 1.
Case 2, q = r: The proof follows by similar arguments, with notice that there are a few modifications because the expression of N q↔q n is slightly different: 1 i∼ Gn j 1 Xi∈Iq,Xj ∈Iq . Then, 1 i∼ Gn j 1 Xi∈Iq,Xj ∈Iq λ n q λ n q − 1/n (4.10)We have that the first term on r.h.s. of (4.10) converges in probability to 0 as in case 1.For the second term on r.h.s. of (4.10), we define the function f as in Case 1 by For a fixed ε > 0, As in Case 1, the second and the third term on r.h.s. of above inequality are negligible.Applying McDiarmid's concentration for f with notice that when changing 1 edge in G n , the value of f changes at most 1/n 2 , For the point (ii), we have: The first term in the right hand side is treated by point (i).The second term is the Proposition 2.2 shown in [3,Corollary 2.2].
Let us now consider the point (i).For the sake of simplicity, we assume for the proof that there are two classes of vertices in the graph, i.e.Q = 2.The proof can be generalized to general Q by following the same steps.Our parameters' notations are simplified as λ n 1 =: λ n and lim n→+∞ λ n 1 =: α = Γ(α).
Our purpose is to prove a convergence of graphons for the distance d sub introduced in (2.7) using the densities (2.5).If F is an edge (meaning that F = K 2 , the complete graph of 2 vertices), then the density of F in λ n q λ n r π n qr .
In general case, if F is a graph of k vertices, Let us first consider the case where F is an edge. or(Zi,Zj)=(2,1) By the law of large numbers and using (4.4) whose proof does not depend on the Proposition 4.3, the four terms converge to zero.
In the general case, proceeding in a similar way leads to: π n qr 1 Zi =q,Zi =r Zi =q,Zi =r are bounded by 1, there exist c(k) such that the first term and the second term in the right hand side are bounded by c(k)/n.For the third term, it is equal to 1≤q1,...,q k ≤Q { , }∈E(F )  (3.8) obtained by neglecting the bias coming from the sampling scheme can be corrected by using the inverse of the cumulative distribution function Γ of m.When the types are unobserved, we proceed in the same way.We assume here that the types Z i are unobserved, but we need the observation of the marks X i , otherwise no de-biasing is permitted since the cumulative distribution function Γ can not be estimated.We detail this estimation procedure in the case Q = 2 for the sake of simplicity, but generalization is straightforward.
Step 1: First, we perform an estimation of the SBM neglecting the sampling biases.
• For the proposal distribution of the types Z c , it is simpler since we assume that the X i 's are known.Assume that we are at step k and that we dispose of the parameters θ (k−1) .We initialize the types by attributing the types 1 to the X i ≤ λ (0) and 2 to the others.At each step, the threshold is modified from λ by following a random walk: a gaussian increment (mean 0 and variance s 2 ) is added.All the X i smaller than this increment are given the type Z i = 1 and the others the type Z i = 2.This Step 1 corresponds to a variational EM for the classical likelihood (3.7), for which the consistency and asymptotic normality have been established by Célisse et al. [9] and Bickel et al. [4].
Step 2: We estimate the cumulative distribution function Γ n (see (2.14)) and deduce the graphon estimator α n 1 of α 1 using (4.2).This provides the estimator of κ θ : When both X i and Z i are unobserved, it is not possible to compute the empirical cumulative distribution function Γ n any more.Thus, Equation (4.2) can not be used any more to obtain an estimator of α q from an estimator of α q .
As pointed out by an anonymous Referee, from (2.2) and (2.4), we can write that in vectorial form, where is the Kronecker product of two vectors.Then an estimator α for the vector α = (α 1 , . . .α Q ) can be obtained from solving the equation:

.15)
For Q = 2: In this case, under the constraint α 1 + α 2 = 1, and equation (4.15) is written simply as: It leads to a quadratic equation of α 1 as follow: Solving this second order equation, Hence, there are two solutions: . (4.17) These solutions can be computed numerically.
For Q ≥ 3: Equation (4.15) is written as: α T π α λ − α ( π α) = 0. Consider the function g defined on S It leads to solve the optimization problem min x∈S g(x) .
The RDS graphs consist of n = 50 vertices.
We proceed to the four estimations presented in this paper: • Maximum likehood on complete data: the algorithm of Section 3.1 for complete observations by assuming that the types Z i ∈ {1, 2} are observed.• SAEM: the algorithm of Section 3.2.1 when the types Z i are unobserved.
The SAEM is based on an iteration on k and we perform K = 200 iterations.• De-biased graphon: the computation of the estimators given in Proposition 4.2 assuming complete observations, • De-biased graphon with SAEM: again, we use an SAEM algorithm for the likelihood (3.7), and then use the same de-biaising technique as in item 3 above (see Section 4.2).Again, we use K = 200 iterations for the SAEM iterations.• De-biaised graphon by solving the algebraic equation for α q (2.4).
We proceed to a Monte-Carlo study of the estimators' distributions.We simulate 200 RDS graphs, and for each of them, apply the four estimation strategies.The empirical distribution of the estimators are represented in Fig. 2, and this allows us to estimate the associated mean squares errors (MSE) for each method, see Table 1.
Without surprise, for the maximum likelihood estimation, the estimation is better when we have complete observations (compare columns 1 and 2).Note that the use of the SAEM algorithm could be accelerated, which is discussed in the conclusion.For the graphon de-biasing, the methods with incomplete observations perform well, sometimes equally to the methods with complete observations.When the types Z i are not observed, we achieve better MSEs with the debiasing of the classical SAEM method of Daudin et  Notice first that the columns 2 and 4 of Table 1 are not completely equivalent, since the debiasing methods of Section 4 necessitate the knowledge of the positions X i of the Markov chain, when the likelihood (3.1) necessitates only the connections Y ij and the types Z i 's.Second, the updating of the types in the SAEM algorithm is easier in Section 4.2 when the X i 's are known since it amounts to choosing the threshold that separates the types 1 and 2. Finally, the SAEM algorithm on the classical likelihood (3.7) seems to converge more easily than for the likelihood (3.1).

Conclusion
Four statistical methods are studied in this paper, for estimating SBM parameters using a subgraph obtained from the exploration of the graphon by a Markov chain: • Two methods built on the maximum likelihood.
-The first one is the classical maximum likelihood estimator on the complete data, and necessitates the observation of the types Z i 's and the edges of G n , Y ij 's.See Section 3.1.
-The second method is an SAEM estimation procedure that can be used when only the connectivities Y ij 's are observed.
• Three methods built on the de-biasing formula of Athreya and Röllin [3].
-The first one is on the complete data, and necessitates the observation of the positions X i 's, the types Z i 's and the edges of G n , Y ij 's.See Section 4.1.
-The second method is a variation started from the SAEM estimation procedure of Daudin et al. when there is no sampling bias.The latter estimation can be used when only the connectivities Y ij 's are observed, but the de-biasing using the cumulative distribution function Γ needs information on the positions X i 's (but not the complete knowledge of the types Z i 's).
-The last one solves an algebraic equation satisfied by the α q 's and obtained from (2.4).This method does not require the knowledge of the X i 's but only of the Y ij 's.
This is a toy model for estimating random networks from chain-referral sampling techniques and there exist sampling biases.The two first methods compute the maximum likelihood estimator when the types of the nodes are known or unknown.On simulations, it appears that the SAEM algorithm used when the types are unobserved is not very robust and provides relatively large MSEs.However, the relatively rough SAEM algorithm that we use here might be improved by using Metropolis-Hastings and Gibbs algorithms with refined exploration of the state space of the Z i 's.
An alternative approach is proposed by taking advantage of recent results by Athreya and Röllin [3]: this allows to correct the classical SBM estimators that would be proposed if one ignores the sampling biases.These methods provide good estimators but rely on the precise knowledge of the Markov chain exploring the SBM graphon (in particular the positions X i 's), which is not always available.

π 22 Fig 2 .
Fig 2. Estimation on complete data for a graph of n = 60 vertices with Q = 2 classes and parameters α 1 = 2/3, π 11 = 0.7, π 12 = π 21 = 0.4 and π 22 = 0.8.500 such graphs are simulated and the empirical distributions of the estimators are represented here with the true parameters in red line.On each graph: the MLE with complete observation (Section 3.1) is in continuous black line, the SAEM estimator (Section 3.2) is in blue dashed line, the graphon estimator with complete observation (Section 4.1) is in dash-dotted pink line, the graphon estimator with incomplete observation and SAEM algorithm (Section 4.2.1) is in yellow dotted line, the graphon estimator with incomplete observation and algebraic equations (Section 4.2.2) is in brown long-dashed line.The Graphon (a): estimator of α, (b): estimator of π 1 1, (c): estimator of π 12 , (d) estimator of π 22 .
almost surely as n tends to infinity.Thus, the point (i) is proved.4.1.2.Proof of Proposition 4.3: Limit of χ n Because t(F, G n ) and t(F, χ n ) are bounded independently from n, this provides the announced result.4.2.Incomplete observations and graphon de-biasing 4.2.1.Case where Z i is unobserved but X i is In Proposition 4.2, it is shown that the 'classical' SBM estimator .13) 4.2.2.Case where both X i and Z i are unobserved