Sparse and Smooth: improved guarantees for Spectral Clustering in the Dynamic Stochastic Block Model

In this paper, we analyse classical variants of the Spectral Clustering (SC) algorithm in the Dynamic Stochastic Block Model (DSBM). Existing results show that, in the relatively sparse case where the expected degree grows logarithmically with the number of nodes, guarantees in the static case can be extended to the dynamic case and yield improved error bounds when the DSBM is sufficiently smooth in time, that is, the communities do not change too much between two time steps. We improve over these results by drawing a new link between the sparsity and the smoothness of the DSBM: the more regular the DSBM is, the more sparse it can be, while still guaranteeing consistent recovery. In particular, a mild condition on the smoothness allows to treat the sparse case with bounded degree. We also extend these guarantees to the normalized Laplacian, and as a by-product of our analysis, we obtain to our knowledge the best spectral concentration bound available for the normalized Laplacian of matrices with independent Bernoulli entries.


Introduction
In recent years, the study of dynamic networks has appeared as a topic of great interest to model complex phenomenons that evolve with time, such as interactions in social networks, the spread of infectious diseases or opinions, or information packets in computer networks. In light of this, many random graphs models, traditionally static (non-dynamic), have been extended to the dynamic case, see [16,20] for reviews. One of the most popular use of dynamic networks consists in detecting and tracking communities of well-connected nodes, for instance users of a social network [49,47,44], also known as clustering. In this context, the classical Stochastic Block Model (SBM) [19], in which nodes intra-and inter-communities are linked independently with some prescribed probabilities, is, the characterization of regimes of parameters in which there exists (or not) an algorithm that asymptotically performs better than random guess. However, this approach does not concern the classic SC algorithm (which will generally fail [21]), and the case where the number of communities K is larger than 2 is still largely open. In the dynamic case, a conjecture on the detectability threshold is given in [15]. In parallel, other works study the sparse case by regularizing the adjacency matrix or normalized Laplacian of the graph before the SC algorithm [23,24,7].
In [25], Lei and Rinaldo provide strong, non-asymptotic consistency guarantees for the classic SC algorithm on the adjacency matrix (without regularization) in the relatively sparse case α n log n n , showing that the proportion of misclassified nodes tends to 0 with a probability that goes to 1 when the number of nodes n increases. Their recovery results are valid for any K, potentially growing slowly with n. In [37], Pensky and Zhang extend this analysis to a particular DSBM, referred to as "deterministic" DSBM in the sequel, for the SC algorithm applied to the adjacency matrix smoothed over a finite window. In this case, another key quantity is the temporal smoothness of the model ε n , that is, the proportion of nodes that may change community between two time steps (the smaller ε n is, the smoother the model). They showed that, in the relatively sparse case, if the model was sufficiently smooth in ε n = o 1 log n , then the error bound of the static case can be improved. However their analysis still takes place in the relatively sparse case even when ε n is very low. In [4,35], the authors consider constant cluster membership ε n = 0, but changing probabilities of connection.

Contributions
In this paper, we follow the analyses of [25] and [37] and significantly extend them in several ways.
-Our main contribution is to draw a new link between the sparsity α n and smoothness ε n in the analysis of the DSBM: we show that, the smoother the model, the sparser it can be, while still guaranteeing consistency. In particular, a mildly strengthened condition ε n ∼ 1 log 2 n allows to give consistent guarantees in the sparse case α n ∼ 1 n . -We extend the result to the normalized Laplacian. As a by-product, in the static case, we obtain, to our knowledge, the best spectral concentration bound available (of order 1 √ log n ) in the relatively sparse case α n ∼ log n n . -We also improve the rate of the error bounds with respect to the number of communities K when the probabilities of connection between communities decrease with K, in both the static [25] and dynamic [37] cases. -Finally, we extend our results to the Markov DSBM introduced in [49], and the SC algorithm with an "exponentially smoothed" matrix, used in [8,45] and appropriate in a streaming computing framework.
We also define the (diagonal) degree matrix D(A) by For any symmetric matrix A such that j A ij = 0 for all i, the normalized Laplacian L(A) is defined as We note that, typically in the litterature [10], the normalized Laplacian is defined as the symmetric matrix Id − D(A) − 1 2 AD(A) − 1 2 . However, SC is mainly concerned with the eigenvectors of the Laplacian, which are the same for both variants.
Stochastic Block Model Let us start by introducing the classical static SBM. We take the following notations: n the number of nodes, K the number of communities. Each node belongs to exactly one community. We denote by Θ ∈ M n,K ⊂ {0, 1} n×K the 0−1 matrix representing the memberships of nodes, where for each node i, Θ ik = 1 indicates that it belongs to the kth community, and is 0 otherwise, and M n,K is the set of all such membership matrices. Given Θ, for i < j, we have where B ∈ [0, 1] K×K is a symmetric connectivity matrix, and Ber(p) indicates a Bernoulli random variable with parameter p. Finally, we define P = ΘBΘ ∈ R n×n the matrix storing the probabilities of connection between two nodes off its diagonal, and we have Typically, B has high diagonal terms and low off-diagonal terms. We will consider B of the form for some α n ∈ (0, 1) and B 0 ∈ [0, 1] K×K whose elements are denoted by b k . It is known that the rate α n when n → ∞ is the main key quantity when analyzing the properties of random graphs. Typical settings include α n ∼ 1 (dense graphs), α n ∼ 1/n (sparse graphs), or middle grounds such as α n ∼ log n n , usually referred to "relatively sparse" graphs. As we will see, it is known that strong guarantees of consistency can be given in the relatively sparse case, while the sparse case it hard to analyze and only partially understood.
For some maximum and minimum community sizes n max n K and n min n K , we define the set of admissible community sizes N def.
These quantities are such that the expected degree will be comprised between α nnmin and α nnmax . For simplicity, we will sometimes express our results with B 0 equal to: In other words, B contains α n on its diagonal and τ α n outside. For this expression of B 0 , we haven max = (1−τ )n max +nτ , and similarly forn min . Interestingly, in the case of balanced communities n max , n min ∼ n K , we have then Dynamic SBM The Dynamic SBM (DSBM) is a random model for generating adjacency matrices A 0 , . . . , A t at each time step. Each A i will be generated according to a classical SBM with constant number of nodes n, number of communities K and connectivity matrix B, but changing node memberships Θ t . Note that several works consider changing number of nodes [46] or changing connectivity matrix [37], but for simplicity we assume that they are constant in time here. Note that, using simple triangular inequalities as in [37], it would not be difficult to integrate a changing connectivity matrix B into our results. We will consider two potential models on the Θ t .
-The simplest one, adopted in [37], is to consider that Θ 0 , . . . , Θ t are deterministic variables. In this case, we will assume that only a number s n of nodes change communities between each time step t − 1 and t, and denote ε n = s/n this relative proportion of nodes. We will also assume that at all time steps, the communities sizes are comprised between some n min and n max , which will typically be of the order of n/K for balanced communities. As a shorthand, we will simply refer to this model as deterministic DSBM (keeping in mind that the A t are still random). -In the second model, similar to [49] we assume that the nodes memberships follow a Markov chain, such that between two time steps, all nodes have a probability 1 − ε n to stay in the same community, and ε n to go into any other community, that is: Then, conditionally on Θ t , the A t are drawn independently according to a SBM. The global model is thus a Hidden Markov Model (HMM). We will simply refer to this case as Markov DSBM. Note that, in this case, it is rather difficult to quantify, in a non-asymptotic manner, the probability of having bounded community sizes globally holding for all time steps. Hencen max ,n min will not intervene in our analysis of this case.
Goal and error measure The goal of a clustering algorithm is to give an estimatorΘ of the node memberships Θ, up to permutation of the communities labels. We consider the following measure of discrepancy between Θ and an estimatorΘ [25]: where P k is the set of permutation matrices of [k] and · 0 counts the number of non-zero elements of a matrix. While other error measures are possible, as we will see one can generally relate them to a spectral concentration property, which will be the main focus of this paper. In the dynamic case, a possible goal is to estimate Θ 1 , . . . , Θ t for all time steps simultaneously [45,37]. Here we consider a slightly different goal: at a given time step t, we seek to estimate Θ t with the best precision possible, by exploiting past data. In general, this will give rise to methods that are computationally lighter than simultaneous estimation of all the Θ t 's, and more amenable to streaming computing, where one maintains an estimator without having to keep all past data in memory. Naturally, such methods could be applied independently at each time step to produce estimators of all the Θ t 's, but this is not the primary goal here.
Spectral Clustering (SC) algorithm Spectral Clustering [32] is nowadays one of the leading methods to identify communities in an unsupervised setting. The basic idea is to solve the K-means problem [28] on the K leading eigenvectors E K of either the adjacency matrix or (normalized) Laplacian. Solving the 1336
K-means, i.e., obtaining is known to be NP-hard, but several approximation algorithms, such as [22], are known to produce 1 + δ approximate solutions (Θ,Ĉ) The SC is summarized in Algorithm 1.
In the dynamic case, a typical approach to exploit past data is to replace the adjacency matrix A t with a version "smoothed" in time A smooth t , and feed either P = A smooth t or the corresponding LaplacianL = L(A smooth t ) to the classical SC algorithm. In [37], the authors consider the smoothed adjacency matrix as an average over its last r values: Note that, in the original paper, the authors sometimes consider non-uniform weights due to potential changes in time of the connectivity matrix B t , but in our case we consider a fixed B, and thus uniform weights 1 r . In this paper, we will also consider the "exponentially smoothed" estimator proposed by [8,9,48], which is computed recursively as: for some "forgetting factor" λ ∈ (0, 1], and A exp 0 = A 0 . Compared to the uniform estimator (6), this kind of estimator is somewhat more amenable to streaming and online computing, since only the current A exp t needs to be stored in memory instead of the last r values A t , A t−1 , . . . , A t−r+1 . Note however that A exp t may be denser than a typical adjacency matrix, so the memory gain is sometimes mitigated depending on the case. Note that in fact A unif and A exp can be both written as a weighted sum A smooth t = t k=0 β k A t−k , and our results are actually valid for any weights β k satisfying some assumptions, see eq. (25). Uniform and exponential weights are most commonly used in practice. In Fig. 1, we illustrate the performance of the SC algorithm on a synthetic DSBM example. As expected [39], the normalized Laplacian L(A exp t ) generally performs better than A exp t . Interestingly, the optimal forgetting factor λ is slightly different from one to the other, and the normalized Laplacian reaches a higher performance altogether. We then compare A unif t and A exp t . As we will see in the sequel, taking r ∼ 1 λ often results in the same performance for both estimators. However, a clear advantage of the exponential estimator is that it is not limited to discrete window sizes, but has a continuous forgetting factor. As such, A exp t with the optimal λ often reaches a better performance than A unif t with the optimal r.

From Spectral Clustering to spectral norm concentration
As described in [25], a key quantity for analyzing SC algorithm is the concentration of the adjacency matrix around its expectation in spectral norm. As a first contribution, we prove the following lemma, which is a generalisation of this result to the normalized Laplacian. Lemma 1. Let P = ΘBΘ correspond to some SBM with K communities, where n max , n max and n min are respectively the largest, second-largest and smallest community size. Assume B = α n B 0 for any B 0 with smallest eigenvalue γ. LetP be an estimator of P , andΘ be the output of Algorithm 1 onP with a (1 + δ)-approximate k-means algorithm. Then Similarly, ifL is an estimator of L(P ) andΘ is the output of Algorithm 1 on L, it holds that When B 0 is defined as (3), we have γ = 1 − τ .
The proof of this lemma is deferred to Appendix A.1. The first bound (8) was proved in [25], we extend it to the Laplacian case. Note thatL could be an estimator of L(P ) without being of the formL = L(M ) for some matrix M .
Using this lemma, in the static SBM case, the goal is to find estimatorsP orL that concentrate around P or L(P ) in spectral norm. In the dynamic case, where the goal is to estimate the communities at a particular time t, we seek the best estimators for P t or L(P t ). As outlined in the previous section, we will consider smoothed versions of the adjacency matrix A smooth t , and prove concentration of

Remark 1.
Assuming that all community sizes are of the order of n K and τ is fixed, the error in the adjacency case (8) scales as K 2 n 2 α 2 n P − P 2 , and in the normalized Laplacian case the error (9) scales as K 2 L − L(P ) 2 . Also note that, whenn max ∼ n K , then the error (9) is as L − L(P ) 2 . This does not explicitely depend on α n or K, however these quantities will naturally appear in the concentration of the Laplacian.
The next sections will therefore be devoted in analyzing the spectral concentration rates of the various estimators. Table 1 summarize our results and compare them with previous works. As we will see in the next section, our main contribution is to weaken the hypothesis on the sparsity α n , and relate it to the smoothness of the DSBM ε n . We also provide the best bound available for the normalized Laplacian in the static case, and the first bound in the dynamic case.
In Figure 2, we illustrate numerically the spectral concentration of A exp t and L(A exp t ), and their actual clustering performance, with respect to the forgetting factor λ. We see that there is a slight discrepancy between the λ that minimizes the spectral bound, and the one that yields the best clustering result. As we will see in the next sections, the λ that minimizes the spectral error is theoretically of the order of √ α n nε n . This rate is indeed verified numerically for the spectral error, however the actual best clustering performance deviates slightly. This indicates that spectral norm concentration probably does not yield sharp bounds in examining the performance of k-means and SC. We consider this to be an open, difficult question, as many analyses of k-means rely on spectral properties. This is left for future investigations.

Spectral concentration of the adjacency matrix
We start by recalling the result of [25] in the static case and prove an interesting improvement in some cases, then we examine the result for DSBM of [37] and

Static case
In their landmark paper [25], Lei and Rinaldo analyze the relatively sparse case α n log n n and show that, with probability at least 1 − n −ν for some ν > 0, the adjacency matrix concentrates as Therefore, by Lemma 1, using A as an estimator for P in an SC algorithm leads to an error E(Θ, Θ) . As a minor contribution, we remark that it is not hard to prove the following Proposition that improves over their result in the regime where α n is slightly larger than log n n and B 0 is defined as (3). Proposition 1. Consider a static SBM where B 0 is defined as (3), assume that the community sizes n 1 , . . . , n K are comprised between n min and n max , and that α n log n n min .
Then, for all ν > 0, there exists a constant C ν such that, with probability at Proof. Denote by S 1 , . . . , S K ⊂ [n] the subset of indices of each community, assume without lost of generality that the nodes are ordered such that the S k are consecutive in [n], that is, n k ×n k the adjacency matrix of the subgraph of nodes from the kth community. Note that by our assumption on B we have n×n the block matrix containing the A k on its diagonal of blocks, similarly P , and where the equality is valid because A − P is a block diagonal matrix. From Lei and Rinaldo's result above, for each k, if α n log n k n k , then with probability at For the second term, we note that A is an adjacency matrix generated by the SBM corresponding to P , whose maximal probability is τ α n . Hence, if τ α n log n n , then with probability 1 − n −ν we have A − P √ τ nα n . We conclude with a union bound.
This proposition provides a better error rate than [25] when τ goes to 0 with K, at the price of requiring a higher α n . For instance, when the communities sizes are balanced n k ∼ n K , and we have τ ∼ 1 K and α n ∼ K log n n , Lei and Rinaldo's rate (10) yields E(Θ, Θ) K log n and converge only for K = o(log n), while using Proposition 1 we get E(Θ, Θ) 1 log n . The latter does not depend on K, which brings a strict improvement compared to (10). Recall however that τ and α n do depend on K, and that there must be a ν > 0 such that K ν+1 n −ν → 0 to obtain a probability rate that goes to 1.

Dynamic case
In [37], Pensky and Zhang analyze the dynamic case with Lei and Rinaldo's proof technique. They consider the deterministic DSBM model in the almost sparse case α n log n n and the uniform estimator (6). Defining a factor they show that, for an optimal choice of window size r ∼ 1 which is valid for t sufficiently large to avoid degenerate situations, in conditions similar to that of Theorem 1 below. In particular, the concentration is better if ρ In other words, there is an improvement if we assume sufficient smoothness in time, which then leads to a better error rate E(Θ, Θ) n αnn when using 1342

N. Keriven and S. Vaiter
A unif t in the SC algorithm. Note that, with this proof technique, a constant smoothness ε n ∼ 1 does not improve the error rate (see remark 2).
We remark that, despite the assumption on the smoothness and the availability of more data, the result above still assumes the relative sparse case. However, with sufficient smoothness, it should be possible to weaken the hypothesis made on the sparsity α n , since intuitively, if there is more data available where the communities are almost the same as the present time step, the density of edges should not need to be as large. We solve this in the following theorem, which is the central contribution of this paper.

Consider either the uniform estimator
This result is proved in section 6.2. In this theorem, we improve over [37] in several ways. First, we improve ρ (PZ) n to ρ n by replacing n withn max n. In the case where (B 0 ) k stays bounded, for instance if it is defined as (3) with τ ∼ 1 K , we haven max ∼ n K and this improves the bound (18) compared to (14). We also extend the result to the exponential estimator with the right choice of forgetting factor. In fact, it can be seen in the proof that this result is valid for a more general class of estimators based on weighted averages to which the uniform and exponential estimators belong, see Section 6.
More importantly, the main feature of our result is the weaker condition (17), which relates the sparsity and the smoothness of the DSBM. Strinkingly, if which is a slight strengthening of (15), then our result is valid in the sparse regime α n ∼ 1 n , which is a significant improvement compared to previous works. In any case, if we have exactly αn ρn ∼ log n n , then as previously Lemma 1 yields that E(Θ, Θ) → 0 when K = o( √ log n). At the limit, when ε n → 0 (and the number of steps t min grows accordingly), the results stay valid for even sparser graphs α n → 0, as long as (17) holds.
In this case, "with probability at least 1 − n −ν " refers to joint probability on both the A t and the Θ t .
The above result (proved in section 6.2.3) shows that the Markov DSBM yields the exact same error bounds than the deterministic DSBM model, but since we do not assume a maximal community size here,n max is replaced with n. Furthermore, it seems that the rate of the deterministic case is reached when ε n log n/n, or, in other words, when n is sufficiently large when ε n gets small. Indeed, as ε n gets small, concentration bounds on the Markov chain need to hold over an increasingly large number of steps, and in results n needs to grow accordingly. Nevertheless, the condition ε n log n/n is much weaker than the rate (19) for instance, such that the sparse regime with sufficient smoothness is still valid. In this particular Markov model, it may still be possible to improve these results in the limit ε n → 0, which we leave for future investigations.

Remark 2.
As already observed in [37], with this proof technique, a constant ε n , or in other words, a fraction of changing nodes s that grows linearly with n, does not result in an improvement of the rate of the error bounds compared to the static case. Following the statistical physic approach in the sparse static case [21,31,1], a conjecture on the detectability threshold in the sparse case and ε n ∼ 1 has been formulated in [15], but the proof is still open. Note that, as mentioned before, even in the static case, this analysis does not cover classic SC algorithm, or the case K > 2.

Spectral concentration of the normalized Laplacian
As mentioned in the introduction, the spectral concentration of the normalized Laplacian has been less studied than the adjacency matrix, even in the static case. Many works study the asymptotic spectral convergence of the normalized Laplacian in the dense case [43], but few examine non-asymptotic bounds.

Static case
Among the few existing bounds, [34] proves a concentration in O (1) in the relatively sparse case, and [38] proves a concentation in Frobenius norm but with the stronger condition α n 1 √ log n . An important corollary of our study of the dynamic case is to significantly improve over these results, and obtain, to our knowledge, the best bound available in the relatively sparse case. We state the following proposition for any Bernoulli matrix (not necessarily SBM).
Proposition 3 (Normalized Laplacian, static case). Let A be a symmetric matrix with independent entries a ij ∼ Ber(p ij ). Assume p ij α n , and that there isn min ,n max such that for all i, α nnmin j p ij α nnmax , and μ B =n max nmin . For all ν > 0, there are constants C ν , C ν such that: if then with probability at least 1 − n −ν we have Proof. This is a direct consequence of Theorem 4 in Section 6.
In other words, whenn min ∼ n (for instance when all the p ij /α n are bounded below), then in the relatively sparse case the spectral concentration of the normalized Laplacian is in 1 √ log n , which is a strict improvement over existing bounds.
Let us comment a bit on the condition (21). (1), it is stronger than the relatively sparse case. The attentive reader would also remark the subtle interplay of the quantifiers with the rate ν: in the analysis of the adjacency matrix in the previous section, any multiplicative constant between α n and log n n was acceptable, and the rate ν only forced a multiplicative constant C ν in the final error bound. Here, the rate ν also imposes a multiplicative constant C ν in the sparsity hypothesis, which is essential to avoid having too many nodes with small degrees [25,7].

Dynamic case
To our knowledge, the normalized Laplacian in the DSBM has never been studied theoretically. Our result is the following.
then with probability at least 1 − n −ν , it holds that This result is proved in section 6.3. In the case of balanced communities, the result of Theorem 2 combined with Lemma 1 yields the same error rate than in the case of the adjacency matrix with Theorem 1 and Lemma 1, even in terms of K whenn min ,n max ∼ n K . Thus all the observations of the previous section remain valid, including the fact that A smooth may belong to a more general class of estimator (see Section 6). However in the Laplacian case, the condition (22) is slightly stronger than (17), similar to the static case. In practice though, it is well-known that the normalized Laplacian generally performs better (Fig. 1). This is not fully explained by our theory and left for future work. Finally, note that we only provide the result in the deterministic DSBM case, indeed, the use of the normalized Laplacian requires the existence of a minimal community sizē n min , which is not always guaranteed in the Markov DSBM.

Proofs
In this section, we provide the proof of our main results, largely inspired by [25] and [37]. The technical computations are given in appendix. Despite some similarity with [25] and [37], we strove to make the proofs self-contained.

Preliminaries
We place ourselves at a particular time t. Both estimators A unif t and A exp t can be written as a weighted sum where β 0 = . . . = β r−1 = 1 r and β k = 0 for k r in the uniform case, and β k = λ(1 − λ) k for k < t and β t = (1 − λ) t in the exponential case. As we will see, our results will be valid for any estimator of the form (24), with weights β k 0 that satisfy: there are constants β max , C β , C β > 0 such that: In words, the weights must naturally sum to 1 and be bounded; the sum of their squares must be small; and they must decrease faster than √ k, which is roughly the rate at which the past communities Θ t−k deviate from Θ t . It is not difficult to show that the uniform and exponential estimator satisfy these conditions. , the weights in the exponential estimator (7) satisfy (25) with β max = λ, C β = 3 2 , C β = 2. Proof. The computations are trivial in the uniform case, where the last condition is implied by the stronger property k β k √ k √ r = case, we have β max = λ, and where the first inequality is valid since t log βmax 2 log(1−βmax) , and thus C β = 3 2 . Next, we have Since t log(εn/βmax) and therefore we obtain the desired inequality with C β = 2.

Concentration of adjacency matrix: proof of Theorem 1
For an estimator of the form (24), our goal is to bound A smooth t − P t . We define P smooth t def.
= t k=0 β k P t−k , and divide the error in two terms: The first error term corresponds to the difference between A smooth t and its expectation (up to the diagonal terms). Intuitively, it decreases when the amount of smoothing increases, that is, when r increases or λ gets close to 0, since the sum of matrices is taken over more values. The second term is the difference between the smoothed matrix of probability connection and its value at time t. This time, it will increase when the amount of smoothing increases, since the past communities will be increasingly present in P smooth t . Once we have the two bounds, we can balance them to obtain an optimal value for r or λ, respectively 1 ρn and ρ n .

Bound on the first term
The first bound will be handled by the following general concentration theorem. This is where we are able to weaken the hypothesis on the sparsity. A 1 , . . . , A t ∈ {0, 1} n×n be t symmetric Bernoulli matrices whose elements a (k) ij are independent random variables:

Theorem 3. Let
Assume p (k) ij α n . Consider non-negative weights β k that satisfy (25). Denoting A = t k=0 β k A t−k and P = E(A), there is a universal constant C such that for all c > 0 we have This theorem is proved in Appendix A.2. Its proof is heavily inspired by [25] and [37]: the spectral norm is expressed as a maximization problem over the sphere, and for each point of the sphere the obtained sum is divided into socalled "light" terms, for which Berstein's concentration inequality is sufficient, and more problematic "heavy" terms, that require a complex concentration method. We obtain our weaker sparsity hypothesis in a small but crucial part of this second step, the so-called bounded degree lemma.
Proof. We use Bernstein's inequality. For any fixed i we have Therefore, applying Berstein's inequality, we have Applying a union bound over the nodes i proves the result.
In the static case [25] where β max = 1, the bounded degree lemma is exactly where the relative sparsity hypothesis α n log n n is needed, otherwise the probability of failure diverges. In the dynamic case, we see that β max (which we will ultimately set at ρ n ) intervenes and gives our final hypothesis on sparsity and smoothness.
Applying Theorem 3, we obtain that for any fixed Θ 0 , . . . , Θ t , if nαn βmax log n, then for any ν > 0 there is a constant C ν such that with probability at least Since in all considered cases we will have β max 1/n the second term is negligible, and we obtain

Second term
The second error term in (26) is handled slightly differently in the deterministic and Markov DSBM, even if the final bound is the same.

It holds that
Proof. Since the weights sum to 1, we decompose where · F is the Frobenius norm. Consider P = ΘBΘ and P = Θ B(Θ ) two probability matrices such that there is a set S of nodes that have changed communities. We have then: Since j p 2 ij α 2 nnmax and at most ks nodes have changed community between P t and P t−k , with a maximum of n nodes, we have nnmax min(n, ks) = 2α 2 n nn max min(1, kε n ).
Using the hypothesis that we have made on k β k min(1, √ kε n ), we obtained the desired bound.
At the end of the day, combining (26), (28) and (29) for both deterministic and Markov DSBM model we obtain with the desired probability: As expected, E 1 decreases and E 2 increases when β max decreases. A simple function study show that the sum of the errors is minimized for β max = ρ n , which concludes the proof of Theorem 1.

Markov DSBM
Since the bound on the first term (28) is valid for any Θ k , and the A k are conditionally independent given the Θ k , by the law of total probability it is also valid with joint probability at least 1 − n −ν on both the A k and Θ k in the Markov DSBM model. For the bound on the second term, we show that (30) is still valid with high probability, replacingn max with n.

Lemma 5.
Consider the Markov DSBM model. We have whereε def.
= max(ε n , log n/n). (20) is satisfied we obtain that with probability at least 1 − n −ν , (30) is satisfied for all k. Using the rest of the proof of Lemma 4,(29) is valid in the Markov DSBM model, with n instead ofn max andε n instead of ε n . The rest of the proof is the same as the deterministic case.

Concentration of Laplacian: proof of Theorem 2
A crucial part of handling the normalized Laplacian is to lower-bound the degrees of the nodes, since we later manipulate the inverse of the degree matrix. Under our hypotheses, the minimal expected degree is of the order of α nnmin , so we need to bound the deviation of the degrees with respect to this quantity. We revisit the bounded degree lemma.
Proof. We do the same proof as Lemma 3, but we remark that k,j V ar(Y jk ) α nnmax for all k, i. Therefore, by Berstein's inequality, we have Applying a union bound over the nodes i proves the result.
To lower-bound d i , we use Lemma 6 with 0 < c < 1, for instance c = 1 2 . The sparsity hypothesis (22) in the theorem comes directly from this: it usesn min instead of n, and the multiplicative constant C ν actually depends on the desired concentration rate ν, unlike the previous case of the adjacency matrix where ν could be obtained by adjusting c in Lemma 3. Let us now turn to the proof of the theorem.
As before, we divide the bound in two parts: The first bound is handled with a general concentration theorem. A 1 , . . . , A t ∈ {0, 1} n×n be t symmetric Bernoulli matrices whose elements a (k) ij are independent random variables:
α n , and that there isn min ,n max such that for all i, α nnmin j p ij α nnmax . Then there is a universal constant C such that for all c > 0 we have The proof is in Appendix A.3. Similar to the adjacency matrix case, we thus obtain and by Lemma 11, the second term is negligible since E(A smooth t ) and P smooth t only differ by their diagonal, of the order of α n . The second bound is handled in the same way as the adjacency matrix in the deterministic case.

Lemma 7. Under the deterministic DSBM, we have
The proof is in Appendix A.4. At the end of the day, we obtain which is minimized for the same choice of β max ∼ ρ n .

Conclusion and outlooks
In the DSBM, it should come as no surprise that a model that is very smooth should not need to be as dense as when treading with a single snapshot. Our analysis is the first to show this, for classic SC, in a non-asymptotic manner. Under a slightly stronger condition on the smoothness than that in [37], we showed that strong consistency guarantees can be obtained even in the sparse case. We extended the results to the normalized Laplacian and, although we obtain the same final error rate as the adjacency matrix, our analysis also yields, to our knowledge, the best non-asymptotic spectral bound concentration of the normalized Laplacian for Bernoulli matrices with independent edges. In this theoretical paper, we did not discuss how to select in practice the various parameters of the algorithms such as the number of communities K, the forgetting factor λ, or the window size r. This is left for future investigations, as well as the analysis of varying K, n, or B. As we mentioned in Remark 2, an outstanding conjecture about the sparse case and ε n ∼ 1 is formulated in [15]. Finally, our new spectral concentration of the normalized Laplacian, which shows that L(A) − L(P ) → 0 in the relatively sparse case, may have consequences in other asymptotic analyses of the spectral convergence of the normalized Laplacian [43,40,26].

A.1. Proof of Lemma 1
From [25, Section 5.4], for any matrix M ∈ R K×K and Q = ΘM Θ , given an estimatorQ that we feed to the SC algorithm it holds that where γ M is the smallest eigenvalue of M . When using the adjacency matrixP = A to estimate the probability matrix P = ΘBΘ , we have B = α n B 0 , and γ M = α n γ, which gives us (8). When B 0 is defined as (3), we have γ = 1 − τ .
In the Laplacian case, for a node i n belonging to a community k K, hence the Laplacian of the probability matrix L(P ) can be written as:

A.2. Proof of Theorem 3
The proof is heavily inspired by [25]. Define P k = E(A k ), W k = A k − P k and w (k) ij its elements, and their respective smoothed versions A = t k=0 β k A t−k , P = E(A), W = A − P , and a ij , p ij , w ij their elements. Denote by S the Euclidean ball in R n of radius 1. The proof strategy of [25] is to define a grid and simply note that (Lemma 2.1 in [25] supplementary): Hence we must bound this last quantity. To do this, for each given (x, y) in T , we divide their indices into "light" pairs: and bound each of these two terms separately.

A.2.1. Bounding the light pairs
To bound the light pairs, Bernstein's concentration inequality is sufficient. (14) n for all constants c > 0.

Lemma 8 (Bounding the light pairs). We have
Proof. The proof is immediate by applying Bernstein's inequality. Take any (x, y) ∈ T , denote C = αn βmaxn . Define u ij = x i y j 1 (i,j)∈L(x,y) + x j y i 1 (j,i)∈L(x,y) (which is necessary because the edges (i, j) and (j, i) are not independent). We have be independent random variables such that EY ijk = 0, |Y ijk | 2Cβ max since u ij 2C, β k β max and w (t−k) ij 1, and Hence, applying Bernstein's inequality: Then, we use the fact that |T | e n log (14) (see proof of Lemma 3.1 in [25]) and a union bound to conclude.

A.2.2. Bounding the heavy pairs
To bound the heavy pairs, two main Lemmas are required: the so-called bounded degree (Lemma 3) and bounded discrepancy lemma, presented below. As mentioned before, the bounded degree lemma is key in improving the sparsity hypothesis, despite the simplicity of its proof. The bounded discrepancy lemma is closer to its original proof [25], that we reproduce here for completeness. Of course by symmetry it is also valid for |J| |I| with the same probability (inverting the role of I and J in the bounds).
To prove it, we need the following Lemma.
Lemma 10 (Adapted from Lemma 9 in [37]). Let X 1 , ..., X n be independent variables such that Then, for all t 7, we have Proof. For some λ > 0 to be fixed later, have E(e λwiXi ) = p i e wi (1−pi) .
Hence, for t 7 and λ = log(1+t) wmax , Since log(1 + t) t and i w i p i μ, The Lemma above is slightly stronger than Bernstein in this particular case: we would have obtained O(t) instead of O(t log(1 + t)). Now we can prove the bounded discrepancy lemma.
Proof of Lemma 9. We assume that the bounded degree property (Lemma 3) holds, which implies that for all I, J, it hold that: Thus we now considers the pairs I, J where both have size less than n/e, and such that |I| |J| without lost of generality. For such a given pair I, J, we decompose where the sum over (i, j) counts only once each distinct edge between I and J, and Y ijk = a We can now prove the bound on the heavy pairs, that is, we want to prove with high probability: Since p ij α n and by definition of the heavy pairs, for all x, y ∈ T : Hence our goal is now to bound sup x,y∈T (i,j)∈H(x,y) x i y j a ij . We will show that, when the bounded degree and bounded discrepancy properties hold, this sum is bounded for all x, y. From now on, we assume that these results hold, and consider any x, y ∈ T . Let us define sets of indices I s , J t over which we bound uniformly x i and y j , and replace the sum over a ij is these sets by e(I s , J t ). More specifically, we define Since we consider heavy pairs, we need only consider indices (s, t) such that 2 s+t 16 αnn βmax , and we define C n = αnn βmax for convenience. Moreover, we have i∈Is,j∈Jt a ij 2e(I s , J t ), since each edge indices appears at most twice. Hence, we have: We now introduce more notations. We denote μ s = 4 s |Is| n , ν t = 4 t |Jt| n , γ st = e(Is,Jt) αn|Is||Jt| , σ st = γ st C n 2 −(s+t) . We reformulate (35) as: (i,j)∈H(x,y) x i y j a ij (s,t):2 s+t 16Cn Our goal is therefore to show that s,t μ s ν t σ st 1. For this, we will make extensive use of the fact that μ s 16 i∈Is x 2 i , and therefore s μ s 16, and similarly t ν t 16.
Following the original proof of [25], let C = {(s, t) : 2 s+t 16C n , |I s | |J t |}, divided in six: Similarly, we define C = {(s, t) : 2 s+t 16C n , |I s | |J t |} and C i the same way by inverting the roles of μ s and ν t . We write the proof for C, the other case is strictly symmetric. Our goal is to prove that each of the (s,t)∈Ci μ s ν t σ st is bounded by a constant. x i y j p ij δ i | cn max nβ max α 3 2 n ⎞ ⎠ 2 exp − c 2 /2 2C β + 2c 3 · n .
Using a union bound over T we can conclude.

A.4. Additional proofs
Proof of Lemma 5. Recall that we definedε def.
= max(ε n , log n/n). For any k, we have p i and z t−k j = z t j , that is, if the nodes have not changed communities. Thus we write The A i are independent and occur with probability (1 − ε n ) k . By Hoeffding inequality, for some δ > 0 that we will fix later, we have Therefore with probability at least 1 − 2e −2δ 2 n , Using 1 − x k = (1 − x)(1 + x + . . . + x k−1 ) (1 − x)k for |x| 1 and ε n ε n we get P t−k − P t 2 F 2α 2 n n 2 (min(1, kε n ) + δ) . Then we choose δ ∼ε n min(1, kε n ), and use a union bound for k = 1 to k ∼ 1/ε n (since beyond that we have min(1, kε n ) = 1) to conclude.

N. Keriven and S. Vaiter
From Lemma 4, we have P t − P smooth t C β α n nnmaxεn βmax . For the second term, we define D t−k the diagonal degree matrix associated to P t−k , such that Denoting byp ij the elements of P smooth t , recall that jp 2 ij α 2 nnmax , and using (a + b) 2 2(a 2 + b 2 ) we have where A 1 2 indicates element-wise square root. Repeating the proof of Lemma 4, for two SBM connection probability matrices P and P between which only the nodes belonging to a set S have change community, we have n min(1, √ kε n ). We conclude using the hypothesis on k β k min(1, √ kε n ).

A.5. Technical Lemma
Lemma 11. Let A, P ∈ R n×n be symmetric matrices containing non-negative elements, assume that d i = j a ij and d P i = j p ij are strictly positive, define D = diag(d i ), D P = diag(d Then, for x, y ∈ S, we have min . Since the p ij are non-negative, the maximum over x, y ∈ S is necessarily reached when every term in the sum is non-negative, by choosing y j 0 and x i with the same sign as d P i − d i . Hence, sup x,y∈S ij x i y j p ij (d P i − d i ) = sup x,y∈S ij |x i y j p ij (d P i − d i )|. Using this property, sup x,y∈S ij which concludes the proof.