Markov Random Geometric Graph (MRGG): A Growth Model for Temporal Dynamic Networks

We introduce Markov Random Geometric Graphs (MRGGs), a growth model for temporal dynamic networks. It is based on a Markovian latent space dynamic: consecutive latent points are sampled on the Euclidean Sphere using an unknown Markov kernel; and two nodes are connected with a probability depending on a unknown function of their latent geodesic distance. More precisely, at each stamp-time $k$ we add a latent point $X_k$ sampled by jumping from the previous one $X_{k-1}$ in a direction chosen uniformly $Y_k$ and with a length $r_k$ drawn from an unknown distribution called the latitude function. The connection probabilities between each pair of nodes are equal to the envelope function of the distance between these two latent points. We provide theoretical guarantees for the non-parametric estimation of the latitude and the envelope functions.We propose an efficient algorithm that achieves those non-parametric estimation tasks based on an ad-hoc Hierarchical Agglomerative Clustering approach. As a by product, we show how MRGGs can be used to detect dependence structure in growing graphs and to solve link prediction problems.


Introduction
In Random Geometric Graphs (RGG), nodes are sampled independently in latent space R d . Two nodes are connected if their distance is smaller than a threshold. A thorough probabilistic study of RGGs can be found in [26]. RGGs have been widely studied recently due to their ability to provide a powerful modeling tool for networks with spatial structure. We can mention applications in bioinformatics [16] or analysis of social media [17]. One main feature is to uncover hidden representation of nodes using latent space and to model interactions by relative positions between latent points. Furthermore, nodes interactions may evolve with time. In some applications, this evolution is given by the arrival of new nodes as in online collection growth [22], online social network growth [3,19], or outbreak modeling [31] for instance. The network is growing as more nodes are entering. Other time evolution modelings have been studied, we refer to [28] for a review.
A natural extension of RGG consists in accounting this time evolution. In [12], the expected length of connectivity and dis-connectivity periods of the Dynamic Random Geometric Graph is studied: each node choose at random an angle in [0, 2π) and make a constant step size move in that direction. In [29], a random walk model for RGG on the hypercube is studied where at each time step a vertex is either appended or deleted from the graph. Their model falls into the class of Geometric Markovian Random Graphs that are generally defined in [7].
As far as we know, there is no extension of RGG to growth model for temporal dynamic networks. For the first time, we consider a Markovian dynamic on the latent space where the new latent point is drawn with respect to the latest latent point and some Markov kernel to be estimated. This work was supported by grants from Région Ile-de-France. Figure 1: Graphical model of the MRGG model: Markovian dynamics on Euclidean sphere where we jump from X k onto X k+1 . The Y k encodes direction of jump while r k encodes its distance, see (1).
Estimation of graphon in RGGs: the Euclidean sphere case Random graphs with latent space can be defined using a graphon, cf. [23]. A graphon is a kernel function that defines edge distribution. In [30], Tang and al. prove that spectral method can recover the matrix formed by graphon evaluated at latent points up to an orthogonal transformation, assuming that graphon is a positive definite kernel (PSD). Going further, algorithms have been designed to estimate graphons, as in [20] which provide sharp rates for the Stochastic Block Model (SBM). Recently, the paper [9] provides a non-parametric algorithm to estimate RGGs on Euclidean spheres, without PSD assumption.
We present here RGG on Euclidean sphere. Given n points X 1 , X 2 , . . . , X n on the Euclidean sphere S d−1 , we set an edge between nodes i and j (where i, j ∈ [n], i = j) with independent probability p(〈X i , X j 〉). The unknown function p : [−1, 1] → [0, 1] is called the envelope function. This RGG is a graphon model with a symmetric kernel W given by W (x, y) = p(〈x, y〉). Once the latent points are given, independently draw the random undirected adjacency matrix A by A i, j ∼ Ber(p(〈X i , X j 〉)) , i < j with Bernoulli r.v. drawn independently (set zero on the diagonal and complete by symmetry), and set T n := 1 n p(〈X i , X j 〉) i, j∈ [n] and T n : We do not observe the latent points and we have to estimate the envelope p from A only. A standard strategy is to remark that T n is a random perturbation of T n and to dig into T n to uncover p.
One important feature of this model is that the interactions between nodes is depicted by a simple object: the envelope function p. The envelope summarises how individuals connect each others given their latent positions. Standard examples [6] are given by p τ (t) = 1 {t≥τ} where one connects two points as soon as their geodesic distance is below some threshold. The non-parametric estimation of p is given by [9] where the authors assume that latent points X i are independently and uniformly distributed on the sphere, which will not be the case in this work.
A new growth model: the latent Markovian dynamic Consider RGGs where latent points are sampled with Markovian jumps, the Graphical Model under consideration can be found in Figure 1. Namely, we sample n points X 1 , X 2 , . . . , X n on the Euclidean sphere S d−1 using a Markovian dynamic. We start by sampling randomly X 1 on S d−1 . Then, for any i ∈ {2, . . . , n}, we sample • a unit vector Y i ∈ S d−1 uniformly, orthogonal to X i−1 .
then X i is defined by We built a graph of 1500 nodes sampled on the sphere S 2 and using envelope p (1) and latitude f (1) (dot orange curves) defined in Section 5 by Eq. (11). The estimated envelope is thresholded to get a function in [0, 1] and the estimated latitude function is normalized with integral 1 (plain blue lines).
This dynamic can be pictured as follows. Consider that X i−1 is the north pole, then chose uniformly a direction (i.e., a longitude) and, in a independent manner, randomly move along the latitudes (the longitude being fixed by the previous step). The geodesic distance γ i drawn on the latitudes satisfies where random variable r i = 〈X i , X i−1 〉 has density f (r i ). The resulting model will be referred to as the Markov Random Geometric Graph (MRGG) and is described with Figure 1.
Temporal Dynamic Networks: MRGG estimation strategy. Seldom growth models exist for temporal dynamic network modeling, see [28] for a review. In our model, we add one node at a time making a Markovian jump from the previous latent position. It results in the observation of (A i, j ) 1≤ j≤i−1 at time T = i , as pictured in Figure 1. Namely, we observe how a new node connects to the previous ones. For such dynamic, we aim at estimating the model, namely envelope p and respectively latitude f . These functions capture in a simple function on Ω = [−1, 1] the range of interaction of nodes (represented by p) and respectively the dynamic of the jumps in latent space (represented by f ), where, in abscissa Ω, values r = 〈X i , X j 〉 near 1 corresponds to close point X i X j while values close to −1 corresponds to antipodal points X i −X j . These functions may be non-parametric. From snapshots of the graph at different time steps, can we recover envelope and latitude functions? We prove that it is possible under mild conditions on the Markovian dynamic of the latent points and our approach is summed up with Figure 3.
Define λ(T n ) := (λ 1 , . . . , λ n ) and resp. λ( T n ) := (λ 1 , . . . ,λ n ) the spectrum of T n and resp. T n , see (1). Building clusters from λ( T n ), Algorithm 1 (SCCHEi) estimates the spectrum of envelope p while Algorithm 3 [1] (HEiC, cf. Section F) extracts d eigenvectors of T n to uncover the Gram matrix of the latent positions. Both can then be used to estimate the unknown functions of our model (cf. Figure 2). Previous works. The latent space approach to model dynamics of network has already been studied in a large span of recent works. Most of them focus on block models with dynamic generalizations covering discrete dynamic evolution via hidden Markov models (cf. [24]) or continuous time analysis via extended Kalman filter (cf. [32]). [33] and [11] use a Gamma Markov process allowing to model evolving mixed membership in graphs using respectively the Bernoulli Poisson link function and the logistic function to generate edges from the latent space representation. While the above mentioned papers consider community based random graphs with fixed size where edges and communities change through time, we has been investigated in [6]. For the first time, we introduce latitude function and non-parametric estimations of envelope and latitude using new results on kernel matrices concentration with dependent variables.
Outline Sections 2 and 3 present the estimation method with new theoretical results under Markovian dynamic. These new results are random matrices operator norm control and resp. U-statistics control under Markovian dynamic, presented in the Appendix at Section H and resp. Section G. The envelope adaptive estimate is built from a size constrained clustering (Algorithm 1) tuned by slope heuristic Eq. (7), and the latitude function estimate (cf. Section 3.1) is derived from estimates of latent distances r i . Our method can handle random graphs with logarithmic growth node degree (i.e., new comer at time T = n connects to (log n) previous nodes), referred to as relatively sparse models, see Section 4. Sections 5 and 6 investigate synthetic data experiments. We propose heuristics to solve link prediction problems and to test for a Markovian dynamic. In a last section (Section 7), we dig deeper into the analysis of our methods by studying their behaviour under model mispecification or under slow mixing conditions. We conclude by presenting final remarks and future research directions. At the end of Section 7, we provide with Figure 14 a synthetic presentation of the estimation methods of this paper.
Notations. Consider a dimension d ≥ 3. Denote by · 2 (resp. 〈·, ·〉) the Euclidean norm (resp. inner product) on R d . Consider the d-dimensional sphere S d−1 := {x ∈ R d : x 2 = 1} and denote by π the uniform probability measure on S d−1 . For For any x, y ∈ R, x ∧ y := min(x, y) and x ∨ y := max(x, y). Given two sequences x, y of reals-completing finite sequences by zeros-such that i x 2 i + y 2 i < ∞, we define the 2 rearrangement distance δ 2 (x, y) as where S is the set of permutations with finite support. This pseudo-distance is useful to compare two spectra.

Nonparametric estimation of the envelope function
One can associate with W (x, y) = p(〈x, y〉) the integral operator such that for any g ∈ L 2 (S d−1 ), where π is the uniform probability measure on S d−1 . The operator T W is Hilbert-Schmidt and it has a countable number of bounded eigenvalues λ * k with zero as only accumulation point. The eigenfunctions of T W have the remarkable property that they do not depend on p (cf.
We finally introduce for any resolution level R ∈ N the truncated graphon W R which is obtained from W by keeping only the R first eigenvalues, that is Weighted Sobolev space The space Z s w β ((−1, 1)) with regularity s > 0 is defined as the set of functions

Integral operator spectrum estimation with dependent variables
One key result is a new control of U-statistics with latent Markov variables (cf. Section G) and it makes use of a Talagrand's concentration inequality for Markov chains. This article follows the hypotheses made on the Markov chain (X i ) i≥1 by [10]. Namely, we work under the following assumption.

Assumption A
The latitude function f is such that f ∞ < ∞ and makes the chain (X i ) i≥1 uniformly ergodic.
Under Assumption A, we prove in Section B that the unique stationary distribution of the Markov chain (X i ) i≥1 is the uniform probability measure on S d−1 denoted π. Theorem 1 is a theoretical guarantee for a random matrix approximation of the spectrum of integral operator with dependent latent variables. Theorem 5 in Section H gives explicitly the constants hidden in the big O below which depend on the absolute spectral gap of the Markov chain (X i ) i≥1 (cf. Definition 11). Theorem 1. We consider that Assumption A holds and we assume the envelope p has regularity s > 0. Then, it holds Using this preliminary result and the near optimal error bound for the operator norm of random matrices from [4] we obtain with λ R opt ( T n ) = (λ 1 , . . . ,λ R opt , 0, 0, . . . ) and R opt = n/ log 2 (n) 1 2s+d−1 .λ 1 , . . . ,λ n are the eigenvalues of T n sorted in decreasing order of magnitude.
Remark. In Theorem 1 and Theorem 4, note that we recover, up to a log factor, the minimax rate of non-parametric estimation of s-regular functions on a space of (Riemannian) dimension d − 1. Even with i.i.d. latent variables, it is still an open question to know if this rate is the minimax rate of non-parametric estimation of RGGs.
Eq.(3) shows that one could use an approximation of (p * k ) k≥1 to estimate the envelope p and Theorem 1 states we can recover (p * k ) k≥1 up to a permutation. In most cases, the problem of finding such a permutation is NP-hard and we introduce in the next section an efficient algorithm to fix this issue.

Size Constrained Clustering Algorithm
Note the spectrum of T W is given by (p * l ) l≥0 where p * l has multiplicity d l . In order to recover envelope p, we build clusters from eigenvalues of T n while respecting the dimension d l of each eigen-space of T W . In [9], an algorithm is proposed testing all permutations of {0, . . . , R} for a given maximal resolution R. To bypass the high computational cost of such approach, we propose an efficient method based on the tree built from Hierarchical Agglomerative Clustering (HAC). In the following, for any ν 1 , . . . , ν n ∈ R, we denote by HAC({ν 1 , , . . . , ν n }, d c ) the tree built by a HAC on the real values ν 1 , . . . , ν n using the complete linkage function d c defined by ∀A, B ⊂ R, d c (A, B) = max a∈A max b∈B a − b 2 . Algorithm 1 describes our approach.
Given some resolution level R ∈ N, our estimator p R of the envelope function p is obtained from the clustering of the eigenvalues obtained by the SCCHEi algorithm as follows for d ∈ d ims do 6: Search for a cluster of size d in t r ee as close as possible to the root. 7: if such a cluster d exists then Update(dims, t r ee, d , d). 8: end for 9: for d ∈ d ims do 10: Search for the group in t r ee with a size larger than d and as close as possible to d. 11: if such a group exists then Update(dims, t r ee, , d) else Go to line 3.

Theoretical guarantees
Let us recall that for any resolution level R ≥ 0, whereλ 1 , . . . ,λ n are the eigenvalues of T n sorted in decreasing order of magnitude. We order the eigenvaluesλ 1 , . . . ,λ R and in the following we consider that λ R ( T n ) 1 ≥ · · · ≥ λ R ( T n ) R .
Theorem 2. Let us consider some resolution level R ∈ N. We keep the assumptions of Theorem 1. We recall that we consider λ R ( T n ) 1 ≥ · · · ≥ λ R ( T n ) R . Then for n large enough, the clusters d 0 , . . . , d R obtained from the SCCHEi algorithm satisfy Proof of Theorem 2. Let us denote For any g ∈ (0, ∆ G 4 ), the proof of Theorem 1 (cf. Section H) ensures that for n large enough it holds Let us recall that The proof of Theorem 2 relies on the following two Lemmas. The proofs of these Lemmas are postponed to Section D.
Moreover, the function f * given by is non-increasing.

Lemma 2.
We keep the assumptions and notations of Lemma 1. A clustering d k 0≤k≤R at depth R in the tree of the HAC algorithm applied to Then the HAC algorithm with complete linkage applied to reaches (after R − R − 1 iterations) a state d k 0≤k≤R of type ( ). As a consequence, the SCCHEi algorithm returns the Theorem 2 directly follows from the conclusion of Lemma 2 since we get that where the first equality comes from the conclusion of Lemma 2, the second one comes from the definition of f * from Lemma 1 and the last one comes from the choice of σ * from Lemma 1.
Theorem 2 ensures that under appropriate conditions, the SCCHEi leads to a clustering of the eigenvalues of the adjacency matrix that achieves the δ 2 distance between λ(T W R ) and λ R ( T n ). Nevertheless, this is not a sufficient condition to ensure that the L 2 error between the true envelope function and our plug-in estimator (cf. Eq.(4)) goes to 0 has n → +∞. This is due to identifiability issues coming from the δ 2 metric. This was already mentioned in [9, Section 3.6], where the authors present the following example. Consider the case d = 3, which implies β = 1/2, d k = 2k + 1, c k = 2k + 1. For µ > 0, let Then the associated spectrum are which are indistinguishable in δ 2 metric, although p a − p b 2 = µ 24. Nevertheless, we can obtain a theoretical guarantee on the L 2 error between the true envelope function and our plug-in estimate using Theorem 2 if we consider additional conditions on the eigenvalues (p * k ) k≥0 .

Theorem 3.
Assume that the envelope function p is polynomial of degree D ∈ N, i.e., p * k = 0 for any k > D and p * D = 0. Assume also that all nonzeros p * k for k ∈ {0, . . . , D} are distinct and that R ≥ D. Then for n large enough it holds with probability at least 1 − n −8 , where c > 0 is a universal numerical constant.

Remarks.
• The question of whether the problem of estimating p is NP-hard was still completely open. Theorem 3 brings a first partial answer to this question by showing that p can be estimated in polynomial time in the case where p is a polynomial with all non-zero eigenvalues distinct.
• The proof of Theorem 3 is strictly analogous to the one of [9, Proposition 9]. In a nutshell, considering that the envelope function p is a polynomial with all non-zeros eigenvalues p * k distinct ensures that (since R ≥ D) which coincides with the L 2 norm of the difference between p and its estimate where σ * is a permutation as defined in Lemma 1. Since we proved that for n large enough, the clusters returned by the SCCHEi algorithm correspond to an allocation given by f * , we deduce that the L 2 norm between p and our plug-in estimate p R is equal to the δ 2 distance between spectra. The result then comes directly using Theorem 1.

Adaptation: Slope heuristic as model selection of Resolution
A data-driven choice of model size R can be done by slope heuristic, see [2] for a nice review. One main idea of slope heuristic is to penalize the empirical risk by κ pen( R) and to calibrate κ > 0. If the sequence (pen( R)) R is equivalent to the sequence of variances of the population risk of empirical risk minimizer (ERM) as model size R grows, then, penalizing the empirical risk (as done in Eq. (7)), one may ultimately uncover an empirical version of the U-shaped curve of the population risk. Hence, minimizing it, one builds a model sizeR that balances between bias (under-fitting regime) and variance (over-fitting regime). First, note that empirical risk is given by the intra-class variance below.

Definition 1.
For any output ( d 0 , . . . , d R , Λ) of the Algorithm SCCHEi, the thresholded intra-class variance is defined by and the estimations (p k ) k≥0 of the eigenvalues (p * k ) k≥0 is given by Second, as underlined in the proof of Theorem 1 (see Theorem 5 in Section H), the estimator's variance of our estimator scales linearly in R.
Hence, we apply Algorithm SCCHEi for R varying from 0 to R max (with R max := max{R ≥ 0 : R ≤ n}) to compute the thresholded intra-class variance R (see Definition 1) and given some κ > 0, we select The hyper-parameter κ controlling the bias-variance trade-off is set to 2κ 0 where κ 0 is the value of κ > 0 leading to the "largest jump" of the function κ → R(κ). OnceR := R(2κ 0 ) has been computed, we approximate the envelope function p using Eq.(6) (see Eq. (20) for the closed form). We denote this estimator p and with the notations of Eq.(4) it holds p = p R . In Appendix E, we describe this slope heuristic on a concrete example and our results can be reproduced using the notebook Experiments 1 in the Appendix.

Our approach to estimate the latitude function in a nutshell
In Theorem 4 (see below), we show that we are able to estimate consistently the pairwise distances encoded by the Gram matrix G * where Taking the diagonal just above the main diagonal (referred to as superdiagonal) of G -an estimate of the matrix G to be specified -we get estimates of the i.i.d. random variables (〈X i , X i−1 〉) 2≤i≤n = (r i ) 2≤i≤n sampled from f . Using (r i ) 2≤i≤n the superdiagonal of n G, we can build a kernel density estimator of the latitude function f . In the following, we describe the algorithm used to build our estimatorĜ with theoretical guarantees.

Spectral gap condition and Gram matrix estimation
The Gegenbauer polynomial of degree one is defined by G β Denoting We will prove that for n large enough there exists a matrix V ∈ R n×d where each column is an eigenvector of T n , such that G := 1 d V V approximates G * well, in the sense that the Frobenius norm G * − G F converges to 0. To choose the d eigenvectors of the matrix T n that we will use to build the matrix V , we need the following spectral gap condition This condition will allow us to apply Davis-Kahan type inequalities. Now, thanks to Theorem 1, we know that the spectrum of the matrix T n converges towards the spectrum of the integral operator T W . Then, based on Eq.(8), one can naturally think that extracting the d eigenvectors of the matrix T n related with the eigenvalues that converge towards p * 1 , we can approximate the Gram matrix G * of the latent positions. Theorem 4 proves that the latter intuition is true with high probability under the spectral gap condition (9). The algorithm HEiC [1] (cf. Section F for a presentation) aims at identifying the above mentioned d eigenvectors of the matrix T n to build our estimate of the Gram matrix G * . Theorem 4. We consider that Assumption A holds, we assume ∆ * > 0, and we assume that graphon W has regularity s > 0. We denote V ∈ R n×d the d eigenvectors of the matrix T n associated with the eigenvalues returned by the algorithm HEiC and we define G := 1 d V V . Then for n large enough and for some constant D > 0, it holds with probability at least 1 − 5/n 2 , Based on Theorem 4, we propose a kernel density approach to estimate the latitude function f based on the super-diagonal of the matrix G, namely r i := n G i−1,i i∈{2,...,n} . In the following, we denotef this estimator.

Relatively Sparse Regime
Although we deal so far with the so-called dense regime (i.e. when the expected number of neighbors of each node scales linearly with n), our results may be generalized to the relatively sparse model connecting nodes i and j with probability In the relatively sparse model, one can show following the proof of Theorem 1 that the resolution should be chosen as R = nζ n 1+ζ n log 2 n 1 2s+d−1 . Specifying that λ * = (p * 0 , p * 1 , . . . , p * 1 , p * 2 , . . . ) and T n = A/n, Theorem 1 becomes for a graphon with regularity s > 0 Figure 4 illustrates the estimation of the latitude and the envelope functions in some relatively sparse regimes.

Experiments
In the following, we test our methods using different envelope and latitude functions. Note that a common choice of connection functions in RGGs are the Rayleigh fading activation functions which take the form Any Rayleigh function ζ,η corresponds to the following envelope function Let us also denote for any α, β > 0 g(·; α, β) the density of the beta distribution (α, β) with parameters (α, β). In this paper, we will study the numerical results of our methods considering the following envelope and latitude functions Note that considering the latitude function f (2) (resp. f (3) ) is equivalent to consider that one fourth of the Euclidean distance between consecutive latent positions is distributed as Z ∼ (1, 3) (resp. Z ∼ (2, 2)). With Figures 5, 6 and 7, we present the results of our experiments for the three different settings described in Eq. (11). In each case, we work with a latent dimension d = 4 and we show: 1. the estimates of the envelope and latitude functions obtained with our adaptive procedure working the graph of 1500 nodes (see Figures (a) and (b)).

the corresponding clustering obtained by the SCCHEi algorithm for the resolution level R determined by the slope heuristic (see Figures (c)).
Blue crosses represent the R eigenvalues of T n with the largest magnitude, which are used to form clusters corresponding to the R + 1-first spherical harmonic spaces. The red plus are the estimated eigenvalues (p k ) 0≤k≤R (plotted with multiplicity) defined from the clustering given by our algorithm SCCHEi (see Eq. (6)). Those results show that SCCHEi achieves a relevant clustering of the eigenvalues of T n which allows us to recover the envelope function.
3. the errors between the estimated functions and the true ones in δ 2 metric and in L 2 norm for different size of graphs (see Figures (d) and (e)). We notice that a significant decrease of the δ 2 distance between spectra does not necessarily means that the L 2 norm between the estimated and the true envelope functions shrinks seriously. We refer in particular to Figures 5 and 7. The identifiability issue highlighted in Section 2.3 is one of the possible explanations of this phenomenon. Nevertheless, these experiments show that both the δ 2 and L 2 errors on our estimate of the envelope or the latitude functions are decreasing as the size of the graph is getting larger. Let us also recall that Theorem 3 ensures that the L 2 error on our estimate of the envelope function goes to zero as n grows when p has a finite number of non zeros eigenvalues that are all distinct.  Figure 6: Results for d = 4, the envelope p (2) and the latitude f (2) of Eq.(11).

Applications
In this section, we apply the MRGG model to link prediction and hypothesis testing in order to demonstrate the usefulness of our approach as well as the estimation procedure.

Markovian Dynamic Testing
As a first application of our model, we propose a hypothesis test to statistically distinguish between an independent sampling the latent positions and a Markovian dynamic. The null is then set to 0 : nodes are independent and uniformly distributed on the sphere (i.e., no Markovian dynamic). Our test is based on estimatef of latitude and thus the null can be rephrased as 0 : f = f 0 where f 0 is the latitude of uniform law, dynamic is then i.i.d. dynamic.    (1) and latitude f (1) of Eq. (11).
We run our algorithm to estimate latitude from which we sample a batch to compute the χ 2 -test statistic. We see that for graphs of size larger than 1, 000, the rejection rate is almost 1 under the alternative (Type II error is almost zero), the test is very powerful.

Link Prediction
Suppose that we observe a graph with n nodes. Link prediction is the task that consists in estimating the probability of connection between a given node of the graph and the upcoming node.

Bayes Link Prediction
We propose to show the usefulness of our model solving a link prediction problem. Let us recall that we do not estimate the latent positions but only the pairwise distances (embedding task is not necessary for our purpose). Denoting by proj X ⊥ n (·) the orthogonal projection onto the orthogonal complement of Span(X n ), the decomposition of 〈X i , X n+1 〉 defined by shows that latent distances are enough for link prediction. Indeed, it can be achieved using a forward step on our Markovian dynamic, giving the posterior probability (cf. Definition 2) η i (D 1:n ) defined by where w d−3

Definition 2. (Posterior probability function)
The posterior probability function η is defined for any latent pairwise distances ) is a random variable that equals 1 if there is an edge between nodes i and n + 1, and is zero otherwise.
We consider a classifier g (cf. Definition 3) and an algorithm that, given some latent pairwise distances D 1:n , estimates A i,n+1 by putting an edge between nodes X i and X n+1 if g i (D 1:n ) is 1.

Definition 3. A classifier is a function which associates to any pairwise distances
The risk of this algorithm is as in binary classification, where we used the independence between A i,n+1 and g i (D 1:n ) conditionally on (D 1:n ). Pushing further this analogy, we can define the classification error of some classifier g by L(g) = E [ (g, D 1:n )]. Proposition 1 shows that the Bayes estimator -introduced in Definition 4 -is optimal for the risk defined in Eq. (14).

Proposition 1. (Optimality of the Bayes classifier for the risk )
We keep the notations of Definitions 2 and 4. For any classifier g, it holds for all i ∈ [n], which immediately implies that (g, D 1:n ) ≥ (g * , D 1:n ) and therefore L(g) ≥ L(g * ).

Heuristic for Link Prediction
One natural method to approximate the Bayes classifier from the previous section is to use the plug-in approach. This leads to the MRGG classifier introduced in Definition 5.

Definition 5. (The MRGG classifier)
For any n and any i ∈ [n], we defineη i (D 1:n ) as where p andf denote respectively the estimate of the envelope function and the latitude function with our method and where r := n G. The MRGG classifier is defined by Figure 9: ← On the left: Link predictions between the future node X n+1 and the 10 first nodes X 1 , . . . , X 10 . → On the right: Comparison between the risk (defined in Eq. (14)) of the MRGG classifier, the random classifier and the risk of the optimal Bayes classifier.
To illustrate our approach we work with a graph of 1500 nodes with d = 4, and we consider the envelope and latitude functions defined in Eq. (11). The plots on the left column of Figure 9 show that we are able to recover the probabilities of connection of the nodes already present in the graph with the coming node X n+1 . Using the decomposition of 〈X i , X n+1 〉 given by Eq. (12), orange crosses are computed using Eq. (13). Green plus are computed similarly replacing p and f by their estimations p andf following Eq. (15). Blue stars are computed using Eq.(13) by replacing f by 2 ) which implicitly supposes that the points are sampled uniformly on the sphere.
With the plots on the left column of Figure 9, we compare the risk of the random classifier -whose guess g i (D 1:n ) is a Bernoulli random variable with parameter given by the ratio of edges compared to complete graph -with the risk of the MRGG classifier (cf. Definition 5). These figures show that for a small number of nodes, the risk estimate provided by the MRGG classifier can be significantly far from the one of the Bayes classifier. However, when the number of nodes is getting larger, the MRGG classifier gives similar results compared to the optimal Bayes classifier. This risk estimate can be significantly smaller than the one of the random classifier (see for example the plots corresponding to the envelope p (2) and the latitude f (2) ).

Discussion
In this section, we want to push the investigation of the performance of our estimation methods as far as possible. In Section 7.1 we study the robustness of our methods under model mispecification before inspecting the influence of the mixing time of the Markov chain (X i ) i≥1 on the estimation error in Section 7.2.
On a more theoretical side, we show that replacing the use of the complete linkage by the Ward distance in the SCCHEi algorithm, Theorem 2 might not be true anymore. We conclude with some remarks and by highlighting future research directions.

Robustness to model mispecification
We consider a mixture model for the sampling scheme of the latent position. We fix some ∈ (0, 1) and we draw X 1 randomly on the sphere. Then at time step i ≥ 2, the point X i is sampled as follows: • with probability 1 − , X i is drawn following the Markovian dynamic described in Section 1 (based on X i−1 ).
• with probability , X i is drawn uniformly on the sphere. Figure 10 and Figure 11 show the numerical results obtained under this mispecified model. We consider the hypothesis testing question presented in Section 6.1 with the same settings namely d = 3 and the envelope and latitude functions p (1) and f (1) of Eq. (11). We can see that when = 0, the power of our test is 1 and we always reject the null hypothesis (uniform sampling of the latent positions) under the alternative. On the contrary, when = 1, the points are sampled uniformly on the sphere and we obtain a power of the order of the level of our test (i.e. 5%) as expected. The larger the sample size n is, the greater can be chosen while keeping a large power. In the case where n = 1500, one can afford to sample 75% of latent positions uniformly (and the rest using our Markovian sampling scheme) while keeping a power equal to 1. Figure 11 shows that the larger is, the closer the estimated latitude function is to w β w β 1 ≡ 1 2 (since d = 3) which corresponds to the density of a one-dimensional marginal of a uniform random point on S d−1 .

Influence of mixing time on estimation error
In order to assert that the dependence of the latent variables has an influence on the estimation of the unknown functions of our model, we would require a minimax bound. The derivation of such minimax result is still an open problem, even in the independent setting (cf. [9]). Nevertheless, by making explicit the constants involved in concentration inequalities, we can show that the mixing time of the latent Markovian dynamic affects our bound on the δ 2 error between spectra. For any r * ∈ (−1, 1), let us consider the following latitude function Note that the Markov transition kernel P of the chain (X i ) i≥1 using this latitude function is the one that starting from a point x ∈ S d−1 samples uniformly a point in the set {z ∈ S d−1 | x − z 2 2 ≤ 2(1 − r * )}. In particular, when r * = −1, we recover the uniform distribution on the sphere. It is clear that the closer r * to one, the larger the mixing time of the chain. One can show that for any r * ∈ (−1, 1), the chain is uniformly ergodic by proving that there exist an integer m ≥ 1, a constant δ m > 0 and a probability measure ν such that Eq.(16) holds by considering for example ν = π the uniform distribution on the sphere. It is straightforward to show that the smallest integer m(r * ) ≥ 1 satisfying Eq.(16) is larger than 2 1−r * . 2 Taking a closer look at the constants involved in the concentration inequality from [10] (cf. [10, Section 3.1.1]), we get that where C > m(r * ) 2 τ(r * ) 2 f r * ∞ and τ(r * ) ≥ 1 is the Orlicz norm of some regeneration time. Since for any 0 < r * < 1, 2 ln(1−(r+r * ) 2 ) d r 2 Indeed, the latitude function f r * allows to make a jump at each time step of size at most 1 − r * . Since the length of the shortest arc on S d−1 joining the north pole to the south pole is 2, the result follows. , is increasing in r * and diverges to +∞ when r * → 1 − . Hence, the closer r * is to one, the slower the chain is mixing, and the poorer is our bound. Figure 12 presents the result of the simulations using the latitude function f r * and the envelope function p : t → 1 t≥0 . We compute the L 2 error between the true and the estimated envelope functions (respectively the true and the estimated latitude functions). When r * is getting closer to 1, the chain is mixing slowly and we need to increase the sample size if we want to prevent the L 2 errors from blowing up. Graphs have been generated with a latent dimension d = 3 and by sampling the latent positions using our isotropic sampling procedure with latitude function f r * .

Choice of the clustering algorithm for the SCCHEi
The SCCHEi algorithm relies on the clustering of the eigenvalues of the adjacency matrix provided by the HAC with complete linkage. In this section, we motivate the use of the HAC algorithm with complete linkage by showing that the theoretical results from Section 2.3 could be much more involved to establish by using another clustering procedure. Indeed, if one would consider for example the HAC with the Ward distance, the theoretical result obtained for the correctness of the SCCHEi algorithm (cf. Theorems 2 and 3) is likely to be no longer true (even if the sample size n is chosen arbitrarily large). Let us show this on a simple example.
Applying the HAC algorithm (with the Ward distance) to the eigenvalues (λ 1 , . . . ,λ R ), it is obvious that the state reached after R − 4 = 1 + d + d 2 − 4 iterations in the HAC procedure will be 0 :={λ 1 } Hence, in order to understand which clusters will be merged at the next step of the HAC algorithm, we compute the Ward distance between the different clusters.
Let us recall that for two finite and non-empty sets S, S ⊂ R with respective cardinality |S| and |S |, the Ward distance between S and S is given by Ward distances between clusters We deduce that all Ward distances between pair of clusters are scaling at least linearly with d except the Ward distances between 0 and the other three clusters 1 , 2 and 3 . Indeed, for any i ∈ {1, 2, 3}, d W ( 0 , i ) remains bounded independently of the latent dimension d. Hence, for any g ∈ (0, ∆ G /4) which can be chosen arbitrarily small, one can take d large enough to ensure that We deduce that for any g ∈ (0, ∆ G /4), we can choose d large enough to ensure that Eq.(17) holds and thus the clusters merged between depths 4 and 3 from the root of the HAC's tree will not be 2 and 3 .
This means that the state obtained at depth 3 from the root is not of type ( ) (in the sense defined in Lemma 2). If this is not a sufficient condition to state that the SCCHEi will fail to recover the correct clusters, this example shows that the use of Ward distance can lead to some unexpected clustering of the eigenvalues. Our example proves that using the HAC algorithm with the Ward distance, the result of Lemma 2 does not hold anymore. Namely, regardless of how large the sample size is chosen, there are situations (in particular for a large latent dimension) where the states of type ( ) (cf. Lemma 2) are never reached in the HAC tree with the Ward distance. Hence obtaining a theoretical guarantee for the clustering provided by the SCCHEi in this framework may be impossible or at least much more involved.

Concluding remarks 7.4.1 Estimation of the latent dimension
The proposed methods implicitly assume that the latent dimension d is known. [1] proved that the latent dimension d can be easily recovered in practice for n large enough provided that the spectral gap condition (9) holds. In the following, we briefly describe their approach. Given some matrix T n as input and some set of candidates for the dimension d (typically = {2, 3, . . . , d max }), apply the Algorithm HEiC (cf. Algorithm 3 in Section F) for any d c ∈ and store the returned value g ap := g ap(d c ). Let us recall that gap(d c ) corresponds to the largest gap between a bulk of d c eigenvalues of T n and the rest of the spectrum (see the definition of Gap 1 in Section F for details). Once we have computed the different gaps, we pick the candidate d c that led to the largest one. Given the guarantees provided by Proposition 4, the previously described procedure will find the correct dimension, with high probability (on the event with the notations of Proposition 4), if the true dimension of the latent space is in the candidate set .

Future research directions
Our work encourages the development of growth model in random graphs and in particular the derivation of similar results in MRGGs with other latent spaces. It would be also desirable to extend our methods to the case where we consider more complex Markovian sampling of the latent positions, typically one that is not isotropic. Our work leaves open the question of getting a theoretical guarantee for the estimation of the latitude function. If we proved (with Theorem 4) that we can consistently estimate the Gram matrix of the latent positions in Frobenius norm, this is not sufficient to ensure that our kernel density estimator is consistent since we cannot ensure that 1 n−1 n i=2 (r i −r i ) 2 tends to 0 as n goes to +∞. Deriving a theoretical result regarding the estimation of the latitude function seems challenging and we believe that it would require significantly different proof techniques.

Guidelines for the Appendix Sections A to C: Basic definitions and Complements
In Section A we recall basic definitions on Markov chains which are required for Section B where we describe some properties verified by the Markov chain (X i ) i≥1 . Section C provides complementary results on the Harmonic Analysis on S d−1 which will be useful for our proofs.

Sections D to F: Algorithms and Experiments
In Section D, we give the proof of Lemma 1 and Lemma 2. They are the cornerstones of the proof of Theorem 2 that provides a theoretical guarantee for the correctness of the algorithm SCCHEi. Section E describes precisely the slope heuristic used to perform the adaptive selection of the model dimensionR. Section F provides a complete description of the HEiC algorithm used to extract d-eigenvectors of the adjacency matrix that will be used to estimate the Gram matrix of the latent positions. Sections G to I: Proofs of theoretical results Thereafter, we dig into the most theoretical part of the Appendix. In Section G, we discuss the assumptions we made on the Markov chain (X i ) i≥1 . Section G is also dedicated to the presentation of a concentration result for a particular U-statistic of the Markov chain (X i ) i≥1 that is an essential element of the proof of Theorem 1 which is provided in Section H. Finally, the proof of Theorem 4 can be found in Section I.

A Definitions for general Markov chains
We consider a state space E and a sigma-algebra Σ on E which is a standard Borel space. We denote by (X i ) i≥1 a time homogeneous Markov chain on the state space (E, Σ) with transition kernel P.

A.1 Ergodic and reversible Markov chains Definition 6. [27, section 3.2] (ϕ-irreducible Markov chains)
The Markov chain (X i ) i≥1 is said ϕ-irreducible if there exists a non-zero σ-finite measure ϕ on E such that for all A ∈ Σ with ϕ(A) > 0, and for all x ∈ E, there exists a positive integer n = n(x, A) such that P n (x, A) > 0 (where P n (x, ·) denotes the distribution of X n+1 conditioned on X 1 = x).

Definition 9. [27, section 3.3] and [25, Chapter 16] (Uniform ergodicity)
The Markov chain (X i ) i≥1 is said uniformly ergodic if there exists an invariant distribution π and constants 0 < ρ < 1 and L > 0 such that Equivalently, the Markov chain (X i ) i≥1 is uniformly ergodic if the whole space S d−1 is a small set, namely if there exist an integer m ≥ 1, δ m > 0 and a probability measure ν such that Remark. A Markov chain geometrically or uniformly ergodic admits a unique invariant distribution and is aperiodic.

A.2 Spectral gap
This section is largely inspired from [13]. Let us consider that the Markov chain (X i ) i≥1 admits a unique invariant distribution π on S d−1 .

Definition 11. (Spectral gap) A Markov operator P reversible admits an absolute spectral gap
The next result provides a connection between spectral gap and geometric ergodicity for reversible Markov chains.

B Properties of the Markov chain
In the following, we denote λ Leb ≡ λ Leb,d the Lebesgue measure on S d−1 and λ Leb,d−1 the Lebesgue Let P be the Markov operator of the Markov chain (X i ) i≥1 . By abuse of notation, we will also denote P(x, ·) the density of the measure P(x, dz) with respect to λ Leb (dz). For any x, z ∈ S d−1 , we denote R z x ∈ R d×d a rotation matrix sending x to z (i.e. R z x x = z) and keeping Span(x, z) ⊥ fixed. In the following, we denote e d := (0, 0, . . . , 0, 1) ∈ R d .

B.1 Invariant distribution and reversibility for the Markov chain
Reversibility of the Markov chain (X i ) i≥1 .

Lemma 3.
For all x, z ∈ S d−1 , P(x, z) = P(z, x) = P(e d , R e d z x).
Proof of Lemma 3. Using our model described in Section 2, we get X 2 = r X 1 + 1 − r 2 Y where conditionally on X 1 , Y is uniformly sampled on (X In the following, we denote (d) = the equality in distribution sense. We have conditionally on X 1 . Using Cochran's theorem, we know thatŴ −〈Ŵ , e d 〉e d is a centered normal vector with covariance matrix the orthographic projection matrix onto the space Span(e d ) ⊥ , leading tô where Y ∼ (0, I d−1 ). Using Lemma 4, we conclude that conditionally on X 1 , the random variable We deduce that is also a standard centered Gaussian vector because this distribution is invariant by rotation. Since 〈R X 1 is the rotation that sends X 1 to X 2 keeping the other dimensions fixed. Let us denote a 1 := X 1 , a 2 := X 2 −r X 1 X 2 −r X 1 2 and complete the linearly independent family (a 1 , a 2 ) in an orthonormal basis of R d given by a := (a 1 , a 2 , . . . , a d ). Then, the matrix of R We deduce that Going back to Eq.(18), we deduce that whereW = −W is also a standard centered Gaussian vector in R d . Thus, we proved the first equality of Lemma 3. Based on Eq. (19) we have, which proves that P(e d , R e d x 2 is again a standard centered Gaussian vector in R d .

Proposition 3. The uniform distribution on the sphere S d−1 is a stationary distribution of the Markov chain
Proof of Proposition 3. Let us consider z ∈ S d−1 . We have using Lemma 3, which proves that the uniform distribution on the sphere is a stationary distribution of the Markov chain.

B.2 Ergodicity of the Markov chain
Our results hold under the condition that the Markov chain (X i ) i≥1 is uniformly ergodic (cf. Assumption A). In this section, we provide a sufficient condition on the latitude function f for uniform ergodicity to hold.

Lemma 5.
We consider that f is bounded away from zero. Then, the Markov chain (X i ) i≥1 is π-irreducible and aperiodic.

Lemma 6.
We consider that f is bounded away from zero. Then the Markov chain (X i ) i≥1 is uniformly ergodic.
Proof of Lemmas 5 and 6. Considering for π the uniform distribution on S d−1 , we get that for any x ∈ S d−1 and any A ⊂ S d−1 with π(A) > 0, since π is invariant by rotation and f is bounded away from zero. We also used that

B.3 Computation of the absolute spectral gap of the Markov chain
Thanks to Proposition 2 (in Appendix A), we know that if f is such that (X i ) i≥1 is uniformly ergodic, the Markov chain has an absolute spectral gap (cf. Definition 11). In the following, we show that this absolute spectral gap is equal to 1.
Keeping notations of Appendix A, let us consider h ∈ L 2 0 (π) such that h π = 1. Then (Using the rotational invariance of π) where the last equality comes from h ∈ L 2 0 (π). Hence, the Markov chain (X i ) i≥1 has 1 for absolute spectral gap.

C Complement on Harmonic Analysis on the sphere
This section completes the brief introduction to Harmonic Analysis on the sphere S d−1 provided in Section 2. We will need in our proof the following result which states that fixing one variable and integrating with respect to the other one with the uniform measure on S d−1 gives W − W R 2 2 .

Lemma 7.
For any x ∈ S d−1 , where π is the uniform measure on the S d−1 .
Proof of Lemma 7.
• Let us recall that the function f * is defined by Note that for any 1 ≤ i ≤ R, σ * (i) ≤ R thanks to the previous paragraph. We denote p * (0) ≥ · · · ≥ p * (R) the ordered sequence of p * 0 , . . . , p * R and d (k) is the multiplicity of the eigenvalue p * (k) of the operator T W . We show that f * is such that f * (1) = · · · = f * (d (0) ) = p * (0) , f * (d (0) + 1) = · · · = f * (d (0) + d (1) (2) , . . . . This is equivalent to say that the function f * is non-increasing. If this was not true, it would mean that there exist 1 Since we chose g such that ∆ G > 4g, this previous inequality is absurd. This concludes the proof.

D.2 Proof of Lemma 2
We prove our result by induction. In the following, we say that an intermediate state of the HAC algorithm is valid if it is still possible to reach state ( ) in the next iterations of the algorithm. Stated otherwise, a state is valid if it does not exist 1 ≤ i = j ≤ R such that f * (i) = f * ( j) with λ R ( T n ) i and λ R ( T n ) j in the same cluster. It is obvious that the initial state of the HAC algorithm is valid since all eigenvalues are alone in their respective clusters.
Suppose now that we are at iteration 2 ≤ t ≤ R − R − 2 of the HAC algorithm and that our procedure is valid until step t. We are sure that we did not reach a state of type ( ) before step t because only the state at depth R from the root of the HAC's tree contains exactly R + 1 clusters. For any cluster S formed at step t by the HAC algorithm, we denote by abuse of notation f * (S) := f * (i) for any i such that λ R ( T n ) i ∈ S (which is licit since step t is valid). By contradiction, assume that the algorithm does not make a valid merging at step t + 1. This means that the two merged clusters S a and S b at step t + 1 are such that f * (S a ) = f * (S b ). Since at step t we did not reach a state of type ( ), this means that there are two clusters S i and S j with i = j such that f * (S i ) = f * (S j ).
and for any λ R ( T n ) a ∈ S a and λ R ( This is a contradiction since at step t, the HAC algorithm merges the two clusters with the smallest complete linkage distance. Hence, the algorithm performs a valid merging at step t + 1. We proved that a state of type ( ) is reached by the HAC algorithm with complete linkage at iteration R − R − 1. Since d ≥ 3, it holds d 0 < d 1 < d 2 < . . . and since the SCCHEi starts by selecting the cluster of size d 0 in the tree as close as possible to the root, we get d 0 = d 0 . Continuing the process of the "for loop" in the SCCHEi algorithm, the SCCHEi algorithm then selects the cluster of size d 1 in the remaining tree (where we removed all eigenvalues in d 0 in the tree of the HAC). Hence, the SCCHEi algorithm sets d 1 = d 1 . Following this procedure, it is straightforward to see that the SCCHEi returns the partition

E Slope heuristic
We propose a detailed analysis of the slope heuristic described in Section 2.4 on simulated data using d = 3, the envelope function p (1) and the latitude function f (1) presented in Eq. (11). We recall that R(κ) represents the optimal value of R to minimize the bias-variance decomposition defined by Eq.(7) for a given hyperparameter κ. Figure 15 shows the evolution of R(κ) with respect to κ which is sampled on a logscale. R(κ) is the dimension of the space of Spherical Harmonics with degree at most R(κ). Our slope heuristic consists in choosing the value κ 0 leading to the larger jump of the function κ → R(κ). In our case, Figure 15 shows that κ 0 = 10 −3.9 . As described in Section 2.2, the resolution levelR selected to cluster the eigenvalues of the matrix T n is given by R(2κ 0 ).

G Concentration inequality for U-statistics with Markov chains
In this section, we present a recent concentration inequality for a U-statistic of the Markov chain (X i ) i≥1 from [10] which is a key result to prove Theorem 1. In the first subsection, we remind the assumptions made on the Markovian dynamic, namely Assumption A.

G.1 Assumptions and notations for the Markov chain
Let us recall that Assumption A states that the latitude function f is such that f ∞ < ∞ and makes the chain (X i ) i≥1 uniformly ergodic. Assumption A guarantees in particular that there exists δ M > 0 such that for some probability measure ν (e.g. the uniform measure on the sphere π). In Section B.2, we provide a sufficient condition on the latitude function f ensuring the uniform ergodicity of the chain with associated constants L > 0 and 0 < ρ < 1 (cf. Definition 9). In Section B.3, we explain why Assumption A ensures that the Markov chain (X i ) i≥1 has a spectral gap and we show that this spectral gap is equal to 1.

G.2 Concentration inequality of U-statistic for Markov chain
One key result to prove Theorem 1 is the concentration of the following U-statistic Note that W − W R 2 2 corresponds to the expectation of the kernel (W − W R ) 2 (·, ·) under the uniform distribution on S d−1 which is known to be the unique invariant distribution π of the Markov chain (X i ) i≥1 (cf. Appendix B). More precisely, for any x ∈ S d−1 , it holds see Lemma 7 for a proof. Applying [10, Theorem 2] in a our framework leads to the following result. Lemma 8. Let us consider γ ∈ (0, 1) satisfying log(e log(n)/γ) ≤ n. Then it holds with probability at least 1 − γ, where M > 0 only depends on constants related to the Markov chain (X i ) i≥1 .

H Proof of Theorem 1
The proof of Theorem 1 mainly lies in the following result which is proved in Section H.1. Coupling the convergence of the spectrum of the matrix of probability T n with a concentration result on the spectral norm of random matrices with independent entries (cf. [4]), we show the convergence in metric δ 2 of the spectrum of T n towards the spectrum of the integral operator T W .
Using Theorem 5, it holds P (Ω(γ)) ≥ 1 − γ. Remarking further that we have log n n log(e log(n)/γ) where c > 0 is a constant that does not depend on R, d nor n. Since for some constant C(p, s, d) > 0 (depending only on p, s and d) and since we have choosing γ = 1/n 2 where D > 0 is a constant independent of n and R. Let us show that choosing R = n/ log 2 (n) and using Eq.(30), we deduce that Hence, Eq. (25) becomes where D is a constant that does not depend on n nor R. Choosing R = n/ log 2 (n) Second part of the proof for Theorem 1 Let us recall that in the statement of Theorem 1, λ R opt ( T n ) is the sequence of the R opt first eigenvalues (sorted in decreasing absolute values) of the matrix T n where R opt is the value of the parameter R leading to the optimal bias-variance trade off, namely λ R opt ( T n ) = (λ 1 , . . . ,λ R opt , 0, 0, . . . ).
From the computations of the first part of the proof, we know that R opt = n/ log 2 (n) 1 2s+d−1 . That corresponds to the situation where we choose optimally R and it is in practice possible to approximate this best model dimension using e.g. the slope heuristic. Therefore, δ 2 λ(T W ), λ R opt ( T n ) is the quantity of interest since it represents the distance between the eigenvalues used to built our estimates (p k ) k and the true spectrum of the envelope function p. Since R = R d−1 for all integer R ≥ 0, we have 2s+d−1 . We deduce that for n large enough 2 R opt ≤ n and using [9, Proposition 15] we obtain where λ(T W R opt ) = (λ * 1 , . . . , λ * R opt , 0, 0, . . . ). Let us consider γ ∈ (0, 1). Using Theorem 5, we know that with probability at least 1 − γ it holds for n large enough Using Eq.(23), Eq.(26) and the fact that R = (R d−1 ), it holds with probability at least 1 − 1/n 2 , where c > 0 is a numerical constant and M > 0 depends on constants related to the Markov chain (X i ) i≥1 (see Theorem 5 for details). Moreover, where we used Eq. (23). Finally, using the concentration of spectral norm for random matrices with independent entries from [4], there exists a universal constant C 0 > 0 such that conditionally on (X i ) i≥1 , it holds with probability at least 1 − 1/n 2 , Using again R = (R d−1 ), this implies that for n large enough, it holds conditionally on (X i ) i≥1 with probability at least 1 − 1/n 2 , where D > 0 is a numerical constant. From Eq. (27), we deduce that P(Ω) ≥ 1 − 2/n 2 where the event Ω is defined by Remarking finally that Using the triangle inequality, Eq.(28) and Eq.(29) lead to which concludes the proof of Theorem 1.

H.1 Proof of Theorem 5
We follow the same sketch of proof as in [9]. Let R ≥ 1 and define, It holds We point out the equality between spectra of the operator T W R and the matrix K R . Using the SVD decomposition of X R,n , one can also easily prove that λ(T R,n ) = λ(T * R,n ). We deduce that with the Hoffman-Wielandt inequality. Using equation (4.8) at ( [21] p.127) gives Using again the Hoffman-Wielandt inequality we get For any γ ∈ (0, 1) with log(e log(n)/γ) ≤ (n/(13 R)), it holds with probability at least 1 − γ, where M > 0 depends only on constants related to the Markov chain (X i ) i≥1 . Now remark that because p R is the orthogonal projection of p, and |p| ≤ 1. We deduce that ln(e/γ) + M p − p R ∞ log n n (log(e log(n)/γ)) 1/2 .

H.2 Proof of Lemma 9
Observe that nE R,n = By definition of the spectral norm for a Hermitian matrix, We use a covering set argument based on the following Lemma. We consider Q the set given by Lemma 11 with D = d and 0 ∈ (0, 1/2). Let us define x 0 ∈ S d−1 such that |x 0 E R,n x 0 | = E R,n and q 0 ∈ Q such that x 0 − q 0 2 ≤ 0 . Then, |x 0 E R,n x 0 | − |q 0 E R,n q 0 | ≤ |x 0 E R,n x 0 − q 0 E R,n q 0 | (by triangle inequality) Hence, We introduce for any q ∈ Q the function Let us consider t > 0. We want to apply Bernstein's inequality for Markov chains from [18, Theorem 1.1]. In the following, we denote E π [·] the expectation with respect to the measure π. We remark that Using that the Markov chain (X i ) i≥1 has an absolute spectral gap equals to 1 (cf. Section B.3), we get from [18, Eq. (1.6)] that P |F q (X )| ≥ t = P |q E R,n q| ≥ t ≤ 2 exp −nt 2 4( R − 1) + 10( R − 1)t , which leads to We deduce that if 25 2 ln(2/α) R ≤ n, it holds with probability at least 1 − α, max q∈Q |q E R,n q| ≤ 16 R n ln(2/α).
where Λ = Diag(λ 1 , . . . , λ n ) are the eigenvalues of the matrix T n and where Λ R = (p * 0 , p * 1 , . . . , p * 1 , . . . , p * R , . . . , p * R , 0, . . . , 0) ∈ R n where each p * k has multiplicity d k . Then, we note by V ∈ R n×d (resp. V R ) the matrix formed by the columns 1, . . . , d of the matrixṼ (resp.Ṽ R ). The matrix V * ∈ R n×d is the orthonormal matrix with i−th column 1 n Y 1,1 (X i ), . . . , Y 1,d (X i ) . The matrices G * , G, G R and G * pr o j are defined as follows G * pr o j is the projection matrix for the columns span of the matrix V * . Using the triangle inequality we have Step 1: Bounding G − G R F . Since the columns of the matrices V and V R correspond respectively to the eigenvectors of the matrices T n and T R,n , applying the Davis Kahan sinus Theta Theorem (cf. Theorem 6) gives that there exists O ∈ R d×d such that where ∆ := min k∈{0,2,3,...,R} |p * 1 − p * k | ≥ ∆ * = min k∈N, k =1 |p * 1 − p * k |. Using Lemma 12 and c 1 = d d−2 , we get that Hence, using the proof of Theorem 1, we get that with probability at least 1 − 1/n 2 , where C > 0 is a constant.
Step 2: Bounding G * − G * pr o j F . To bound G * − G * pr o j F , we apply first Lemma 13 with B = V * . This leads to Using a proof rigorously analogous to the proof of Lemma 9, it holds with probability at least 1 − γ and for n large enough, We get by choosing γ = 1/n 2 that it holds with probability at least 1 − 1/n 2 , where C > 0 is a universal constant.
Step 3: Bounding G * pr o j − G R F . We proceed exactly like in [1] but we provide here the proof for completeness. Since G * pr o j and G R are projectors we have, using for example [5, p.202], We use Theorem 7 with E = G * pr o j , F = G ⊥ R , B = T R,n and A = T R,n + H where H =X R,n K RX R,n − X R,n K R X R,n , where the columns of the matrixX R,n are obtained using a Gram-Schmidt orthonormalization process on the columns of X R,n . Hence there exists a matrix L such thatX R,n = X R,n (L −1 ) . This matrix L is such that a Cholesky decomposition of X R,n X R,n reads as L L .
where the last inequality comes from Lemma 14. From the previous remarks on the matrix L, we directly get Using the notations of the proof of Theorem 5 which is provided in Section H.1, we get L −1 L − − Id R X R,n X R,n = X R,n X R,n − Id R = E R,n .
Since R = R d−1 and R = n/ log 2 n 1 2s+d−1 , we obtain using Eqs. (31), (32), (34) and (35) that with probability at least 1 − 1/n 2 it holds where C d > 0 is a constant that may depend on d and on constants related to the Markov chain (X i ) i≥1 .

Conclusion.
We proved that on the event , it holds with probability at least 1 − 3/n 2 , where D 1 > 0 is a constant that depends on ∆ * , d and on constants related to the Markov chain (X i ) i≥1 . Moreover, Eq.(39) from the proof of Proposition 4 gives that on the event , we have Using the concentration result from [4] on spectral norm of centered random matrix with independent entries we get that there exists some constant D 2 > 0 such that with probability at least 1 − 1/n 2 it holds G −Ĝ F ≤ D 2 log n n .
matrices in R n×d with columns (v r , v r+1 , . . . , v s ) and (v r ,v r+1 , . . . ,v s ) respectively, such that Σv j = λ j v j and Σv j = λ jvj . Then there exists an orthogonal matrixÔ in R d×d such that

Lemma 14. (Ostrowski's inequality) Let
A ∈ R n×n be a Hermitian matrix and S ∈ R d×n be a general matrix then SAS − A F ≤ A F × S S − Id n .

J Proof of Proposition 1
Notice that for any i ∈ [n], P g i (D 1:n ) = A i,n+1 = E 1 g i (D 1:n ) =A i,n+1 = EE 1 g i (D 1:n ) =A i,n+1 | D 1:n , and that