Localization in 1D non-parametric latent space models from pairwise affinities

We consider the problem of estimating latent positions in a one-dimensional torus from pairwise affinities. The observed affinity between a pair of items is modeled as a noisy observation of a function $f(x^*_{i},x^*_{j})$ of the latent positions $x^*_{i},x^*_{j}$ of the two items on the torus. The affinity function $f$ is unknown, and it is only assumed to fulfill some shape constraints ensuring that $f(x,y)$ is large when the distance between $x$ and $y$ is small, and vice-versa. This non-parametric modeling offers a good flexibility to fit data. We introduce an estimation procedure that provably localizes all the latent positions with a maximum error of the order of $\sqrt{\log(n)/n}$, with high-probability. This rate is proven to be minimax optimal. A computationally efficient variant of the procedure is also analyzed under some more restrictive assumptions. Our general results can be instantiated to the problem of statistical seriation, leading to new bounds for the maximum error in the ordering.


1D latent localization problem
We consider the 1D latent localization problem, where we seek to recover the 1D latent positions of n objects from pairwise similarity measurements. Such problems arise in archeology for relative dating of objects or graves [34], in 2D-tomography for angular synchronization [10,37], in bioinformatics for reads alignment in de novo sequencing [32], in computer science for time synchronization in distributed networks [14,22], or in matchmaking problems [6]. The data are collected as a n × n symmetric matrix [A ij ] 1≤i,j≤n , called affinity matrix, which provides similarity measurements between pairs of objects. These similarity measurements A ij can be real valued scores, or they can be binary pieces of information, as when the matrix A encodes a network structure.
In 1D latent space models [24], the affinity matrix is assumed to be sampled as follows. The distribution is parametrized by a 1D metric space (X , d), some (possibly random) latent positions (x * 1 , . . . , x * n ) ∈ X n and an affinity function f : X × X → R. Then, conditionally on (x * 1 , . . . , x * n ), the upper-diagonal entries A ij of the affinity matrix are sampled independently, with conditional mean . The affinity f (x * i , x * j ) is typically assumed to decrease as the metric distance d(x * i , x * j ) increases. In particular, close points x * i and x * j share a high affinity, whereas distant points share a small affinity. These latent space models encompass many classical models, as exemplified in the next paragraphs.
Example 1: Random Geometric Graph [21,30,13,11]. We observe a random graph with n nodes labelled by {1, . . . , n}. The graph is encoded into an adjacency matrix A ∈ {0, 1} n×n , by setting A ij = 1 if there is an edge between nodes i and j, and A ij = 0 otherwise. Let C denote the unit sphere in R 2 endowed with the geodesic distance d. In the circular random geometric graph model, the edges are sampled are sampled independently, with probability P[A ij = 1] = g(d(x * i , x * j )), where g : [0, π] → [0, 1] is a non-increasing function and x * 1 , . . . , x * n ∈ C are the latent positions of the nodes on the sphere. This random graph model is therefore an instance of 1D latent space model where Example 2: Graphons and f -Random Graphs [12,27]. The class of f -random graph models, also called graphon models, encompasses all the distributions on random graphs that are invariant by permutation of nodes. It is parametrized by the set of measurable functions f : [0, 1] × [0, 1] → [0, 1]. The adjacency matrix A of the graph is sampled as follows. First, n latent positions x * 1 , . . . , x * n are sampled i.i.d. uniformly on [0, 1]. Then, conditionally on x * 1 , . . . , x * n , the edges are sampled independently, with conditional probability P[A ij = 1|x * 1 , . . . , x * n ] = f (x * i , x * j ). The f -random graph model is then an instance of 1D latent space model where A ij ∈ {0, 1}, and X = [0, 1]. Unless some additional constraints are imposed on the shape of f , the affinity f (x * i , x * j ) may vary arbitrarily with the distance |x * i − x * j |.

Example 3: R-Matrices and Statistical Seriation. A Robinson matrix (R-matrix)
is any symmetric matrix B ∈ R n×n whose entries decrease when moving away from the diagonal, i.e. such that B i,j ≥ B i+1,j and B i,j ≥ B i,j−1 , for all 1 ≤ j ≤ i ≤ n. A matrix F is called a pre-R matrix, when there exists a permutation σ ∈ Σ n of {1, . . . , n}, such that F σ = [F σ(i),σ(j) ] i,j is an R-matrix.
The noisy seriation problem [15] amounts to find, from a noisy observation of a pre-R matrix F , a permutation σ * such that F σ * is a R-matrix. This problem appears in genomic sequencing [20], in interval graph identification [16], and in envelope reduction for sparse matrices [5]. This problem can be recast in the latent space terminology using X = {1, . . . , n}, x * i = σ * (i) and the affinity function f (x * i , x * j ) = F σ * (i),σ * (j) . Since F σ * is a R-matrix, the function f (x * i , x * j ) is decreasing with the distance |x * i − x * j |.
Example 4: Toroidal R-Matrices and Toroidal Seriation. Consider the set {1, . . . , n} as a torus with the corresponding distance d(i, j) = min(|j − i|, |n + i − j|) for any 1 ≤ i, j ≤ n. A toroidal R-matrix is any symmetric matrix B whose entries decrease when moving away from the diagonal with respect to the toroidal distance: B i,j ≥ B i+1,j when d(i, j) < d(i + 1, j) and B i,j ≥ B i,j+1 when d(i, j) < d(i, j + 1). As in Example 3 above, a pre-toroidal R-matrix is defined as a permutation of a toroidal R-matrix and the statistical seriation model is defined analogously [33]. Again, we can recast this model as a latent space model on X = {1, . . . , n} endowed with the toroidal distance. Alternatively, we can also rewrite it as a latent space model on the regular grid C n of the unit sphere C corresponding to the n-th unit roots, endowed with the geodesic distance on C.
In the following, we assume that we observe a symmetric matrix [A ij ] 1≤i,j≤n of pairwise affinity measurements, with A ii = 0 (by convention) and where (i) x * 1 , . . . , x * n are n unobserved latent positions spread on the unit sphere C in R 2 , (ii) f : C × C → [0, 1] is unobserved, symmetric, decreasing with the geodesic distance d(x, y), and (iii) [E ij ] 1≤i<j≤n are some independent sub-Gaussian random variables.
This non-parametric framework is very flexible for fitting pairwise affinity data. It encompasses the circular random geometric graph model (Example 1) and the toroidal statistical seriation model (Example 4).
Our overall goal is to recover the n-tuple of latent positions x * = (x * 1 , . . . , x * n ) ∈ C n , with some high-confidence, simultaneously for all individual positions x * i . As the global error of an estimatorx, say d 2 (x, x * ) = n i=1 d(x i , x * i ) 2 , provides limited information on each individual error d(x i , x * i ), we focus instead on the maximum error We propose some estimatorsx achieving, with high-probability, a maximum error d ∞ (x, x * ) of the order of log(n)/n, under the assumptions that the latent positions x * 1 , . . . , x * n are sufficiently spread on C and some shape conditions relative to the decreasing of f (x, y) with d(x, y). The log(n)/n-rate of estimation is shown to be optimal. To the best of our knowledge, these are the first optimal results on maximum error d ∞ (x, x * ) in latent space models with unknown and non-parametric affinity function f .

Our contribution
As explained above, our overall goal is to recover the n-tuple of latent positions x * = (x * 1 , . . . , x * n ) with a control on the maximum error (1). Unfortunately, this program cannot be carried out literally, as the latent positions are not identifiable from the distribution of the data. Indeed, for any bijective map φ : C → C, we have f (x, y) = f • φ −1 (φ(x), φ(y)) for all x, y ∈ C, with the notation f • φ −1 (x, y) := f (φ −1 (x), φ −1 (y)). Even if we would enforce some strong shape constraints, like f (x, y) = 1 − αd(x, y) with α > 0, since f (x, y) = f (Qx, Qy) for any orthogonal transformation Q of C, the distribution of the data would still be invariant by orthogonal transformation of the latent positions. Hence, we face a delicate identifiability issue. This identifiability issue is fully explained and tackled in Section 2.2. Informally, our remedy is to provide some estimatorsx which are, under some assumptions, at the distance d ∞ (x, x * ) = O( log(n)/n) of some specific representative x * of the latent positions.
Our shape assumption (ii) on the affinity function f ensures that the matrix [f (x i , x j )] i,j=1,...,n is (approximately) a toroidal pre-R matrix. We observe that the constant function f (x, y) = 1 fulfills assumption (ii), and that for this specific function there is no hope to recover any information on the latent positions, even in the noiseless case. To circumvent this issue, we introduce a bi-Lipschitz assumption, detailed in Section 2.1, constraining the decay of f with d. In the specific case of the random geometric graph model with g continuously differentiable, this condition merely amounts to require g ′ (x) to be bounded away from 0.
Our estimation procedures proceed in two main stages: (1) we start with an initial localization with a global control in d 1 (x, y) := n i=1 d(x i , y i ) distance, (2) then, for each point, we refine this first estimator to get a control in d ∞ distance. In order to avoid some nasty statistical dependencies between the two stages, we use a sample splitting scheme ensuring that, at the second stage, the refinement uses data independent from those used at the first stage.
Let S ⊂ {1, . . . , n} be a subset of indices sampled uniformly at random, and S = {1, . . . , n} \ S. At the second step, the refined estimatorx (2) S of x * S := (x * i ) i∈S , can take as input any initial estimatorx (1) S of x * S = (x * i ) i∈S . This second step has a polynomial computational complexity and, under appropriate assumptions, it fulfills with high-probability for some specific representative x * of the latent positions. Hence, in order to get the desired bound d ∞ (x (2) S , x * S ) = O log(n)/n , we need an initial control S , x * S ) = O n log(n) . We propose two estimators fulfilling this requirement: (a) a first one, which requires no additional assumptions, but which has a superpolynomial computational complexity; (b) a second one, adapted from [33], which has a polynomial computational complexity, but for which we prove a O n log(n) control only for a class of random geometric graphs.
Repeating the sampling of S and merging the resulting estimators, we then get an estimatorx achieving, with high-probability and under appropriate assumptions, d ∞ (x, x * ) = O( log(n)/n) for a specific representative x * of the latent positions. A matching lower bound is also derived, proving the optimality of the log(n)/n rate. The significance of the improvement offered by the refinement step, and the impact of the sample splitting on the localization error are investigated numerically.

Related work
In the last decade, the analysis of interaction data has given rise to numerous works in machine learning and statistics. Most of these works handle cases where the affinity function f is either known or belong to a known parametric model. There is a long standing debate on the validity of such a rigid modeling [4]. Our modeling assumptions, with only shape constraints on f , offers a more flexible setting to fit data.
Latent points estimation in random geometric graphs. Random geometric graphs have attracted a lot of attraction as a simple model for wireless communications or internet [21,30]. In the most classical setting, j ∥≤r for some r > 0. The problem of estimating the latent positions x * 1 , . . . , x * n in a square of R 2 has been tackled by [13]. Compared to us, they consider the noiseless setting, where E ij = 0, with the affinity map f belongs to 1-dimensional parametric model. The problem of latent positions localization has also been investigated in the random dot-product graph [38,28,2], where, conditionally to the latent positions, the entries A ij of the adjacency matrix are independent Bernoulli random variables with mean f (x * i , x * j ) = ⟨x * i , x * j ⟩, where ⟨·, ·⟩ is the Euclidean scalar product in R d . In this case, the function f is known and the results are of an asymptotic nature.
Phase synchronization Problems. The phase synchronization problem [37] amounts to estimating unknown angles θ 1 , . . . , θ n from noisy measurements of θ i − θ j mod 2π. A version of this problem is when we seek to retrieve x * 1 , . . . , x * n , which are spread on the unit complex sphere C 1 = {x ∈ C : |x| = 1}, from noisy observations of x * i x * j = e ι(θi−θj ) . In this model, some minimax ℓ 2 -bounds on the localization error have been obtained by [19], without assumptions on the latent positions x * 1 , . . . , x * n . Such a model is close to the pairwise affinity model on C. The main differences compared to our setting is that we focus on d ∞ -bounds, with an unknown function f , which does not have the affinity shape (ii).
Skills estimation in the Bradley-Terry model. In the Bradley-Terry model [6], the observations A ij are independent Bernoulli outcomes with mean , where x * i ∈ R represents the skill of individual i and σ(x) = e x /(1+e x ) is the sigmoid function. The estimation of the skills by a spectral algorithm, or two-steps variants of it, has received a lot of attention recently [29,9,8]. In particular, building on the structure of the problem, rate-minimax ℓ ∞ -bound have been derived for the spectral algorithm in the Bradley-Terry model when the skills belong to a compact set, possibly with missing at random observations.The Bradley-Terry model is a special instance of the 1D-latent space model. Compared to our setting, the function f is known and it does not fulfill the affinity properties (for example, it is not symmetric).
Seriation from pairwise affinity. Given a pre-R matrix F , the seriation problem seeks to find the latent order σ * such that F σ * is a R-matrix. For this noiseless version of Example 3, efficient algorithms have been proposed using convex optimization [15], or spectral methods [3]. The exact seriation problem has been solved on toroidal R-matrices in the noiseless case [33], by using a spectral algorithm. A perturbation analysis has also been sketched in [33]. As a byproduct of our analysis, we provide some more explicit recovery bounds in our specific setting with noisy observations. Closer to our contribution, Jannssen and Smith [25] observe a noisy version of a pre-R matrix and, under some assumptions on the affinity function f , learn a permutation that Although their assumptions on f are not directly comparable to ours, the localization rates are (up to logarithmic factors) comparable to ours. We refer to the discussion below Corollary 3.4 for more details.
Two-step methods for latent space models. Our work is related to the global-to-local estimation strategy, that was originally introduced in Stochastic Block Models and more generally in clustering analysis [26,17,43], for the purpose of deriving sharp recovery bounds with polynomial time procedures. The general idea is to build upon an initial estimator that satisfies a certain (weak) consistency condition, and then apply greedy-type procedures (e.g. Lloyd's algorithm) to obtain minimax recovery bounds. This approach turned out to be fruitful in various latent space problems with discrete structure [7,18] and our procedure can be interpreted as one instance of this strategy in a non-parametric setting with a continuous latent space.

Notation and organization of the paper
Notation: In the sequel, C, C ′ , C ′′ > 0 denote numerical constants that may change from line to line. For two functions or sequences x and y, we write x ≲ y (resp. x ≳ y) when, for some numerical constant C > 0, we have x ≤ Cy (resp. x ≥ Cy). The maximum (resp. minimum) of x and y is denoted by x ∨ y (resp. x ∧ y). For any x > 0, we write ⌊x⌋ for its integer part, and [x] for the set of integers [x] = {1, . . . , ⌊x⌋} . For q ≥ 1, the entry-wise l q norm of a matrix F = (f ij ) is denoted by ∥F ∥ q = ( ij |f ij | q ) 1/q , the i th -row of F is denoted by F i , and the Frobenius scalar product between two matrices F and G is denoted by ⟨F, G⟩. Let Σ n be the collection of permutations of [n]. For any permutation σ ∈ Σ n , and for any n-tuple x of size n, we define x σ as the permuted n-tuple x σ = (x σ(1) , . . . , x σ(n) ).
Assimilating points in the unit sphere C of R 2 to complex numbers with unit norm, we can represent x ∈ C by x = e ιx , with x ∈ [0, 2π). We call henceforth argument of x ∈ C the real number x. The geodesic distance d(x, y) on C can be conveniently defined in terms of the arguments of x and y For any positive integer k, we define the regular grid C k = {1, e ι2π/k , . . . , e ι2π(k−1)/k }, which plays an important role in our analysis and algorithms. We denote by O the orthogonal group of R 2 made of rotations and reflections, and for any n-tuple x = (x 1 , . . . , x n ) ∈ C n , and any Q ∈ O, we define Qx := (Qx 1 , . . . , Qx n ). For two subsets S, S ′ of {1, . . . , n}, a matrix A ∈ R n×n and a n-tuple More generally, we denote by x S ∈ C S a |S|-tuple indexed by S. The complement of a set S, is denoted by S.
Organization: In Section 2, we describe the statistical setting and we discuss thoroughly the identifiability issues. The main embedding procedure, called Localize-and-Refine, is presented in Section 3. A spectral variant of this procedure is introduced in Section 4, with an application to geometric models. In Section 6, we investigate numerically the usefulness of the sample splitting, and the significance of the improvement offered by the refinement step. We summarize our findings and discuss an open problem in Section 7. All the proofs are postponed to the appendices.

Statistical setting
We observe a realization of a symmetric random matrix A ∈ R n×n , whose values on the diagonal are A ii = 0. We denote by F ij = E[A ij ] the mean value of A ij and by E ij = A ij − F ij the centered random fluctuation. We assume that A has been generated by a latent space model on C: there exist a n-tuple x * = (x * 1 , . . . , x * n ) ∈ C n and a function f : Both the function f and the latent positions x * 1 , . . . , x * n are unknown. We emphasize that the latent positions x * = (x * 1 , . . . , x * n ) are assumed to be fixed 1 and we denote by P (x * ,f ) the distribution of A. Let us describe our assumptions on the spreading of the latent positions x * 1 , . . . , x * n , the shape of f , and the random fluctuations E ij .
Spreading of the latent positions. We have in mind that the latent positions are well spread over the unit sphere. We do not strictly enforce this condition, but our error bounds depend on how far the latent positions are from a regular position on C. More precisely, let us denote by Π n the set of regular positions on the unit sphere Π n = x = (e ι2πσ(j)/n ) 1≤j≤n : σ ∈ Σ n . Our results involve the d ∞ -distance of the n-tuple of latent positions x * to the set Π n of regular positions with d ∞ (x * , y) defined in (1).
Bi-Lipschitz shape of f . As explained in the introduction, we have in mind that f (x, y) decreases with the distance d(x, y). Since there is no hope to recover the latent positions x * when the function f is flat, we impose a minimal decreasing of f (x, y) with the distance d(x, y). We also require some Lipschitz continuity of f for our analysis. These two conditions on f are enforced by the Bi-Lipschitz condition described below.
Definition 2.1. Bi-Lipschitz functions. For any fixed constants c e ≥ 0 and 0 < c l ≤ c L , we define BL[c l , c L , c e ] as the set made of all functions f : C 2 → [0, 1] that are symmetric (i.e. f (x, y) = f (y, x) for all x, y ∈ C) and that satisfy the two following conditions for all x, y, y ′ ∈ C, with ε n = c e log(n)/n.
When c e = 0, Condition (5) enforces Lipschitz continuity and Condition (6) enforces a minimal decreasing of f (x, y) with d(x, y). In the geometric case For c e > 0, the term ε n in (5-6) can be interpreted as a possible small relaxation of a strict bi-Lipschitz property. In the remaining of the paper, we will assume that f ∈ BL[c l , c L , c e ] for some c e ≥ 0 and 0 < c l ≤ c L .
SubGaussian errors. We assume that the entries E ij for 1 ≤ i < j ≤ n of the noise matrix are independent and follow a subGaussian(1) distribution. It means that, for any matrix B ∈ R n×n and t ≥ 0, we have Since centered random variables taking values in [−1, 1] have a subGaussian (1) distribution, this setting encompasses the case where A ∈ {0, 1} n×n is the adjacency matrix of a random graph, whose distribution belongs to a latent space model on C.
To keep the notation and the presentation simple, we assume henceforth that the sample size n is a multiple of 4, and we denote by n 0 the integer n 0 = n/4.

Identifiability issues
Our overall goal is to estimate the latent positions x * . Yet, in general, these latent positions are not identifiable from the distribution of A. Indeed, for any bijective φ : . Hence, it is not possible to recover x * from F and, unless x * is identifiable from the distribution of the noise E, the n-tuple of latent positions x * is not identifiable. Worse, F can be represented by many different couple (x, f ). Hence, we face a serious identifiability issue. However, with the premise that the latent positions are well spread on C, we can give a sensible meaning to our estimation objective. We explain progressively the issues that we face, in order to clarify the problem.
As a warm-up, let us assume in this paragraph that f is known. Even in this favorable case, there might exist some bijective φ : In this last case, unless x * is identifiable from the distribution of the noise E, the best that we can hope is to consistently estimate x * in terms of the quasi-distance min Q∈O d ∞ (x, Qx * ), where Qx := (Qx 1 , . . . , Qx n ).
Let us come back to our setting where f is unknown. We define R[F, c l , c L , c e ] (or simply R[F ]) as the set of representations of F by n-tuples in C n and bi-Lipschitz functions Are all the elements in R[F, c l , c L , c e ] of the form (Qx * , f • Q −1 ), as in the case discussed above?
Let us first focus on the case where ( This property breaks down when we move away from regular latent positions in Π n . Indeed, the next proposition shows that we can have min Q∈O d ∞ (x, Qx ′ ) large even, when x ∈ Π n and x ′ is well spread on C. More precisely, this property is shown for Such a n-tuple x ′ is well spread on C, since any z ∈ C is at a distance at most 3π/n from one of the x ′ i .
The representation (x ′ , f ′ ) can be obtained from (x, f ) by slightly stretching and contracting some pieces of the sphere C. The detailed proof of this proposition is postponed to Appendix C. This result shows that the set of latent positions x ′ in R[F, (3π) −1 , π −1 , 0] is much richer than the set {Qx : Q ∈ O} of orthogonal transformations of x, since it includes some latent positions x ′ at constant d ∞ -distance from this set. Yet, the next proposition shows that for any for some c e ≥ 0 and 0 < c l ≤ c L . Then, there exists a constant C lLe > 0, depending only on c e , c l and c L , and such that, for any ( If, in addition, c e = 0 and x, x ′ ∈ Π n , then there exists Q ∈ O such that x = Qx ′ . The proof of (9) can be found in Appendix C.2, whereas the proof of the second statement can be found in Appendix A . This result shows that if x and x ′ are close to Π n , then x is close to an orthogonal transformation of x ′ . Hence, when we restrict to representations (x, f ) and (x ′ , f ′ ), with latent positions x and x ′ close to Π n , the identifiability issue becomes smoother. Summarizing our discussion above, we have shown that: Accordingly, to contain the phenomenon described in Proposition 2.2, we will focus henceforth on the representations (x, f ) which are the closest to Π n .
Problem formulation. Let us explain our estimation strategy, in light of the above discussion. In order to circumvent the identifiability issues, we focus on the representations (x, f ) ∈ R[F, c l , c L , c e ] whose latent positions are the closest to Π n , i.e. we focus on the following set 2 of representations Our goal is then to build an estimatorx, not depending on c l , c L and c e , such that, with high-probability for some representation (x * , f ) ∈ R Πn [F, c l , c L , c e ], and some constant C lLe > 0 depending only on c e , c l and c L . For such an estimator, under the premise that As a side remark, we notice that combining such a bound with (9), we obtain that, for any (x, f ) ∈ R Πn [F, c l , c L , c e ], we have with high-probability

Localize-and-Refine algorithm
The overall strategy for estimating a n-tuple of latent positions in R Πn [F ], is to start with a first estimatorx (1) ∈ Π n with a control in d 1 (x, y) = n i=1 d(x i , y i ) distance, and then to refine the estimation of each point x * i . In order to avoid complex statistical dependencies between the two steps, we use a sample splitting of the data. We sample S ⊂ {1, . . . , n} with cardinality |S| = n 0 = n/4 uniformly at random, and set S = {1, . . . , n} \ S. The first estimatorx S and the matrix A SS . This scheme avoids to have some statistical dependence betweenx (1) S and A SS . The estimatorx (2) S provides a localization for points indexed by S, with an error bound in d ∞ -norm. In order to localize all the points, we repeat the process and the final estimator is obtained by carefully merging the estimations. These three steps are precisely described in Sections 3.2-3.4, after the statement of our main results.

Main result
In this section, we consider the following setting for the data.
Setting 1. Assume that the matrix A of observations is given by (3), with x * ∈ C n , and f ∈ BL[c l , c L , c e ], for some c e ≥ 0 and 0 < c l ≤ c L (the set of Bi-Lipschitz functions BL[c l , c L , c e ] is introduced in Definition 2.1, page 8). Assume also that the noise matrix E follows the subGaussian errors assumption (7).
In this setting, the estimator (21) described in the next subsections fulfills the following risk bound.
Theorem 3.1. Assume that the data are generated according to the Setting 1 above. Then, there exists a constant C lLe > 0 depending only on c e , c l and c L such that, with probability at least 1 − 5/n 2 , there exists a representation (x, f ) in the set R Πn [F, c l , c L , c e ] defined in (10) such that the estimator (21) fulfills with R[F, c l , c L , c e ] defined by (8).
We emphasize that the estimator (21) has no tuning parameter. In particular, it does not depend on the unknown constants c e ≥ 0 and 0 < c l ≤ c L . The first term in the right-hand side of (11) can be assimilated to a bias term, which stems from the bias of the estimator (21) towards regular positions on C. The second term in the right-hand side of (11) is a variance-type term. We observe that, if there exists c a > 0 such that then d ∞ (x, x) = O log(n)/n with high probability. Such a setting arises for example when the latent positions have been sampled uniformly on the sphere, see Corollary 3.3. This log(n)/n rate is shown to be minimax optimal in Section 5.
We also give an explicit control for any representative ( We emphasize that the above statement holds for any representative (x, f ) ∈ R[F, c l , c L , c e ]. Before moving to the description of the estimator (21), let us give two important instantiations of Theorem 3.1 and 3.2.
Latent model with uniform sampling on C. Let us consider the latent model F ij = f (x * i , x * j ) when the positions x * 1 , . . . , x * n have been sampled independently and uniformly on C, as in the graphon model. In this case, the Assumption (12) is satisfied with high probability; see Appendix B.5 for a proof. We then derive the next result from Theorem 3.2. Corollary 3.3. Assume that the latent positions x * 1 , . . . , x * n have been sampled i.i.d. uniformly on C and that f ∈ BL[c l , c L , c e ]. Then, with probability higher than 1 − 7/n 2 , we have for some constant C lLe > 0 depending only on c e , c l and c L .
Toroidal seriation. Let us consider the toroidal seriation problem introduced in Example 4 of Section 1.1. In this setting, the set [n] is considered as a torus, endowed with the torus distance d(i, j) = min(|j − i|, |n + i − j|) for any 1 ≤ i ≤ j ≤ n, and the matrix F is a pre-toroidal R-matrix. Let σ * ∈ Σ n be a permutation such that [F σ * (i)σ * (j) ] i,j is a toroidal R-matrix. Our goal is estimate σ * from the noisy observation A = F + E. As explained in the introduction, we can recast this problem as a localization problem in a latent space model on the regular grid C n . Assimilating points on the sphere C to unit norm complex numbers, we define the vector x * ∈ Π n by x * j = exp(ι2πσ * (j)/n), and we define The problem of estimating σ * then amounts to estimating x * in the latent space model A ij = f (x * i , x * j ) + E ij on C n . We can apply our estimator (21) and get an estimationx ∈ C n . From this estimation, we can derive the mapσ : [n] → [n] by settingσ i = ⌈nx i /2π⌉ for i = 1, . . . , n, wherex i ∈ (0, 2π] is the argument ofx i and ⌈z⌉ is the upper integer part of z. While the mapσ may not be a permutation in Σ n , it is an estimation of σ * and we can translate Theorem 3.2 into a ℓ ∞ -error between the two. Corollary 3.4. Assume that [F σ * (i)σ * (j) ] i,j=1,...,n fulfills the bi-Lipschitz condition with respect to the torus distance: Then, there exists a constant C lLe > 0, depending only on c e , c l and c L , such that, with probability at least 1 − 5/n 2 , we have where Γ n is the subgroup of permutations of {1, . . . , n} generated by the circular and reverse permutations.
To prove (14), we extend f n defined on C n × C n to f : C × C → R belonging to BL[c l , c L , c e ], and apply Theorem 3.2. The minimum over Γ n in the lefthand side of (14) cannot be avoided, since σ * is identifiable from F only up to permutations in Γ n . Furthermore, the n log(n) rate for the toroidal seriation problem can be shown to be minimax in the above set-up, by combining Theorem 5.1 page 23 and the correspondence between the n-tuples x * ∈ Π n and the permutations σ * of [n].
In a recent work, Janssen and Smith [25] consider the related seriation problem for R-matrices (Example 3 in the introduction), in a geometric setting where F σ * i ,σ * j = g(|i − j|/n) for some unknown permutation σ * , and some unknown function g. Hence, in addition to be a pre-R matrix, F is also a Toeplitz matrix. Under additional assumptions on the squared matrix (F σ * i ,σ * j ) 2 , they establish that an algorithm based on a (thresholded version) of the square matrix of observations A 2 , achieves, with high probability, an error bound where Γ ′ n gathers the identity and the reverse permutations. Their assumptions are not comparable to ours, but their rates are similar (up to log factors). Our results then complement this work, by providing another set of conditions on F , under which the hidden permutation σ * can be recovered at the rate n log(n).
Organization of Section 3. The description of the estimator (21) is organized as follows. The refinement stepx (2) S is described in Section 3.2. This step can take as input any initial estimatorx (1) S based on A SS and taking values in Π |S| . When this initial estimator fulfills with high-probability for some (x, f ) ∈ R Πn [F ], then the refined estimatorx (2) S is shown to fulfill with high-probability In order to get an estimator satisfying the risk bound (11), we then need an initial estimatorx (1) S fulfilling (15). Such an estimator is provided in Section 3.3. A computationally efficient alternative, based on the spectral decomposition of A is proposed in Section 4. To get an estimator of the whole n-tuple of latent positions, the data splitting is repeated and a final merging step is needed to buildx. This final step is described in Section 3.4.

Step 2: refined estimation
We start by describing the refinement step which converts an initial estimator with an error bound in d 1 -distance into a refined estimator with an error bound in d ∞ -distance. For a subset S ⊂ {1, . . . , n} of cardinality |S| = n 0 = n/4, the refinement step takes as input any initial estimatorx (1) S of x * S based on A SS and taking values in Π n0 . It outputs an estimatorx (2) S of x * S . We denote by D(z,x (1) j )] j∈S the vector of distances between z ∈ C and the components of the n 0 -tuplex (1) S . The refined estimatorx (2) S is obtained by solvinĝ where C n0 = {1, e ι2π/n0 , . . . , e ι2π(n0−1)/n0 } is the regular grid of cardinality n 0 on C. The principle underlying the definition (16) is that d(x (2) i ,x (1) j ) should be small when A ij is large, and vice-versa. Hence, for any x * i ∈ C n0 and any , see Appendix A for details. Since A is a noisy version of a such a matrix, and since d( S is close to x * S . The next proposition quantifies this statement, by providing a uniform error bound for x (2) S in terms of the error bound d 1 (x (1) S , Qx * S ) for the initial estimator.
The right-hand side of (17) is made of three terms. The first one is the uniform approximation error of x * S by Π n0 , as defined in (4). It is a bias-type term which stems from the fact that we aim at estimating positions that are almost evenly spaced. The second term accounts for the error of the preliminary estimator x (1) S in d 1 -distance and the last-one is a variance-type term. The proof of Proposition 3.5 can be found in Appendix B.
The time complexity for computingx (2) i is linear in n 0 and the algorithm can be parallelized for computingx (2) S . To decrease the time complexity, it is possible to restrict to z ∈ C √ n0 in (16) instead of z ∈ C n0 . In that case, the proof of (17) still holds, with different constants.

Step 1: initial localization
In view of the above Proposition 3.5, we seek to build an initial estimatorx (1) S fulfilling (15) for some x S = Qx * S , with Q ∈ O. Such an estimator can be obtained by solvinĝ The estimatorx (1) S is chosen in such a way that the distance d(x is large, and conversely, it should be large when F ij is small. To grasp the principle underlying the definition (18), let us look at the noiseless geometric affine setting, where the observations , and where the positions x * S are evenly spread, i.e. x * S ∈ Π n0 . Then, one readily checks that whose maximum is achieved at all x S = Qx * S with Q any orthogonal transformation preserving Π n0 . In other words, the estimator (18) exactly recovers -up to distance preserving transformations-the positions in this ideal setting.
The next proposition establishes a d 1 -bound with the flavor of (15) for the estimator (18).
and let S ⊂ {1, . . . , n} be a subset of cardinality n 0 = n/4. Then, there exists a constant C lLe depending only on c l , c L and c e , such that, the estimatorx with probability higher than 1 − 1/n 2 .
To prove Proposition 3.6 (in Appendix B.3), we first establish that ∥D(x * S ) − D(x (1) S )∥ 2 is small, meaning that the distances between the estimated positions x (1) i : i ∈ S are close to the distances between the true positions {x * i : i ∈ S}. Then, relying on a recent result on matrix perturbation from [1], we deduce that ∥x is a distance preserving transformation and where we consider herex (1) S and Qx * S as 2 × n 0 matrices. The bound (19) then follows by connecting the Euclidean distance in R 2 to the d 1 -distance.
From a computational point of view, the minimization problem (18) is an instance of the Quadratic Assignment Problem which is known to be NP-Hard and even hard to approximate [31,35]. In section 4, we propose a computationally efficient alternative to (18), and we provide theoretical guarantees for this alternative under additional model assumptions.

Final merging step
For a given subset S ⊂ {1, . . . , n} of cardinality n 0 = n/4, combining the initial estimator (18) with the refined localisation (16), we get an estimatorx (2) S with an error bound on d ∞ (x (2) S , Qx * S ) for some orthogonal transformation Q ∈ O. In order to get an estimation of all the latent positions, we repeat the process by sampling S ′ ⊂ S of cardinality n 0 = n/4 and by computingx (18) and (16). We then get an estimator with an error bound on d ∞ (x In order to get a final estimatorx, we still have to deal with the fact that we may have Q ̸ = Q ′ , and hence the trivial mergex = (x S ) may not be a good one. Hence, we need to synchronize the estimatorsx and by defining the final estimator asx = (x Putting pieces together, we then get the following estimation procedure.

Computex
C) Merging the two localizations 1. Compute Q by solving (20). 2. Output:x ∈ C n defined bŷ The minimization problem (20) can be solved efficiently. For example, we can observe that the minimum is achieved at someQ ∈ O preserving C n0 , and since there are at most 2n 0 such orthogonal transformations, we can enumerate them. We refer to Section 6 for details.

Spectral Localization in the geometric latent model
The computation of the initial localizationx (1) S requires to minimize (18) over Π n0 , which is an instance of the Quadratic Assignment Problem (QAP), which is known to be NP-hard and hard to approximate [31,35]. A spectral relaxation of the QAP has been shown to be successful for reordering a pre (toroidal) R-matrix [3,33], and hence for solving the noiseless seriation problem for Rmatrices. This vanilla spectral algorithm proposed in [3] takes as input any symmetric matrix M ∈ R N ×N and output N points in R 2 .
Vanilla Spectral Algorithm (VSA) Compute: two orthonormal eigenvectorsû,v ∈ R N associated with the second and third eigenvaluesλ 1 andλ 2 of M Output: In this section, we adapt this vanilla spectral algorithm in order to get a computationally efficient initial estimatorx (1) S . In Section 4.1, we describe the estimatorx (1) S and provide some error bounds in d ∞ -distance for the Localizeand-Refine algorithm based onx (1) S . The main difference compared to Section 3 is that our theory is limited to cases where the function f is geometric, that is, for some g : [0, π] → [0, 1]. In Section 4.2, we complement this result by providing an error bound in ℓ 1 -norm for the vanilla spectral algorithm (VSA) in the geometric case.

Spectral Localization algorithm
We observe that the outputx VSA S of the vanilla spectral algorithm applied to A SS does not belong to Π n0 , and even not to C n0 . Hence, we need an additional approximation step in order to get an estimatorx (1) S that can be plugged in our Localize-and-Refine procedure (16). In the description of the algorithm below, we identify points on the circle C to unit norm complex numbers. Besides, for such a point z, we write ∥z∥ 1 for its ℓ 1 norm in R 2 .

Spectral Localization (LS)
The next theorem provides an error bound in d ∞ -distance for the Localizeand-Refine procedure (21), when we replacex S . This bound involves the two spectral gaps denote the eigenvalues of the signal matrix F .
and that the spectral gaps satisfy ∆ 1 ∧ ∆ 2 ≥ c b n. Then, there exists a constant C lLeab > 0 depending only on c e , c l , c L , c a and c b , such that, with probability at least 1 − 9/n 2 , the spectral Localize-and-Refine procedure (21) withx (1) S and x Similarly as in Theorem 3.1 and 3.2, we estimate the latent positions at the optimal log(n)/n rate in d ∞ -distance, but under the additional assumptions that f is a geometric function (23) and F fulfills the spectral gap condition ∆ 1 ∧ ∆ 2 ≥ c b n. The proof of Theorem 4.1 is given in Appendix D. The proof mainly relies on controlling the ℓ 1 -norm betweenx VSA and Qx * . This result, which has its own interest, is presented in Proposition 4.4, in Section 4.2. Below, we exhibit two cases where the spectral gap condition ∆ 1 ∧ ∆ 2 ≥ c b n holds.
Example: geometric model with Fourier gaps. The eigenvalues of F are closely related to the discrete Fourier transform of g, so that we can bound the spectral gaps ∆ 1 and ∆ 2 in terms of these Fourier coefficients. More precisely, the function f is given by f (x, y) = g(d(x, y)), with g defined on [0, π]. One can extend g to [0, 2π) by taking g(x) = g(2π − x) for any x ∈ (π, 2π). Then, for any integer n, the discrete Fourier transform of g(j 2π n ) : j = 0, . . . , n − 1 is defined by The following lemma bounds the spectral gaps ∆ 1 and ∆ 2 in terms of the gaps between the Fourier coefficients.
with f a geometric function f = g • d, and x * fulfilling (24). Let us set Φ 1 = F 0,n (g)−F 1,n (g) and Φ 2 = min j=2,...,⌊n/2⌋ F 1,n (g)−F j,n (g). Then, there exists a constant C lLea > 0, depending only on c e , c l , c L and c a , such that Hence, Theorem 4.1 still holds when we replace the gap condition So, when the first discrete Fourier coefficients of g are well separated from the other coefficients, the spectral version of the Localize-and-Refine algorithm estimates, in polynomial time, the latent positions at the optimal log(n)/n rate in d ∞ -distance. Below, we give an example where the Fourier coefficients can be explicitly computed and where Φ 1 , Φ 2 are proportional to n.
Example: affine geometric model. As a simple instantiation of Theorem 4.1 and Lemma 4.2, let us consider the geometric function f (x, y) = 1 − d(x, y)/(2π). The corresponding univariate function g(z) = 1 − z/(2π) is affine and its discrete Fourier coefficients can be computed explicitly in terms of trigonometric functions. In Appendix D.6, we prove that Φ 1 ∧ Φ 2 ≥ c b n for some numerical constant c b > 0. We then get the next corollary of Theorem 4.1.
Assume that the latent positions x * ∈ C n fulfill (24). Then, there exist constants C a and C ′ a depending only on c a such that for all n ≥ C ′ a , with probability higher than 1−9/n 2 , the spectral Localize-and-Refine procedure (21), withx Theorem 5.1 in the next section shows that this estimation rate is optimal.

ℓ 1 -bound for the vanilla spectral algorithm
As a byproduct of our analysis, we provide an ℓ 1 -bound for the estimation of the latent positions with the vanilla spectral algorithm (VSA), in the geometric latent model. Recanati et al. [33] have already shown that VSA succeeds to recover the hidden permutation in the noiseless seriation problem with R-matrices. We extend their work to the geometric latent model on C.

Starting from the noisy observation
, we apply VSA to the whole matrix A and get an estimationx VSA ∈ R 2×n of x * . The next proposition provides a bound in terms of the ℓ 1 -distance and in terms of the spectral gaps ∆ be a bi-Lipschitz geometric function. Assume that the latent positions x * fulfill the Assumption (24) , with c a > 0. Then, there exist two constants C lLea and C ′ lLea , depending only on c l , c L , c e and c a , such that, with probability at least 1 − 1/n 2 , the vanilla spectral estimatorx VSA satisfies Proposition 4.4 is proved in Appendix D.7. It provides an ℓ 1 -localization bound depending on the spectral gap ∆ 1 ∧ ∆ 2 of the signal matrix F . Since there are only n positions to be estimated in the bounded space C, this bound is uninformative when the spectral gaps ∆ 1 ∧ ∆ 2 are smaller than n log(n). Conversely, when the spectral gaps are of the order of n, we get an ℓ 1 -bound of the desired scaling n log(n). Proposition 4.4 is based on the fact that the signal matrix F is well approximated by a circulant and circular-R matrix, which benefits from nice spectral properties, see Appendix D.4. This type of R-matrices was already studied in [33] to derive some error bounds on the reconstruction of positions -see Proposition D in [33]. Here, Proposition 4.4 extends their result by providing some explicit bounds in the stochastic setting and also by considering some more general signals F , which are not assumed to be an exact circulant and circular R-matrix.

Minimax lower bound
In this section, we prove that the log(n)/n rate in Theorem 3.1 is minimax optimal. Let us consider the observation model A = F + E, where we assume that the entries {A ij : i < j} follow independent Bernoulli distributions with parameters f (x * i , x * j ). We focus on this specific case of sub-Gaussian distributions in the lower bound, as we have in mind random graph applications. We emphasize that the same lower-bound holds for Gaussian noise.
To prove the lower bound, we consider the simpler setting where f 0 is known to the statistician, and is an affine function of d, This function f 0 corresponds to a geometric latent model as discussed in the introduction, and it satisfies the bi-Lipschitz assumption (5 -6) for c e = 0 and c l = c L = (4π) −1 . In this simple scenario, the latent positions are identifiable up to the orthogonal transformations in O, so we derive a lower bound in terms of the quasi-metric min Q∈O Theorem 5.1. There exist two positive constants C, C ′ such that for any n ≥ C ′ , we have the lower bound where the infimum holds over all σ(A)-measurable functionsx.
The proof of the Theorem 5.1 is given in Appendix C.3. The lower bound is written over the collection of n-tuples x * ∈ Π n , which is a subclass of the class considered in our upper bounds (since all x * ∈ Π n satisfy the condition (12) for any c a ≥ 0). The lower bound matches the upper bound in Theorem 3.2 up to some multiplicative constants. Therefore, it implies the optimality of the log(n)/n estimation rate of our estimator (in the minimax sense). The fact that the lower bound holds even for a known function entails that the rate log(n)/n is not driven by the (absence of) knowledge of the affinity function in our setting. Moreover, since the affine function f 0 satisfies the bi-Lipschitz assumption (5)(6) for c e = 0, i.e. f 0 ∈ BL[(4π) −1 , (4π) −1 , 0], this entails that the rate log(n)/n is not due to the slack ε n = c e log(n)/n in the bi-Lipschitz assumption. In fact, we precisely allow this slack c e log(n)/n in (5-6) because this generalization does not worsen the estimation rate compared to pure bi-Lipschitz functions (c e = 0). Finally, since the set of n-tuples x * ∈ Π n is in correspondence with the set of permutations σ * of [n], Theorem 5.1 ensures that the bound (14) is rate-optimal for the bi-Lipschitz seriation problem.

Optimal rate
In Figure ( rate v opt = log(n)/n. For each sample size n = 100, 200, 300, 400, a dot represents the average of 50 ratios r 1 , . . . , r 50 obtained on independent data sets , has been generated as in the model (3), with the three following specifications. The latent points x * 1 , . . . , x * n are sampled independently and uniformly on C. The affinity function is the affine The entries E ij , 1 ≤ i < j ≤ n, of the noise matrix are independent Gaussian random variables, with a standard deviation that is either equal to 0.1 (green curve) or to 0.5 (red curve). One can observe in Figure (1) that the (averaged) ratio for σ = 0.1 is (approximately) constant and equal to 1, while for σ = 0.5 it decreases from 7 to 5. In other words, the maximum error of the Localize-and-Refine algorithm follows a log(n)/n rate, up to a multiplicative constant C ∈ [ 1 2 , 8], for sample sizes n ≥ 100. This corroborates the conclusion of our theoretical findings (upper bound of Corollary 3.3 and lower bound of Theorem 5.1) that the Localize-and-Refine algorithm achieves the optimal log(n)/n rate up to a multiplicative constant C that is bounded away from zero and bounded from above. An interesting question (for future research) would be to understand the dependencies of C in the problem parameters. Figure (1) indeed shows that C behaves differently when σ = 0.1 or σ = 0.5, and that C varies with n.

Usefulness of data splitting? of refined estimation step 2?
In this section, we investigate two questions relative to the empirical performance of the Localize-and-Refine algorithm with the initial localizationx In each Figure 2 and 3, we compare two algorithms, presenting boxplots of their localization errors in d ∞ -distance. Each boxplot represents the distribution of 50 errors made on 50 samplings of the data matrices A. Each data matrix A is generated as in the model (3), with the three following specifications. The latent points x * 1 , . . . , x * n are sampled independently and uniformly on C. For the affinity function, we choose either the affine geometric In the noise matrix, the entries E ij , 1 ≤ i < j ≤ n, are independent Gaussian random variables, with a standard deviation that is either equal to 0.1 (top line) or 0.5 (bottom line). The same protocol is used in Figure 4, except that we measure the localization error in

Question (i):
We use a data splitting scheme in the Localize-and-Refine algorithm in order to ensure independence between the data used in the two steps. This independence was convenient to prove theoretical guarantees (as Theorem 4.1). Yet, data splitting makes the initial localization run on a (n/4) × (n/4) data matrix, instead of the whole n × n matrix, which is expected to enlarge the variance of this initial localization by a factor 4. So, one can wonder whether the splitting is necessary and useful in practice. To answer this question, we illustrate in Figure 2 the difference between the performances of the Localize-and-Refine algorithm and the homologous procedure without data splitting (the former is plotted in red, the latter in blue). One can observe that the d ∞ -localization error is much smaller for the procedure without splitting. A plausible explanation for the good performances without data-splitting is that the statistical dependence between the steps 1 and 2 of the algorithm is negligible for large n, rendering the data splitting useless. Indeed, in the no-splitting version of the algorithm, step 1 uses O(n 2 ) observations to release a first localizationx (1) of the n positions, then step 2 refines the estimation of a position x i using n observations, which only represents a fraction O(1/n) of the observations used in step 1. Hence, the dependence betweenx (1) in step 1 and the n observations in step 2 could be sufficiently small to not require a data splitting. Accordingly, we recommend the version of the Localize-and-Refine algorithm without data-spliting for practical use. As a future direction of research, it would be interesting to investigate the theoretical performance of the algorithm without data splitting, in order to bridge the gap between the theory and the practice.

Question (ii):
The strategy of the Localize-and-Refine algorithm is to get an initial localization with controlled d 1 -error and then to refine the localization in order to ensure a control in the d ∞ -metric. A natural question is whether the refinement step 2 improves the initial localization obtained by the Spectral algorithm in step 1. We investigate numerically this question in Figure 3, by comparing the d ∞ -error of the Spectral Localization procedure (plotted in red) and of the Localize-and-Refine algorithm without data splitting (in blue). One can observe contrasting results, depending on the standard-deviation of the noise. When the standard-deviation is 0.5 (bottom line), the second step offers no significant improvement in the d ∞ -localization error. Conversely, when  the standard-deviation is 0.1 (top line), the d ∞ -error is significantly improved by the refinement step. This suggests that, to be useful, the refinement step requires a precise enough initial localization. We complement Figure 3 with Figure 4, which displays the errors in d 1 -distance (scaled by 1/n for a better comparison), instead of the d ∞ -distance, though this loss function is not our main concern in this paper. In Figure 4, we observe a behavior in the (d 1 /n)metric very similar to the behavior in the d ∞ -metric, displayed in Figure 3. In the light of the numerical performance of the Spectral Localization in Figure 3, an interesting open question is wether we can prove theoretical guarantees on the d ∞ -localization error of this procedure.

Discussion
Relying on observations of pairwise affinities in a latent space model, we studied the problem of uniformly localizing positions x * = (x * 1 , . . . , x * n ) on the unit sphere C ⊂ R 2 . Under bi-Lipschitz assumptions on the affinity function, we established the rate log(n)/n for the uniform localization of balanced n-tuples x * ∈ Π n . We also proved that non-trivial estimation error is still possible when the latent points do not form a balanced n-tuple (x * / ∈ Π n ) to the price of an additional bias d ∞ (x * , Π n ). This bias remains small compared to the log(n)/n rate when the points have been sampled uniformly at random on C.
We also analyzed a spectral embedding alternative in Section 4, which benefits from a polynomial-time complexity. When the function f is geometric and when the associated Fourier coefficients are suitably separated, this spectral method achieves the optimal rate log(n)/n for uniform localization. Yet, the spectral embedding takes advantage of the structure of Toeplitz R-matrix. Since this structure disappears in the general case of bi-Lipschitz functions, there is no apparent reason for the spectral algorithm to work over the whole class of bi-Lipschitz functions.
As our non-polynomial-time algorithm is based on an instance of the Quadratic Assignment Problem, which is known to be NP Hard and even hard to approximate, the existence of polynomial-time algorithms achieving the log(n)/n rate over the whole class of bi-Lipschitz functions remains an open question.
The latent positions are not identifiable when f is unknown, and our main hypothesis is that there exists a representation (x, f ) with x close to Π n and f bi-Lipschitz. We use as reference the regular distribution Π n , since regular and uniform distributions are the ones that appear in classical models like graphon, f -random graphs, or statistical seriation. Our algorithms builds on this hypothesis, and consequently the bias min (x,f )∈R(c l ,c L ,ce) d ∞ (x, Π n ) appears in our bounds, where the minimum is over the set R(c l , c L , c e ) of bi-Lipschitz representatives (x, f ). This minimum leaves room to handle situations where the latent positions do not match the regular grid Π n but are only more or less evenly spread. For instance, the minimal bias is zero for some representations (x, f ), with x as far apart from Π n as d ∞ (x, Π n ) ≥ π/8 (Proposition 2.2). Yet, there are many practical situations where the affinity matrix is clustered, that we cannot handle. In the case where the affinity matrix is clustered (f bi-Lipschitz, but the x i are clustered), the problem becomes a clustering problem, rather than a seriation problem, and our algorithms are not suited for clustering data. The question of handling simultaneously clustering and seriation is very interesting, but it is beyond the scope of this paper.
In this manuscript, we focused our attention to symmetric pairwise affinity functions f . However, other one-dimensional localization models such as Bradley-Terry model or more generally ranking problems, do not satisfy the symmetry assumption. Still, we hope that our general two-step approach can leverage other structural assumptions. In ranking, a natural counterpart of our model (3) is the so-called SST model introduced by [36], which is defined as follows. We observe 1] is non-decreasing with respect to x and nonincreasing with respect to y and satisfies the skew symmetry assumption, that is f (x, y) = 1 − f (y, x). In this setting, f (x * i , x * j ) stands for the probability that player i wins a game against player j. Note that the latent space is now [0, 1] and not the torus C anymore. Although our methodology does not apply verbatim, we could adapt the Localize and Refine procedure for the latent space [0, 1]. To exploit the bi-isotonic and skew-symmetric assumptions, the refinement estimator of (16) could for instance be replaced bŷ where G n0 stands for the regular grid { 1 n0 , 2 n0 , . . . , 1} andx S is a suitable firststep estimator. In comparison to (16), S . We expect that, with a suitable initializationx (1) S and under bi-Lipschitz assumptions, the resulting procedure achieves near-optimal localization rates. This is an interesting direction for future research.

Appendix A: Some intuition on our analysis
To get some intuition on the rationale behind our analysis, we single out the next lemma, which is a cornerstone of the analysis at least in the simplified situation where c e = 0 and where the latent positions belong to Π n . Then, we discuss some consequences in simplified versions of our work.
Writing d σ j = D σ j − D σ j−1 and rearranging the sums, we get where we used Abel transformation in the penultimate line. The proof of Lemma A.1 is complete. □ Let us discuss some immediate consequences of the above lemma for our problem. Let us consider the case where the entries A ij of the matrix A decrease with d(x * i , x * j ) for some x * ∈ Π n . For a fixed i, let τ be a permutation of [n] such that d(x * i , x * τ (j) ) : j = 1, . . . , n is ranked in increasing order. Let us set a j = A iτ (j) and d j = d(x * i , x * τ (j) ). Since the entries A ij decrease with d(x * i , x * j ), the sequences (a j ) j=1,...,n and (d j ) j=1,...,n fulfill the conditions of Lemma A.1 with α = ε = 0. Let us pick k ∈ [n] and let us denote by σ k the permutation of [n] such that d(x * k , x * τ (j) ) = d(x * i , x * τ (σ k (j)) ) = d σ k (j) -this is possible because x * ∈ Π n . We notice that σ i = Id. Then, Lemma A.1 ensures that so that This justifies that the criterion underlying the refined estimator (16) is able to recover the true latent position x * j at least in an idealized setting where the observations are noiseless, the entries of A ij are decreasing with d(x * i , x * j ), and the true latent positions x * j are plugged in (16) instead of the initial estimator x (1) .
When, in addition, we have a lower Lipschitz condition then, applying Lemma A.1, we can lower bound the difference In particular, we observe that, for all z ∈ C n , so the sum j A ij d(z, x * j ) locally increases, when z moves away from x * i . In the general case, where c e > 0 and the observations are noisy, the criterion does not satisfy a simple local quadratic lower bound and we need to rely on finer arguments than Lemma A.1 -see e.g. the proofs Lemma B.3 and B.16.
Finally, we sketch here the proof of the second result of Proposition 2.3, in the specific case where c e = 0. Consider any two representations (x, f ) and ( Since both x and x ′ belong to Π n , this implies that, for any fixed i, the vectors (d(x i , x j )) j and (d(x ′ i , x ′ j )) j are equal, up to a permutation of the entries. As c e = 0, the lower Lipschitz condition (6) ensures that (F i,j ) j=1,...,n is decreasing both with respect to d( . We now show that this implies that x = Qx ′ for some Q ∈ O. Denote σ the permutation of [n] such that σ(1) = 1 and the arguments x j satisfy x σ(i+1) = x σ(i) + 2π/n. As a consequence, we have d(x ′ σ(i) , x ′ σ(i+1) ) = d(x σ(i) , x σ(i+1) ) = 2π/n. This implies that either x ′ σ(i+1) = x ′ σ(i) + 2π/n for all i, or x ′ σ(i+1) = x ′ σ(i) − 2π/n for all i. In the former case, one easily sees that x = Qx ′ , where Q is the rotation satisfying x 1 = Qx ′ 1 , whereas in the latter case, we have x = Qx ′ , where Q is the reflection satisfying x 1 = Qx ′ 1 .

Appendix B: Proofs of main results
Recall that n = 4n 0 . Given an orthogonal transformation Q ∈ O, we define the d 1 -loss relative to Q as Before proving Proposition 3.5, we study, as a warm-up, the simpler situation where all the latent positions x * i are elements of the regular grid C n0 and where the vector x * S (composed of n 0 coordinates of x * ) belongs to Π n0 . In this case, d ∞ (x * S , Π n0 ) = 0. Lemma B.1. In addition of the assumptions listed in Proposition 3.5, we assume that x * ∈ C n n0 and x * S ∈ Π n0 . Then for any Q ∈ O and any i ∈ S, the estimator (16) satisfies the following bound with probability at least 1 − 1/n 3 .
Taking a union bound over the n − n 0 indices i ∈ S, we with probability higher than 1 − 1/n 2 . This is exactly the conclusion of Proposition 3.5 in the special case where x * ∈ C n n0 and x * S ∈ Π n0 . The proof of Proposition 3.5 for general x * follows the same scheme as that of Lemma B.1, but also requires some slight refinements. We first prove Lemma B.1 before turning to the general case.

B.1. Proof of Lemma B.1
First, we claim that it suffices to restrict our attention to transformations Q ∈O that let C n0 invariant. Indeed, for general Q, there exists an orthogonal transformation Q ′ , letting C n0 invariant, and such that max z∈Cn 0 d(Qz, Q ′ z) ≲ 1/n 0 . Replacing Q ′ by Q in the statement of Lemma B.1 only entails an additional term of order 1/n 0 which is negligible compared to the term log(n)/n.
Let i ∈ S. In the two next lemmas, we bound from above and below. We recall that F i,S is the vector (f (x * i , x * j ) for j ∈ S. Lemma B.2. With probability at least 1 − 1/n 3 , we have for some constant C L,e > 0.
for some numerical constant C > 0 and all n larger than quantity C l,e .
These two lemmas imply that, for n large enough and We conclude that the error bound d(x Since |S| = n 0 , we can assume that S = [n 0 ] for the ease of exposition. Let i ∈ S. First, we decompose L i as follows The regular grid C n0 is invariant by Q, and x * S belongs to Π n0 . Besides,x Hence, we can reorder the sum in L i as follows To alleviate the notation, we writeẑ i : Gathering this lemma with the definition of L i leads us to The orthogonal transformation Q preserves the distances, hence the last term of (29) is equal to To handle this term, we come back to the definition (16) ofx S )⟩ . This yields

S )⟩ depends onx
(2) i which belongs to C n0 . This is why we simultaneously control the expression ⟨E i,S , S )⟩ for all z ∈ C n0 . This expression is distributed as a mean zero sub-Gaussian random variable with norm at most C∥D(Qx * i ,x S )∥ 2 . Applying a union bound over all z ∈ C n0 leads us to S )∥ 2 with probability higher than 1 − 1/n 3 . Invoking the triangular inequality for the distance d, we deduce that ∥D(Qx * i ,x It follows that, with probability at least 1 − 1/n 3 ,
By triangular inequality, we also have |d(ẑ i , z We have d(z S ,x * S (Q) + c e n log(n) . Since |S| = n 0 , we can assume that S = [n 0 ] for the ease of exposition. Let i ∈ S and denoteẑ i := Q −1x (2) i . Since d(x * i ,ẑ i ) ≤ π, we can assume without loss of generality that the arguments x * i = 0 andẑ i ∈ (0, π], so that we have the equality d( 3 is trivial. We therefore assume in the following thatẑ i ∈ (0, π]. Below, we introduce a partition of [n 0 ] according to the relative positions of x * j , x * i andẑ i . This partition is depicted in Figure 5. Although I s stands for a subset of indices, with a slight abuse of notation, we still write |I s | for the length of the corresponding interval in R/(2π). For instance, |I 1 | := |x * i −ẑ i |. We decompose L according to this partition of indices L i = L is the restriction of L i to the set I s . In particular, if z i = π, then the intervals I 2 and I 4 are empty, and L Next, we heavily rely on the fact that the elements of x * S are evenly spaced on the sphere, that is {x * i : i ∈ [n 0 ]} = C n0 which holds true since we have assumed x * S ∈ Π n0 . Using the symmetry of the set C n0 , we establish below that the sums L As for L i ), we rely on the symmetry of I 1 (resp. I 3 ) around the point of C whose argument is (x * i +ẑ i )/2 (resp. ((x * i +ẑ i )/2) + π). Lemma B.6. For some numerical constant C > 0, we have By definition, |I 1 | + |I 4 | = π, which yields the desired bound □ Proof of Lemma B.5. In Figure 5, we can see that the difference d( Let ϕ denote the reflection with respect to the line going through the two points of C of arguments a = x * i +ẑ i 2 and b = (x * i + π) + (ẑ i + π) 2 .
As can be checked in Figure 5, for any l ∈ I 2 , we have ϕ(x * j ) = x * l for some j in I 4 . Hence, To lower bound the difference in the sum , we invoke the bi-Lipschitz condition (6), which gives since x * j is closer to x * i than ϕ(x * j ) -see again Figure 5. Also, we can check from Figure 5 is evenly spaced, the number of indices j in I 4 is larger than n 0 |I 4 |/(2π). This leads us to i , Qx * i ), this concludes the proof.
Proof of Lemma B.6. From Figure 5, we see that, for all j ∈ I 1 , For α ∈ (0, 1), write I (α) 1 the sub-interval of I 1 defined as In particular, for all j ∈ I (1/2) 1 , the above expression leads us to where ϕ is the symmetry introduced in the proof of Lemma B.5. Hence, the terms with j ∈ I (1/2) 1 partially compensate with the terms with j outside I For any j ∈ I . As a consequence, it follows from the bi-Lipschitz condition (6) that To control (31), we split the interval I Claim B.7. We have Gathering these two claims leads us to i , Qx * i ). By symmetry, the term L Proof of Claim B.7. For simplicity, the notation x is dropped out in the proof of Claim B.7, and x is simply denoted by x. By definition of ϕ, we know that Since the number of indices in I | is the arc length), and the length of this arc is at most c −1 l ε n , we conclude that Proof of Claim B.8. Again, for convenience the notation x is dropped out here. Since all the terms in the sum are nonnegative, we can simply consider indices and, for some numerical constant C > 0, Thus, we have , this concludes the proof.

B.2. Proof of Proposition 3.5
Let x * * S be a best approximation of As in the proof of Lemma B.1, we restrict our attention to orthogonal transformations Q ∈ O that let C n0 invariant. Fix i in S. To prove Proposition 3.5, it suffices to establish variants of Lemmas B.2 and B.3 with instead of L i . In the definition of L i , x * S has been replaced by x * * S . Lemma B.9. With probability at least 1 − 1/n 3 , we have Lemma B.10. For n large enough, one has These two lemmas enforce that, with probability higher than 1 − 1/n 3 , Indeed, assume that d(x (2) i , Qx * i ) ≥ C ′ l,L,e d ∞ (x * S , Π n0 ) + log(n)/n where C ′ l,L,e is large enough. Then, Lemma B.10 implies that L i ≳ c l ,c L ,ce nd 2 (x (2) i , Qx * i ). Together with Lemma B.9, we deduce that In any case, we conclude that (1) S ,x * S (Q) + log(n)/n , with probability higher than 1 − 1/n 3 . Taking the minimum over all Q ∈ O that let C n0 invariant and a union bound over all i ∈ S, leads to Proposition 3.5.

B.2.1. Proof of Lemma B.9
To ease the exposition, we assume that S = [n 0 ]. Fix i ∈ S. We start from In order to come back to the setting of Lemma B.1, we replace , using the bi-Lipschitz condition (5) so that By triangular inequality, we have d( where we define r n = nd ∞ (x * S , Π n0 ) + n log(n). This leads us to Since x * * j now runs over Π n0 , we can replace as in the proof of Lemma B.2 the sum over x * * j by a sum over Q −1x (1) j using a suitable permutation: The remainder of the proof follows the same lines as for Lemma B.2, except for small differences. Still, we provide some details for the sake of completeness. as in that proof we writeẑ i = Q −1x (2) i and z i . We first apply Lemma B.4 to obtain The last expression simplifies in In Lemma B.2, we had Qx * i ∈ C n0 so that we could use the definition ofx S )⟩. Unfortunately, Qx * i does not necessarily belong to C n0 anymore. To handle this minor issue, we replace x * i by the closest element y * i in C n0 . It satisfies d(x * i , y * i ) ≤ 2π/n 0 and Qy * i ∈ C n0 . This leads us to Since |d(Qy * i , x i ) − d(Qx * i , x i )| ≤ 2π/n 0 and |f (x, y)| ≤ 1, the above additional error term satisfies with probability higher than 1 − 1/(2n 3 ). Putting everything together, we have shown that with probability higher than 1 − 1/(2n 3 ). To conclude, it suffices to the rhs in the above expression. We do it exactly as in the end of the proof of Lemma B.2 except that we now consider a probability 1 − 1/(2n 3 ).

B.2.2. Proof of Lemma B.10.
Fix i ∈ S and define y * i ∈ C n0 as a closest point to x * i in C n0 . We introduce the quantity which has the same properties as the L i used in Lemma B.1, since each point involved in the expression of L ′ i is an element of C n0 , and the sum runs over a vector x * * S in Π n0 . This allows us to invoke Lemma B.3 -from the proof of Lemma B.1 -to get for n large enough. By definition of y * i , we know that d(y * i , x * i ) ≤ 2π/n 0 . Hence, by triangular inequality, d(x i , Qx * i ) − 2π/n 0 and we derive that Next, we rely on the following lemma to replace L ′ i by L i . Lemma B.11. We have Gathering these two bounds completes the proof of Lemma B.10. □ Proof of Lemma B.11. From the definition of y * i and the triangular inequality, we have This allows us to deduce that the quantity satisfies | L i − L ′′ i | ≤ 2π. Besides, we deduce from the bi-Lipschitz condition (5) and the triangular inequality that Then, we deduce that All in all, we have and the result follows.

B.3.1. Main Arguments
Assume that S = [n 0 ] for the ease of presentation. In this proof, we both interpretx (1) S and x * S as vectors in C n and matrices of size 2 × n 0 . We recall that ∥.∥ q refers to the entry-wise l q norm for matrices. We shall establish that the estimatorx (1) S is such that the matrixx In other words, the distances betweenx (1) 1 , . . . ,x (1) n0 are close to the respective distances between the x * 1 , . . . , x * n0 . Then, relying on a recent matrix perturbation result from [1], we deduce that, up to an orthogonal transformation,x (1) S and x * S are close. Let us first state this perturbation result. Given any p × 2 matrix M with real coefficients, we denote its transpose by M T , and the Moore-Penrose pseudo-inverse by M † , and the usual operator norm by ∥M ∥ op . In this proof, the transformations Q ∈ O are interpreted as orthogonal matrices of size 2 × 2.
. In order to invoke the above proposition for M = (x (1) S ) T and N = (x * * S ) T in R n0×2 , we need to check that the condition 2ν∥N † ∥ 2 op ≤ 1 is fulfilled. First, to bound the term ∥N † ∥ 2 op , we work out g. (76) for a proof. As a consequence, The following lemma bounds ν = ∥x Hence, with probability higher than 1 − 1/n 2 , we obtain If d ∞ (x * S , Π n0 ) + log(n)/n ≥ 1/C ′ l,L,e , the conclusion of Proposition 3.6 obviously holds since min Q∈O d 1 (x (1) S , Qx * S ) ≤ n 0 ≤ n. Hence, it suffices to consider the case where d ∞ (x * S , Π n0 ) + log(n)/n ≤ 1/C ′ l,L,e so that the condition of Proposition B.12 is fulfilled. This implies and so Since the distances in R 2 and C are equivalent, we have ∥Qx * * Together with the triangular inequality, this leads us to Using again the equivalence between the distances, that is d(x, y) ≲ ∥x − y∥ 1 for all x, y ∈ C, we conclude that and the proof of Proposition 3.6 is complete.

B.3.2. Proof of Lemma B.13
Bothx (1) S and x * * S are elements of Π n0 ⊂ R 2×n0 , hence they both satisfŷ x Lemma B.14. For any Z = (z 1 , . . . , z n0 ) and For Z = x * * S and Z ′ =x 2 , it follows from Lemma B.14 that ν ≤ ∥D − D ′ ∥ 2 . Since all square distances D ij and D ′ ij are at most equal to 4, we get For any x, y ∈ C, elementary geometry gives ∥x−y∥ 2 = 2 sin(d(x, y)/2). Since the sinus function is 1-Lipschitz, we have . As a consequence, we mainly have to control with high prob- 15. With probability at least 1 − 1/n 2 , we have Hence ν ≤ C l,L,e [nd ∞ (x * S , Π n0 ) + n log(n)] and the proof of Lemma B.13 is complete. □ Proof of Lemma B.14. Let H = I − J/n 0 , where I is the identity and J the matrix of ones. Since Z1 = 0, we have ZH = Z, so that since D is the matrix of distances associated with Z. Then we have where the last inequality derives from the general relation ∥AB∥ 2 ≤ ∥A∥ op ∥B∥ 2 for any matrices A, B, and the fact that ∥H∥ op = 1 -because H is an orthogonal projection. Lemma B.14 is proved.
with probability at least 1−1/n 2 . Conversely, we shall lower bound ⟨F S,S , D(x Using the bi-Lipschitz property of the function f , we deduce that where we applied Cauchy-Schwarz inequality on R n0×n0 . As a consequence, The following result bounds This is a key step in our proof. Had the slack constant c e been equal to zero, the following result would have been a consequence of Lemma A.1. Here the proof is slightly more involved and is provided below. We conclude from (36) and the above lemma that which in turn implies that Lemma B.15 is proved.
Proof of Lemma B.16. To alleviate the notation, we introduce ] so that we aim at establishing a lower bound for each γ i and in turn for γ = n0 i=1 γ i . To simplify the arguments, we only consider the case where n 0 is odd, the case of n 0 even being almost similar.
Both x * * S andx (1) S belongs to Π n0 and we shall heavily rely on the symmetries of Π n0 . Assume without loss of generality that i = 1 and x * * j = e ι2π(j−1)/n0 for all j = 1, . . . , n 0 . Then, d( S also belongs to Π n0 , there exists a permutation σ of [n 0 ] such that σ(1) = 1 and d(x Recall that we consider the case where n 0 is odd. Besides, we can focus on n 0 larger than 3 since Lemma B.16 is trivial for n 0 = 1. Thus, there exists a surjective map σ : [n 0 − 1] → [⌊n 0 /2⌋] such that |σ −1 ({z})| = 2 for any z ∈ [⌊n 0 /2⌋] and d(x n0 σ(j − 1) for any j = 2, . . . n 0 . Finally, we write ψ j = f (1, e ι2πj/n0 ) and ψ ′ j = f (1, e −ι2πj/n0 ) for j = 1, . . . ⌊n 0 /2⌋. Equipped with this new notation, we arrive at Finally, we denote a j = σ(j) − j and a ′ j = σ(n 0 − j) − j for j = 1, . . . , ⌊n 0 /2⌋. Obviously, we have ⌊n0/2⌋ j=1 a j + a ′ j = 0. More generally, one easily checks that, for any positive integer s ≤ ⌊n 0 /2⌋, the sum s j=1 (a j + a ′ j ) is nonnegative. Starting from we partition the indices according to the signs of a j and a ′ j . Define Intuitively, we want to group indices j such that a j > 0 with indices k such that a k < 0. This can be done by recursion. First, consider the smallest index k ∈ A − ∪ A ′ − . By symmetry, suppose that a k < 0. Since k j=1 (a j + a ′ j ) ≥ 0, this implies that Iterating the construction we obtain the following decomposition where all b j,k,t 's are nonnegative, b j,k,t = 0 for k < j, and In the above decomposition all the terms b j,k,1 , b j,k,2 , b ′ j,k,1 , and b ′ j,k,2 are nonnegative. Besides, they are positive only when k ≥ j, so that we can use the bi-Lipschitz condition (6) We obtain similarly the same lower bound for ψ j − ψ k , ψ ′ j − ψ k , and ψ ′ j − ψ ′ k . Coming back to the expression γ j and the definition of the b i,j,t with t = 1, 2 yields Let us work out these two expressions in the rhs. By symmetry and definition of σ and σ we get Similarly, we get Putting everything together yields which in turn allows us to conclude Lemma B.16 is proved.

B.4. Proof of Theorem 3.1 and 3.2
In this section, we prove Theorem 3.2. Theorem 3.1 then follows directly from this result.

B.4.1. Main arguments
Recall that n = 4n 0 . For n 0 ≤ 3, the bound of Theorem 3.2 is trivially true. Assume that n 0 ≥ 4 in the following. In Step 1 of the main procedure, it follows from Propositions 3.5 and 3.6 that the outputx (2) S satisfies the following uniform bound with probability higher than 1 − 2/n 2 . Similarly, for the outputx with probability higher than 1 − 2/n 2 . In (38) and (39), we shall prove that the bias terms d ∞ (x * S , Π n0 ) and d ∞ (x * S ′ , Π n0 ) are of the same order as d ∞ (x * , Π n ) up to an additional error of the order of log(n)/n -see Lemma B.17 below.
Lemma B.17. Assume that n 0 ≥ 4 and fix x * ∈ C n . There exists an event of probability higher than 1 − 1/n 2 such that Thus, by a union bound, the following inequalities hold together with probability at least 1 − 5/n 2 : Since the final estimatorx := (x (2) S , Qx S , we deduce from (40) that To prove Theorem 3.2, it suffices to show the counterpart of this bound on S: By the triangle inequality, we have by definition ofx. By (41) and since S ⊂ S ′ , we have In view of (42) and (43), it remains to prove that Before consider this maximum, we control the quantity d ∞ (Q 1 x * S∩S ′ , QQ 2 x * S∩S ′ ) that will turn out to be instrumental. By the triangular inequality, By definition of Q, the second term of the right hand-side is bounded by Together with (40) and (41), this leads us to Let us now come back to proving (44). Since the symmetric group on the plane is only made of rotations and reflections, we consider two cases. Case 1: Q −1 1 QQ 2 is a rotation. Then, d(Q 1 y, QQ 2 y) does not depend on y. In particular, max y∈C d(Q 1 y, QQ 2 y) = d ∞ (Q 1 x * S∩S ′ , QQ 2 x * S∩S ′ ) and (44) is a consequence of (45). Case 2: Q −1 1 QQ 2 is a reflection. Then, max y∈C d(Q 1 y, QQ 2 y) = π.
, this implies that the points in x * S∩S ′ belong to two arcs of length π/4 that are (individually) symmetric around the axis of the reflection Q −1 1 QQ 2 . It follows that d ∞ (x * S∩S ′ , Π 2n0 ) ≥ π/8 as soon as 2n 0 ≥ 4, that is n ≥ 8. Indeed, if d ∞ (x * S∩S ′ , Π 2n0 ) < π/8 and 2n 0 ≥ 4, this would imply that, any point on C is at distance less than 3π/8 from x * S∩S ′ which is impossible because those points in x * S∩S ′ belong to these two arcs of length π/4. Since Lemma B.17 ensures that d ∞ (x * S∩S ′ , Π 2n0 ) is of the same order as d ∞ (x * , Π n ), this implies that the latter is of the order of a constant and (44) is obviously valid.

B.4.2. Proof of Lemma B.17
We claim that it suffices to restrict our attention to the case where x * = (x * 1 , . . . , x * n ) are n distinct points. Indeed, for general points x * 1 , . . . , x * n in C, there exist points y 1 , . . . , y n that are all distinct and satisfy d(y j , x * j ) ≤ 1/n for all j ∈ [n]. Replacing x * 1 , . . . , x * n by y 1 , . . . , y n in the statement of Lemma B.17 only entails an additional term 1/n which is negligible compared to the term log(n)/n. For any k ∈ [n] and any vector x ∈ C k , we introduce a new quantity that is equivalent to d ∞ (x, Π k ), but more easy to handle. For any interval I ⊂ R/(2π), we write N I (x) the number of coordinates of x that lie in the interval I, i.e. the number of i ∈ [k] such that x i ∈ I. We then define the quantity V I (x) as Remark that, for a uniform k-sample of C, the fraction k|I|/(2π) would be the expected number of points in I. The next lemma shows that the supremum sup I V I (x) is equivalent to k d ∞ (x, Π k ). We note I the set of all closed intervals I ⊂ R/(2π).
Lemma B.18. For any integer k ∈ [n] and any vector x = (x 1 , . . . , x k ) of k distinct points of C, we have Thus, to prove Lemma B.17, it is enough to show that, for T = S, S ′ , and S ∩ S ′ , one has P sup The next Lemma states a uniform concentration bound for V I .
Lemma B.19. Consider any integer n > 4 and any integer k < n. Fix any x ∈ C n . Sampling uniformly at random k coordinates of x without replacement, we write x (k) ∈ C k the resulting vector. Then, with probability higher than 1 − 1/n 3 , one has Since the marginal distributions of S, S ′ , and S ∩ S ′ are uniform, we can apply Lemma B.19 to x * S , x * S ′ , and x * S∩S ′ and the conclusion of the Lemma holds with probability higher than 1 − 3/n 3 , which is higher than 1 − 1/n 2 . □ Proof of Lemma B.19. We start with a fixed interval I ∈ I. Since N I (x (k) ) is a hypergeometric random variable with parameters (k, N I (x) n , n), we can invoke Hoeffding inequality (82) for hypergeometric distributions and get We combine (49) with to conclude that In order to extend (50) to all intervals I ∈ I, we use an ε-net approach with a subcollection I n (x) of I. Let I n (x) be the collection of all intervals I n = [a n , b n ] where a n , b n ∈ {x 1 . . . , x n }∪C n , i.e., a n , b n are either coordinates of x or elements of the n-regular grid {2πi/n; i ∈ [n]}. We then apply (50) together with a union bound over all intervals I ∈ I n (x). Since |I n (x)| ≤ (2n) 2 ≤ n 3 , we obtain sup I∈In(x) with probability higher than 1 − 1/n 3 . To obtain (51) for all I ∈ I, we observe that, for any I ∈ I, there exists I n ∈ I n (x) such that where I (l) and I (r) are two closed intervals of I \ I n whose lengths are smaller than 2π/n and that satisfy N I (l) (x) = N I (r) (x) = 0. In particular, we have N I (l) (x (r) ) = N I (l) (x (k) ) = 0. We then deduce that Since the same decomposition holds for V I (x), we get Together with (51), we obtain with probability higher than 1 − 1/n 3 . Lemma B.19 is proved Proof of Lemma B.18. We first prove the upper bound Recall that for a vector x = (x 1 , x 2 , . . . , x k ) ∈ C k , we say that x is ordered, if these points are consecutive when one walks on the sphere with the trigonometric direction. Without loss of generality and for ease of exposition, we assume that the identity permutation is a latent order, that is x 1 , . . . , x k is ordered.
We define x * * S = (x * * 1 , . . . , x * * k ) a vector of Π k as follows. The first point x * * 1 ∈ C k is a closest point to x * 1 with respect to d and the other points x * * j+1 are elements of C k with arguments Fix any i ∈ {2, . . . , k} and consider the intervals Observe that N Ii (x) = i since x 1 , . . . , x k are ordered and all distinct. Hence, Besides, we know that the length of By construction of the x * * j 's, we have d(x 1 , x * * 1 ) ≤ 2π/k and [x * * 1 , x * * i ] = 2π(i− 1)/k. Hence, we obtain k||I ′ i | − 2πi| ≤ 4π. We then deduce from (54) and the triangular inequality that k |I i | − |I ′ i | ≤ 2π sup I∈I |V I | + 4π. Coming back to (53), taking the supremum over all i ∈ {2, . . . , k}, and noting that d(x 1 , x * * 1 ) ≤ 2π/k leads us to where x * * ∈ Π k . Finally, we take the minimum over Π k to get the desired bound.
We now turn to the lower bound kd . Consider any such interval I and x * * ∈ Π k . Since the entries of x * * are regularly spaced on C, it follows that |N I (x * * )−k|I|/(2π)| ≤ 1 so that |V I (x)| ≤ |N I (x)−N I (x * * )|+ 1. Now, assume that N I (x) > N I (x * * ). We claim that sup j: Otherwise, the set of x * * j with j satisfying x j ∈ I is included in an interval of size This contradicts the fact that this set of equi-spaced points has size N I (x). If N I (x) < N I (x * * ), we simply consider the complement 3 interval I that satisfies Putting everything together, we have shown that Taking the infimum over x * * and the supremum over I leads to the desired result.

B.5. Proof of Corollary 3.3
Theorem 3.2 ensures that, conditionally to x * , with probability at least 1−5/n 2 . Thus, it suffices to show that, with probability at least 1 − 2/n 2 , one has for some C > 0. We shall rely on Dvoretzky-Kiefer-Wolfowitz (DKW) inequality. Indeed, the arguments x * 1 , . . . , x * n are independent and uniformly distributed on [0, 2π). Besides, any interval I of the torus R/(2π) can be represented as a union of at most two intervals of [0, 2π). For any interval I, we denote |I| its length and N I (x * ) the number of points x * i whose argument lies in I. Then, we deduce from DKW inequality that, for any t > 0, We then choose t = log(n)/n to obtain P sup Besides, by Lemma B.18, we know that the quantity where I stands for the set of interval on the torus R/(2π). The last two displays lead to the desired result. For simplicity, we assume that n/8 is an integer in the rest of the example and we write n = 8n 1 . The construction of f ′ mainly amounts to contracting the function f in some regions and dilating it in other regions which allows to contracting and dilating the positions x. Consider a partition of the latent space C = C 1 ∪ C 2 ∪ C 3 in three arcs C 1 = (x n , x n1 ] = (0, π/4], C 2 = (x n1 , x 4n1 ] = (π/4, π] and C 3 = (x 4n1 , x n ] = (π, 2π]. For x and y belonging C 1 , define f ′ 1 (x, y) by f ′ 1 (x, y) = 1 − d(x, y)/π. For k = 1, . . . , 2n 1 , define x ′ k = e ιkπ/n and let x ′ n = x n = 1. In other words, we contract the positions x k for k = 1, . . . , 2n 1 . Although we have not yet completely defined x ′ , we already can certify that min Q∈O . For x and y in C 2 , we define f ′ 2 (x, y) by f ′ 2 (x, y) = 1 − d(x, y)/(3π). For k = 1, . . . , 2n 1 , we set x ′ k+2n1 = e ιπ/4 e ιk3π/n . Again, observe that f ′ 2 (x ′ i , x ′ j ) = f (x i , x j ) for all integers i, j in [2n 1 + 1, 4n 1 ]. Finally, for x and y in C 3 , set f ′ 3 (x, y) = f (x, y), and let x ′ k = x k for all integers k = 4n 1 + 1, . . . , n − 1.
. It remains to deal with the situations where the pairs of points lie in different parts of the partition C 1 ∪ C 2 ∪ C 3 . In the case where x ∈ C 1 and y ∈ C 2 , define f ′ 1−2 (x, y) = f ′ 1 (x, e ιπ/4 ) + f ′ 2 (e ιπ/4 , y) − 1. For all integers i ∈ [0, 2n 1 ] and j ∈ [2n 1 , 4n 1 ], we have already seen that f ′ 1 (x ′ i , e ιπ/4 ) = f (x i , e ιπ/2 ) and In the case where x ∈ C 1 and y ∈ C 3 , define Since f admits similar decompositions, one can deduce from the above that The remaining cases can be handled in the same manner. Finally, we define the symmetric function . As a consequence, (x ′ , f ′ ) belongs to R[F, (3π) −1 , π −1 , 0]. One easily check that x ′ ∈ S ev and the result follows. □ C.2. Proof of (9) in Proposition 2.3 We show in the paragraph below that (9) is a consequence of the proof of Theorem 3.2 in the noiseless case (E = 0), after application of the triangular inequality. Indeed, since the noise is equal to zero, the conclusion of Theorem 3.2 is deterministic (and not with high probability anymore), so it can be used to prove deterministic inequalities such as (9). By doing so, we establish (9) via our localization algorithm (Theorem 3.2), though (9) is an approximation result (independent of any algorithm) which could be proved directly.
Consider any two representations (x, f ) and (x ′ , f ′ ) in R[F, c l , c L , c e ] and apply our Localize-and-Refine procedure to noiseless observations A = F . The conclusion of Theorem 3.2 applies to both x and x ′ , so that we have Hence, it follows from the triangular inequality that

C.3. Proof of Theorem 5.1
We establish the lower bound log(n)/n in the particular setting where the observations A ij are independent Bernoulli random variables of parameters F ij = f 0 (x i , x j ), for the specific function with x = (x 1 , . . . , x n ) ∈ Π n . The corresponding probability distribution is denoted by P (x,f0) . This minimax lower bound is based on Fano's method as stated below. For two configuration x and x ′ in Π n , we denote the Kullback-Leibler divergence of P (x,f0) and P (x ′ ,f0) by KL(P (x,f0) ∥ P (x ′ ,f0) ). Besides, we quantify the quasimetric ρ(x, x ′ ) = min Q∈O d ∞ (x, Qx ′ ). Given a radius δ > 0 and a subset S ⊂ Π n , the packing number M(δ, S ′ , ρ) is defined as the largest number of points in S ′ that are at quasi-distance ρ at least δ away from each other. Below, we state a specific version of Fano's lemma.

59
Then, for any estimatorx and for any δ > 0, we have In view of the above proposition, we mainly have to choose a suitable subset S ′ , control its Kullback diameter, and get a sharp lower bound of its packing number. The main difficulty stems from the fact that the loss function ρ(x, y) = min Q∈O d ∞ (x, Qy) is a minimum over a collection of orthogonal transformations. It is therefore challenging to derive a tight lower bound for this loss.
Let k := C ′ n log(n), for a small enough constant C ′ ∈ (0, 1] that will be set later. Define n/2 vectors x (s) ∈ Π n , s = 1, . . . , n/2, as follows. For each s ∈ [n/2], we define x which in turn ensures that the packing number M(δ n , S ′ , ρ) of radius δ n := πk/n satisfies M(δ n , S ′ , ρ) ≥ n/2. To upper bound the KL diameter of S ′ , we use the following claim whose proof is postponed to the end of the section.
Claim C.2. For any x, x ′ ∈ C n , we have Together with the definition (55) of f 0 , we get for some numerical constant C. Then, choosing the constant C ′ in the definition of k such that C ′ = (2 √ C) −1 leads to d KL (S ′ ) ≤ log(n)/4. Applying Lemma C.1 to this set S ′ , we arrive at as soon as n is large enough. Theorem 5.1 is proved. □ Proof of Claim C.2. By definition of the Kullback-Leibler divergence, and and since log(t) ≤ t − 1 for all t > 0, it follows that where the second inequality follows from the fact that 1/4 ≤ F ′ ij ≤ 3/4. Lemma D. 3. Let x * S ∈ C n0 . For any input x S ∈ R 2×n0 , UA returns a vector x S ∈ Π n0 such that By (58-60) and Lemma D.3, the projection x )+1 ≤ C lLeab n log(n) (61) with probability higher than 1 − 3/n 2 .
Finally, we plug x (1) S in the criterion (16) to localize the remaining points. In other words, we compute, for i ∈ S, and get the position estimatesx i ] i∈S . As a direct consequence of (61) and Propositions 3.5 (and the equivalence between the ℓ 1 norm in R 2 and the distance d 1 in the sphere C), we arrive at the following uniform bound which holds with probability higher than 1 − 4/n 2 . As in section 3.4, we finally rely on a cross-validation scheme to estimate and realign all the positions. This straightforwardly allows us to uniformly localize, with probability higher than 1 − 9/n 2 , all positions x * i within an error of the order of log(n)/n. This concludes the proof of Theorem 4.1. of size n which is close to x * . Let us show first that the spectrums of F SS and F (4) are almost the same. By construction of F (4) , each of the n 0 eigenvectors of F SS can be transformed into an eigenvector of F (4) , by replicating 4-times the coordinates of these vectors. Besides, the rank of F (4) is the same as that of F SS . We deduce that all non-zero eigenvalues of F (4) are eigenvalues of 2F SS . Formally, denoting the eigenvalues of F (4) by λ ′ 0 ≥ . . . ≥ λ ′ n−1 , and recalling that the eigenvalues of F SS are denoted by λ * (S) 0 We then show that the spectrums of F (4) and F are close. By (58) there is a probability higher than 1 − 1/n 2 that d ∞ (x * S , Π n0 ) ≤ C a log(n)/n. Hence, one can readily check that Furthermore, Assumption (24) of the theorem ensures that Therefore, both x * (4) and x * are close to Π n . Since (any) two elements of Π n are equal up to a permutation of their indices [n], we deduce that there exists a permutation σ of [n] satisfying d ∞ (x * (4) , x * σ ) ≤ C a log(n)/n. Combining this with the bi-Lipschitz condition (5) we deduce that We are now ready to control the difference between the spectrums of F and F (4) . Recalling that λ * 0 ≥ . . . ≥ λ * n−1 denote the eigenvalues of F , and since F σ has the same eigenvalues as F , it follows from Weyl's inequality (see e.g. [39, page 45]) that for all i = 0, . . . , n − 1, and some constant C aL depending only on c a and c L . Gathering (63-64), we conclude that the following implication holds. If ∆ 1 := λ * 0 − λ * 1 and ∆ 2 := λ * 2 − λ * 3 satisfies ∆ ∧ ∆ 2 ≥ c b n, then ∆ ≥ c b 2 n − C n log(n) which is larger than c b n/4 as soon as n ≥ n ab for n ab the smallest integer satisfying c b n ≥ 4C aL n log(n). □
The bound of Lemma D.3 trivially holds for n 0 ≤ 3. Henceforth, we assume that n 0 ≥ 4. The next lemma is a key element in the proof; it states that reordering two vectors is almost optimal for minimizing their d 1 distance. This result is fairly classical for real vectors. Here, as the vectors v and u take their values on C the proof is slightly more complicated.
Lemma D.4. Consider any v ∈ C n0 . Provided that n 0 ≥ 4, we have Recall that S = {1, . . . , n 0 } for the ease of exposition. We shall prove the following statement which implies Lemma D.3. For any x * S ∈ C n0 , any input x S ∈ R 2×n0 , and any Q ∈ O, the vector x S of Π n0 fulfills UA computes in step 1 the projection z S = [x i /∥x i ∥ 2 ] i∈S of the input x S onto C n0 . Given the projection z S , UA picks in step 3 a vectorx S ∈ Π n0 (z S ) that has the smallest ℓ 1 -error: It follows from these definitions and the equivalence between the distance d on C and ℓ 1 -norm in R 2 that Gathering this bound with Lemma D.4, we derive that As a consequence, it suffices to exhibit some u ∈ Π n0 such that its ℓ 1 distance to the projection z S is small. This is precisely the purpose of the next lemma.
Lemma D.5. Consider any matrix Q ∈ O. There exists y ∈ Π n0 such that We conclude that By triangular inequality, we have The definition of a projection -and the equivalence between the ℓ 1 -norm and the euclidean norm in R 2 -ensure that since z S is the projection of x S on C n0 and Qx * S is an element of C n0 . The last three displays allow us to conclude that which gives (65) using the triangle inequality again. Let x * * S ∈ Π n0 be a closest approximation of x * S in Π n0 , that is, such that The triangular inequality gives where the last inequality comes from the definition of a projection and the equivalence between the ℓ 1 -norm and the euclidean norm in R 2 . By the triangular inequality again, we get An orthogonal transformation preserves the distances.
where we use again the equivalence between the distance d in C and the ℓ 1 -norm in R 2 . Putting everything together, we conclude that Although x * * S belongs to Π n0 , this is not necessarily the case for Qx * * S . Nevertheless, it is easy to check that there exists some Q ′ ∈ O such that Q ′ x * * S ∈ Π n0 and ∥Q ′ x * * S − Qx * * S ∥ 1 ≲ 1. Setting y := Q ′ x * * S ∈ Π n0 , then we see that which concludes the proof. Fix any vector v ∈ C n0 and any u ∈ Π n0 . We shall prove that Let τ be a permutation ordering the coordinates of v on the unit sphere, meaning that v τ (1) , . . . v τ (n0) is ordered. For simplicity and without loss of generality, assume that τ is the identity. Since u ∈ Π n0 , it there suffices to prove the existence of a permutation σ of [n 0 ] such that u σ is ordered and Define the set of 'bad' indices B = {i : d(u i , v i ) ≥ π/16}. If the cardinal of B is larger than n 0 /2, then d 1 (u, v) ≥ n 0 π/32 and any vector u σ satisfies d 1 (v, u σ ) ≤ n 0 π ≤ 32d 1 (v, u). Hence, we assume henceforth that |B| ≤ n 0 /2. First, we focus on the set of 'good' indices G = [n 0 ]\B. We establish the following claim at the end of the proof.
Claim D.6. There exists a permutation σ of G such that the sequence (u σ(j) ) with j ∈ G is ordered and Hence, it is possible to order the restriction of u to G without increasing the sum of the distances. It remains to transform σ into a permutation of [n 0 ]. We iteratively add elements of B into σ. Consider any i ∈ B. Let k and l be the two consecutive (modulo n 0 ) elements of G such that u i belongs to the interval [u σ(k) , u σ(l) ) of the torus R/(2π). Let r and s be the two consecutive elements of G such that i ∈ (r, s) (where we work modulo n 0 ). Then, we define the permutation σ ′ of (G ∪ {i}) as follows.
If (r, s) = (k, l), then we take σ ′ (j) = σ(j) if j ∈ G and σ ′ (i) = i. One readily checks that the sequence (u σ ′ (j) ) with j ∈ G ∪ {i} is ordered and that Otherwise, we set σ ′ (i) = σ(s) and σ ′ (k) = i. For j ∈ G, let succ G (j) be the successor of j ∈ G, that is the smallest index j ′ ∈ G which is larger than j (modulo n 0 ). For any j ∈ G in the segment [s, k), we set σ ′ (j) = σ(succ G (j)). Besides, we set σ ′ (j) = σ(j) for all j ∈ G in the segment [l, r]. In other words, we have shifted all elements in the segment [s, k] to successfully include i in the permutation σ ′ . It follows from this definition that the sequence u σ ′ (j) with j ∈ G ∪ {i} is ordered. By the triangular inequality, we have where we used in the third line that j∈G∩[s,k) d(u σ(j) , u σ(succ G (j)) ) ≤ 2π. Indeed, the sequence (u σ(j) ) j∈G∩[s,k) is ordered on the sphere and this sum is therefore equal to the length of the arc [u σ(s) , u σ(k) ].
By a straightforward induction, we manage to build a permutation σ on [n 0 ] such that (u σ(j) ) j∈[n0] is ordered and where we used Markov's inequality in the last line. We have shown the desired result. □ Proof of claim D.6. Without loss of generality, we assume in the proof that B = ∅ so that we build a permutation σ of [n 0 ]. Since B = ∅, u ∈ Π n0 satisfies d ∞ (u, v) ≤ π/16. We shall iteratively build a permutation σ such that u σ is ordered. Let us first partition the one-dimensional torus R/(2π) into three parts Note that the diameter of D ′ s is smaller 2π/3 + π/8 < π. We have the decomposition For s = 1, 2, 3, let σ s denote the permutation of I s such that the sequence u σs(i) is ordered when i is in I s . Since the diameter of D ′ s is at most π, the sequence u σs(i) in D ′ s is isometric to an increasing sequence of points in [0, π] ⊂ R endowed with the absolute value distance. It goes the same for the ordered sequence v i in D ′ s . Next, we use the following classical property. Claim D.7. Let l ≥ 1 be an integer and a, b be two monotonic vectors of R l , that is, a 1 ≤ a 2 ≤ . . . ≤ a l and b 1 ≤ b 2 ≤ . . . ≤ b l . Then, for any permutation τ of the indices {1, . . . , l}, we have It follows that, for s = 1, 2, 3, we have Let σ be the permutation such that σ(i) = σ s (i) if i ∈ I s . Obviously, we have d 1 (v, u σ ) ≤ d 1 (v, u). Besides, u σ is ordered except possibly at the indices J s = {i : u σ(i) ∈ [(s − 1) 2π 3 − π 16 ; (s − 1) 2π 3 + π 16 ]} with s = 1, 2, 3. Since max i d(u σ(i) , v i ) ≤ π 16 by the second part of the above claim, all u σ(i) and v i with i ∈ J s belong to an interval of length smaller than π. Besides, Hence, we can build as previously partitions σ ′ s of J s that make u σ ′ s (σ(i)) ordered on J s and so that i∈Js d(v i , u σ ′ s (σ(i)) ) ≤ i∈Js d(v i , u σ(i) ) .

D.4. Proof of Proposition D.1
Under the extra assumption that f is geometric, i.e. f satisfies (23), we will show that the estimation error of the spectral algorithm is bounded by in ℓ 1 -type norm. The proof consists in approximating the signal F SS by a circulant and circular-R matrix (Definition D.8) whose spectrum is known (Lemma D.9) and provides information on the latent positions x * S . The difference between the spectrums of F SS and A SS will be bounded using the Davis-Kahan perturbation bound.
Definition D.8. For any integer n ≥ 1, a symmetric matrix M ∈ R n×n is circulant if there exists a vector a of size n such that M ij = a |i−j| and ∀k = 1, . . . , n − 1, a k = a n−k .
Moreover, M is a circulant and circular R-matrix if the above holds and the sequence (a j ) 0≤j≤⌊n/2⌋ is non-increasing.
The spectrum of circulant matrices is known -see [23] and the references therein, which allows to easily deduce the spectrum of symmetric circulant matrices, see Proposition C.4 from [33]. For clarity, we recall this result below -with a small correction on the first coordinate of the eigenvector v (m) . Lemma D.9 (spectrum of symmetric circulant matrices). Let M ∈ R n×n be any symmetric circulant matrix associated to a vector a (as above).  where each α m , for m = 1, . . . , p−1, is associated with the two eigenvectors in (67). The eigenvalue α p is associated with u (p) = (1, −1, . . . , 1, −1). For m = 0, α 0 has multiplicity 1 and is associated to u (0) = (1, . . . , 1).
If the vector a has nonnegative entries, then α 0 is obviously the largest eigenvalue. The next lemma ensures that, for circular R-matrices, α 1 is the second largest eigenvalue. Its proof can be found in [33,Proposition C.5].
Then, we deduce from the Cauchy-Schwarz inequality that .
The bounds (68)  ). Finally, by (69) and the equivalence between the distance d in C and the ℓ 1 -norm in R 2 , we have min Q∈O ∥x * S −Qx * * S ∥ 1 ≤ C a n log(n). Since λ 0 ≤ n (all the entries of F SS belong to [0, 1]), we then deduce from the triangular inequality that .

D.6. Proof of Corollary 4.3 (spectral gap for affine functions)
We will show that, for n large enough, the gaps in the Fourier Φ 1 := F 0,n (g) − F 1,n (g) and Φ 2 := min j=2,...,⌊n/2⌋ F 1,n (g) − F j,n (g) are at least of the the order of n. Corollary 4.3 will then follow directly from Theorem 4.1 and Lemma 4.2.
Recall that the m-th coefficient Fourier transform is defined as For simplicity, we only consider the case where n is odd -the case of even n being similar. Let n = 2p + 1 with p ≥ 2. Using the fact that g(x) = 1 − x/(2π), we get Taking x m = 2πm n , the first term of the numerator is equal to zero since sin((p + .
Since p+1 2 x m = m π 2 + m π 2n , if m is odd, if m is even.
In other words, each eigenvalue of even index is upper bounded by the previous eigenvalue. In light of this, we only need to prove that, for n large enough, From (80), we deduce that α 0 is equivalent to 3n/4. Besides, we deduce from the explicit form of α m in the general case that α 1 , α 2 , and α 3 are respectively equivalent to n π 2 , 1/(4n), and n/(4π 2 ). This completes the proof. For the first inequality of the proposition, the proof is the same as for Proposition D.1, after replacingx VSA S , x * S , A SS respectively byx VSA , x * , A. The second inequality of the proposition follows from Lemma 4.2. □