Density estimation on an unknown submanifold

We investigate density estimation from a $n$-sample in the Euclidean space $\mathbb R^D$, when the data is supported by an unknown submanifold $M$ of possibly unknown dimension $d<D$ under a reach condition. We study nonparametric kernel methods for pointwise and integrated loss, with data-driven bandwidths that incorporate some learning of the geometry via a local dimension estimator. When $f$ has H\"older smoothness $\beta$ and $M$ has regularity $\alpha$ in a sense to be defined, our estimator achieves the rate $n^{-\alpha \wedge \beta/(2\alpha \wedge \beta+d)}$ and does not depend on the ambient dimension $D$ and is asymptotically minimax for $\alpha \geq \beta$. Following Lepski's principle, a bandwidth selection rule is shown to achieve smoothness adaptation. We also investigate the case $\alpha \leq \beta$: by estimating in some sense the underlying geometry of $M$, we establish in dimension $d=1$ that the minimax rate is $n^{-\beta/(2\beta+1)}$ proving in particular that it does not depend on the regularity of $M$. Finally, a numerical implementation is conducted on some case studies in order to confirm the practical feasibility of our estimators.


Motivation
Suppose we observe a n-drawn (X 1 , . . . , X n ) distributed on an Euclidean space R D according to some density function f . We wish to recover f at some point arbitrary point x ∈ R D nonparametrically. If the smoothness of f at x measured in a strong sense is of order β -for instance by a Hölder condition or with a prescribed number of derivatives -then the optimal (minimax) rate for recovering f (x) is of order n −β (2β+D) and is achieved by kernel or projection methods, see e.g. the classical textbooks Devroye and Györfi (1985); Silverman (1986) or (Tsybakov, 2008, Sec. 1.2-1.3). Extension to data-driven bandwidths (Bowman, 1984;Chiu, 1991) offers the possibly to adapt to unknown smoothness, see Lepski, 2008, 2014;Goldenshluger et al., 2011) for a modern mathematical formulation. In many situations however, the dimension D of the ambient space is large, hitherto disqualifying such methods for pratical applications. Opposite to the curse of dimensionality, a broad guiding principle in practice is that the observations (X 1 , . . . , X n ) actually live on smaller dimensional structures and that the effective dimension of the problem is smaller if one can take advantage of the geometry of the data (Fefferman et al., 2016). This classical paradigm probably goes back to a conjecture of (Stone, 1982) that paved the way to the study of the celebrated single-index model in nonparametric regression, where a structural assumption is put in the form f (x) = g(⟨ϑ, x⟩), where ⟨⋅, ⋅⟩ is the scalar product on R D , for some unknown univariate function g ∶ R → R and direction ϑ ∈ R D . Under appropriate assumptions, the minimax rate of convergence for recovering f (x) with smoothness β drops to n −β (2β+1) and does not depend on the ambient dimension D, see e.g. (Gaïffas and Lecué, 2007;Lepski and Serdyukova, 2014) and the references therein. Also, in the search for significant variables, one postulates that f only depends on d < D coordinates, leading to the structural assumption f (x 1 , . . . , x D ) = F (x i 1 , . . . x i d ) for some unknown function F ∶ R d → R and {i 1 , . . . , i d } ⊂ {1, . . . , D}. In an analogous setting, the minimax rate of convergence becomes n −β (2β+d) and this is also of a smaller order of magnitude than n −β (2β+D) , see (Hoffmann and Lepski, 2002) in the white noise model. The next logical step is to assume that the data (X 1 , . . . , X n ) live on a d-dimensional submanifold M of the ambient space R D . When the manifold is known prior to the experiment, nonparametric density estimation dates back to (Devroye and Györfi, 1985) when M is the circle, and on a homogeneous Riemannian manifold by (Hendriks, 1990), see also Pelletier (2005). Several results are known for specific geometric structures like the sphere or the torus involved in many applied situations: inverse problems for cosmological data (Kerkyacharian et al., 2011;Kim and Koo, 2002;Kim et al., 2009), in geology (Hall et al., 1987) or flow calculation in fluid mechanics (Eugeciouglu and Srinivasan, 2000). For genuine compact homogeneous Riemannian manifolds, a general setting for smoothness adaptive density estimation and inference has recently been considered by Kerkyacharian et al. (2012), or even in more abstract metric spaces in Cleanthous et al. (2018). A common strategy adapts conventional nonparametric tools like projection or kernel methods to the underlying geometry, via the spectral analysis of the Beltrami-Laplace operator on M . Under appropriate assumptions, this leads to exact or approximate eigenbases (spherical harmonics for the sphere, needlets and so on) or properly modified kernel methods, according to the Riemannian metric on M .
If the submanifold M itself is unknown, getting closer in spirit to a dimension reduction approach, the situation becomes drastically different: M hence its geometry is unknown, and considered as a nuisance parameter. In order to recover the density f , one has to understand the minimal geometry of M that must be learned from the data and how this geometry affects the optimal reconstruction of f . This is the topic of the paper.

Main results
We construct a class of compact smooth submanifolds of dimension d of the Euclidean space R D , without boundaries, that constitute generic models for the unknown support of the target density f that we wish to reconstruct under a reach condition. The reach is a somehow unavoidable notion in manifold reconstruction that goes back to Federer (1959): it is a geometric invariant that quantifies both local curvature conditions and how tightly the submanifold folds on itself. It is related to the scale at which the sampling rate n can effectively recover the geometry of the submanifold, see Section 2.2 below. We consider regular manifolds M with reach bounded below that satisfy the following property: M admits a local parametrization at every point x ∈ M by its tangent space T x M , and this parametrization is regular enough. A natural candidate is given by the exponential map exp x ∶ T x M → M ⊂ R D . More specifically, for some regularity parameter α ≥ 1, we require a certain uniform bound for the α-fold differential of the exponential map to hold, quantifying in some sense the regularity of the parametrization in a minimax spirit. Our approach is close but slightly more restrictive than Aamari and Levrard (2019, Def. 1) that consider arbitrary parametrizations among those close to the inverse of the projection onto tangent spaces. Given a density function f ∶ M → [0, ∞) with respect to the volume measure on M , we have a natural extension of smoothness spaces on M by requiring that f ○ exp x ∶ T x M → R is a smooth map in any reasonable sense, see for instance Triebel (1987) for the characterisation of function spaces on a Riemannian manifold.
Our main result is that in order to reconstruct f (x) efficiently at a point x ∈ R D when f lives on an unknown smooth submanifold, it is sufficient to consider estimators of the form where K ∶ R D → R is a kernel verifying certain assumptions andd(x) =d(x, X 1 , . . . , X n ) is an estimator of the local dimension of the support of f in the vicinity of x based on a scaling estimator as introduced in Farahmand et al. (2007). We prove in Theorem 1 that following a classical bias-variance trade-off for the bandwidth h, the rate n −α∧β (2α∧β+d) is achievable for pointwise and global loss functions (in L p (M )-norm for p ≥ 1) when the dimension of M is d, irrespectively of the ambient dimension D. In particular, it is noteworthy that in terms of manifold learning, only the dimension of M needs to be estimated. When α ≥ β, we also have a lower bound (Theorem 2) showing that our result is minimax. Moreover, by implementing Lepski's principle (Lepskii, 1992), we are able to construct a data driven bandwidthĥ =ĥ(x, X 1 . . . , X n ) that achieves in Theorem 3 the rate n −α∧β (2α∧β+d) up to a logarithmic term -unavoidable in the case of pointwise loss due to the Lepski-Low phenomenon (Lepskiȋ, 1990;Low, 1992). When the dimension d is known, the estimator (1) has already been investigated in squared-error norm in Ozakin and Gray (2009) for a fixed manifold M and smoothness β = 2.
A remaining question at this stage is to understand how the regularity of M can affect the minimax rates of convergence for smooth functions, i.e. when α ≤ β. We restrict our attention to the one-dimensional case d = 1. When M is known, Pelletier (2005) studied an estimator of the form where K ∶ R → R is a radial kernel, d M is the intrinsic Riemannian distance on M and the correction term ϑ x (X i ) is the volume density function on M (Besse, 1978, p. 154) that accounts for the value of the density of the volume measure at X i in normal coordinates around x, taking into account how the submanifold curves around X i . By establishing in Lemma 3 that ϑ x is constant (and identically equal to one) when d = 1, we have another estimator by simply learning the geometry of M via its intrinsic distance d M in (2). This can be done by efficiently estimating d M thanks to the Isomap method as coined by Tenenbaum et al. (2000). We construct an estimator that achieves in Theorem 5 the rate n −β (2β+1) , therefore establishing that in dimension d = 1 at least, the regularity of the manifold M does not affect the minimax rate for estimating f even when M is unknown. However, the volume density function ϑ x is not constant as soon as d ≥ 2 and obtaining a global picture in higher dimensions remains open.

Organisation of the paper
In Section 2, we provide with all the necessary material and notation from classical geometry for the unfamiliar reader, as well as the essential notions we need from manifold learning. We elaborate in particular on the reach of a subset of the Euclidean space in Section 2.2 and on our statistical model for density estimation on an unknown manifold in Section 2.3. Section 3 describes in details our estimator and its minimax properties in Theorem 1 and Theorem 2 as well as its adaptation properties with respect to an unknown smoothness (Theorem 3). For clarity, we first consider the case when the dimension d and regularity α of the submanifold M and the smoothness β of the density are known (Section 3.3). We next treat adaptation with respect to α and β (Section 3.4). Finally, we consider adaptation with respect to β and d in Section 3.5 and construct an estimator in Theorem 4 that adapts to the smoothness of f and the dimension of M simultaneously. Section 4 focuses on the case on one-dimensional submanifolds, where we prove that the regularity of the underlying geometry does not affect the minimax rates of convergence for estimating f pointwise (Theorem 5 We recall some basic notions of geometry of submanifolds of the Euclidean space R D for the unfamiliar reader. We borrow material from the classical textbooks Gallot et al. (2004) and Lee (2006). We endow R D with its usual Euclidean product and norm, respectively denoted by ⟨⋅, ⋅⟩ and ⋅ . We denote by B D (x, r) the open ball of R D of center x and radius r.

Smooth submanifolds and tangent spaces
A smooth submanifold M of R D is a connected subset of R D that verifies (Gallot et al., 2004, Prp 1.3 p.3) that for any point x ∈ M , there exists an open neighbourhood of x in R D and a smooth map (called a parametrization) φ ∶ Ω → R D defined in a neighbourhood Ω of 0 in R d for some d ≥ 1 and such that φ(0) = x; dφ(0) is injective; and φ is a homeomorphism from Ω onto U ∩ M . The integer d, called the dimension of M , is the same for all local parametrizations of M . A vector v ∈ R D is said to be tangent to M at point x ∈ M if there exists a smooth curve γ ∶ I → M defined in a neighbourhood of 0 in R and such that γ(0) = x andγ(0) = v. The set of all tangent vectors at point x ∈ M is called the tangent space and is denoted by T x M . It is a subspace of R D of dimension d and it verifies that for any local parametrization φ of M at x, we have et al., 2004, Thm. 1.22 and 1.23 p.13). We denote by T x M ⊥ the orthonormal complement of T x M in M , also called the normal space at point x.

Geodesics and the exponential map
A smooth curve γ ∶ I → M defined on an open set I of R is called a geodesic if at any time t ∈ I, the accelerationγ(t) lies in the normal space T γ(t) M ⊥ (Gallot et al., 2004, Ex 2.80 b. p.81). In particular, the speed t ↦ γ(t) of a geodesic is constant. For any x ∈ M and any v ∈ T x M , there exists a unique maximal geodesic γ x,v defined in an open neighbourhood of 0 such that γ x,v (0) = x andγ x,v (0) = v (Gallot et al., 2004, Cor 2.85 p.85). The set of vectors v ∈ T x M such that γ x,v (1) is well defined is a neighbourhood of 0 in T x M . We define on this neighbourhood the map called the exponential map. It is smooth and there always exists > 0 such that its restriction to the ball B d (0, ) in T x M defines a local parametrization of M at point x (Gallot et al., 2004, Cor 2.89 p.86). The supremum of all for which the latter holds true is called the injectivity radius at point x.
The uniqueness of γ x,v readily yields that exp x (tv) = γ x,v (t), for any t ∈ R for which both expressions are well-defined, in particular in a neighbourhood of 0. Letting t go to 0 shows that d exp x (0) = Id TxM and that d 2 exp by II x and is called the second fundamental form of M at x -see Gallot et al. (2004, Def 5.1 p. 246) for a more general definition. The quantity II x (v ⊗ v) encapsulates the way the submanifold M curves around x in the direction v. For instance, if M is flat around x, then the geodesics originating from x are locally straight lines and thus II x = 0.

Length of curves and the Riemannian distance
We say that a curve γ ∶ a, b] such that on any segment [a i , a i+1 ], γ is smooth and its derivative does not vanish. For such a curve, we define its length as For any two points x, y ∈ M , we define the Riemannian distance d M (x, y) between x and y as the infimum of the length of all the admissible curves that join x to y. One can show that d M is well-defined on M × M (recall that we only consider connected submanifolds) and that it is indeed a distance. It endows M with a topology that coincides with the one induced by the ambient space, see Lee (2006, Lem 6.2 p.94) and Arias-Castro and Le Gouic (2019, Lem 3). An admissible curve γ that reaches the infimum in the definition of d M is called minimizing. We know that every minimizing curve is a geodesic of M , up to a reparametrization (Lee, 2006, Thm 6.6 p.100). Conversely, if y is in exp x (B d (0, )) with smaller than the injectivity radius at x, then the geodesic joining x and y is minimising (Lee, 2006, Prp 6.10). This means that for any When M is a closed subset of R D , the maximal geodesics γ x,v are defined on the whole line R, and the exponential maps are defined on the whole tangent spaces. This is (one side of) the Hopf-Rinow theorem (Lee, 2006, Thm 6.13 p.108).

The volume measure
We need a canonical way to define a uniform measure µ M on M . This can easily be done locally around x ∈ M by pushing forward the Lebesgue measure on T x M through exp x ∶ B TxM (0, ) → M with smaller than the injectivity radius at x. In order to take into account the geometry of M around x, we add a correction term, called the volume density function (Besse, 1978, p. 154). For a continuous function ψ ∶ M → R with support in exp x (B d (0, )), we define and where (e 1 , . . . , e d ) is an arbitrary orthonormal basis of T x M . The value of det g x ij (v) represents the volume of the parallelepiped in T exp x (v) M generated by the images of (e 1 , . . . , e d ) via d exp x (v). We can then define µ M (ψ) for any continuous bounded ψ ∶ M → R by means of partition of unity (Gallot et al., 2004

The reach of a subset in an Euclidean space
One of the main concern when dealing with observations sampled from a geometrically structured probability measure is to determine the suitable scale at which one should look at the data. Indeed, given finite-sized point cloud in R D , there are infinitely many submanifolds that interpolate the point cloud, see Figure 1 for an illustration. A popular notion of regularity for a subset of the Euclidean space is the reach, introduced by Federer (1959). Definition 1. Let K be a compact subset of R D . The reach τ K of K is the supremum of all r ≥ 0 such that the orthogonal projection pr K on K is well-defined on the r-neighbourhood K r of K, When M is a compact submanifold of R D , the reach τ M quantifies two geometric invariants: locally, it measures how curved the manifold is, and globally, it measures how close it is to intersect itself (the so-called bottleneck effect). See Figure 2 for an illustration of the phenomenon. A reach condition, meaning that the reach of the manifold under study is bounded below, is usually necessary in order to obtain minimax inference results in manifold learning. These include: homology inference Balakrishnan et al. (2012); Niyogi et al. (2008), curvature (Aamari and Levrard, 2019) and reach estimation itself  as well as manifold estimation Aamari and Levrard (2019); Genovese et al. (2012). We follow this approach, and start with citing a few useful properties related to the reach. Proposition 1. (Niyogi et al., 2008, Prp. 6.1) Let M be a compact smooth submanifold of R D . Then, for any x ∈ M , we have II x op ≤ 1 τ M .
Since II x is the differential of order two of the mapping exp x , Proposition 1 has several convenient implications. First, it gives a uniform lower bound for the injectivity radiuses of M . This is a corollary of Alexander and Bishop (2006, Thm 1.3), as explained in Aamari and Levrard (2019, Lem A.1). For the second one (Right), the reach is equal to τ M because it is close to self intersecting (a bottleneck effect). The blue area represents the tubular neighbourhood over which the orthonormal projection on each manifold is well-defined.
Proposition 2. Let M be a compact smooth submanifold of R D . Then, for any x ∈ M , the exponential map exp Proposition 1 also yields nice bounds on how well the Euclidean distance on R D approximates the Riemannian distance d M on M × M .
Proposition 3. (Niyogi et al., 2008, Prp. 6.3) For any compact submanifold M of R D and any Proposition 3 allows in turn to compare the volume measure µ M to the Lebesgue measure on its tangent spaces.
Lemma 1. For any d-dimensional compact smooth submanifold M of R D , for any x ∈ M and any η ≤ τ M 2, we have, for all h ≤ η Proof. This result already appears in (Aamari, 2017, Lem III.23) but we prove it here to make constants explicit. Let us denote by Leb the Lebesgue measure on T x M . Using (Aamari, 2017, Prp III.22.v), we know that, as long as These inclusions combined with the last inequalities yield the result.

A statistical model for sampling on a unknown manifold
Our statistical model is characterised by two quantities: the regularity of its support and the regularity of the density defined on this support. The support belongs to a class of submanifolds M , for which we need to fix some kind of canonical parametrization. This is what Aamari and Levrard (2019) propose by asking the support M to admit a local parametrization at all point x ∈ M by T x M , and that this parametrization is close to being the inverse of the projection over this tangent space. We consider here a somewhat stronger condition: among all the local parametrizations of a manifold by its tangent spaces, we impose a constraint on the exponential map up to its differentials of order k.
Definition 2. Let 1 ≤ d < D be integers and τ > 0 be a positive number. We let C d (τ ) define the set of submanifolds M of R D satisfying the following properties: For a real number α ≥ 1 and a positive real number L > 0, we define C d,α (τ, L) of as the set of M ∈ C d (τ ) that fulfill the additional condition: (iv) For any x ∈ M the restriction of the exponential map exp x to the ball B TxM (0, τ 2) is (α + 1)-Hölder with coefficient L. Namely, setting k = ⌈α⌉ and δ = α Condition (iv) implies a bound on derivatives of exp x up to order k. Indeed, we know that for with L j depending on τ , L and α only. This comes from Lemma 5. In that sense, the model of Definition 2 is close to the one proposed by Aamari and Levrard (2019).
In the following, we consider a class C of submanifolds of R D , and a class F of real-valued functions defined on R d .
Definition 3. Given C and F, the statistical model P(C, F) consists of all probability measures P defined on R D (endowed with its Borel sigma-field) such that: (ii) The Radon-Nikodym derivative f = dP dµ M expressed in normal coordinates belongs to F. (In other words, there exists is a version of f such that for every x ∈ M , f ○ exp x belongs to F.) For a model Σ of the form P(C, F) and any x ∈ R D , we define the submodel Σ(x) as P(C(x), F) where C(x) is the subset of all the submanifolds of C containing x.
Some remarks: 1) Definition 3 models the set of all probability measures with support in C and densities in F. 2) Point (ii) of Definition 3 is a natural way to define smoothness in a Hölder sense over Riemannian manifolds, see for instance Triebel (1987). Of course, (ii) in Definition 3 is meaningful only if the class F is left invariant by the canonical action of the orthogonal group O(R d ) so that the claim f ○exp x ∈ F does not depend on an arbitrary orthonormal identification we make between T x M and R d . This will be the case for all the functional classes F we will consider in the paper. According to Definition 3, a given probability measure P ∈ P(C, F) defines in turn a pair of submanifold and density (M P , f P ) abbreviated by (M, f ) when no confusion may arise. If M P is not uniquely determined by P , we simply pick an arbitrary element M among all the manifolds satisfying supp P ⊂ M . One classical case where M is uniquely determined is when the class F consists of densities bounded away from 0, in which case M = supp P exactly.
We will further restrict our study to the following functional classes: Definition 4. For a real number β ≥ 1 and constants r, R > 0, we let F β (a, b, r, R) denote the set of real-valued functions g on R d satisfying: 3 Main results

Methodology
Let Σ = P(C, F) according to the construction given in Definition 3 above. Given x ∈ R D and an n-sample X 1 , . . . , X n arising from a distribution P ∈ Σ(x), we measure the accuracy of an estimator and its associated maximal pointwise risk at x: where E P denotes expectation w.r.t. P . Alternatively, we may consider the global L p -risk Since now the global L p -risk does not depend on x, it makes sense to evaluate it for any P ∈ Σ and define its maximal version provided we have nice mapping x ↦f (x), meaning that the restrictionf M off can be seen as a measurable estimator with value in L p (M, µ M ) for all M ∈ C.
Remark 1. Since we measure smoothness in a strong Hölder sense, it will be sufficient to restrict our study to (3). Indeed, whenever a result holds for some Σ(x) for (3) uniformly in x, it readily carries over to Σ and (4) thanks to the inequality that is always meaningful for classes Σ of distributions P supported on compact submanifolds, an assumption in force in the following.
In this setting, we further investigate the construction of sharp upper bounds for models of the form Σ α,β (x) for Σ α,β = P(C α , F β ) where C α = C d,α (τ, L) and F β = F β (a, b, r, R) and for fixed values of d, τ, L, R, a, b and r, and unknown smoothness parameters α, β.

Kernel estimation
Classical nonparametric density estimation methods are based on kernel smoothing (Parzen, 1962;Silverman, 1986). In this section, we combine kernel density estimation with the minimal geometric features needed in order to recover efficiently their density. In our context, it turns out that it suffices to know the dimension of the underlying manifold M where the data sit. We first work with a fixed (or known) dimension 1 ≤ d < D for M , and thus we further drop d in (most of) the notation. We refer to Section 3.5 for the case of an unknown d.
Let K ∶ R D → R be a smooth function vanishing outside the unit ball B D (0, 1). Given an n-sample (X 1 , . . . , X n ) drawn from a distribution P on R D , we are interested in the behaviour of the kernel estimatorf Notice the normalisation in h that depends on d and not on D, as one would set for a classical kernel estimator in R D . Our main result is thatf h (x) behaves well when P is supported on a d-dimensional submanifold of R D .
We need some notations. Let Σ sd = P C d (τ ), B(0, b) as in Section 3.1. This is quite a large model that contains in particular all the smoothness classes Σ α,β . For P ∈ Σ sd (x) with associated pair of submanifold and density (M, f ), we let that correspond respectively to the mean, bias and stochastic deviation of the estimatorf h (x). We also introduce the quantity where ζ d is the volume of the unit ball B d (0, 1) in R d . The quantity v(h) will prove to be a good majorant of the stochastic deviations off h (x). The usual bias-stochastic decomposition off h (x) leads to for the pointwise risk (3) and for the global risk (4). We study each term separately.
Behaviour of the stochastic term Proposition 4. Let p ≥ 1. There exists a constant c p > 0 depending on p only such that or any P ∈ Σ sd (x): and for any P ∈ Σ sd Remark 2. The last bound is uniform in M as soon as vol M is uniformly bounded. This is the case if we restrict further the class Σ sd to P

Behaviour of the bias term
We need a certain property for the kernel K. More precisely: One way to obtain Assumption 1 is to set Λ is a smooth kernel, supported on the unit ball of the ambient space R D that satisfies Assumption 1. In the following, we pick an arbitrary kernel K such that Assumption 1 is satisfied.
with G j (P, x) ≲ 1 and R h (P, x) ≲ h α∧β where the symbol ≲ means inequality up to a constant that depends on K, α, τ, L, β, r, b and R.
Starting from a kernel K satisfying Assumption 1, we recursively define a sequence of smooth kernels (K (d, ) ) ≥1 , simply denoted by K ( ) in this section, with support in B D (0, 1) as follows (see Figure 3). For x ∈ R D , we put Proposition 5. Let be a positive integer and let 1 ≤ α, β ≤ be real numbers. Let K ( ) be the kernel defined in (8) starting from a kernel K satisfying Assumption 1. Then, for any P ∈ Σ α,β (x) and any bandwidth 0 < h < (r ∧ τ ) 4, the estimatorf h (x) defined as in (5) for the pointwise bias, up to a constant depending on K, , α, τ, L, β, r, b and R and also up to the same constant as in (9).
The second estimate is a direct consequence of the first one: a closer inspection at the proof of Lemma 2 reveals that the estimates are indeed uniform in x ∈ M .

Minimax rates of convergence
For ≥ 1, we let K ( ) be defined as in (8), originating from a kernel K satisfying Assumption 1.
Theorem 1. Let α, β ≥ 1and p ≥ 1. Let be an integer greater than α ∧ β. For the estimatorf h (x) defined in (5) constructed with K ( ) and bandwidth h = n −1 (2α∧β+d) , we have, for n large enough (depending on τ, r, K and d) up to a constant that depends on p, K, , α, τ, L, β, r, b and R and that does not depend on x. If moreover a > 0, the same result holds for the global risk up to a multiplication of the constant in (10) by a factor a −1 p .
Proof. Combining Proposition 4 and Proposition 5, as soon as h ≤ (τ ∧ r) 4, we obtain, for any P ∈ Σ α,β (x), up to a constant that depends on p, K, , α, τ, L, β, r, b and R and that does not depend on x. Setting h = n −1 (2α∧β+d) readily yields (10), while the uniformity in x ∈ M P enables us to integrate (10) and obtain the result for the integrated risk.
This rate is optimal if the regularity of M dominates the smoothness of f : Theorem 2. For any real numbers α, β ≥ 1 and any x ∈ R D , we have, for L, b and 1 a large enough (depending on τ ) lim inf where C ⋆ only depends on τ and R. The infimum is taken among all estimators F of f constructed from a drawn X 1 , . . . , X n with distribution P ∈ Σ α,β (x).
Some remarks: 1) the order ≥ 1 of the kernel plays the role of a cancellation order as in the Euclidean case: the higher α and β, the bigger we need to choose , resulting in oscillations for K . 2) We have a minimax rate of convergence n −α∧β (2α∧β+d) that is similar to the Euclidean case, provided the underlying manifold M is regular enough, namely α ≥ β, which probably covers most cases of interests. This restriction played by α together with the reach condition comes as a price for recovering f when its support M is unknown. We will see in Section 4 below that for d = 1, it is possible to remove the term α and achieve the rate n −β (2β+1) . This however requires extra effort on learning the geometry of M . 3) As mentioned earlier, a reach condition τ > 0 is usually necessary in minimax geometric inference when the manifold is unknown (Aamari and Levrard, 2019;Balakrishnan et al., 2012;Genovese et al., 2012;Kim et al., 2016;Niyogi et al., 2008). Although we do not have a proof here, it is likely that having τ > 0 is necessary in order to recover f in a minimax setting.

Smoothness adaptation
We implement Lepski's algorithm, following closely Lepski et al. (1997) in order to automatically select the bandwidth from the data (X 1 , . . . , X n ). We know from Section 3.2 that the optimal bandwidth on Σ α,β (x) is of the form n −1 (2α∧β)+d . Hence, without prior knowledge of the value of α and β, we can restrict our search for a bandwidth in a bounded interval of the form [h − , h + ] for some arbitrary h + ∈ (0, 1] and consider the set This lower bound is smaller than the optimal bandwidth n −1 (2α∧β+d) on Σ α,β (x) for n large, and is such that v(h) ≤ 2 2v (nh d ) for all h ≥ h − . For h, η ∈ H, we introduce the following quantities: where D 1 and D 2 are positive constants (to be specified). For h ∈ H we define the subset of bandwidths H(h) = {η ∈ H, η ≤ h} The selection rule for h is the following: and we finally consider the estimatorf Theorem 3. In the same setting as Theorem 1 forf defined in (12) with h + = 1, for any D 2 > p ≥ 1, and any D 1 > 0, we have, for n large enough (depending on p, d, D 1 , D 2 , K, , α, τ, L, β, r, b and R) up to a constant that depends on p, d, D 1 , D 2 , K, , α, τ, L, β, r, b, R and that does not depend on x.
Some remarks: 1) Theorem 3 provides us with a classical smoothness adaptation result in the spirit of (Lepski et al., 1997): the estimatorf as the same performance as the estimatorf h selected with the optimal bandwidth n −1 (2α∧β+d) , up to a logarithmic factor on each model Σ α,β (x) without the prior knowledge of α ∧ β over the range [1, ].
2) The extra logarithmic term is the unavoidable payment for the Lepski-Low phenomenon Lepskiȋ (1990); Low (1992) when recovering a function in pointwise or in a uniform loss. 3) We were unable to obtain oracle inequalities in the spirit of the Goldenshluger-Lepski method, see Lepski, 2008, 2014;Goldenshluger et al., 2011), due to the non-Euclidean character of the support of f : our route goes along the more classical approach of Lepski et al. (1997). Obtaining oracle inequalities in this framework remains an open question.

Simultaneous smoothness and dimension adaptation
We know consider the case when the dimension d of M is unknown. We write again Σ d α,β = P C d,α , F β with C d,α = C d,α (τ, L) and F β = F β (r, a, b, R), and Σ d sd = P (C d , F) with C d = C d (τ ) and F = B(0, b). We work with fixed values of τ, L, r, a, b and R, and we impose this time that a > 0. For any 1 ≤ d < D and h > 0, we define where Λ and λ d have been introduced in Section 3.2. We then define the estimator We write K ( ) (d; ⋅) for the kernel K (d, ) (d, ⋅) defined by recursion in (8) starting from the kernel K(d; ⋅). We redefine all the quantities introduced before as now depending on d. Namely, for h, η > 0, and a given family of kernel where D 1 and D 2 are constants, unspecified at this stage. We also definê with H d (h) = {η ∈ H d , η ≤ h}. We assume that we have an estimatord of the dimension d of M with values in {1, . . . , D}. We require the following property Assumption 2. For every integer 1 ≤ d < D, all real numbers α, β ≥ 1, and for all P ∈ Σ d α,β (x), we have P d ≠ d ≲ n −3p up to a constant that depends on all the parameters of Σ d α,β (x).
There are various way to define such an estimator, see Farahmand et al. (2007) or even Kim et al. (2016) where an estimator with super-exponential minimax rate on a wide class of probability measures is constructed. For sake of completeness and simplicity, we will mildly adapt the work of Farahmand et al. (2007) to our setting.

The special case of one-dimensional submanifolds
We address the question whether the regularity α of M is a necessary component in the minimax rate of convergence for estimating f at a point x ∈ R D . In the specific case of one-dimensional manifolds, i.e. closed curves in an Euclidean space, we have an answer. We restrict our attention to the model Σ 1 β = P(C 1 (τ ), F β ) i.e. we consider the case d = 1. In this section, we do not have any restriction on the regularity of the submanifold or curve M , except for its reach that we assume bigger than τ > 0. We have the somehow surprising result: Theorem 5. For any β ≥ 1 and any p ≥ 1, there exists an estimatorf * explicitly constructed in (15) below such that, for n large enough (depending on p, τ, β, r and a), up to a constant that depends on , p, τ, a, b, R and that does not depend on x.
In the specific case of dimension 1, any M in C 1 (τ ) is a closed smooth injective curve that can be parametrised by a unit-speed path γ M ∶ [0, L M ] → R D with γ M (0) = γ M (L M ) and with L M = vol M being the length of the curve. In that case, the volume density function takes a trivial form.
Lemma 3. For M ∈ C 1 (τ ), for any x ∈ M and any v ∈ T x M , we have det g x (v) = 1.
Proof. Let γ ∶ [0, L M ] → M be a unit speed parametrization of M and extend γ to a smooth function on R by L M -periodicity. Suppose without loss of generality that γ(0) = x. For any t ∈ R, there is a canonical identification between T γ(t) M and R through the map v ↦ ⟨γ(t), v⟩. With such an identification, we can write that for s ∈ R ≃ T x M , exp x (s) = γ(s) because γ is unit-speed. We thus have d exp x (s)[h] = hγ(s) for any h ∈ T γ(s) M ≃ R. It follows that det g x (s) = d exp x (s)[1] 2 = γ(s) 2 = 1 and this completes the proof.
Thanks to Lemma 3, the estimator proposed by Pelletier (2005) takes a simpler form, which we will try to take advantage of. Indeed, in the representation (2) of Pelletier (2005), only d M remains unknown. We now show how to efficiently estimate d M thanks to the Isomap method as coined by Tenenbaum et al. (2000). The analysis of this algorithm essentially comes from Bernstein et al. (2000) and is pursued in Arias-Castro and Le Gouic (2019), but the bounds obtained there are manifold dependent. We thus propose a slight modification of their proofs in order to obtain uniform controls over C 1 (τ ), and make use of the simplifications coming from the dimension 1. Indeed, for d = 1, we have the following simple and explicit formula for the intrinsic distance on M : The Isomap method can be described as follows: let > 0, and letĜ (x) be the -neighbourhood graph built upon the data (X 1 , . . . , X n ) and x, where x is the point where we evaluate the density function. Any two vertices p, q inĜ (x) are adjacent if and only if p−q ≤ . For a path inĜ (x) (a sequence of adjacent vertices) s = (p 0 , . . . , p m ), we define its length as L s = p 1 −p 0 +⋅ ⋅ ⋅+ p m −p m−1 . The distance between two vertices p, q of the graph is then defined aŝ and we set this distance to ∞ if p and q are not connected. We are now ready to describe our estimatorsf * (x). For any h, > 0, we set for some K ∶ R → R. Notice that the kernel K ( ) (1; ⋅) ∶ R D → R defined in Section 3.5 only depends on the norm of its argument, so it can be put in the form K ( ) (1; y) = K ( ) (1; y ) with K ( ) (1; ⋅) denoting thus (with a slight abuse of notation) both functions starting from either R or R D .
Theorem 6. The estimatorf * h , defined above for using kernel K ( ) (1; ⋅) defined in Section 3.5, verifies that for any P ∈ Σ 1 β (x) with β ≤ , and any p ≥ 1, for n large enough (depending on p, τ and a) and h ≤ (τ ∧ r) 4, we have up to a constant that depends on , p, τ, b, R and that does not depend on x.

Numerical illustration
In this section we propose a few simulations to illustrate the results presented above. For the sake of visualisation, we focus on two typical examples of submanifold of R D , namely non-isometric embeddings of the flat circle T 1 = R Z and of the flat torus T 2 = T 1 × T 1 . In particular, these embeddings will be chosen in such way that their images, as submanifolds of R D , are not homogeneous compact Riemannian manifolds, so that the work of Kerkyacharian et al. (2012) for instance cannot be of use here.
For a given embedding Φ ∶ N → M ⊂ R D where N is either T 1 of T 2 , we construct absolutely continuous probabilities on M by pushing forward probability densities of N w.r.t. their volume measure. Indeed, if P = g ⋅ µ N , the push-forward measure Q = Φ * P has density f with respect to µ M given by where the determinant is taken in an orthonormal basis of T Φ −1 (x) N and T x M , so that, if Φ is chosen smooth enough, f has the same regularity as g. If Φ is an embedding of T 1 , we simply have det dΦ(y) = Φ ′ (y) for all y ∈ T 1 . If now Φ maps T 2 to M , we have det dΦ ( where (e 1 , e 2 ) is an orthonormal basis of R 2 ≃ T y T 2 .

An example of a density supported by a one-dimensional submanifold
Let s ≥ 1 be a regularity parameter. We define the following function on [0, 1] The function g s is positive and ∫ 1 0 g s (t)ds = 1; it defines a probability density over the unit interval. Also, because all derivatives of g at 0 and 1 coincide up to order s, but not at order s+1, the function g s also defines a probability density on T 1 that is s-times differentiable at 0, but not s + 1-times. See Figure 4 for a few plots of the functions g s . We next consider the parametric curve (2πωt), sin(2πt) + a sin(2πωt)) . Short computations show that Φ is indeed an embedding as soon as aω < 1, in which case M = Φ(T 1 ) is indeed a smooth compact submanifold of R 2 . For the rest of this section, we set a = 1 8 and ω = 6. See Figure 5 for a plot of M with these parameters. We are interested in estimating the density f s with respect to dµ M of the push-forward measure Φ * g s ⋅ µ T 1 , at point x = (1 + a, 0) ∈ M . We et P s = Φ * g s ⋅µ T 1 . We use formula (16) to compute f s (x): Our aim here is to provide an empirical measure for the convergence of the risk n ↦ R (p) n [f h , P ](x) when h is tuned optimally (in an oracle way). We pick p = 2. Our numerical procedure is detailed in Algorithm 1 below, and the numerical results are presented in Figure 6.

An example of a density supported by a two-dimensional submanifold
We consider a non-isometric embedding of the flat torus T 2 . We first construct a density function. For and integer s ≥ 1, define G s ∶ (t, u) ∈ [0, 1] 2 ↦ g s (t)g s (u) where g s is defined as in (18). Obviously, G s defines a density function on T 2 that is s times differentiable at (0, 0) ∈ T 2 , but not s + 1 times. See Figure 7 for a plot of G s .
Algorithm 1 MSE rate of convergence estimation 1: Fix arbitrary integers s ≥ 1 and ≥ s. 2: Set a grid of increasing number of points n = (n 1 , . . . , n k ) ∈ N k . 3: for n i ∈ n do 4: Sample n i points independently of P s (using an acceptance-rejection method),

7:
Repeat the three previous steps N times,

8:
Average the errors to get a Monte-Carlo approximationR n i [f h , P s ](x) 2 . 9: end for 10: Perform a linear regression on the curve log n i ↦ logR (2) n i [f h , P s ](x) 2 . 11: return The coefficient of the linear regression. Figure 6: Plot of the empirical mean square error (blue) for a density supported by a onedimensional submanifold case with parameters s = 2, = 4 (Left) and s = 10, = 1 (Right). We use a log-regular grid n of 20 points ranging from 100 to 10 4 . Each experiment is repeated N = 500 times. We next consider the parametric surface for some a, b, ω ∈ R. In the remaining of the section, we set a = 1 8, b = 3 and ω = 5. We show that Φ indeed defines an embedding. See Figure 8 for a plot of the submanifold M = Φ(T 2 ). For an integer s ≥ 1, we denote by F s the density of the push forward measure Q s = Φ * G s ⋅ µ T 2 with respect to the volume measure µ M . Let x = (b + 1, a, 0) be the image of 0 by Φ. Simple calculations show that the differential of Φ at 0 evaluated at e 1 = (1, 0) ∈ T 0 T 2 and e 2 = (0, 1) ∈ T 0 T 2 is equal to respectively dΦ(0)[e 1 ] = 2π(aω, 0, 1) and dΦ(0)[e 2 ] = 2π(0, b + 1, aω). Hence formula (17) yields det dΦ(0) = (2π) 2 1 + a 2 ω 2 (b + 1) 2 + a 2 ω 2 − a 2 ω 2 1 2 (20)  and we obtain In the same way as in the previous section, we aim at providing an empirical measure for the rate of convergence of the risk R (p) n [f h , P s ] when h is suitably tuned with respect to n and s. This is done using again Algorithm 1. The results are presented in Figure 9.

Adaptation
In this section we estimate a density when its regularity is unknown, contrary to the previous simulation where the regularity parameter s is pugged in the bandwidth choice n −1 (2s+d) . This is performed using Lepski's method presented in Section 3.4. The MSE rate is computed using Algorithm 1, for both the one-dimensional and the two-dimensional synthetic datasets.
For the adaptive estimation on the two-dimensionnal manifold, we observed that the corrective term det dΦ(0) computed in (20) made the resulting density F s (x) very small, while the function ψ Figure 9: Plot of the empirical mean square error (blue) for a 2-dimensional submanifold with parameters s = 2, = 4 (Left) and s = 4, = 4 (Right). We use a log-regular grid n of 20 points ranging from 10 3 to 10 5 . Each experiment is repeated N = 500 times. defined at (11) and used to tune the bandwidth soared dramatically because of the retained value of v d = 4 d ζ d K (d, ) 2 ∞ b, so that the values off h and ψ(h, ⋅) are not of the same order anymore at this scale (using maximum 10 6 observations). To circumvent this effect, we use a different kind of density g s,λ defined as for some λ ≥ 1 and with C s = 1 − 1 (2s + 2) − 1 (2s + 6). We plot the function g s,λ below in Figure 10 for some value of s and λ = 2. Again, g s,λ defines a probability density on T 1 that is s-times differentiable at 0 but not s + 1times. Like before, we consider the density G s,λ ∶ (t, u) ↦ g s,λ (t)g s,λ (u) on the torus T 2 and the push-forwarded probability measure Φ * G s,λ ⋅ µ T 2 which has density F s,λ with respect to µ M . For λ = 10, we find that F s,λ (x) ≃ 1 for most value of s, so that we set v = 1 in the definition of ψ for the sake of the estimation procedure. We have no theoretical guarantee that such a method work but we recover nonetheless the right rate in the estimation of the value of the density, see Figure 11. The numerical results are presented in Table 1.  Table 1: This table sums up the results of the simulation carried on in this section. In light blue is the expected rate for the mean square error. The column Linear regression contains the outcome of Algorithm 1 for the given set of parameters. We then compute the relative error between this coefficient and the expected rate in the last column.
6 Proofs 6.1 Proofs of Section 3.2 We set K h (x) = h −d K(x h) and start with bounding the variance of K h (X−x) when X is distributed according to P ∈ Σ sd . Let first observe that Lemma 4. If P ∈ Σ sd then, for all x ∈ M P and all h ≤ τ 2, with ζ d being the volume of the unit ball in R d .
Using Bernstein inequality (Boucheron et al., 2013, Thm. 2.10 p.37), for any P ∈ Σ sd , any x ∈ M P , and any t > 0, we infer where P is a short-hand notation for the distribution P ⊗n of the n-sample X 1 , . . . , X n taken under P . The bound (21) is the main ingredient needed to bound the L p -norm of the stochastic deviation off h .
Proof of Proposition 4. We denote by y + = max{y, 0} the positive part of a real number y. We start with The first term has the right order. For the second one, we make use of (21) to infer which ends the proof.
The proof of Lemma 2 partly relies on the following technical lemma.
Lemma 5. Let β ≥ 1 be a real number and let g ∶ R d → R verifying that g ∞ ≤ b and that the restriction of g to B d (0, r) is β-Hölder, meaning that for some R ≥ 0 with k = ⌈β − 1⌉ and δ = β − k. Then there exists a constant C (depending on β, r, b and R, and depending on β when r = ∞) such that, for all 1 ≤ j ≤ k, Proof. Let v ∈ B d (0, r 2). Since g is β-Hölder on B d (0, r), we know that there exists a function R v such that, for any z such that v + z ∈ B(0, r), we have Let h = (2bk! R) 1 β , and z 0 ∈ R d be unit-norm. Pick a 1 , . . . , a k ∈ (0, 1) all distincts and small enough such that ha k z 0 ∈ B(0, r 2) for all k (if r = ∞, then we can pick the a i independently from R, b and β). Introducing the vectors of R k we have Y = V X with V being the Vandermonde matrix associated with the real numbers (a 1 , . . . , a k ).
The former being invertible, we have X ≤ V −1 op Y and thus, for Substituing the value of h and noticing that the former inequality holds for every unit-norm vector z 0 , we can conclude.
Proof of Lemma 2. We set B h = B D (x, h). Since τ 4 is smaller than the injectivity radius of exp x (see Proposition 2) we can write . We can thus write the following expansion, valid for all v ∈ exp −1 x B h and all w ∈ T x M , with C 1 depending on α, τ and L, C 2 depending on β, r, b and R (see Lemma 5), and C 3 depending on K. Since now we know that g with C 4 depending on α, τ and L. Making the change of variable v = hw in (22), we get with G j corresponding to the integration of the j-th order terms in the expansion around 0 of the In particular G j can be written as a sum of terms of the type where ψ and φ are monomials in w satisfying m deg φ + deg φ = j, with coefficients bounded by constants depending on α, τ, L, β, r, b and R (again, use Lemma 5 to bound the derivatives). Since now B TxM (0, 1) ⊂ 1 h exp −1 x B h , and since d j K is zero outside of B(0, 1), we have that G j can actually be written G j (h, P, x) = h j G(P, x) with G(P, x) ≤ C for some C depending on K, α, τ, L, β, r, b and R. Similar reasoning leads to R h (P, x) ≤ Ch γ with C depending again on K, α, τ, L, β, r, b and R. To conclude, it remains to compute G 0 (P, x). Looking at the zero-th order terms in the expansions (23) to (26), we find that where we used Assumption 1. The proof of Lemma 2 is complete.
Proof of Proposition 5. For a positive integer ≥ 1, let f ( ) h (P, x) be the mean of the estimator f h (x) computed using K ( ) . Let γ = α ∧ β and k = ⌈γ − 1⌉. We recursively prove on 1 ≤ < ∞ the following identity where R ( ) h (P, x) ≤ C ( ) h γ for some constant C ( ) depending on τ, , L, R, b, β and r. The initialisation step = 1 has been proven in Lemma 2. Let now 1 ≤ ≤ k. By linearity of f h (P, x) with respect to K, we have f Since 2 −1 h ≤ h, we can use our induction hypothesis (27) and find We conclude noticing that 2 1−j − 1 = 0 for j = , and setting G ending the induction by setting C ( +1) = 3C ( ) . When ≥ k + 1, the induction step is trivial.

Proofs of Section 3.3
We go along a classical line of arguments, thanks to a Bayesian two-point inequality by means of Le Cam's lemma (Yu, 1997, Lem. 1), restated here in our context. For two probability measures P 1 , P 2 , we write TV(P 1 , P 2 ) = sup A P 1 (A) − P 2 (A) for their variational distance and H 2 (P 1 , P 2 ) = ∫ √ dP 1 − √ dP 2 2 for their (squared) Hellinger distance.
Proof of Theorem 2. Suppose without loss of generality that x = 0 and consider a smooth submanifold M of R d+1 ⊂ R D that contains the disk B d (0, 1) ⊂ R d with reach is greater than τ , see Figure 12 for a diagram of such an M . By smoothness and compacity of M , there exists L * (depending on τ ) such that M ∈ C d,α (τ, L * ). Let P be the uniform probability measure over M , with density f ∶ x ↦ 1 vol M . We have P ∈ Σ α,β (0) as long as L * ≤ L and a ≤ 1 vol M ≤ b an assumption we make from now on. For 0 We pick G such that f δ ∈ F β for small enough δ, depending on r and τ . Such a G can be chosen to depend on R only. For δ small enough (depending on r and τ ), we thus have P δ ∈ Σ α,β (0) as well. By Lemma 6, we infer inf so that it remains to compute H 2 (P, P δ ). We have the following bound with C = vol M ∫ B(0,1) G(x) 2 dx depending on τ and R only. Taking γ = (1 (C ∨ 1)n) 1 (2β+d) we obtain, for large enough n (depending on τ ) with C * = (C ∨ 1) −1 2 depending on τ and R.
6.3 Proofs of Section 3.4 Lemma 7. For any p ≥ 1, P ∈ Σ sd (x), and D 2 > p, we have up to a constant depending on p, D 1 and D 2 , with Proof. We fix P ∈ Σ α,β (x) and writeĥ and h * forĥ(x) and h * (P, x) respectively. Let We start with bounding R A . Firstly, Next, by definition ofĥ and A, we have By definition of h * , we also have f h * (P, x) − f (x) ≤ D 1 v(h * )λ(h * ). Finally, using Proposition 4 holds as well. Putting all three inequalities together yields We now turn to R A c . Notice that for any h ∈ H(h * ), we have We can thus write whereξ 2h,η (P, x) =ξ 2h (P, x)−ξ η (P, x), and where we used the triangle inequality and the definition of h * . Now, we have 2D 1 v(h * )λ(h * ) ≤ 2D 1 v(2h)λ(2h) since 2h ≤ h * and by definition of ψ(2h, η), For (30) we use the fact that λ(η) ≥ 1 and Bernstein's inequality on the random variableξ 2h,η (P, x) for (31). Noticing now that λ(η) 2 ≥ dD 2 log(h + η), we further obtain For any h ∈ H, h < h * , we thus get the following bound, using Cauchy-Schwarz inequality We plan to sum over h ∈H(h * ) the RHS of (32). Notice first that

Moreover, for any
for any h ≤ h * . This enables us to bound the following sum where we used that D 2 > p. Putting all these estimates together, using that h * h + ≤ 1 and λ(h * ) ≥ 1, we eventually obtain In conclusion (R Proof of Theorem 3. Let P ∈ Σ α,β (x) and leth = (ω log n n) 1 (2γ+d) with γ = α ∧ β and for some constant ω to be specified later. By Proposition 5 we know that for n large enough (depending on ω, α, β, d) such thath ≤ (r ∧ τ ) 4, we have f η (P, x) − f (x) ≤ C 1 η γ for all η ≤h with C 1 depending on K, , α, τ, L, β, r, b and R. Moreover, we also have for n large enough (depending on ω), and thereforeh ≤ h * (P, x). By Lemma 7 this implies where C 2 depends on p, D 1 and D 2 . But using that bothh ≥ h − and λ(h) 2 = dD 2 log(1 h ) for n large enough (depending on ω, d, K and D 2 ), we also obtain This last estimate yields for n large enough depending on ω, α, β, d, K and D 2 , which completes the proof.
6.4 Proofs of Section 3.5 Proof of Proposition 6. Let P ∈ Σ d α,β (x) and η > 0. Assume thatP η > 0. We have We first consider the determinist term. For η ≤ τ 2, we have, writing r η = ξ(η τ )η and using Lemma 1, Using the fact that f ∈ F β , and by Lemma 5, we know that for η ≤ r 4, there exists R 1 > 0 (depending on β, r, b and R such that We thus obtain Using these two inequalities, and the fact that r η η → 1 as η → 0, we find that P 2η (2 d P η ) − 1 ≲ η up to a constant that depends on R 1 , τ and a, for η small enough (depending on R 1 , τ and a as well). For the other terms, a simple use of Hoeffding's inequality yields for any η, > 0, P P η − P η > ≤ 2 exp(−2n 2 ).
Proof of Theorem 4. By triangle inequality, for any P ∈ Σ d α,β (x), we write The first term in the right-hand side has the right order thanks to Theorem 3. For the second one, Cauchy-Schwarz inequality yields .
Using that f (x) ≤ b and up to a constant that depend on D, K and , we infer R (p) n fĥ (d, ⋅), P (x) ≲ log n n α∧β (2α∧β+d) + n −3 2 (n + b) for n large enough depending on p, D 1 , D 2 , K, , α, τ, L, β, r, b and R, so that the result indeed holds up to a constant depending on p, D, D 1 , D 2 , K, , α, τ, L, β, r, b and R.

Proofs of Section 4
We writeV = {X 1 , . . . , X n }∪{x} for the vertices ofĜ (x) andη = sup p∈M d(p,V ). For small enougĥ η we have thatĜ (x) is connected, therefore the distanced is well-defined onV . We have in that case a good reverse control of d M byd , as shown in the next two lemmata.
Proof of Lemma 8. We can suppose without loss of generality that the shortest path between p and q is given by γ ∶ [0, ] → R D with = d M (p, q) ≤ L M 2. We let δ = (4⌊ ⌋) and N = 4⌊ ⌋. Notice that 4 ≤ δ ≤ 2. Let us define p j = γ(jδ), so that p 0 = p and p N = q. Sinceη ≤ 16, for every 1 ≤ j ≤ N − 1, there exists among our verticesV a point denoted byp j such that p j −p j ≤ 16. We sett j ∈ [0, L M ] for its coordinate, namelyp j = γ(t j ).
Proof of Lemma 9. Following the proof of (Arias-Castro and Le Gouic, 2019, Lem. 5) if there exists δ > 0 such that x − y ≤ δ implies d M (x, y) ≤ πτ for all x, y ∈ M , then we must have that for any x, y ∈ M verifying x − y ≤ δ, d M (x, y) ≤ x − y 1 + π 2 48τ 2 x − y 2 .
In view of Lemma 8 and Lemma 9, we want to tune so that it is the smallest possible and so that 16η ≤ holds with high probability. This is achieved for of order log n n.
Proof. Let δ > 0, and let N = ⌊L M δ⌋. We split [0, L M ] into N intervals I 1 , . . . , I N of length L M N . We denote A the event for which each I j contains at least one coordinate among those of the sample of observations (X 1 , . . . , X n ). On A, we haveη ≤ L M N ≤ 2δ. Moreover, P(A) = 1 − P (∃j, γ(I j ) contains no observation) Using that N ≤ L M δ and that L M ≤ 1 a we infer P (η ≤ 2δ) ≥ 1 − 1 aδ (1 − aδ) n ≥ 1 − e −aδn aδ .
Proof of Theorem 6. We write K ( ) in short for K ( ) (1; ⋅). Let A = {16η ≤ }. By triangle inequality, R On A, we have, for n large enough (depending on p, a and τ ) such that ≤ τ 2 holds, d (X i , x) − d M (X i , x) ≤ C 1 2 with C 1 depending on τ only. This is infered by Lemmas 8 and 9. We deduce that, on this event, It follows that with B * h andξ * h denoting the bias and stochastic deviation of estimatorĝ * h (x). Following the same arguments as in proof of Proposition 4, we have E[ ξ * h (x) p ] ≤ c p v(h) p with c p depending only on p. For the bias term, as soon as h ≤ (τ ∧ r) 4, we have Since now f ○ exp x ∈ F β with ≥ β, we know that all the term in the development of B * h (P, x) up to order ⌈β − 1⌉ cancels. We deduce B * h (x) ≤ C 2 h β with C 2 depending on and R only. For the other term R A c , we write ĝ * h (x) − f (x) ≤ K ( ) ∞ h + b, so that, according to Lemma 10, with C 3 depending on and b. Putting all these estimates together yields the result.