Adaptive warped kernel estimation for nonparametric regression with circular responses

In this paper, we deal with nonparametric regression for circular data, meaning that observations are represented by points lying on the unit circle. We propose a kernel estimation procedure with data-driven selection of the bandwidth parameter. For this purpose, we use a warping strategy combined with a Goldenshluger-Lepski type estimator. To study optimality of our methodology, we consider the minimax setting and prove, by establishing upper and lower bounds, that our procedure is nearly optimal on anisotropic Holder classes of functions for pointwise estimation. The obtained rates also reveal the specific nature of regression for circular responses. Finally, a numerical study is conducted, illustrating the good performances of our approach.


Introduction
Directional statistics is the branch of statistics which deals with observations that are directions.In this paper, we will consider more specifically circular data which arises whenever using a periodic scale to measure observations.These data are represented by points lying on the unit circle of R 2 denoted in the sequel by S 1 .Circular data are collected in many research fields, for example in ecology (animal orientations), earth sciences (wind, ocean current directions, cross-bed orientations to name a few), medicine (circadian rhythm), forensics (crime incidence) or social science (clock or calendar effects).Various comprehensive surveys on statistical methods for circular data can be found in Mardia and Jupp [18], Jammalamadaka and SenGupta [12], Ley and Verdebout [17] and recent advances are collected in Pewsey and García-Portugués [21].Note that the term circular data is also used to distinguish them from data supported on the real line R (or some subset of it), which henceforth are referred to as linear data.
In the present work, we focus on a nonparametric regression model with a circular response and linear predictor.Assume that we have an independent identically distributed (i.i.d in the sequel) sample {(X i , Θ i )} n i=1 distributed as (X, Θ), where Θ is a circular random variable, i.e.Θ ∈ S 1 , and X is a random variable with density f X supported on R. We assume that the cumulative distribution function of X, denoted F X , is known.We also assume that F X is invertible on R meaning that f X is positive on R and F X (R) = (0, 1).We aim at estimating a function m which contains the dependence structure between the predictors X i and the observations Θ i .For our setting, described in Section 2.1, the regression function m is derived in Equation (3).
Regression with circular response and linear covariates has been first and mostly explored from a parametric point of view.Pioneered contributions are due to Gould [11], Johnson and Wehlry [13] or Fisher and Lee [8].The latter proposed the most popular link-based function (namely the function 2 arctan) to model the conditional mean.Major difficulties, among others of such link-based models involve computational drawbacks to estimate parameters as identified by Presnell et al. [24].Presnell et al. [24] in turn suggested alternatively a spherically projected multivariate linear model.Since then, numerous parametric approaches have been proposed, we refer the reader to all the references in Pewsey and García-Portugués [21].In order to get a more flexible approach, nonparametric paradigm has been considered, first in the pioneering work by Di Marzio et al. [19] and more recently in Meilán-Vila et al. [20] for the multivariate setting.Surprisingly enough, the nonparametric point of view has only been considered in very few papers.Note that contrary to all works aforementioned which classically focus on the conditional mean (which is our goal as well) Alonso-Pena and Crujeiras [1] proposed a nonparametric multimodal regression method for estimating the conditional density when for instance the latter is highly skewed or multimodal.Estimation procedures developed in [19] or [20] consist in estimating the arctangent function of the ratio of the trigonometric moments of Θ (more details about this approach are given in the next section as it is the starting point of our procedure).More precisely, in the case of pointwise estimation and covariates supported on [0, 1], Di Marzio et al. [19] investigated the performances of a Nadaraya-Watson and a local linear polynomial estimators.Theoretically, for regression functions being twice continuously differentiable, they obtained expressions for asymptotic bias and variance.Their proofs are based on linearization of the function arctangent by using Taylor expansions, but no sharp controls of the remainder terms in the expansions are obtained.Actually obtaining such controls would be very tedious with such an approach based on Taylor expansions.As for the more recent work of Meilán-Vila et al. [20], they studied the multivariate setting [0, 1] d with the same estimators and proofs technics.In both papers, neither rates of convergence nor adaptation are obtained and cross-validation is used to select the kernel bandwidth in practice.By adaptation, we mean that the estimators do not require the specification of the regularity of the regression function which is crucial from a practical point of view.In view of this, we were motivated to fill the gap in the literature.Our goal is twofold: obtaining optimal rates of convergence for predictors supported on R and adaptation for estimating m the regression function.
To achieve this, we propose a new strategy based on concentration inequalities along with warping methods.
Our contributions.Under the assumption that the cumulative distribution function (c.d.f.) of the design X is known and invertible, warping methods used in this paper consist in introducing the auxiliary function g := m • F <−1> X , with F <−1> X the inverse of F X .We then use classical kernel rules to estimate the function g in the specific framework of circular data.Our procedure needs to select two bandwidths.Fully data-driven selection of bandwidths is performed by using a Goldenshluger-Lepski type procedure [10].Then, theoretical performances are studied.We consider the minimax setting and prove by establishing upper and lower bounds that our procedure is nearly optimal on anisotropic Hölder classes of functions for pointwise estimation.These results are stated in Theorems 3.6 and 3.11 respectively.Then, we conduct a numerical study whose goal is twofold.We first investigate the best tuning parameters of our procedure.Once tuned, our estimates are used on artificial data and compared to other classical methods.The numerical study reveals the good performances of our methodology.
Plan.In section 2, we explain how to take into account the circular nature of the response and then propose our data-driven kernel estimator of the regression function m based on warping strategy and the Goldenshluger-Lepski bandwidth selection rule.Section 3 contains the theoretical results.Section 4 presents numerical results including simulations.Finally, all the proofs are deferred to Section 5.
Notations.It is necessary to equip the reader with some notations.In the sequel, a point on S 1 will not be represented as a two-dimensional vector w = (w 2 , w 1 ) with unit Euclidean norm but as an angle θ = atan2(w 1 , w 2 ) defined as follows: Definition 1.1.The function atan2 : R 2 \ (0, 0) → [−π; π] is defined for any In this definition, one has arbitrarily fixed the origin of S 1 at (1, 0) and uses the anti-clockwise direction as positive.Thus, a circular random variable can be represented as angle over [−π, π).Observe that atan2(0, 0) is not defined.Hereafter, • L 1 (R) and • L 2 (R) respectively denote the L 1 and L 2 norm on R with respect to the Lebesgue measure: The L ∞ norm is defined by f ∞ = sup y∈R |f (y)|.Moreover, we denote * the classical convolution product defined for functions f, g by f * g(x) := R f (x − y).g(y)dy, for x ∈ R. Finally, for α ∈ R, [α] + := max {α; 0}, and for β > 0, β denotes the largest integer strictly smaller than β.

The estimation procedure
After recalling the framework of circular data in Section 2.1, Section 2.2 is devoted to the construction of an estimator for m(x), at a given point x ∈ R which will be fixed along this paper, using warped kernel methods.Then, Section 2.3 presents a data-driven procedure for bandwidth selection by using the Goldenshluger-Lepski methodology.

The framework of circular data
There is no doubt that, due to their periodic nature, circular data are fundamentally different from linear ones, and thus need specific tools.To measure the closeness between two angles θ 1 and θ 2 , we do not consider the natural distance but we focus on d c with which is extensively used in the literature of directional statistics (see for instance Section 2 in the seminal monograph by Mardia and Jupp [18], Section 3.2.1 of [17], [19] or [20]).Note that the divergence d c corresponds to the usual squared Euclidean norm in R 2 .Indeed, the angles θ 1 and θ 2 determine the corresponding points (cos θ 1 , sin θ 1 ) and (cos θ 2 , sin θ 2 ) respectively on the unit circle S 1 .Then, the usual squared Euclidean norm in R 2 reads Hence, √ d c is a distance on [−π, π) and we naturally look for a measurable function m such that: where the minimum is taken over [−π, π)-valued functions f that are measurable with respect to the σ-algebra generated by X.It is interesting to notice that the minimization problem (1) is directly linked to the definition of the Frechet mean on the circle (see Charlier [6]).Furthermore, in the literature of directional statistics, the problem of finding such a regression function m(X) as defined in (1) has been already considered to solve the circular regression problem (see [9] and [20]).Now let us work conditionally to X.For x ∈ R let Moreover, write for an arbitrary function f : R → [−π, π) , Observe that γ(x) = atan2(m 1 (x), m 2 (x)).
Thus, we have min Finally the minimizer of the minimization problem (1) is achieved for In conclusion, the circular nature of the response is taken into account by the arctangent of the ratio of the conditional expectation of sine and cosine components of Θ given X and we tackle the problem by estimating the function with m 1 and m 2 defined in (2).
Remark 2.1.Observe that if m 1 (x) = m 2 (x) = 0, then m(x) is not defined.This occurs if and only if where f (•|x) denotes the conditional density of Θ|X = x.Note that φ 1 (f (•|x)) plays a specific role in the literature of directional statistics.See for instance Section 3.4.2 of [18].
In the sequel, we estimate the circular regression function m as defined in (3) under the condition We set ζ := (ζ i ) i=1,...,n the vector of errors so that Our estimation methodology is based on a warping strategy.

Warping strategy
The popular Nadaraya-Watson (NW) methodology provides a natural estimator of m of the form ), for some bandwidth h > 0. However, on the one hand, the denominator which can be small may lead to some instability.On the other hand, as adaptive estimation requires the data-driven selection of the bandwidth, the ratio form of the NW estimate indicates that we should select two bandwidths: one for the numerator and one for the denominator.Consequently, considering NW estimators for m 1 and m 2 involve four bandwidths.This makes the study of these estimators quite intricate.
Recalling that the inverse of F X , warping methods then boil down to first estimating g by say g and then estimating the regression function of interest m by g • F X .To deal with regression with random design, the warping strategy has been applied for instance by Kerkyacharian and Picard [14], Pham Ngoc [22], Chagny [3] and Chagny et al. [5].Among the advantages of this method, let us mention that a warped kernel estimator does not involve a ratio, which strengthens its stability whatever the design distribution, even when the design is inhomogeneous.In our framework, in order to construct an estimator for the regression function m, we first estimate m 1 and m 2 (see (3)).Consequently, we introduce two auxiliary functions g 1 , g 2 : (0, 1) −→ R defined by so that m 1 = g 1 • F X and m 2 = g 2 • F X ; we then have for u ∈ (0, 1) Our fully data-driven approach is based on the selection of two bandwidths that adapt automatically to the unknown smoothness of functions g 1 and g 2 .Now, we propose to adapt the strategy developed in the linear case by Chagny et al. in [5].The warping device is based on the transformation F X (X i ) of the data X i , i = 1, . . ., n.We first define kernels considered in our framework as follows.
Then, for u ∈ (0, 1), we estimate g 1 (u) and g 2 (u) by respectively, where h 1 , h 2 > 0 are bandwidths of kernels K h1 (•) and K h2 (•) respectively.Thus, we estimate g by where we denote h := (h 1 , h 2 ).Moreover, as a consequence, for x ∈ R, the estimators for m 1 and m 2 are and Using m = g • F X = atan2 m 1 , m 2 , we then obtain an estimator of m(x) at x ∈ R by setting m h (x) := atan2 m 1,h1 (x), m 2,h2 (x) = g h F X (x) .

Bandwidth selection
We study the pointwise risk of the estimator m h (x) associated to the divergence d c .The expression of the risk is then We first focus on the estimator g h of g by studying the adaptive choice of bandwidths belonging to a convenient grid H n .To define the latter, we assume that the kernel K satisfies supp(K) ⊆ [−A, A] for some A > 0 and we take h max a constant such that F X (x) − A.h max > 0 and F X (x) + A.h max < 1.Then, we set Nguyen et al./Adaptive warped kernel estimation for nonparametric circular regression 9 Remark 2.3.Observe that the condition is satisfied for n large enough if h max depends on n and goes to 0 (even slowly) when n → +∞.
We have Card H n n/ log n.In the sequel, we apply the method proposed by Goldenshluger and Lepski in [10] to select an optimal value for bandwidths h 1 and h 2 automatically.Let j ∈ {1, 2}.For h j ∈ H n and v ∈ (0, 1) we set with n.h j , c 0,j > 0 a tuning parameter and so that g j,hj ,h j (v) = g j,h j ,hj (v).Then, a data-driven choice of bandwidth h j is performed as follows: Observe that our bandwidth selection rule depends on x.The criterion ( 12) is inspired from [10], in order to mimic the optimal "bias-variance" trade-off in the pointwise quadratic decomposition: It is common to use V j (n, h j ) to provide an upper bound for the variance term V (h j , v) (see Section 5.2), whereas the more involved task of the Goldenshluger-Lepski method is to provide an estimate for the bias term by comparing pairby-pair several estimators.In our framework, the bias term corresponds to (see ( 23)), so it is natural to estimate it by an estimator of the form K hj * g j,h j (v) − g j,h j (v) .Thus, the estimator of the bias term is A j (h j , v), defined in (11), where the second term V j (n, h j ) controls the fluctuations of the first term.Now, we define the kernel estimator of g(v) with data-driven bandwidths as follows: where we denote h := ( h 1 , h 2 ).We finally define the adaptive estimator for m(x) by Nguyen et al./Adaptive warped kernel estimation for nonparametric circular regression 10

Minimax rates of convergence
The minimax approach is a framework that shows the optimality of an estimate among all possible estimates.The minimax pointwise quadratic risk for the estimator g h = atan2 g 1, h1 , g 2, h2 will be derived from the following control of the pointwise quadratic risks of g 1, h1 and g 2, h2 .
Proposition 3.1.Consider the collection of bandwidths H n defined in (10).
Let j ∈ {1, 2} and q ≥ 1 and assume that min {c 0,1 ; c 0,2 } ≥ 16 2 + q 2 . 1 + Then, with probability larger than 1 − 4.n −q , The proof of Proposition 3.1 is given in Section 5.4.1.Roughly speaking, in view of results of Section 5.1, the right hand side of the inequality stated in Proposition 3.1 may be viewed as the bias-variance decomposition of the pointwise quadratic-risk of the best warped-kernel estimate, up to a logarithmic term.
Remark 3.2.Examining the proof of Proposition 3.1, the "uniform bias" comes from Inequality (33).This control can be refined and the term g j − K hj * g j ∞ can be replaced by sup Observe that the size of this neighborhood of u x goes to 0 if h max → 0.
Since the function atan2(w 1 , w 2 ) is undefined when w 1 = w 2 = 0, it is reasonable to consider the following assumption: In the minimax setting, we need some assumptions on the regularity of g 1 and g 2 .Thus, we introduce the following Hölder classes that are adapted to local estimation.Definition 3.4.Let β > 0 and L > 0. The Hölder class H(β, L) is the set of functions f : (0, 1) −→ R, such that f admits derivatives up to the order β , and for any (y, y) ∈ (0, 1) 2 , We also consider the following assumption on the kernel K: R y k .K(y)dy = 0. Now, we obtain an upper bound for the pointwise risk of our final estimator m h at x defined in ( 14): where , δ is defined in (15) and C is a constant depending on β 1 , β 2 , L 1 , L 2 , c 0,1 , c 0,2 and K.
A proof of Theorem 3.6 is given in Section 5.4.2.Observe that if β 1 = β 2 = β, then we obtain the rate ψ n (β) = (log n/n) β/(2β+1) , which is the optimal rate for adaptive univariate regression function estimation and pointwise risk (see e.g.Section 2 in [2]).Note that the logarithmic term appearing in the rate of convergence is expected since we deal with pointwise adaptive estimation.For further details, we refer the reader to Lepski [15] and Lepski and Spokoiny [16] who have highlighted and discussed this fact for the Gaussian white noise model.Remark 3.7.Eventually, to obtain the fully computable estimator, we replace the c.d.f.F X by its natural estimate F n (y) := n −1 n i=1 1 Xi≤y .Following arguments of [4], this replacement should not change the final rate of convergence of our nonparametric estimator.

Minimax lower bounds
To establish minimax lower bounds, we assume that the ζ i 's are centered i.i.d.random angles, independent of the X i 's.We also assume that Model (5) satisfies the following assumption.
Assumption 3.8.The design points X i 's are i.i.d.random variables with density f X (.) on [0, 1] such that there exists µ 0 < ∞ and f X (t) ≤ µ 0 ∀t ∈ [0, 1] and the errors ζ i have common density p ζ (.) on S 1 with respect to the Lebesgue measure on S 1 , verifying The subsequent minimax lower bound is based on a reduction scheme based on some well-chosen probability distributions.The closeness between the associated models is measured by using the Kullback-Leibler divergence and is controlled by using Assumption 3.8.In the sequel, the function m belongs to the class Σ(β, L) defined as the set of functions f : [0, 1] −→ S 1 such that the derivative f (l) , l = β exists and verifies with c(κ) the normalizing constant, satisfies condition (16).This is proved in Lemma 5.7.Note that apart from the most popular von Mises distribution, two other classical circular distributions namely the cardioid and the wrapped Cauchy distributions, respectively defined by (see [17]) also satisfy (16).Proofs are very similar to the von Mises case.
We obtain the following lower bound: Theorem 3.11.Let β > 0 and L > 0. Under Assumptions 3.8, we have where c depends only on β, L, p * and µ 0 and the infimum is taken over all possible estimates based on observations (Θ i , X i ) i=1,...,n .
According to Remark 3.9, Theorem 3.11 entails that the lower bound for the minimax risk for functions m : R −→ S 1 such that m ∈ Σ(β, L) is n − 2β 2β+1 .Now let us connect this result to the upper bound obtained in Theorem 3.6.As the function atan2 is infinitely differentiable on R * × R * , and if F X is smoother than g 1 and g 2 , then if one writes m(x) = atan2(g 1 (F X (x)), g 2 (F X (x)), the smoothness β of m will be the minimum of the smoothness of g 1 and the smoothness of g 2 .Hence, the result of Theorem 3.6 guarantees the near optimal rate of our adaptive estimator provided that F X is known.

Numerical simulations
In this section, we implement some simulations to study the numerical performances of our procedure.We consider three different regression models: M3. Θ = arccos X 5 − 1 + 3. arcsin where the circular error, ζ, is distributed according to a von Mises distribution f vM (0,10) (see (17)) and is independent from X.
Finally, we make simulations for the general case of unknown design distribution, i.e. the final estimators are computed using
We denote R ≡ R(c 0 ).The numerical illustrations are displayed in Figure 2 for x = −2 and in Figure 3 for x = 1.25, respectively.
To further study an influence of the kernel rule, we consider the Gaussian kernel.The associated numerical illustrations are provided in Figure 4 for X ∼ U ([−5, 5]) and for X ∼ N (0, 1.5).This brief numerical study shows that the choice c 0,1 = c 0,2 = 0.04 is convenient for each numerical scheme.

Numerical results
We now illustrate the numerical performances obtained by our methodology for models M1, M2 and M3 by using the Epanechnikov kernel.They are also compared to other approaches.A similar scheme is conducted by using the Gaussian kernel.Remember that in the following numerical experiments, our estimate is tuned with c 0,1 = c 0,2 = 0.04.We first display several graphs to illustrate numerical performances obtained by our methodology, denoted GL, by using the  Epanechnikov kernel.More precisely, we display boxplots in Figures 7 and 8 summarizing our numerical results for model M1 in the case X ∼ U ([−5, 5]) and in the case X ∼ N (0, 1.5), respectively.In both cases, for model M1, we estimate m(x) at x = −2 and at x = 1.25.Figure 9 shows simulations for model M2 with X ∼ N (0, 1.5) and we estimate m(x) at x = 1.05.Figure 10 shows simulations for model M3 with X ∼ U ([0, 1]) and we estimate m(x) at x = 0.95.Moreover, to make a comparison with our adaptive estimator, as proposed in [19], we also compute the Nadaraya-Watson (NW) estimator m N W  We finally repeat the previous numerical experiments but with the use of the Gaussian kernel: Figures 11 and 12 are the analogs of Figures 7 and 8. Figure 13a shows the numerical simulation for model M2 with X ∼ N (0, 1.5) and we estimate m(x) at x = 1.05.Figure 13b shows the numerical simulation for model M3 with X ∼ U ([0, 1]) and we estimate m(x) at x = 0.95.These graphs show that the performances of our adaptive estimator associated with the Gaussian kernel are quite satisfying as well.

Proofs
Along this section, we fix x in R and we set u x := F X (x).

Preliminary results
In this section, we study several preliminary results for g 1,h1 and g 2,h2 defined in (6).First of all, via the warping method, we observe  Then, using the choice of h max , since u x = F X (x) ∈ (0, 1) = F X (R) is fixed and K h1 u x − w = 0 for w ∈ (0, 1) such that u x − w > A.h max , we have Thus, we obtain and similarly, We obtain upper bounds for the bias and variance terms.
Lemma 5.1.Let j ∈ {1, 2}.Suppose that g j belongs to H(β j , L j ), with L j , β j ∈ R * + .Assume that the kernel K satisfies Assumption 3.5 with an index L ∈ R + such that L ≥ β j .Then, for any h j ∈ H n , E g j,hj (u x ) − g j (u x ) ≤ C K,L .L j .hβj j , and Var g j,hj (u with C K,L the constant defined in Assumption 3.5.
The proof of Lemma 5.1 is given in Section 5.2.
We introduce in the sequel several events on which we will establish some concentration results for g j,hj , j ∈ {1, 2}.
Proposition 5.3.For p ≥ 1 and h j ∈ H n (defined in (10)), we have: Consequently, suppose further that g j belongs to H(β j , L j ) with L j , β j ∈ R * + and the kernel K satisfies Assumption 3.5 with an index L ∈ R + such that L ≥ β j , then we get: The proof of Proposition 5.3 is given in Section 5.3.

Proof of Lemma 5.1
Proof.First, for the bias of g j,h (u x ) at u x = F X (x), using (23), we can write This implies that with 0 ≤ τ ≤ 1, with C K,L the constant defined in Assumption 3.5, since g j ∈ H(β j , L j ), For the variance of g j,hj (u x ), one gets, with a 1 This concludes the proof of Lemma 5.1.

Proof of Proposition 5.3
We shall use the following version of Bernstein inequality (see [7,Lemma 2]).
Corollary 5.5.Let j ∈ {1, 2}.Under the Assumptions of Proposition 3.1, for all h j , h j ∈ H n , for all p ≥ 1, for u x = F X (x), Proof of Corollary 5.5.We define random variables and Using similar arguments of the proof of Proposition 5.3, we obtain with a probability greater than 1 − 2.n −p that This concludes the proof of Corollary 5.5.Now, we can start to prove Proposition 3.1.
Proof of Proposition 3.1.We follow the strategy proposed in [7, Theorem 2].The target is to find an upper bound for g j, hj (u x ) − g j (u x ) .Let h j ∈ H n be fixed.We consider the following decomposition:

Nguyen et al./Adaptive warped kernel estimation for nonparametric circular regression 26
By the definition of A 2 (h j , u x ), we have And similarly, by the definition of A 2 ( h j , u x ), Therefore, by using the definition of h j , we get Hence, we obtain Now, to study A 2 (h j , u x ), we can write: and, we have E g j,h j (u x ) = K h j * g j (u x ) as well as E g j,hj ,h j (u x ) = E K h j * g j,hj (u x ) = K h j * K hj * g j (u x ).Thus,

Nguyen et al./Adaptive warped kernel estimation for nonparametric circular regression 27
However, for any h j ∈ H n , Hence, incorporating this bound in the definition of A 2 (h j , u x ), we obtain = sup + sup From Corollary 5.5, for h j , h j ∈ H n , It implies that if we take c hj ∈Hn as Card(H n ) ≤ n.Consequently, the following set has probability greater than (1 − 4.n 2−p ).Now, choose p = 2 + q and then c 0,j ≥ 16 2 + q 2 . 1 + K L 1 (R) 2 .Thus, we obtain that P A 2 > 1 − 4.n −q .
Combining inequalities (32) and (34), we have on A 2 : Therefore, on A 2 , we finally obtain This concludes the proof of Proposition 3.1.

Proof of Theorem 3.6
First, we establish a concentration result for g 1, h1 (u x ) and g 2, h2 (u x ) as follows.
In the sequel, we consider j ∈ {1, 2} and we set Corollary 5.6.Suppose that g j belongs to H(β j , L j ) for β j , L j > 0.Then, under the assumptions of Proposition 3.1, for q ≥ 1, there exists a constant C j depending on β j , L j , c 0,j and K such that, with we have, for n large enough, Proof of Corollary 5.6.Since g j belongs to H(β j , L j ), from Lemma 5.1, we have From Proposition 3.1, this implies that with a probability greater than 1−4.n −q , one gets for any h j ∈ H n : In (36), we take h j so that h −1 j is an integer and h j is of order log(n) n 1 2β j +1 .
Since h max = (log n) −1 and 1/(2β j + 1) < 1, h j ∈ H n , for n large enough.As a result, we obtain with probability greater than 1 − 4.n −q , that with a constant C j (depending on β j , L j , c 0,j and K).This concludes the proof of Corollary 5.6.Now, we start to prove Theorem 3.6.
We now distinguish 3 cases.Case 1: |g 1 (u x )| > 0 and |g 2 (u x )| > 0. We denote for n large enough satisfying we have Thus, we get g 2, h2 (u x ).g 2 (u x ) > 0 and g 1, h1 (u x ).g 1 (u x ) > 0. Therefore, For n sufficiently large, , and using the 1-Lipschitz continuity of arctan, we get for the first term in (37), since on E 1,n (u x , h 1 ) one has Moreover, for the second term in (37), since we have Therefore, on the event On the other hand, on the complementary using the fact that atan2(w 1 , w 2 ) ≤ π, ∀(w 1 , w 2 ), we can simply obtain an upper-bound as follows: by Corollary 5.6.For q ≥ 1, we get that n −q is negligible in comparison with and we conclude as for the first case.
-If g 2 (u x ) < 0, then, as previously, on the event E 2,n (u x , h 2 ) ∩ E 1,n (u x , h 1 ), for n large enough, g 2, h2 (u x ) < 0.Then, and we conclude as for the first case.

Case 3: |g
where the last equality is obtained by using for x ∈ R * , arctan(x) + arctan(1/x) = π/2 × sign(x) and by distinguishing the cases according to the sign of g 2, h2 (u x ).
We conclude by using the second case since g 1, h1 (u x ) (resp.g 1 (u x )) and g 2, h2 (u x ) (resp.g 2 (u x )) play a symmetric role.
Note that under Assumption 3.3, the case g 1 (u x ) = g 2 (u x ) = 0 cannot occur.

Proof of Theorem 3.11
Before tackling the proof of Theorem 3.11, the next lemma shows that the von Mises density satisfies condition (16).
where inf ψ denotes the infimum over all tests ψ taking values in {0, 1}.
We have used that √ d c is a true distance on S 1 , so that it satisfies the triangular inequality.Now let us fix the X i 's.The minimum average probability p e,1 is defined as (see page 116 in [25]): (40) There exists n 0 such that ∀n > n 0 , Lh β n K max ≤ y 0 where K max = max t |K(t)|.Using (40) and ( 16), we have: Now taking the expectation and using that the density of the X i 's is bounded by µ 0 , we get: E X1,...,Xn K(P m0 , P m1 ) ≤ p * L 2 K 2 max h 2β n nP For α < 2 log(2), since h n = c 0 n − 1 2β+1 , setting , we get that E X1,...,Xn K(P m0 , P m1 ) ≤ α.As in Lemma 2.10 of [25], we introduce the function H(t) = −t log(t) − (1 − t) log(1 − t) for t ∈ (0, 1) and H(0) = H(1) = 0. Inequality (2.70) of [25] where the right hand side is a positive constant.This concludes the proof of Theorem 3.11.

Conclusion
Considering nonparametric regression for circular data, we derive minimax convergence rates and prove near optimal properties of our kernel estimate combined with a warping strategy on anisotropic Hölder classes of functions for pointwise estimation.The bandwidth parameter is selected by using a datadriven Goldenshluger-Lepski type procedure.After tuning hyperparameters of our estimate, we show that it remains very competitive with respect to existing methods.
As a natural extension, it could be very challenging to investigate our regression problem with a response on the sphere S 2 or more generally on the unit hypersphere S d−1 .The case of predictors X ∈ S d−1 and a response Θ ∈ S d−1 has been tackled in [9].The spherical context is of course more complicated than the circular one and the arctangent function approach used here is not easily generalizable in the spherical setting.In [9], Di Marzio et al. proposed a local constant estimator by smoothing on each component of the response.Once again no rates of convergence were obtained.Hence, in a future work, a first task would be to obtain convergence rates and then investigate adaptation issue.

Remark 3 . 9 .
For two classes of functions D and D such that D ⊂ D , a lower bound for the minimax rate of convergence for D will also be a lower bound for the minimax rate for D .Hence, this justifies the restriction of the study of the lower bound to circular functions m defined on [0, 1].Remark 3.10.The classical von Mises distribution with location parameter µ ∈ [−π, π) and concentration parameter κ > 0 whose density f vM (µ,κ) is defined for any θ ∈ [−π, π) by sider either the Gaussian kernel defined by y −→ K(y) = 1 √ 2π .e− y 2 2 , or the Epanechnikov kernel K defined by y −→ K(y) = 3 4 .(1− y 2

Fig 1 :
Fig 1: Illustration of model M1 with n = 200 for two different density functions of the design.Simulated data (Θ i ) n i=1 are displayed in green points.The red curve represents the regression function m, while the blue vertical line displays the point x = −2 and the orange vertical line displays the point x = 1.25 where we aim at estimating m(x).

Fig 2 :Fig 3 :
Fig 2: Model M1.Plot of the Monte Carlo estimation of the function c 0 ∈ G c0 −→ R(c 0 ), based on 50 runs, for x = −2, the Gaussian and the uniform designs and by using the Epanechnikov kernel for n ∈ {100; 200; 500; 1000} corresponding to the line in blue, green, violet and orange, respectively.The red vertical line displays the point c 0 = 0.04.

Fig 4 : 25 Fig 5 :
Fig 4: Model M1.Plot of the Monte Carlo estimation of the function c 0 ∈ G c0 −→ R(c 0 ), based on 50 runs, for x = −2, the Gaussian and the uniform designs and by using the Gaussian kernel for n ∈ {100; 200; 500; 1000} corresponding to the line in color of blue, green, violet and orange, respectively.

Fig 7 :
Fig 7: Model M1.Boxplots of the estimated risk with 50 runs for the GL, NW and LL methodologies (from left-hand side to right-hand side on each plot (a) and (b) for n = 200 and X ∼ U ([−5, 5]) by using the Epanechnikov kernel.

25 Fig 8 :
Fig 8: Model M1.Boxplots of the estimated risk with 50 runs for the GL, NW and LL methodologies (from left-hand side to right-hand side on each plot (a) and (b) for n = 200 and X ∼ N (0, 1.5) by using the Epanechnikov kernel.

Fig 9 :
Fig 9: (a): Model M2.Simulated data (Θ i ) n i=1 of model M2 (green points) with n = 200 and X i ∼ N (0, 1.5).The red curve represents the true regression function m, while the blue vertical line displays the point x = 1.05; (b): Boxplots of the estimated risk with 50 runs for the GL, NW and LL methodologies (from left-hand side to right-hand side by using the Epanechnikov kernel.

25 Fig 11 :
Fig 10: (a): Model M3.Simulated data (Θ i ) n i=1 of Model M3 (green points) with n = 200 and X i ∼ U ([0, 1]).The red curve represents the true regression function m, while the blue vertical line displays the point x = 0.95; (b): Boxplots of the estimated risk with 50 runs for the GL, NW and LL methodologies (from left-hand side to right-hand side) by using the Epanechnikov kernel.

25 Fig 12 :Fig 13 :
Fig 12: Model M1.Boxplots of the estimated risk with 50 runs for the GL, NW and LL methodologies (from left-hand side to right-hand side on each plot (a) and (b)) for n = 200 and X ∼ N (0, 1.5) by using the Gaussian kernel.
.b , with Var(T 1 ) ≤ V and |T 1 | ≤ b, where V and b are two positive deterministic constants.Now, we can start to prove Proposition 5.3.