Semiparametric inference for mixtures of circular data

We consider X 1 ,. .. , X n a sample of data on the circle S 1 , whose distribution is a twocomponent mixture. Denoting R and Q two rotations on S 1 , the density of the X i 's is assumed to be g(x) = pf (R --1 x) + (1 -- p)f (Q --1 x), where p $\in$ (0, 1) and f is an unknown density on the circle. In this paper we estimate both the parametric part $\theta$ = (p, R, Q) and the nonparametric part f. The specific problems of identifiability on the circle are studied. A consistent estimator of $\theta$ is introduced and its asymptotic normality is proved. We propose a Fourier-based estimator of f with a penalized criterion to choose the resolution level. We show that our adaptive estimator is optimal from the oracle and minimax points of view when the density belongs to a Sobolev ball. Our method is illustrated by numerical simulations.


Introduction
Circular data are collected when the topic of interest is a direction or a time of day. These particular data appear in many applications: earth sciences (e.g. wind directions), medicine (e.g. circadian rhythm), ecology (e.g. animal movements), forensics (crime incidence). Different surveys on statistical methods for circular data can be found: Mardia and Jupp (2000), Jammalamadaka and Sen-Gupta (2001), Ley and Verdebout (2017) or more recently Pewsey and García-Portugués (2021). In the present work, we consider a mixture model with two components equal up to a rotation. We observe X 1 , . . . , X n a sample of data on S 1 with probability distribution function:

3483
In the right hand side we have identified f : S 1 → R and its periodized version on R. Here R 0 and Q 0 are two unknown rotations of the circle. R 0 is a rotation with angle α 0 and Q 0 is a rotation with angle β 0 . The aim is to estimate both θ 0 = (p 0 , α 0 , β 0 ) and the nonparametric part f . Bimodal circular data are commonly encountered in many scientific fields, for instance in climatology, animal orientations or in earth sciences. For the analysis of wind directions, see Hernández-Sánchez and Scarpa (2012) and for animal orientations, the dragonflies data set presented in Batschelet (1981). In geosciences, one can cite the cross-bed orientations data set obtained in the middle Mississipian Salem Limestone of central Indiana and which was presented by the Seminar Sedimentation (Sedimentation Seminar (1966)). Last but not least, the paper of Lark, Clifford and Waters (2014) analyzes some geological data sets and clearly favours for some of them a two component mixture of von Mises distributions.
Mixture models for describing multimodal circular data date back to Pearson (1894) and have been largely used since then. An important case in the literature is the mixture of two von Mises distributions which has been explored in numerous works. Let us cite among others papers by Bartels (1984), Spurr (1981) or Chen, Li and Fu (2008). From a practical point of view, algorithms have also been proposed to deal with mixture of two von Mises distributions, including maximum likelihood algorithms by Jones and James (1969) or a characteristic function based procedure by Spurr and Koutbeiy (1991). Note that on the unit hypersphere, Banerjee et al. (2005) investigated clustering methods for mixtures of von Mises Fisher distributions. In our framework, we shall not assume any parametric form of the density and hence the model is said to be semiparametric. To the best of our knowledge, this is the first work devoted to the study of the semiparametric mixture model for circular data. This semiparametric model is more complex and intricate than the usual parametric one encountered in the circular literature. In the spherical case, Kim and Koo (2000) studied the general mixture framework for a location parameter but assuming that the nonparametric part f is known. On the real line, this semiparametric model has been studied by Bordes, Mottelet and Vandekerkhove (2006), Hunter, Wang and Hettmansperger (2007), Butucea and Vandekerkhove (2014) or Gassiat and Rousseau (2016) for dependent latent variables. For the multivariate case, see for instance Hall and Zhou (2003), Hall et al. (2005), Gassiat, Rousseau and Vernet (2018), Hohmann and Holzmann (2013). When dealing with the specific case of one of the two components being parametric, one refers to work by Ma and Yao (2015) and references therein.
Note that we can rewrite model (1) as where Y i has density f and ε i is a Bernoulli angle, which is equal to α 0 with probability p 0 and β 0 otherwise. Accordingly, model (1) can be viewed as a circular convolution model with unknown noise operator ε. The circular convolution model has been studied by Goldenshluger (2002) in the case of known noise operator whereas Johannes and Schwarz (2013) dealt with unknown error distribution but have at their disposal an independent sample of the noise to estimate this latter. It is worth pointing out that Goldenshluger (2002) and Johannes and Schwarz (2013) made the usual assumptions on the decay of the Fourier coefficients of the density of ε, whereas in model (1) the Fourier coefficients are not decreasing. Identifiability questions are at the heart of the theory of mixture models and the circular context is no exception. Thus, our first task is to study the identifiability of the model. From a mathematical point of view, the topology of the circle makes the problem very different from the linear case. In the circular parametric case, Fraser, Hsu and Walker (1981) obtained identifiability results for the von Mises distributions, which were extended in Kent (1983) to generalized von Mises distributions while Holzmann, Munk and Stratmann (2004)) focused on wrapped distributions, basing their analysis on the Fourier coefficients. Here, the Fourier coefficients turn out to be very useful as well but the nonparametric paradigm makes the study quite different and intricate. Our identifiability results are obtained under mild assumptions on the Fourier coefficients. We require that the coefficients are real which can be related to the usual symmetry assumption in mixture models (see for instance Hunter, Wang and Hettmansperger (2007)) and we impose that only the first 4 coefficients do not vanish. Interestingly enough, some not intuitive phenomena appear. A striking case occurs when the angles α 0 and β 0 are distant from 2π/3, model (1) is then nonidentifiable which is quite surprising at first sight.
Once the identifiability of the model is obtained, we resort to a contrast function in the line of Butucea and Vandekerkhove (2014) to estimate the Euclidian parameter θ 0 . In that regard, we prove the consistency of our estimator and an asymptotic normality result. Thereafter, for the estimation of the nonparametric part, a penalized empirical risk estimation method is used. The estimator of the density turns out to be adaptive (meaning that it does not require the specification of the unknown smoothness parameter), a property which was not reached so far for this semiparametric model even in the linear case. The procedure devised is hence relevant for practical purposes. We prove an oracle inequality and minimax rates are achieved by our estimator for Sobolev regularity classes. Eventually, a numerical section shows the good performances of the whole estimation procedure.
The paper is organized as follows. Section 2 is devoted to the identifiability of the model. Section 3 tackles the estimation of the parameter θ 0 whereas Section 4 focuses on the estimation of the nonparametric part. Finally Section 5 presents numerical implementations of our procedure. Proofs are gathered in Section 6.

Identifiability
In this section, to keep the notation as light and clear as possible, we drop the subscript 0 in the parameters. For any function g and any angle α, denote g α (x) := g(x − α). For any complex number a, a is the complex conjugate of a. For any integrable function φ : S 1 → R, for any l ∈ Z, we denote by φ l the Fourier coefficients φ l = S 1 φ(x)e −ilx dx 2π . Note also that we use notation f and f for two densities, where f is not the derivative of f .
Let us now study the identifiability of our model (1) where the data have density pf (x − α) + (1 − p)f (x − β). First, it is obvious that if p = 0, α is not identifiable, and if p = 1, β is not identifiable. In the same way, p is not identifiable if α = β. Moreover, as explained in Hunter, Wang and Hettmansperger (2007) for a translation mixture on the real line, the case p = 1/2 has to be avoided. Indeed, denoting g a density and for instance f = 1 2 g 1 + 1 2 g −1 and f = 1 2 g 2 + 1 2 g −2 we have In addition, it is well known that, in such a mixture model, (p, α, β) cannot be distinguished from (1 − p, β, α): it is the so-called label switching problem. So we will assume that p ∈ (0, 1/2) (for mixtures on R it is assumed alternatively that α < β but ordering angles is less relevant). Now let us study the specific problems of identifiability on the circle, that do not appear on R. First, if f is the uniform probability, the model is not identifiable, so we have to exclude this case. Another case to exclude is the case of δ-periodic functions. Indeed in this case f α = f α+δ . These functions have the property that f l = 0 for all l / ∈ (2π/δ)Z. So we will require that the Fourier coefficients of f do not cancel out too much. Here we will assume for all l ∈ {1, 2, 3, 4}, f l = 0, and f l = f l .
This last assumption can be related to the symmetry of f . Indeed if f is zerosymmetric then all its Fourier coefficients are real. Symmetry is a usual assumption in this mixture context, to distinguish between the translations of f : for any δ ∈ R, More precisely, Hunter, Wang and Hettmansperger (2007) show that symmetry is a sufficient and necessary condition for identifiability of the model mixture on R. In the circle framework, it is natural to work with Fourier coefficients rather than Fourier transform as on R. A lot of circular densities have their Fourier coefficients real, provided that their location parameter is μ = 0: for example the Jones-Pewsey density, which includes the cardioid, the wrapped Cauchy density, and the von Mises density. Here we require the assumption only for the first 4 Fourier coefficients of f (due to our proof), which is milder than symmetry. Let us now state our identifiability result under these assumptions. Note that Holzmann, Munk and Stratmann (2004) have studied the identifiability of this model when f belongs to a parametric scale-family of densities, but here we face a nonparametric problem concerning f . Theorem 1. Assume that θ = (p, α, β) and θ = (p , α , β ) belong to (p, α, β) and that f, f belongs to Then 1. either (p ,α ,β ) = (p,α,β) and f = f , 2. or (p , α , β ) = (p, α + π, β + π) and f = f π , 3. or if β − α = π (mod 2π), then f is a linear combination of f and f π , and either (α , β ) = (α, β), or (α , β ) = (β, α), Case 2 arises from a specific feature of circular distributions: if f is symmetric with respect to 0 then it is symmetric with respect to π. Unlike the real case, a symmetry assumption does not exclude the case f (x) = f (x−π). To bypass this we could assume for instance f 1 > 0. Indeed for each l ∈ Z, (f π ) l = f l (−1) l , so the Fourier coefficients of f and f π have opposite sign for any odd l. With our assumption, we recover among f and f π the one with positive first Fourier coefficient, i.e. with positive mean resultant length. Nevertheless our estimation procedure begins with the parametric part so that this assumption concerning only the nonparametric part will not allow us to distinguish α from α + π in this first parametric estimation step. That is why we rather choose to assume that α and β belong to [0, π) (mod π).
Case 3 concerns bipolar data since α and β are diametrically opposed (separated by π radians). In this case α and β are identifiable, but p and f not. Indeed, for any density f and any 0 < p ≤ p < 1/2, we can find q ∈ (0, 1] such that f = qf + (1 − q)f π verifies pf α + (1 − p)f β = p f α + (1 − p )f β . Thus our result demonstrates that bimodal data sets with opposite modes lead to non-identifiability issues, and this highlights a fundamental issue in considering a too large class of possible densities.
Let us now discuss the case 4, which is the most curious (we shall only comment the first case (a), the other is similar). Let us set This function is symmetric if f is symmetric, verifies S 1 f = 1 and may be positive for some values of p (depending on f ): see Figure 1. Then we can write f π/3 : as well as f π : Hence a mixture of f π and f π/3 gives a mixture of In such a particular case, we cannot identify θ nor f . However this happens only when β − α = ±2π/3. So, to exclude these cases, we will now assume β = α (mod 2π/3).
Finally, we shall assume that f ∈ F with some assumptions for F: using that f l are non-zero real numbers. This invites us to consider Note that g 0 M 0 (θ) = 1/(2π) and that The empirical counterpart of S(θ) is Next, we consider a slightly modified version ofS n (θ) by removing the diagonal terms Note that we have E(Z l k (θ)) = J l (θ), and S n (θ) is an unbiased estimator of S(θ).
For this estimator we can prove the following consistency result.
Proof. Θ is a compact set and S is continuous. Lemma 13 ensures that S n is Lipschitz hence uniformly continuous, and Proposition 14 ensures that for all θ, |S n (θ) − S(θ)| tends to 0 in probability. Then it is sufficient to apply a classical Lemma to conclude. See the details in Section 6.2 From now on, we assume that Θ is a compact set included in 0, 1 2 × [0, π) × [0, π), as in Assumption 4. Then, θ 0 + π is excluded and under Assumption 4, θ n → θ 0 in probability. Moreover this estimator is asymptotically normal. We denoteφ(θ) the gradient of any function φ with respect to θ = (p, α, β),φ(θ) the Hessian matrix and for any matrix A, we denote A its transpose.

Theorem 5. Consider Θ a compact set included in
. The proof can be found in Section 6.3. Note that A can be estimated bÿ S(θ n ) and V by (see details in Section 6.4). Thus we can estimate the covariance matrix Σ and deduce an asymptotic confidence region. We also prove the following result on the quadratic risk of the estimatorθ n , which is useful for the sequel (see Section 6.5 for a proof).

Proposition 6.
Under the assumptions of Theorem 5, there exists a numerical constant K such that, for all θ 0 ∈ Θ and for all n ≥ 1 where the norm is the Euclidean norm in R 3 .

Nonparametric part
Let us now estimate the nonparametric part. We shall use the following norm: for any function φ, we denote φ 2 = 1 2π S 1 φ 2 (x)dx 1/2 . Recall that for all l ∈ Z, g l = M l (θ 0 )f l where g is the density of the observations X k and g l its Fourier coefficient. Then Nevertheless this division by M l (θ 0 ) requires us to impose a new assumption. We assume that there exists P ∈ (0, 1/2) such that 0 < p < P for any p, i. e.

Assumption 5. Θ is a compact set included in
Under this assumption, |M l (θ)| is always bounded from below by 1 − 2P .
Ifθ =θ n is the previous estimator of the parametric part, we set the plugin estimator of the Fourier coefficient: Finally, for L an integer, setf To measure the performance of this estimator, we use Parseval equality to write which is the classical bias variance decomposition. Moreover it is possible to prove that the variance term satisfies To control the bias term we recall the definition of the Sobolev ellipsoid: For such a smooth f , the risk of estimatorf L is then bounded in the following way: It is clear that an optimal value for L is of order n 1/(2s+1) but this value is unknown. We rather choose a data-driven method to select L. We introduce a classical minimization of a penalized empirical risk. Set where L is a finite set of resolution level, and λ a constant to be specified later. The next theorem states an oracle inequality which highlights the bias variance decomposition of the quadratic risk and justifies our estimation procedure.

Theorem 7. Assume Assumption 1 and Assumption 5. Assume that f belongs to the Sobolev ellipsoid
As a consequence our estimator has a quadratic risk in n −2s/(2s+1) . Regarding the lower bound note that for any estimatorf n for some arbitrary θ ∈ Θ, so that the problem is reduced to a purely nonparametric lower bound. In the case of direct observations this quantity is lower bounded by Cn −2s/(2s+1) , see Theorem 11 and its proof in Baldi et al. (2009) for the circle S 1 ). We can use this proof to prove the lower bound in our mixture case. Indeed, for any densities f 1 and is the associated density of our observations, then the Kullback-Leibler divergence verifies and our estimator is optimal minimax.
Remark 1. Note that the penalty only depends on P which is some safety margin around 1/2, that can be chosen by the statistician. For the practical choice of the penalty, see Section 5.
Eventually, note that some densities may be supersmooth, in the following sense: In this case, the quadratic bias is bounded by R 2 exp(−2bL r ) which gives the following fast rate of convergence:

Numerical results
All computations are performed with Matlab software and the Optimization Toolbox.
We shall implement our statistical procedure to both estimate the parameter θ 0 and the density f . We consider three popular circular densities, namely the von Mises density, the wrapped Cauchy and the wrapped normal densities. We remind their expression (see Ley and Verdebout (2017)). The von Mises density is given by: with κ ≥ 0, I 0 (κ) the modified Bessel function of the first kind and of order 0. The wrapped Cauchy distribution has density: with 0 ≤ γ ≤ 1. The wrapped normal density expression is: For more clarity, we set σ 2 =: −2 log(ρ). Hence, we have 0 ≤ ρ ≤ 1. All these densities are characterized by a concentration parameter κ, γ or ρ and a location parameter μ. Remind that values κ = 0, γ = 0 and ρ = 0 correspond to the uniform density on the circle. To meet symmetry assumptions of Theorem 1, we consider in the sequel that the location parameter is set to μ = 0.
First, let us focus on the parametric part. We set θ 0 = (p 0 , α 0 , β 0 ) = ( 1 4 , π 8 , 2π 3 ). Obtaining the estimateθ n of θ 0 (see (4)) requires to solve a nonlinear minimization problem. To this end, we resort to the function fmincon of the Matlab Optimization toolbox. The function fmincon finds a constrained minimum of a function of several variables. Two parameters are to be specified: the domain over which the minimum is searched and an initial value. We consider the domain For more stability and to avoid possible local minimums, we perform the procedure over 10 initials values uniformly drawn on The final estimatorθ n corresponds to the minimum value of the empirical contrast S n (θ) given in (3) over the 10 trials. Table 1 gathers mean squared errors for our estimation procedure. When analyzing Table 1, one clearly sees that increasing the number of observations improves noticeably the performances. As expected, von Mises densities with smaller concentration parameter are more difficult to estimate. Nonetheless, the overall performances are satisfying. Table 2 displays the performances of the method-of-moments estimation procedure developed by Spurr and Koutbeiy (1991) to handle the problem of estimating the parameters in mixtures of von Mises distributions. To fairly compare the two methods, Table 3 gives the Spurr and Koutbeiy (1991) performances but this time when estimating on the same domain than ours e.g At closer inspection, the Spurr and Koutbeiy (1991) method seems to behave better to estimate angles α 0 and β 0 while our method may appear more competitive for estimating p 0 . It is worth noticing that the method by Spurr and Koutbeiy (1991) is completely parametric and takes advantage of the knowledge of the distributions. In this regard, our procedure which is semiparametric is competitive with a parametric method. Figure 2 illustrates the asymptotic normality of our estimatorθ n stated in Theorem 5.  Now, let us turn to the nonparametric estimation part namely the estimation of the density f . The estimator of f is given byf L (see Theorem 7). It requires the computation of a data-driven resolution level choice L (given in (5)) which implies a tuning parameter λ. To select the proper λ, we follow the data-driven slope estimation approach due to Birgé and Massart (see Birgé and Massart (2001) and Birgé and Massart (2007)). An overview in practice is presented in Baudry, Maugis and Michel (2012). To implement the slope heuristics method, one has to plot for L = 0 to L max the couples of points ( 2L+1 n , L l=−L | f l | 2 ). For L ≥ L 0 , one should observe a linear behaviour (see Figure 3). Then, once the slope is estimated, say a, by a linear regression method, one eventually takeŝ λ = 2a and the final resolution level is: Finally, Figure 4 shows reconstructions of the density f and the mixture density g as well. The estimates are good.

Remark 2.
Note that for the two exceptional cases, when p 0 = 0 or f is the uniform density, our procedure performs well. Indeed, if p 0 = 0, our method yields that α = β and retrieves that there is only one component in the mixture. When f is the uniform density, our algorithm selectsL = 0 which yields the uniform distribution.

Proof of Theorem 1 (identifiability)
Then, our assumptions on f and f entail Let us now study the consequence of this fact. Denote the 4 angles. Denote also the associated weights in (0, 1): With this notation Then This system of equations is studied in Lemmas 8 and 9 below. Let us now reason with the representatives of the γ k in (−π, π]. Lemma 9 says that the possible values for the γ k 's are 0, π, γ, −γ, for some γ ∈ (0, π). Note that here (7) and then the γ k 's take at least 2 different values: either 4 different values; or γ 2 = γ 3 and the other distinct; or γ 1 = γ 4 and the other distinct; or γ 2 = γ 3 and γ 1 = γ 4 .

Proof. First observe that, since
Now, let us study the various cases that make this quantity vanish. For the first case, note that if three γ k are multiples of π: γ i1 = γ i2 = γ i3 = 0 (mod π) then equation (8) becomes λ i4 sin(lγ i4 ) = 0 and the last angle is also null modulo π.
Proof. Point 1 is straightforward. 2. Let us start with Z l k (θ). We recall that Z l k (θ) = e ilX k 2π M l (θ) . Then 3. We havė and we have the same bound for J l (θ) . 4. We havë We bound J l (θ) F in the same way. This ends the proof of the lemma.

Lemma 12.
There exists a numerical positive constant C such that the following inequalities hold.

We also have
Proof. We use Taylor expansions at first order and then apply same bounding techniques as in Lemma 11. Proof. We will write C for a numerical constant that may change from line to line but is numerical. Let us start with point 1. We recall that S(θ) = l J l (θ) 2 . Let θ and θ iñ Θ. AsΘ is a convex set, we get, thanks to the mean value theorem with θ u lying on the line connecting θ to θ , and using Lemma 11.
Let us shift to point 2. Due to the mean value theorem, we have with θ u lying on the line connecting θ to θ . Then using 1. and 2. of Lemma 11 we get (1 + |l|) ≤ C θ − θ which ends the proof of the second point. Concerning point 3. we have thaẗ Using Taylor expansions and Lemma 11 and 12, we get that (1 + |l| + l 2 + |l| 3 ).

Proposition 14.
There exist a positive constant C such that Proof. The definitions of S n and S provide Note that E(Z l k (θ) − J l (θ)) = 0 which entails E[T n V n ] = 0. Then using Lemma 11. We focus now on V n : in the same way

C. Lacour and T. M. Pham Ngoc
using Lemma 11 again.
Theorem 4 is finally proved using the following lemma, its assumptions being ensured by Proposition 10, Lemma 13 and Proposition 14.
Lemma 15. Assume that Θ is a compact set and let S : Θ → R be a continuous function. Assume that where θ 0 , θ 0 ∈ Θ. Let S n : Θ → R be a function which is uniformly continuous and such that for all θ |S n (θ) − S(θ)| tends to 0 in probability. Letθ n be a point such that S n (θ n ) = inf Θ S n . Thenθ n → θ 0 or θ 0 in probability.
This is a classical result in the theory of minimum contrast estimators, when θ 0 = θ 0 (see van der Vaart (1998) or Dacunha-Castelle and Duflo (1986)). We reproduce the proof since it is slightly adapted to the case of two argmins.
Proof. Let > 0 and B be the union of the open ball with center θ 0 and radius and the open ball with center θ 0 and radius . Since S is continuous and B c ⊂ Θ is a compact set, there exists θ ∈ B c such that S(θ ) = inf B c S. Using the assumption, since θ = θ 0 , and θ = θ 0 Since S n is uniformly continuous, there exists α > 0 such that Moreover B c is a compact set then there exists a finite set ( The assumption ensures that Δ n tends to 0 in probability. Let θ ∈ B c . There exists 1 ≤ i ≤ I such that θ − θ i < α, and then |S n (θ) − S n (θ i )| ≤ δ/2. Thus Now, if θ n − θ 0 ≥ and θ n − θ 0 ≥ thenθ n ∈ B c and inf θ∈Θ S n (θ) = S n (θ n ) = inf θ∈B c S n (θ).
In particular inf θ∈B c S n (θ) ≤ S n (θ 0 ) so that since Δ n tends to 0 in probability.

Proof of Theorem 5 (asymptotic normality)
The Taylor's theorem and the definition ofθ n givė where θ * n lies in the line segment with extremities θ 0 andθ n . Equivalently we have,S Step 1-Let us prove that We remind by Lemma 3 that J l (θ 0 ) = 0. Hence We can break downṠ n (θ 0 ) in the following way: Note that A n and B n are centered variables. Let us show that √ nA n = o P (1). Note that the variables W jk := k<j are centered and uncorrelated. Then Using Lemma 11, there exists C > 0 such that . Finally, invoking Markov inequality we have that √ nA n = o P (1). We can write √ nB n in the following way: . Note that the U k (θ 0 )'s are i.i.d and centered. Invoking the central limit theorem, we have that where V/4 is the covariance matrix of U 1 (θ 0 ), equal to E(U 1 (θ 0 )U 1 (θ 0 ) ).
Step 2-Let us prove thatS n (θ * n ) Next we write the decomposition We get due to the Lipschitz property ofS n stated in Lemma 13 that Last, let us focus on the term S n (θ 0 ) − ES n (θ 0 ) F . We remind thaẗ From now on, we drop indices l and θ 0 to simplify the notation. We center the variables in order to find uncorrelatedness: Using the weak law of large numbers for uncorrelated centered variables, we obtain that S n (θ 0 ) − ES n (θ 0 ) F P → 0 which completes the step 2. Finally it is sufficient to apply Slutsky's Lemma to obtain the theorem.

Proposition 16. Consider notation and assumptions of Theorem 5. Let
tends almost surely toward V when n tends to +∞.
Thus we obtain a consistent estimator for V (that allows to estimate the covariance Σ). Nevertheless this estimator is biased. Notice that the quantity and we could also prove (with some additional technicalities in the following proof about the uniform convergence in k) that it tends almost surely toward V . However, we lose the "unbiased" property when replacing θ 0 byθ n .

Proof of Proposition 16
where the convergence is almost sure. Moreover where the convergence is almost sure and we have dropped the θ 0 for the sake of simplicity. This convergence is uniform in k in the following sense: there exists a set with probability 1 for which for any ε > 0, there exists N ≥ 1 such that for all n ≥ N and for all 1 ≤ k ≤ n Then we use the following lemma: "If v nk → v k uniformly, with (v nk ) and (v k ) bounded, and if n −1 n k=1 v k → v then n −1 n k=1 v nk → v." To prove this lemma, notice that, for a given positive ε, for n large enough That provides where the convergence is almost sure. Here all the Z k 's are depending on θ 0 , but we can use the consistency ofθ n to finally assert

Proof of Proposition 6
We use the proof of Theorem 5. We have seen thaẗ , with θ * n in the line segment with extremities θ 0 andθ n . Recall thatṠ n (θ 0 ) = A n + B n with l (θ 0 )Z l k (θ 0 ). Note that the U k (θ 0 )'s are i.i.d and centered so that nVar (U 1j (θ 0 )) ≤ nc 1 using Lemma 11. Here c 1 is a numerical constant. Thus E B n 2 1 ≤ 4c 1 /n. In the same way the variables W jk := k<j are centered and uncorrelated, and also bounded. Then
In the proof of Theorem 5, we noted thatS n (θ * n ) tends toS(θ 0 ) in probability. Actually we can prove that the convergence is almost sure. Indeed the strong law of large numbers is true for uncorrelated variables if their second moments have a common bound (see e.g. Chung and Zhong (2001)) so thaẗ SinceS n is continuous, it is sufficient to show that the convergence ofθ n towards θ is almost sure and this will imply thatS n (θ * n ) converges almost surely towards S n (θ 0 ) (recall that θ * n in the line segment with extremities θ 0 andθ n ). To do this, remark first that S n (θ) − S(θ) a.s. −→ 0 by the strong law of large numbers for uncorrelated variables again (see the decomposition of S n − S in the proof of Proposition 14). Now, we come back to the proof of Lemma 15 (in the case of a unique minimum θ 0 ), with this new assumption that S n (θ) tends to S(θ) almost surely. The proof shows that for any > 0 there exist δ( ) > 0 and Δ n ( ) which tends to 0 almost surely such that . This set has probability 1 and on this set, for any ε > 0, taking p ≥ 1/ε, there exists N ≥ 1 such that for any n ≥ N Δ n (1/p) < δ(1/p)/4 and then θ n − θ 0 < (1/p) ≤ ε.
Lemma 17. Let λ > 0 and L be a finite set of resolution level and define Then, for all 0 < γ < 1/2, Proof. We recall that the dot product f, g means 1 2π f (x)g(x)dx and that . 2 is the associated norm. Usual Fourier analysis gives for any L: where we denote by f L the sequence in C Z such that ( f L ) l = f l if −L ≤ l ≤ L and 0 otherwise. Now let L be an arbitrary resolution level in L. Using the definition of L, But, denoting by . the natural norm of 2 (C Z ) where L ∨ L = max(L, L). Thus Thus |ν n | 2 ≤ 3|ν n,1 | 2 + 3|ν n,2 | 2 + 3|ν n,3 | 2 , and, if κ 1 = κ/3, where a + = max(a, 0) denotes the positive part of a. Control of ν n,3 First note that Thus, using Schwarz inequality (note that it is also true for l = 0 since M 0 (θ 0 ) = M 0 (θ) = 1).